Tools to Get Data In and Out

edit

Batch on Bitarchives

The bitarchives are designed to have batch-programs to run on all the arc-files stored in the bitarchive. This is true no matter whether the bitarchive is installed as a local arc-repository or a distributed repository with several bitarchives.

The batch programs are also used internally by the NetarchiveSuite software to for example get a CDX for a specific job, to get checksums of arc-files stored in the bitarchive and to get list of arc-files from the bitarchive.

When a batch program is started, it will only sent it to one bitarchive replica.

Prerequisites for running a batch job

A number of prerequisites must be taken care of before a batch job can be executed. These are:

Execution and Parameters

The execution of a batch program is done by calling the dk.netarkivet.archive.tools.RunBatch program with the following parameters:

If the batch program is given in a single class file, this must be specified in the parameter:

  • -C<classfile> is a file containing a !FileBatchJob/ARCBatchJob implementation

If the batch program is given in a jar file, this must be specified in the parameters:

  • -N<className> is the name of the primary class to be loaded and executed as a !FileBatchJob/ARCBatchJob implementation

  • -J<jarfile> is a file containing all the classes needed by the primary class

To specify which files the batch program must be executed on, the following parameters may be set

  • -B<replica> is the name of the bitarchive replica which the batchjob must be executed on. The default is the name of the bitarchive replica identified by the setting settings.common.useReplicaId.

  • -R<regexp> is a regular expression that will be matched against file names in the archive. The default is .* which means it will be executed on all files in the bitarchive replica.

To specify output files from the batch program, the following parameters may be set

  • -O<outputfile> is a file where the output from the batch job will be written. By default, it goes to stdout, but it will be mixed with other output to stdout.

  • -E<errorFile> is a file where the errors from the batch job will be written. By default, it goes to stderr.

An example of an execution command is:

   java -Ddk.netarkivet.settings.file=/home/user/conf/settings.xml \
        -cp lib/dk.netarkivet.archive.jar \
        dk.netarkivet.archive.tools.RunBatch \
        -CFindMime.class -R10-*.arc -BReplicaOne -Oresfile

which will take in lib/dk.netarkivet.archive.jar in the class path and execute the general NetarchiveSuite program dk.netarkivet.archive.tools.RunBatch based on settings from file /home/user/conf/settings.xml. This will result in running the batch program FindMime.class on the bitarchive replica named ReplicaOne, but only on files with names matching the pattern 10-*.arc. The results written by the batch program is concatenated and placed in the output file named resfile.

Example of packing and executing a batch job

To package the files do the following:

jar -cvf batchfile.jar path/batchProgram.class

where path is the path to the directory where the batch class files are placed. This is under the bin/ directory in the eclipse project. The batchProgram.class is the compiled file for your batch program.

The call to run this batch job is then:

  java -Ddk.netarkivet.settings.file=conf/settings.xml \
        -cp lib/dk.netarkivet.archive.jar \
       dk.netarkivet.archive.tools.RunBatch \
       -Jbatch.jar -Npath.batchProgram

where path in the -N argument has all '/' changed to '.'.

E.g. to run the batch job from the file myBatchJobs/arc/MyArcBatchJob.java, which inherits the ARCBatchJob class (dk/netarkivet/common/utils/arc/ARCBatchJob), do the following.

  • cd bin/ - Place yourself in the bin/ folder under your project.

  • jar -cvf batch.jar myBatchJobs/arc/* - Package the compiled Java binaries into an .jar file.

  • mv batch.jar ~/NetarchiveSuite/. - Move the packaged batch job to your NetarchiveSuite directory.

  • cd ~/NetarchiveSuite/ - Go to your NetarchiveSuite directory.

  • Run the following command to execute the batch job:

   java -Ddk.netarkivet.settings.file=conf/settings.xml \
        -cp lib/dk.netarkivet.archive.jar:lib/dk.netarkivet.common.jar
        dk.netarkivet.archive.tools.RunBatch -Jbatch.jar -NmyBatchJobs.arc.MyArcBatchJob

The lib/dk.netarkivet.common.jar library need to be included in the classpath since the batch job (myBatchJobs/arc/MyArcBatchJob) inherits from a class within this library (dk/netarkivet/common/utils/arc/ARCBatchJob).

Security

If the security properties for the bitarchive (independent of this execution) are set as described in the Configuration Manual the batch program will not be allowed to:

  • to write files to the bitarchive
  • to change files in the bitarchive
  • to delete files in the bitarchive

Outstanding Issues

As it will be described in archive assignemnet B.2.4, there are plans to make logging of internal logs and exceptions from batch jobs better. The only way to code internal logging today is to write to the output of the batch program.

If the batch job today is running where not all BitarchiveApplications are online, this will not be discovered by the NetarchiveSuite software. As it will be described in archive assignemnet B.2.2 there are plans to make change this at a later stage.

Indexes and caching

The deduplication code and the viewer proxy both make use of an index generating system to extract Lucene indexes from the data in the archive. This system makes extensive use of caching to improve index generation performance. This section describes the default index generating system implemented as the IndexRequestClient plugin.

There are four parts involved in getting an index, each of them having their own cache. The first part resides on the client side, in the IndexRequestClient class, which caches unzipped Lucene indexes and makes them available for use. The IndexRequestClient receives its data from the CrawlLogIndexCache in the form of gzipped Lucene indexes. The CrawlLogIndexCache generates the Lucene indexes based on Heritrix crawl.log files and CDX files extracted from the ARC files, and caches the generated indexes in gzipped form. The crawl.log files and CDX files are in turn received through two more caches, both of which extract their data directly from the archive using batch jobs and store them in sorted form in their caches.

All four caches are based on the generic FileIndexCache class, which handles the necessary synchronization to ensure that not only separate threads but also separate processes can access the cache simultaneously without corrupting it. When a specific cache item is requested, the cache is first checked to see if it already exists. If it doesn't, a file indicating that work is being done is locked by the process. If this lock is acquired, the actual cache-filling operation can take place, otherwise another thread or process must be working on it already, and we can wait until it finishes and take its data.

The FileIndexCache class is generic on the type of the identifier that indicates which item to get. The higher-level caches (IndexRequestClient and CrawlLogIndexCache) use a Set<Long> type to allow indexes of multiple jobs based on their IDs. The two low-level caches just use a Long type, so they operate on just one job at a time.

The two caches that handle multiple job IDs as their cache item ID must handle a couple of special scenarios: Their cache item ID may consist of hundreds or thousands of job IDs, and part of the job data may be unavailable. To deal with the first problem, any cache item with more than four job IDs in the ID set is stored in a file whose name contains the four lowest-numbered IDs followed by an MD5 checksum of a concatenation of all the IDs in sorted order. This ensures uniqueness of the cache file without overflowing operating system limits.

Every subclass to FileBasedCache uses its own directory, where the cache files are placed. The name of the final cache file is uniquely created from the id-set, which should be made into an index. Since the synchronization is done on the complete path to the cache file, then it must require two instances of the same class (e.g. DedupCrawlLogIndexCache), which is attempting to make cache on the same id-set at the same time, for a synkronisation block to occur. In this case the cache file would anyway only be made once, since the waiting instance will use the same cache file as the first instance creates.

A subclass of CombiningMultiFileBasedCache uses a corresponding subclass of RawMetadataCache to make sure that an cache file for every id exists (an id-cache file). If this file does not exist, then it will be created. Afterwards all the id-cache files will be combined to a complete file of the wanted id-set.

The id-cache files are blocked for other processes during their creation, but they are only created once since they can be used directly to create the Lucene cache for other id-sets, which contain this id.

CrawlLogIndexCache

The CrawlLogIndexCache guarantees that an index is always returned for a given request, regardless of whether part of the necessary data was available. This is done by performing a preparatory step where the data required to create the index is retrieved. If any of the data chunks are missing, a recursive attempt at generating an index for a reduced set is performed. Since the underlying data is always fetched from a cache, it is very likely that all the data for the reduced set is readily available, so no further recursion is typically needed. The set of job IDs that was actually found is returned from the request to cache data, while the actual data is stored in a file whose name can be requested afterwards. Note that future requests for the full set of job IDs will cause a renewed attempt at downloading the underlying data, which may take a while, especially if the lack of data is caused by a time-out.

The CrawlLogIndexCache is the most complex of the caches, but its various responsibilities are spread out over several superclasses.

  • The top class is the generic FileBasedCache handles the locking necessary to have only one thread in one process at a time create the cached data. It also provides two helper methods: getIndex() is a forgiving cache lookup for complex cache items that handles the partial results described before, and the get(Set<I>) method allows for optimized caching of multiple simple cache requests.

  • The MultiFileBasedCache handles the naming of files for caches that use sets as cache item identifiers.

  • The CombiningMultiFileBasedCache extends the MultiFileBasedCache to have another, simpler cache as a data source, and providing an abstract method for combining the data from the underlying cache. It adds a step to the caching process of getting the underlying data, and only performs the combine action if all required data was found.

  • The CrawlLogIndexCache is a CombiningMultiFileBasedCache whose underlying data is crawl.log files, but adds a simple CDX cache to provide data not found in the crawl.log. It also implements the combine method by creating a Lucene index from the crawl.log and CDX files, using code from Kristinn SigurĂ°sson. The other subclass of CombiningMultiFileBasedCache, which provides combined CDX indexes, is not currently used in the system, but is available at the IndexRequestClient level.

  • The CrawlLogIndexCache is further subclasses into two flavors, FullCrawlLogIndexCache which is used in the viewer proxy, and DedupCrawlLogIndexCache which is used by the deduplicator in the harvester. The DedupCrawlLogIndexCache restricts the index to non-text files, while the FullCrawlLogIndexCache indexes all files.

The two caches used by CrawlLogIndexCache are !CDXDataCache and CrawlLogDataCache, both of which are simply instantiations of the RawMetadataCache. They both work by extracting records from the archived metadata files based on regular expressions, using batch jobs submitted through the ArcRepositoryClient. This is not the most efficient way of getting the data, as a batch job is submitted separately for getting the files for each job, but it is simple. It could be improved by overriding the get(Set<I>) method to collect all the data in one batch job, though some care has to be taken with synchronization and avoiding refetching unnecessary data.

Viewerproxy

The viewerproxy uses the Jetty HTTP server library to handle connections. Each incoming URL is sent through a pipeline of "resolvers", each of which can either process the URL or pass it on to the next resolver. The <tt>executeCommand</tt> method should be overridden to handle requests, and should return true if the requests was handled by this resolver. The resolver is responsible for calling <tt>response.setStatus</tt> to set the appropriate HTTP result code.

Incoming URLs are handled by the below resolvers in the order shown.

Viewerproxy control resolver

The HTTPControllerServer class manages index setup and missing URL collection for the viewerproxy. It is mainly used through the QA web interface. It has the following commands:

Special Access

Via URLs

The GetDataResolver class provides some special URLs in the viewerproxy that can be used for more direct access to the stored data. To use them, your browser must be set up to access the viewerproxy in the same way as when browsing harvested data. The general format of the commands are http://viewerproxy.invalid/<command>?arg1=value1&arg2=value2... The commands are:

  • getFile - gets a whole file from the archive
    • arcFile=<filename> - name of the file (without pathnames)

  • getRecord - gets a single ARC record from the archive
    • arcFile=<filename> - name of the file to look up a record in (without pathnames)

    • arcOffset=<offset> - offset into the file the record starts at

  • getMetadata - gets all metadata for a job from the archive
    • jobID=<id> - ID (numeric) of the job for which to fetch metadata

Via Command Line

Futhermore it is possible to make getFile and getRecord commands from the command line the following way:

usage: java dk.netarkivet.archive.tools.GetFile <filename> [destination-file]

This tool retrieves a file from the archive. If [destination-file] is omitted, the file is stored with the same name. The bitarchive replica the file is retrieved from is chosen based on the setting settings.archive.common.useReplicaId.

usage: java dk.netarkivet.archive.tools.GetRecord <indexdir> [uri]

This tool depends on the existence of a luceneindex, as generated by the index server. It will use this index to lookup the arcfile and offset to get a particular record from. It will the retrieve that record from the archive.

The bitarchive replica the file is retrieved from is chosen based on the setting settings.archive.common.useReplicaId. The result is printed to stdout.

Observer resolver

The NotifyingURIResolver class provides means for logging what users access through the viewerproxy. It never processes any URLs itself, merely allows a URIObserver to monitor the URLs. It is currently used to record URLs that are not handled by other resolvers.

Upload of Files

The NetarchiveSuite offers a separate tools to upload arc-files to the running repository.

usage: java dk.netarkivet.archive.tools.Upload <files...>

This tool will upload a number of local files to all replicas in the archive. An example of an execution command is:

   java -Ddk.netarkivet.settings.file=/home/user/conf/settings.xml \
        -cp lib/dk.netarkivet.archive.jar \
        dk.netarkivet.archive.tools.Upload \
        file1.arc [file2.arc ...]

where file1.arc [file2.arc ...] is the files to be uploaded

This will result in the the files to be uploaded via the running ArcRepository application

This means that (as for batch programs) the setting file must contain a minimum of settings for the tool to communicate with the running system.

When a file is uploaded successfully, it will be deeleted locally. That means that if there are files left after execution, this will be an indicater that the files are not stored in a secure way.

Developer Manual 3.12/Tools to Get Data In and Out (last edited 2010-08-16 10:24:29 by localhost)