## page was renamed from Developer Manual 3.12/Getting data out ## page was renamed from Developer Manual 3.12/Indexes and caching <> == Tools to Get Data In and Out == <> <> === Batch on Bitarchives === The bitarchives are designed to have batch-programs to run on all the arc-files stored in the bitarchive. This is true no matter whether the bitarchive is installed as a local arc-repository or a distributed repository with several bitarchives. The batch programs are also used internally by the !NetarchiveSuite software to for example get a CDX for a specific job, to get checksums of arc-files stored in the bitarchive and to get list of arc-files from the bitarchive. When a batch program is started, it will only sent it to one bitarchive replica. ==== Prerequisites for running a batch job ==== A number of prerequisites must be taken care of before a batch job can be executed. These are: * ''Setting file:'' . must be present and must include definitions of minimum the following setttings: * Replicas to identification of replica to communicate with: * ~+{{{settings.common.replicas}}}+~ in order for the batch program to identify and messages to the bitarchive. * ~+{{{settings.common.useReplicaId}}}+~ in order to determine default bitarchive replica to use. * Channel settings to be able to make channel names to communicate with running system: * ~+{{{settings.common.environmentName}}}+~ (typically PROD) * ~+{{{settings.common.applicationName}}}+~ (!RunBatchApplication, but currently set automatically) * Other settings related to communication where the running systems settings differs from default. * ''Batch program:'' . The batch program is passed as a class file which must be developed in the context of the !NetarchiveSuite code. . The batch program must extend ARCBatchJob or !FileBatchJob depending on whether you wnat to make a batch program over arc records or a batch program over files (arc-files). Note that for configuration of local repository will mean that you can make batchjobc on files of any format using the !FileBatchJob. * ''Call location:'' . The batch program can be started from any of the machines in the distributed system where the system runs. * ''Disc space on bitarchive:'' . The disc space needed will depend on the written batch program. As an example the !ChecksumJob produces about 100 bytes per arc-file, whereas a batch program writting the full contents of arc-files would require as much space as the archive it self. * ''Class Path:'' . The run of batch program requires ~+{{{lib/dk.netarkivet.archive.jar}}}+~ in the class path * ''Memory space on bitarchive:'' . The memory space needed will depend on the written batch program. If the batch program is written using a lot of jar files, these files will be needed to be kept in memory while the batch program is running, and on top of that comes the memory requirenments for the batch job it self. ==== Execution and Parameters ==== The execution of a batch program is done by calling the ~+{{{dk.netarkivet.archive.tools.RunBatch}}}+~ program with the following parameters: If the batch program is given in a single class file, this must be specified in the parameter: * ~+{{{-C}}}+~ is a file containing a !FileBatchJob/ARCBatchJob implementation If the batch program is given in a jar file, this must be specified in the parameters: * ~+{{{-N}}}+~ is the name of the primary class to be loaded and executed as a !FileBatchJob/ARCBatchJob implementation * ~+{{{-J}}}+~ is a file containing all the classes needed by the primary class To specify which files the batch program must be executed on, the following parameters may be set * ~+{{{-B}}}+~ is the name of the bitarchive replica which the batchjob must be executed on. The default is the name of the bitarchive replica identified by the setting ~+{{{settings.common.useReplicaId}}}+~. * ~+{{{-R}}}+~ is a regular expression that will be matched against file names in the archive. The default is ~+{{{.*}}}+~ which means it will be executed on all files in the bitarchive replica. To specify output files from the batch program, the following parameters may be set * ~+{{{-O}}}+~ is a file where the output from the batch job will be written. By default, it goes to ~+{{{stdout}}}+~, but it will be mixed with other output to ~+{{{stdout}}}+~. * ~+{{{-E}}}+~ is a file where the errors from the batch job will be written. By default, it goes to stderr. An example of an execution command is: {{{ java -Ddk.netarkivet.settings.file=/home/user/conf/settings.xml \ -cp lib/dk.netarkivet.archive.jar \ dk.netarkivet.archive.tools.RunBatch \ -CFindMime.class -R10-*.arc -BReplicaOne -Oresfile }}} which will take in ~+{{{lib/dk.netarkivet.archive.jar}}}+~ in the class path and execute the general !NetarchiveSuite program ~+{{{dk.netarkivet.archive.tools.RunBatch}}}+~ based on settings from file ~+{{{/home/user/conf/settings.xml}}}+~. This will result in running the batch program ~+{{{FindMime.class}}}+~ on the bitarchive replica named ~+{{{ReplicaOne}}}+~, but only on files with names matching the pattern ~+{{{10-*.arc}}}+~. The results written by the batch program is concatenated and placed in the output file named ~+{{{resfile}}}+~. ==== Example of packing and executing a batch job ==== To package the files do the following: ~+{{{jar -cvf batchfile.jar path/batchProgram.class}}}+~ where ~+{{{path}}}+~ is the path to the directory where the batch class files are placed. This is under the ~+{{{bin/}}}+~ directory in the eclipse project. The ~+{{{batchProgram.class}}}+~ is the compiled file for your batch program. The call to run this batch job is then: {{{ java -Ddk.netarkivet.settings.file=conf/settings.xml \ -cp lib/dk.netarkivet.archive.jar \ dk.netarkivet.archive.tools.RunBatch \ -Jbatch.jar -Npath.batchProgram }}} where ~+{{{path}}}+~ in the ~+{{{-N}}}+~ argument has all ~+{{{'/'}}}+~ changed to ~+{{{'.'}}}+~. E.g. to run the batch job from the file ~+{{{myBatchJobs/arc/MyArcBatchJob.java}}}+~, which inherits the ARCBatchJob class (~+{{{dk/netarkivet/common/utils/arc/ARCBatchJob}}}+~), do the following. * ~+{{{cd bin/}}}+~ - Place yourself in the bin/ folder under your project. * ~+{{{jar -cvf batch.jar myBatchJobs/arc/*}}}+~ - Package the compiled Java binaries into an .jar file. * ~+{{{mv batch.jar ~/NetarchiveSuite/.}}}+~ - Move the packaged batch job to your NetarchiveSuite directory. * ~+{{{cd ~/NetarchiveSuite/}}}+~ - Go to your NetarchiveSuite directory. * Run the following command to execute the batch job: {{{ java -Ddk.netarkivet.settings.file=conf/settings.xml \ -cp lib/dk.netarkivet.archive.jar:lib/dk.netarkivet.common.jar dk.netarkivet.archive.tools.RunBatch -Jbatch.jar -NmyBatchJobs.arc.MyArcBatchJob }}} The ~+{{{lib/dk.netarkivet.common.jar}}}+~ library need to be included in the classpath since the batch job (~+{{{myBatchJobs/arc/MyArcBatchJob}}}+~) inherits from a class within this library (~+{{{dk/netarkivet/common/utils/arc/ARCBatchJob}}}+~). ==== Security ==== If the security properties for the bitarchive (independent of this execution) are set as described in the [[Configuration Manual 3.12#ConfigureSecurity|Configuration Manual]] the batch program will not be allowed to: * to write files to the bitarchive * to change files in the bitarchive * to delete files in the bitarchive ==== Outstanding Issues ==== As it will be described in [[AssignmentGroupB4|archive assignemnet B.2.4]], there are plans to make logging of internal logs and exceptions from batch jobs better. The only way to code internal logging today is to write to the output of the batch program. If the batch job today is running where not all !BitarchiveApplications are online, this will not be discovered by the !NetarchiveSuite software. As it will be described in [[AssignmentGroupB2|archive assignemnet B.2.2]] there are plans to make change this at a later stage. === Indexes and caching === The deduplication code and the viewer proxy both make use of an index generating system to extract Lucene indexes from the data in the archive. This system makes extensive use of caching to improve index generation performance. This section describes the default index generating system implemented as the !IndexRequestClient plugin. There are four parts involved in getting an index, each of them having their own cache. The first part resides on the client side, in the !IndexRequestClient class, which caches unzipped Lucene indexes and makes them available for use. The !IndexRequestClient receives its data from the !CrawlLogIndexCache in the form of gzipped Lucene indexes. The !CrawlLogIndexCache generates the Lucene indexes based on Heritrix crawl.log files and CDX files extracted from the ARC files, and caches the generated indexes in gzipped form. The crawl.log files and CDX files are in turn received through two more caches, both of which extract their data directly from the archive using batch jobs and store them in sorted form in their caches. All four caches are based on the generic !FileIndexCache class, which handles the necessary synchronization to ensure that not only separate threads but also separate processes can access the cache simultaneously without corrupting it. When a specific cache item is requested, the cache is first checked to see if it already exists. If it doesn't, a file indicating that work is being done is locked by the process. If this lock is acquired, the actual cache-filling operation can take place, otherwise another thread or process must be working on it already, and we can wait until it finishes and take its data. The !FileIndexCache class is generic on the type of the identifier that indicates which item to get. The higher-level caches (!IndexRequestClient and !CrawlLogIndexCache) use a Set type to allow indexes of multiple jobs based on their IDs. The two low-level caches just use a Long type, so they operate on just one job at a time. The two caches that handle multiple job IDs as their cache item ID must handle a couple of special scenarios: Their cache item ID may consist of hundreds or thousands of job IDs, and part of the job data may be unavailable. To deal with the first problem, any cache item with more than four job IDs in the ID set is stored in a file whose name contains the four lowest-numbered IDs followed by an MD5 checksum of a concatenation of all the IDs in sorted order. This ensures uniqueness of the cache file without overflowing operating system limits. Every subclass to !FileBasedCache uses its own directory, where the cache files are placed. The name of the final cache file is uniquely created from the id-set, which should be made into an index. Since the synchronization is done on the complete path to the cache file, then it must require two instances of the same class (e.g. !DedupCrawlLogIndexCache), which is attempting to make cache on the same id-set at the same time, for a synkronisation block to occur. In this case the cache file would anyway only be made once, since the waiting instance will use the same cache file as the first instance creates. A subclass of !CombiningMultiFileBasedCache uses a corresponding subclass of !RawMetadataCache to make sure that an cache file for every id exists (an id-cache file). If this file does not exist, then it will be created. Afterwards all the id-cache files will be combined to a complete file of the wanted id-set. The id-cache files are blocked for other processes during their creation, but they are only created once since they can be used directly to create the Lucene cache for other id-sets, which contain this id. ==== CrawlLogIndexCache ==== The !CrawlLogIndexCache guarantees that an index is always returned for a given request, regardless of whether part of the necessary data was available. This is done by performing a preparatory step where the data required to create the index is retrieved. If any of the data chunks are missing, a recursive attempt at generating an index for a reduced set is performed. Since the underlying data is always fetched from a cache, it is very likely that all the data for the reduced set is readily available, so no further recursion is typically needed. The set of job IDs that was actually found is returned from the request to cache data, while the actual data is stored in a file whose name can be requested afterwards. Note that future requests for the full set of job IDs will cause a renewed attempt at downloading the underlying data, which may take a while, especially if the lack of data is caused by a time-out. The !CrawlLogIndexCache is the most complex of the caches, but its various responsibilities are spread out over several superclasses. * The top class is the generic !FileBasedCache handles the locking necessary to have only one thread in one process at a time create the cached data. It also provides two helper methods: getIndex() is a forgiving cache lookup for complex cache items that handles the partial results described before, and the get(Set) method allows for optimized caching of multiple simple cache requests. * The !MultiFileBasedCache handles the naming of files for caches that use sets as cache item identifiers. * The !CombiningMultiFileBasedCache extends the !MultiFileBasedCache to have another, simpler cache as a data source, and providing an abstract method for combining the data from the underlying cache. It adds a step to the caching process of getting the underlying data, and only performs the combine action if all required data was found. * The !CrawlLogIndexCache is a !CombiningMultiFileBasedCache whose underlying data is crawl.log files, but adds a simple CDX cache to provide data not found in the crawl.log. It also implements the combine method by creating a Lucene index from the crawl.log and CDX files, using code from Kristinn SigurĂ°sson. The other subclass of !CombiningMultiFileBasedCache, which provides combined CDX indexes, is not currently used in the system, but is available at the !IndexRequestClient level. * The !CrawlLogIndexCache is further subclasses into two flavors, !FullCrawlLogIndexCache which is used in the viewer proxy, and !DedupCrawlLogIndexCache which is used by the deduplicator in the harvester. The !DedupCrawlLogIndexCache restricts the index to non-text files, while the !FullCrawlLogIndexCache indexes all files. The two caches used by !CrawlLogIndexCache are !CDXDataCache and !CrawlLogDataCache, both of which are simply instantiations of the !RawMetadataCache. They both work by extracting records from the archived metadata files based on regular expressions, using batch jobs submitted through the !ArcRepositoryClient. This is not the most efficient way of getting the data, as a batch job is submitted separately for getting the files for each job, but it is simple. It could be improved by overriding the get(Set) method to collect all the data in one batch job, though some care has to be taken with synchronization and avoiding refetching unnecessary data. === Viewerproxy === The viewerproxy uses the Jetty HTTP server library to handle connections. Each incoming URL is sent through a pipeline of "resolvers", each of which can either process the URL or pass it on to the next resolver. The executeCommand method should be overridden to handle requests, and should return true if the requests was handled by this resolver. The resolver is responsible for calling response.setStatus to set the appropriate HTTP result code. Incoming URLs are handled by the below resolvers in the order shown. ==== Viewerproxy control resolver ==== The HTTPControllerServer class manages index setup and missing URL collection for the viewerproxy. It is mainly used through the QA web interface. It has the following commands: <> ==== Special Access ==== <> ===== Via URLs ===== The !GetDataResolver class provides some special URLs in the viewerproxy that can be used for more direct access to the stored data. To use them, your browser must be set up to access the viewerproxy in the same way as when browsing harvested data. The general format of the commands are ~+{{{http://viewerproxy.invalid/?arg1=value1&arg2=value2...}}}+~ The commands are: * getFile - gets a whole file from the archive * arcFile= - name of the file (without pathnames) * getRecord - gets a single ARC record from the archive * arcFile= - name of the file to look up a record in (without pathnames) * arcOffset= - offset into the file the record starts at * getMetadata - gets all metadata for a job from the archive * jobID= - ID (numeric) of the job for which to fetch metadata ===== Via Command Line ===== Futhermore it is possible to make getFile and getRecord commands from the command line the following way: {{{ usage: java dk.netarkivet.archive.tools.GetFile [destination-file] }}} This tool retrieves a file from the archive. If [destination-file] is omitted, the file is stored with the same name. The bitarchive replica the file is retrieved from is chosen based on the setting ~+{{{settings.archive.common.useReplicaId}}}+~. {{{ usage: java dk.netarkivet.archive.tools.GetRecord [uri] }}} This tool depends on the existence of a luceneindex, as generated by the index server. It will use this index to lookup the arcfile and offset to get a particular record from. It will the retrieve that record from the archive. The bitarchive replica the file is retrieved from is chosen based on the setting ~+{{{settings.archive.common.useReplicaId}}}+~. The result is printed to ~+{{{stdout}}}+~. ==== Observer resolver ==== The NotifyingURIResolver class provides means for logging what users access through the viewerproxy. It never processes any URLs itself, merely allows a URIObserver to monitor the URLs. It is currently used to record URLs that are not handled by other resolvers. <> === Upload of Files === The !NetarchiveSuite offers a separate tools to upload arc-files to the running repository. {{{ usage: java dk.netarkivet.archive.tools.Upload }}} This tool will upload a number of local files to all replicas in the archive. An example of an execution command is: {{{ java -Ddk.netarkivet.settings.file=/home/user/conf/settings.xml \ -cp lib/dk.netarkivet.archive.jar \ dk.netarkivet.archive.tools.Upload \ file1.arc [file2.arc ...] }}} where file1.arc [file2.arc ...] is the files to be uploaded This will result in the the files to be uploaded via the running !ArcRepository application This means that (as for batch programs) the setting file must contain a minimum of settings for the tool to communicate with the running system. When a file is uploaded successfully, it will be deeleted locally. That means that if there are files left after execution, this will be an indicater that the files are '''''not''''' stored in a secure way.