AutomaticIndexing - NetarchiveSuite

The purpose of this task is to develop a component which will automatically index new material and add the resultant material to a running wayback instance. The essential requirement is that in some reasonably timely fashion (for example within at most three days since harvesting) any harvested webpage should be discoverable via wayback. We have two basic tools to achieve this task:

ExtractWaybackCDXBatchJob for harvested arc files, and
ExtractDeduplicateCDXBatchJob for dedup records in metadata files

(So every file in the repository needs to be indexed by one batch-job or the other.)

Files returned by these batch-jobs are unsorted, but are ready for sorting. They require no other post-processing.

A major constraint on the architecture is that we can only know what files are available in a given replica of the archive. We cannot easily determine what files are missing due to, e.g., machine downtime. In order to ensure that any missed files are indexed later we therefore need to maintain a list of files which have been indexed, for example in a relational database. Moreover, in our current architecture the only way to know that a batch job has run on a given file is to use a regexp matching that file alone and then check that the std output from the job shows that it has been successfully run on one file. This is therefore what we will do, running many small batch jobs. We should consider whether we should run multiple small batch jobs concurrently. (If so, these must reuse a small pool of applicationInstanceId's or we will pollute the jms broker with a lot of queues.)

The WaybackIndexer should consist of two separate components:

A WaybackBatchIndexer. This is the component which maintains a list of files in the archive (via FileListJob) and starts batch jobs to index any and all files not yet indexed
A WaybackIndexAggregator. This component scans the output directories of WaybackBatchIndexer, takes any new files found, sorts them, and merges the result into existing sorted index files.

It should be possible not only to run these two components separately but to develop them relatively independently of each other as the only interface between them is the filesystem.