AutomaticIndexing - NetarchiveSuite

Overall Aims

The purpose of this task is to develop a component which will automatically index new material and add the resultant material to a running wayback instance. The essential requirement is that in some reasonably timely fashion (for example within at most three days since harvesting) any harvested webpage should be discoverable via wayback. We have two basic tools to achieve this task:

ExtractWaybackCDXBatchJob for harvested arc files, and
ExtractDeduplicateCDXBatchJob for dedup records in metadata files

(So every file in the repository needs to be indexed by one batch-job or the other.)

Files returned by these batch-jobs are unsorted, but are ready for sorting. They require no other post-processing.

Architectural Constraints

A major constraint on the architecture is that we can only know what files are available in a given replica of the archive. We cannot easily determine what files are missing due to, e.g., machine downtime. In order to ensure that any missed files are indexed later we therefore need to maintain a list of files which have been indexed, for example in a relational database.

In our batch architecture, the client only knows the number of files on which the batch job was successfully run, not their names. Here are two possible strategies for this

Strategy 1: Multiple files per Batch Job

Select a set of un-indexed files (either all data or all metadata files) from the database
Create a batch job to index only these files by calling FileBatchJob.processOnlyFilesNamed(List<String> specifiedFilenames)
If the standard output shows that the number of files processed is as expected, keep the output and mark all these files as indexed in the database
Otherwise just throw away the output and select a new set of files to index

In this way we can exploit a controllable amount of concurrency by specifying how many files to index in each batch job.

An important issue is how to select which files to clump together in batch jobs. In particular, any missing files will cause timeouts and result in the output from successfully indexed files being thrown away. One strategy is the following:

In the database, each file has a field "NumberOfTimesFailed" initialised to zero
This is incremented any time the given file is part of a batch job which fails
Not yet indexed files are selected (queued) by date order (most recent first) and in reverse order of NumberOfTimesFailed

We could also tune the algorithm so that, for example, files which have failed multiple times are subsequently indexed one at a time. This will allow us to rescue files which have "gotten a bad reputation" by ending up near actually missing or damaged files in the queue. We should also manually inspect the database for files which cannot be indexed.

Strategy 2: Single file per Batch Job

Create a queue of files (e.g. a LinkedBlockingDeque) to be indexed by selecting all unindexed files reverse-ordered by date, so that previously failed files will be indexed last
Create a small number of consumer threads
In each consumer thread, read (poll) one file from the file-queue and index it in a batch job, blocking the thread until the job is finished
Check batch jobs for success and mark as indexed any which process one file successfully.

	Advantages	Disadvantages
Strategy 1	Uses batch architecture for concurrency Unithreaded application	Loads bamon with potentially larger batch jobs to aggregate. Potentially complex aggregation algorithms for selecting files to index. Must throw all results away whenever one file fails
Strategy 2	Simple single-producer multiple-consumer algorithm using standard components. Easy to keep track of which files cannot be indexed.	Requires our own multithreading. Loads broker/repository with more individual batch jobs.

Strategy 2 appears to be the clear winner in terms of architectural and algorithmic simplicity.

Overall Architecture

The WaybackIndexer should consist of two separate components:

A WaybackBatchIndexerApplication. This is the component which maintains a list of files in the archive (via FileListJob) and starts batch jobs to index any and all files not yet indexed. This application could be parameterized to index either data files or metadata files and one instance of each could be started. In this case, the database would need to run in standalone mode. This is probably a complication to be avoided.
A WaybackIndexAggregatorApplication. This component scans the output directories of WaybackBatchIndexer, takes any new files found, sorts them, and merges the result into existing sorted index files.

It should be possible not only to run these two components separately but to develop them relatively independently of each other as the only interface between them is the filesystem.

Functional Description

WaybackBatchIndexerApplication

The indexer maintains a (FIFO) queue of files (arcfiles and metadata-arcfiles) to be indexed. It also maintains a database (persistent object store) of all known files in the repository with their status - "not yet indexed", "indexed" "indexing failed x times" "currently indexing" - and their discovery date. The indexer runs one producer thread which, on a schedule, polls the arcrepository for a list of files. Any new-found files are added to the database. All files still requiring indexing are added to the queue. The indexer has a small suite of consumer threads which reads files from the file-queue and issue batch-jobs to index them. After a job is run, the file status is updated depending on the result of the batch job. If the job was successful, the batch output file is moved to an output directory for sorting and aggregation.

WaybackIndexAggregatorApplication

This component watches the output directory from the WaybackBatchIndexer. It runs as a timer - for example one per hour. Any new files found are passed to the unix sort command to create a single sorted file. This sorted file is then merged, also via unix sort, with a pre-existing sorted file known to wayback. The list of known index files is specified in the spring configuration file CDXCollection.xml in the wayback webapp. One cannot add new files without reloading the web-application. Ideally it would be smart if we could maintain a single list of cdx files in settings.xml which would be used to generate CDXCollection.xml and be passed into the aggregator. This would ensure that wayback and the aggregator always have the same list. The aggregator could also be responsible for creating new empty cdx files when they are specified. We need to investigate the behaviour of wayback when a specified cdx file is absent.

Acceptance Criteria

Sanity testing will consist of running both applications on the PLIGT system (or the new test-suite discussed in Odense, if it is available). The detailed requirements will need to be finalised later, but should include running the indexer over an extended period with new harvests.

Implementation Labour Estimate

Assignment description (this document) 2md

Detailed architecture / interfaces for indexer 1.5md

Detailed architecture / interfaces for aggregator 1md

Integration of object store (hibernate) 3md

Unit tests for indexer 1.5md

Implementation of indexer 1.5md

Sanity testing of indexer 1.5md

Code review of indexer 1md

Unit tests of aggregator 1md

Implementation of aggregator 1md

Sanity testing of aggregator 1md

Code review of aggregator 1md

Documentation 2md

Total: 19md