Diff for "AutomaticIndexing"

Differences between revisions 2 and 3

Overall Aims

The purpose of this task is to develop a component which will automatically index new material and add the resultant material to a running wayback instance. The essential requirement is that in some reasonably timely fashion (for example within at most three days since harvesting) any harvested webpage should be discoverable via wayback. We have two basic tools to achieve this task:

ExtractWaybackCDXBatchJob for harvested arc files, and
ExtractDeduplicateCDXBatchJob for dedup records in metadata files

(So every file in the repository needs to be indexed by one batch-job or the other.)

Files returned by these batch-jobs are unsorted, but are ready for sorting. They require no other post-processing.

Architectural Constraints

A major constraint on the architecture is that we can only know what files are available in a given replica of the archive. We cannot easily determine what files are missing due to, e.g., machine downtime. In order to ensure that any missed files are indexed later we therefore need to maintain a list of files which have been indexed, for example in a relational database.

In our batch architecture, the client only knows the number of files on which the batch job was successfully run, not their names. Therefore a strategy to ensure that all files are indexed could be as follows:

Select a set of un-indexed files (either all data or all metadata files) from the database
ii. Create a batch job to index only these files by calling FileBatchJob.processOnlyFilesNamed(List<String> specifiedFilenames) iii. If the standard output shows that the number of files processed is as expected, keep the output and mark all these files as indexed in the database iv. Otherwise just throw away the output and select a new set of files to index

In this way we can exploit a controllable amount of concurrency by specifying how many files to index in each batch job.

Overall Architecture

The WaybackIndexer should consist of two separate components:

A WaybackBatchIndexerApplication. This is the component which maintains a list of files in the archive (via FileListJob) and starts batch jobs to index any and all files not yet indexed. This application could be parameterized to index either data files or metadata files and one instance of each could be started. In this case, the database would need to run in standalone mode.
A WaybackIndexAggregatorApplication. This component scans the output directories of WaybackBatchIndexer, takes any new files found, sorts them, and merges the result into existing sorted index files.

It should be possible not only to run these two components separately but to develop them relatively independently of each other as the only interface between them is the filesystem.

Functional Description

!WaybackBatchIndexer

The indexer runs on a timer - for example once per day - or possible simply with a wait(ONE_DAY) between runs. It maintains a relational database showing which files have been indexed and possibly also which files are known but have not yet been indexed. The suggested architecture of the database is a standalone derby RDBMS server with a Hibernate-annotation O-R storage layer. (This will help to keep developmental overhead to a minimum as generic DAOs can be generated automatically.) Once per day (e.g.) a complete list of all files in the archive is fetched and any new files are added to the DB. The indexer generates a list of all files not yet indexed and creates batch jobs to index these files in small groups. On completion of a job, the standard output is checked for success on all files. If the job was successful, the batch output file is moved to a suitable directory and the DB is updated to reflect that the given files have been indexed.

An important issue is how to select which files to clump together in batch jobs. In particular, any missing files will cause timeouts and result in the output from successfully indexed files being thrown away. One strategy is the following:

In the database, each file has a field "NumberOfTimesFailed" initialised to zero
This is incremented any time the given file is part of a batch job which fails
Not yet indexed files are selected (queued) by date order (most recent first) and in reverse order of NumberOfTimesFailed

We could also tune the algorithm so that, for example, files which have failed multiple times are subsequently indexed one at a time. This will allow us to rescue files which have "gotten a bad reputation" by ending up near actually missing or damaged files in the queue. We should also manually inspect the database for files which cannot be indexed.

!WaybackIndexAggregator

This component watches the output directory from the WaybackBatchIndexer. It runs as a timer - for example one per hour. Any new files found are passed to the unix sort command to create a single sorted file. This sorted file is then merged, also via unix sort, with a pre-existing sorted file known to wayback.

-  ⇤ ← Revision 2 as of 2009-09-22 13:49:49 → 
  Size: 3847
  Editor: ColinRosenthal
  Comment:
+   ← Revision 3 as of 2009-09-23 08:22:20 → ⇥
  Size: 5101
  Editor: ColinRosenthal
  Comment:
-Deletions are marked like this.
+Additions are marked like this.
 Line 11:
-A major constraint on the architecture is that we can only know what files are available in a given replica of the archive. We cannot easily determine what files are missing due to, e.g., machine downtime. In order to ensure that any missed files are indexed later we therefore need to maintain a list of files which have been indexed, for example in a relational database. Moreover, in our current architecture the only way to know that a batch job has run on a given file is to use a regexp matching that file alone and then check that the std output from the job shows that it has been successfully run on one file. This is therefore what we will do, running many small batch jobs. We should consider whether we should run multiple small batch jobs concurrently. (If so, these must reuse a small pool of applicationInstanceId's or we will pollute the jms broker with a lot of queues.)
+A major constraint on the architecture is that we can only know what files are available in a given replica of the archive. We cannot easily determine what files are missing due to, e.g., machine downtime. In order to ensure that any missed files are indexed later we therefore need to maintain a list of files which have been indexed, for example in a relational database. 

In our batch architecture, the client only knows the number of files on which the batch job was successfully run, not their names. Therefore a strategy to ensure that all files
are indexed could be as follows:
 i. Select a set of un-indexed files (either all data or all metadata files) from the database
 ii. Create a batch job to index only these files by calling FileBatchJob.processOnlyFilesNamed(List<String> specifiedFilenames)
 iii. If the standard output shows that the number of files processed is as expected, keep the output and mark all these files as indexed in the database
 iv. Otherwise just throw away the output and select a new set of files to index

In this way we can exploit a controllable amount of concurrency by specifying how many files to index in each batch job.
-Line 16:
+Line 25:
- *A !WaybackBatchIndexer. This is the component which maintains a list of files in the archive (via !FileListJob) and starts batch jobs to index any and all files not yet indexed
 *A !WaybackIndexAggregator. This component scans the output directories of !WaybackBatchIndexer, takes any new files found, sorts them, and merges the result into existing sorted index files.
+ *A !WaybackBatchIndexerApplication. This is the component which maintains a list of files in the archive (via !FileListJob) and starts batch jobs to index any and all files not yet indexed. This application could be parameterized to index either data files or metadata files and one instance of each could be started. In this case, the database would need to run in standalone mode.
 *A !WaybackIndexAggregatorApplication. This component scans the output directories of !WaybackBatchIndexer, takes any new files found, sorts them, and merges the result into existing sorted index files.
-Line 26:
+Line 36:
-generates a list of all files not yet indexed and submits batch jobs for them - one per file. Ideally there should be a pooling solution to ensure that a fixed maximum number of such jobs
is being run at any given time (commons-pool will be useful for this) . On completion of a job, the standard output is checked for success on one file. If the job was successful, the
batch output file is moved to a suitable directory and the DB is updated to reflect that the given file has been indexed.
+generates a list of all files not yet indexed and creates batch jobs to index these files in small groups. On completion of a job, the standard output is checked for success on all files. If the job was successful, the
batch output file is moved to a suitable directory and the DB is updated to reflect that the given files have been indexed.

An important issue is how to select which files to clump together in batch jobs. In particular, any missing files will cause timeouts and result in the output from
successfully indexed files being thrown away. One strategy is the following:
 i. In the database, each file has a field "NumberOfTimesFailed" initialised to zero
 i. This is incremented any time the given file is part of a batch job which fails
 i. Not yet indexed files are selected (queued) by date order (most recent first) and in reverse order of NumberOfTimesFailed
We could also tune the algorithm so that, for example, files which have failed multiple times are subsequently indexed one at a time. This will allow
us to rescue files which have "gotten a bad reputation" by ending up near actually missing or damaged files in the queue. We should also manually inspect the database for 
files which cannot be indexed.