Tools for Generation of Wayback-ready Indexes

Deduplication Indexing

This task follows from Feature Request 1678.

The current situation is that nothing at all is written to the cdx index when a duplicate object is found. When such objects are browsed in wayback one may be lucky that wayback will find the last-harvested version of the object but there is no guarantee of this.

To correct this situation we require both the ability to generate new cdx index files for old jobs and, ideally, to create them on-the-fly for new jobs. We will, of course, reuse the same code for the two cases. The following subtasks are identified:

  1. Analysis (1 md)
  2. Develop a utility method with signature public String adaptCrawlLogLineToCDX(String crawl_line) which, given a line from a crawl log will return a cdx record if the line is a deduplication line or null otherwise. Another utility method public void adaptCrawlRecordToCDX(InputStream is, OutputStream os) will do the same for whole crawl records. (1 md)

  3. Write a batch job using 1. to generate a deduplicate cdx index from any existing crawl log or metadata arcfile. (1 md)
  4. Ensure that the NS documentation describes, with worked examples, how users who do not use our Arcrepository component can run batch jobs locally using LocalArcrepositoryClient (1 md)

  5. Create a heritrix processor that can catch deduplication lines as they are written on-the-fly and append an appropriate cdx record at the same time. Note that these records must be put into a new file as wayback cdx records have a different (and richer) content than heritrix-generated cdx's (see below). (2 md)

Inclusion of Correct HTTP-return Codes

HTTP return codes are absent in the cdx'es generated by heritrix. In Iteration 37 we developed a prototype ExtractWaybackCDXBatchJob class which regenerates cdx files from the archive arc files. The code uses methods supplied as part of wayback and would be trivial if it were not for the issue addressed in Bug 1719. The remaining task is clean up the workaround code included in ExtractWaybackCDXBatchJob and develop a second version of the BatchJob which will run ok using wayback code when in situations where the batch-job code is permitted to read System properties. (2 md)

Release Test Criteria

  1. Test of batch job for dedup cdx indexing. Output should be well-formed cdx records.
  2. Test of generation of dedup cdx indexes on the fly. Should auto-generate a cdx file during harvesting which has the same content as that generated afterwards by batch job.
  3. Test of two versions of ExtractWaybackCDXBatchJob (one using workaround code, one using native wayback methods). Both should generate well-formed wayback cdx indexes.

Time Estimate

8 md in Total

ImprovedIndexing (last edited 2010-08-16 10:24:45 by localhost)