Tools for Generation of Wayback-ready Indexes

Deduplication Indexing

This task follows from [https://gforge.statsbiblioteket.dk/tracker/index.php?func=detail&aid=1678 Feature Request 1678].

The current situation is that nothing at all is written to the cdx index when a duplicate object is found. When such objects are browsed in wayback one may be lucky that wayback will find the last-harvested version of the object but there is no guarantee of this.

To correct this situation we require both the ability to generate new cdx index files for old jobs and, ideally, to create them on-the-fly for new jobs. We will, of course, reuse the same code for the two cases. The following subtasks are identified:

  1. Develop a utility method with signature public String adaptCrawlLogLineToCDX(String crawl_line) which, given a line from a crawl log will return a cdx record of the line is a deduplication line or null otherwise.

  2. Write a batch job using 1. to generate a deduplicate cdx index from any existing crawl log or metadata arcfile.
  3. Ensure that the NS documentationn describes, with worked examples, how users who do not use our Arcrepository component can neverthless run batch jobs locally using LocalArcrepositoryClient

  4. Create a heritrix processor that can catch deduplication lines as they are written on-the-fly and append an appropriate cdx record at the same time.

Inclusion of Correct HTTP-return Codes

HTTP return codes are absent in the cdx'es generated by heritrix. In Iteration 37 we developed a prototype ExtractWaybackCDXBatchJob class which regenerates cdx files from the archive arc files. The code uses methods supplied as part of wayback and would be trivial if it were not for the issue addressed in [https://gforge.statsbiblioteket.dk/tracker/index.php?func=detail&aid=1719 Bug 1719]. The remaining task is clean up the workaround code included in ExtractWaybackCDXBatchJob and develop a second version of the BatchJob which will run ok using wayback code when that code is included in NS itself, and not as part of a JarBatchJob.