Differences between revisions 1 and 2

Tools for Generation of Wayback-ready Indexes

Deduplication Indexing

This task follows from [https://gforge.statsbiblioteket.dk/tracker/index.php?func=detail&aid=1678 Feature Request 1678].

The current situation is that nothing at all is written to the cdx index when a duplicate object is found. When such objects are browsed in wayback one may be lucky that wayback will find the last-harvested version of the object but there is no guarantee of this.

To correct this situation we require both the ability to generate new cdx index files for old jobs and, ideally, to create them on-the-fly for new jobs. We will, of course, reuse the same code for the two cases. The following subtasks are identified:

Analysis (1 md)
Develop a utility method with signature public String adaptCrawlLogLineToCDX(String crawl_line) which, given a line from a crawl log will return a cdx record of the line is a deduplication line or null otherwise. (1 md)
Write a batch job using 1. to generate a deduplicate cdx index from any existing crawl log or metadata arcfile. (1 md)
Ensure that the NS documentationn describes, with worked examples, how users who do not use our Arcrepository component can neverthless run batch jobs locally using LocalArcrepositoryClient (1 md)
Create a heritrix processor that can catch deduplication lines as they are written on-the-fly and append an appropriate cdx record at the same time. Note that these records must be put into a new file as wayback cdx records have a different (and richer) content than heritrix-generated cdx's (see below). (2 md)

Inclusion of Correct HTTP-return Codes

HTTP return codes are absent in the cdx'es generated by heritrix. In Iteration 37 we developed a prototype ExtractWaybackCDXBatchJob class which regenerates cdx files from the archive arc files. The code uses methods supplied as part of wayback and would be trivial if it were not for the issue addressed in [https://gforge.statsbiblioteket.dk/tracker/index.php?func=detail&aid=1719 Bug 1719]. The remaining task is clean up the workaround code included in ExtractWaybackCDXBatchJob and develop a second version of the BatchJob which will run ok using wayback code when that code is included in NS itself, and not as part of a JarBatchJob. (2 md)

Release Test Criteria

Test of batch job for dedup cdx indexing. Output should be well-formed cdx records.
Test of generation of dedup cdx indexes on the fly. Should auto-generate a cdx file during harvesting which has the same content as that generated afterwards by batch job.
Test of two versions of ExtractWaybackCDXBatchJob (one as jar batch job, one using native NS code). Both should generate well-formed wayback cdx indexes.

Time Estimate

8 md in Total

-  ⇤ ← Revision 1 as of 2009-08-25 08:22:28 → 
  Size: 2111
  Editor: ColinRosenthal
  Comment:
+   ← Revision 2 as of 2009-08-25 08:31:57 → ⇥
  Size: 2816
  Editor: ColinRosenthal
  Comment:
-Deletions are marked like this.
+Additions are marked like this.
 Line 5:
-The current situation is that nothing at all is written to the cdx index when a duplicate object is found. When such objects are browsed
in wayback one may be lucky that wayback will find the last-harvested version of the object but there is no guarantee of this.
+The current situation is that nothing at all is written to the cdx index when a duplicate object is found. When such objects are browsed in wayback one may be lucky that wayback will find the last-harvested version of the object but there is no guarantee of this.
-Line 8:
+Line 7:
-To correct this situation we require both the ability to generate new cdx index files for old jobs and, ideally, to create them on-the-fly for new jobs. We will,
of course, reuse the same code for the two cases. The following subtasks are identified:
+To correct this situation we require both the ability to generate new cdx index files for old jobs and, ideally, to create them on-the-fly for new jobs. We will, of course, reuse the same code for the two cases. The following subtasks are identified:
-Line 11:
+Line 9:
-. Develop a utility method with signature {{{public String adaptCrawlLogLineToCDX(String crawl_line)}}} which, given a line from a crawl log will return a cdx record of the line is a deduplication line or null otherwise.
 1. Write a batch job using 1. to generate a deduplicate cdx index from any existing crawl log or metadata arcfile.
 1. Ensure that the NS documentationn describes, with worked examples, how users who do not use our Arcrepository component can neverthless run batch jobs locally using !LocalArcrepositoryClient 
 1. Create a heritrix processor that can catch deduplication lines as they are written on-the-fly and append an appropriate cdx record at the same time.
+. Analysis (1 md)
 1. Develop a utility method with signature {{{public String adaptCrawlLogLineToCDX(String crawl_line)}}} which, given a line from a crawl log will return a cdx record of the line is a deduplication line or null otherwise. (1 md)
 1. Write a batch job using 1. to generate a deduplicate cdx index from any existing crawl log or metadata arcfile. (1 md)
 1. Ensure that the NS documentationn describes, with worked examples, how users who do not use our Arcrepository component can neverthless run batch jobs locally using !LocalArcrepositoryClient (1 md)
 1. Create a heritrix processor that can catch deduplication lines as they are written on-the-fly and append an appropriate cdx record at the same time. Note that these records must be put into a new file as wayback cdx records have a different (and richer) content than heritrix-generated cdx's (see below). (2 md)
== Inclusion of Correct HTTP-return Codes ==
HTTP return codes are absent in the cdx'es generated by heritrix. In Iteration 37 we developed a prototype ExtractWaybackCDXBatchJob class which regenerates cdx files from the archive arc files. The code uses methods supplied as part of wayback and would be trivial if it were not for the issue addressed in [https://gforge.statsbiblioteket.dk/tracker/index.php?func=detail&aid=1719 Bug 1719]. The remaining task is clean up the workaround code included in ExtractWaybackCDXBatchJob and develop a second version of the BatchJob which will run ok using wayback code when that code is included in NS itself, and not as part of a !JarBatchJob. (2 md)
-Line 16:
+Line 17:
-== Inclusion of Correct HTTP-return Codes ==
+== Release Test Criteria ==
-Line 18:
+Line 19:
-HTTP return codes are absent in the cdx'es generated by heritrix. In Iteration 37 we developed a prototype ExtractWaybackCDXBatchJob class which regenerates cdx files from the archive arc files. The code uses methods supplied as part of wayback and would be trivial if it were not for the issue addressed in [https://gforge.statsbiblioteket.dk/tracker/index.php?func=detail&aid=1719 Bug 1719]. The remaining task is clean up the workaround code included in ExtractWaybackCDXBatchJob and develop a second version of the BatchJob which will run ok using wayback code when that code is included in NS itself, and not as part of a JarBatchJob.
+. Test of batch job for dedup cdx indexing. Output should be well-formed cdx records.
 1. Test of generation of dedup cdx indexes on the fly. Should auto-generate a cdx file during harvesting which has the same content as that generated afterwards by batch job. 
 1. Test of two versions of ExtractWaybackCDXBatchJob (one as jar batch job, one using native NS code). Both should generate well-formed wayback cdx indexes.

== Time Estimate ==

8 md in Total