Differences between revisions 10 and 11

Harvester assignment 2: Move to WARC writing instead of ARC writing

This assignment contains the subtasks needed to upgrade NetarchiveSuite to allow Heritrix to write to WARC files instead of ARC-files, including writing the contents of the metadata-1.arc files into WARC metadata records.

TBD: Should NetarchiveSuite still allow the user to write to ARC-files?
Comment: Probably not, as this will complicate the post-processing, and indexing facilities of NetarchiveSuit

Prerequisite: Implementation of Feature request #1643: Update Heritrix to version 1.14.3 This version of Heritrix includes a complete implementation of the WARC standard.

Tasks

Subtask 1: Replace the "ARCWriterProcesser" with "WARCWriterProcessor" in our Heritrix templates

Currently our harvest templates include the following piece of xml that configures Heritrix to write ARC files:

 <map name="write-processors">
      <newObject name="Archiver" class="org.archive.crawler.writer.ARCWriterProcessor">
        <boolean name="enabled">true</boolean>
        <newObject name="Archiver#decide-rules" class="org.archive.crawler.deciderules.DecideRuleSequence">
          <map name="rules">
          </map>
        </newObject>
        <boolean name="compress">false</boolean>
        <string name="prefix">netarkivet</string>
        <string name="suffix">${HOSTNAME}</string>
        <long name="max-size-bytes">100000000</long>
        <stringList name="path">
          <string>arcs</string>
        </stringList>
        <integer name="pool-max-active">5</integer>
        <integer name="pool-max-wait">300000</integer>
        <long name="total-bytes-to-write">0</long>
        <boolean name="skip-identical-digests">false</boolean>
      </newObject>
    </map>

By replacing this piece of xml with the following, you tell Heritrix to write WARC-files:

      <newObject name="WARCArchiver" class="org.archive.crawler.writer.WARCWriterProcessor">
        <boolean name="enabled">true</boolean>
        <newObject name="WARCArchiver#decide-rules" class="org.archive.crawler.deciderules.DecideRuleSequence">
          <map name="rules">
          </map>
        </newObject>
        <boolean name="compress">false</boolean>
        <string name="prefix">netarkivet</string>
        <string name="suffix">${HOSTNAME}</string>
        <long name="max-size-bytes">100000000</long>
        <stringList name="path">
          <string>warcs</string>
        </stringList>
        <integer name="pool-max-active">5</integer>
        <integer name="pool-max-wait">300000</integer>
        <long name="total-bytes-to-write">0</long>
        <boolean name="skip-identical-digests">false</boolean>
        <boolean name="write-requests">true</boolean>
        <boolean name="write-metadata">true</boolean>
        <boolean name="write-revisit-for-identical-digests">true</boolean>
        <boolean name="write-revisit-for-not-modified">true</boolean>
      </newObject>
    </map>

Estimated time: 2 MD

Subtask 2: Implement CDX-generating code, that also works for WARC-files.

The CDX generating code must work for both ARC and WARC files. Currently the method dk.netarkivet.common.utils.cdx.ExtractCDX.generateCDX() ignores all files not ending with .arc. This method is used in the Harvest documentation phase to generate CDX-files for the arc-files coming from Heritrix

When generating a single CDX-entry for an URL request, information from several Warc-records is combined.

Note that Wayback already has code to make an CDX from WARC:

https://archive-access.svn.sourceforge.net/svnroot/archive-access/trunk/archive-access/projects/wayback/wayback-core/src/main/java/org/archive/wayback/resourcestore/indexer/

Estimated time: ? MD

Subtask 3: Extend our BatchJob framework to handle WARC-files on record level

Currently our Batch framework only handles ARCfiles on record level.

Currently we only have an abstract class handling ARCRecords(ARCBatchJob) with these concrete implementations:

ExtractCDXJob,
HarvestedUrlsForDomainBatchJob (also assumes crawl.log stored in ARC-file with URL "metadata://netarkivet.dk/crawl/logs/crawl.log")

ARCBatchJob could/should be generalized to handle ArchiveRecords instead of ArcRecords. I have a prototype for such a generalization in the trunk: https://gforge.statsbiblioteket.dk/plugins/scmsvn/viewcvs.php/trunk/tests/dk/netarkivet/common/utils/cdx/ArchiveBatchJob.java?root=netarchivesuite&view=markup

Estimated time: ? MD

Subtask 4: Upgrade or remove dk.netarkivet.viewerproxy.LocalCDXCache (deprecated, and uses inline CDXCacheBatchJob)

The deprecated method LocalCDXCache should either be upgraded to handle WARC-CDX'es or removed.

Estimated time: ? MD

Subtask 5: Store the contents of the metadata-1.arc files as WARC-records

Find a way to store the contents of the metadata-1.arc files as WARC-records. This means identifying a way to refer these WARC-Records to the WARC-files harvested by Heritrix in this HarvestJob.

Estimated time: ? MD

Subtask 6: Extend Bitarchive code, so WARC-records can be retrieved from archive

The class dk.netarkivet.common.distribute.arcrepository.BitarchiveRecord only supports ARCRecords. Maybe just add another BitarchiveRecord constructer to support WARCRecords:

public BitarchiveRecord(WARCRecord record) {
        ArgumentNotValid.checkNotNull(record, "WARCRecord record");
        //fileName = record.getMetaData().getArcFile().getName();
        offset = record.getHeader().getOffset();
        length = record.getHeader().getLength();
        fileName = (String) record.getHeader().getHeaderValue(WARCRecord.HEADER_KEY_FILENAME);
        ....
    }

The same class also uses method ARCUtils.readARCRecord(ARCRecord ar), and we may also need such a method for reading WarcRecords?

public static byte[] readWARCRecord(WARCRecord in) throws IOException {..}

The method dk.netarkivet.archive.bitarchive.Bitarchive.get(String arcfile, long index) needs to work for WARC-files as well as ARC-files.

Estimated time: ? MD

Subtask 7: Upgrade of Indexserver system

The current indexing system assumes the heritrix crawl.log, and cdx'es for each harvestjob are stored in a metadata-1.arc. After subtask 5 is implemented, this is no longer true.

This task should solve the following problems:

Which Heritrix processor will handle deduplication in a WARC scenario? Here there are two possibilities:
- 1) upgrade the Icelandic deduplication module to write its deduplication information to WARC revisit record instead of to the crawl.log
  2) Use Heritrix own deduplication processes
How are we going to construct the indices for deduplication, and for the proxyviewer access system.

-  ⇤ ← Revision 10 as of 2010-08-16 10:24:40 → 
  Size: 7026
  Editor: localhost
  Comment: converted to 1.6 markup
+   ← Revision 11 as of 2010-08-23 10:10:26 → ⇥
  Size: 6988
  Editor: SoerenCarlsen
  Comment:
-Deletions are marked like this.
+Additions are marked like this.
 Line 2:
+This assignment contains the subtasks needed to upgrade !NetarchiveSuite to allow Heritrix to write to WARC files instead of ARC-files, including writing the contents of the metadata-1.arc files into WARC metadata records.
-Line 3:
+Line 4:
-This assignment contains the subtasks needed to upgrade !NetarchiveSuite to allow Heritrix to write to WARC files instead
of ARC-files, including writing the contents of the metadata-1.arc files into WARC metadata records.
+TBD: Should !NetarchiveSuite still allow the user to write to ARC-files?<<BR>>  Comment: Probably not, as this will complicate the post-processing, and indexing facilities of NetarchiveSuit
 Line 6:
-TBD: Should !NetarchiveSuite still allow the user to write to ARC-files?<<BR>> 
Comment: Probably not, as this will complicate the post-processing, and indexing facilities of NetarchiveSuit

Prerequisite: Implementation of Feature request #1643: Update Heritrix to version 1.14.3
This version of Heritrix includes a complete implementation of the WARC standard.
+Prerequisite: Implementation of Feature request #1643: Update Heritrix to version 1.14.3 This version of Heritrix includes a complete implementation of the WARC standard.
-Line 13:
+Line 9:
+=== Subtask 1: Replace the "ARCWriterProcesser" with "WARCWriterProcessor"  in our Heritrix templates ===
Currently our harvest templates include the following piece of xml that configures Heritrix to write ARC files:
-Line 14:
+Line 12:
-=== Subtask 1: Replace the "ARCWriterProcesser" with "WARCWriterProcessor"  in our Heritrix templates ===

Currently our harvest templates include the following piece of xml that configures Heritrix to write ARC files:
-Line 40:
+Line 35:
-Line 65:
+Line 61:
-Line 67:
+Line 62:
-Line 69:
+Line 63:
-Line 79:
+Line 72:
-Line 84:
+Line 76:
-    * ExtractCDXJob,
    * HarvestedUrlsForDomainBatchJob (also assumes crawl.log stored in ARC-file with URL "metadata://netarkivet.dk/crawl/logs/crawl.log")
-Line 87:
+Line 77:
-ARCBatchJob could/should be generalized to handle ArchiveRecords instead of ArcRecords. I have a prototype for such a generalization in the trunk:
https://gforge.statsbiblioteket.dk/plugins/scmsvn/viewcvs.php/trunk/tests/dk/netarkivet/common/utils/cdx/ArchiveBatchJob.java?root=netarchivesuite&view=markup
+ * ExtractCDXJob,
 * HarvestedUrlsForDomainBatchJob (also assumes crawl.log stored in ARC-file with URL "metadata://netarkivet.dk/crawl/logs/crawl.log")

ARCBatchJob could/should be generalized to handle ArchiveRecords instead of ArcRecords. I have a prototype for such a generalization in the trunk: https://gforge.statsbiblioteket.dk/plugins/scmsvn/viewcvs.php/trunk/tests/dk/netarkivet/common/utils/cdx/ArchiveBatchJob.java?root=netarchivesuite&view=markup
-Line 91:
+Line 83:
-Line 93:
+Line 84:
-Line 97:
+Line 87:
-Line 99:
+Line 88:
-Find a way to store the contents of the metadata-1.arc files as WARC-records.
This means identifying a way to refer these WARC-Records to the WARC-files harvested by Heritrix in this HarvestJob.
+Find a way to store the contents of the metadata-1.arc files as WARC-records. This means identifying a way to refer these WARC-Records to the WARC-files harvested by Heritrix in this HarvestJob.
-Line 103:
+Line 91:
+=== Subtask 6: Extend Bitarchive code, so WARC-records can be retrieved from archive ===
The class dk.netarkivet.common.distribute.arcrepository.!BitarchiveRecord only supports ARCRecords.  Maybe just add another !BitarchiveRecord constructer to support WARCRecords:
-Line 104:
+Line 94:
-=== Subtask 6: Extend Bitarchive code, so WARC-records can be retrieved from archive ===
The class dk.netarkivet.common.distribute.arcrepository.BitarchiveRecord only supports ARCRecords. 
Maybe just add another BitarchiveRecord constructer to support WARCRecords:
-Line 116:
+Line 103:
-}}}
+}}}
The same class also uses method ARCUtils.readARCRecord(ARCRecord ar), and we may also need such a method for reading WarcRecords?
-Line 118:
+Line 106:
-The same class also uses method ARCUtils.readARCRecord(ARCRecord ar), and we may also need such a method for reading WarcRecords?
-Line 122:
+Line 109:
-The method {{{dk.netarkivet.archive.bitarchive.Bitarchive.get(String arcfile, long index)}}} needs to work for WARC-files as well
as ARC-files.
+The method {{{dk.netarkivet.archive.bitarchive.Bitarchive.get(String arcfile, long index)}}} needs to work for WARC-files as well as ARC-files.
-Line 127:
+Line 112:
-Line 129:
+Line 113:
-The current indexing system assumes the heritrix crawl.log, and cdx'es for each harvestjob are stored in a metadata-1.arc.
After subtask 5 is implemented, this is no longer true.
+The current indexing system assumes the heritrix crawl.log, and cdx'es for each harvestjob are stored in a metadata-1.arc. After subtask 5 is implemented, this is no longer true.
-Line 133:
+Line 116:
- * Which Heritrix processor will handle deduplication in a WARC scenario? Here there are two possibilities: 
   1) upgrade the Icelandic deduplication module to write its deduplication information to WARC revisit record instead of to the crawl.log<<BR>>
   2) Use Heritrix own deduplication processes
 * How are we going to construct the indices for deduplication, and for the proxyviewer access system.
+ * Which Heritrix processor will handle deduplication in a WARC scenario? Here there are two possibilities:
  . 1) upgrade the Icelandic deduplication module to write its deduplication information to WARC revisit record instead of to the crawl.log<<BR>> 2) Use Heritrix own deduplication processes
 * How are we going to construct the indices for deduplication, and for the proxyviewer access system.