Differences between revisions 8 and 9

Harvester assignment 2: Move to WARC writing instead of ARC writing

This assignment contains the subtasks needed to upgrade NetarchiveSuite to allow Heritrix to write to WARC files instead of ARC-files, including writing the contents of the metadata-1.arc files into WARC metadata records.

TBD: Should NetarchiveSuite still allow the user to write to ARC-files?BR Comment: Probably not, as this will complicate the post-processing, and indexing facilities of NetarchiveSuit

Prerequisite: Implementation of Feature request #1643: Update Heritrix to version 1.14.3 This version of Heritrix includes a complete implementation of the WARC standard.

Tasks

Subtask 1: Replace the "ARCWriterProcesser" with "WARCWriterProcessor" in our Heritrix templates

Currently our harvest templates include the following piece of xml that configures Heritrix to write ARC files:

 <map name="write-processors">
      <newObject name="Archiver" class="org.archive.crawler.writer.ARCWriterProcessor">
        <boolean name="enabled">true</boolean>
        <newObject name="Archiver#decide-rules" class="org.archive.crawler.deciderules.DecideRuleSequence">
          <map name="rules">
          </map>
        </newObject>
        <boolean name="compress">false</boolean>
        <string name="prefix">netarkivet</string>
        <string name="suffix">${HOSTNAME}</string>
        <long name="max-size-bytes">100000000</long>
        <stringList name="path">
          <string>arcs</string>
        </stringList>
        <integer name="pool-max-active">5</integer>
        <integer name="pool-max-wait">300000</integer>
        <long name="total-bytes-to-write">0</long>
        <boolean name="skip-identical-digests">false</boolean>
      </newObject>
    </map>

By replacing this piece of xml with the following, you tell Heritrix to write WARC-files:

      <newObject name="WARCArchiver" class="org.archive.crawler.writer.WARCWriterProcessor">
        <boolean name="enabled">true</boolean>
        <newObject name="WARCArchiver#decide-rules" class="org.archive.crawler.deciderules.DecideRuleSequence">
          <map name="rules">
          </map>
        </newObject>
        <boolean name="compress">false</boolean>
        <string name="prefix">netarkivet</string>
        <string name="suffix">${HOSTNAME}</string>
        <long name="max-size-bytes">100000000</long>
        <stringList name="path">
          <string>warcs</string>
        </stringList>
        <integer name="pool-max-active">5</integer>
        <integer name="pool-max-wait">300000</integer>
        <long name="total-bytes-to-write">0</long>
        <boolean name="skip-identical-digests">false</boolean>
        <boolean name="write-requests">true</boolean>
        <boolean name="write-metadata">true</boolean>
        <boolean name="write-revisit-for-identical-digests">true</boolean>
        <boolean name="write-revisit-for-not-modified">true</boolean>
      </newObject>
    </map>

Estimated time: 2 MD

Subtask 2: Implement CDX-generating code, that also works for WARC-files.

The CDX generating code must work for both ARC and WARC files. Currently the method dk.netarkivet.common.utils.cdx.ExtractCDX.generateCDX() ignores all files not ending with .arc. This method is used in the Harvest documentation phase to generate CDX-files for the arc-files coming from Heritrix

When generating a single CDX-entry for an URL request, information from several Warc-records is combined.

Note that Wayback already has code to make an CDX from WARC:

https://archive-access.svn.sourceforge.net/svnroot/archive-access/trunk/archive-access/projects/wayback/wayback-core/src/main/java/org/archive/wayback/resourcestore/indexer/

Estimated time: ? MD

Subtask 3: Extend our BatchJob framework to handle WARC-files on record level

Currently our Batch framework only handles ARCfiles on record level.

Currently we only have an abstract class handling ARCRecords(ARCBatchJob) with these concrete implementations:

ExtractCDXJob,
HarvestedUrlsForDomainBatchJob (also assumes crawl.log stored in ARC-file with URL "metadata://netarkivet.dk/crawl/logs/crawl.log")

ARCBatchJob could/should be generalized to handle ArchiveRecords instead of ArcRecords. I have a prototype for such a generalization in the trunk: https://gforge.statsbiblioteket.dk/plugins/scmsvn/viewcvs.php/trunk/tests/dk/netarkivet/common/utils/cdx/ArchiveBatchJob.java?root=netarchivesuite&view=markup

Estimated time: ? MD

Subtask 4: Upgrade or remove dk.netarkivet.viewerproxy.LocalCDXCache (deprecated, and uses inline CDXCacheBatchJob)

The deprecated method LocalCDXCache should either be upgraded to handle WARC-CDX'es or removed.

Estimated time: ? MD

Subtask 5: Store the contents of the metadata-1.arc files as WARC-records

Find a way to store the contents of the metadata-1.arc files as WARC-records. This means identifying a way to refer these WARC-Records to the WARC-files harvested by Heritrix in this HarvestJob.

Estimated time: ? MD

Subtask 6: Extend Bitarchive code, so WARC-records can be retrieved from archive

The class dk.netarkivet.common.distribute.arcrepository.BitarchiveRecord only supports ARCRecords. Maybe just add another BitarchiveRecord constructer to support WARCRecords:

public BitarchiveRecord(WARCRecord record) {
        ArgumentNotValid.checkNotNull(record, "WARCRecord record");
        //fileName = record.getMetaData().getArcFile().getName();
        offset = record.getHeader().getOffset();
        length = record.getHeader().getLength();
        fileName = (String) record.getHeader().getHeaderValue(WARCRecord.HEADER_KEY_FILENAME);
        ....
    }

The same class also uses method ARCUtils.readARCRecord(ARCRecord ar), and we may also need such a method for reading WarcRecords?

public static byte[] readWARCRecord(WARCRecord in) throws IOException {..}

The method dk.netarkivet.archive.bitarchive.Bitarchive.get(String arcfile, long index) needs to work for WARC-files as well as ARC-files.

Estimated time: ? MD

Subtask 7: Upgrade of Indexserver system

The current indexing system assumes the heritrix crawl.log, and cdx'es for each harvestjob are stored in a metadata-1.arc. After subtask 5 is implemented, this is no longer true.

This task should solve the following problems:

Which Heritrix processor will handle deduplication in a WARC scenario? Here there are two possibilities:
- 1) upgrade the Icelandic deduplication module to write its deduplication information to WARC revisit record instead of to the crawl.logBR 2) Use Heritrix own deduplication processes
How are we going to construct the indices for deduplication, and for the proxyviewer access system.

-  ⇤ ← Revision 8 as of 2009-05-23 15:11:35 → 
  Size: 7020
  Editor: SoerenCarlsen
  Comment:
+   ← Revision 9 as of 2009-05-23 15:12:17 → ⇥
  Size: 7026
  Editor: SoerenCarlsen
  Comment:
-Deletions are marked like this.
+Additions are marked like this.
 Line 134:
-) upgrade the Icelandic deduplication module to write its deduplication information to WARC revisit record instead of to the crawl.log
+) upgrade the Icelandic deduplication module to write its deduplication information to WARC revisit record instead of to the crawl.log[[BR]]