Harvester assignment 2: Move to WARC writing instead of ARC writing

This assignement has been replaced by NAS-1720 Enable WARC file writing and handling in the NetarchiveSuite

This assignment contains the subtasks needed to upgrade NetarchiveSuite to allow Heritrix to write to WARC files instead of ARC-files, including writing the contents of the metadata-1.arc files into WARC metadata records.

TBD: Should NetarchiveSuite still allow the user to write to ARC-files?
Comment: Probably not, as this will complicate the post-processing, and indexing facilities of "NetarchiveSuite

Prerequisite: Implementation of Feature request #1643: Update Heritrix to version 1.14.3. This version of Heritrix includes a complete implementation of the WARC standard. Note that since then implementation of FR 1951(Upgrade to heritrix 1.14.4) has moved the WARC-implementation to version 1.0 (ie. supposedly following the ISO standard)

Total estimate: 35 MD Remaining: 20-25 MD Degree of uncertainty: B

Tasks

Subtask 1: Replace the "ARCWriterProcesser" with "WARCWriterProcessor" in our Heritrix templates

Currently our harvest templates include the following piece of xml that configures Heritrix to write ARC files:

 <map name="write-processors">
      <newObject name="Archiver" class="org.archive.crawler.writer.ARCWriterProcessor">
        <boolean name="enabled">true</boolean>
        <newObject name="Archiver#decide-rules" class="org.archive.crawler.deciderules.DecideRuleSequence">
          <map name="rules">
          </map>
        </newObject>
        <boolean name="compress">false</boolean>
        <string name="prefix">netarkivet</string>
        <string name="suffix">${HOSTNAME}</string>
        <long name="max-size-bytes">100000000</long>
        <stringList name="path">
          <string>arcs</string>
        </stringList>
        <integer name="pool-max-active">5</integer>
        <integer name="pool-max-wait">300000</integer>
        <long name="total-bytes-to-write">0</long>
        <boolean name="skip-identical-digests">false</boolean>
      </newObject>
    </map>

By replacing this piece of xml with the following, you tell Heritrix to write WARC-files:

      <newObject name="WARCArchiver" class="org.archive.crawler.writer.WARCWriterProcessor">
        <boolean name="enabled">true</boolean>
        <newObject name="WARCArchiver#decide-rules" class="org.archive.crawler.deciderules.DecideRuleSequence">
          <map name="rules">
          </map>
        </newObject>
        <boolean name="compress">false</boolean>
        <string name="prefix">netarkivet</string>
        <string name="suffix">${HOSTNAME}</string>
        <long name="max-size-bytes">100000000</long>
        <stringList name="path">
          <string>warcs</string>
        </stringList>
        <integer name="pool-max-active">5</integer>
        <integer name="pool-max-wait">300000</integer>
        <long name="total-bytes-to-write">0</long>
        <boolean name="skip-identical-digests">false</boolean>
        <boolean name="write-requests">true</boolean>
        <boolean name="write-metadata">true</boolean>
        <boolean name="write-revisit-for-identical-digests">true</boolean>
        <boolean name="write-revisit-for-not-modified">true</boolean>
      </newObject>
    </map>

Estimated time: 2 MD

Subtask 2: Implement CDX-generating code, that also works for WARC-files.

The CDX generating code must work for both ARC and WARC files. Currently the method dk.netarkivet.common.utils.cdx.ExtractCDX.generateCDX() ignores all files not ending with .arc. This method is used in the Harvest documentation phase to generate CDX-files for the arc-files coming from Heritrix

When generating a single CDX-entry for an URL request, information from several Warc-records is combined.

Note that Wayback already has code to make an CDX from WARC:

https://archive-access.svn.sourceforge.net/svnroot/archive-access/trunk/archive-access/projects/wayback/wayback-core/src/main/java/org/archive/wayback/resourcestore/indexer/

Estimated time: 6 MD

Subtask 3: Extend our BatchJob framework to handle WARC-files on record level

Currently our Batch framework only handles ARCfiles on record level.

Currently we only have an abstract class handling ARCRecords(ARCBatchJob) with these concrete implementations:

ExtractCDXJob,
HarvestedUrlsForDomainBatchJob (also assumes crawl.log stored in ARC-file with URL "metadata://netarkivet.dk/crawl/logs/crawl.log")

ARCBatchJob could/should be generalized to handle ArchiveRecords instead of ArcRecords. I have a prototype for such a generalization in the trunk: https://gforge.statsbiblioteket.dk/plugins/scmsvn/viewcvs.php/trunk/tests/dk/netarkivet/common/utils/cdx/ArchiveBatchJob.java?root=netarchivesuite&view=markup

Estimated time: 5 MD

Subtask 4: Upgrade or remove dk.netarkivet.viewerproxy.LocalCDXCache (deprecated, and uses inline CDXCacheBatchJob)

The deprecated method LocalCDXCache should either be upgraded to handle WARC-CDX'es or removed.

Estimated time: 2 MD

Subtask 5: Store the contents of the metadata-1.arc files as WARC-records

Find a way to store the contents of the metadata-1.arc files as WARC-records. This means identifying a way to refer these WARC-Records to the WARC-files harvested by Heritrix in this HarvestJob. Idea: Extract the ID of the WARC-info record in the warc-files produced by Heritrix, and insert all these as WARC-Concurrent-To identifiers in the the warc-metadata records.

Estimated time: 5 MD

Subtask 6: Extend Bitarchive code, so WARC-records can be retrieved from archive

The class dk.netarkivet.common.distribute.arcrepository.BitarchiveRecord only supports ARCRecords. Maybe just add another BitarchiveRecord constructer to support WARCRecords:

public BitarchiveRecord(WARCRecord record) {
        ArgumentNotValid.checkNotNull(record, "WARCRecord record");
        //fileName = record.getMetaData().getArcFile().getName();
        offset = record.getHeader().getOffset();
        length = record.getHeader().getLength();
        fileName = (String) record.getHeader().getHeaderValue(WARCRecord.HEADER_KEY_FILENAME);
        ....
    }

The same class also uses method ARCUtils.readARCRecord(ARCRecord ar), and we may also need such a method for reading WarcRecords?

public static byte[] readWARCRecord(WARCRecord in) throws IOException {..}

The method dk.netarkivet.archive.bitarchive.Bitarchive.get(String arcfile, long index) needs to work for WARC-files as well as ARC-files.

Estimated time: 5 MD (DONE, FR 2040)

Subtask 7: Upgrade of Indexserver system

The current indexing system assumes the heritrix crawl.log, and cdx'es for each harvestjob are stored in a metadata-1.arc. After subtask 5 is implemented, this is no longer true.

This task should solve the following problems:

Which Heritrix processor will handle deduplication in a WARC scenario? Here there are two possibilities:
- 1) upgrade the Icelandic deduplication module to write its deduplication information to WARC revisit record instead of to the crawl.log
  2) Use Heritrix own deduplication processes
How are we going to construct the indices for deduplication, and for the proxyviewer access system.

Estimated time: 5 MD (estimate uncertain)

Subtask 8: extend the dk.netarkivet.wayback.NetarchiveResourceStore to handle warc-records

The current NetarchiveResourceStore only handles ARC-records. Extend this class to also handle warc-records.

Harvester assignment 2: Move to WARC writing instead of ARC writing

Tasks

Subtask 1: Replace the "ARCWriterProcesser" with "WARCWriterProcessor" in our Heritrix templates

Estimated time: 2 MD

Subtask 2: Implement CDX-generating code, that also works for WARC-files.

Estimated time: 6 MD

Subtask 3: Extend our BatchJob framework to handle WARC-files on record level

Estimated time: 5 MD

Subtask 4: Upgrade or remove dk.netarkivet.viewerproxy.LocalCDXCache (deprecated, and uses inline CDXCacheBatchJob)

Estimated time: 2 MD

Subtask 5: Store the contents of the metadata-1.arc files as WARC-records

Estimated time: 5 MD

Subtask 6: Extend Bitarchive code, so WARC-records can be retrieved from archive

Estimated time: 5 MD (DONE, FR 2040)

Subtask 7: Upgrade of Indexserver system

Estimated time: 5 MD (estimate uncertain)

Subtask 8: extend the dk.netarkivet.wayback.NetarchiveResourceStore to handle warc-records

Estimated time: 5 MD (DONE, released in 3.14.0)