Differences between revisions 5 and 6

Tasks

Estimated Estimated
Estimated Harvester assignment 2: Move to WARC writing instead of ARC writing

class="anchor" id="line-3">

This assignment contains the subtasks needed to upgrade NetarchiveSuite to allow Heritrix to write to WARC files instead of ARC-files, including writing the contents of the metadata-1.arc files into WARC metadata records.

TBD: Should NetarchiveSuite still allow the user to write to ARC-files?BR Comment: Probably not, as this will complicate the post-processing, and indexing facilities of NetarchiveSuit

Prerequisite: Implementation of Feature request #1643: Update Heritrix to version 1.14.3 This version of Heritrix includes a complete implementation of the WARC standard.

class="anchor" id="line-14">

Processer.22_with_.22WARCWriterProcessor.22__in_our_Heritrix_templates">Subtask 1: Replace the "ARCWriterProcesser" with "WARCWriterProcessor" in our Heritrix templates class="anchor" id="line-16">

Currently our harvest templates include the following piece of xml that configures Heritrix to write ARC files:

 <map name="write-processors"> <newObject name="Archiver" class="org.archive.crawler.writer.ARCWriterProcessor"> <boolean name="enabled">true</boolean> <newObject name="Archiver#decide-rules" class="org.archive.crawler.deciderules.DecideRuleSequence"> <map name="rules"> </map> </newObject> <boolean name="compress">false</boolean> <string name="prefix">netarkivet</string> <string name="suffix">${HOSTNAME}</string> <long name="max-size-bytes">100000000</long> <stringList name="path"> <string>arcs</string> </stringList> <integer name="pool-max-active">5</integer> <integer name="pool-max-wait">300000</integer> <long name="total-bytes-to-write">0</long> <boolean name="skip-identical-digests">false</boolean> </newObject> </map>

By replacing this piece of xml with the following, you tell Heritrix to write WARC-files:

      <newObject name="WARCArchiver" class="org.archive.crawler.writer.WARCWriterProcessor"> <boolean name="enabled">true</boolean> <newObject name="WARCArchiver#decide-rules" class="org.archive.crawler.deciderules.DecideRuleSequence"> <map name="rules"> </map> </newObject> <boolean name="compress">false</boolean> <string name="prefix">netarkivet</string> <string name="suffix">${HOSTNAME}</string> <long name="max-size-bytes">100000000</long> <stringList name="path"> <string>warcs</string> </stringList> <integer name="pool-max-active">5</integer> <integer name="pool-max-wait">300000</integer> <long name="total-bytes-to-write">0</long> <boolean name="skip-identical-digests">false</boolean> <boolean name="write-requests">true</boolean> <boolean name="write-metadata">true</boolean> <boolean name="write-revisit-for-identical-digests">true</boolean> <boolean name="write-revisit-for-not-modified">true</boolean> </newObject> </map>

time: 2 MD class="anchor" id="line-68">

_code.2C_that_also_works_for_WARC-files.">Subtask 2: Implement CDX-generating code, that also works for WARC-files. class="anchor" id="line-70">

The CDX generating code must work for both ARC and WARC files. Currently the method dk.netarkivet.common.utils.cdx.ExtractCDX.generateCDX() ignores all files not ending with .arc. This method is used in the Harvest documentation phase to generate CDX-files for the arc-files coming from Heritrix

When generating a single CDX-entry for an URL request, information from several Warc-records is combined.

Note that Wayback already has code to make an CDX from WARC:

https://archive-access.svn.sourceforge.net/svnroot/archive-access/trunk/archive-access/projects/wayback/wayback-core/src/main/java/org/archive/wayback/resourcestore/indexer/

time: ? MD class="anchor" id="line-80">

ework_to_handle_WARC-files_on_record_level">Subtask 3: Extend our BatchJob framework to handle WARC-files on record level class="line874">Currently our Batch framework only handles ARCfiles on record level.

Currently we only have an abstract class handling ARCRecords(ARCBatchJob) with these concrete implementations:

ExtractCDXJob,
HarvestedUrlsForDomainBatchJob (also assumes crawl.log stored in ARC-file with URL "metadata://netarkivet.dk/crawl/logs/crawl.log")

ARCBatchJob could/should be generalized to handle ArchiveRecords instead of ArcRecords. I have a prototype for such a generalization in the trunk: https://gforge.statsbiblioteket.dk/plugins/scmsvn/viewcvs.php/trunk/tests/dk/netarkivet/common/utils/cdx/ArchiveBatchJob.java?root=netarchivesuite&view=markup

time: ? MD class="anchor" id="bottom">

-  ⇤ ← Revision 5 as of 2009-05-22 16:42:36 → 
  Size: 4482
  Editor: SoerenCarlsen
  Comment:
+   ← Revision 6 as of 2009-05-22 16:44:51 → ⇥
  Size: 4641
  Editor: SoerenCarlsen
  Comment:
-Deletions are marked like this.
+Additions are marked like this.
 Line 87:
-ARCBatchJob could/should be generalized to handle ArchiveRecords instead of ArcRecords. I have attached such a prototype for such a generalization.
+ARCBatchJob could/should be generalized to handle ArchiveRecords instead of ArcRecords. I have a prototype for such a generalization in the trunk:
https://gforge.statsbiblioteket.dk/plugins/scmsvn/viewcvs.php/trunk/tests/dk/netarkivet/common/utils/cdx/ArchiveBatchJob.java?root=netarchivesuite&view=markup

Tasks

Estimated Estimated Estimated Harvester assignment 2: Move to WARC writing instead of ARC writing

Estimated Estimated
Estimated Harvester assignment 2: Move to WARC writing instead of ARC writing