Differences between revisions 10 and 11
Revision 10 as of 2010-08-16 10:24:40
Size: 7026
Editor: localhost
Comment: converted to 1.6 markup
Revision 11 as of 2010-08-23 10:10:26
Size: 6988
Comment:
Deletions are marked like this. Additions are marked like this.
Line 2: Line 2:
This assignment contains the subtasks needed to upgrade !NetarchiveSuite to allow Heritrix to write to WARC files instead of ARC-files, including writing the contents of the metadata-1.arc files into WARC metadata records.
Line 3: Line 4:
This assignment contains the subtasks needed to upgrade !NetarchiveSuite to allow Heritrix to write to WARC files instead
of ARC-files, including writing the contents of the metadata-1.arc files into WARC metadata records.
TBD: Should !NetarchiveSuite still allow the user to write to ARC-files?<<BR>> Comment: Probably not, as this will complicate the post-processing, and indexing facilities of NetarchiveSuit
Line 6: Line 6:
TBD: Should !NetarchiveSuite still allow the user to write to ARC-files?<<BR>>
Comment: Probably not, as this will complicate the post-processing, and indexing facilities of NetarchiveSuit

Prerequisite: Implementation of Feature request #1643: Update Heritrix to version 1.14.3
This version of Heritrix includes a complete implementation of the WARC standard.
Prerequisite: Implementation of Feature request #1643: Update Heritrix to version 1.14.3 This version of Heritrix includes a complete implementation of the WARC standard.
Line 13: Line 9:
=== Subtask 1: Replace the "ARCWriterProcesser" with "WARCWriterProcessor" in our Heritrix templates ===
Currently our harvest templates include the following piece of xml that configures Heritrix to write ARC files:
Line 14: Line 12:
=== Subtask 1: Replace the "ARCWriterProcesser" with "WARCWriterProcessor" in our Heritrix templates ===

Currently our harvest templates include the following piece of xml that configures Heritrix to write ARC files:
Line 40: Line 35:
Line 65: Line 61:
Line 67: Line 62:
Line 69: Line 63:
Line 79: Line 72:
Line 84: Line 76:
    * ExtractCDXJob,
    * HarvestedUrlsForDomainBatchJob (also assumes crawl.log stored in ARC-file with URL "metadata://netarkivet.dk/crawl/logs/crawl.log")
Line 87: Line 77:
ARCBatchJob could/should be generalized to handle ArchiveRecords instead of ArcRecords. I have a prototype for such a generalization in the trunk:
https://gforge.statsbiblioteket.dk/plugins/scmsvn/viewcvs.php/trunk/tests/dk/netarkivet/common/utils/cdx/ArchiveBatchJob.java?root=netarchivesuite&view=markup
 * ExtractCDXJob,
 * HarvestedUrlsForDomainBatchJob (also assumes crawl.log stored in ARC-file with URL "metadata://netarkivet.dk/crawl/logs/crawl.log")

ARCBatchJob could/should be generalized to handle ArchiveRecords instead of ArcRecords. I have a prototype for such a generalization in the trunk: https://gforge.statsbiblioteket.dk/plugins/scmsvn/viewcvs.php/trunk/tests/dk/netarkivet/common/utils/cdx/ArchiveBatchJob.java?root=netarchivesuite&view=markup
Line 91: Line 83:
Line 93: Line 84:
Line 97: Line 87:
Line 99: Line 88:
Find a way to store the contents of the metadata-1.arc files as WARC-records.
This means identifying a way to refer these WARC-Records to the WARC-files harvested by Heritrix in this HarvestJob.
Find a way to store the contents of the metadata-1.arc files as WARC-records. This means identifying a way to refer these WARC-Records to the WARC-files harvested by Heritrix in this HarvestJob.
Line 103: Line 91:
=== Subtask 6: Extend Bitarchive code, so WARC-records can be retrieved from archive ===
The class dk.netarkivet.common.distribute.arcrepository.!BitarchiveRecord only supports ARCRecords. Maybe just add another !BitarchiveRecord constructer to support WARCRecords:
Line 104: Line 94:
=== Subtask 6: Extend Bitarchive code, so WARC-records can be retrieved from archive ===
The class dk.netarkivet.common.distribute.arcrepository.BitarchiveRecord only supports ARCRecords.
Maybe just add another BitarchiveRecord constructer to support WARCRecords:
Line 116: Line 103:
}}} }}}
The same class also uses method ARCUtils.readARCRecord(ARCRecord ar), and we may also need such a method for reading WarcRecords?
Line 118: Line 106:
The same class also uses method ARCUtils.readARCRecord(ARCRecord ar), and we may also need such a method for reading WarcRecords?
Line 122: Line 109:

The method {{{dk.netarkivet.archive.bitarchive.Bitarchive.get(String arcfile, long index)}}} needs to work for WARC-files as well
as ARC-files.
The method {{{dk.netarkivet.archive.bitarchive.Bitarchive.get(String arcfile, long index)}}} needs to work for WARC-files as well as ARC-files.
Line 127: Line 112:
Line 129: Line 113:
The current indexing system assumes the heritrix crawl.log, and cdx'es for each harvestjob are stored in a metadata-1.arc.
After subtask 5 is implemented, this is no longer true.
The current indexing system assumes the heritrix crawl.log, and cdx'es for each harvestjob are stored in a metadata-1.arc. After subtask 5 is implemented, this is no longer true.
Line 133: Line 116:
 * Which Heritrix processor will handle deduplication in a WARC scenario? Here there are two possibilities: 
   1) upgrade the Icelandic deduplication module to write its deduplication information to WARC revisit record instead of to the crawl.log<<BR>>
  
2) Use Heritrix own deduplication processes
 * How are we going to construct the indices for deduplication, and for the proxyviewer access system. 

* Which Heritrix processor will handle deduplication in a WARC scenario? Here there are two possibilities:
  . 1) upgrade the Icelandic deduplication module to write its deduplication information to WARC revisit record instead of to the crawl.log<<BR>> 2) Use Heritrix own deduplication processes
 * How are we going to construct the indices for deduplication, and for the proxyviewer access system.

Harvester assignment 2: Move to WARC writing instead of ARC writing

This assignment contains the subtasks needed to upgrade NetarchiveSuite to allow Heritrix to write to WARC files instead of ARC-files, including writing the contents of the metadata-1.arc files into WARC metadata records.

TBD: Should NetarchiveSuite still allow the user to write to ARC-files?
Comment: Probably not, as this will complicate the post-processing, and indexing facilities of NetarchiveSuit

Prerequisite: Implementation of Feature request #1643: Update Heritrix to version 1.14.3 This version of Heritrix includes a complete implementation of the WARC standard.

Tasks

Subtask 1: Replace the "ARCWriterProcesser" with "WARCWriterProcessor" in our Heritrix templates

Currently our harvest templates include the following piece of xml that configures Heritrix to write ARC files:

 <map name="write-processors">
      <newObject name="Archiver" class="org.archive.crawler.writer.ARCWriterProcessor">
        <boolean name="enabled">true</boolean>
        <newObject name="Archiver#decide-rules" class="org.archive.crawler.deciderules.DecideRuleSequence">
          <map name="rules">
          </map>
        </newObject>
        <boolean name="compress">false</boolean>
        <string name="prefix">netarkivet</string>
        <string name="suffix">${HOSTNAME}</string>
        <long name="max-size-bytes">100000000</long>
        <stringList name="path">
          <string>arcs</string>
        </stringList>
        <integer name="pool-max-active">5</integer>
        <integer name="pool-max-wait">300000</integer>
        <long name="total-bytes-to-write">0</long>
        <boolean name="skip-identical-digests">false</boolean>
      </newObject>
    </map>

By replacing this piece of xml with the following, you tell Heritrix to write WARC-files:

      <newObject name="WARCArchiver" class="org.archive.crawler.writer.WARCWriterProcessor">
        <boolean name="enabled">true</boolean>
        <newObject name="WARCArchiver#decide-rules" class="org.archive.crawler.deciderules.DecideRuleSequence">
          <map name="rules">
          </map>
        </newObject>
        <boolean name="compress">false</boolean>
        <string name="prefix">netarkivet</string>
        <string name="suffix">${HOSTNAME}</string>
        <long name="max-size-bytes">100000000</long>
        <stringList name="path">
          <string>warcs</string>
        </stringList>
        <integer name="pool-max-active">5</integer>
        <integer name="pool-max-wait">300000</integer>
        <long name="total-bytes-to-write">0</long>
        <boolean name="skip-identical-digests">false</boolean>
        <boolean name="write-requests">true</boolean>
        <boolean name="write-metadata">true</boolean>
        <boolean name="write-revisit-for-identical-digests">true</boolean>
        <boolean name="write-revisit-for-not-modified">true</boolean>
      </newObject>
    </map>

Estimated time: 2 MD

Subtask 2: Implement CDX-generating code, that also works for WARC-files.

The CDX generating code must work for both ARC and WARC files. Currently the method dk.netarkivet.common.utils.cdx.ExtractCDX.generateCDX() ignores all files not ending with .arc. This method is used in the Harvest documentation phase to generate CDX-files for the arc-files coming from Heritrix

When generating a single CDX-entry for an URL request, information from several Warc-records is combined.

Note that Wayback already has code to make an CDX from WARC:

https://archive-access.svn.sourceforge.net/svnroot/archive-access/trunk/archive-access/projects/wayback/wayback-core/src/main/java/org/archive/wayback/resourcestore/indexer/

Estimated time: ? MD

Subtask 3: Extend our BatchJob framework to handle WARC-files on record level

Currently our Batch framework only handles ARCfiles on record level.

Currently we only have an abstract class handling ARCRecords(ARCBatchJob) with these concrete implementations:

ARCBatchJob could/should be generalized to handle ArchiveRecords instead of ArcRecords. I have a prototype for such a generalization in the trunk: https://gforge.statsbiblioteket.dk/plugins/scmsvn/viewcvs.php/trunk/tests/dk/netarkivet/common/utils/cdx/ArchiveBatchJob.java?root=netarchivesuite&view=markup

Estimated time: ? MD

Subtask 4: Upgrade or remove dk.netarkivet.viewerproxy.LocalCDXCache (deprecated, and uses inline CDXCacheBatchJob)

The deprecated method LocalCDXCache should either be upgraded to handle WARC-CDX'es or removed.

Estimated time: ? MD

Subtask 5: Store the contents of the metadata-1.arc files as WARC-records

Find a way to store the contents of the metadata-1.arc files as WARC-records. This means identifying a way to refer these WARC-Records to the WARC-files harvested by Heritrix in this HarvestJob.

Estimated time: ? MD

Subtask 6: Extend Bitarchive code, so WARC-records can be retrieved from archive

The class dk.netarkivet.common.distribute.arcrepository.BitarchiveRecord only supports ARCRecords. Maybe just add another BitarchiveRecord constructer to support WARCRecords:

public BitarchiveRecord(WARCRecord record) {
        ArgumentNotValid.checkNotNull(record, "WARCRecord record");
        //fileName = record.getMetaData().getArcFile().getName();
        offset = record.getHeader().getOffset();
        length = record.getHeader().getLength();
        fileName = (String) record.getHeader().getHeaderValue(WARCRecord.HEADER_KEY_FILENAME);
        ....
    }

The same class also uses method ARCUtils.readARCRecord(ARCRecord ar), and we may also need such a method for reading WarcRecords?

public static byte[] readWARCRecord(WARCRecord in) throws IOException {..}

The method dk.netarkivet.archive.bitarchive.Bitarchive.get(String arcfile, long index) needs to work for WARC-files as well as ARC-files.

Estimated time: ? MD

Subtask 7: Upgrade of Indexserver system

The current indexing system assumes the heritrix crawl.log, and cdx'es for each harvestjob are stored in a metadata-1.arc. After subtask 5 is implemented, this is no longer true.

This task should solve the following problems:

  • Which Heritrix processor will handle deduplication in a WARC scenario? Here there are two possibilities:
    • 1) upgrade the Icelandic deduplication module to write its deduplication information to WARC revisit record instead of to the crawl.log
      2) Use Heritrix own deduplication processes

  • How are we going to construct the indices for deduplication, and for the proxyviewer access system.

Estimated time: ? MD

AssignmentHarvester2 (last edited 2011-09-29 14:00:43 by MikisSethSorensen)