599
Comment:
|
2921
|
Deletions are marked like this. | Additions are marked like this. |
Line 11: | Line 11: |
=== Subtask 1: replace === | === Subtask 1: Replace the "ARCWriterProcesser" with "WARCWriterProcessor" in our Heritrix templates === |
Line 13: | Line 13: |
Currently our harvest templates include the following piece of xml that configures Heritrix to write ARC files: | |
Line 14: | Line 15: |
<map name="write-processors"> <newObject name="Archiver" class="org.archive.crawler.writer.ARCWriterProcessor"> <boolean name="enabled">true</boolean> <newObject name="Archiver#decide-rules" class="org.archive.crawler.deciderules.DecideRuleSequence"> <map name="rules"> </map> </newObject> <boolean name="compress">false</boolean> <string name="prefix">netarkivet</string> <string name="suffix">${HOSTNAME}</string> <long name="max-size-bytes">100000000</long> <stringList name="path"> <string>arcs</string> </stringList> <integer name="pool-max-active">5</integer> <integer name="pool-max-wait">300000</integer> <long name="total-bytes-to-write">0</long> <boolean name="skip-identical-digests">false</boolean> </newObject> </map> }}} By replacing this piece of xml with the following, you tell Heritrix to write WARC-files: {{{ <newObject name="WARCArchiver" class="org.archive.crawler.writer.WARCWriterProcessor"> <boolean name="enabled">true</boolean> <newObject name="WARCArchiver#decide-rules" class="org.archive.crawler.deciderules.DecideRuleSequence"> <map name="rules"> </map> </newObject> <boolean name="compress">false</boolean> <string name="prefix">netarkivet</string> <string name="suffix">${HOSTNAME}</string> <long name="max-size-bytes">100000000</long> <stringList name="path"> <string>warcs</string> </stringList> <integer name="pool-max-active">5</integer> <integer name="pool-max-wait">300000</integer> <long name="total-bytes-to-write">0</long> <boolean name="skip-identical-digests">false</boolean> <boolean name="write-requests">true</boolean> <boolean name="write-metadata">true</boolean> <boolean name="write-revisit-for-identical-digests">true</boolean> <boolean name="write-revisit-for-not-modified">true</boolean> </newObject> </map> |
|
Line 20: | Line 64: |
==== Estimated time: 1 MD ==== |
==== Estimated time: 2 MD ==== |
Harvester assignment 2: Move to WARC writing instead of ARC writing
This assignment contains the subtasks needed to upgrade NetarchiveSuite to allow Heritrix to write to WARC files instead of ARC-files, including writing the contents of the metadata-1.arc files into WARC metadata records.
TBD: Should NetarchiveSuite still allow the user to write to ARC-files?BR Comment: Probably not, as this will complicate the post-processing, and indexing facilities of NetarchiveSuite
Tasks
Subtask 1: Replace the "ARCWriterProcesser" with "WARCWriterProcessor" in our Heritrix templates
Currently our harvest templates include the following piece of xml that configures Heritrix to write ARC files:
<map name="write-processors"> <newObject name="Archiver" class="org.archive.crawler.writer.ARCWriterProcessor"> <boolean name="enabled">true</boolean> <newObject name="Archiver#decide-rules" class="org.archive.crawler.deciderules.DecideRuleSequence"> <map name="rules"> </map> </newObject> <boolean name="compress">false</boolean> <string name="prefix">netarkivet</string> <string name="suffix">${HOSTNAME}</string> <long name="max-size-bytes">100000000</long> <stringList name="path"> <string>arcs</string> </stringList> <integer name="pool-max-active">5</integer> <integer name="pool-max-wait">300000</integer> <long name="total-bytes-to-write">0</long> <boolean name="skip-identical-digests">false</boolean> </newObject> </map>
By replacing this piece of xml with the following, you tell Heritrix to write WARC-files:
<newObject name="WARCArchiver" class="org.archive.crawler.writer.WARCWriterProcessor"> <boolean name="enabled">true</boolean> <newObject name="WARCArchiver#decide-rules" class="org.archive.crawler.deciderules.DecideRuleSequence"> <map name="rules"> </map> </newObject> <boolean name="compress">false</boolean> <string name="prefix">netarkivet</string> <string name="suffix">${HOSTNAME}</string> <long name="max-size-bytes">100000000</long> <stringList name="path"> <string>warcs</string> </stringList> <integer name="pool-max-active">5</integer> <integer name="pool-max-wait">300000</integer> <long name="total-bytes-to-write">0</long> <boolean name="skip-identical-digests">false</boolean> <boolean name="write-requests">true</boolean> <boolean name="write-metadata">true</boolean> <boolean name="write-revisit-for-identical-digests">true</boolean> <boolean name="write-revisit-for-not-modified">true</boolean> </newObject> </map>