= Harvester assignment 2: Move to WARC writing instead of ARC writing =
'''This assignement has been replaced by [[https://sbforge.org/jira/browse/NAS-1720|NAS-1720 Enable WARC file writing and handling in the NetarchiveSuite]]'''
This assignment contains the subtasks needed to upgrade !NetarchiveSuite to allow Heritrix to write to WARC files instead of ARC-files, including writing the contents of the metadata-1.arc files into WARC metadata records.
TBD: Should !NetarchiveSuite still allow the user to write to ARC-files?<
> Comment: Probably not, as this will complicate the post-processing, and indexing facilities of "NetarchiveSuite
Prerequisite: Implementation of Feature request #1643: Update Heritrix to version 1.14.3.
This version of Heritrix includes a complete implementation of the WARC standard.
Note that since then implementation of FR 1951(Upgrade to heritrix 1.14.4) has moved the WARC-implementation to version 1.0 (ie. supposedly following the ISO standard)
Total estimate: 35 MD
Remaining: 20-25 MD
Degree of uncertainty: B
== Tasks ==
=== Subtask 1: Replace the "ARCWriterProcesser" with "WARCWriterProcessor" in our Heritrix templates ===
Currently our harvest templates include the following piece of xml that configures Heritrix to write ARC files:
{{{
}}}
By replacing this piece of xml with the following, you tell Heritrix to write WARC-files:
{{{
true
false
netarkivet
${HOSTNAME}
100000000
warcs
5
300000
0
false
true
true
true
true
}}}
==== Estimated time: 2 MD ====
=== Subtask 2: Implement CDX-generating code, that also works for WARC-files. ===
The CDX generating code must work for both ARC and WARC files. Currently the method dk.netarkivet.common.utils.cdx.ExtractCDX.generateCDX() ignores all files not ending with .arc. This method is used in the Harvest documentation phase to generate CDX-files for the arc-files coming from Heritrix
When generating a single CDX-entry for an URL request, information from several Warc-records is combined.
Note that Wayback already has code to make an CDX from WARC:
https://archive-access.svn.sourceforge.net/svnroot/archive-access/trunk/archive-access/projects/wayback/wayback-core/src/main/java/org/archive/wayback/resourcestore/indexer/
==== Estimated time: 6 MD ====
=== Subtask 3: Extend our BatchJob framework to handle WARC-files on record level ===
Currently our Batch framework only handles ARCfiles on record level.
Currently we only have an abstract class handling ARCRecords(ARCBatchJob) with these concrete implementations:
* ExtractCDXJob,
* HarvestedUrlsForDomainBatchJob (also assumes crawl.log stored in ARC-file with URL "metadata://netarkivet.dk/crawl/logs/crawl.log")
ARCBatchJob could/should be generalized to handle ArchiveRecords instead of ArcRecords. I have a prototype for such a generalization in the trunk: https://gforge.statsbiblioteket.dk/plugins/scmsvn/viewcvs.php/trunk/tests/dk/netarkivet/common/utils/cdx/ArchiveBatchJob.java?root=netarchivesuite&view=markup
==== Estimated time: 5 MD ====
=== Subtask 4: Upgrade or remove dk.netarkivet.viewerproxy.LocalCDXCache (deprecated, and uses inline CDXCacheBatchJob) ===
The deprecated method LocalCDXCache should either be upgraded to handle WARC-CDX'es or removed.
==== Estimated time: 2 MD ====
=== Subtask 5: Store the contents of the metadata-1.arc files as WARC-records ===
Find a way to store the contents of the metadata-1.arc files as WARC-records. This means identifying a way to refer these WARC-Records to the WARC-files harvested by Heritrix in this HarvestJob.
Idea: Extract the ID of the WARC-info record in the warc-files produced by Heritrix, and insert all these as WARC-Concurrent-To identifiers in the the warc-metadata records.
==== Estimated time: 5 MD ====
=== Subtask 6: Extend Bitarchive code, so WARC-records can be retrieved from archive ===
The class dk.netarkivet.common.distribute.arcrepository.!BitarchiveRecord only supports ARCRecords. Maybe just add another !BitarchiveRecord constructer to support WARCRecords:
{{{
public BitarchiveRecord(WARCRecord record) {
ArgumentNotValid.checkNotNull(record, "WARCRecord record");
//fileName = record.getMetaData().getArcFile().getName();
offset = record.getHeader().getOffset();
length = record.getHeader().getLength();
fileName = (String) record.getHeader().getHeaderValue(WARCRecord.HEADER_KEY_FILENAME);
....
}
}}}
The same class also uses method ARCUtils.readARCRecord(ARCRecord ar), and we may also need such a method for reading WarcRecords?
{{{
public static byte[] readWARCRecord(WARCRecord in) throws IOException {..}
}}}
The method {{{dk.netarkivet.archive.bitarchive.Bitarchive.get(String arcfile, long index)}}} needs to work for WARC-files as well as ARC-files.
==== Estimated time: 5 MD (DONE, FR 2040) ====
=== Subtask 7: Upgrade of Indexserver system ===
The current indexing system assumes the heritrix crawl.log, and cdx'es for each harvestjob are stored in a metadata-1.arc. After subtask 5 is implemented, this is no longer true.
This task should solve the following problems:
* Which Heritrix processor will handle deduplication in a WARC scenario? Here there are two possibilities:
. 1) upgrade the Icelandic deduplication module to write its deduplication information to WARC revisit record instead of to the crawl.log<
> 2) Use Heritrix own deduplication processes
* How are we going to construct the indices for deduplication, and for the proxyviewer access system.
==== Estimated time: 5 MD (estimate uncertain) ====
=== Subtask 8: extend the dk.netarkivet.wayback.NetarchiveResourceStore to handle warc-records ===
The current NetarchiveResourceStore only handles ARC-records. Extend this class to also handle warc-records.
==== Estimated time: 5 MD (DONE, released in 3.14.0) ====