Differences between revisions 1 and 2
Revision 1 as of 2009-05-22 13:09:12
Size: 2190
Comment:
Revision 2 as of 2009-05-22 13:52:22
Size: 566
Comment:
Deletions are marked like this. Additions are marked like this.
Line 1: Line 1:
= Harvester assignment 2: Upgrade to Heritrix release 3 =
This assignment contains the subtasks needed to upgrade NetarchiveSuite to use Heritrix 3.0.0+ Note that Internet Archive, the maintainer of Heritrix more or less deprecates the use of Heritrix 2.0+ Therefore we will bypass the 2.0 branch of Heritrix.
= Harvester assignment 2: Move to WARC writing instead of ARC writing =
Line 4: Line 3:
Heritrix 3.0.0 is still in alpha. But until then you can build an heritrix 3.0.0 distribution by downloading the code from SVN:
http://archive-crawler.svn.sourceforge.net/viewvc/archive-crawler/trunk/heritrix3.tar.gz?view=tar
This assignment contains the subtasks needed to upgrade !NetarchiveSuite to allow Heritrix to write to WARC files instead
of ARC-files, including writing the contents of the metadata-1.arc files into WARC metadata records.
Line 7: Line 6:
and then {{{
gtar xvfz heritrix3.tar.gz
cd heritrix3
mvn install
}}}
This sequence assumes that Maven 2.0.9+ is installed on your machine
TBD: Should !NetarchiveSuite still allow the user to write to ARC-files?[[BR]]
Comment: Probably not, as this will complicate the post-processing, and indexing facilities of NetarchiveSuite
Line 15: Line 10:
=== Subtask 1: Replace Heritrix 1.14.X libraries with Heritrix 3.0.0+ libraries. ===
Download binary copy of Heritrix, and update the contents of lib/heritrix, lib/heritrix-dependencies, and make the appropiate changes to our build.xml === The deduplication-X-Y-Z.jar (if we're still using this module for deduplication) needs to be recompiled with the new Heritrix.''' '''

=== Subtask 1:
Line 19: Line 15:
=== Subtask 2: Migrate heritrix templates bundled with NetarchiveSuite to new template format. ===
This task includes the writing of extensive documentation on how to migrate.

Beware that already generated jobs that are in one of the states (submitted, started or failed) will not be resubmittable. So it is a good idea to minimise this number, for instance by disabling all active harvestdefinitions.

Consider adding a versioning schema to the "order_templates" table This versionnumber is then added to the jobs table to supplement the fields "orderxml" and "orderxmldoc"

==== Estimated time: 5 MD ====
=== Subtask 3: Upgrade all classes that manipulate heritrix templates ===
Upgrade the classes "Job", "JobDBDAO", "JobDAO" to handle the new kind of templates. This goes for HeritrixTemplate as well

==== Estimated time: 5 MD ====
=== Subtask 4: Upgrade code, that starts, stops, and communicates with Heritrix ===
Upgrade the classes in the dk.netarkivet.harvester.harvesting package to communicate with Heritrix 3.

==== Estimated time: (> 5MD) ====

Harvester assignment 2: Move to WARC writing instead of ARC writing

This assignment contains the subtasks needed to upgrade NetarchiveSuite to allow Heritrix to write to WARC files instead of ARC-files, including writing the contents of the metadata-1.arc files into WARC metadata records.

TBD: Should NetarchiveSuite still allow the user to write to ARC-files?BR Comment: Probably not, as this will complicate the post-processing, and indexing facilities of NetarchiveSuite

Tasks

=== Subtask 1:

Estimated time: 1 MD

AssignmentHarvester2 (last edited 2011-09-29 14:00:43 by MikisSethSorensen)