Harvester assignment 1: Upgrade to Heritrix release 3

This assignment was replaced by NAS-1780 Upgrade to or support heritrix 3.

This assignment relates to FR 2055

This assignment contains the subtasks needed to upgrade NetarchiveSuite to use Heritrix 3.0.0+ Note that Internet Archive, the maintainer of Heritrix more or less deprecates the use of Heritrix 2.0+ Therefore we will bypass the 2.0 branch of Heritrix.

Heritrix 3.0.0 is still in alpha(as of 22 May 2009). But until it is released you can build an heritrix 3.0.0 alpha distribution by downloading the code from SVN: http://archive-crawler.svn.sourceforge.net/viewvc/archive-crawler/trunk/heritrix3.tar.gz?view=tar

and then

gtar xvfz heritrix3.tar.gz
cd heritrix3
mvn install

This sequence assumes that Maven 2.0.9+ is installed on your machine

Note: as of 7 October 2009, a beta of Heritrix 3 has been released. See https://webarchive.jira.com/wiki/display/Heritrix/Heritrix3

Note: as of 5 December 2010, Heritrix 3 has been released. See https://webarchive.jira.com/wiki/display/Heritrix/Release+Notes+-+3.0.0

Tasks

Subtask 1: Replace Heritrix 1.14.X libraries with Heritrix 3.0.0+ libraries.

Download binary copy of Heritrix, and update the contents of lib/heritrix, lib/heritrix-dependencies, and make the appropiate changes to our build.xml
The deduplication-X-Y-Z.jar (if we're still using this module for deduplication) needs to be recompiled with the new Heritrix.

Estimated time: 1 MD

Subtask 2: Migrate heritrix templates bundled with NetarchiveSuite to new template format.

This task includes the writing of extensive documentation on how to migrate.

Beware that already generated jobs that are in one of the states (submitted, started or failed) will not be resubmittable. So it is a good idea to minimise this number, for instance by disabling all active harvestdefinitions.

Consider adding a versioning schema to the "order_templates" table This versionnumber is then added to the jobs table to supplement the fields "orderxml" and "orderxmldoc"

Note that Heritrix 3.0.0 comes with a tool for converting Heritrix 1.X templates to Heritrix 3.X templates. The following is taken from the H3 releasenotes.

Migration Tool Limited
The current migration utility, executable class org.archive.crawler.migrate.MigrateH1to3Tool, only works reliably for changed basic order.xml configuration values as reflected in our bundled default configuration. Also, it makes no effort to convert H1 per-domain/per-host settings overrides. By providing a model H3-based configuration with some values brought over, and reporting the values it cannot convert, it may still provide a useful base for other hand-conversion. Please let us know what advanced configuration in your Heritrix1 crawls is most important to support in the automated tool.

Estimated time: 4 MD

Subtask 3: Upgrade all classes that manipulate heritrix templates

Upgrade the classes "Job", "JobDBDAO", "JobDAO" to handle the new kind of templates. This goes for HeritrixTemplate as well

Estimated time: 5 MD

Subtask 4: Upgrade code, that starts, stops, and communicates with Heritrix

Upgrade the classes in the dk.netarkivet.harvester.harvesting package to communicate with Heritrix 3.

Estimated time: (> 5MD)

AssignmentHarvester1 (last edited 2011-09-29 13:29:37 by MikisSethSorensen)