Differences between revisions 1 and 2
Revision 1 as of 2009-05-20 15:48:59
Size: 1064
Comment:
Revision 2 as of 2009-05-20 16:50:38
Size: 1625
Comment:
Deletions are marked like this. Additions are marked like this.
Line 2: Line 2:

This assignment contains the subtasks needed to upgrade NetarchiveSuite to use Heritrix 3.0.0+
Note that Internet Archive, the maintainer of Heritrix more or less deprecates the use of Heritrix 2.0+
Therefore we will bypass the 2.0 branch of Heritrix.
This assignment contains the subtasks needed to upgrade NetarchiveSuite to use Heritrix 3.0.0+ Note that Internet Archive, the maintainer of Heritrix more or less deprecates the use of Heritrix 2.0+ Therefore we will bypass the 2.0 branch of Heritrix.
Line 8: Line 5:
=== Task A ===
Download binary copy of Heritrix, and update the contents of lib/heritrix, lib/heritrix-dependencies, and make the appropiate changes to our build.xml
Line 9: Line 8:
=== Task A === The deduplication-X-Y-Z.jar (if we're still using this module for deduplication) needs to be recompiled with the new Heritrix.'''
'''
Line 11: Line 11:
Download binary copy of Heritrix, and update the contents of lib/heritrix, lib/heritrix-dependencies,
and make the appropiate changes to our build.xml

The deduplication-X-Y-Z.jar (if we're still using this module for deduplication) needs to be recompiled with the new Heritrix.
'''Estimated time: 1 MD'''''''''
Line 17: Line 14:
Line 22: Line 18:
Beware that already generated jobs that are in one of the states (submitted, started or failed) will not be
resubmittable. So it is a good idea to minimise this number, for instance by disabling all active harvestdefinitions.
Beware that already generated jobs that are in one of the states (submitted, started or failed) will not be  resubmittable. So it is a good idea to minimise this number, for instance by disabling all active harvestdefinitions.
Line 25: Line 20:
  Consider adding a versioning schema to the "order_templates" table This versionnumber is then added to the jobs table to supplement the fields "orderxml" and "orderxmldoc"
Line 27: Line 22:
=== Estimated time: 5 MD ===
=== Task C ===
Upgrade the classes "Job", "JobDBDAO", "JobDAO" to handle the new kind of templates. This goes for HeritrixTemplate as well

=== Estimated time: 5 MD ===

=== Task D ===
Upgrade the classes in the dk.netarkivet.harvester.harvesting package to communicate with Heritrix 3.

=== Estimated time: (> 5MD) ===

Harvester assigment 1: Upgrade to Heritrix release 3

This assignment contains the subtasks needed to upgrade NetarchiveSuite to use Heritrix 3.0.0+ Note that Internet Archive, the maintainer of Heritrix more or less deprecates the use of Heritrix 2.0+ Therefore we will bypass the 2.0 branch of Heritrix.

Tasks

Task A

Download binary copy of Heritrix, and update the contents of lib/heritrix, lib/heritrix-dependencies, and make the appropiate changes to our build.xml

The deduplication-X-Y-Z.jar (if we're still using this module for deduplication) needs to be recompiled with the new Heritrix.

Estimated time: 1 MD

Task B

All heritrix templates bundled with NetarchiveSuite needs to be migrated to the new format.

This task includes the writing of extensive documentation on how to migrate.

Beware that already generated jobs that are in one of the states (submitted, started or failed) will not be resubmittable. So it is a good idea to minimise this number, for instance by disabling all active harvestdefinitions.

Consider adding a versioning schema to the "order_templates" table This versionnumber is then added to the jobs table to supplement the fields "orderxml" and "orderxmldoc"

Estimated time: 5 MD

Task C

Upgrade the classes "Job", "JobDBDAO", "JobDAO" to handle the new kind of templates. This goes for HeritrixTemplate as well

Estimated time: 5 MD

Task D

Upgrade the classes in the dk.netarkivet.harvester.harvesting package to communicate with Heritrix 3.

Estimated time: (> 5MD)

AssignmentHarvester1 (last edited 2011-09-29 13:29:37 by MikisSethSorensen)