Heritrix and NetarchiveSuite

Related assignments: Upgrade to Heritrix 3, Move to WARC writing

Scheduling

A HarvesterScheduler is initiated as a separate thread when the GUIApplication starts This scheduler-thread initiates a jobscheduling phase each minute unless a previous jobscheduling phase is still running. The core of the jobscheduling amounts to this

HarvestDefinitionDAO hddao = HarvestDefinitionDAO.getInstance();
hddao.generateJobs(new Date());
submitNewJobs();

What generateJobs() does, is the following:

Gets a list of Harvestdefinitions that are ready to be scheduled. Note that there are two types of HarvestDefinitions:
1. FullHarvest(only be scheduled once), and
2. PartialHarvest (can be scheduled an infinite number of times)
For each of these definitions create a series of jobs using the HarvestDefinition. createJobs() method Such a series constitutes a run of that harvestdefinition . Each of these jobs will then have status NEW.

What submitNewJobs() is the following

For each job in the harvest database with status NEW.
Set the job to state SUBMITTED
Collect metadata to attach to the job
1. AliasInfo for this job
2. DuplicateReduction metadata (used in the deduplication process)
Prepare a DoOneCrawlMessage for this job, and submit it to a harvesting queue. The specific queue is decided by the job priority. Historically, the job priority can either be LOW_PRIORITY (for full-harvests), and HIGH-PRIORITY (for selective harvests)

The scheduler also listens for responses from the harvesters (CrawlStatusMessages) sent to its own dedicated JMS Queue (referred to as Channels.getTheSched() ), using the class HarvestSchedulerMonitorServer.

The harvester send back two messages, the first to announce that it has started the crawl, which causes to scheduler to move the job to the STARTED state, and the second, to announce, that it has either finished in the expected way (Status moved to DONE), or finished abnormally (Status moved to FAILED). Another responsibility of the scheduler is to backup the HarvestDatabase.

TODO: It would probably be a very good idea to separate the scheduler from the GUIApplication. This will require the database to be external instead of embedded, which it now is by default.

Harvesting

The actual harvesting is handled by the HarvestControllerApplication (or more correctly the HarvestControllerServer hiding behind). This application can operate in either HIGHPRIORITY or LOWPRIORITY mode. In the HIGHPRIORITY mode, it does jobs belonging to a selective harvest. In the LOWPRIORITY mode, it does jobs belonging to a full harvest. The modus operandi for the HarvestControllerServer is as follows:

Grab a DoOneCrawlMessage from the harvest queue
Prepare harvest
1. Fetch deduplication index
2. Setup Heritrix directories
Start Heritrix process
Keep fetching Crawlstate from Heritrix using JMX until Crawlstate is finished OR we tell Heritrix to finish because it is harvesting too little material
Shutdown Heritrix
Postprocessing
1. Move harvest-metadata inside metadata-1.arc file
2. Upload arcfiles including the metadata-1.arc file
3. Send back CRawlStatusMessage to the scheduler