NetarchiveSuite Overview

software to harvest, archive and browse large parts of the internet.

Introduction

The primary function of the NetarchiveSuite is to plan, schedule and archive web harvests of parts of the internet. We use Heritrix as our webcrawler. The NetarchiveSuite can organize three different kinds of harvests:

The software has been designed with the following in mind:

The modules in the NetarchiveSuite

The NetarchiveSuite is split into four main modules: One module with common functionality and three modules corresponding to ingesting, archiving and accessing. dk.netarkivet.common module The framework, and utilities used by the whole suite, like exceptions, settings, messaging, filetransfer (RemoteFile), and logging. It also defines the interfaces used to communicate between the different modules, to support alternative implementations.

dk.netarkivet.harvester module

This module handles defining, scheduling, and performing harvests.

dk.netarkivet.archive module

This module allows running a repository with replication, active bit consistency checks for bitpreservation, and support for distributed batch jobs on the archive.

dk.netarkivet.viewerproxy module

This module gives access to previously harvested material, through a proxy solution. * The viewerproxy component supports transparent access to the harvested data, using a proxy solution, and an archive with an index.

For developers * The modules are loosely coupled, communicating through interfaces, with the implementation replacable without recompiling.