NetarchiveSuite Overview

software to harvest, archive and browse large parts of the internet.

Note that this document is for the development version of NetarchiveSuite. It will be updated continuously to reflect the currently developed SVN repository For documentation for the stable version, please refer to the stable Overview

Printer friendly version

Introduction

The primary function of the NetarchiveSuite is to plan, schedule and archive web harvests of parts of the internet. We use Heritrix as our web-crawler. NetarchiveSuite was released on July 2007 as Open Source under the LGPL license and is used by the Danish organization Netarkivet.dk (http://netarkivet.dk). This organization has since July 2005 been using NetarchiveSuite to harvest Danish websites as authorized by the latest Danish Legal Deposit Act.

The NetarchiveSuite can organize three different kinds of harvests:

The software has been designed with the following in mind:

The modules in the NetarchiveSuite

The NetarchiveSuite is split into four main modules: One module with common functionality and three modules corresponding to processes of harvesting, archiving and accessing, respectively.

MainView.jpg

The Common Module

The framework and utilities used by the whole suite, like exceptions, settings, messaging, file transfer (RemoteFile), and logging. It also defines the Java interfaces used to communicate between the different modules, to support alternative implementations.

The Harvester Module

This module handles defining, scheduling, and performing harvests.

The Archive Module

This module makes it possible to setup and run a repository with replication, active bit consistency checks for bit-preservation, and support for distributed batch jobs on the archive.

The Access (Viewerproxy) Module

This module gives access to previously harvested material, through a proxy solution.

For developers

Overview devel (last edited 2010-08-16 10:24:49 by localhost)