Overall Systems Design

edit

This section includes an overall description of the NetarchiveSuite modules. Additional information can be found in the Overview document.

There are seven modules in the NetarchiveSuite software. This section gives an overview of what each module contains. All Java sourcefiles are found in the src directory, and all packages start with dk.netarkivet. Units tests are similarly arranged, but under tests instead of src. The web interface definitions are found in the webpages directory. The lib directory contains all the libraries necessary to compile and run the code.

More detailed descriptions are given later in this document.

Access (Viewerproxy)

The dk.netarkivet.viewerproxy package implements a simple access client to the archived data, based on web-page proxying. For more details please refer to Detailed Access Design.

Archive

The dk.netarkivet.archive package and its subpackages provide redundant, distributed storage primarily for ARC files as well as Lucene indexing of same. The arcrepository subpackage contains the logic of keeping multiple bit archives synchronized. The bitarchive subpackage contains the application that stores the actual files and manages access to them. The indexserver subpackage handles merging CDX files and crawl.log files into a Lucene index used for deduplication and for viewerproxy access. The checksum subpackage contains the checksum replica code. For more details please refer to Detailed Archive Design

Common

The dk.netarkivet.common package and its subpackages provide module-neutral code partly of a generic nature, partly specific to NetarchiveSuite, e.g. settings and channels. For more details please refer to Detailed Common Design

Deploy

The dk.netarkivet.deploy module contains software for installing NetarchiveSuite on multiple machines. This module is only used in the deployment phase. For more details please refer to Detailed Deploy Design

Harvester

The dk.netarkivet.harvester package and its subpackages handle the definition and execution of harvests. Its main parts are the database containing the harvest definitions (the datamodel subpackage), the webinterface that the user can access the database with, the scheduler subpackage which handles scheduling and splitting into jobs, and the harvesting subpackage which encapsulates running Heritrix and sending the results off to the archive. For more details please refer to Detailed Harvester Design

Monitor

The dk.netarkivet.monitor package provides web-access to JMX-packaged information from all NetarchiveSuite applications. For more details please refer to Detailed Monitor Design

Wayback

The dk.netarkivet.wayback package provides tools for integrating NetarchiveSuite with the open-source wayback machine for browsing webarchives. The tools we provide can be divided under three headings:

System Design 3.14/Overall Systems Description (last edited 2010-11-24 14:57:57 by SoerenCarlsen)