4374
Comment:
|
4387
|
Deletions are marked like this. | Additions are marked like this. |
Line 2: | Line 2: |
*software to harvest, archive and browse large parts of the internet.* | '''software to harvest, archive and browse large parts of the internet.''' |
Line 4: | Line 5: |
The primary function of the NetarchiveSuite is to plan, schedule and archive web harvests of parts of the internet. We use Heritrix as our webcrawler. The NetarchiveSuite can organize three different kinds of harvests: | The primary function of the <nop>NetarchiveSuite is to plan, schedule and archive web harvests of parts of the internet. We use Heritrix as our webcrawler. The NetarchiveSuite can organize three different kinds of harvests: |
Line 6: | Line 8: |
Line 9: | Line 10: |
The software has been designed with the following in mind: * Friendly to non-developers - designed to be usable by librarians and curators with a minimum of technical supervision | The software has been designed with the following in mind: |
Line 11: | Line 12: |
* Friendly to non-developers - designed to be usable by librarians and curators with a minimum of technical supervision | |
Line 14: | Line 16: |
The modules in the NetarchiveSuite '''The NetarchiveSuite is split into four main modules: One module with common functionality and three modules corresponding to ingesting, archiving and accessing. '''dk.netarkivet.common module '''The framework, and utilities used by the whole suite, like exceptions, settings, messaging, filetransfer (RemoteFile), and logging. It also defines the interfaces used to communicate between the different modules, to support alternative implementations. '''dk.netarkivet.harvester module '''This module handles defining, scheduling, and performing harvests. * Harvesting uses Heritrix from Internet Archive as the crawler, the harvesting module allows flexible automated definitions of harvests. The system allows the full power of Heritrix, given knowledge of the Heritrix crawler. NetarchiveSuite wraps the crawler in an easy-to-use interface that handles scheduling and configuring the crawl, and distributing it to several crawling servers. ''' | The modules in the NetarchiveSuite The <nop>NetarchiveSuite is split into four main modules: One module with common functionality and three modules corresponding to ingesting, archiving and accessing. '''dk.netarkivet.common module '''The framework, and utilities used by the whole suite, like exceptions, settings, messaging, filetransfer (RemoteFile), and logging. It also defines the interfaces used to communicate between the different modules, to support alternative implementations. '''dk.netarkivet.harvester module '''This module handles defining, scheduling, and performing harvests. * Harvesting uses Heritrix from Internet Archive as the crawler, the harvesting module allows flexible automated definitions of harvests. The system allows the full power of Heritrix, given knowledge of the Heritrix crawler. NetarchiveSuite wraps the crawler in an easy-to-use interface that handles scheduling and configuring the crawl, and distributing it to several crawling servers. ''' |
NetarchiveSuite Overview
software to harvest, archive and browse large parts of the internet.
Introduction
The primary function of the <nop>NetarchiveSuite is to plan, schedule and archive web harvests of parts of the internet. We use Heritrix as our webcrawler. The NetarchiveSuite can organize three different kinds of harvests:
- Event harvesting (organize harvests of a set of domains related to a specific event (e.g. 9/11, Royal Weddings, Elections and so on)).
- Selective harvesting (recurrent harvests of a set of domains).
- Snapshot harvesting (organizing a complete snapshot of all known domains)
The software has been designed with the following in mind:
- Friendly to non-developers - designed to be usable by librarians and curators with a minimum of technical supervision
- Low maintenance - easy setup of automated harvests, automated bit-integrity checks, and simple curation tools
- High bit-preservation security - replication and active integrity tests of large data contents
- Loosely coupled - the suite consists of modules the can be integrated individually, or be used as one large webarchiving system
The modules in the NetarchiveSuite The <nop>NetarchiveSuite is split into four main modules: One module with common functionality and three modules corresponding to ingesting, archiving and accessing. dk.netarkivet.common module The framework, and utilities used by the whole suite, like exceptions, settings, messaging, filetransfer (RemoteFile), and logging. It also defines the interfaces used to communicate between the different modules, to support alternative implementations. dk.netarkivet.harvester module This module handles defining, scheduling, and performing harvests. * Harvesting uses Heritrix from Internet Archive as the crawler, the harvesting module allows flexible automated definitions of harvests. The system allows the full power of Heritrix, given knowledge of the Heritrix crawler. NetarchiveSuite wraps the crawler in an easy-to-use interface that handles scheduling and configuring the crawl, and distributing it to several crawling servers. dk.netarkivet.archive module dk.netarkivet.viewerproxy module For developers NetarchiveSuite is available under a well-known, integration-friendly, open source license (LGPL).