title(Quick Start Manual)

Action(print,Printer friendly version)

TableOfContents

Include(User Manual/Introduction)

Include(User Manual/Download and installation)

Include(User Manual/Running a simple harvest)

Include(User Manual/Running a snapshot harvest)

Include(User Manual/Carrying on)

Introduction

Must be fixed: netarchivesuite.netarkivet.dk (does not work). Software is presently placed at http://developer.statsbiblioteket.dk/NetarchiveSuite/

This manual provides instructions for quickly getting a basic NetarchiveSuite system up and running. It uses a pre-built script that starts all components on the same machine. This allows you to start experimenting with the functionality without having to do any more setup than absolutely necessary.

Download and installation

For a quick start, we have prepared a bash script that starts all the necessary components on one machine. We will use this script throughout this quickstart manual to allow you to get a feel for what the system can do and how it works without having to deal with issues of distributing to other servers.

Base system required

For the quick startup, NetarchiveSuite requires

To check that you have the right version of Java do the following

<verbatim> linux>java -version java version "1.5.0_06" Java(TM) 2 Runtime Environment, Standard Edition (build 1.5.0_06-b05) Java HotSpot(TM) Client VM (build 1.5.0_06-b05, mixed mode, sharing) </verbatim>

Downloading

First go to netarchivesuite.netarkivet.dk and download the newest release of the binary.

Unzip the binary package in some directory, e.g. directory ~/netarchive

JMS

NetarchiveSuite uses JMS for inter-process communication. We recommend using the open-source version of Sun's JMS implementation, since some functionality of other implementations do not match our assumptions well.

If you do not already have JMS installed, then do the following to download and install it (note that if you want to install it under /opt/ then you need to be the root user):

Configuration

Changeable settings for the NetarchiveSuite is placed in an XML file. For the quickstart, only the mail addresses and viewerproxy server need to be configured.

To change the necessary settings do the following: <br> (netarchive is assumed to contain the downloaded and unzipped NetarchiveSuite files)

Starting simple_harvest version

Note: Starting the script clears all data from previous runs. This is a feature of it being a "playground" setup.

To start the program do the following:

A typical set of windows looks like this:

You can now see the webinterface in the browser

Stopping simple_harvest version

If at a later point you need to stop the whole system, you can use the scripts described below to stop the programs. After that, it should be possible to restart from scratch by following the instructions described above under section "Starting simple_harvest version".

Do the following: <br> $ cd ~/netarchive/scripts/simple_harvest <br> $ ./killharvest.sh

or if that does not kill all the xterms, try the following <br> $ cd ~/netarchive/scripts/simple_harvest <br> $ ./killhard.sh

Running a simple harvest

Start the program as described under section "Starting simple_harvest version". click 'Selective Harvests' under menu 'Definitions'

<img src="%ATTACHURLPATH%/Screenshot-SelectiveHarvests-MozillaFirefox.png" alt="Screenshot-SelectiveHarvests-MozillaFirefox.png" />

Click 'Create new harvest definition' under the (empty) table of existing harvests.

Enter an arbitrary name for the harvest in the top. Enter some second-level domain name (e.g., fnord.com) in the box and press 'Add domains'. Preferably the domain should be one that you know is not bothered by being harvested. You can add multiple domains if you want, but for simplicity and speed of harvesting we will assume a single domain has been entered.

Since the domain didn't exist in the database, it suggests you add it. Click 'Create and add to harvest definition'. You can now click 'Save' on the 'Selective Harvest' page

Click 'Activate' for the newly defined harvest.

Go to the Job Status page by clicking "Harvest status". Refresh it periodically until a job appears and changes to state "Started". This should take no more than two minutes. At this point, a Heritrix instance has started harvesting.

Go to the System Status page by clicking "Systemstate". Click on the application HarvestControllerServer. Its latest log record will give status information from Heritrix (more application information can be found by clicking on "Show all" in the Index column).

  • <img src="%ATTACHURLPATH%/Screenshot-Overviewofthesystemstate-MozillaFirefox.png" alt="Screenshot-Overviewofthesystemstate-MozillaFirefox.png" />

Viewing the results

Once that some web pages have been harvested, we can go over to the viewerproxy part to view them.

Go to the Job status page by clicking "Harvest status". The job created will take a little while to upload. Refresh the page until the job changes state to "Done". Click on the link with the Job Id.

Set your browser to use localhost:8074 as a proxy for everything except localhost.

Click on 'Select this job for QA with viewerproxy'. It will generate an index for a while, then go to the viewerproxy status page.

Now go to the URL that you started harvesting from (with www). It should show you the harvested material. If you go to a URL in another domain, you should get an error. Depending on the layout of the domain you harvested, there may also be missing pages or images from that domain.

The NetarchiveSuite allows automatic collection of unharvested URLs during browsing. To try this, go back to the viewerproxy status page and click 'Start collecting URLs'. Now browse in the collected material until you find a page or image that did not get harvested. Go back to the viewerproxy status page and click 'Show collected URLs'.

It should contain several URLs, including the ones you just found missing. Copy the URLs for your harvested domain that were found missing into the clipboard. Go to the domain definition page by clicking "Find Domain(s)" under "Definitions" and search for your domain. On the domain definition page, click "Edit" next to the seedlist.

Add the URLs from the clipboard to the seedlist and press 'Update'. These URLs will be used as seeds the next time the domain is harvested. To see this in effect, create another harvest of this domain as above, and go to the job status page for the new job. Limit the viewerproxy to that job only and browse the material again. The URLs that were missing last time should now have been found.

Running a snapshot harvest

A snapshot harvest harvests all known domains up to a given limit. Each domain has one "default configuration" automatically generated when the domain is created. The default configuration is used to determine how to harvest the domain in a snapshot harvest. Typically, the default configuration is good enough, but if you want to have a domain excluded from the snapshot harvest (e.g. if the domain is outside the group you're interested in) you may want to set the harvest limit on the default configuration for that domain to 0. The default configuration is also the one used in a selective harvest unless another configuration is chosen in the drop-down menu on the selective harvest page. The other way to control how a snapshot harvest is executed is by choosing a different harvest template. Descriptions of how harvest templates work are in the (upcoming) user manual.

NetarchiveSuite has support for mass ingest of domains, for instance from a list given by a national TLD administrator. To ingest, go to the "Create Domain" page under "Definitions" and specify the file containing the list of domains (or paste them in the text window, but typical national TLD lists are very long). The list should be a newline-separated list of domain names including the top level domain, but not including subdomains, protocol specifications or URL paths. Thus netarkivet.dk or archive.org are useable, while http://foo.com, bar.dk/hest or news.bbc.co.uk are not. Note that we assume only one level under the TLD at the moment. When the file is specified, press "Ingest" and wait while the domains are ingested. For a first test, you probably want to keep it to a fairly small number of sites.

After ingest, you can click on "Domain statistics" under "Definitions" to see an overview of how many domains are registered under the TLDs. To create a snapshot definition, go to "Snapshot harvests" and press "Create new snapshot harvest". The harvest definition presented will require you to enter a harvest name, and also allows adding comments or changing the limit of how many bytes to collect per domain. Keep this to a fairly low number for a first test.

When done, press "Save", then "Activate".

Watch it run as above, and when it is done (this will take somewhat longer, depending on how many sites were used) go to the job created and select it for QA as above before browsing the sites you just added.

Carrying on...

This marks the end of the quickstart manual. While you can of course continue to play around with this simple setup, there are numerous options and possibilities that are not mentioned herein that are useful for scalability and for adapting the harvesting to your situation. Further information about installation and configuration can be found in the InstallationManual, and more details on how to use the web interface can be found in the UserManual. The DeveloperManual has information on how to program new or altered functionality for NetarchiveSuite in Java.

Good luck!