Quick Start Manual - NetarchiveSuite

Introduction

Must be fixed: netarchivesuite.netarkivet.dk (does not work). Software is presently placed at http://developer.statsbiblioteket.dk/NetarchiveSuite/

This manual provides instructions for quickly getting a basic NetarchiveSuite system up and running. It uses a pre-built script that starts all components on the same machine. This allows you to start experimenting with the functionality without having to do any more setup than absolutely necessary.

Download and installation

For a quick start, we have prepared a bash script that starts all the necessary components on one machine. We will use this script throughout this quickstart manual to allow you to get a feel for what the system can do and how it works without having to deal with issues of distributing to other servers.

Base system required

For the quick startup, NetarchiveSuite requires

a Linux system. Note that for the quickstart, you must be able to run a browser on the machine that you run the system on -- this is an artifact of the quickstart system and is not the case in the full system.
Sun Java SE (Standard Edition) JDK ver. 1.5.0_06 running on the Linux system. Higher versions of Java may work, but have not been tested. Lower versions of Java will not work correctly. (Current downloadable version of Sun Java 5 SE is "JDK 5.0 Update 12")

To check that you have the right version of Java do the following

on a terminal login to the linux system as a ordinary user
check java version is ver. 1.5.0_06 (or higher) by writing: $ java -version

<verbatim> linux>java -version java version "1.5.0_06" Java(TM) 2 Runtime Environment, Standard Edition (build 1.5.0_06-b05) Java HotSpot(TM) Client VM (build 1.5.0_06-b05, mixed mode, sharing) </verbatim>

Downloading

First go to netarchivesuite.netarkivet.dk and download the newest release of the binary.

Unzip the binary package in some directory, e.g. directory ~/netarchive

on a terminal login to the linux system as a ordinary user in a bash shell
make directory for the download e.g. directory ~/netarchive $ mkdir ~/netarchive
start browser by: $ mozilla
write url in the started browser http://netarchivesuite.netarkivet.dk
click på download link: NetarchiveSuite.zip
mark $-1òøSave it to diskòù and click òøOKòù 
- <img src="%ATTACHURLPATH%/download.jpg" alt="save as - download binary">
go to the created empty directory for download of file 
- <img src="%ATTACHURLPATH%/downloading_filefullpath.jpg" alt="specify filepath - download binary"> click $-1òùSaveòù
go to directory where file was downloaded $ cd ~/netarchive
unzip the binary package $ unzip NetarchiveSuite.zip

JMS

NetarchiveSuite uses JMS for inter-process communication. We recommend using the open-source version of Sun's JMS implementation, since some functionality of other implementations do not match our assumptions well.

If you do not already have JMS installed, then do the following to download and install it (note that if you want to install it under /opt/ then you need to be the root user):

write this url in the started browser https://mq.dev.java.net/downloads.html
click the Linux Link under version 4.1 to download a file mq4_1-binary-Linux_X86-20070518.jar (or later)
mark $-1òøSave it to diskòù and click òøOKòù <img src="%ATTACHURLPATH%/JMS_saveas.jpg" alt="save as - download JMS" />
go to directory for download of file e.g. $-1ņųnetarchiveņų <img src="%ATTACHURLPATH%/JMS_filefullpath.jpg" alt="specify filepath - download JMS" />
click $-1òùSaveòù
go to directory where file was downloaded $ cd ~/netarchive
unpack the jar file (resulting in creation of a directory mq and three files with legalese) $ jar xvf mq4_1-binary-Linux_X86-20070518.jar
run imqbroker in order to create instance settings $ chmod +x ./mq/bin/imqbrokerd $./mq/bin/imqbrokerd
check that it starts and that the last message is "Broker <localhost>:7676 ready"
stop imqbroker by control-C
edit settings to allow for enough listeners to a queue by
- editing ~/netarchive/mq/var/instances/imqbroker/props/config.properties
- uncomment and specify count=20 for listeners by changing line "# imq.autocreate.queue.maxNumActiveConsumers" -> =" imq.autocreate.queue.maxNumActiveConsumers=20"=

Configuration

Changeable settings for the NetarchiveSuite is placed in an XML file. For the quickstart, only the mail addresses and viewerproxy server need to be configured.

To change the necessary settings do the following: (netarchive is assumed to contain the downloaded and unzipped NetarchiveSuite files)

edit changeable settings by edit ~/netarchive/scripts/simple_harvest/settings.xml
find the <notifications> tag, and change
- the <sender> email address to your address <sender>edit@this.mail.address</sender> -> <sender>your email address</sender> e.g. <sender>olsen@kb.dk</sender>
- the <receiver> email address to your address <receiver>edit@this.mail.address</receiver> -> <receiver>administrator email address</receiver> e.g. <receiver>root@dia-test-int-01.kb.dk</receiver>
find the <mail> tag, and change the hostname of the mailserver to be that of your friendly neighbourhood mail server. <server>your.mail.server</server> -> <server>your mail server</server> e.g. <server>post.kb.dk</server>
- Since zip does not store the executable bit, you need to make the three shell-scripts executable: $ chmod +x ~/netarchive/scripts/simple_harvest/'''.sh
For the simple harvest setup, the startup script needs to know a few paths. You can either do this by exporting the environment variables in your shell or by changing the harvest script. The following describes how to set the environment variables:
Set JAVA to point to your java installation directory, e.g. /usr/java/jdk1.5.0_06. =$ export JAVA=/usr/java/jdk1.5.0_06=
Set WEBARKIVERING to point to the directory that you unpacked the NetarchiveSuite files into, e.g. ~/netarchive/. =$ export WEBARKIVERING=~/netarchive/=
Set IMQ to the full path of the executable of the Sun JMS broker, e.g. ~/netarchive/mq/bin/imqbrokerd =$ export IMQ=~/netarchive/mq/bin/imqbrokerd=

Starting simple_harvest version

Note: Starting the script clears all data from previous runs. This is a feature of it being a "playground" setup.

To start the program do the following:

on a terminal login to the linux system as a ordinary user in a bash shell
if the environment variables JAVA, WEBARKIVERING, IMQ are not set (as described above) then these needs to be set.
go to the simple harvest directory and run program $ cd ~/netarchive/scripts/simple_harvest $ ./harvest.sh note that it will open up a number of xterms, one for each application running (including JMS). Two of them will have smaller xterms either over or under them. All the xterms will display the logs of their applications (see figure below) -- you can follow how they start up and when they are ready to use.
wait until the xterm with the title HarvestDefinition? has stopped spewing out messages (it should include towards the end one saying "Scheduler running every 60 seconds")

A typical set of windows looks like this:

<img src="%ATTACHURLPATH%/Screenshot-1.png" alt="Screenshot-1.png" />

start browser by e.g. $ mozilla note that it is important that the browser is started on the same machine as the simple harvest script is run on.
write url in the started browser http://localhost:8073/HarvestDefinition

You can now see the webinterface in the browser

Stopping simple_harvest version

If at a later point you need to stop the whole system, you can use the scripts described below to stop the programs. After that, it should be possible to restart from scratch by following the instructions described above under section "Starting simple_harvest version".
Do the following: $ cd ~/netarchive/scripts/simple_harvest $ ./killharvest.sh
or if that does not kill all the xterms, try the following $ cd ~/netarchive/scripts/simple_harvest $ ./killhard.sh

Running a simple harvest

Start the program as described under section "Starting simple_harvest version". click 'Selective Harvests' under menu 'Definitions'
<img src="%ATTACHURLPATH%/Screenshot-SelectiveHarvests-MozillaFirefox.png" alt="Screenshot-SelectiveHarvests-MozillaFirefox.png" />
Click 'Create new harvest definition' under the (empty) table of existing harvests.
<img src="%ATTACHURLPATH%/Screenshot-SelectiveHarvest-MozillaFirefox.png" alt="Screenshot-SelectiveHarvest-MozillaFirefox.png" />
Enter an arbitrary name for the harvest in the top. Enter some second-level domain name (e.g., fnord.com) in the box and press 'Add domains'. Preferably the domain should be one that you know is not bothered by being harvested. You can add multiple domains if you want, but for simplicity and speed of harvesting we will assume a single domain has been entered.
<img src="%ATTACHURLPATH%/Screenshot-SelectiveHarvest-MozillaFirefox-1.png" alt="Screenshot-SelectiveHarvest-MozillaFirefox-1.png" />
Since the domain didn't exist in the database, it suggests you add it. Click 'Create and add to harvest definition'. You can now click 'Save' on the 'Selective Harvest' page
<img src="%ATTACHURLPATH%/Screenshot-SelectiveHarvests-MozillaFirefox-1.png" alt="Screenshot-SelectiveHarvests-MozillaFirefox-1.png" />
Click 'Activate' for the newly defined harvest.
Go to the Job Status page by clicking "Harvest status". Refresh it periodically until a job appears and changes to state "Started". This should take no more than two minutes. At this point, a Heritrix instance has started harvesting.
<img src="%ATTACHURLPATH%/Screenshot-AllJobs-MozillaFirefox.png" alt="Screenshot-AllJobs-MozillaFirefox.png" />
Go to the System Status page by clicking "Systemstate". Click on the application HarvestControllerServer. Its latest log record will give status information from Heritrix (more application information can be found by clicking on "Show all" in the Index column).
<img src="%ATTACHURLPATH%/Screenshot-Overviewofthesystemstate-MozillaFirefox.png" alt="Screenshot-Overviewofthesystemstate-MozillaFirefox.png" />

Viewing the results

Once that some web pages have been harvested, we can go over to the viewerproxy part to view them.
Go to the Job status page by clicking "Harvest status". The job created will take a little while to upload. Refresh the page until the job changes state to "Done". Click on the link with the Job Id.
<img src="%ATTACHURLPATH%/Screenshot-AllJobs-MozillaFirefox-1.png" alt="Screenshot-AllJobs-MozillaFirefox-1.png" />
Set your browser to use localhost:8074 as a proxy for everything except localhost.
<img src="%ATTACHURLPATH%/Screenshot-ConnectionSettings.png" alt="Screenshot-ConnectionSettings.png" />
Click on 'Select this job for QA with viewerproxy'. It will generate an index for a while, then go to the viewerproxy status page.
<img src="%ATTACHURLPATH%/Screenshot-ViewerproxyStatus-MozillaFirefox.png" alt="Screenshot-ViewerproxyStatus-MozillaFirefox.png" />
Now go to the URL that you started harvesting from (with www). It should show you the harvested material. If you go to a URL in another domain, you should get an error. Depending on the layout of the domain you harvested, there may also be missing pages or images from that domain.
The NetarchiveSuite allows automatic collection of unharvested URLs during browsing. To try this, go back to the viewerproxy status page and click 'Start collecting URLs'. Now browse in the collected material until you find a page or image that did not get harvested. Go back to the viewerproxy status page and click 'Show collected URLs'.
<img src="%ATTACHURLPATH%/Screenshot-ViewerproxyStatus-MozillaFirefox-2.png" alt="Screenshot-ViewerproxyStatus-MozillaFirefox-2.png" />
It should contain several URLs, including the ones you just found missing. Copy the URLs for your harvested domain that were found missing into the clipboard. Go to the domain definition page by clicking "Find Domain(s)" under "Definitions" and search for your domain. On the domain definition page, click "Edit" next to the seedlist.
<img src="%ATTACHURLPATH%/Screenshot-Editseedlist-MozillaFirefox-1.png" alt="Screenshot-Editseedlist-MozillaFirefox-1.png" />
Add the URLs from the clipboard to the seedlist and press 'Update'. These URLs will be used as seeds the next time the domain is harvested. To see this in effect, create another harvest of this domain as above, and go to the job status page for the new job. Limit the viewerproxy to that job only and browse the material again. The URLs that were missing last time should now have been found.

Running a snapshot harvest

A snapshot harvest harvests all known domains up to a given limit. Each domain has one "default configuration" automatically generated when the domain is created. The default configuration is used to determine how to harvest the domain in a snapshot harvest. Typically, the default configuration is good enough, but if you want to have a domain excluded from the snapshot harvest (e.g. if the domain is outside the group you're interested in) you may want to set the harvest limit on the default configuration for that domain to 0. The default configuration is also the one used in a selective harvest unless another configuration is chosen in the drop-down menu on the selective harvest page. The other way to control how a snapshot harvest is executed is by choosing a different harvest template. Descriptions of how harvest templates work are in the (upcoming) user manual.
NetarchiveSuite has support for mass ingest of domains, for instance from a list given by a national TLD administrator. To ingest, go to the "Create Domain" page under "Definitions" and specify the file containing the list of domains (or paste them in the text window, but typical national TLD lists are very long). The list should be a newline-separated list of domain names including the top level domain, but not including subdomains, protocol specifications or URL paths. Thus netarkivet.dk or archive.org are useable, while http://foo.com, bar.dk/hest or news.bbc.co.uk are not. Note that we assume only one level under the TLD at the moment. When the file is specified, press "Ingest" and wait while the domains are ingested. For a first test, you probably want to keep it to a fairly small number of sites.
<img src="%ATTACHURLPATH%/Screenshot-CreateDomain-MozillaFirefox.png" alt="Screenshot-CreateDomain-MozillaFirefox.png" />
After ingest, you can click on "Domain statistics" under "Definitions" to see an overview of how many domains are registered under the TLDs. To create a snapshot definition, go to "Snapshot harvests" and press "Create new snapshot harvest". The harvest definition presented will require you to enter a harvest name, and also allows adding comments or changing the limit of how many bytes to collect per domain. Keep this to a fairly low number for a first test.
<img src="%ATTACHURLPATH%/Screenshot-SnapshotHarvest-MozillaFirefox.png" alt="Screenshot-SnapshotHarvest-MozillaFirefox.png" />
When done, press "Save", then "Activate".
Watch it run as above, and when it is done (this will take somewhat longer, depending on how many sites were used) go to the job created and select it for QA as above before browsing the sites you just added.

Carrying on...

This marks the end of the quickstart manual. While you can of course continue to play around with this simple setup, there are numerous options and possibilities that are not mentioned herein that are useful for scalability and for adapting the harvesting to your situation. Further information about installation and configuration can be found in the InstallationManual, and more details on how to use the web interface can be found in the UserManual. The DeveloperManual has information on how to program new or altered functionality for NetarchiveSuite in Java.
Good luck!