Release Notes for NetarchiveSuite 3.4.2

This version of NetarchiveSuite was released on 2008-04-14.

New features since NetarchiveSuite 3.4.1

This is a bugfix release fixing the bug

[#1238] JMX connection to Heritrix fails immediately after Heritrix process being started

These bugs caused OutOfMemory exceptions on harvests the size of the full harvests of the Danish domain.

New features since NetarchiveSuite 3.4.0

This is a bugfix release fixing the bugs

[#1078] DeDuplikator index too large
[#1079] snap shot harvest not browsable due to large index

These bugs caused OutOfMemory exceptions on harvests the size of the full harvests of the Danish domain.

New features since NetarchiveSuite 3.2.*

General

We now have a publicly available SVN repository. See https://gforge.statsbiblioteket.dk/scm/?group_id=7.

We also have a public tracking system available at the GForge site. Here you can browse known bugs and feature requests, as well as report new bugs or request new features. You can also submit patches for the NetarchiveSuite code. See https://gforge.statsbiblioteket.dk/tracker/index.php?group_id=7

We also have three mailing lists available. The netarchivesuite-announce list is low traffic announcement of new releases. The netarchivesuite-users list is a list for general discussion. The developers are active on this list, and feedback and comments is very welcome. Finally there is a mailing list where all commits to our repository are automatically reported. This third list is somewhat high volume and of interest for developers only. See https://gforge.statsbiblioteket.dk/mail/?group_id=7

Common Module

New secure remotefile implementation

A new remote file implementation using secure https ha sbeen added. To use it, set your remote file settings in settings.xml as follows:

<common>
  ...
  <remoteFile xsi:type="httpsremotefile">
    <!-- The class to use for RemoteFile objects. -->
    <class>dk.netarkivet.common.distribute.HTTPSRemoteFile</class>
    <!-- The port for the remote file transfers -->
    <port>8300</port>
    <!-- The keystore -->
    <certificateKeyStore>path/to/keystore</certificateKeyStore>
    <!-- The keystore passwd -->
    <certificateKeyStorePassword>testpass</certificateKeyStorePassword>
    <!-- The key password-->
    <certificatePassword>testpass2</certificatePassword>
  </remoteFile>
</common>

The keystore file must contain a certificate with the given passwords.

It can be generated with the keytool application distributed with Java 5.

Run the following command:

keytool -alias NetarchiveSuite -keystore keystore -genkey

It should then respond with the following:

Enter keystore password:  

Enter the password for the keystore.

The keytool will now prompt you for the following information

What is your first and last name?
  [Unknown]:  
What is the name of your organizational unit?
  [Unknown]:  
What is the name of your organization?
  [Unknown]:  
What is the name of your City or Locality?
  [Unknown]:  
What is the name of your State or Province?
  [Unknown]:  
What is the two-letter country code for this unit?
  [Unknown]:  
Is CN=Unknown, OU=Unknown, O=Unknown, L=Unknown, ST=Unknown, C=Unknown correct?
  [no]:  

answer all the questions, and end with "yes".

Finally you will be asked for the certificate password.

Enter key password for <NetarchiveSuite>
        (RETURN if same as keystore password): 

Answer with a password for the certificate.

You now have a file called keystore which contains a certificate.

To keep your environment secure, you should make sure you do not have the keystore and settings file readable for anyone but the application.

Settings split

The settings for the monitor module have been moved to a separate settings file. It is expected that we will do further restructuring of our settings to make them more extensible and modular in a later release.

Better pluggability of JMS implementation

If you wish to use a different JMS queue software, the class handling all JMS communication is now pluggable and can be specified in Settings.

Harvester Module

Two-part TLDs possible

The possibility of using e.g. co.uk as a top-level domain has been introduced. See upgrade section for example.

Heritrix integration

Heritrix is now run as a separate process and controlled by JMX. This among other makes it possible to use the Heritrix harvest interface while the harvest is running. It also opens up for new ways to control Heritrix during a harvest in later releases.

Manual override of scheduling

For selective harvests it is now possible to override the time when it should run next. This will override the scheduling with a date chosen by the operator.

Sorting and grouping seed lists

The list of jobs can now be grouped by job status and sorted in ascending or descending order.

Checking of harvest templates

When adding a new harvest template, it will now be checked that the parts of an order.xml template that are expected by the NetarchiveSuite software is actually present.

Archive Module

Index Client Performance

The performance for requesting the same index multiple times should now be much better.

GetFile command line tool

A new command line tool has been added. Try java dk.netarkivet.archive.tools.GetFile

Access Module

Work on making http auth protected material available

The access module now does a best guess at retrieving the wanted record, when both a 401 and a 200 record is harvested during a http authentication negotiation.

Monitor Module

Dynamic reload of settings for deployed applications

With the new settings file for the monitor module, you will be able to update which applications are to be monitored. Simply change the settings file, and the file should be reloaded on the next load of the monitor application.

Bugs fixed since NetarchiveSuite 3.4.0

Harvester Module

1078  DeDuplikator index too large

Viewerproxy Module

1079  snap shot harvest not browsable due to large index

Bugs fixed since NetarchiveSuite 3.2.*

Common Module

1016   ExtractCDX tool does not handle arc-files with large records
1023   [javadoc] Tag @see: can't find getInstance(File) in dk.netarkivet.common.distribute.RemoteFile
1034   English NetarchiveSuite thumbnail redirects to Danish site
1057   HTTPRemoteFile breaks ?
1061   Remove obsolete settings (monitor.htmlDir, viewerproxy.hostname)
1070   SUNMQ class must be fixed
1083   The NetarchiveSuite releases contain test code in lib/dk.netarkivet.*.jar
1147   AbstractRemoteFile.copyTo(File destfile) throws NullPointerException if destfile has no parent

Harvester Module

628    need a way to reset the nextdate of a HD
680    Cannot browse harvested password protected material
789    illegal regexp crashes entire job
861    No logmessage in HarvestDefinitionGUI, when submitting crawljob
897    sorting of seed-lists on job-page
915    TLDs having two parts i.e co.uk are disallowed
926    error writing crawl.log to metadata.arc
937    rescheduling of jobs is very slow and blocks normal scheduling
939    Webpages should handle the case, where no schedules or harvestdefinitions exist already properly
970    DB error on registering jobs with too many upload-errors
971    Seeds, passwords and configurations sorted without regard to locale
973    Required parameter urlListList not given
984    Missing headlines in "Selective Harvests" window
993    missing trim of domain strings or a normal error message
1010   long string in ordertemplate blows up formating of Job Details page
1033   Sensitive "Find Domain(s)"
1038   New title for 'Harvest status' - 'All Jobs per domain'
1039   SideKick sets it's application name to "dk.netarkivet.SideKick"
1040   sortNamedObjectList and Named should be moved to common.utils
1049   Danish translation of resubmitted is wrong
1051   Error on Definitions-create-domain.jsp when trying to create an invalid domain
1052   wrong description of setting in installation manual
1053   wrong parameter in configuration-link on job details page
1065   Show harvest run count
1071   Checks of Xpaths in Heritrix templates
1074   more information from harvesters (JobID and Priority)
1075   harvester with no diskspace keeps starting new jobs
1088   Time synchronization issue?
1096   Wrong timeout listed in log
1159   long string in seedlist blows up formating of Job Details page
1179   Loading job details takes too long

Archive Module

1022    [javadoc] Parameter "data" is documented more than once in RemoveAndGetFileMessage
1024    An error message is always shown on the bitarchive checksum page
1036    missing translation
1046    Bitpreservation-filestatus-checksum.jsp: Missing whitespace between filename and Info-link
1059    Mention of locations as institutions in comments, and variable names
1062    indexserver skips a lot of lines due to threading problem with SimpeDateFormat
1077    more logging when indexing large number of jobs
1080    failed upload does not clean up FTPRemoteFile
1099    Only batch jobs being started is visible in the system overview

Viewerproxy Module

1030    new viewerproxy command URL gives strange behavior in some browsers

Monitor Module

936    no way to add a new bitarchive machine to the JMX-overview without restarting the GUI

Deploy module (unsupported)

1081   deploy uses largeIndexRequestTimeout for both LOW and HIGH priority harvester instances
1086   Use unzip in quiet mode during deploy

Upgrade instructions

Upgrading from 3.4.1

The 3.4.2 release should be interchangeable with the 3.4.1 release.

It is possible to upgrade by only replacing the jar file lib/dk.netarkivet.common.jar on all harvester machines, while keeping all other applications running.

Upgrading from 3.4.0

The 3.4.2 release should be interchangeable with the 3.4.0 release.

Upgrading from 3.2.*

Please note that we only support upgrading from the previous stable release. Upgrading across several stable releases is not supported. You will need to upgrade step-by-step through all stable releases (for instance 3.0 -> 3.2 -> 3.4).

Monitor settings

To upgrade from a 3.2.* version of NetarchiveSuite, you will need to update your settings files on the application running the monitor GUI, usually HarvestDefinitionApplication.

What needs to be done is to move the section that looks like

<deploy>
    <jmxMonitorRolePassword>JMX_MONITOR_ROLE_PASSWORD_PLACEHOLDER</jmxMonitorRolePassword>
    <numberOfHosts>3</numberOfHosts>
    <host1>
        <name>hostname1.example.com</name>
        <jmxport>8100</jmxport>
        <jmxport>8101</jmxport>
        <jmxport>8102</jmxport>
    </host1>
    <host2>
        <name>hostname2.example.com</name>
        <jmxport>8100</jmxport>
        <jmxport>8101</jmxport>
    </host2>
    <host3>
        <name>hostname3.example.com</name>
        <jmxport>8100</jmxport>
        <jmxport>8101</jmxport>
        <jmxport>8102</jmxport>
        <jmxport>8103</jmxport>
    </host3>
</deploy>

to the file called monitor_settings.xml.

If you use -Ddk.netarkivet.settings.file=path/to/your/settings.xml you should now also use -Ddk.netarkivet.monitorsettings.file=path/to/your/monitor_settings.xml.

Setting top-level domains

The old setting settings.harvester.datamodel.domain.validDomainRegex has been removed. Instead a new repeatable setting settings.harvester.datamodel.domain.tld is introduced, which declares the valid top-level-domain-parts for domains in the system. This setting also defines how domains are split. Thus, if 'co.uk' is a valid top-level domain, a legal domain would be "bbc.co.uk", and the host "news.bbc.co.uk" would be considered a host in that domain.

Example: Say your old regex looked like this:

<validDomainRegex>^([0-9]+\.[0-9]+\.[0-9]+\.[0-9]+|[^\0000-,.-/:-@\[-`{-\0177]+\.(dk|com|net|uk))$</validDomainRegex>

You could replace it with the following for the exact same meaning:

<tld>dk</tld>
<tld>com</tld>
<tld>net</tld>
<tld>uk</tld>

Or you could introduce a better handling of .uk domains as follows:

<tld>dk</tld>
<tld>com</tld>
<tld>net</tld>
<tld>co.uk</tld>
<tld>gov.uk</tld>
<tld>ltd.uk</tld>
<tld>me.uk</tld>
<tld>mod.uk</tld>
<tld>net.uk</tld>
<tld>nic.uk</tld>
<tld>nhs.uk</tld>
<tld>org.uk</tld>
<tld>plc.uk</tld>
<tld>police.uk</tld>
<tld>sch.uk</tld>
<tld>uk</tld>

Note that uk is added at the end. This will let domains not belonging to any of the two-part TLDs still be caught (like British Library).

New settings for controlling heritrix

You can control the time we wait for the Heritrix external process to start. That time is set with the setting settings.common.processTimeout. Also, new settings set the ports for communication with Heritrix using JMX in the external process.

Your new settings should be as follows:

<settings>
  ...
  <common>
    ...
    <!--The number of milliseconds we wait for processes to react
        to shutdown requests.-->
    <processTimeout>5000</processTimeout>
    ...
  </common>
  ...
  <harvesters>
    <harvesting>
      ...
      <heritrix>
        <!-- The timeout setting for aborting a crawl based on
             crawler-inactivity. If the crawler is inactive for this
             amount of seconds the crawl will be aborted.
             The inactivity is measured on the
             crawlController.activeToeCount(). -->
        <inactivityTimeout>100</inactivityTimeout>
        <!-- The timeout value (in seconds) used in HeritrixLauncher
             for aborting crawl when no bytes are being received from
             web servers. -->
        <noresponseTimeout>100</noresponseTimeout>
        <!-- Name for accessing the Heritrix GUI -->
        <adminName>admin</adminName>
        <!-- Password for accesing the Heritrix GUI -->
        <adminPassword>adminPassword</adminPassword>
        <!-- Port used to access the Heritrix web user interface.
             This port must not be used by anything else on the machine.
             -->
        <guiPort>8090</guiPort>
        <!-- Port used to access the Heritrix JMX interface.
             This port must not be used by anything else on the machine,
             but does not need to be accessible from other machines
             unless you want to be able to use jconsole to access
             Heritrix directly
             -->
        <jmxPort>8091</jmxPort>
        <!-- The heap size to use for the Heritrix sub-process.  This
             should probably be fairly large.  It can be specified in
             the same way as for the -Xmx argument to Java, e.g.
             512M, 2G etc.-->
      <heapSize>1598M</heapSize>
    </heritrix>
  </harvesters>
</settings>

New setting for when harvesters should no longer listen for new jobs

The harvester will now stop listening for jobs, if the required amount of disk space is not available. This value needs to be set. The value as defined below, equals 400 Mbytes.

<settings>
  ...
  <harvesters>
    <harvesting>
      ...
      <!--  The minimum amount of free bytes in the serverDir
            required before accepting any harvest-jobs. Default is 
            400000000 bytes (~400 Mbytes).
      -->
      <minSpaceLeft>400000000</minSpaceLeft>
  </harvesters>
  ...
</settings>

Full classname need for MQ implementation

Previously the setting settings.common.jms.class was just the prefix of a class name, e.g. SunMQ. Now the full class name is needed.

Example:

If your previous settings.xml file contained:

<settings>
  <common>
    ...
    <jms>
        <!-- Selects the broker vendor to be used. Fx. ActiveMQ. -->
        <class>SunMQ</class>
        ...
    </jms>
    ...
  </common>

it should now become

<settings>
  <common>
    ...
    <jms>
        <!-- Selects the broker class to be used. Must be a subclass of
        dk.netarkivet.common.distribute.JMSConnection. -->
        <class>dk.netarkivet.common.distribute.JMSConnectionSunMQ</class>
        ...
    </jms>
    ...
  </common>

You must clear all job-caches

On all harvester machines, on the index server, and on the viewer proxy, please delete all contents in the cache directories.

The format of generated caches has changed.

With default settings, this is the directory 'cache' on the viewerproxies, the index servers, and the harvester machines.

Version History

Version 3.4.2

2008-04-14

Bug fix release, fixing JMX timeout on bosy harvester machines

Version 3.4.1

2008-01-16

Bug fix release, fixing out of memory on very large indexes

Version 3.4.0

2008-01-03

Separation of Heritrix, work on developing our open source platform, two-part TLDs like co.uk, and lots of bugfixes

Version 3.3.*

Develop versions aiming for 3.4.0

Version 3.2.3

2007-09-27

Bugfix of 3.2.2 with patched deduplicator, that fixes problem in parallel indexing

Version 3.2.2

2007-08-03

Bugfix of 3.2.1 with patched Heritrix 1.12.1, that supports ARCRecords larger than 2GBs

Version 3.2.1

2007-07-04

Bugfix of 3.2.0 fixing trouble using the quick start manual.

Version 3.2.0

2007-07-04

Open source release

Version 3.1.*

Development versions. Version 3.1.7 was kindly reviewed by Internet Archive and the Norwegian national library.

Version 3.0.0

2007-02-02

Marked the naming of the NetarchiveSuite, the splitting of NetarchiveSuite into independent modules, and the licensing of NetarchiveSuite under LGPL

Version 2.*

Various features and updates

Version 2.0

2006-08-30

Marked a general restructuring of the code, where harvest definition data was backed by a database, the viewerproxy was trimmed and rewritten.

Version 1.*

Various features and updates

Version 1.0

2005-07-01

The first version of the netarchive| software put in production for harvesting the entire Danish web

Version 0.*

Various pre-production development versions

ReleaseNotes3_4_2 (last edited 2010-08-16 10:24:52 by localhost)