Release Notes for NetarchiveSuite 3.3.2

This version of NetarchiveSuite was released on 2007-11-08.

Note: This is a development release, it is not tested to be of production quality.

New features since NetarchiveSuite 3.2.*

General

We now have a publicly available SVN repository. See https://gforge.statsbiblioteket.dk/scm/?group_id=7.

We also have a public tracking system available at the GForge site. Here you can browse known bugs and feature requests, as well as report new bugs or request new features. You can also submit patches for the NetarchiveSuite code. See https://gforge.statsbiblioteket.dk/tracker/index.php?group_id=7

We also have three mailing lists available. The netarchivesuite-announce list is low traffic announcement of new releases. The netarchivesuite-users list is a list for general discussion. The developers are active on this list, and feedback and comments is very welcome. Finally there is a mailing list where all commits to our repository are automatically reported. This third list is somewhat high volume and of interest for developers only. See https://gforge.statsbiblioteket.dk/mail/?group_id=7

Common Module

New secure remotefile implementation

A new remote file implementation using secure https ha sbeen added. To use it, set your remote file settings in settings.xml as follows:

<common>
  ...
  <remoteFile xsi:type="httpsremotefile">
    <!-- The class to use for RemoteFile objects. -->
    <class>dk.netarkivet.common.distribute.HTTPSRemoteFile</class>
    <!-- The port for the remote file transfers -->
    <port>8300</port>
    <!-- The keystore -->
    <certificateKeyStore>path/to/keystore</certificateKeyStore>
    <!-- The keystore passwd -->
    <certificateKeyStorePassword>testpass</certificateKeyStorePassword>
    <!-- The key password-->
    <certificatePassword>testpass2</certificatePassword>
  </remoteFile>
</common>

The keystore file must contain a certificate with the given passwords.

It can be generated with the keytool application distributed with Java 5.

Run the following command:

keytool -alias NetarchiveSuite -keystore keystore -genkey

It should the respond with the following:

Enter keystore password:  

Enter the password for the keystore.

The keytool will now prompt you for the following information

What is your first and last name?
  [Unknown]:  
What is the name of your organizational unit?
  [Unknown]:  
What is the name of your organization?
  [Unknown]:  
What is the name of your City or Locality?
  [Unknown]:  
What is the name of your State or Province?
  [Unknown]:  
What is the two-letter country code for this unit?
  [Unknown]:  
Is CN=Unknown, OU=Unknown, O=Unknown, L=Unknown, ST=Unknown, C=Unknown correct?
  [no]:  

answer all the questions, and end with "yes".

Finally you will bey asked for the certificate password.

Enter key password for <NetarchiveSuite>
        (RETURN if same as keystore password): 

Answer with a password for the certificate.

You now how a file called keystore which contains a certificate.

To keep your environment secure, you should make sure you do not have the keystore and settings file readable for anyone but the application.

Settings split

The settings for the monitor module have been moved to a separate settings file. It is expected that we will do further restructuring of our settings to make them more extensible and modular in a later release.

Better pluggability of JMS implementation

If you wish to use a different JMS queue software, the class handling all JMS communication is now pluggable and can be specified in Settings.

Harvester Module

Two-part TLDs possible

The possibility of using e.g. co.uk as a top-level domain has been introduced. See upgrade section for example.

Heritrix integration

Heritrix is now run as a separate process and controlled by JMX. This among other makes it possible to use the Heritrix harvest interface while the harvest is running. It also opens new possiblities for how we can control Heritrix during a harvest in later releases.

Manual override of scheduling

For selective harvests it is now possible to override when it should next run. This will override the scheduling with a date chosen by the operator.

Monitor Module

Dynamic reload of settings for deployed applications

With the new settings file for the monitor module, you will be able to update which applications are to be monitored. Simply change the settings file, and the file should be reloaded on the next load of the monitor application.

Bugs fixed since NetarchiveSuite 3.2.*

Common Module

1016   ExtractCDX tool does not handle arc-files with large records
1023   [javadoc] Tag @see: can't find getInstance(File) in dk.netarkivet.common.distribute.RemoteFile
1034   English NetarchiveSuite thumbnail redirects to Danish site
1057   HTTPRemoteFile breaks ?
1061   Remove obsolete settings (monitor.htmlDir, viewerproxy.hostname)
1070   SUNMQ class must be fixed

Harvester Module

628    need a way to reset the nextdate of a HD
789    illegal regexp crashes entire job
861    No logmessage in HarvestDefinitionGUI, when submitting crawljob
915    TLDs having two parts i.e co.uk are disallowed
926    error writing crawl.log to metadata.arc
937    rescheduling of jobs is very slow and blocks normal scheduling
939    Webpages should handle the case, where no schedules or harvestdefinitions exist already properly
970    DB error on registering jobs with too many upload-errors
971    Seeds, passwords and configurations sorted without regard to locale
984    Missing headlines in "Selective Harvests" window
993    missing trim of domain strings or a normal error message
1010   long string in ordertemplate blows up formating of Job Details page
1033   Sensitive "Find Domain(s)"
1038   New title for 'Harvest status' - 'All Jobs per domain'
1039   SideKick sets it's application name to "dk.netarkivet.SideKick"
1040   sortNamedObjectList and Named should be moved to common.utils
1049   Danish translation of resubmitted is wrong
1051   Error on Definitions-create-domain.jsp when trying to create an invalid domain
1052   wrong description of setting in installation manual
1053   wrong parameter in configuration-link on job details page
1065   Show harvest run count
1074   more information from harvesters (JobID and Priority)

Archive Module

1022    [javadoc] Parameter "data" is documented more than once in RemoveAndGetFileMessage
1024    An error message is always shown on the bitarchive checksum page
1036    missing translation
1046    Bitpreservation-filestatus-checksum.jsp: Missing whitespace between filename and Info-link
1059    Mention of locations as institutions in comments, and variable names
1062    indexserver skips a lot of lines due to threading problem with SimpeDateFormat
1077    more logging when indexing large number of jobs
1080    failed upload does not clean up FTPRemoteFile

Viewerproxy Module

1030    new viewerproxy command URL gives strange behavior in some browsers

Monitor Module

936    no way to add a new bitarchive machine to the JMX-overview without restarting the GUI

Deploy module (unsupported)

1081   deploy uses largeIndexRequestTimeout for both LOW and HIGH priority harvester instances

Upgrade instructions

Please note that we only support upgrading from the previous stable release. Upgrading across several stable releases is not supported. You will need to upgrade step-by-step through all stable releases (for instance 3.0 -> 3.2 -> 3.4).

Monitor settings

To upgrade from a 3.2.* version of NetarchiveSuite, you will need to update your settings files on the application running the monitor GUI, usually HarvestDefinitionApplication.

What needs to be done is move the section that looks like

<deploy>
    <jmxMonitorRolePassword>JMX_MONITOR_ROLE_PASSWORD_PLACEHOLDER</jmxMonitorRolePassword>
    <numberOfHosts>3</numberOfHosts>
    <host1>
        <name>hostname1.example.com</name>
        <jmxport>8100</jmxport>
        <jmxport>8101</jmxport>
        <jmxport>8102</jmxport>
    </host1>
    <host2>
        <name>hostname2.example.com</name>
        <jmxport>8100</jmxport>
        <jmxport>8101</jmxport>
    </host2>
    <host3>
        <name>hostname3.example.com</name>
        <jmxport>8100</jmxport>
        <jmxport>8101</jmxport>
        <jmxport>8102</jmxport>
        <jmxport>8103</jmxport>
    </host3>
</deploy>

to the file called monitor_settings.xml.

If you use -Ddk.netarkivet.settings.file=path/to/your/settings.xml you should now also use -Ddk.netarkivet.monitorsettings.file=path/to/your/monitor_settings.xml.

Setting top-level domains

The old setting settings.harvester.datamodel.domain.validDomainRegex has been removed. Instead a new repeatable setting settings.harvester.datamodel.domain.tld is introduced, which declares the valid top-level-domain-parts for domains in the system. This setting also defines how domains are split. Thus, if 'co.uk' is a valid top-level domain, a legal domain would be "bbc.co.uk", and the host "news.bbc.co.uk" would be considered a host in that domain.

Example: Say your old regex looked like this:

<validDomainRegex>^([0-9]+\.[0-9]+\.[0-9]+\.[0-9]+|[^\0000-,.-/:-@\[-`{-\0177]+\.(dk|com|net|uk))$</validDomainRegex>

You could replace it with the following for the exact same meaning:

<tld>dk</tld>
<tld>com</tld>
<tld>net</tld>
<tld>uk</tld>

Or you could introduce a better handling of .uk domains as follows:

<tld>dk</tld>
<tld>com</tld>
<tld>net</tld>
<tld>co.uk</tld>
<tld>gov.uk</tld>
<tld>ltd.uk</tld>
<tld>me.uk</tld>
<tld>mod.uk</tld>
<tld>net.uk</tld>
<tld>nic.uk</tld>
<tld>nhs.uk</tld>
<tld>org.uk</tld>
<tld>plc.uk</tld>
<tld>police.uk</tld>
<tld>sch.uk</tld>
<tld>uk</tld>

Note that uk is added at the end. This will let domains not belonging to any of the two-part TLDs still be caught (like British Library).

New settings for controlling heritrix

You can control the time we wait for the Heritrix external process to start. That time is set with the setting settings.common.processTimeout. Also, new settings set the ports for communication with Heritrix using JMX in the external process.

Your new settings should be as follows:

<settings>
  ...
  <common>
    ...
    <!--The number of milliseconds we wait for processes to react
        to shutdown requests.-->
    <processTimeout>5000</processTimeout>
    ...
  </common>
  ...
  <harvesters>
    <harvesting>
      ...
      <!-- Name for accessing the Heritrix GUI -->
      <adminName>admin</adminName>
      <!-- Password for accesing the Heritrix GUI -->
      <adminPassword>adminPassword</adminPassword>
      <!-- Port used to access the Heritrix web user interface.
           This port must not be used by anything else on the machine.
           -->
      <guiPort>8090</guiPort>
      <!-- Port used to access the Heritrix JMX interface.
           This port must not be used by anything else on the machine,
           but does not need to be accessible from other machines
           unless you want to be able to use jconsole to access
           Heritrix directly
           -->
      <jmxPort>8091</jmxPort>
      <!-- The heap size to use for the Heritrix sub-process.  This
           should probably be fairly large.  It can be specified in
           the same way as for the -Xmx argument to Java, e.g.
           512M, 2G etc.-->
      <heapSize>1598M</heapSize>    </harvesting>
  </harvesters>
</settings>

Full classname need for MQ implementation

Previously the setting settings.common.jms.class was just the prefix of a class name, e.g. SunMQ. Now the full class name is needed.

Example:

If your previous settings.xml file contained:

<settings>
  <common>
    ...
    <jms>
        <!-- Selects the broker vendor to be used. Fx. ActiveMQ. -->
        <class>SunMQ</class>
        ...
    </jms>
    ...
  </common>

it should now become

<settings>
  <common>
    ...
    <jms>
        <!-- Selects the broker class to be used. Must be a subclass of
        dk.netarkivet.common.distribute.JMSConnection. -->
        <class>dk.netarkivet.common.distribute.JMSConnectionSunMQ</class>
        ...
    </jms>
    ...
  </common>

Version History

Current development versions

Version 3.3.2

2007-11-08

Separation of Heritrix, more bugfix work

Version 3.3.1

2007-09-24

Mostly bugfix work, including possibility to use two-part TLDs like co.uk

Version 3.3.0

2007-08-06

Mostly bugfix work, including upgradability of the monitored applications, and faster resubmitting of jobs

Stable versions

Version 3.2.3

2007-09-27

Bugfix of 3.2.2 with patched deduplicator, that fixes problem in parallel indexing

Version 3.2.2

2007-08-03

Bugfix of 3.2.1 with patched Heritrix 1.12.1, that supports ARCRecords larger than 2GBs

Version 3.2.1

2007-07-04

Bugfix of 3.2.0 fixing trouble using the quick start manual.

Version 3.2.0

2007-07-04

Open source release

Version 3.1.*

Development versions. Version 3.1.7 was kindly reviewed by Internet Archive and the Norwegian national library.

Version 3.0.0

2007-02-02

Marked the naming of the NetarchiveSuite, the splitting of NetarchiveSuite into independent modules, and the licensing of NetarchiveSuite under LGPL

Version 2.*

Various features and updates

Version 2.0

2006-08-30

Marked a general restructuring of the code, where harvest definition data was backed by a database, the viewerproxy was trimmed and rewritten.

Version 1.*

Various features and updates

Version 1.0

2005-07-01

The first version of the netarchive| software put in production for harvesting the entire Danish web

Version 0.*

Various pre-production development versions

ReleaseNotes3_3_2 (last edited 2010-08-16 10:24:41 by localhost)