Release Notes for NetarchiveSuite 3.3.1

This version of the NetarchiveSuite was released on 2007-09-24.

Note: This is a development release, it is not tested to be of production quality

New features since NetarchiveSuite 3.2.*

General

We now have a publicly available SVN repository. See https://gforge.statsbiblioteket.dk/scm/?group_id=7.

Please note that anonymous access is not available, although the page claims so. You will need to create a gforge account, and use the instructions under Developer Subversion Access.

Common Module

Settings split

The settings for the monitor module have been moved to a separate settings file. It is expected that we will do further restructuring of our settings to make them more extensible and modular in a later release.

Harvester Module

Two-part TLDs possible

The possibility of using e.g. co.uk as a top-level domain has been introduced. See upgrade section for example.

Heritrix integration

Preliminary work has been done on communicating with the harvesters through JMX and get the Heritrix UI up and running.

Monitor Module

Dynamic reload of settings for deployed applications

With the new settings file for the monitor module, you will be able to update which applications are to be monitored. Simply change the settings file, and the file should be reloaded on the next load of the monitor application.

Bugs fixed since NetarchiveSuite 3.2.*

Common Module

1016   ExtractCDX tool does not handle arc-files with large records
1023   [javadoc] Tag @see: can't find getInstance(File) in dk.netarkivet.common.distribute.RemoteFile
1034   English NetarchiveSuite thumbnail redirects to Danish site
1057   HTTPRemoteFile breaks ?

Harvester Module

628    need a way to reset the nextdate of a HD
789    illegal regexp crashes entire job
861    No logmessage in HarvestDefinitionGUI, when submitting crawljob
915    TLDs having two parts i.e co.uk are disallowed
926    error writing crawl.log to metadata.arc
937    rescheduling of jobs is very slow and blocks normal scheduling
939    Webpages should handle the case, where no schedules or harvestdefinitions exist already properly
970    DB error on registering jobs with too many upload-errors
971    Seeds, passwords and configurations sorted without regard to locale
984    Missing headlines in "Selective Harvests" window
993    missing trim of domain strings or a normal error message
1033   Sensitive "Find Domain(s)"
1038   New title for 'Harvest status' - 'All Jobs per domain'
1039   SideKick sets it's application name to "dk.netarkivet.SideKick"
1040   sortNamedObjectList and Named should be moved to common.utils
1049   Danish translation of resubmitted is wrong
1051   Error on Definitions-create-domain.jsp when trying to create an invalid domain
1053   wrong parameter in configuration-link on job details page

Archive Module

1022    [javadoc] Parameter "data" is documented more than once in RemoveAndGetFileMessage
1024    An error message is always shown on the bitarchive checksum page
1036    missing translation
1046    Bitpreservation-filestatus-checksum.jsp: Missing whitespace between filename and Info-link
1059    Mention of locations as institutions in comments, and variable names

Viewerproxy Module

1030    new viewerproxy command URL gives strange behavior in some browsers

Monitor Module

936    no way to add a new bitarchive machine to the JMX-overview without restarting the GUI

Upgrade instructions

Monitor settings

To upgrade from a previous version of NetarchiveSuite, you will need to update your settings files on the application running the monitor GUI, usually HarvestDefinitionApplication.

What needs to be done is move the section that looks like

<deploy>
    <jmxMonitorRolePassword>JMX_MONITOR_ROLE_PASSWORD_PLACEHOLDER</jmxMonitorRolePassword>
    <numberOfHosts>3</numberOfHosts>
    <host1>
        <name>hostname1.example.com</name>
        <jmxport>8100</jmxport>
        <jmxport>8101</jmxport>
        <jmxport>8102</jmxport>
    </host1>
    <host2>
        <name>hostname2.example.com</name>
        <jmxport>8100</jmxport>
        <jmxport>8101</jmxport>
    </host2>
    <host3>
        <name>hostname3.example.com</name>
        <jmxport>8100</jmxport>
        <jmxport>8101</jmxport>
        <jmxport>8102</jmxport>
        <jmxport>8103</jmxport>
    </host3>
</deploy>

to the file called monitor_settings.xml.

If you use -Ddk.netarkivet.settings.file=path/to/your/settings.xml you should now also use -Ddk.netarkivet.monitorsettings.file=path/to/your/monitor_settings.xml.

Setting top-level domains

The old setting settings.harvester.datamodel.domain.validDomainRegex has been removed. Instead a new repeatable setting settings.harvester.datamodel.domain.tld is introduced, which decalres the valid top-level-domain-parts for domains in the system. This setting also defines how domains are split. Thus, if 'co.uk' is a valid top-level domain, a legal domain would be "bbc.co.uk", and the host "news.bbc.co.uk" would be considered a host in that domain.

Example: Say your old regex looked like this:

<validDomainRegex>^([0-9]+\.[0-9]+\.[0-9]+\.[0-9]+|[^\0000-,.-/:-@\[-`{-\0177]+\.(dk|com|net|uk))$</validDomainRegex>

You could replace it with the following for the exact same meaning:

<tld>dk</tld>
<tld>com</tld>
<tld>net</tld>
<tld>uk</tld>

Or you could introduce a better handling of .uk domains as follows:

<tld>dk</tld>
<tld>com</tld>
<tld>net</tld>
<tld>co.uk</tld>
<tld>gov.uk</tld>
<tld>ltd.uk</tld>
<tld>me.uk</tld>
<tld>mod.uk</tld>
<tld>net.uk</tld>
<tld>nic.uk</tld>
<tld>nhs.uk</tld>
<tld>org.uk</tld>
<tld>plc.uk</tld>
<tld>police.uk</tld>
<tld>sch.uk</tld>
<tld>uk</tld>

Note that uk is added at the end. This will let domains not belonging to any of the two-part TLDs still be caught (like British Library).

Version History

Current development versions

Version 3.3.1

2007-09-24

Mostly bugfix work, including possibility to use two-part TLDs like co.uk

Version 3.3.0

2007-08-06

Mostly bugfix work, including upgradability of the monitored applications, and faster resubmitting of jobs

Stable versions

Version 3.2.3

2007-09-27

Bugfix of 3.2.2 with patched deduplicator, that fixes problem in parallel indexing

Version 3.2.2

2007-08-03

Bugfix of 3.2.1 with patched Heritrix 1.12.1, that supports ARCRecords larger than 2GBs

Version 3.2.1

2007-07-04

Bugfix of 3.2.0 fixing trouble using the quick start manual.

Version 3.2.0

2007-07-04

Open source release

Version 3.1.*

Development versions. Version 3.1.7 was kindly reviewed by Internet Archive and the Norwegian national library.

Version 3.0.0

2007-02-02

Marked the naming of the NetarchiveSuite, the splitting of NetarchiveSuite into independent modules, and the licensing of NetarchiveSuite under LGPL

Version 2.*

Various features and updates

Version 2.0

2006-08-30

Marked a general restructuring of the code, where harvest definition data was backed by a database, the viewerproxy was trimmed and rewritten.

Version 1.*

Various features and updates

Version 1.0

2005-07-01

The first version of the netarchive software put in production for harvesting the entire Danish web

Version 0.*

Various pre-production development versions

ReleaseNotes3_3_1 (last edited 2010-08-16 10:24:41 by localhost)