Release Notes for NetarchiveSuite 3.6.0

This version of NetarchiveSuite was released on 2008-07-03.

New features since NetarchiveSuite 3.4.*

General

Common Module

Improved stability with JMS connections

Previously, if an application lost its connection to the JMS server, the application had to be restarted. It will now attempt to reconnect in known recoverable scenarios.

Security manager supported

The NetarchiveSuite now comes with security.policy files, that limit what source code not in the NetarchiveSuite jar files or libraries is allowed to do. This will increase the security, especially for bitarchives.

The file is distributed as conf/security.policy, and to use it you need to start java with -Djava.security.manager -Djava.security.policy=conf/test.policy

See the new batch job possibilities for use case.

Embedded web server upgraded to Jetty 6.1.6

The embedded web server has been upgraded. This should help on stability in some rare situations where large form fields caused exceptions to happen, and generally improve web server stability.

Code supporting MacOS/X

Although we do not test our software for compatibility with MacOS/X, we have received a patch that should enable NetarchiveSuite to run on macs, and the report that with this patch it runs fine. Many thanks to Lars Clausen and The National Library of Scotland for this patch.

Better log file naming in QuickStart scripts

It is a lot easier to debug QuickStart installations after the logfiles have been properly named for each application. Many thanks to Lars Clausen and The National Library of Scotland for this patch.

Harvester Module

Switch to DecidingScope

NetarchiveSuite now controls Heritrix using deciding scope, rather than the older deprecated HostScope and DomainScope. This is expected to improve harvesting performance. For the end user, there is no visible difference.

HarvestDefinitionApplication replaced by GUIApplication

The application HarvestDefinitionApplication has been removed. Instead, use dk.netarkivet.common.GUIApplication. It will work exactly as the old application, assuming you have the HarvestDefinition site section deployed in settings, that is:

<settings>
...
  <common>
  ...
    <webinterface>
    ...
      <siteSection>
          <!-- A subclass of SiteSection that defines this part of the
               web interface. -->
          <class>dk.netarkivet.harvester.webinterface.DefinitionsSiteSection</class>
          <!-- The directory or war-file containing the web application
               for this site section.-->
          <webapplication>webpages/HarvestDefinition</webapplication>
          <!-- The URL path for this section of the web interface. -->
          <deployPath>/HarvestDefinition</deployPath>
      </siteSection>
...

This is the default, and unchanged since previous versions.

OutOfMemory fix

When running very large harvests, the index over the harvests caused out of memory errors, both on harvests, due to the index used for deduplications, and on access, due to the index used for browsing. This has been fixed in this release.

Support for setting a byte limit on event harvests

When adding seeds to a harvest definition using the "Add seeds" functionality, it is now possible to set a byte limit. Many thanks to Lars Clausen and The National Library of Scotland for this patch.

Support for byte limits over 2GB

A previous limit on how large byte limits you can set has been removed.

The default byte limit for new configurations is now a setting

Previously, when creating a new configuration, the default byte limit was hardcoded to 500MB. Now, it is a setting. Simply set

    <harvester>
        <datamodel>
            <domain>
                ...
                <!-- Default byte limit for domain configuration. -->
                <defaultMaxbytes>1000000000</defaultMaxbytes>
                ...
            </domain>
        </datamodel>
    </harvester>

Archive Module

Bit preservation restructuring

The bit preservation has undergone a huge restructuring. This is partly preparation for more actions that will improve bit preservation, but it has the immediate effect, that reestablishing a missing file, or fixing a file with a bit error, will require fewer mouse clicks, and generally perform faster now. It should be possible to restore more files in one go, than was previously possible.

Support for submitting externally contributed batch jobs

The bit archive now has support for launching a batch job on the archive, that is written by an external source, without recompiling.

This is a great tool for researchers wishing to do some analysis on the entire archive.

All you have to do is to subclass the abstract class dk.netarkivet.common.utils.arc.FileBatchJob and implement methods for initialisation, finishing and what to do on each file. The results must be written to an output stream. It will then be executed on all bitarchive machines. The results will be written to a file, or to the screen. To work on individual arc records, rather than entire files, subclass dk.netarkivet.common.utils.arc.ARCBatchJob instead.

The mechanism to do this is the command line tool dk.netarkivet.archive.tools.RunBatch. E.g.

java dk.netarkivet.archive.tools.RunBatch MyBatchJob.class

Optionally, you can run the job on a subset of all files, on a specific location, and save the output to a file. Try starting the command line client woth no arguments for an example.

It is important to run your bitarchive with a security manager, and a restrictive policy (see above) to use this option. Otherwise external batch jobs might damage you bit archives.

ViewerProxy Module

Feedback when only a partial browse index is available

When using the ViewerProxy for QA, it is relevant that you know exactly what set you are browsing. When requesting a browse index, you will automatically be given the largest available subset of an index that is actually available. However, if some parts are not available, it was previously silently ignored, now it is reported in the ViewerProxy status interface.

Monitor Module

Automatic registering of applications for monitoring

Applications now automatically register themselves for monitoring by the monitor GUI.

This has two effects:

Documentation

The documentation has been brought up to day, and some parts have been elaborated. Especially, the database documentation in the Developer Manual has been updated, and the scripts to generate a new harvest database now has elaborate documentation on the tables used by the harvest definition interface. These can be found in the distribution packages under scripts/sql

Bugs fixed since NetarchiveSuite 3.4.*

Common Module

555  JMS connections cannot reconnect
788  problems with removing MessageListener
1063 Mention environment in webpages
1163 Upgrade to jetty 6.1.6
1175 Underscore in setting environmentName results in Internal Server error in Viewerproxy
1184 Notification is not sent, if backup fails
1185 ProcessUtils starts timer threads that cause random interrupts in our code
1194 Source for webpages is not packaged in source-zipball
1220 Wrong javadoc for method HTMLUtils.makeTableHeader() 
1235 Mac support

Harvester Module

924  jobs that completely fails reports IOException
1078 DeDuplikator index too large
1093 Domain configurations with large limits are wrongly handled, causing unharvested configurations and schedule time
1100 Distributed heritrix.properties file contains wrong version number
1161 Use of known null value in dk.netarkivet.harvester.webinterface.SelectiveHarvest.updateHarvestDefinition
1118 Harvest templates should have their URL and email address checked
1165 Switch to DecidingScope
1182 Logging of job status no longer contains headings
1206 JMXHeritrixController can fail to initialize without throwing an exception
1213 Missing space before listing of aliases of this domain
1232 Incorrect ArgumentNotValid check in constructor of dk.netarkivet.harvester.harvesting.distribute.DomainStats
1234 Allow max bytes setting for event harvests
1238 JMX connection to Heritrix fails immediately after Heritrix process being started
1242 TimeUnit should be public enumeration type
1245 Required arcsdir value not checked
1248 NPE in deduplicator-0.3.0-20061218b.jar
1250 Check on table version of tabel harvestdefinitions is missing
1255 The default bytelimit for configurations should be a setting

Archive Module

1079 snap shot harvest not browsable due to large index
1203 Reestablishing missing files is a lot more inefficient than it should be
1246 Bitpreservation GUI fails
1258 no "Shift infobox for 4 files" on the "Missing Files" page 

ViewerProxy module

1253 It is impossible to see that only a partial index is available

Monitor Module

900  Liveness logger logs too often, every 2  minutes
1042 Automatic registering to monitor of running applications desirable
1050 monitor_settings.xml is not parsed properly
1190 missing letter in danish translation of error message

Documentation

1177 UserManual contains wrong information about domains
1273 QuickStart using Mysql as Database fails due to Java access permissions

Tools

1029 A tool in dk.netarkivet.archive.tools to retrieve a file from the archive would be nice

Scripts

1233 Make harvest.sh name logfiles properly

Upgrade instructions

Upgrading from 3.4.*

Please note that we only support upgrading from the previous stable release. Upgrading across several stable releases is not supported. You will need to upgrade step-by-step through all stable releases (for instance 3.0 -> 3.2 -> 3.4).

Upgrade your templates to use DecidingScope

NetarchiveSuite now requires that you use DecidingScope in all Heritrix templates. You will need to download al your templates, do the updates described in the Installation Manual, Appendix E, and upload them again.

Upgrade of jobs table in database

NetarchiveSuite will automatically upgrade the jobs table in the database. For Derby databases, this may take some time depending of the number of entries in the jobs table.

In case of system crash during the upgrade, you will get an error on this when rebooting the system. The upgrade is done safely, therefore no data will be lost, but reestablishment of the system in this rare case must be done manually. Please do not hesitate to contact us on the mailing list in case of such an unfortunete accident. For more detailed information please refer to ConvertingJobsTableFromVersion3To4.

Automatic monitor registration settings

The automatic registering of clients is a new pluggable setting. The default implementation sends a JMS notification to the monitoring interface every minute, that informs the monitor that this application is alive and can be monitored.

This needs to be set in settings with the following setting:

<settings>
  ...
  <common>
    ...
      <monitorregistryClient xsi:type="jmsmonitorregistryclient">
          <!-- The class instantiated to register JMX urls at a registry. -->
          <class>dk.netarkivet.common.distribute.monitorregistry.JMSMonitorRegistryClient</class>
      </monitorregistryClient>

The settings from monitor_settings.xml that describe running applications can be removed. The only thing left in that settings file is:

<settings>
  <monitor>
    <!--  the password used to connect to the all Mbeanservers started by the application. This password must be same as the one in jmxremote.password -->
    <jmxMonitorRolePassword>JMX_MONITOR_ROLE_PASSWORD_PLACEHOLDER</jmxMonitorRolePassword>
  </monitor>
</settings>

Add setting for time out when waiting for responses from JMX subsystems

We communicate with the monitoring framework and Heritrix using JMX. A new setting controls how long we wait for a reply before giving up when trying to communicate.

You should add the following to you settings.xml files

    <common>
        <jmx>
            ...
            <!-- How many seconds we will wait before giving up on a JMX
            connection. -->
            <timeout>120</timeout>
        </jmx>
    </common>

Add setting for default configuration byte limit

It is now configurable what the byte limit is set to, when creating new configurations.

You should add the following to you settings.xml files

    <harvester>
        <datamodel>
            <domain>
                ...
                <!-- Default byte limit for domain configuration. -->
                <defaultMaxbytes>1000000000</defaultMaxbytes>
                ...
            </domain>
        </datamodel>
    </harvester>

New translations

If you are maintaining a translation, please note that the following new keys have been added:

viewerproxy/Translations.properties:

errormsg;request.was.for.0.but.got.1.missing.2=WARNING: The request was for the jobs {0}, but only the jobs {1} had available data for the index. Missing data for the jobs {2} 

archive/Translations.properties

change.0.failed=Change {0} failed
change.0.may.be.added=Change {0} that can be added
errormsg;admin.data.not.consistent.for.file.0=Admin data are not consistent for the file {0}
replace.file.in.bitarchive.0=Replace the file in bitarchive {0}
file.0.has.been.replaced.in.1=The file {0} has been replaced at bitarchive {1}
unable.to.correct=Unable to correct
no.info.on.file.0=No information could be found on the file ''{0}''.
no.checksum=No checksum

The following translations are no longer used, and can be removed:

change
failed
may.be.added
admin.data.not.consistent.for.file.0
remove.file.from.bitarchive.0
file.0.has.been.deleted.in.1.needs.copy

archive/Translations.properties and monitor/Translations.properties):

All properties starting with "errmsg;" have been renamed to "errormsg;"

Version History

Version 3.5.0

2008-03-04

Improvement of archive component with regard to security, batch, and preservation; greater JMS stability; important bug fixes

Version 3.4.2

2008-03-14

Bug fix release, fixing JMX timeout

Version 3.4.1

2008-01-16

Bug fix release, fixing out of memory on very large indexes

Version 3.4.0

2008-01-03

Separation of Heritrix, work on developing our open source platform, two-part TLDs like co.uk, and lots of bugfixes

Version 3.3.*

Develop versions aiming for 3.4.0

Version 3.2.3

2007-09-27

Bugfix of 3.2.2 with patched deduplicator, that fixes problem in parallel indexing

Version 3.2.2

2007-08-03

Bugfix of 3.2.1 with patched Heritrix 1.12.1, that supports ARCRecords larger than 2GBs

Version 3.2.1

2007-07-04

Bugfix of 3.2.0 fixing trouble using the quick start manual.

Version 3.2.0

2007-07-04

Open source release

Version 3.1.*

Development versions. Version 3.1.7 was kindly reviewed by Internet Archive and the Norwegian national library.

Version 3.0.0

2007-02-02

Marked the naming of the NetarchiveSuite, the splitting of NetarchiveSuite into independent modules, and the licensing of NetarchiveSuite under LGPL

Version 2.*

Various features and updates

Version 2.0

2006-08-30

Marked a general restructuring of the code, where harvest definition data was backed by a database, the viewerproxy was trimmed and rewritten.

Version 1.*

Various features and updates

Version 1.0

2005-07-01

The first version of the netarchive| software put in production for harvesting the entire Danish web

Version 0.*

Various pre-production development versions

ReleaseNotes3_6_0 (last edited 2010-08-16 10:24:52 by localhost)