Release Notes for NetarchiveSuite 3.12.0

This version of NetarchiveSuite was released on 2010-05-03

New features since NetarchiveSuite 3.10.*

Major work on improving the bitpreservation infrastructure has been done, like adding a external bitpreservation database, which by default is Derby. This is still under development and the old bitpreservation framework has not been fully replaced yet. A new type of replica, the checksum replica, has been added to supplement the bitarchive type. With this type of replica is only stored the ingested filenames and their checksums. One checksum replica is necessary (but several may be installed) to make it possible to determine which replica has the copy of a file with the correct checksum in case where the bitarchive replicas report different checksums for the file.

Each checkum replica is represented by the ChecksumFileApplication.

Global-crawlertraps can be now be uploaded to the database, so they can be inserted into the harvest template during job generation, instead of having them hardwired into the harvest templates.

The full name of the HarvestTemplateApplication has been changed from the dk.netarkivet.harvester.datamodel.HarvestTemplateApplication to the dk.netarkivet.harvester.tools.HarvestTemplateApplication

The following bugs and features have been fixed since 3.10

Common Module

Bug 1835 The new examples folder is missing from build
FR 1580 The applications should be able to tell us the version of NetarchiveSuite
FR 1880 Change copyright string from "Copyright 2004-2009" to "Copyright 2004-2010"

Deploy Module

Bug 1705 Make jmxremote.access writable before overwriting it (install script)
Bug 1914 The script to start the archive database lacks max heap option
FR 1790 Print usage of RunNetarchiveSuite.sh
FR 1846 Deploy the bitpreservation database (fixed QA)
FR 1876 Automatic startup of archive database and database url generation for test instance

Harvester Module

Bug 1777 Add event seeds only accepts a very short list of seeds
FR 1116 Global crawlertraps
FR 872 More logging needed in method HarvestControllerServer.HarvesterThread.run()
Bug 1856 Schedule problem after first start on NAS 3.10.0. No schedule started

Access Module

Bug1758 UrlCanonicalizerFactory falls back to default value silently

Archive Module

Bug 1934 SQLNonTransientConnectionException: Insufficient data while reading from the network
Bug 1920 Duplicates are currently ignored in DatabaseBasedBitpreservation.FindMissingFiles
Bug 1917 Request for all checksum timed out after 60 seconds
Bug 1911 java.lang.NullPointerException WARNING: Cannot retrieve the filenames to reply on the
Bug 1910 Update of Checksum replica takes more than 1 minut
Bug 1909 creation of adminDB with prod admin.data will take 6-7 days...
Bug 1905 DatabasedBased Bitpreservation can initiate multiple checksum and filelist requests at the same time.
Bug 1903 SEVERE: Cannot handle 2 files with the name '22583-MB100.arc'
Bug 1897 Wrong nulls in filelist status
Bug 1894 INFO: No replica name found in request.
Bug 1833 The method isAdminCheckSumOk() from the type FilePreservationState is not visible
FR 1734 Unittest of BitachiveMonitor
FR 1736 Monitoring batchjobs through logging
Bug 754 NullPointerException i bitpreservation
Bug 842 Bitpreservation GUI fetches checksums twice
Bug 810 TEST7, step 8: "Send failed" error  when updating filestatus for location SB

Documentation Module

FR 1818 Have all the example configuration files in one folder different from the "conf"

Monitor Module

FR 1757 Need a way to remove an application from lists of monitored applications
FR 1861 When clicking "More" in a status page, the link jumps to the top
FR 1578 The interval (REREGISTER_DELAY) between apps re-registering themselves should be a setting
Documentation

External Software

ES 1587 Upgrade Apache Commons Net library to version 2.0
ES 1784 Upgrade Jetty
ES 1808 Upgrade Apache Commons Fileupload to 1.2.1
FR 1890 Upgrade to Apache Derby 10.5.3.0

Upgrade instructions

Beware that the contents of the /conf directory has been moved to the /examples directory. It may cause your existing deploy scripts to fail.

Remember to stop the running installation before upgrading. If using derby database as your database, it it recommmended, that you upgrade your database to 10.5.3.0 by appending ";upgrade=true" to your JDBC url (Cf. http://db.apache.org/derby/docs/10.5/devguide/tdevupgradedb.html)

If your are using the standard distributed Netarchivesuite archive, you can now choose to use an archive database. An empty database is bundled with Netarchivesuite.

This empty archive database needs to be filled with the data embedded in any existing admin.data, and this must be done after the new version of NetarchiveSuite is installed, but before it is started. The steps for filling the database is the following.

This tool will take a long time to complete (about 7 hours for 1.5 mio lines admin.data), and will print out a progress line for each 10000 lines processed:

  Using default admin.data: /home/test/TEST7A/admin.data
Reading admin.data version 0.4
[Fri Apr 23 05:02:00 CEST 2010] Processed 10000 admin data lines
[Fri Apr 23 05:04:37 CEST 2010] Processed 20000 admin data lines
...
[Fri Apr 23 12:01:30 CEST 2010] Processed 1560000 admin data lines
[Fri Apr 23 12:01:34 CEST 2010] ReestablishAdminDatabase tool finished ingest of file '/home/test/TEST7A/admin.data'.

Adding a checksum-replica to your installation requires some work as well:

1) You need to add a checksum replica to your deploy configuration

<deployGlobal>
     <settings>
           <replicas>
                <replica>
                    <replicaId>CS</replicaId>
                    <replicaName>CSN</replicaName>
                    <replicaType>checksum</replicaType>
                </replica>

2) You need to add a FileChecksumApplication to represent this replica. We recommend that this application is located on a different location than the Arcrepository. Below the application is installed on machine east-acs-001.kb.dk, which is part of Physical location EAST:

<thisPhysicalLocation name="EAST">
...
<deployMachine name="east-acs-001.kb.dk">
...
           <applicationName name="dk.netarkivet.archive.checksum.ChecksumFileApplication">
                <settings>
                    <archive>
                        <checksum>
                            <baseDir>CS</baseDir>
                        </checksum>
                    </archive>
                    <common>
                        <jmx>
                            <port>8112</port>
                            <rmiPort>8212</rmiPort>
                        </jmx>
                        <useReplicaId>CS</useReplicaId>
                    </common>
                </settings>
            </applicationName>

If you have an existing admin.data file from the old installation, and you would to ingest this information in the checksum replica, what you need to do is the following after you have installed but not started the new version of NetarchiveSuite:

How to update the bitpreservation information automatically

The bitpreservation framework allows you to update status of each replica concerning "missing files" and "changed files". These operations can take a long time to complete, which is why the user no longer must wait for the operation to complete before the web-page is rendered. All operations are now performed as background processes.

However, in order not to loose track of update processes, it is recommended to schedule these processes as cron-jobs, so that each operation has 7 hours to complete before the next operation is begun. The following updatebitpreservationinfo.sh script uses the 'wget' command to activate an update process.

##
## Usage: updatebitpreservationinfo.sh type replicaname
##
## The type can be either 'findmissingfiles' or 'checksum'
## The replicaname is the name of one of the replicas in your installation (e.g. replicaOne, replicaTwo, replicaCs)
## Note that GUI_BASEDIR below must be replaced with the correct URL in your installation.
export TYPE=$1
export BITARCHIVE=$2
export GUI_BASEDIR=http://east-adm-001.kb.dk:8076
URL="$GUI_BASEDIR/BitPreservation/Bitpreservation-filestatus-update.jsp?type=$TYPE&bitarchive=$BITARCHIVE"
wget $URL

Example usage of this script:

bash updatebitpreservation.sh findmissingfiles CSN
bash updatebitpreservation.sh checksum CSN

How to backup the external derby archive database

External derby databases are not backed up by the NetarchiveSuite software. This must be done outside NetarchiveSuite with a script that uses the 'ij' command bundled with the derby software. The following script makes a backup to a unique directory relative to the installation directory. The script assumes, that $INSTALLDIR variable contains the installation directory. The DATABASE_PORT and DATABASE variables needs to be changed to fit your environment.

cd $INSTALLDIR/
export DATABASE_PORT=8169
export SecondsSinceEpoch=`date +"%s"`
export DATABASE="jdbc:derby://localhost:$DATABASE_PORT/adminDB"
export BACKUPDIR=$INSTALLDIR/adminDB-backup-$SecondsSinceEpoch
export TMP_FILE=tmp-$SecondsSinceEpoch
echo connect \'$DATABASE\'\; > $TMP_FILE
echo "Making a backup of the database in directory $BACKUPDIR"
echo CALL SYSCS_UTIL.SYSCS_BACKUP_DATABASE\(\'$BACKUPDIR\'\) >> $TMP_FILE
java -cp lib/db/derbyclient-10.5.3.0.jar:lib/db/derbytools-10.5.3.0.jar org.apache.derby.tools.ij < $TMP_FILE
rm $TMP_FILE

New settings

The following new settings have been introduced:

settings.common.batch.timeBetweenLogging (default: 30000): The time between logging the stat us of a batch job measured in milliseconds. The default amounts to 30 seconds. settings.common.batch.defaultBatchTimeout (default: 604800000): Batchjobs without a specified timeout will get this value (one week).

Note that the following checksum replica has been added to the default set of replicas

        <replicas> <!-- The entire settings for replicas. -->
            <replica>
                <replicaId>ONE</replicaId>
                <replicaName>replicaOne</replicaName>
                <replicaType>bitarchive</replicaType>
            </replica>
            <replica>
                <replicaId>TWO</replicaId>
                <replicaName>replicaTwo</replicaName>
                <replicaType>bitarchive</replicaType>
            </replica>
            <replica>
                <replicaId>THREE</replicaId>
                <replicaName>replicaCs</replicaName>
                <replicaType>checksum</replicaType>
            </replica>
        </replicas>

settings.archive.bitpreservation.class (default

                <!-- This database cannot be embedded. -->
                <class>dk.netarkivet.archive.arcrepositoryadmin.DerbyServerSpecifics</class>
                <!-- The url is default: jbdc:derby://localhost:1527/admindb -->
                <baseUrl>jdbc:derby</baseUrl>
                <machine>localhost</machine>
                <port>1527</port>
                <dir>admindb</dir>

settings.archive.admin.class (default dk.netarkivet.archive.arcrepositoryadmin.UpdateableAdminData): The setting for the adminstration instance class. The alternative option is the dk.netarkivet.archive.arcrepositoryadmin.DatabaseAdmin, which requires an external database, just as is the case for the DatabaseBasedActiveBitPreservation.

New translation strings

harvester/Translations.properties

errormsg;template.upload.failed.with.exception.0=Harvest template upload failed with exception {0}
errormsg;uploading.file.0.failed.it.does.not.exist=Uploading the file ''{0}'' failed. It does not exist
errormsg;missing.parameter.0=Required parameter {0} not given
pagetitle;seeds.for.harvestdefinition=Seeds for the harvestdefinition
harveststatus.seeds.for.harvest.0=Domain/Seeds for harvestdefinition {0}
add.seeds.from.file=Add seeds from a file
errormsg;no.seedsfile.was.uploaded=No seedsfile was uploaded
// seeds regarding global crawlertraps facility
pagetitle;edit.global.crawler.traps=Global Crawler Traps
crawlertrap.active.header=Active Crawler Traps
crawlertrap.noactive=There are no active global crawler traps.
crawlertrap.inactive.header=Inactive Crawler Traps
crawlertrap.noinactive=There are no inactive global crawler traps.
crawlertrap.createnew=Upload New Global Crawler Trap List
errormsg;crawlertrap.upload.error=Error uploading new Global Crawler Trap
crawlertrap.activate=Activate
crawlertrap.deactivate=Deactivate
crawlertrap.create=Create
crawlertrap.update=Update
crawlertrap.delete=Delete
crawlertrap.name=Name
crawlertrap.active=Active
crawlertrap.inactive=Inactive
crawlertrap.description=Description
crawlertrap.filetoupload=File To Upload
hide=Hide

monitor/Translation.properties

tablefield;removeapplication=Remove Application
errormsg.error.when.unregistering.mbean.0=Error when unreqistering JMX MBean identified with query ''{0}''.

(note: This latter needs to be changed to "errormsg;error.when.unregistering.mbean.0" See outstanding bug 1844 Wrong labelling of the translation key "errormsg.error.when.unregistering.mbean.0")

Deleted translation strings

harvester/Translation.properties

errormsg;template.upload.failed=Harvest template upload failed

archive/Translation.properties:

pagetitle;filestatus.update=Update of filestatus information
errormsg;unknown.filestatus.update.type.0=Unknown filestatus update type ''{0}''.
initiating;update.of.0.for.replica.1=Initiating update of ''{0}'' for replica ''{1}''
be.patient.this.operation.can.take.hours=Please be patient. This operation can take hours

Version History

Version 3.11.*

Development versions aiming for 3.12.0

Version 3.10.0

2009-11-16

New deploy application; JMX stability issues fixed; JMS stability issues also fixed

Version 3.9.*

Development versions aiming for 3.10.0

Version 3.8.2

2009-09-10

Fix an important index synchronization bug

Version 3.8.1

2009-07-15

Fix of important bug leading to unresponsive harvesters

Version 3.8.0

2009-05-23

Java 1.6, Heritrix 1.14.1, Derby 10.4.2.0, complete rewrite of settings, new supported deploy module, gui access to harvest logs

Version 3.7.0

2008-11-04

Develop version aiming for 3.8.0

Version 3.6.0

2008-07-03

Improvement of archive component with regard to security, batch, and preservation; greater JMS stability; important bug fixes

Version 3.5.*

Develop versions aiming for 3.6.0

Version 3.4.2

2008-03-14

Bug fix release, fixing JMX timeout

Version 3.4.1

2008-01-16

Bug fix release, fixing out of memory on very large indexes

Version 3.4.0

2008-01-03

Separation of Heritrix, work on developing our open source platform, two-part TLDs like co.uk, and lots of bugfixes

Version 3.3.*

Develop versions aiming for 3.4.0

Version 3.2.3

2007-09-27

Bugfix of 3.2.2 with patched deduplicator, that fixes problem in parallel indexing

Version 3.2.2

2007-08-03

Bugfix of 3.2.1 with patched Heritrix 1.12.1, that supports ARCRecords larger than 2GBs

Version 3.2.1

2007-07-04

Bugfix of 3.2.0 fixing trouble using the quick start manual.

Version 3.2.0

2007-07-04

Open source release

Version 3.1.*

Development versions. Version 3.1.7 was kindly reviewed by Internet Archive and the Norwegian national library.

Version 3.0.0

2007-02-02

Marked the naming of the NetarchiveSuite, the splitting of NetarchiveSuite into independent modules, and the licensing of NetarchiveSuite under LGPL

Version 2.*

Various features and updates

Version 2.0

2006-08-30

Marked a general restructuring of the code, where harvest definition data was backed by a database, the viewerproxy was trimmed and rewritten.

Version 1.*

Various features and updates

Version 1.0

2005-07-01

The first version of the netarchive| software put in production for harvesting the entire Danish web

Version 0.*

Various pre-production development versions

ReleaseNotes3_12_0 (last edited 2010-08-16 10:24:07 by localhost)