= Release Notes for NetarchiveSuite 3.12.0 = This version of !NetarchiveSuite was released on 2010-05-03 <> == New features since NetarchiveSuite 3.10.* == Major work on improving the bitpreservation infrastructure has been done, like adding a external bitpreservation database, which by default is Derby. This is still under development and the old bitpreservation framework has not been fully replaced yet. A new type of replica, the checksum replica, has been added to supplement the bitarchive type. With this type of replica is only stored the ingested filenames and their checksums. One checksum replica is necessary (but several may be installed) to make it possible to determine which replica has the copy of a file with the correct checksum in case where the bitarchive replicas report different checksums for the file. Each checkum replica is represented by the !ChecksumFileApplication. Global-crawlertraps can be now be uploaded to the database, so they can be inserted into the harvest template during job generation, instead of having them hardwired into the harvest templates. The full name of the !HarvestTemplateApplication has been changed from the dk.netarkivet.harvester.datamodel.!HarvestTemplateApplication to the dk.netarkivet.harvester.tools.!HarvestTemplateApplication The following bugs and features have been fixed since 3.10 === Common Module === {{{ Bug 1835 The new examples folder is missing from build FR 1580 The applications should be able to tell us the version of NetarchiveSuite FR 1880 Change copyright string from "Copyright 2004-2009" to "Copyright 2004-2010" }}} === Deploy Module === {{{ Bug 1705 Make jmxremote.access writable before overwriting it (install script) Bug 1914 The script to start the archive database lacks max heap option FR 1790 Print usage of RunNetarchiveSuite.sh FR 1846 Deploy the bitpreservation database (fixed QA) FR 1876 Automatic startup of archive database and database url generation for test instance }}} === Harvester Module === {{{ Bug 1777 Add event seeds only accepts a very short list of seeds FR 1116 Global crawlertraps FR 872 More logging needed in method HarvestControllerServer.HarvesterThread.run() Bug 1856 Schedule problem after first start on NAS 3.10.0. No schedule started }}} === Access Module === {{{ Bug1758 UrlCanonicalizerFactory falls back to default value silently }}} === Archive Module === {{{ Bug 1934 SQLNonTransientConnectionException: Insufficient data while reading from the network Bug 1920 Duplicates are currently ignored in DatabaseBasedBitpreservation.FindMissingFiles Bug 1917 Request for all checksum timed out after 60 seconds Bug 1911 java.lang.NullPointerException WARNING: Cannot retrieve the filenames to reply on the Bug 1910 Update of Checksum replica takes more than 1 minut Bug 1909 creation of adminDB with prod admin.data will take 6-7 days... Bug 1905 DatabasedBased Bitpreservation can initiate multiple checksum and filelist requests at the same time. Bug 1903 SEVERE: Cannot handle 2 files with the name '22583-MB100.arc' Bug 1897 Wrong nulls in filelist status Bug 1894 INFO: No replica name found in request. Bug 1833 The method isAdminCheckSumOk() from the type FilePreservationState is not visible FR 1734 Unittest of BitachiveMonitor FR 1736 Monitoring batchjobs through logging Bug 754 NullPointerException i bitpreservation Bug 842 Bitpreservation GUI fetches checksums twice Bug 810 TEST7, step 8: "Send failed" error when updating filestatus for location SB }}} === Documentation Module === {{{ FR 1818 Have all the example configuration files in one folder different from the "conf" }}} === Monitor Module === {{{ FR 1757 Need a way to remove an application from lists of monitored applications FR 1861 When clicking "More" in a status page, the link jumps to the top FR 1578 The interval (REREGISTER_DELAY) between apps re-registering themselves should be a setting Documentation }}} === External Software === {{{ ES 1587 Upgrade Apache Commons Net library to version 2.0 ES 1784 Upgrade Jetty ES 1808 Upgrade Apache Commons Fileupload to 1.2.1 FR 1890 Upgrade to Apache Derby 10.5.3.0 }}} == Upgrade instructions == Beware that the contents of the''' /conf''' directory has been moved to the''' /examples''' directory. It may cause your existing deploy scripts to fail. Remember to stop the running installation before upgrading. If using derby database as your database, it it recommmended, that you upgrade your database to 10.5.3.0 by appending ";upgrade=true" to your JDBC url (Cf. http://db.apache.org/derby/docs/10.5/devguide/tdevupgradedb.html) If your are using the standard distributed Netarchivesuite archive, you can now choose to use an archive database. An empty database is bundled with Netarchivesuite. This empty archive database needs to be filled with the data embedded in any existing admin.data, and this must be done after the new version of !NetarchiveSuite is installed, but before it is started. The steps for filling the database is the following. * Copy the admin.data to the machine, where you have deployed the archive database (In your deploy configuration, find the deployMachine, where the deployArchiveDatabaseDir is set). In the below extract of a deploy configuration, the archive database is deployed on kb-east-adm-001.kb.dk: {{{ /home/test harvestDatabase adminDB dk.netarkivet.archive.arcrepositoryadmin.DatabaseAdmin dk.netarkivet.archive.arcrepositoryadmin.DerbyServerSpecifics jdbc:derby localhost 8119 adminDB }}} (Note: the admin.data may already be on that machine) * Copy the admin.data file to the installdir {{{ cp -pv $AdminDataDir/admin.data $INSTALLDIR/ }}} * Start the archive database server {{{ cd $INSTALLDIR/conf/; ./start_external_database.sh }}} * Run the !ReestablishAdminDatabase tool (requires a sun JAVA JDK 1.6.0_07 or higher in the classpath) {{{ cd $INSTALLDIR sed 's/ArcRepositoryApplication%u/ReestablishAdminDatabase%u/' conf/log_ArcRepositoryApplication.prop > conf/log_ReestablishAdminDatabase.props sed 's/limit=1000000/limit=0/' conf/log_ReestablishAdminDatabase.props > conf/log_ReestablishAdminDatabase.prop rm conf/log_ReestablishAdminDatabase.props java -cp lib/dk.netarkivet.archive.jar:lib/dk.netarkivet.monitor.jar -Ddk.netarkivet.settings.file=conf/settings_GUIApplication.xml -Dorg.apache.commons.logging.Log=org.apache.commons.logging.impl.Jdk14Logger -Djava.util.logging.config.file=$INSTALLDIR/conf/log_ReestablishAdminDatabase.prop dk.netarkivet.archive.tools.ReestablishAdminDatabase > nohup.out 2>&1 & }}} This tool will take a long time to complete (about 7 hours for 1.5 mio lines admin.data), and will print out a progress line for each 10000 lines processed: {{{ Using default admin.data: /home/test/TEST7A/admin.data Reading admin.data version 0.4 [Fri Apr 23 05:02:00 CEST 2010] Processed 10000 admin data lines [Fri Apr 23 05:04:37 CEST 2010] Processed 20000 admin data lines ... [Fri Apr 23 12:01:30 CEST 2010] Processed 1560000 admin data lines [Fri Apr 23 12:01:34 CEST 2010] ReestablishAdminDatabase tool finished ingest of file '/home/test/TEST7A/admin.data'. }}} * shutdown the archive database server {{{ cd $INSTALLDIR/conf/; ./kill_external_database.sh }}} Adding a checksum-replica to your installation requires some work as well: 1) You need to add a checksum replica to your deploy configuration {{{ CS CSN checksum }}} 2) You need to add a !FileChecksumApplication to represent this replica. We recommend that this application is located on a different location than the Arcrepository. Below the application is installed on machine east-acs-001.kb.dk, which is part of Physical location EAST: {{{ ... ... CS 8112 8212 CS }}} If you have an existing admin.data file from the old installation, and you would to ingest this information in the checksum replica, what you need to do is the following after you have installed but not started the new version of !NetarchiveSuite: * Copy the admin.data to installdir on the machine where you have installed the !ChecksumFileApplication * Start the checksumapplication {{{ cd $INSTALLDIR/conf; ./start_ChecksumFileApplication.sh }}} * Wait for the migration to complete. when the line "Finished loading admin data" can be found in the logs, the migration is finished. Make this check by going to the logs dir (cd $INSTALLDIR/logs) and execute the following grep command: {{{ grep "Finished loading admin data" * }}} * Stop the checksumapplication {{{ cd $INSTALLDIR/conf; ./kill_ChecksumFileApplication.sh) }}} ==== How to update the bitpreservation information automatically ==== The bitpreservation framework allows you to update status of each replica concerning "missing files" and "changed files". These operations can take a long time to complete, which is why the user no longer must wait for the operation to complete before the web-page is rendered. All operations are now performed as background processes. However, in order not to loose track of update processes, it is recommended to schedule these processes as cron-jobs, so that each operation has 7 hours to complete before the next operation is begun. The following updatebitpreservationinfo.sh script uses the 'wget' command to activate an update process. {{{ #!/bin/bash ## ## Usage: updatebitpreservationinfo.sh type replicaname ## ## The type can be either 'findmissingfiles' or 'checksum' ## The replicaname is the name of one of the replicas in your installation (e.g. replicaOne, replicaTwo, replicaCs) ## Note that GUI_BASEDIR below must be replaced with the correct URL in your installation. export TYPE=$1 export BITARCHIVE=$2 export GUI_BASEDIR=http://east-adm-001.kb.dk:8076 URL="$GUI_BASEDIR/BitPreservation/Bitpreservation-filestatus-update.jsp?type=$TYPE&bitarchive=$BITARCHIVE" wget $URL }}} Example usage of this script: {{{ bash updatebitpreservation.sh findmissingfiles CSN bash updatebitpreservation.sh checksum CSN }}} ==== How to backup the external derby archive database ==== External derby databases are not backed up by the !NetarchiveSuite software. This must be done outside !NetarchiveSuite with a script that uses the 'ij' command bundled with the derby software. The following script makes a backup to a unique directory relative to the installation directory. The script assumes, that $INSTALLDIR variable contains the installation directory. The DATABASE_PORT and DATABASE variables needs to be changed to fit your environment. {{{ #!/bin/bash cd $INSTALLDIR/ export DATABASE_PORT=8169 export SecondsSinceEpoch=`date +"%s"` export DATABASE="jdbc:derby://localhost:$DATABASE_PORT/adminDB" export BACKUPDIR=$INSTALLDIR/adminDB-backup-$SecondsSinceEpoch export TMP_FILE=tmp-$SecondsSinceEpoch echo connect \'$DATABASE\'\; > $TMP_FILE echo "Making a backup of the database in directory $BACKUPDIR" echo CALL SYSCS_UTIL.SYSCS_BACKUP_DATABASE\(\'$BACKUPDIR\'\) >> $TMP_FILE java -cp lib/db/derbyclient-10.5.3.0.jar:lib/db/derbytools-10.5.3.0.jar org.apache.derby.tools.ij < $TMP_FILE rm $TMP_FILE }}} === New settings === The following new settings have been introduced: ''settings.common.batch.timeBetweenLogging'' (default: 30000): The time between logging the stat us of a batch job measured in milliseconds. The default amounts to 30 seconds. ''settings.common.batch.defaultBatchTimeout'' (default: 604800000): Batchjobs without a specified timeout will get this value (one week). Note that the following checksum replica has been added to the default set of replicas {{{ ONE replicaOne bitarchive TWO replicaTwo bitarchive THREE replicaCs checksum }}} ''settings.archive.bitpreservation.class'' (default . dk.netarkivet.archive.arcrepository.bitpreservation.!FileBasedActiveBitPreservation): Setting for which instance of !ActiveBitPreservation that should be used for preservation. The alternative class is dk.netarkivet.archive.arcrepository.bitpreservation.!DatabaseBasedActiveBitPreservation, which requires an external database to work, as we can't have two embedded databases running at the same time. This database is defined by the ''settings.archive.admin.database.[class|baseUrl|machine|port|dir]'' settings. The default settings is the following: {{{ dk.netarkivet.archive.arcrepositoryadmin.DerbyServerSpecifics jdbc:derby localhost 1527 admindb }}} ''settings.archive.admin.class'' (default dk.netarkivet.archive.arcrepositoryadmin.!UpdateableAdminData): The setting for the adminstration instance class. The alternative option is the dk.netarkivet.archive.arcrepositoryadmin.!DatabaseAdmin, which requires an external database, just as is the case for the !DatabaseBasedActiveBitPreservation. === New translation strings === '''harvester/Translations.properties''' {{{ errormsg;template.upload.failed.with.exception.0=Harvest template upload failed with exception {0} errormsg;uploading.file.0.failed.it.does.not.exist=Uploading the file ''{0}'' failed. It does not exist errormsg;missing.parameter.0=Required parameter {0} not given pagetitle;seeds.for.harvestdefinition=Seeds for the harvestdefinition harveststatus.seeds.for.harvest.0=Domain/Seeds for harvestdefinition {0} add.seeds.from.file=Add seeds from a file errormsg;no.seedsfile.was.uploaded=No seedsfile was uploaded // seeds regarding global crawlertraps facility pagetitle;edit.global.crawler.traps=Global Crawler Traps crawlertrap.active.header=Active Crawler Traps crawlertrap.noactive=There are no active global crawler traps. crawlertrap.inactive.header=Inactive Crawler Traps crawlertrap.noinactive=There are no inactive global crawler traps. crawlertrap.createnew=Upload New Global Crawler Trap List errormsg;crawlertrap.upload.error=Error uploading new Global Crawler Trap crawlertrap.activate=Activate crawlertrap.deactivate=Deactivate crawlertrap.create=Create crawlertrap.update=Update crawlertrap.delete=Delete crawlertrap.name=Name crawlertrap.active=Active crawlertrap.inactive=Inactive crawlertrap.description=Description crawlertrap.filetoupload=File To Upload hide=Hide }}} '''monitor/Translation.properties''' {{{ tablefield;removeapplication=Remove Application errormsg.error.when.unregistering.mbean.0=Error when unreqistering JMX MBean identified with query ''{0}''. }}} (note: This latter needs to be changed to "errormsg;error.when.unregistering.mbean.0" See outstanding bug 1844 Wrong labelling of the translation key "errormsg.error.when.unregistering.mbean.0") === Deleted translation strings === '''harvester/Translation.properties''' {{{ errormsg;template.upload.failed=Harvest template upload failed }}} '''archive/Translation.properties''': {{{ pagetitle;filestatus.update=Update of filestatus information errormsg;unknown.filestatus.update.type.0=Unknown filestatus update type ''{0}''. initiating;update.of.0.for.replica.1=Initiating update of ''{0}'' for replica ''{1}'' be.patient.this.operation.can.take.hours=Please be patient. This operation can take hours }}} == Version History == ||Version 3.11.* || ||Development versions aiming for 3.12.0 || ||Version 3.10.0 ||2009-11-16 ||New deploy application; JMX stability issues fixed; JMS stability issues also fixed || ||Version 3.9.* || ||Development versions aiming for 3.10.0 || ||Version 3.8.2 ||2009-09-10 ||Fix an important index synchronization bug || ||Version 3.8.1 ||2009-07-15 ||Fix of important bug leading to unresponsive harvesters || ||Version 3.8.0 ||2009-05-23 ||Java 1.6, Heritrix 1.14.1, Derby 10.4.2.0, complete rewrite of settings, new supported deploy module, gui access to harvest logs || ||Version 3.7.0 ||2008-11-04 ||Develop version aiming for 3.8.0 || ||Version 3.6.0 ||2008-07-03 ||Improvement of archive component with regard to security, batch, and preservation; greater JMS stability; important bug fixes || ||Version 3.5.* || ||Develop versions aiming for 3.6.0 || ||Version 3.4.2 ||2008-03-14 ||Bug fix release, fixing JMX timeout || ||Version 3.4.1 ||2008-01-16 ||Bug fix release, fixing out of memory on very large indexes || ||Version 3.4.0 ||2008-01-03 ||Separation of Heritrix, work on developing our open source platform, two-part TLDs like co.uk, and lots of bugfixes || ||Version 3.3.* || ||Develop versions aiming for 3.4.0 || ||Version 3.2.3 ||2007-09-27 ||Bugfix of 3.2.2 with patched deduplicator, that fixes problem in parallel indexing || ||Version 3.2.2 ||2007-08-03 ||Bugfix of 3.2.1 with patched Heritrix 1.12.1, that supports ARCRecords larger than 2GBs || ||Version 3.2.1 ||2007-07-04 ||Bugfix of 3.2.0 fixing trouble using the quick start manual. || ||Version 3.2.0 ||2007-07-04 ||Open source release || ||Version 3.1.* || ||Development versions. Version 3.1.7 was kindly reviewed by Internet Archive and the Norwegian national library. || ||Version 3.0.0 ||2007-02-02 ||Marked the naming of the !NetarchiveSuite, the splitting of !NetarchiveSuite into independent modules, and the licensing of !NetarchiveSuite under LGPL || ||Version 2.* || ||Various features and updates || ||Version 2.0 ||2006-08-30 ||Marked a general restructuring of the code, where harvest definition data was backed by a database, the viewerproxy was trimmed and rewritten. || ||Version 1.* || ||Various features and updates || ||Version 1.0 ||2005-07-01 ||The first version of the netarchive| software put in production for harvesting the entire Danish web || ||Version 0.* || ||Various pre-production development versions ||