Release Notes for NetarchiveSuite 3.10.0

This version of NetarchiveSuite was released on 2009-11-16.

New features since NetarchiveSuite 3.8.*

Apart from a general fixing of bugs (see below) the most important new features are:

General

This is a great release for NetarchiveSuite, because so much of the code has been done by new partners in the project! Thank you very much, and welcome to the national libraries in France and Austria for their great contributions!

Common Module

It is now possible to override the implementation of the getBytesFree() method, that is by default calculated using the standard Java method File.getUsableSpace().

We now have translations in Italian and French, thanks to the kind donation from the National Libraries in Austria and France.

Harvester Module

NetarchiveSuite now works properly with MySQL (See bug 1254).

A deadline has now been introduced in the HarvestScheduler for jobs in status STARTED, by default 1 week.

Deduplication is now completely optional, and can be turned off with a setting.

It is now possible to use number of objects as harvest limits, as well as size in bytes.

The harvest reports included in the metadata files uploaded to the archive are now configurable. This also makes it possible to include in the metadata any order.xml-file that has been changed during harvesting using the Heritrix web interface (these are included by default).

A new page lists all seeds for a given harvest definition.

Fewer jobs will be included in later harvests when doing a step-by-step snapshot harvest, since aborted jobs are now considered complete.

Archive Module

Some excessive logging has been removed, and the previously fixed indexing bugs 1078 and 1079 was reopened and fixed again.

Start of a derived archive type, that only stores checksums of files instead of the entire files, for better bit preservation functionality. This is still highly experimental.

Start of database-based active bit preservation functionality. This is not functional yet, but work has started.

Access Module

Code for supporting Wayback is now in NetarchiveSuite.

Current support includes batch job for generating CDXs for use with Wayback including deduplicated entries, and a ResourceStore for use with wayback, that uses the NetarchiveSuite bitarchive. Setting up wayback is still a somewhat manual process, please refer to the documentation.

Deploy

The deploy now generates files for use when starting bitarchive applications as Windows services.

Bugs fixed since NetarchiveSuite 3.8.*

Common Module

Bug 555 JMS connections cannot reconnect
Bug 1218 Exception while adding listeners to JMSConnection
Bug 1275 The message limit (maxNumMsgs) of 100000 has been reached
Bug 1299 Network I/O errors shuts down JMSConnection
Bug 1298 Set JMXConnection timeout, if possible
Bug 1620 Synchronization issues in reading settings while starting BitarchiveServer
Bug 1694 LocalArcRepositoryClient is broken
Bug 1712 Starting multiple applications on one machine leads to potential failure of startup
Bug 1730 The prefix to the messages is thrown away
Bug 1767 After JMS Broker restart, the bitarchive monitors disapeared.
Bug 1785 NullPointerException at dk.netarkivet.common.utils.Settings.reload(Settings.java:385)
FR 1511 Thousand separators requested in user interface
FR 1654 second-level domains for .at in settings.xml
FR 1687 French translation
FR 1709 Module getBytesFree()
FR 1735 File headers need update
FR 1750 Italian Translation

Harvester Module

Bug  749 Job-IDs are not unique after restore from DB
Bug  928 The guess of initial size of unharvested domains is very bad on harvests with a large object limit
Bug 1073 resubmitting jobs redirects the browser to the list of all jobs
Bug 1172 password protected domain was not harvested
Bug 1174 Poor error message on dead job
Bug 1188 Heritrix side exceptions on JMX calls are ignored
Bug 1254 Database connections to MySQL close down intermittently
Bug 1611 Missing space in error message in DefinitionsSiteSection.initialize
Bug 1641 It should be possible to turn off deduplication completely
Bug 1644 On Edit Domain page, the text field only shows 21 characters of the domainname
Bug 1646 dk.netarkivet.harvester.harvesting.distribute.MetadataEntry needs toString method
Bug 1650 It is not checked when creating the Heritrix process, that the JMX password file assigned to Heritrix exists
Bug 1661 Too many warnings logged when looking up Heritrix running state
Bug 1670 Default timeout settings are set way too low in the default settings
Bug 1690 Keep track of order XML changes
Bug 1711 harvester that is not destroyable makes harvesterApplication take and immediately fail jobs
Bug 1718 The link to monitor Heritrix process does not necessarily give fully qualified hostnames
Bug 1728 The template 'ultra_big_domains.xml' does not use Decidingscope, and is not included in the bundled database
Bug 1729 Remove use of deprecated ARCWriter.write() method
Bug 1759 HarvestStatus-seeds.jsp needs update
Bug 1770 WARNING: Non-fatal error in JMX call during crawl
Bug 1771 INFO: Job ID: 3, Harvest ID: 3, http://kb-test-har-002.kb.dk:8490 null
Bug 1786 Harvest template upload failed and the reason is not shown in the errormessage
FR 1014 No good way to mark a non-reported-stopped job as FAILED or DONE
FR 1227 Log the Heritrix command line
FR 1628 Add custom JVM parameters to Heritrix subprocess
FR 1675 List of all Seeds of a selective Harvests
FR 1689 Managing crawls using object number
FR 1691 Configure which Heritrix reports to include in metadata ARC file
FR 1702 Value for max-trans-hops is way too high in default order templates
FR 1716 Increase size of input field, when uploading harvest templates
FR 1717 Increase crawler trap textarea size
FR 1723 Update the Heritrix templates in harvestdefinionbasedir/order_template_dist to Heritrix 1.14.3
FR 1726 Upgrade to wayback 1.4.2
FR 1765 Harvest documentation: domain-specific settings are looked up incorrectly
FR 1772 Input field sizes requested larger
FR 1773 Manually stopped jobs are treated as failed rather than complete

Archive Module

Bug 1078 DeDuplikator index too large (refixed)
Bug 1079 snap shot harvest not browsable due to large index (refixed)
Bug 1547 Wrong synchronization in the IndexRequestServer and the FileBasedCache let two processes generate Index at the same time, and one of them fails
Bug 1721 Batch timeout is not configurable
Bug 1722 Excessive logging in indexserver
FR 1733 Default security policies do not allow reading system properties, making batch jobs difficult to write

Access Module

Bug 1700 The WebProxy.handle() method creates CreateErrorResponse for null Uri
Bug 1707 Wayback problem associated with canonicalized urls
Bug 1726 Upgrade to wayback 1.4.2
Bug 1758 UrlCanonicalizerFactory falls back to default value silently
FR 1678 Make CDX-entries for the deduplicate entries in the crawl.log, and append to the other CDX-entries

Monitor Module

FR 1776 Make REREGISTER_DELAY customizable

Deploy Module

Bug 1705 Make jmxremote.access writable before overwritting it (install script)
Bug 1780 Deploy script is defining 'bitArchive' which should be 'bitarchive'
FR 1660 Install script cannot handle the NetarchiveSuite.zip file having a different location
FR 1745 Deploy should fail if the environment name is invalid
FR 1751 Create restart script for windows services
FR 1783 use std deploy in quickstart instead of simpel_harvest

Documentation

FR 1288 Batch and and use of Tools must be described 
Bug 1636 Warnings in javadoc
Bug 1710 deploy Application seems not to support multiple FTP-servers
Bug 1737 Underscore character in environment name causes wrong applicationInstanceId extraction

Upgrade instructions

Remember to stop the running installation before upgrading.

New settings

The following new settings have been introduced:

settings.common.database.validityCheckTimeout (default: 0): Timeout in seconds to check for the validity of a JDBC connection on the server. This is the time in seconds to wait for the database operation used to validate the connection to complete. If the timeout period expires before the operation completes, this method returns false. A value of 0 indicates a timeout is not applied to the database operation.

settings.common.freespaceprovider.class(Default: dk.netarkivet.common.utils.DefaultFreeSpaceProvider): The implementation class for free space provider, e.g. dk.netarkivet.common.utils.DefaultFreeSpaceProvider. The class must implement FreeSpaceProvider-Interface.

settings.harvester.datamodel.domain.defaultMaxObjects (default: -1): The default object limit for harvests if using object limits rather than bytelimits. -1 means unlimited.

settings.harvester.scheduler.splitByObjectLimit (default: false): Split by object limits rather than byte limits.

settings.harvester.scheduler.jobtimeouttime (default: 604800 (one week)): After this amount of time, jobs that are in status 'Started' change state to status 'Failed', since they are expected to have expired.

settings.harvester.harvesting.heritrix.javaOpts (default: ""): Additional JVM options for the Heritrix sub-process.

settings.harvester.harvesting.heritrixControllerClass (default: dk.netarkivet.harvester.harvesting.JMXHeritrixController): The implementation of the HeritrixController interface to be used.

settings.harvester.harvesting.deduplication.enabled (default: true) If false, deduplication is completely disabled.

settings.harvester.harvesting.metadata.heritrixFilePattern (default: .*(\.xml|\.txt|\.log|\.out)): Regexp matching Heritrix files to be included in metadata arc files after harvesting

settings.harvester.harvesting.metadata.reportFilePattern (default: .*-report.tx): Regexp matching Heritrix reports to be included in metadata arc files after harvesting

settings.harvester.harvesting.metadata.logFilePattern (default: .*(\.log|\.out)): Regexp matching Heritrix logs to be included in metadata arc files after harvesting

settings.archive.checksum.baseDir (default: checksum): The directory used to store checksums in, if using the new checksum-only replicas.

settings.archive.checksum.minSpaceLeft (default: 1000000): The minimum amount of space left to receive messages to the new checksum-only replicas.

settings.common.monitorregistryClient.reregisterdelay (default: 1) The delay in minutes between each time a JMS monitor registry client reports itself as available for monitoring.

settings.archive.bitpreservation.class (default: dk.netarkivet.archive.arcrepository.bitpreservation.FileBasedActiveBitPreservation): The class used for active bit preservation. Currently, only the default class is available, but it is planned that bitpreservation is to be handled using a database-backed functionality.

settings.archive.bitpreservation.database.class (default: dk.netarkivet.harvester.datamodel.DerbyEmbeddedSpecifics): Not used yet, but available for the upcoming database-backed file preservation functionality.

settings.archive.bitpreservation.database.url (default: jdbc:derby:bitpreservationdb): Not used yet, but available for the upcoming database-backed file preservation functionality.

settings.monitor.jmxProxyTimeout (default: 500): Timeout in milliseconds before the monitor interface assumes the monitored applications are dead.

settings.wayback.urlcanonicalizer.classname (dk.netarkivet.wayback.batch.copycode.NetarchiveSuiteAggressiveUrlCanonicalizer): The class used for canonicalizing URLs.

New translation strings

harvester/Translations.properties

stopreason.max.domainobjects.limit.reached=Domain-config object limit reached
harvestdefinition.snapshot.header.maxobjects=Max objects
maximum.number.of.objects=Maximum number of objects
harvestdefinition.linktext.seeds=Seeds
harveststatus.seeds.total=Total
harveststatus.seeds.domains=Domains
harveststatus.seeds.seeds=Seeds
errormsg;missing.parameter.0=Required parameter {0} not given
pagetitle;seeds.for.harvestdefinition=Seeds for the harvestdefinition
harveststatus.seeds.for.harvest.0=Domain/Seeds for harvestdefinition {0}
errormsg;template.upload.failed.with.exception.0=Harvest template upload failed with exception {0}
errormsg;uploading.file.0.failed.it.does.not.exist=Uploading the file ''{0}'' failed. It does not exist

The following string was removed

errormsg;template.upload.failed=Harvest template upload failed

Version History

Version 3.9.*

Development versions aiming for 3.10.0

Version 3.8.2

2009-09-10

Fix an important index synchronization bug

Version 3.8.1

2009-07-15

Fix of important bug leading to unresponsive harvesters

Version 3.8.0

2009-05-23

Java 1.6, Heritrix 1.14.1, Derby 10.4.2.0, complete rewrite of settings, new supported deploy module, gui access to harvest logs

Version 3.7.0

2008-11-04

Develop version aiming for 3.8.0

Version 3.6.0

2008-07-03

Improvement of archive component with regard to security, batch, and preservation; greater JMS stability; important bug fixes

Version 3.5.*

Develop versions aiming for 3.6.0

Version 3.4.2

2008-03-14

Bug fix release, fixing JMX timeout

Version 3.4.1

2008-01-16

Bug fix release, fixing out of memory on very large indexes

Version 3.4.0

2008-01-03

Separation of Heritrix, work on developing our open source platform, two-part TLDs like co.uk, and lots of bugfixes

Version 3.3.*

Develop versions aiming for 3.4.0

Version 3.2.3

2007-09-27

Bugfix of 3.2.2 with patched deduplicator, that fixes problem in parallel indexing

Version 3.2.2

2007-08-03

Bugfix of 3.2.1 with patched Heritrix 1.12.1, that supports ARCRecords larger than 2GBs

Version 3.2.1

2007-07-04

Bugfix of 3.2.0 fixing trouble using the quick start manual.

Version 3.2.0

2007-07-04

Open source release

Version 3.1.*

Development versions. Version 3.1.7 was kindly reviewed by Internet Archive and the Norwegian national library.

Version 3.0.0

2007-02-02

Marked the naming of the NetarchiveSuite, the splitting of NetarchiveSuite into independent modules, and the licensing of NetarchiveSuite under LGPL

Version 2.*

Various features and updates

Version 2.0

2006-08-30

Marked a general restructuring of the code, where harvest definition data was backed by a database, the viewerproxy was trimmed and rewritten.

Version 1.*

Various features and updates

Version 1.0

2005-07-01

The first version of the netarchive| software put in production for harvesting the entire Danish web

Version 0.*

Various pre-production development versions

ReleaseNotes3_10_0 (last edited 2010-08-16 10:25:17 by localhost)