= Release Notes for NetarchiveSuite 3.10.0 = This version of !NetarchiveSuite was released on 2009-11-16. <> == New features since NetarchiveSuite 3.8.* == Apart from a general fixing of bugs (see below) the most important new features are: === General === This is a great release for NetarchiveSuite, because so much of the code has been done by new partners in the project! Thank you very much, and welcome to the national libraries in France and Austria for their great contributions! === Common Module === It is now possible to override the implementation of the getBytesFree() method, that is by default calculated using the standard Java method File.getUsableSpace(). We now have translations in Italian and French, thanks to the kind donation from the National Libraries in Austria and France. === Harvester Module === NetarchiveSuite now works properly with MySQL (See bug 1254). A deadline has now been introduced in the HarvestScheduler for jobs in status STARTED, by default 1 week. Deduplication is now completely optional, and can be turned off with a setting. It is now possible to use number of objects as harvest limits, as well as size in bytes. The harvest reports included in the metadata files uploaded to the archive are now configurable. This also makes it possible to include in the metadata any order.xml-file that has been changed during harvesting using the Heritrix web interface (these are included by default). A new page lists all seeds for a given harvest definition. Fewer jobs will be included in later harvests when doing a step-by-step snapshot harvest, since aborted jobs are now considered complete. === Archive Module === Some excessive logging has been removed, and the previously fixed indexing bugs 1078 and 1079 was reopened and fixed again. Start of a derived archive type, that only stores checksums of files instead of the entire files, for better bit preservation functionality. This is still highly experimental. Start of database-based active bit preservation functionality. This is not functional yet, but work has started. === Access Module === Code for supporting Wayback is now in NetarchiveSuite. Current support includes batch job for generating CDXs for use with Wayback including deduplicated entries, and a ResourceStore for use with wayback, that uses the NetarchiveSuite bitarchive. Setting up wayback is still a somewhat manual process, please refer to the documentation. === Deploy === The deploy now generates files for use when starting bitarchive applications as Windows services. == Bugs fixed since NetarchiveSuite 3.8.* == === Common Module === {{{ Bug 555 JMS connections cannot reconnect Bug 1218 Exception while adding listeners to JMSConnection Bug 1275 The message limit (maxNumMsgs) of 100000 has been reached Bug 1299 Network I/O errors shuts down JMSConnection Bug 1298 Set JMXConnection timeout, if possible Bug 1620 Synchronization issues in reading settings while starting BitarchiveServer Bug 1694 LocalArcRepositoryClient is broken Bug 1712 Starting multiple applications on one machine leads to potential failure of startup Bug 1730 The prefix to the messages is thrown away Bug 1767 After JMS Broker restart, the bitarchive monitors disapeared. Bug 1785 NullPointerException at dk.netarkivet.common.utils.Settings.reload(Settings.java:385) FR 1511 Thousand separators requested in user interface FR 1654 second-level domains for .at in settings.xml FR 1687 French translation FR 1709 Module getBytesFree() FR 1735 File headers need update FR 1750 Italian Translation }}} === Harvester Module === {{{ Bug 749 Job-IDs are not unique after restore from DB Bug 928 The guess of initial size of unharvested domains is very bad on harvests with a large object limit Bug 1073 resubmitting jobs redirects the browser to the list of all jobs Bug 1172 password protected domain was not harvested Bug 1174 Poor error message on dead job Bug 1188 Heritrix side exceptions on JMX calls are ignored Bug 1254 Database connections to MySQL close down intermittently Bug 1611 Missing space in error message in DefinitionsSiteSection.initialize Bug 1641 It should be possible to turn off deduplication completely Bug 1644 On Edit Domain page, the text field only shows 21 characters of the domainname Bug 1646 dk.netarkivet.harvester.harvesting.distribute.MetadataEntry needs toString method Bug 1650 It is not checked when creating the Heritrix process, that the JMX password file assigned to Heritrix exists Bug 1661 Too many warnings logged when looking up Heritrix running state Bug 1670 Default timeout settings are set way too low in the default settings Bug 1690 Keep track of order XML changes Bug 1711 harvester that is not destroyable makes harvesterApplication take and immediately fail jobs Bug 1718 The link to monitor Heritrix process does not necessarily give fully qualified hostnames Bug 1728 The template 'ultra_big_domains.xml' does not use Decidingscope, and is not included in the bundled database Bug 1729 Remove use of deprecated ARCWriter.write() method Bug 1759 HarvestStatus-seeds.jsp needs update Bug 1770 WARNING: Non-fatal error in JMX call during crawl Bug 1771 INFO: Job ID: 3, Harvest ID: 3, http://kb-test-har-002.kb.dk:8490 null Bug 1786 Harvest template upload failed and the reason is not shown in the errormessage FR 1014 No good way to mark a non-reported-stopped job as FAILED or DONE FR 1227 Log the Heritrix command line FR 1628 Add custom JVM parameters to Heritrix subprocess FR 1675 List of all Seeds of a selective Harvests FR 1689 Managing crawls using object number FR 1691 Configure which Heritrix reports to include in metadata ARC file FR 1702 Value for max-trans-hops is way too high in default order templates FR 1716 Increase size of input field, when uploading harvest templates FR 1717 Increase crawler trap textarea size FR 1723 Update the Heritrix templates in harvestdefinionbasedir/order_template_dist to Heritrix 1.14.3 FR 1726 Upgrade to wayback 1.4.2 FR 1765 Harvest documentation: domain-specific settings are looked up incorrectly FR 1772 Input field sizes requested larger FR 1773 Manually stopped jobs are treated as failed rather than complete }}} === Archive Module === {{{ Bug 1078 DeDuplikator index too large (refixed) Bug 1079 snap shot harvest not browsable due to large index (refixed) Bug 1547 Wrong synchronization in the IndexRequestServer and the FileBasedCache let two processes generate Index at the same time, and one of them fails Bug 1721 Batch timeout is not configurable Bug 1722 Excessive logging in indexserver FR 1733 Default security policies do not allow reading system properties, making batch jobs difficult to write }}} === Access Module === {{{ Bug 1700 The WebProxy.handle() method creates CreateErrorResponse for null Uri Bug 1707 Wayback problem associated with canonicalized urls Bug 1726 Upgrade to wayback 1.4.2 Bug 1758 UrlCanonicalizerFactory falls back to default value silently FR 1678 Make CDX-entries for the deduplicate entries in the crawl.log, and append to the other CDX-entries }}} === Monitor Module === {{{ FR 1776 Make REREGISTER_DELAY customizable }}} === Deploy Module === {{{ Bug 1705 Make jmxremote.access writable before overwritting it (install script) Bug 1780 Deploy script is defining 'bitArchive' which should be 'bitarchive' FR 1660 Install script cannot handle the NetarchiveSuite.zip file having a different location FR 1745 Deploy should fail if the environment name is invalid FR 1751 Create restart script for windows services FR 1783 use std deploy in quickstart instead of simpel_harvest }}} === Documentation === {{{ FR 1288 Batch and and use of Tools must be described Bug 1636 Warnings in javadoc Bug 1710 deploy Application seems not to support multiple FTP-servers Bug 1737 Underscore character in environment name causes wrong applicationInstanceId extraction }}} == Upgrade instructions == Remember to stop the running installation before upgrading. === New settings === The following new settings have been introduced: ''settings.common.database.validityCheckTimeout'' (default: 0): Timeout in seconds to check for the validity of a JDBC connection on the server. This is the time in seconds to wait for the database operation used to validate the connection to complete. If the timeout period expires before the operation completes, this method returns false. A value of 0 indicates a timeout is not applied to the database operation. ''settings.common.freespaceprovider.class''(Default: dk.netarkivet.common.utils.DefaultFreeSpaceProvider): The implementation class for free space provider, e.g. dk.netarkivet.common.utils.DefaultFreeSpaceProvider. The class must implement FreeSpaceProvider-Interface. ''settings.harvester.datamodel.domain.defaultMaxObjects'' (default: -1): The default object limit for harvests if using object limits rather than bytelimits. -1 means unlimited. ''settings.harvester.scheduler.splitByObjectLimit'' (default: false): Split by object limits rather than byte limits. ''settings.harvester.scheduler.jobtimeouttime'' (default: 604800 (one week)): After this amount of time, jobs that are in status 'Started' change state to status 'Failed', since they are expected to have expired. ''settings.harvester.harvesting.heritrix.javaOpts'' (default: ""): Additional JVM options for the Heritrix sub-process. ''settings.harvester.harvesting.heritrixControllerClass'' (default: dk.netarkivet.harvester.harvesting.JMXHeritrixController): The implementation of the HeritrixController interface to be used. ''settings.harvester.harvesting.deduplication.enabled'' (default: true) If false, deduplication is completely disabled. ''settings.harvester.harvesting.metadata.heritrixFilePattern'' (default: .*(\.xml|\.txt|\.log|\.out)): Regexp matching Heritrix files to be included in metadata arc files after harvesting ''settings.harvester.harvesting.metadata.reportFilePattern'' (default: .*-report.tx): Regexp matching Heritrix reports to be included in metadata arc files after harvesting ''settings.harvester.harvesting.metadata.logFilePattern'' (default: .*(\.log|\.out)): Regexp matching Heritrix logs to be included in metadata arc files after harvesting ''settings.archive.checksum.baseDir'' (default: checksum): The directory used to store checksums in, if using the new checksum-only replicas. ''settings.archive.checksum.minSpaceLeft'' (default: 1000000): The minimum amount of space left to receive messages to the new checksum-only replicas. ''settings.common.monitorregistryClient.reregisterdelay'' (default: 1) The delay in minutes between each time a JMS monitor registry client reports itself as available for monitoring. ''settings.archive.bitpreservation.class'' (default: dk.netarkivet.archive.arcrepository.bitpreservation.FileBasedActiveBitPreservation): The class used for active bit preservation. Currently, only the default class is available, but it is planned that bitpreservation is to be handled using a database-backed functionality. ''settings.archive.bitpreservation.database.class'' (default: dk.netarkivet.harvester.datamodel.DerbyEmbeddedSpecifics): Not used yet, but available for the upcoming database-backed file preservation functionality. ''settings.archive.bitpreservation.database.url'' (default: jdbc:derby:bitpreservationdb): Not used yet, but available for the upcoming database-backed file preservation functionality. ''settings.monitor.jmxProxyTimeout'' (default: 500): Timeout in milliseconds before the monitor interface assumes the monitored applications are dead. ''settings.wayback.urlcanonicalizer.classname'' (dk.netarkivet.wayback.batch.copycode.NetarchiveSuiteAggressiveUrlCanonicalizer): The class used for canonicalizing URLs. === New translation strings === '''harvester/Translations.properties''' {{{ stopreason.max.domainobjects.limit.reached=Domain-config object limit reached harvestdefinition.snapshot.header.maxobjects=Max objects maximum.number.of.objects=Maximum number of objects harvestdefinition.linktext.seeds=Seeds harveststatus.seeds.total=Total harveststatus.seeds.domains=Domains harveststatus.seeds.seeds=Seeds errormsg;missing.parameter.0=Required parameter {0} not given pagetitle;seeds.for.harvestdefinition=Seeds for the harvestdefinition harveststatus.seeds.for.harvest.0=Domain/Seeds for harvestdefinition {0} errormsg;template.upload.failed.with.exception.0=Harvest template upload failed with exception {0} errormsg;uploading.file.0.failed.it.does.not.exist=Uploading the file ''{0}'' failed. It does not exist }}} The following string was removed {{{ errormsg;template.upload.failed=Harvest template upload failed }}} == Version History == ||Version 3.9.* || ||Development versions aiming for 3.10.0|| ||Version 3.8.2 ||2009-09-10 ||Fix an important index synchronization bug|| ||Version 3.8.1 ||2009-07-15 ||Fix of important bug leading to unresponsive harvesters || ||Version 3.8.0 ||2009-05-23 ||Java 1.6, Heritrix 1.14.1, Derby 10.4.2.0, complete rewrite of settings, new supported deploy module, gui access to harvest logs || ||Version 3.7.0 ||2008-11-04 ||Develop version aiming for 3.8.0 || ||Version 3.6.0 ||2008-07-03 ||Improvement of archive component with regard to security, batch, and preservation; greater JMS stability; important bug fixes || ||Version 3.5.* || ||Develop versions aiming for 3.6.0 || ||Version 3.4.2 ||2008-03-14 ||Bug fix release, fixing JMX timeout || ||Version 3.4.1 ||2008-01-16 ||Bug fix release, fixing out of memory on very large indexes || ||Version 3.4.0 ||2008-01-03 ||Separation of Heritrix, work on developing our open source platform, two-part TLDs like co.uk, and lots of bugfixes || ||Version 3.3.* || ||Develop versions aiming for 3.4.0 || ||Version 3.2.3 ||2007-09-27 ||Bugfix of 3.2.2 with patched deduplicator, that fixes problem in parallel indexing || ||Version 3.2.2 ||2007-08-03 ||Bugfix of 3.2.1 with patched Heritrix 1.12.1, that supports ARCRecords larger than 2GBs || ||Version 3.2.1 ||2007-07-04 ||Bugfix of 3.2.0 fixing trouble using the quick start manual. || ||Version 3.2.0 ||2007-07-04 ||Open source release || ||Version 3.1.* || ||Development versions. Version 3.1.7 was kindly reviewed by Internet Archive and the Norwegian national library. || ||Version 3.0.0 ||2007-02-02 ||Marked the naming of the !NetarchiveSuite, the splitting of !NetarchiveSuite into independent modules, and the licensing of !NetarchiveSuite under LGPL || ||Version 2.* || ||Various features and updates || ||Version 2.0 ||2006-08-30 ||Marked a general restructuring of the code, where harvest definition data was backed by a database, the viewerproxy was trimmed and rewritten. || ||Version 1.* || ||Various features and updates || ||Version 1.0 ||2005-07-01 ||The first version of the netarchive| software put in production for harvesting the entire Danish web || ||Version 0.* || ||Various pre-production development versions ||