Release Notes for NetarchiveSuite 3.14.0

This version of NetarchiveSuite was released on 2010-11-12

Contents

Release Notes for NetarchiveSuite 3.14.0

New features since NetarchiveSuite 3.12.*

This release has primarily focused on integrating harvesting code into the main NetarchiveSuite branch implemented at BNF. We have also refactored our harvesting system significantly, and introduced two new applications: the HarvestJobManagerApplication (takes care of the scheduling of harvestjobs) and the HarvestMonitorApplication, which handles CrawlProgress and JobEnded messages from the harvesters. We now have an improved interface to Heritrix, and a Running Jobs overview that eases the monitoring of the running harvestjobs in your installation.

Additionally, we now require the harvestdatabase to be external (i.e. not embedded). Before the default harvestdatabase was an embedded Derby database.

Additionally, there has been some done work in the archive (batchGUI) and wayback packages.

Note that the code implementing FR 1774 (Stop using the JMS queues for queuing snapshot harvests) has not been removed from the 3.14.0, even though it leaks memory quite fast resulting in OOM (Bug 2059). However, the setting settings.harvester.scheduler.singlejobdispatching (by default = false) has disabled this buggy feature by default. If you want experience OOM, set this setting to true.

Note: The BatchGUI has not translated into italian.

The following bugs and features have been fixed since 3.12

Common Module

Bug 1849 Invalid javadoc for DomainUtils#DOMAINNAME_CHAR_REGEX_STRING
Bug 2002 Remove Common module reference to Archive module
Bug 2012 Mysterious upload error: Ftp transaction aborted (connection closed without indication)

FR 1929 15 second level TLD related to the .fr and .re domains
FR 1998 Change max default logfile size from 1 MB to 10 MB
FR 2056 Do not accept invalid URLs when editing the seedlist
FR 2072 Add TLD .me to the default settings.xml

Harvester Module

Bug 1851 Close connections with Heritrix on unfinished jobs
Bug 1856 Schedule problem after first start on NAS 3.10.0. No schedule started
Bug 1940 Trying to upload a global crawlertraps list with an existing name throws an ugly database error at the use
Bug 1961 Possible to store invalid set of crawlertraps to domain causing the domain to be unreadable
Bug 1964 PWC6033: Unable to compile class for JSP
Bug 1965 "Running Jobs" in the GUI does not show any jobs eventhough 2 jobs are running
Bug 1970 SQLDataException: The syntax of the string representation of a datetime value is incorrect.
Bug 1971 PostgreSQL does not work with current NetarchiveSuite: forgot to copy BNF change to trunk
Bug 1972 Resubmit selected failed job does not do anything and gives a webpage error
Bug 2001 No results when startdate equals enddate on page harveststatus-alljobs.jsp
Bug 2014 Logs at level WARNING when the HarvestControllerServer returns a CrawlStatusMessage for a job in status DONE
Bug 2016 Still missing Heritrix reports
Bug 2018 Scheduling is not always working
Bug 2019 Select of Harvest name may not work
Bug 2043 OutOfMemoryError while scheduling job
Bug 2058 Javadoc in CrawlProgressMessage refers to unknown HeritrixStatus class
Bug 2062 WARNING: not integer! java.lang.NumberFormatException: For input string: ""
Bug 2066 Quickstart: too long waittime before Harveststatus is "done"
Bug 2080 Missing language keys in edit configuration

FR 1134 Filter job lists by category
FR 1668 Paginate and make sortable and searchable the list of jobs
FR 1688 Monitoring broad crawls
FR 1896 Crawl of password protected FTP-sites
FR 1924 Allows to search a domain in active jobs (in case of webmaster complain)
FR 1925 PostgreSQL connectivity (using the PostgreSQL driver version 8.4 - JDBC 4)
FR 1926 Ability to disable the inactivity check
FR 1927 Delay job end to allow Heritrix report generation
FR 1928 Ability to easily resubmit a selection of failed jobs (but see Bug 1972)
FR 1930 Ability to implement a different crawl control loop via HeritrixLauncher / new Heritrix JMX controller
FR 1951 Upgrade to heritrix 1.14.4
FR 1963 Make a HarvestSchedulerApplication that runs the HarvestScheduler
FR 2011 Harvest documentation : generate an optional arcfiles-report.txt
FR 2031 Use the HeritrixLauncher and HeritrixController supplied by BNF as default
FR 2071 Make HarvestMonitorServer run in a separate application

Access Module

Bug 1843 Typos and missing TODOs in standalone_archive.xml
Bug 1955 NetarchiveResourceStore fails to handle redirects
Bug 1976 Wayback fails to start with 'all_test.sh'
Bug 1980 Wayback deploy script does not build warfile

FR 1956 Shared dependency links harvester and wayback modules
FR 1959 Add wayback module to archive classpath

Archive Module

Bug 1892 Different refresh of File list and checksums checks
Bug 1938 Incomplete log in ReplicaCacheDatabase.addChecksumInformation
Bug 1945 The "add to archive" check box is inappropiately named in the new replica context
Bug 1949 Findbugs: use of known null pointer in BitpreserveFileState.processUpdateRequest
Bug 1999 Better logging for ingest of filelist and checksums
Bug 2029 No files in the checksum replica in the Bitpreservation webpage
Bug 2030 Message saturation of bitarchive applications result in OOM
Bug 2036 Poor warnings in log about missing filelist update
Bug 2041 Could not instantiate the batchjob.
Bug 2042 Only one checksum or filelist per replica at the time
Bug 2046 Exceptions thrown during a full checksum batchjob will make the system wait forever
Bug 2051 Terminate timer when batchjob timeout
Bug 2068 BatchGui: UTF8 errors and only in danish language

FR 1809 Write assignment for improving batchjob interface
FR 1817] Add Batch Post Processing
FR 1881] Quality assurance through a batchjob interface
FR 1937 Settings for archive database reconnection variables
FR 2003 Allow parameter for batchjobs through RunBatch
FR 2040 Enable retrieval of WARCRecords from the bitarchives

Documentation Module

Bug 1732 LocalArcRepositoryClient not documented
Bug 1779 Improve documentation of the additional tools
Bug 1948 Problem when running multiple batchjobs through commandline interface
Bug 1952 BitarchiveApplication depends on wayback jar
Bug 2034 Wrong syntax for attachments in Manuals

Deploy Module

Bug 1850 The replacing of Heritrix ports for the deploy test instance should be changed
Bug 2052 Wayback settings are missing from complete_settings

FR 2039 Extend deploy program to handle the deployment of an external harvest database

Upgrade instructions

Note that a lot of new settings has been introduced in this release, and a bunch of new third-party libraries has been added primarily related to hibernate that is now required by the applications in the wayback package.

We have added three additional applications: The harvestdatabase, which before also could be internal, must now be external. If you use the derby database, you can tell deploy where to install your database server, which during start-up is started as a server.

And also tHarvestJobManagerApplication and HarvestMonitorApplication. The standard is they are all three deployed on the same physical server (below called admin.domain), and inserted in the deploy configuration like see below:

<deployGlobal> ..

 . <deployClassPath>lib/dk.netarkivet.wayback.jar</deployClassPath>

..

 . <thisPhysicalLocation name="K">
  . <deployMachine name="admin.domain">
   . <deployHarvestDatabaseDir>harvestDatabase</deployHarvestDatabaseDir> <settings>
    . <common>
     . <database>
      . <class>dk.netarkivet.harvester.datamodel.DerbyServerSpecifics</class> <baseUrl>jdbc:derby</baseUrl> <machine>localhost</machine> <port>8118</port> <dir>harvestDatabase/fullhddb</dir>
     </database>
    </common>

..

 . <applicationName name="dk.netarkivet.harvester.scheduler.HarvestJobManagerApplication">
  . <deployClassPath>lib/dk.netarkivet.harvester.jar</deployClassPath> <deployClassPath>lib/dk.netarkivet.monitor.jar</deployClassPath> <settings>
   . <common>
    . <jmx>
     . <port>8114</port> <rmiPort>8214</rmiPort>
    </jmx>
   </common>
  </settings>
 </applicationName>

..

 . <applicationName name="dk.netarkivet.harvester.harvesting.monitor.HarvestMonitorApplication">
  . <deployClassPath>lib/dk.netarkivet.harvester.jar</deployClassPath> <deployClassPath>lib/dk.netarkivet.monitor.jar</deployClassPath> <settings>
   . <common>
    . <jmx>
     . <port>8115</port> <rmiPort>8215</rmiPort>
    </jmx>
   </common>
  </settings>
 </applicationName>

New settings in the common module

settings.common.webinterface.harvestStatus.defaultPageSize: The default number of jobs to show in the harvest status section on one result page. The default number is 100.

settings.common.database.url: Replaced by the new settings.common.database.baseUrl, settings.common.database.machine, settings.common.database.port, settings.common.database.dir. Note that currently, you need to include username/password information in the settings.common.database.dir value. An FR/bug has been created to address this.

settings.common.database.baseUrl: The baseUrl to use to connect to the database

settings.common.database.machine: The server where the harvestdatabase is installed

settings.common.database.port: The port number used to connect to the harvestdatabase

settings.common.database.dir: The directory of the harvestdatabase

settings.common.batch.batchjobs.batchjob.class: The list of batchjobs to be runnable from the GUI. Must be the complete path to the batchjob classes (e.g. dk.netarkivet.archive.arcrepository.bitpreservation.ChecksumJob). Must inherit FileBatchJob. The default is the following:

<batchjobs>
                <batchjob>
                    <class>dk.netarkivet.archive.arcrepository.bitpreservation.ChecksumJob</class>
                    <jarfile></jarfile>
                </batchjob>
                <batchjob>
                    <class>dk.netarkivet.archive.arcrepository.bitpreservation.FileListJob</class>
                    <jarfile></jarfile>
                </batchjob>
            </batchjobs>

settings.common.batch.batchjobs.batchjob.jarfile: The list of the corresponding jar-files containing the batchjob. This will be used for LoadableJarBatchJobs. If no file is specified, it is assumed, that the batchjob exists with the default classpath of the involved applications (BitarchiveMonitor, ArcRepository, GUIWebServer and BitArchive).

settings.common.batch.baseDir: The directory where the resulting files will be placed when running a batchjob through the GUI interface. The default is the relative dir "batch.

Note that setting settings.common.database.backupInitHour has gone, because all database backup is now done outside NetarchiveSuite.

Note that the default for setting settings.archive.bitarchive.singleChecksumTimeout is now 600000 (10 minutes).

New settings in the harvester module

settings.harvester.datamodel.domain.validSeedregex: The regular expression used to validate a seed within a seedlist. The default value (^.*$) accepts all non-empty strings.

settings.harvester.scheduler.dispatchperiode: The period between checking if new jobs should be dispatched to the harvest servers. New jobs are dispatched if the relevant harvest job queue is empty and new jobs exist for this queue. The default is set to 5 seconds based on a estimate of the harvest servers ability to consume messages.

settings.harvester.scheduler.jobgenerationperiode: The period between checking if new jobs should be generated. This is one minute because that's the finest we can define in a harvest definition.

settings.harvester.scheduler.singlejobdispatching: If true new jobs are dispatched when a Harvester is ready, else job are dispatched to the job queue as soon as they are generated. Note: The ability to switch off <code>singlejobdispatching</code> was introduced because of bug 2059, where a memory leak was found caused by the singlejobdispatching functionality. The default is false (ie. singlejobdispatching disabled).

settings.harvester.monitor.historySampleRate: The time interval in seconds between historical records stores in the DB. The Default value is 5 minutes.

settings.harvester.monitor.historyChartGenInterval: The rime interval in seconds between regenerating the chart of historical data for a running job. The default value is 5 minutes.

settings.harvester.monitor.displayedHistorySize: The maximum number of most recent history records displayed on the running job details page. The default value is 30

settings.harvester.harvesting.frontier.frontierReportWaitTime: The time interval in seconds to wait between two requests to generate a full frontier report. Default value is 600 seconds (10 min).

settings.harvester.harvesting.frontier.filter.class: Defines a filter to apply to the full frontier report. The default is dk.netarkivet.harvester.harvesting.frontier.TopTotalEnqueuesFilter.

settings.harvester.harvesting.frontier.filter.args: This setting defines the arguments for the frontier report filter. The arguments are separated by semicolons. By default, there are no arguments

settings.harvester.monitorResetInterval: The time interval in seconds after which the HarvestMonitorServer will reset the job state data. This is a simple way to detect the end of a job. The default is 300 seconds (5 minutes).

settings.harvester.harvesting.heritrix.crawlLoopWaitTime: Time interval in seconds to wait during a crawl loop in the harvest controller. The default is 20 seconds.

settings.harvester.harvesting.heritrix.abortIfConnectionLost: A boolean flag. If set to true, the harvest controller will abort the current crawl when the JMX connection is lost. If set to false it will only log a warning, leaving the crawl operator shutting down harvester manually. Used only by the BnfHeritrixController. The default is true.

settings.harvester.harvesting.heritrix.waitForReportGenerationTimeout: Maximum time in seconds to wait for Heritrix to generate report files once crawling is over. The default is 600 seconds (10 minutes).

settings.harvester.harvesting.heritrixLauncherClass: The implementation of the HeritrixLauncher abstract class to be used. The default is dk.netarkivet.harvester.harvesting.controller.BnfHeritrixLauncher.

Note: the default of the setting settings.harvester.harvesting.heritrixController.class is now: dk.netarkivet.harvester.harvesting.controller.BnfHeritrixController

New settings in the archive module

settings.archive.admin.database.reconnectMaxRetries: The maximum number of attempts to reconnect to the admin database. The default is 5.

settings.archive.admin.database.reconnectRetryDelay: The delay between the attempts to reconnect to the admin database. The default is 5 minutes

New settings in the wayback module

A lot of new settings has appeared, that configures hibernation, and its database connection manager used by hibernate (c3p0). Documentation about the c3p0 settings and c3p0 in general can be found here: About configuring c3p0: http://www.mchange.com/projects/c3p0/index.html#configuration_properties <br>About c3p0 in general: http://www.mchange.com/projects/c3p0/index.html

settings.wayback.hibernate.c3p0.acquireIncrement: Determines how many connections at a time c3p0 will try to acquire when the pool is exhausted. Defines the value of hibernate configuration key "hibernate.c3p0.acquire_increment" which is the same as the c3p0-native property name "c3p0.acquireIncrement". The default is 1.

settings.wayback.hibernate.c3p0.idleTestPeriod: If this is a number greater than 0, c3p0 will test all idle, pooled but unchecked-out connections, every this number of seconds. Defines the value of hibernate configuration key "hibernate.c3p0.idle_test_period" which is the same as the c3p0-native property name "c3p0.idleConnectionTestPeriod". The default is 100.

settings.wayback.hibernate.c3p0.maxSize: Maximum number of Connections a pool will maintain at any given time. Defines the value of hibernate configuration key "hibernate.c3p0.max_size" which is the same as the c3p0-native property name "c3p0.maxPoolSize". The default is 100.

settings.wayback.hibernate.c3p0.maxStatements: The size of c3p0's global PreparedStatement cache. Defines the value of hibernate configuration key "hibernate.c3p0.max_statements" which is the same as the c3p0-native property name "c3p0.maxStatements". The default is 100.

settings.wayback.hibernate.c3p0.minSize: Minimum number of Connections a pool will maintain at any given time. Defines the value of hibernate configuration key "hibernate.c3p0.min_size" which is the same as the c3p0-native property name "c3p0.minPoolSize". The default is 10.

settings.wayback.hibernate.c3p0.timeout Defines the value of hibernate configuration key "hibernate.c3p0.timeout" which is the same as the c3p0-native property name "c3p0.maxIdleTime". The default is 100.

settings.wayback.hibernate.connectionUrl: The hibernate connection url. The default is "jdbc:derby:derbyDB/wayback_indexer_db;create=true"

settings.wayback.hibernate.dbDriverClass: The hibernate client driver class. The default is "org.apache.derby.jdbc.ClientDriver"

settings.wayback.hibernate.useReflectionOptimizer: Look in the hibernation documentation for its meaning. The default is "false".

settings.wayback.hibernate.transactionFactory: Look in the hibernation documentation for its meaning. The default is org.hibernate.transaction.JDBCTransactionFactory.

settings.wayback.hibernate.dialect: Look in the hibernation documentation for its meaning. The default is "org.hibernate.dialect.DerbyDialect"

settings.wayback.hibernate.showSql: Look in the hibernation documentation for its meaning. The default is "true".

settings.wayback.hibernate.formatSql: Look in the hibernation documentation for its meaning. The default is "true".

settings.wayback.hibernate.hbm2ddlAuto: Look in the hibernation documentation for its meaning. The default is "update".

settings.wayback.hibernate.user: Look in the hibernation documentation for its meaning. The default is "".

settings.wayback.hibernate.password: Look in the hibernation documentation for its meaning. The default is "".

settings.wayback.indexer.replicaId: The replica to be used by the wayback indexer. Default value is "ONE".

settings.wayback.indexer.tempBatchOutputDir: The directory to which batch output is written during indexing. The default value is "tempdir".

settings.wayback.indexer.finalBatchOutputDir: The directory to which batch output is moved after a batch indexing job is successfully completed. The default value is "batchOutputDir".

settings.wayback.indexer.maxFailedAttempts: The maximum number of times an archive file may generate a batch error during indexing before we give up on it. The default value is "3";

settings.wayback.indexer.producerDelay: The delay in milliseconds before the producer thread is started. The default value is "0";

settings.wayback.indexer.producerInterval: The interval, in milliseconds, between successive runs of the producer thread. The default value is "86400000";

settings.wayback.indexer.consumerThreads: The number of consumer threads to run. The default value is "5";

settings.wayback.indexer.initialFiles: A file containing a list of files which have been archived and therefore do not need to be archived again. This key may be unset. The default value is "";

settings.wayback.aggregator.indexFileInputDir: The directory the Aggregator consumes raw index files from The default value is "batchOutputDir";

settings.wayback.aggregator.indexFileOutputDir: The directory the Aggregator places the Aggregated and sorted files into The default value is "indexDir";

settings.wayback.aggregator.tempAggregatorDir: The directory used by the aggregator to store temporary files. The default value is "aggregator_tempdir";

settings.wayback.aggregator.aggregationInterval: The time to between each scheduled aggregation run (in miliseconds). The default value is "86400000";

settings.wayback.aggregator.maxIntermediateIndexFileSize: The maximum size of the Intermediate index file in MB. When this limit is reached a new index file is created and new indexes are added to this file. In case of a 0 value, the intermediate index file will always be merged into the main index file. The default value is "102400";

settings.wayback.aggregator.maxMainIndexFileSize: The maximum size of the main wayback index file in MB. When this limit is reached a new index file is created and new indexes are added to this file. The old index file will be rename to ${finalIndexFileSizeLimit}.1 The default value is "104857600";

New tables in the harvest database

frontierReportMonitor
runningJobsMonitor
runningJobsHistory
global_crawler_trap_expressions
global_crawler_trap_lists

Note the tables mentioned above should be added automatically, if you are using MySQL, and Derby. If you are using postgreSQL, please contact Nicolas Giraud at BNF for further instructions.

Version History

Version 3.14.0	2010-11-12	Added running jobs overview, and batchGUI; fixed major OOM problem in the batch monitoring code
Version 3.12.0	2010-05-03	New Bitpreservation infrastructure, and upgrade of Apache Derby to version 10.5.3.0
Version 3.11.*		Development versions aiming for 3.12.0
Version 3.10.0	2009-11-16	New deploy application; JMX stability issues fixed; JMS stability issues also fixed
Version 3.9.*		Development versions aiming for 3.10.0
Version 3.8.2	2009-09-10	Fix an important index synchronization bug
Version 3.8.1	2009-07-15	Fix of important bug leading to unresponsive harvesters
Version 3.8.0	2009-05-23	Java 1.6, Heritrix 1.14.1, Derby 10.4.2.0, complete rewrite of settings, new supported deploy module, gui access to harvest logs
Version 3.7.0	2008-11-04	Develop version aiming for 3.8.0
Version 3.6.0	2008-07-03	Improvement of archive component with regard to security, batch, and preservation; greater JMS stability; important bug fixes
Version 3.5.*		Develop versions aiming for 3.6.0
Version 3.4.2	2008-03-14	Bug fix release, fixing JMX timeout
Version 3.4.1	2008-01-16	Bug fix release, fixing out of memory on very large indexes
Version 3.4.0	2008-01-03	Separation of Heritrix, work on developing our open source platform, two-part TLDs like co.uk, and lots of bugfixes
Version 3.3.*		Develop versions aiming for 3.4.0
Version 3.2.3	2007-09-27	Bugfix of 3.2.2 with patched deduplicator, that fixes problem in parallel indexing
Version 3.2.2	2007-08-03	Bugfix of 3.2.1 with patched Heritrix 1.12.1, that supports ARCRecords larger than 2GBs
Version 3.2.1	2007-07-04	Bugfix of 3.2.0 fixing trouble using the quick start manual.
Version 3.2.0	2007-07-04	Open source release
Version 3.1.*		Development versions. Version 3.1.7 was kindly reviewed by Internet Archive and the Norwegian national library.
Version 3.0.0	2007-02-02	Marked the naming of the NetarchiveSuite, the splitting of NetarchiveSuite into independent modules, and the licensing of NetarchiveSuite under LGPL
Version 2.*		Various features and updates
Version 2.0	2006-08-30	Marked a general restructuring of the code, where harvest definition data was backed by a database, the viewerproxy was trimmed and rewritten.
Version 1.*		Various features and updates
Version 1.0	2005-07-01	The first version of the netarchive\| software put in production for harvesting the entire Danish web
Version 0.*		Various pre-production development versions