= Release Notes for NetarchiveSuite 3.13.1 = This version of !NetarchiveSuite was released on 2010-09-8 <> == New features since NetarchiveSuite 3.12.* == This release has primarily focused on integrating harvesting code into the main !NetarchiveSuite branch implemented at BNF. Additionally, there has been some done work in the archive and wayback packages. The following bugs and features have been fixed since 3.12 === Common Module === {{{ FR 1929 15 second level TLD related to the .fr and .re domains FR 1998 change max default logfile size from 1 MB to 10 MB Bug 1849 Invalid javadoc for DomainUtils#DOMAINNAME_CHAR_REGEX_STRING Bug 2002 Remove Common module reference to Archive module }}} === Harvester Module === {{{ Bug 1856: Schedule problem after first start on NAS 3.10.0. No schedule started Bug 1940 Trying to upload a global crawlertraps list with an existing name throws an ugly database error at the use Bug 1961 Possible to store invalid set of crawlertraps to domain causing the domain to be unreadable Bug 1964: PWC6033: Unable to compile class for JSP Bug 1971: PostgreSQL does not work with current NetarchiveSuite: forgot to copy BNF change to trunk Bug 2014 Logs at level WARNING when the HarvestControllerServer returns a CrawlStatusMessage for a job in status DONE FR 1134 Filter job lists by category FR 1668 Paginate and make sortable and searchable the list of jobs FR 1774 Stop using the JMS queues for queuing snapshot harvests FR 1688 Monitoring broad crawls FR 1924 Allows to search a domain in active jobs (in case of webmaster complain) FR 1925 PostgreSQL connectivity (using the PostgreSQL driver version 8.4 - JDBC 4) FR 1927 Delay job end to allow Heritrix report generation FR 1928 Ability to easily resubmit a selection of failed jobs (but see Bug 1972) FR 1930 Ability to implement a different crawl control loop via HeritrixLauncher / new Heritrix JMX controller FR 1951 Upgrade to heritrix 1.14.4 FR 2031 Use the HeritrixLauncher and HeritrixController supplied by BNF as default }}} === Access Module === {{{ Bug 1843 Typos and missing TODOs in standalone_archive.xml Bug 1955 NetarchiveResourceStore fails to handle redirects Bug 1976 Wayback fails to start with 'all_test.sh' Bug 1980 Wayback deploy script does not build warfile FR 1956 Shared dependency links harvester and wayback modules FR 1959 Add wayback module to archive classpath }}} === Archive Module === {{{ FR 1937 Settings for archive database reconnection variables FR 1809 Write assignment for improving batchjob interface FR 1817] Add Batch Post Processing FR 1881] Quality assurance through a batchjob interface FR 2003 Allow parameter for batchjobs through RunBatch Bug 1892 Different refresh of File list and checksums checks Bug 1938 Incomplete log in ReplicaCacheDatabase.addChecksumInformation Bug 1945 The "add to archive" check box is inappropiately named in the new replica context Bug 1949 Findbugs: use of known null pointer in BitpreserveFileState.processUpdateRequest Bug 1999] Better logging for ingest of filelist and checksums }}} === Documentation Module === {{{ Bug 1732 LocalArcRepositoryClient not documented Bug 1779 Improve documentation of the additional tools Bug 1948 Problem when running multiple batchjobs through commandline interface Bug 1952 BitarchiveApplication depends on wayback jar Bug 2034 Wrong syntax for attachments in Manuals }}} === Deploy Module === {{{ Bug 1850 The replacing of Heritrix ports for the deploy test instance should be changed }}} == Upgrade instructions == Note that a lot of new settings has been introduced in this release, and a bunch of new third-party libraries has been added primarily related to hibernate that is now required by the applications in the wayback package. === New settings in the common module === '''settings.common.webinterface.harvestStatus.defaultPageSize''': The default number of jobs to show in the harvest status section on one result page. The default number is 100. '''settings.common.batch.batchjobs.batchjob.class''': The list of batchjobs to be runnable from the GUI. Must be the complete path to the batchjob classes (e.g. dk.netarkivet.archive.arcrepository.bitpreservation.!ChecksumJob). Must inherit !FileBatchJob. The default is the following: {{{ dk.netarkivet.archive.arcrepository.bitpreservation.ChecksumJob dk.netarkivet.archive.arcrepository.bitpreservation.FileListJob }}} '''settings.common.batch.batchjobs.batchjob.arcfile''': The list of the corresponding jar-files containing the batchjob. This will be used for !LoadableJarBatchJobs. . If no file is specified, it is assumed, that the batchjob exists with the default classpath of the involved applications (!BitarchiveMonitor, !ArcRepository, GUIWebServer and !BitArchive). '''settings.common.batch.baseDir''': The directory where the resulting files will be placed when running a batchjob through the GUI interface. The default is the relative dir "batch" ==== New settings in the harvester module ==== '''settings.harvester.harvesting.heritrix.monitorResetInterval''': The time interval in seconds after which the !HarvestMonitorServer will reset the job state data. This is a simple way to detect the end of a job. The default is 300 seconds (5 minutes). '''settings.harvester.harvesting.heritrix.crawlLoopWaitTime''': Time interval in seconds to wait during a crawl loop in the harvest controller. The default is 20 seconds. '''settings.harvester.harvesting.heritrix.abortIfConnectionLost''': A boolean flag. If set to true, the harvest controller will abort the current crawl when the JMX connection is lost. If set to false it will only log a warning, leaving the crawl operator shutting down harvester manually. Used only by the !BnfHeritrixController. The default is true. '''settings.harvester.harvesting.heritrix.waitForReportGenerationTimeout''': Maximum time in seconds to wait for Heritrix to generate report files once crawling is over. The default is 600 seconds (10 minutes). '''settings.harvester.harvesting.heritrixLauncherClass''': The implementation of the !HeritrixLauncher abstract class to be used. The default is dk.netarkivet.harvester.harvesting.controller.!DefaultHeritrixLauncher. ==== New settings in the wayback module ==== A lot of new settings has appeared, that configures hibernation, and its database connection manager used by hibernate (c3p0). Documentation about the c3p0 settings and c3p0 in general can be found here: About configuring c3p0: http://www.mchange.com/projects/c3p0/index.html#configuration_properties
About c3p0 in general: http://www.mchange.com/projects/c3p0/index.html '''settings.wayback.hibernate.c3p0.acquire_increment''': Determines how many connections at a time c3p0 will try to acquire when the pool is exhausted. Defines the value of hibernate configuration key "hibernate.c3p0.acquire_increment" which is the same as the c3p0-native property name "c3p0.acquireIncrement". The default is 1. '''settings.wayback.hibernate.c3p0.idle_test_period''': If this is a number greater than 0, c3p0 will test all idle, pooled but unchecked-out connections, every this number of seconds. Defines the value of hibernate configuration key "hibernate.c3p0.idle_test_period" which is the same as the c3p0-native property name "c3p0.idleConnectionTestPeriod". The default is 100. '''settings.wayback.hibernate.c3p0.max_size''': Maximum number of Connections a pool will maintain at any given time. Defines the value of hibernate configuration key "hibernate.c3p0.max_size" which is the same as the c3p0-native property name "c3p0.maxPoolSize". The default is 100. '''settings.wayback.hibernate.c3p0.max_statements''': The size of c3p0's global !PreparedStatement cache. Defines the value of hibernate configuration key "hibernate.c3p0.max_statements" which is the same as the c3p0-native property name "c3p0.maxStatements". The default is 100. '''settings.wayback.hibernate.c3p0.min_size''': Minimum number of Connections a pool will maintain at any given time. Defines the value of hibernate configuration key "hibernate.c3p0.min_size" which is the same as the c3p0-native property name "c3p0.minPoolSize". The default is 10. '''settings.wayback.hibernate.c3p0.timeout''' Defines the value of hibernate configuration key "hibernate.c3p0.timeout" which is the same as the c3p0-native property name "c3p0.maxIdleTime". The default is 100. '''settings.wayback.hibernate.connection_url''': The hibernate connection url. The default is "jdbc:derby:derbyDB/wayback_indexer_db;create=true" '''settings.wayback.hibernate.db_driver_class''': The hibernate client driver class. The default is "org.apache.derby.jdbc.!ClientDriver" '''settings.wayback.hibernate.use_reflection_optimizer''': Look in the hibernation documentation for its meaning. The default is "false". '''settings.wayback.hibernate.transaction_factory''': Look in the hibernation documentation for its meaning. The default is org.hibernate.transaction.JDBCTransactionFactory. '''settings.wayback.hibernate.dialect''': Look in the hibernation documentation for its meaning. The default is "org.hibernate.dialect.!DerbyDialect" '''settings.wayback.hibernate.show_sql''': Look in the hibernation documentation for its meaning. The default is "true". '''settings.wayback.hibernate.format_sql''': Look in the hibernation documentation for its meaning. The default is "true". '''settings.wayback.hibernate.hbm2ddl_auto''': Look in the hibernation documentation for its meaning. The default is "update". '''settings.wayback.hibernate.user''': Look in the hibernation documentation for its meaning. The default is "". '''settings.wayback.hibernate.password''': Look in the hibernation documentation for its meaning. The default is "". '''settings.wayback.indexer.replicaId''': The replica to be used by the wayback indexer. Default value is "ONE". '''settings.wayback.indexer.temp_batch_output_dir''': The directory to which batch output is written during indexing. The default value is "tempdir". '''settings.wayback.indexer.final_batch_output_dir''': The directory to which batch output is moved after a batch indexing job is successfully completed. The default value is "batchOutputDir". '''settings.wayback.indexer.maxFailedAttempts''': The maximum number of times an archive file may generate a batch error during indexing before we give up on it. The default value is "3"; '''settings.wayback.indexer.producerDelay''': The delay in milliseconds before the producer thread is started. The default value is "0"; '''settings.wayback.indexer.producerInterval''': The interval, in milliseconds, between successive runs of the producer thread. The default value is "86400000"; '''settings.wayback.indexer.consumerThreads''': The number of consumer threads to run. The default value is "5"; '''settings.wayback.indexer.initialFiles''': A file containing a list of files which have been archived and therefore do not need to be archived again. This key may be unset. The default value is ""; '''settings.wayback.aggregator.index_file-input_dir''': The directory the Aggregator consumes raw index files from The default value is "batchOutputDir"; '''settings.wayback.aggregator.index_file-output_dir''': The directory the Aggregator places the Aggregated and sorted files into The default value is "indexDir"; '''settings.wayback.aggregator.temp_aggregator_dir''': The directory used by the aggregator to store temporary files. The default value is "aggregator_tempdir"; '''settings.wayback.aggregator.aggregation_interval''': The time to between each scheduled aggregation run (in miliseconds). The default value is "86400000"; '''settings.wayback.aggregator.max_intermediate_index_file_size''': The maximum size of the Intermediate index file in MB. When this limit is reached a new index file is created and new indexes are added to this file. In case of a 0 value, the intermediate index file will always be merged into the main index file. The default value is "102400"; '''settings.wayback.aggregator.max_main_index_file_size''': The maximum size of the main wayback index file in MB. When this limit is reached a new index file is created and new indexes are added to this file. The old index file will be rename to ${finalIndexFileSizeLimit}.1 The default value is "104857600"; === New translation strings === '''common/Translation.properties''' {{{ batchpage;Name.of.batchjob=Name of batchjob batchpage;Description=Description batchpage;Last.run=Last run batchpage;Batchjob.has.never.been.run=Batchjob has never been run batchpage;Size.of.output.file=Size of output file batchpage;Number.of.lines.in.output.file=Number of lines in output file batchpage;Size.of.error.file=Size of error file batchpage;Number.of.lines.in.error.file=Number of lines in error file batchpage;Choose.replica=Choose replica batchpage;Regular.expression.for.filenames.(all.files)=Regular expression for file names (\".*\" = all files) batchpage;Execute.batchjob=Execute batchjob batchpage;Arguments=Arguments batchpage;Bad.argument.metadata.for.the.constructor=Bad argument metadata resources for the constructor ''{0}'' batchpage;Argument.i=Argument {0} batchpage;Argument.i.missing.argument.metadata=Argument {0} (missing argument metadata) batchpage;No.batchjobs.defined.in.settings=No batchjobs defined in settings batchpage;Predefined.batchjobs=Predefined batchjobs batchpage;Batchjob=Batchjob batchpage;No.output.file=No output file batchpage;No.error.file=No error file batchpage;Warning.0=Warning: {0} batchpage;Which.files=Which files batchpage;Job.ID=Job ID batchpage;Metadata=Metadata batchpage;Content=Content batchpage;Both=Both batchpage;No.outputfile=No outputfile batchpage;No.errorfile=No errorfile batchpage;Download.outputfile=Download outputfile batchpage;Download.errorfile=Download errorfile batchpage;No.valid.timestamp=No valid timestamp batchpage;bytes=bytes batchpage;lines=lines batchpage;Started.date=Stated date batchpage;Ended.date=Ended date batchpage;Output.file=Output file batchpage;Error.file=Error file batchpage;Number.of.runs.0=Number of runs: {0} }}} '''harvester/Translations.properties''' {{{ pagetitle;all.jobs.running=Running Jobs status.job.filters.group1=Job status {0} Harvest name {1} Start date {2} End date {3} status.job.filters.group2=Order {0} Display {1} rows per page. status.job.filters.group3=Job status {0} Harvest name {1} status.sort.order.job.reset=Reset status.results.page=Displaying result page {0}. status.results.displayed=Search results: {0}, displaying results {1} to {2}. status.results.displayed.pagination={0} / {1} status.results.displayed.nextPage=next status.results.displayed.prevPage=previous status.harvest.all=All table.running.jobs.jobId=Job ID table.running.jobs.harvestName=Harvest definition table.running.jobs.host=Host table.running.jobs.progress=Progress table.running.jobs.queuedFiles=Queued files table.running.jobs.totalQueues=Queues table.running.jobs.activeQueues=Active table.running.jobs.retiredQueues=Retired table.running.jobs.exhaustedQueues=Exhausted table.running.jobs.elapsedTime=Elapsed time table.running.jobs.alerts=Alerts table.running.jobs.downloadedCount=Downloaded files table.running.jobs.currentProcessedKBPerSec=KB/s table.running.jobs.currentProcessedDocsPerSec=URL/s table.running.jobs.queues=Queues table.running.jobs.performance=Performance table.running.jobs.toeThreads=Threads table.running.jobs.status.preCrawl=Crawl in preparation table.running.jobs.status.crawlerRunning=Crawler is running table.running.jobs.status.crawlerPaused=Crawler is paused table.running.jobs.status.crawlFinished=Crawl finished table.running.jobs.legend={0} - crawl in preparation, {1} crawler is running, {2} - crawler is paused, {3} - crawl finished running.jobs.finder.inputGroup=Find job harvesting domain {0} running.jobs.finder.submit=Search resubmit.jobs.submit=Resubmit selected failed jobs errormsg;resubmit.jobs.selectionEmpty=Please select at least one job! running.jobs.finder.table.jobId=Job ID }}} '''monitor/Translation.properties''' {{{ }}} '''viewerproxy/Translation.properties''' {{{ pagetitle;qa.batchjob.overview=Batchjob Overview pagetitle;qa.batchjob.retrieve.resultfile=BatchJob resultfile pagetitle;qa.batchjob=Batchjob pagetitle;qa.batchjob.execute=Executing batchjob pagetitle;qa.get.files=Get harvested files pagetitle;qa.get.reports=Get harvest reports pagetitle;qa.crawllog.lines.for.domain=Lines from crawl.log about domain pagetitle;files.for.job.0=Files for job {0} pagetitle;reports.for.job.1=Reports for job {0} pagetitle;qa.crawllog.lines.for.domain.0.in.1=Lines from crawl.log of job {1} concerning domain {0} }}} === Deleted translation strings === '''harvester/Translation.properties''' {{{ }}} '''archive/Translation.properties''': {{{ }}} == Version History == ||Version 3.12.0 ||2010-05-03 ||New Bitpreservation infrastructure, and upgrade of Apache Derby to version 10.5.3.0 || ||Version 3.11.* || ||Development versions aiming for 3.12.0 || ||Version 3.10.0 ||2009-11-16 ||New deploy application; JMX stability issues fixed; JMS stability issues also fixed || ||Version 3.9.* || ||Development versions aiming for 3.10.0 || ||Version 3.8.2 ||2009-09-10 ||Fix an important index synchronization bug || ||Version 3.8.1 ||2009-07-15 ||Fix of important bug leading to unresponsive harvesters || ||Version 3.8.0 ||2009-05-23 ||Java 1.6, Heritrix 1.14.1, Derby 10.4.2.0, complete rewrite of settings, new supported deploy module, gui access to harvest logs || ||Version 3.7.0 ||2008-11-04 ||Develop version aiming for 3.8.0 || ||Version 3.6.0 ||2008-07-03 ||Improvement of archive component with regard to security, batch, and preservation; greater JMS stability; important bug fixes || ||Version 3.5.* || ||Develop versions aiming for 3.6.0 || ||Version 3.4.2 ||2008-03-14 ||Bug fix release, fixing JMX timeout || ||Version 3.4.1 ||2008-01-16 ||Bug fix release, fixing out of memory on very large indexes || ||Version 3.4.0 ||2008-01-03 ||Separation of Heritrix, work on developing our open source platform, two-part TLDs like co.uk, and lots of bugfixes || ||Version 3.3.* || ||Develop versions aiming for 3.4.0 || ||Version 3.2.3 ||2007-09-27 ||Bugfix of 3.2.2 with patched deduplicator, that fixes problem in parallel indexing || ||Version 3.2.2 ||2007-08-03 ||Bugfix of 3.2.1 with patched Heritrix 1.12.1, that supports ARCRecords larger than 2GBs || ||Version 3.2.1 ||2007-07-04 ||Bugfix of 3.2.0 fixing trouble using the quick start manual. || ||Version 3.2.0 ||2007-07-04 ||Open source release || ||Version 3.1.* || ||Development versions. Version 3.1.7 was kindly reviewed by Internet Archive and the Norwegian national library. || ||Version 3.0.0 ||2007-02-02 ||Marked the naming of the !NetarchiveSuite, the splitting of !NetarchiveSuite into independent modules, and the licensing of !NetarchiveSuite under LGPL || ||Version 2.* || ||Various features and updates || ||Version 2.0 ||2006-08-30 ||Marked a general restructuring of the code, where harvest definition data was backed by a database, the viewerproxy was trimmed and rewritten. || ||Version 1.* || ||Various features and updates || ||Version 1.0 ||2005-07-01 ||The first version of the netarchive| software put in production for harvesting the entire Danish web || ||Version 0.* || ||Various pre-production development versions ||