15552
Comment:
|
← Revision 13 as of 2010-08-16 10:24:55 ⇥
6894
converted to 1.6 markup
|
Deletions are marked like this. | Additions are marked like this. |
Line 1: | Line 1: |
= Release Notes for NetarchiveSuite 3.9.0 (DRAFT)= | = Release Notes for NetarchiveSuite 3.9.0 = |
Line 4: | Line 4: |
[[TableOfContents]] == Bug fix since since NetarchiveSuite 3.8.0 == This version is identical to the 3.8.0 release, apart from the fixing of bug 1711, where a Heritrix that does not respond to the Java commands may cause a general disruption of harvesting. |
<<TableOfContents>> |
Line 14: | Line 10: |
==== Issue tracking cleanup ==== We have cleaned up the bug/feature request/patch trackers, removing a lot of confusing fields. We are working on the final documentation of explaining the precise definition of each remaining field, and how bugs are handled ==== Upgrade top java 1.6.0 ==== NetarchiveSuite now requires java 1.6.0. |
|
Line 21: | Line 11: |
==== New settings structure ==== The way we read settings has been radically updated. It is no longer required to have a settings file with all settings, instead default values are used, if the settings are not set. Furthermore, more than one settings file can be used, one overriding the other. Refer to the ["Installation Manual"] for more info on how they are read. It is recommended to not set values where the default values are acceptable, so new defaults will be deployed automatically, in new versions of !NetarchiveSuite. ==== German added as language ==== Language files for German has been added to the NetarchiveSuite thanks to a patch from ONB. === Deploy Module === ==== New and improved software to deploy NetarchiveSuite ==== The existing deploy software which were developed for the needs of the Netarkivet installation, and which were lacking in many areas, has been replaced by new deploy software that is more versatile and configurable than the old one, thus hopefully more useful for a larger range of scenarios. |
It is now possible to override the implementation of the getBytesFree() method, that is by default calculated using the standard Java method File.getUsableSpace(). |
Line 32: | Line 14: |
==== Heritrix upgraded to version 1.14.3 ==== The version of Heritrix bundled with NetarchiveSuite have now been upgraded to version 1.14.3 released 2009-03-03. Note that two Heritrix attributes have been renamed. 'overly-eager-link-detection' used as attributes by ExtractorHTML and JerichoExtractorHTML, with a default value of 'true', has been renamed 'extract-value-attributes' to more accurately reflect its effect. 'bind-address' in FetchHTTP, with a default value of the empty string, has been renamed 'http-bind-address'. [http://webteam.archive.org/confluence/display/Heritrix/Release+Notes+-+1.14.0 ReleaseNotes for Heritrix 1.14.0] [http://webteam.archive.org/confluence/display/Heritrix/Release+Notes+-+1.14.1 ReleaseNotes for Heritrix 1.14.1] [http://webteam.archive.org/confluence/display/Heritrix/Release+Notes+-+1.14.2 ReleaseNotes for Heritrix 1.14.2] [http://webteam.archive.org/confluence/display/Heritrix/Release+Notes+-+1.14.3 ReleaseNotes for Heritrix 1.14.3] ==== Access to harvest logs ==== From the job page of any DONE or FAILED job, there is now a link to inspect the Heritrix logs, metadata files, and harvested files from that job! This greatly enhances the possibilities of debugging why a job behaved as it did. You can also get the subset of a crawl.log that referred to a specific domain in a harvest. ==== Derby DB upgraded to 10.4.2.0 ==== We have upgraded the version of Derby DB to 10.4.2.0 (released September 5, 2008). Migration of the Derby Database files from the old 10.1.1.0 format to the new internal format is done by setting it as a property in the connection url: {{{ "jdbc:derby:harvestDatabase;upgrade=true" }}} |
NetarchiveSuite now works properly with MySQL (See bug 1254), and a deadline has now been introduced in the HarvestScheduler for jobs in status STARTED, by default 1 week. |
Line 56: | Line 17: |
==== Better handling of third-party batch jobs with tool ==== The command-line tool for submitting third-party batch jobs now supports a much better syntax. Simply launch the tool ({{{dk.netarkivet.archive.tools.RunBatch}}}) with no arguments for a description. |
Some excessive logging has been removed, and the previously fixed indexing bugs 1078 and 1079 was reopened and fixed again. |
Line 59: | Line 19: |
== Documentation == The documentation has been brought up to day, and some parts have been elaborated. And a new manual, the Configuration Manual, has been added. The ["Developer Manual"] has been replaced with a Systems Design manual, that includes Database documentation. Complementing this documentation the scripts to generate a new harvest database now contain elaborate documentation on the tables used by the harvest definition interface. These can be found in the distribution packages under {{{scripts/sql}}} Furthermore, an Additional Tools manual has been added, complementing the User Manual and the Installation Manual. == Bugs fixed since NetarchiveSuite 3.6.* == |
== Bugs fixed since NetarchiveSuite 3.8.* == |
Line 67: | Line 22: |
bug 1247: FileUtils.unzip does not unzip directories properly bug 1308: 3 different applications uses 'COMMON_THIS_HACO' in the JMS-queue naming bug 1501: Setting schema and xsi type not valid FR 291: HarvestControllerServer uses http port to set unique THIS_HACO FR 1101: Should upgrade to using Java 1.6 FR 1252: Upgrade to Apache Derby 10.4.1.3 FR 1276: QuickStart installation should not use DEV as its environmentName FR 1277: Move derbytools library from tests/lib/db to lib/db Patch 1493: German Translation |
Bug 1694 LocalArcRepositoryClient is broken Bug 1712 Starting multiple applications on one machine leads to potential failure of startup FR 1654 second-level domains for .at in settings.xml FR 1709 Module getBytesFree() |
Line 79: | Line 29: |
bug 662: We throw away important information from SQLException bug 934: could not start harvester due to JMX problem (Adress already in use) bug 1157: jobs disappeared in a strange way bug 1167: Danish translation of "save configuration" is inconsistent bug 1181: The templates in scripts/simple_harvest/data/originals/harvestdefinitionbasedir/order_templates are invalid bug 1226: Discrepancy between how our database are defined in the dev, and prod environments respectively bug 1240: PAUSED heritrix gets terminated by JMXHarvestController bug 1292: Possible to store Jobs in database coming from unknown harvestdefinition bug 1468: If harvest process ends with non-zero exit code for any reason, it will never be retried to start the harvester by the SideKick bug 1469: Wrong translation label in Definitions-find-domains if no domains are in the database bug 1519: Harvesting with no domains gives runs without history FR 770: Resubmitted jobs do not provide information about which job they are resubmitted as FR 1108: Heritrix logs should be accessible in harvest definition interface FR 1146: submit date as new field on jobs FR 1160: Status and order selection (771) on History/Harveststatus-perharvestrun.jsp FR 1162: Multiselect of status on all jobs FR 1485: File must be streamed instead on QA/QA-crawlloglines.jsp FR 1497: Button in NetarchiveSuite FR 1643: Update Heritrix to version 1.14.3 |
Bug 928 The guess of initial size of unharvested domains is very bad on harvests with a large object limit Bug 1254 Database connections to MySQL close down intermittently Bug 1611 Missing space in error message in DefinitionsSiteSection.initialize Bug 1644 On Edit Domain page, the text field only shows 21 characters of the domainname Bug 1646 dk.netarkivet.harvester.harvesting.distribute.MetadataEntry needs toString method Bug 1650 It is not checked when creating the Heritrix process, that the JMX password file assigned to Heritrix exists Bug 1670 Default timeout settings are set way too low in the default settings Bug 1711 harvester that is not destroyable makes harvesterApplication take and immediately fail jobs bUG 1718 The link to monitor Heritrix process does not necessarily give fully qualified hostnames FR 1014 No good way to mark a non-reported-stopped job as FAILED or DONE FR 1227 Log the Heritrix command line FR 1628 Add custom JVM parameters to Heritrix subprocess FR 1675 List of all Seeds of a selective Harvests FR 1702 Value for max-trans-hops is way too high in default order templates FR 1716 Increase size of input field, when uploading harvest templates FR 1717 Increase crawler trap textarea size FR 1723 Update the Heritrix templates in harvestdefinionbasedir/order_template_dist to Heritrix 1.14.3 |
Line 101: | Line 49: |
bug 574: When parsing checksum in ArcRepository we do strange things on wrong results bug 800: Indexserver generates indices for the empty job set bug 1191: NullPointerException in BitarchiveMonitorServer bug 1193: Exceptions from FileBatchJob stop batch job processing bug 1212: When a file with bit errors has been restored, it still appears on the list of files with checksum errors bug 1216: no commandline argument support for batch-output file bug 1246: Bitpreservation GUI fails bug 1261: Wrong table headings (mostly Danish) in info part on page "Missing Files" bug 1278: Error from RunBatch unexpected bug 1279: Missing toString method on FileBatchJob classes bug 1294: ref. to argument before check of argument in Filelist batchjob bug 1492: The "Missing Files" page repeats the "status" headline bug 1564: Disk mount filled to last block - and then it's not possible to move a single file FR 1029: A tool in dk.netarkivet.archive.tools to retrieve a file from the archive would be nice FR 1195: TrivialArcRepositoryClient does not offer setting for file directory FR 1263: Confusing layout of "Files with checksum errors" FR 1410: Missing files page could look nicer after update FR 1498: batchJobs given in jar files |
Bug 1078 DeDuplikator index too large (refixed) Bug 1079 snap shot harvest not browsable due to large index (refixed) Bug 1547 Wrong synchronization in the IndexRequestServer and the FileBasedCache let two processes generate Index at the same time, and one of them fails Bug 1722: Excessive logging in indexserver |
Line 122: | Line 56: |
bug 1152: URLs with { or } not browsable | Bug 1700: The WebProxy.handle() method creates CreateErrorResponse for null Uri |
Line 126: | Line 60: |
bug 1223: System overview takes more than 20 secs and Show all never returns bug 1388: Too much logging during automatic registering of applications FR 1042: Automatic registering to monitor of running applications desirable FR 1379: Misleading error in SystemState when harvester has not started FR 1471: Monitor plugin had no alternative FR 1483: Alternative monitor plugin could be print of jmx url |
|
Line 135: | Line 63: |
bug 431: Settings.DIR_COMMONTEMPDIR directories should be emptied upon startup bug 433: Starting the bit archives twice without killing inbetween make bitarchive immortal bug 846: Make sure that install-scripts makes the necessary directories bug 1271: Naming of GuiApplication scripts are misleading FR 1520: Deploy of more than one Bitapplication per server FR 1572: We need a new DeployApplication that is more usable, and more configurable |
|
Line 144: | Line 66: |
bug 1178: Missing information in UserManual: Adding domains not in database to a selective harvest bug 1266: UserManual: missing screendump desc for new field in "Adding seeds to an event harvest" bug 1273: QuickStart using Mysql as Database fails due to Java access permissions bug 1484: Document that we are not entirely platform independent in installation manual bug 1514: Add "ContentSize" object to list of mandatory fields for harvester templates in Installation Manual FR 281 set FTP-directory to ~/ftp (default is ~) FR 1287: Developer Manual should have better splits FR 1389: Installation Manual should document the need to set the maximum number of producers on JMS broker |
Bug 1636 Warnings in javadoc Bug 1710 deploy Application seems not to support multiple FTP-servers |
Line 156: | Line 72: |
=== Order templates === In the order templates, you need to add a new setting in the deduplication module, or the deduplication will require far too much RAM. The setting is {{{ <newObject name="DeDuplicator" class="is.hi.bok.deduplicator.DeDuplicator"> [...] <!--Start of change--> <boolean name="use-sparse-range-filter">true</boolean> <!--End of change--> [...] </newObject> }}} |
=== New settings === The following new settings have been introduced: |
Line 169: | Line 75: |
=== New settings structure === As mentioned above, settings now have defaults, and can be read from multiple settings files. It is recommended that you no longer set settings, where you wish to use the default values. |
''settings.common.database.validityCheckTimeout'' (default: 0): Timeout in seconds to check for the validity of a JDBC connection on the server. This is the time in seconds to wait for the database operation used to validate the connection to complete. If the timeout period expires before the operation completes, this method returns false. A value of 0 indicates a timeout is not applied to the database operation. |
Line 172: | Line 77: |
=== Default Monitor class moved === The plugin for distributed monitoring using !StatusSiteSection has moved. The setting {{{settings.monitorregistryClient.class}}} that was previously {{{dk.netarkivet.common.distribute.monitorregistry.JMSMonitorRegistryClient}}} is now {{{dk.netarkivet.monitor.distribute.JMSMonitorRegistryClient}}} |
''settings.common.freespaceprovider.class''(Default: dk.netarkivet.common.utils.DefaultFreeSpaceProvider): The implementation class for free space provider, e.g. dk.netarkivet.common.utils.DefaultFreeSpaceProvider. The class must implement FreeSpaceProvider-Interface. |
Line 175: | Line 79: |
=== New settings === Settings from monitor_settings.xml are now settings in the standard settings files. The new settings are: |
''settings.harvester.harvesting.heritrix.javaOpts'' (default: ""): Additional JVM options for the Heritrix sub-process. |
Line 178: | Line 81: |
{{{settings.monitor.jmxUsername}}}: Must match the JMX username in all monitored applications. Default: monitorRole Note that if you change the jmxUsername, you must change the security.policy accordingly. Assuming you set jmxUsername to "anonymous", we need the following line in the security.policy: {{{ grant principal javax.management.remote.JMXPrincipal "anonymous" { permission java.security.AllPermission; }; }}} {{{settings.monitor.jmxPassword}}}: Must match the JMX password in all monitored applications. Default: JMX_MONITOR_ROLE_PASSWORD_PLACEHOLDER === Changed setting names === The setting {{{settings.common.jms.environmentName}}} is now {{{settings.common.environmentName}}} The setting {{{settings.common.harvester.datamodel.domain.tld}}} is now {{{settings.common.topLevelDomains.tld}}} The setting {{{settings.common.database.specificsclass}}} is now {{{settings.common.database.class}}} The setting {{{settings.archive.bitarchive.limitForRecordDatatransferInFile}}} is now {{{settings.common.repository.limitForRecordDatatransferInFile}}} The setting {{{settings.archive.arcrepository.location}}} is now {{{settings.common.locations.location}}} The setting {{{settings.archive.arcrepository.batchLocation}}} is now {{{settings.common.locations.batchLocation}}} The setting {{{settings.archive.bitarchive.thisLocation}}} is now {{{settings.common.thisPhysicalLocation}}} The setting {{{settings.monitor.applicationName}}} is now {{{settings.common.applicationName}}} === Removed settings === The setting {{{settings.common.siteSection.deployPath}}} is no longer used. |
''settings.harvester.harvesting.heritrixControllerClass'' (default: dk.netarkivet.harvester.harvesting.JMXHeritrixController): The implementation of the HeritrixController interface to be used. |
Line 214: | Line 89: |
no.files.with.checksum.errors=No files with checksum errors were found location=Location |
|
Line 220: | Line 93: |
subtitle;reports.for.job=Harvest information for job harvest.reports=Browse reports for jobs harvest.files=Browse harvest files for job crawl.log.lines.for.domain.0=Browse only relevant crawl-log lines for domain {0} |
harvestdefinition.linktext.seeds=Seeds harveststatus.seeds.total=Total harveststatus.seeds.domains=Domains |
Line 228: | Line 100: |
pagetitle;qa.get.files=Get harvested files pagetitle;qa.get.reports=Get harvest reports pagetitle;qa.crawllog.lines.for.domain=Lines from crawl.log about domain pagetitle;files.for.job.0=Files for job {0} pagetitle;reports.for.job.1=Reports for job {0} pagetitle;qa.crawllog.lines.for.domain.0.in.1=Lines from crawl.log of job {1} concerning domain {0} helptext;get.job.qa.information.with.viewerproxy=The links below will only work \ if your browser is set up to use the viewerproxy as web proxy. |
|
Line 238: | Line 102: |
||Version 3.8.1 ||2009-07-15 || || ||Version 3.8.0 ||2009-05-23 || || |
||Version 3.8.1 ||2009-07-15 ||Fix of important bug leading to unresponsive harvesters || ||Version 3.8.0 ||2009-05-23 ||Java 1.6, Heritrix 1.14.1, Derby 10.4.2.0, complete rewrite of settings, new supported deploy module, gui access to harvest logs || |
Release Notes for NetarchiveSuite 3.9.0
This version of NetarchiveSuite was released on 2009-08-10.
Contents
New features since NetarchiveSuite 3.8.*
Apart from a general fixing of bugs (see below) the most important new features are:
General
Common Module
It is now possible to override the implementation of the getBytesFree() method, that is by default calculated using the standard Java method File.getUsableSpace().
Harvester Module
NetarchiveSuite now works properly with MySQL (See bug 1254), and a deadline has now been introduced in the HarvestScheduler for jobs in status STARTED, by default 1 week.
Archive Module
Some excessive logging has been removed, and the previously fixed indexing bugs 1078 and 1079 was reopened and fixed again.
Bugs fixed since NetarchiveSuite 3.8.*
Common Module
Bug 1694 LocalArcRepositoryClient is broken Bug 1712 Starting multiple applications on one machine leads to potential failure of startup FR 1654 second-level domains for .at in settings.xml FR 1709 Module getBytesFree()
Harvester Module
Bug 928 The guess of initial size of unharvested domains is very bad on harvests with a large object limit Bug 1254 Database connections to MySQL close down intermittently Bug 1611 Missing space in error message in DefinitionsSiteSection.initialize Bug 1644 On Edit Domain page, the text field only shows 21 characters of the domainname Bug 1646 dk.netarkivet.harvester.harvesting.distribute.MetadataEntry needs toString method Bug 1650 It is not checked when creating the Heritrix process, that the JMX password file assigned to Heritrix exists Bug 1670 Default timeout settings are set way too low in the default settings Bug 1711 harvester that is not destroyable makes harvesterApplication take and immediately fail jobs bUG 1718 The link to monitor Heritrix process does not necessarily give fully qualified hostnames FR 1014 No good way to mark a non-reported-stopped job as FAILED or DONE FR 1227 Log the Heritrix command line FR 1628 Add custom JVM parameters to Heritrix subprocess FR 1675 List of all Seeds of a selective Harvests FR 1702 Value for max-trans-hops is way too high in default order templates FR 1716 Increase size of input field, when uploading harvest templates FR 1717 Increase crawler trap textarea size FR 1723 Update the Heritrix templates in harvestdefinionbasedir/order_template_dist to Heritrix 1.14.3
Archive Module
Bug 1078 DeDuplikator index too large (refixed) Bug 1079 snap shot harvest not browsable due to large index (refixed) Bug 1547 Wrong synchronization in the IndexRequestServer and the FileBasedCache let two processes generate Index at the same time, and one of them fails Bug 1722: Excessive logging in indexserver
Access Module
Bug 1700: The WebProxy.handle() method creates CreateErrorResponse for null Uri
Monitor Module
Deploy Module
Documentation
Bug 1636 Warnings in javadoc Bug 1710 deploy Application seems not to support multiple FTP-servers
Upgrade instructions
Remember to stop the running installation before upgrading.
New settings
The following new settings have been introduced:
settings.common.database.validityCheckTimeout (default: 0): Timeout in seconds to check for the validity of a JDBC connection on the server. This is the time in seconds to wait for the database operation used to validate the connection to complete. If the timeout period expires before the operation completes, this method returns false. A value of 0 indicates a timeout is not applied to the database operation.
settings.common.freespaceprovider.class(Default: dk.netarkivet.common.utils.DefaultFreeSpaceProvider): The implementation class for free space provider, e.g. dk.netarkivet.common.utils.DefaultFreeSpaceProvider. The class must implement FreeSpaceProvider-Interface.
settings.harvester.harvesting.heritrix.javaOpts (default: ""): Additional JVM options for the Heritrix sub-process.
settings.harvester.harvesting.heritrixControllerClass (default: dk.netarkivet.harvester.harvesting.JMXHeritrixController): The implementation of the HeritrixController interface to be used.
New translations
If you are maintaining a translation, please note that the following new keys have been added:
archive/Translations.properties
harvester/Translations.properties
harvestdefinition.linktext.seeds=Seeds harveststatus.seeds.total=Total harveststatus.seeds.domains=Domains
viewerproxy/Translations.properties
Version History
Version 3.8.1 |
2009-07-15 |
Fix of important bug leading to unresponsive harvesters |
Version 3.8.0 |
2009-05-23 |
Java 1.6, Heritrix 1.14.1, Derby 10.4.2.0, complete rewrite of settings, new supported deploy module, gui access to harvest logs |
Version 3.7.0 |
2008-11-04 |
Develop version aiming for 3.8.0 |
Version 3.6.0 |
2008-07-03 |
Improvement of archive component with regard to security, batch, and preservation; greater JMS stability; important bug fixes |
Version 3.5.* |
|
Develop versions aiming for 3.6.0 |
Version 3.4.2 |
2008-03-14 |
Bug fix release, fixing JMX timeout |
Version 3.4.1 |
2008-01-16 |
Bug fix release, fixing out of memory on very large indexes |
Version 3.4.0 |
2008-01-03 |
Separation of Heritrix, work on developing our open source platform, two-part TLDs like co.uk, and lots of bugfixes |
Version 3.3.* |
|
Develop versions aiming for 3.4.0 |
Version 3.2.3 |
2007-09-27 |
Bugfix of 3.2.2 with patched deduplicator, that fixes problem in parallel indexing |
Version 3.2.2 |
2007-08-03 |
Bugfix of 3.2.1 with patched Heritrix 1.12.1, that supports ARCRecords larger than 2GBs |
Version 3.2.1 |
2007-07-04 |
Bugfix of 3.2.0 fixing trouble using the quick start manual. |
Version 3.2.0 |
2007-07-04 |
Open source release |
Version 3.1.* |
|
Development versions. Version 3.1.7 was kindly reviewed by Internet Archive and the Norwegian national library. |
Version 3.0.0 |
2007-02-02 |
Marked the naming of the NetarchiveSuite, the splitting of NetarchiveSuite into independent modules, and the licensing of NetarchiveSuite under LGPL |
Version 2.* |
|
Various features and updates |
Version 2.0 |
2006-08-30 |
Marked a general restructuring of the code, where harvest definition data was backed by a database, the viewerproxy was trimmed and rewritten. |
Version 1.* |
|
Various features and updates |
Version 1.0 |
2005-07-01 |
The first version of the netarchive| software put in production for harvesting the entire Danish web |
Version 0.* |
|
Various pre-production development versions |