= Release Notes for NetarchiveSuite 3.4.0 = This version of !NetarchiveSuite was released on 2008-01-03. Happy new year! These release notes have been updated with extra upgrade instructions on 2008-02-18. <> == New features since NetarchiveSuite 3.2.* == === General === We now have a publicly available SVN repository. See https://gforge.statsbiblioteket.dk/scm/?group_id=7. We also have a public tracking system available at the GForge site. Here you can browse known bugs and feature requests, as well as report new bugs or request new features. You can also submit patches for the NetarchiveSuite code. See https://gforge.statsbiblioteket.dk/tracker/index.php?group_id=7 We also have three mailing lists available. The netarchivesuite-announce list is low traffic announcement of new releases. The netarchivesuite-users list is a list for general discussion. The developers are active on this list, and feedback and comments is very welcome. Finally there is a mailing list where all commits to our repository are automatically reported. This third list is somewhat high volume and of interest for developers only. See https://gforge.statsbiblioteket.dk/mail/?group_id=7 === Common Module === ==== New secure remotefile implementation ==== A new remote file implementation using secure https ha sbeen added. To use it, set your remote file settings in {{{settings.xml}}} as follows: {{{ ... dk.netarkivet.common.distribute.HTTPSRemoteFile 8300 path/to/keystore testpass testpass2 }}} The keystore file must contain a certificate with the given passwords. It can be generated with the {{{keytool}}} application distributed with Java 5. Run the following command: {{{ keytool -alias NetarchiveSuite -keystore keystore -genkey }}} It should the respond with the following: {{{ Enter keystore password: }}} Enter the password for the keystore. The keytool will now prompt you for the following information {{{ What is your first and last name? [Unknown]: What is the name of your organizational unit? [Unknown]: What is the name of your organization? [Unknown]: What is the name of your City or Locality? [Unknown]: What is the name of your State or Province? [Unknown]: What is the two-letter country code for this unit? [Unknown]: Is CN=Unknown, OU=Unknown, O=Unknown, L=Unknown, ST=Unknown, C=Unknown correct? [no]: }}} answer all the questions, and end with "yes". Finally you will bey asked for the certificate password. {{{ Enter key password for (RETURN if same as keystore password): }}} Answer with a password for the certificate. You now how a file called {{{keystore}}} which contains a certificate. To keep your environment secure, you should make sure you do not have the keystore and settings file readable for anyone but the application. ==== Settings split ==== The settings for the monitor module have been moved to a separate settings file. It is expected that we will do further restructuring of our settings to make them more extensible and modular in a later release. ==== Better pluggability of JMS implementation ==== If you wish to use a different JMS queue software, the class handling all JMS communication is now pluggable and can be specified in Settings. === Harvester Module === ==== Two-part TLDs possible ==== The possibility of using e.g. co.uk as a top-level domain has been introduced. See upgrade section for example. ==== Heritrix integration ==== Heritrix is now run as a separate process and controlled by JMX. This among other makes it possible to use the Heritrix harvest interface while the harvest is running. It also opens new possiblities for how we can control Heritrix during a harvest in later releases. ==== Manual override of scheduling ==== For selective harvests it is now possible to override when it should next run. This will override the scheduling with a date chosen by the operator. ==== Sorting and grouping seed lists ==== The list of jobs can now be grouped by job status and sorted in ascending or descending order. ==== Checking of harvest templates ==== When adding a new harvest template, it will now be checked that the parts of an order.xml template that are expected by the NetarchiveSuite software is actually present. === Archive Module === ==== Index Client Performance ==== The performance for requesting the same index multiple times should now be much better. ==== GetFile command line tool ==== A new command line tool has been added. Try {{{java dk.netarkivet.archive.tools.GetFile}}} === Access Module === ==== Work on making http auth protected material available ==== The access module now does a best guess at retrieving the wanted record, when both a 401 and a 200 record is harvested during a http authentication negotiation. === Monitor Module === ==== Dynamic reload of settings for deployed applications ==== With the new settings file for the monitor module, you will be able to update which applications are to be monitored. Simply change the settings file, and the file should be reloaded on the next load of the monitor application. == Bugs fixed since NetarchiveSuite 3.2.* == === Common Module === {{{ 1016 ExtractCDX tool does not handle arc-files with large records 1023 [javadoc] Tag @see: can't find getInstance(File) in dk.netarkivet.common.distribute.RemoteFile 1034 English NetarchiveSuite thumbnail redirects to Danish site 1057 HTTPRemoteFile breaks ? 1061 Remove obsolete settings (monitor.htmlDir, viewerproxy.hostname) 1070 SUNMQ class must be fixed 1083 The NetarchiveSuite releases contain test code in lib/dk.netarkivet.*.jar 1147 AbstractRemoteFile.copyTo(File destfile) throws NullPointerException if destfile has no parent }}} === Harvester Module === {{{ 628 need a way to reset the nextdate of a HD 680 Cannot browse harvested password protected material 789 illegal regexp crashes entire job 861 No logmessage in HarvestDefinitionGUI, when submitting crawljob 897 sorting of seed-lists on job-page 915 TLDs having two parts i.e co.uk are disallowed 926 error writing crawl.log to metadata.arc 937 rescheduling of jobs is very slow and blocks normal scheduling 939 Webpages should handle the case, where no schedules or harvestdefinitions exist already properly 970 DB error on registering jobs with too many upload-errors 971 Seeds, passwords and configurations sorted without regard to locale 973 Required parameter urlListList not given 984 Missing headlines in "Selective Harvests" window 993 missing trim of domain strings or a normal error message 1010 long string in ordertemplate blows up formating of Job Details page 1033 Sensitive "Find Domain(s)" 1038 New title for 'Harvest status' - 'All Jobs per domain' 1039 SideKick sets it's application name to "dk.netarkivet.SideKick" 1040 sortNamedObjectList and Named should be moved to common.utils 1049 Danish translation of resubmitted is wrong 1051 Error on Definitions-create-domain.jsp when trying to create an invalid domain 1052 wrong description of setting in installation manual 1053 wrong parameter in configuration-link on job details page 1065 Show harvest run count 1071 Checks of Xpaths in Heritrix templates 1074 more information from harvesters (JobID and Priority) 1075 harvester with no diskspace keeps starting new jobs 1088 Time synchronization issue? 1096 Wrong timeout listed in log 1159 long string in seedlist blows up formating of Job Details page 1179 Loading job details takes too long }}} === Archive Module === {{{ 1022 [javadoc] Parameter "data" is documented more than once in RemoveAndGetFileMessage 1024 An error message is always shown on the bitarchive checksum page 1036 missing translation 1046 Bitpreservation-filestatus-checksum.jsp: Missing whitespace between filename and Info-link 1059 Mention of locations as institutions in comments, and variable names 1062 indexserver skips a lot of lines due to threading problem with SimpeDateFormat 1077 more logging when indexing large number of jobs 1080 failed upload does not clean up FTPRemoteFile 1099 Only batch jobs being started is visible in the system overview }}} === Viewerproxy Module === {{{ 1030 new viewerproxy command URL gives strange behavior in some browsers }}} === Monitor Module === {{{ 936 no way to add a new bitarchive machine to the JMX-overview without restarting the GUI }}} === Deploy module (unsupported) === {{{ 1081 deploy uses largeIndexRequestTimeout for both LOW and HIGH priority harvester instances 1086 Use unzip in quiet mode during deploy }}} == Upgrade instructions == Please note that we only support upgrading from the previous stable release. Upgrading across several stable releases is not supported. You will need to upgrade step-by-step through all stable releases (for instance 3.0 -> 3.2 -> 3.4). === Monitor settings === To upgrade from a 3.2.* version of !NetarchiveSuite, you will need to update your settings files on the application running the monitor GUI, usually !HarvestDefinitionApplication. What needs to be done is move the section that looks like {{{ JMX_MONITOR_ROLE_PASSWORD_PLACEHOLDER 3 hostname1.example.com 8100 8101 8102 hostname2.example.com 8100 8101 hostname3.example.com 8100 8101 8102 8103 }}} to the file called {{{monitor_settings.xml}}}. If you use {{{-Ddk.netarkivet.settings.file=path/to/your/settings.xml}}} you should now also use {{{-Ddk.netarkivet.monitorsettings.file=path/to/your/monitor_settings.xml}}}. === Setting top-level domains === The old setting {{{settings.harvester.datamodel.domain.validDomainRegex}}} has been removed. Instead a new repeatable setting {{{settings.harvester.datamodel.domain.tld}}} is introduced, which declares the valid top-level-domain-parts for domains in the system. This setting also defines how domains are split. Thus, if '{{{co.uk}}}' is a valid top-level domain, a legal domain would be "{{{bbc.co.uk}}}", and the host "{{{news.bbc.co.uk}}}" would be considered a host in that domain. Example: Say your old regex looked like this: {{{ ^([0-9]+\.[0-9]+\.[0-9]+\.[0-9]+|[^\0000-,.-/:-@\[-`{-\0177]+\.(dk|com|net|uk))$ }}} You could replace it with the following for the exact same meaning: {{{ dk com net uk }}} Or you could introduce a better handling of .uk domains as follows: {{{ dk com net co.uk gov.uk ltd.uk me.uk mod.uk net.uk nic.uk nhs.uk org.uk plc.uk police.uk sch.uk uk }}} Note that {{{uk}}} is added at the end. This will let domains not belonging to any of the two-part TLDs still be caught (like British Library). === New settings for controlling heritrix === You can control the time we wait for the Heritrix external process to start. That time is set with the setting {{{settings.common.processTimeout}}}. Also, new settings set the ports for communication with Heritrix using JMX in the external process. Your new settings should be as follows: {{{ ... ... 5000 ... ... ... 100 100 admin adminPassword 8090 8091 1598M }}} === New setting for when harvesters should no longer listen for new jobs === The harvester will now stop listening for jobs, if a given amount of dis space is not available. This needs to be set. {{{ ... ... 400000000 ... }}} === Full classname need for MQ implementation === Previously the setting {{{settings.common.jms.class}}} was just the prefix of a class name, e.g. {{{SunMQ}}}. Now the full class name is needed. Example: If your previous settings.xml file contained: {{{ ... SunMQ ... ... }}} it should now become {{{ ... dk.netarkivet.common.distribute.JMSConnectionSunMQ ... ... }}} === You must clear all job-caches === On all harvester machines, on the index server, and on the viewer proxy, please delete all contents in the cache directories. The format of generated caches has changed. With default settings, this is the directory 'cache' on the viewerproxies, the index servers, and the harvester machines. == Version History == ||Version 3.4.0||2008-01-03||Separation of Heritrix, work on developing our open source platform, two-part TLDs like co.uk, and lots of bugfixes|| ||Version 3.3.*|| ||Develop versions aiming for 3.4.0|| ||Version 3.2.3||2007-09-27||Bugfix of 3.2.2 with patched deduplicator, that fixes problem in parallel indexing|| ||Version 3.2.2||2007-08-03||Bugfix of 3.2.1 with patched Heritrix 1.12.1, that supports ARCRecords larger than 2GBs|| ||Version 3.2.1||2007-07-04||Bugfix of 3.2.0 fixing trouble using the quick start manual.|| ||Version 3.2.0||2007-07-04||Open source release|| ||Version 3.1.*|| ||Development versions. Version 3.1.7 was kindly reviewed by Internet Archive and the Norwegian national library.|| ||Version 3.0.0||2007-02-02||Marked the naming of the !NetarchiveSuite, the splitting of !NetarchiveSuite into independent modules, and the licensing of !NetarchiveSuite under LGPL|| ||Version 2.* || ||Various features and updates|| ||Version 2.0 ||2006-08-30||Marked a general restructuring of the code, where harvest definition data was backed by a database, the viewerproxy was trimmed and rewritten.|| ||Version 1.* || ||Various features and updates|| ||Version 1.0 ||2005-07-01||The first version of the netarchive| software put in production for harvesting the entire Danish web|| ||Version 0.* || ||Various pre-production development versions||