Release Notes for NetarchiveSuite 3.3.1
This version of the NetarchiveSuite was released on 2007-09-24.
Note: This is a development release, it is not tested to be of production quality
Contents
New features since NetarchiveSuite 3.2.*
General
We now have a publicly available SVN repository. See https://gforge.statsbiblioteket.dk/scm/?group_id=7.
Please note that anonymous access is not available, although the page claims so. You will need to create a gforge account, and use the instructions under Developer Subversion Access.
Common Module
Settings split
The settings for the monitor module have been moved to a separate settings file. It is expected that we will do further restructuring of our settings to make them more extensible and modular in a later release.
Harvester Module
Two-part TLDs possible
The possibility of using e.g. co.uk as a top-level domain has been introduced. See upgrade section for example.
Heritrix integration
Preliminary work has been done on communicating with the harvesters through JMX and get the Heritrix UI up and running.
Monitor Module
Dynamic reload of settings for deployed applications
With the new settings file for the monitor module, you will be able to update which applications are to be monitored. Simply change the settings file, and the file should be reloaded on the next load of the monitor application.
Bugs fixed since NetarchiveSuite 3.2.*
Common Module
1016 ExtractCDX tool does not handle arc-files with large records 1023 [javadoc] Tag @see: can't find getInstance(File) in dk.netarkivet.common.distribute.RemoteFile 1034 English NetarchiveSuite thumbnail redirects to Danish site 1057 HTTPRemoteFile breaks ?
Harvester Module
628 need a way to reset the nextdate of a HD 789 illegal regexp crashes entire job 861 No logmessage in HarvestDefinitionGUI, when submitting crawljob 915 TLDs having two parts i.e co.uk are disallowed 926 error writing crawl.log to metadata.arc 937 rescheduling of jobs is very slow and blocks normal scheduling 939 Webpages should handle the case, where no schedules or harvestdefinitions exist already properly 970 DB error on registering jobs with too many upload-errors 971 Seeds, passwords and configurations sorted without regard to locale 984 Missing headlines in "Selective Harvests" window 993 missing trim of domain strings or a normal error message 1033 Sensitive "Find Domain(s)" 1038 New title for 'Harvest status' - 'All Jobs per domain' 1039 SideKick sets it's application name to "dk.netarkivet.SideKick" 1040 sortNamedObjectList and Named should be moved to common.utils 1049 Danish translation of resubmitted is wrong 1051 Error on Definitions-create-domain.jsp when trying to create an invalid domain 1053 wrong parameter in configuration-link on job details page
Archive Module
1022 [javadoc] Parameter "data" is documented more than once in RemoveAndGetFileMessage 1024 An error message is always shown on the bitarchive checksum page 1036 missing translation 1046 Bitpreservation-filestatus-checksum.jsp: Missing whitespace between filename and Info-link 1059 Mention of locations as institutions in comments, and variable names
Viewerproxy Module
1030 new viewerproxy command URL gives strange behavior in some browsers
Monitor Module
936 no way to add a new bitarchive machine to the JMX-overview without restarting the GUI
Upgrade instructions
Monitor settings
To upgrade from a previous version of NetarchiveSuite, you will need to update your settings files on the application running the monitor GUI, usually HarvestDefinitionApplication.
What needs to be done is move the section that looks like
<deploy> <jmxMonitorRolePassword>JMX_MONITOR_ROLE_PASSWORD_PLACEHOLDER</jmxMonitorRolePassword> <numberOfHosts>3</numberOfHosts> <host1> <name>hostname1.example.com</name> <jmxport>8100</jmxport> <jmxport>8101</jmxport> <jmxport>8102</jmxport> </host1> <host2> <name>hostname2.example.com</name> <jmxport>8100</jmxport> <jmxport>8101</jmxport> </host2> <host3> <name>hostname3.example.com</name> <jmxport>8100</jmxport> <jmxport>8101</jmxport> <jmxport>8102</jmxport> <jmxport>8103</jmxport> </host3> </deploy>
to the file called monitor_settings.xml.
If you use -Ddk.netarkivet.settings.file=path/to/your/settings.xml you should now also use -Ddk.netarkivet.monitorsettings.file=path/to/your/monitor_settings.xml.
Setting top-level domains
The old setting settings.harvester.datamodel.domain.validDomainRegex has been removed. Instead a new repeatable setting settings.harvester.datamodel.domain.tld is introduced, which decalres the valid top-level-domain-parts for domains in the system. This setting also defines how domains are split. Thus, if 'co.uk' is a valid top-level domain, a legal domain would be "bbc.co.uk", and the host "news.bbc.co.uk" would be considered a host in that domain.
Example: Say your old regex looked like this:
<validDomainRegex>^([0-9]+\.[0-9]+\.[0-9]+\.[0-9]+|[^\0000-,.-/:-@\[-`{-\0177]+\.(dk|com|net|uk))$</validDomainRegex>
You could replace it with the following for the exact same meaning:
<tld>dk</tld> <tld>com</tld> <tld>net</tld> <tld>uk</tld>
Or you could introduce a better handling of .uk domains as follows:
<tld>dk</tld> <tld>com</tld> <tld>net</tld> <tld>co.uk</tld> <tld>gov.uk</tld> <tld>ltd.uk</tld> <tld>me.uk</tld> <tld>mod.uk</tld> <tld>net.uk</tld> <tld>nic.uk</tld> <tld>nhs.uk</tld> <tld>org.uk</tld> <tld>plc.uk</tld> <tld>police.uk</tld> <tld>sch.uk</tld> <tld>uk</tld>
Note that uk is added at the end. This will let domains not belonging to any of the two-part TLDs still be caught (like British Library).
Version History
Current development versions
Version 3.3.1 |
2007-09-24 |
Mostly bugfix work, including possibility to use two-part TLDs like co.uk |
Version 3.3.0 |
2007-08-06 |
Mostly bugfix work, including upgradability of the monitored applications, and faster resubmitting of jobs |
Stable versions
Version 3.2.3 |
2007-09-27 |
Bugfix of 3.2.2 with patched deduplicator, that fixes problem in parallel indexing |
Version 3.2.2 |
2007-08-03 |
Bugfix of 3.2.1 with patched Heritrix 1.12.1, that supports ARCRecords larger than 2GBs |
Version 3.2.1 |
2007-07-04 |
Bugfix of 3.2.0 fixing trouble using the quick start manual. |
Version 3.2.0 |
2007-07-04 |
Open source release |
Version 3.1.* |
|
Development versions. Version 3.1.7 was kindly reviewed by Internet Archive and the Norwegian national library. |
Version 3.0.0 |
2007-02-02 |
Marked the naming of the NetarchiveSuite, the splitting of NetarchiveSuite into independent modules, and the licensing of NetarchiveSuite under LGPL |
Version 2.* |
|
Various features and updates |
Version 2.0 |
2006-08-30 |
Marked a general restructuring of the code, where harvest definition data was backed by a database, the viewerproxy was trimmed and rewritten. |
Version 1.* |
|
Various features and updates |
Version 1.0 |
2005-07-01 |
The first version of the netarchive software put in production for harvesting the entire Danish web |
Version 0.* |
|
Various pre-production development versions |