Differences between revisions 1 and 29 (spanning 28 versions)
Revision 1 as of 2009-02-02 11:07:35
Size: 16236
Comment:
Revision 29 as of 2010-08-16 10:24:45
Size: 15237
Editor: localhost
Comment: converted to 1.6 markup
Deletions are marked like this. Additions are marked like this.
Line 1: Line 1:
#acl NetarkivetGroup:read,write,delete,revert,admin All:
Line 3: Line 2:
This version of !NetarchiveSuite was released on 2009-MM-DD. [[TableOfContents]] This version of !NetarchiveSuite was released on 2009-05-23.

<<
TableOfContents>>
Line 6: Line 7:
Apart from a general fixing of bugs (see below) the most important new features are:
Line 7: Line 10:
==== Issue tracking cleanup ====
We have cleaned up the bug/feature request/patch trackers, removing a lot of confusing fields. We are working on the final documentation of explaining the precise definition of each remaining field, and how bugs are handled

==== Upgrade top java 1.6.0 ====
NetarchiveSuite now requires java 1.6.0.
Line 8: Line 17:
==== Improved stability with JMS connections ====
Previously, if an application lost its connection to the JMS server, the application had to be restarted. It will now attempt to reconnect in known recoverable scenarios.

==== Security manager supported ====
The !NetarchiveSuite now comes with security.policy files, that limit what source code not in the !NetarchiveSuite jar files or libraries is allowed to do. This will increase the security, especially for bitarchives. The file is distributed as conf/security.policy, and to use it you need to start java with {{{-Djava.security.manager -Djava.security.policy=conf/test.policy}}} See the new batch job possibilities for use case.

==== Embedded web server upgraded to Jetty 6.1.6 ====
The embedded web server has been upgraded. This should help on stability in some rare situations where large form fields caused exceptions to happen, and generally improve web server stability.

==== Code supporting MacOS/X ====
Although we do not test our software for compatibility with MacOS/X, we have received a patch that should enable !NetarchiveSuite to run on macs, and the report that with this patch it runs fine. Many thanks to Lars Clausen and The National Library of Scotland for this patch.

==== Better log file naming in QuickStart scripts ====
It is a lot easier to debug !QuickStart installations after the logfiles have been properly named for each application. Many thanks to Lars Clausen and The National Library of Scotland for this patch.
==== New settings structure ====
The way we read settings has been radically updated. It is no longer required to have a settings file with all settings, instead default values are used, if the settings are not set. Furthermore, more than one settings file can be used, one overriding the other. Refer to the [[Installation Manual]] for more info on how they are read. It is recommended to not set values where the default values are acceptable, so new defaults will be deployed automatically, in new versions of !NetarchiveSuite.

==== German added as language ====
Language files for German has been added to the NetarchiveSuite thanks to a patch from ONB.

=== Deploy Module ===
==== New and improved software to deploy NetarchiveSuite ====
The existing deploy software which were developed for the needs of the Netarkivet installation, and which were lacking in many areas, has been replaced by new deploy software that is more versatile and configurable than the old one, thus hopefully more useful for a larger range of scenarios.
Line 24: Line 28:
==== Switch to DecidingScope ====
!NetarchiveSuite now controls Heritrix using deciding scope, rather than the older deprecated !HostScope and !DomainScope. This is expected to improve harvesting performance. For the end user, there is no visible difference.

==== HarvestDefinitionApplication replaced by GUIApplication ====
The application !HarvestDefinitionApplication has been removed. Instead, use dk.netarkivet.common.GUIApplication. It will work exactly as the old application, assuming you have the !HarvestDefinition site section deployed in settings, that is:

{{{
<settings>
...
<common>
...
<webinterface>
...
<siteSection>
<!-- A subclass of SiteSection that defines this part of the
web interface. -->
<class>dk.netarkivet.harvester.webinterface.DefinitionsSiteSection</class>
<!-- The directory or war-file containing the web application
for this site section.-->
<webapplication>webpages/HarvestDefinition</webapplication>
<!-- The URL path for this section of the web interface. -->
<deployPath>/HarvestDefinition</deployPath>
</siteSection>
...
}}}
This is the default, and unchanged since previous versions.

==== OutOfMemory fix ====
When running very large harvests, the index over the harvests caused out of memory errors, both on harvests, due to the index used for deduplications, and on access, due to the index used for browsing. This has been fixed in this release.

==== Support for setting a byte limit on event harvests ====
When adding seeds to a harvest definition using the "Add seeds" functionality, it is now possible to set a byte limit. Many thanks to Lars Clausen and The National Library of Scotland for this patch.

==== Support for byte limits over 2GB ====
A previous limit on how large byte limits you can set has been removed.

==== The default byte limit for new configurations is now a setting ====
Previously, when creating a new configuration, the default byte limit was hardcoded to 500MB. Now, it is a setting. Simply set

{{{
<harvester>
<datamodel>
<domain>
...
<!-- Default byte limit for domain configuration. -->
<defaultMaxbytes>1000000000</defaultMaxbytes>
...
</domain>
</datamodel>
</harvester>
}}}
==== Heritrix upgraded to version 1.14.3 ====
The version of Heritrix bundled with NetarchiveSuite have now been upgraded to version 1.14.3 released 2009-03-03.

Note that two Heritrix attributes have been renamed. 'overly-eager-link-detection' used as attributes by ExtractorHTML and JerichoExtractorHTML, with a default value of 'true', has been renamed 'extract-value-attributes' to more accurately reflect its effect. 'bind-address' in FetchHTTP, with a default value of the empty string, has been renamed 'http-bind-address'.

[[http://webteam.archive.org/confluence/display/Heritrix/Release+Notes+-+1.14.0|ReleaseNotes for Heritrix 1.14.0]]

[[http://webteam.archive.org/confluence/display/Heritrix/Release+Notes+-+1.14.1|ReleaseNotes for Heritrix 1.14.1]]

[[http://webteam.archive.org/confluence/display/Heritrix/Release+Notes+-+1.14.2|ReleaseNotes for Heritrix 1.14.2]]

[[http://webteam.archive.org/confluence/display/Heritrix/Release+Notes+-+1.14.3|ReleaseNotes for Heritrix 1.14.3]]

==== Access to harvest logs ====
From the job page of any DONE or FAILED job, there is now a link to inspect the Heritrix logs, metadata files, and harvested files from that job! This greatly enhances the possibilities of debugging why a job behaved as it did. You can also get the subset of a crawl.log that referred to a specific domain in a harvest.

==== Derby DB upgraded to 10.4.2.0 ====

We have upgraded the version of Derby DB to 10.4.2.0 (released September 5, 2008).
Migration of the Derby Database files from the old 10.1.1.0 format to the new internal
format is done by setting it as a property in the connection url:
{{{ "jdbc:derby:harvestDatabase;upgrade=true" }}}
Line 76: Line 52:
==== Bit preservation restructuring ====
The bit preservation has undergone a huge restructuring. This is partly preparation for more actions that will improve bit preservation, but it has the immediate effect, that reestablishing a missing file, or fixing a file with a bit error, will require fewer mouse clicks, and generally perform faster now. It should be possible to restore more files in one go, than was previously possible.

==== Support for submitting externally contributed batch jobs ====
The bit archive now has support for launching a batch job on the archive, that is written by an external source, without recompiling. This is a great tool for researchers wishing to do some analysis on the entire archive. All you have to do is to subclass the abstract class dk.netarkivet.common.utils.arc.!FileBatchJob and implement methods for initialisation, finishing and what to do on each file. The results must be written to an output stream. It will then be executed on all bitarchive machines. The results will be written to a file, or to the screen. To work on individual arc records, rather than entire files, subclass dk.netarkivet.common.utils.arc.ARCBatchJob instead. The mechanism to do this is the command line tool dk.netarkivet.archive.tools.!RunBatch. E.g.

{{{
java dk.netarkivet.archive.tools.RunBatch MyBatchJob.class
}}}
Optionally, you can run the job on a subset of all files, on a specific location, and save the output to a file. Try starting the command line client woth no arguments for an example. It is important to run your bitarchive with a security manager, and a restrictive policy (see above) to use this option. Otherwise external batch jobs might damage you bit archives.

=== ViewerProxy Module ===
==== Feedback when only a partial browse index is available ====
When using the !ViewerProxy for QA, it is relevant that you know exactly what set you are browsing. When requesting a browse index, you will automatically be given the largest available subset of an index that is actually available. However, if some parts are not available, it was previously silently ignored, now it is reported in the !ViewerProxy status interface.
==== Better handling of third-party batch jobs with tool ====
The command-line tool for submitting third-party batch jobs now supports a much better syntax. Simply launch the tool ({{{dk.netarkivet.archive.tools.RunBatch}}}) with no arguments for a description.

== Documentation ==
The documentation has been brought up to day, and some parts have been elaborated. And a new manual, the Configuration Manual, has been added. The [[Developer Manual]] has been replaced with a Systems Design manual, that includes Database documentation. Complementing this documentation the scripts to generate a new harvest database now contain elaborate documentation on the tables used by the harvest definition interface. These can be found in the distribution packages under {{{scripts/sql}}}

Furthermore, an Additional Tools manual has been added, complementing the User Manual and the Installation Manual.

== Bugs fixed since NetarchiveSuite 3.6.* ==
=== Common Module ===
{{{
bug 1247: FileUtils.unzip does not unzip directories properly
bug 1308: 3 different applications uses 'COMMON_THIS_HACO' in the JMS-queue naming
bug 1501: Setting schema and xsi type not valid
FR 291: HarvestControllerServer uses http port to set unique THIS_HACO
FR 1101: Should upgrade to using Java 1.6
FR 1252: Upgrade to Apache Derby 10.4.1.3
FR 1276: QuickStart installation should not use DEV as its environmentName
FR 1277: Move derbytools library from tests/lib/db to lib/db
Patch 1493: German Translation
}}}
=== Harvester Module ===
{{{
bug 662: We throw away important information from SQLException
bug 934: could not start harvester due to JMX problem (Adress already in use)
bug 1157: jobs disappeared in a strange way
bug 1167: Danish translation of "save configuration" is inconsistent
bug 1181: The templates in scripts/simple_harvest/data/originals/harvestdefinitionbasedir/order_templates are invalid
bug 1226: Discrepancy between how our database are defined in the dev, and prod environments respectively
bug 1240: PAUSED heritrix gets terminated by JMXHarvestController
bug 1292: Possible to store Jobs in database coming from unknown harvestdefinition
bug 1468: If harvest process ends with non-zero exit code for any reason, it will never be retried to start the harvester by the SideKick
bug 1469: Wrong translation label in Definitions-find-domains if no domains are in the database
bug 1519: Harvesting with no domains gives runs without history
FR 770: Resubmitted jobs do not provide information about which job they are resubmitted as
FR 1108: Heritrix logs should be accessible in harvest definition interface
FR 1146: submit date as new field on jobs
FR 1160: Status and order selection (771) on History/Harveststatus-perharvestrun.jsp
FR 1162: Multiselect of status on all jobs
FR 1485: File must be streamed instead on QA/QA-crawlloglines.jsp
FR 1497: Button in NetarchiveSuite
FR 1643: Update Heritrix to version 1.14.3
}}}
=== Archive Module ===
{{{
bug 574: When parsing checksum in ArcRepository we do strange things on wrong results
bug 800: Indexserver generates indices for the empty job set
bug 1191: NullPointerException in BitarchiveMonitorServer
bug 1193: Exceptions from FileBatchJob stop batch job processing
bug 1212: When a file with bit errors has been restored, it still appears on the list of files with checksum errors
bug 1216: no commandline argument support for batch-output file
bug 1246: Bitpreservation GUI fails
bug 1261: Wrong table headings (mostly Danish) in info part on page "Missing Files"
bug 1278: Error from RunBatch unexpected
bug 1279: Missing toString method on FileBatchJob classes
bug 1294: ref. to argument before check of argument in Filelist batchjob
bug 1492: The "Missing Files" page repeats the "status" headline
bug 1564: Disk mount filled to last block - and then it's not possible to move a single file
FR 1029: A tool in dk.netarkivet.archive.tools to retrieve a file from the archive would be nice
FR 1195: TrivialArcRepositoryClient does not offer setting for file directory
FR 1263: Confusing layout of "Files with checksum errors"
FR 1410: Missing files page could look nicer after update
FR 1498: batchJobs given in jar files
}}}
=== Access Module ===
{{{
bug 1152: URLs with { or } not browsable
}}}
Line 92: Line 121:
==== Automatic registering of applications for monitoring ====
Applications now automatically register themselves for monitoring by the monitor GUI. This has two effects: * Positive: You do not need to have a list of the machines to monitor in monitor_settings.xml, and you do not need to update it when adding new machines * Negative: You will not get an error message, if an expected machine does not show up on the list of monitored machines.

== Documentation ==
The documentation has been brought up to day, and some parts have been elaborated. Especially, the database documentation in the ["Developer Manual"] has been updated, and the scripts to generate a new harvest database now has elaborate documentation on the tables used by the harvest definition interface. These can be found in the distribution packages under {{{scripts/sql}}}

== Bugs fixed since NetarchiveSuite 3.4.* ==
=== Common Module ===
{{{
555 JMS connections cannot reconnect
788 problems with removing MessageListener
1063 Mention environment in webpages
1163 Upgrade to jetty 6.1.6
1175 Underscore in setting environmentName results in Internal Server error in Viewerproxy
1184 Notification is not sent, if backup fails
1185 ProcessUtils starts timer threads that cause random interrupts in our code
1194 Source for webpages is not packaged in source-zipball
1220 Wrong javadoc for method HTMLUtils.makeTableHeader()
1235 Mac support
}}}
=== Harvester Module ===
{{{
924 jobs that completely fails reports IOException
1078 DeDuplikator index too large
1093 Domain configurations with large limits are wrongly handled, causing unharvested configurations and schedule time
1100 Distributed heritrix.properties file contains wrong version number
1161 Use of known null value in dk.netarkivet.harvester.webinterface.SelectiveHarvest.updateHarvestDefinition
1118 Harvest templates should have their URL and email address checked
1165 Switch to DecidingScope
1182 Logging of job status no longer contains headings
1206 JMXHeritrixController can fail to initialize without throwing an exception
1213 Missing space before listing of aliases of this domain
1232 Incorrect ArgumentNotValid check in constructor of dk.netarkivet.harvester.harvesting.distribute.DomainStats
1234 Allow max bytes setting for event harvests
1238 JMX connection to Heritrix fails immediately after Heritrix process being started
1242 TimeUnit should be public enumeration type
1245 Required arcsdir value not checked
1248 NPE in deduplicator-0.3.0-20061218b.jar
1250 Check on table version of tabel harvestdefinitions is missing
1255 The default bytelimit for configurations should be a setting
}}}
=== Archive Module ===
{{{
1079 snap shot harvest not browsable due to large index
1203 Reestablishing missing files is a lot more inefficient than it should be
1246 Bitpreservation GUI fails
1258 no "Shift infobox for 4 files" on the "Missing Files" page
}}}
=== ViewerProxy module ===
{{{
1253 It is impossible to see that only a partial index is available
}}}
=== Monitor Module ===
{{{
900 Liveness logger logs too often, every 2 minutes
1042 Automatic registering to monitor of running applications desirable
1050 monitor_settings.xml is not parsed properly
1190 missing letter in danish translation of error message
{{{
bug 1223: System overview takes more than 20 secs and Show all never returns
bug 1388: Too much logging during automatic registering of applications
FR 1042: Automatic registering to monitor of running applications desirable
FR 1379: Misleading error in SystemState when harvester has not started
FR 1471: Monitor plugin had no alternative
FR 1483: Alternative monitor plugin could be print of jmx url
}}}
=== Deploy Module ===
{{{
bug 431: Settings.DIR_COMMONTEMPDIR directories should be emptied upon startup
bug 433: Starting the bit archives twice without killing inbetween make bitarchive immortal
bug 846: Make sure that install-scripts makes the necessary directories
bug 1271: Naming of GuiApplication scripts are misleading
FR 1520: Deploy of more than one Bitapplication per server
FR 1572: We need a new DeployApplication that is more usable, and more configurable
Line 153: Line 140:
1177 UserManual contains wrong information about domains
1273 QuickStart using Mysql as Database fails due to Java access permissions
}}}
=== Tools ===
{{{
1029 A tool in dk.netarkivet.archive.tools to retrieve a file from the archive would be nice
}}}
=== Scripts ===
{{{
1233 Make harvest.sh name logfiles properly
bug 1178: Missing information in UserManual: Adding domains not in database to a selective harvest
bug 1266: UserManual: missing screendump desc for new field in "Adding seeds to an event harvest"
bug 1273: QuickStart using Mysql as Database fails due to Java access permissions
bug 1484: Document that we are not entirely platform independent in installation manual
bug 1514: Add "ContentSize" object to list of mandatory fields for harvester templates in Installation Manual
FR 281 set FTP-directory to ~/ftp (default is ~)
FR 1287: Developer Manual should have better splits
FR 1389: Installation Manual should document the need to set the maximum number of producers on JMS broker
Line 165: Line 150:
=== Upgrading from 3.4.* ===
Please note that we only support upgrading from the previous stable release. Upgrading across several stable releases is not supported. You will need to upgrade step-by-step through all stable releases (for instance 3.0 -> 3.2 -> 3.4).

=== Upgrade your templates to use DecidingScope ===
!NetarchiveSuite now requires that you use !DecidingScope in all Heritrix templates. You will need to download al your templates, do the updates described in the ["Installation Manual"], Appendix E, and upload them again.

=== Upgrade of jobs table in database ===
!NetarchiveSuite will automatically upgrade the jobs table in the database. For Derby databases, this may take some time depending of the number of entries in the jobs table. In case of system crash during the upgrade, you will get an error on this when rebooting the system. The upgrade is done safely, therefore no data will be lost, but reestablishment of the system in this rare case must be done manually. Please do not hesitate to contact us on the mailing list in case of such an unfortunete accident. For more detailed information please refer to ConvertingJobsTableFromVersion3To4.

=== Automatic monitor registration settings ===
The automatic registering of clients is a new pluggable setting. The default implementation sends a JMS notification to the monitoring interface every minute, that informs the monitor that this application is alive and can be monitored. This needs to be set in settings with the following setting:

{{{
<settings>
...
<common>
...
<monitorregistryClient xsi:type="jmsmonitorregistryclient">
<!-- The class instantiated to register JMX urls at a registry. -->
<class>dk.netarkivet.common.distribute.monitorregistry.JMSMonitorRegistryClient</class>
</monitorregistryClient>
}}}
The settings from monitor_settings.xml that describe running applications can be removed. The only thing left in that settings file is:

{{{
<settings>
<monitor>
<!-- the password used to connect to the all Mbeanservers started by the application. This password must be same as the one in jmxremote.password -->
<jmxMonitorRolePassword>JMX_MONITOR_ROLE_PASSWORD_PLACEHOLDER</jmxMonitorRolePassword>
</monitor>
</settings>
}}}
=== Add setting for time out when waiting for responses from JMX subsystems ===
We communicate with the monitoring framework and Heritrix using JMX. A new setting controls how long we wait for a reply before giving up when trying to communicate. You should add the following to you settings.xml files

{{{
<common>
<jmx>
...
<!-- How many seconds we will wait before giving up on a JMX
connection. -->
<timeout>120</timeout>
</jmx>
</common>
}}}
=== Add setting for default configuration byte limit ===
It is now configurable what the byte limit is set to, when creating new configurations. You should add the following to you settings.xml files

{{{
<harvester>
<datamodel>
<domain>
...
<!-- Default byte limit for domain configuration. -->
<defaultMaxbytes>1000000000</defaultMaxbytes>
...
</domain>
</datamodel>
</harvester>
}}}
Remember to stop the running installation before upgrading.

=== Order templates ===
In the order templates, you need to add a new setting in the deduplication module, or the deduplication will require far too much RAM.
The setting is
{{{
<newObject name="DeDuplicator" class="is.hi.bok.deduplicator.DeDuplicator">
    [...]
    <!--Start of change-->
    <boolean name="use-sparse-range-filter">true</boolean>
    <!--End of change-->
    [...]
</newObject>
}}}

=== New settings structure ===
As mentioned above, settings now have defaults, and can be read from multiple settings files. It is recommended that you no longer set settings, where you wish to use the default values.

=== Default Monitor class moved ===
The plugin for distributed monitoring using !StatusSiteSection has moved. The setting {{{settings.monitorregistryClient.class}}} that was previously {{{dk.netarkivet.common.distribute.monitorregistry.JMSMonitorRegistryClient}}} is now {{{dk.netarkivet.monitor.distribute.JMSMonitorRegistryClient}}}

=== New settings ===
Settings from monitor_settings.xml are now settings in the standard settings files. The new settings are:

{{{settings.monitor.jmxUsername}}}: Must match the JMX username in all monitored applications. Default: monitorRole

Note that if you change the jmxUsername, you must change the security.policy accordingly. Assuming you set jmxUsername to "anonymous", we need the following line in the security.policy:

{{{
grant principal javax.management.remote.JMXPrincipal "anonymous" {
  permission java.security.AllPermission; };
}}}
{{{settings.monitor.jmxPassword}}}: Must match the JMX password in all monitored applications. Default: JMX_MONITOR_ROLE_PASSWORD_PLACEHOLDER

=== Changed setting names ===
The setting {{{settings.common.jms.environmentName}}} is now {{{settings.common.environmentName}}}

The setting {{{settings.common.harvester.datamodel.domain.tld}}} is now {{{settings.common.topLevelDomains.tld}}}

The setting {{{settings.common.database.specificsclass}}} is now {{{settings.common.database.class}}}

The setting {{{settings.archive.bitarchive.limitForRecordDatatransferInFile}}} is now {{{settings.common.repository.limitForRecordDatatransferInFile}}}

The setting {{{settings.archive.arcrepository.location}}} is now {{{settings.common.locations.location}}}

The setting {{{settings.archive.arcrepository.batchLocation}}} is now {{{settings.common.locations.batchLocation}}}

The setting {{{settings.archive.bitarchive.thisLocation}}} is now {{{settings.common.thisPhysicalLocation}}}

The setting {{{settings.monitor.applicationName}}} is now {{{settings.common.applicationName}}}

=== Removed settings ===
The setting {{{settings.common.siteSection.deployPath}}} is no longer used.
Line 226: Line 205:
If you are maintaining a translation, please note that the following new keys have been added: '''viewerproxy/Translations.properties''':

{{{
errormsg;request.was.for.0.but.got.1.missing.2=WARNING: The request was for the jobs {0}, but only the jobs {1} had available data for the index. Missing data for the jobs {2}
}}}
If you are maintaining a translation, please note that the following new keys have been added:
Line 234: Line 210:
change.0.failed=Change {0} failed
change.0.may.be.added=Change {0} that can be added
errormsg;admin.data.not.consistent.for.file.0=Admin data are not consistent for the file {0}
replace.file.in.bitarchive.0=Replace the file in bitarchive {0}
file.0.has.been.replaced.in.1=The file {0} has been replaced at bitarchive {1}
unable.to.correct=Unable to correct
no.info.on.file.0=No information could be found on the file ''{0}''.
no.checksum=No checksum
}}}
The following translations are no longer used, and can be removed:

{{{
change
failed
may.be.added
admin.data.not.consistent.for.file.0
remove.file.from.bitarchive.0
file.0.has.been.deleted.in.1.needs.copy
}}}
'''archive/Translations.properties''' and '''monitor/Translations.properties'''): All properties starting with "errmsg;" have been renamed to "errormsg;"
no.files.with.checksum.errors=No files with checksum errors were found
location=Location
}}}
'''harvester/Translations.properties'''

{{{
subtitle;reports.for.job=Harvest information for job
harvest.reports=Browse reports for jobs
harvest.files=Browse harvest files for job
crawl.log.lines.for.domain.0=Browse only relevant crawl-log lines for domain {0}
}}}
'''viewerproxy/Translations.properties'''

{{{
pagetitle;qa.get.files=Get harvested files
pagetitle;qa.get.reports=Get harvest reports
pagetitle;qa.crawllog.lines.for.domain=Lines from crawl.log about domain
pagetitle;files.for.job.0=Files for job {0}
pagetitle;reports.for.job.1=Reports for job {0}
pagetitle;qa.crawllog.lines.for.domain.0.in.1=Lines from crawl.log of job {1} concerning domain {0}
helptext;get.job.qa.information.with.viewerproxy=The links below will only work \
if your browser is set up to use the viewerproxy as web proxy.
}}}
Line 256: Line 234:
||Version 3.6.0 ||2008-07-03 || Improvement of archive component with regard to security, batch, and preservation; greater JMS stability; important bug fixes, automatic registering of applications for monitoring ||
||Version 3.5.0 ||2008-03-04 || Development version
aiming for 3.6.0 ||
||Version 3.7.0 ||2008-11-04 ||Develop version aiming for 3.8.0 ||
||Version 3.
6.0 ||2008-07-03 ||Improvement of archive component with regard to security, batch, and preservation; greater JMS stability; important bug fixes ||
||Version 3.5.* || ||Deve
lop versions aiming for 3.6.0 ||

Release Notes for NetarchiveSuite 3.8.0

This version of NetarchiveSuite was released on 2009-05-23.

New features since NetarchiveSuite 3.6.*

Apart from a general fixing of bugs (see below) the most important new features are:

General

Issue tracking cleanup

We have cleaned up the bug/feature request/patch trackers, removing a lot of confusing fields. We are working on the final documentation of explaining the precise definition of each remaining field, and how bugs are handled

Upgrade top java 1.6.0

NetarchiveSuite now requires java 1.6.0.

Common Module

New settings structure

The way we read settings has been radically updated. It is no longer required to have a settings file with all settings, instead default values are used, if the settings are not set. Furthermore, more than one settings file can be used, one overriding the other. Refer to the Installation Manual for more info on how they are read. It is recommended to not set values where the default values are acceptable, so new defaults will be deployed automatically, in new versions of NetarchiveSuite.

German added as language

Language files for German has been added to the NetarchiveSuite thanks to a patch from ONB.

Deploy Module

New and improved software to deploy NetarchiveSuite

The existing deploy software which were developed for the needs of the Netarkivet installation, and which were lacking in many areas, has been replaced by new deploy software that is more versatile and configurable than the old one, thus hopefully more useful for a larger range of scenarios.

Harvester Module

Heritrix upgraded to version 1.14.3

The version of Heritrix bundled with NetarchiveSuite have now been upgraded to version 1.14.3 released 2009-03-03.

Note that two Heritrix attributes have been renamed. 'overly-eager-link-detection' used as attributes by ExtractorHTML and JerichoExtractorHTML, with a default value of 'true', has been renamed 'extract-value-attributes' to more accurately reflect its effect. 'bind-address' in FetchHTTP, with a default value of the empty string, has been renamed 'http-bind-address'.

ReleaseNotes for Heritrix 1.14.0

ReleaseNotes for Heritrix 1.14.1

ReleaseNotes for Heritrix 1.14.2

ReleaseNotes for Heritrix 1.14.3

Access to harvest logs

From the job page of any DONE or FAILED job, there is now a link to inspect the Heritrix logs, metadata files, and harvested files from that job! This greatly enhances the possibilities of debugging why a job behaved as it did. You can also get the subset of a crawl.log that referred to a specific domain in a harvest.

Derby DB upgraded to 10.4.2.0

We have upgraded the version of Derby DB to 10.4.2.0 (released September 5, 2008). Migration of the Derby Database files from the old 10.1.1.0 format to the new internal format is done by setting it as a property in the connection url:  "jdbc:derby:harvestDatabase;upgrade=true" 

Archive Module

Better handling of third-party batch jobs with tool

The command-line tool for submitting third-party batch jobs now supports a much better syntax. Simply launch the tool (dk.netarkivet.archive.tools.RunBatch) with no arguments for a description.

Documentation

The documentation has been brought up to day, and some parts have been elaborated. And a new manual, the Configuration Manual, has been added. The Developer Manual has been replaced with a Systems Design manual, that includes Database documentation. Complementing this documentation the scripts to generate a new harvest database now contain elaborate documentation on the tables used by the harvest definition interface. These can be found in the distribution packages under scripts/sql

Furthermore, an Additional Tools manual has been added, complementing the User Manual and the Installation Manual.

Bugs fixed since NetarchiveSuite 3.6.*

Common Module

bug 1247: FileUtils.unzip does not unzip directories properly
bug 1308: 3 different applications uses 'COMMON_THIS_HACO' in the JMS-queue naming
bug 1501: Setting schema and xsi type not valid
FR 291: HarvestControllerServer uses http port to set unique THIS_HACO
FR 1101: Should upgrade to using Java 1.6
FR 1252: Upgrade to Apache Derby 10.4.1.3
FR 1276: QuickStart installation should not use DEV as its environmentName
FR 1277: Move derbytools library from tests/lib/db to lib/db
Patch 1493: German Translation

Harvester Module

bug 662: We throw away important information from SQLException
bug 934: could not start harvester due to JMX problem (Adress already in use)
bug 1157: jobs disappeared in a strange way
bug 1167: Danish translation of "save configuration" is inconsistent
bug 1181: The templates in scripts/simple_harvest/data/originals/harvestdefinitionbasedir/order_templates are invalid
bug 1226: Discrepancy between how our database are defined in the dev, and prod environments respectively
bug 1240: PAUSED heritrix gets terminated by JMXHarvestController
bug 1292: Possible to store Jobs in database coming from unknown harvestdefinition
bug 1468: If harvest process ends with non-zero exit code for any reason, it will never be retried to start the harvester by the SideKick
bug 1469: Wrong translation label in Definitions-find-domains if no domains are in the database
bug 1519: Harvesting with no domains gives runs without history
FR 770: Resubmitted jobs do not provide information about which job they are resubmitted as
FR 1108: Heritrix logs should be accessible in harvest definition interface
FR 1146: submit date as new field on jobs
FR 1160: Status and order selection (771) on History/Harveststatus-perharvestrun.jsp
FR 1162: Multiselect of status on all jobs
FR 1485: File must be streamed instead on QA/QA-crawlloglines.jsp
FR 1497: Button in NetarchiveSuite
FR 1643: Update Heritrix to version 1.14.3

Archive Module

bug 574: When parsing checksum in ArcRepository we do strange things on wrong results
bug 800: Indexserver generates indices for the empty job set
bug 1191: NullPointerException in BitarchiveMonitorServer
bug 1193: Exceptions from FileBatchJob stop batch job processing
bug 1212: When a file with bit errors has been restored, it still appears on the list of files with checksum errors
bug 1216: no commandline argument support for batch-output file
bug 1246: Bitpreservation GUI fails
bug 1261: Wrong table headings (mostly Danish) in info part on page "Missing Files"
bug 1278: Error from RunBatch unexpected
bug 1279: Missing toString method on FileBatchJob classes
bug 1294: ref. to argument before check of argument in Filelist batchjob
bug 1492: The "Missing Files" page repeats the "status" headline
bug 1564: Disk mount filled to last block - and then it's not possible to move a single file
FR 1029: A tool in dk.netarkivet.archive.tools to retrieve a file from the archive would be nice
FR 1195: TrivialArcRepositoryClient does not offer setting for file directory
FR 1263: Confusing layout of "Files with checksum errors"
FR 1410: Missing files page could look nicer after update
FR 1498: batchJobs given in jar files

Access Module

bug 1152: URLs with { or } not browsable

Monitor Module

bug 1223: System overview takes more than 20 secs and Show all never returns
bug 1388: Too much logging during automatic registering of applications
FR 1042: Automatic registering to monitor of running applications desirable
FR 1379: Misleading error in SystemState when harvester has not started
FR 1471: Monitor plugin had no alternative
FR 1483: Alternative monitor plugin could be print of jmx url

Deploy Module

bug 431: Settings.DIR_COMMONTEMPDIR directories should be emptied upon startup
bug 433: Starting the bit archives twice without killing inbetween make bitarchive immortal
bug 846: Make sure that install-scripts makes the necessary directories
bug 1271: Naming of GuiApplication scripts are misleading
FR 1520: Deploy of more than one Bitapplication per server
FR 1572: We need a new DeployApplication that is more usable, and more configurable

Documentation

bug 1178: Missing information in UserManual: Adding domains not in database to a selective harvest
bug 1266: UserManual: missing screendump desc for new field in "Adding seeds to an event harvest"
bug 1273: QuickStart using Mysql as Database fails due to Java access permissions
bug 1484: Document that we are not entirely platform independent in installation manual
bug 1514: Add "ContentSize" object to list of mandatory fields for harvester templates in Installation Manual
FR 281 set FTP-directory to ~/ftp (default is ~)
FR 1287: Developer Manual should have better splits
FR 1389: Installation Manual should document the need to set the maximum number of producers on JMS broker

Upgrade instructions

Remember to stop the running installation before upgrading.

Order templates

In the order templates, you need to add a new setting in the deduplication module, or the deduplication will require far too much RAM. The setting is

<newObject name="DeDuplicator" class="is.hi.bok.deduplicator.DeDuplicator">
    [...]
    <!--Start of change-->
    <boolean name="use-sparse-range-filter">true</boolean>
    <!--End of change-->
    [...]
</newObject>

New settings structure

As mentioned above, settings now have defaults, and can be read from multiple settings files. It is recommended that you no longer set settings, where you wish to use the default values.

Default Monitor class moved

The plugin for distributed monitoring using StatusSiteSection has moved. The setting settings.monitorregistryClient.class that was previously dk.netarkivet.common.distribute.monitorregistry.JMSMonitorRegistryClient is now dk.netarkivet.monitor.distribute.JMSMonitorRegistryClient

New settings

Settings from monitor_settings.xml are now settings in the standard settings files. The new settings are:

settings.monitor.jmxUsername: Must match the JMX username in all monitored applications. Default: monitorRole

Note that if you change the jmxUsername, you must change the security.policy accordingly. Assuming you set jmxUsername to "anonymous", we need the following line in the security.policy:

grant principal javax.management.remote.JMXPrincipal "anonymous" {
  permission java.security.AllPermission; };

settings.monitor.jmxPassword: Must match the JMX password in all monitored applications. Default: JMX_MONITOR_ROLE_PASSWORD_PLACEHOLDER

Changed setting names

The setting settings.common.jms.environmentName is now settings.common.environmentName

The setting settings.common.harvester.datamodel.domain.tld is now settings.common.topLevelDomains.tld

The setting settings.common.database.specificsclass is now settings.common.database.class

The setting settings.archive.bitarchive.limitForRecordDatatransferInFile is now settings.common.repository.limitForRecordDatatransferInFile

The setting settings.archive.arcrepository.location is now settings.common.locations.location

The setting settings.archive.arcrepository.batchLocation is now settings.common.locations.batchLocation

The setting settings.archive.bitarchive.thisLocation is now settings.common.thisPhysicalLocation

The setting settings.monitor.applicationName is now settings.common.applicationName

Removed settings

The setting settings.common.siteSection.deployPath is no longer used.

New translations

If you are maintaining a translation, please note that the following new keys have been added:

archive/Translations.properties

no.files.with.checksum.errors=No files with checksum errors were found
location=Location

harvester/Translations.properties

subtitle;reports.for.job=Harvest information for job
harvest.reports=Browse reports for jobs
harvest.files=Browse harvest files for job
crawl.log.lines.for.domain.0=Browse only relevant crawl-log lines for domain {0}

viewerproxy/Translations.properties

pagetitle;qa.get.files=Get harvested files
pagetitle;qa.get.reports=Get harvest reports
pagetitle;qa.crawllog.lines.for.domain=Lines from crawl.log about domain
pagetitle;files.for.job.0=Files for job {0}
pagetitle;reports.for.job.1=Reports for job {0}
pagetitle;qa.crawllog.lines.for.domain.0.in.1=Lines from crawl.log of job {1} concerning domain {0}
helptext;get.job.qa.information.with.viewerproxy=The links below will only work \
if your browser is set up to use the viewerproxy as web proxy.

Version History

Version 3.7.0

2008-11-04

Develop version aiming for 3.8.0

Version 3.6.0

2008-07-03

Improvement of archive component with regard to security, batch, and preservation; greater JMS stability; important bug fixes

Version 3.5.*

Develop versions aiming for 3.6.0

Version 3.4.2

2008-03-14

Bug fix release, fixing JMX timeout

Version 3.4.1

2008-01-16

Bug fix release, fixing out of memory on very large indexes

Version 3.4.0

2008-01-03

Separation of Heritrix, work on developing our open source platform, two-part TLDs like co.uk, and lots of bugfixes

Version 3.3.*

Develop versions aiming for 3.4.0

Version 3.2.3

2007-09-27

Bugfix of 3.2.2 with patched deduplicator, that fixes problem in parallel indexing

Version 3.2.2

2007-08-03

Bugfix of 3.2.1 with patched Heritrix 1.12.1, that supports ARCRecords larger than 2GBs

Version 3.2.1

2007-07-04

Bugfix of 3.2.0 fixing trouble using the quick start manual.

Version 3.2.0

2007-07-04

Open source release

Version 3.1.*

Development versions. Version 3.1.7 was kindly reviewed by Internet Archive and the Norwegian national library.

Version 3.0.0

2007-02-02

Marked the naming of the NetarchiveSuite, the splitting of NetarchiveSuite into independent modules, and the licensing of NetarchiveSuite under LGPL

Version 2.*

Various features and updates

Version 2.0

2006-08-30

Marked a general restructuring of the code, where harvest definition data was backed by a database, the viewerproxy was trimmed and rewritten.

Version 1.*

Various features and updates

Version 1.0

2005-07-01

The first version of the netarchive| software put in production for harvesting the entire Danish web

Version 0.*

Various pre-production development versions

ReleaseNotes3_8_0 (last edited 2010-08-16 10:24:45 by localhost)