== Tools in Wayback Module == <> === dk.netarkivet.wayback.NetarchiveCacheResourceStore === Wayback is a tool for browsing in webarchives. It can be downloaded from http://archive-access.sourceforge.net/projects/wayback/. The !NetarchiveSuite plugin for wayback is a class {{{NetarchiveCacheResourceStore}}} which implements {{{org.archive.wayback.ResourceStore}}}. {{{NetarchiveCacheResourceStore}}} instantiates a connection to a !NetarchiveSuite !ArcRepository and retrieves archive data from it via !NetarchiveSuite. In order to make use of the plugin, it is necessary to i. Copy the required jar files into the lib-directory of your wayback installation i. Ensure that wayback has access to a !NetarchiveSuite settings file with the necessary connection information i. Configure wayback to use !NetarchiveResourceStore The lib directory for wayback will be under {{{wayback/WEB-INF/lib}}} in your tomcat webapps directory. Copy into the lib directory all the jar files in {{{netarchivesuite/lib}}} ''except'' {{{ dk.netarkivet.deploy.jar dk.netarkivet.harvester.jar dk.netarkivet.viewerproxy.jar dk.netarkivet.monitor.jar }}} and the jar files for the packages {{{wayback, je, jericho, jetty, junit, poi}}} and {{{libidn}}}. These are either not required or aleady included in the wayback distribution. The !NetarchiveSuite settings file location can be specified in the catalina.sh file of your tomcat with a line like {{{CATALINA_OPTS="-Ddk.netarkivet.settings.file=/home/user/settings_for_my_repository.xml"}}}. !NetarchiveCacheResourceStore has been tested with a wayback localcdxcollection using settings like: {{{ index1.cdx index2.cdex }}} but should work with other types of wayback collection. There is an ant build file which can be used to repack the wayback war-file with the addition of the netarchivesuite plugin. Ant tasks to unpack and repack the wayback war-file are in wayback.build.xml and there are samples settings in ''examples/wayback/*''. === dk.netarkivet.wayback.batch.ExtractWaybackCDXBatchJob === This batch job is a wrapper for the parts of the wayback API which generate CDX index files for use in wayback. The job can be called with a script like {{{ java \ -Ddk.netarkivet.settings.file=../../settings_wayback.xml \ -Dsettings.common.applicationInstanceId=CDX_BATCH \ -cp ../lib/dk.netarkivet.archive.jar \ dk.netarkivet.archive.tools.RunBatch \ -Ndk.netarkivet.wayback.ExtractWaybackCDXBatchJob \ -J../lib/dk.netarkivet.wayback.jar,../lib/wayback-core-1.4.2.jar \ -R1042-.*(? sorted.cdx }}} Our experience is that sorting and merging of files with total size up to 100GB can be accomplished in a few hours on a moderately powerful server machine. === dk.netarkivet.wayback.batch.ExtractDeduplicateCDXBatchJob === In netarchivesuite, duplicate objects which are not harvested are recorded as extra metadata in the heritrix crawl log. In order to be able to browse these items, these deduplication records need to be indexed. Each deduplication record will generate a cdx record showing the harvested time as the time when the duplicate record was discovered, but pointing to the archive location here the original records is stored. The batch job to execute this indexing is invoked in exactly the same way as that described above for indexing the archived data, except that in this case we would use a regular expression which matches only metadata files, rather than one which matches everything except metadata files. For example {{{ -R1042-'.*'metadata'.*'arc }}} As in the above case, the returned cdx files are unsorted. === dk.netarkivet.wayback.DeduplicateToCDXApplication === This is a command line interface to the same code for generating CDX indexes from deduplication records in crawl log files (not metadata arcfiles). It can be invoked by {{{ java -cp dk.netarkivet.wayback.jar dk.netarkivet.wayback.DeduplicateToCDXApplication crawl1.log crawl2.log crawl3.log > out.cdx }}}