Diff for "Additional Tools Manual 3.16/Tools in Wayback Module"

Differences between revisions 2 and 3

Tools in Wayback Module

dk.netarkivet.wayback.NetarchiveCacheResourceStore

Wayback is a tool for browsing in webarchives. It can be downloaded from http://archive-access.sourceforge.net/projects/wayback/. The NetarchiveSuite plugin for wayback is a class NetarchiveCacheResourceStore which implements org.archive.wayback.ResourceStore. NetarchiveCacheResourceStore instantiates a connection to a NetarchiveSuite ArcRepository and retrieves archive data from it via NetarchiveSuite.

In order to make use of the plugin, it is necessary to

Copy the required jar files into the lib-directory of your wayback installation
Ensure that wayback has access to a NetarchiveSuite settings file with the necessary connection information
Configure wayback to use NetarchiveResourceStore

The lib directory for wayback will be under wayback/WEB-INF/lib in your tomcat webapps directory. Copy into the lib directory all the jar files in netarchivesuite/lib except

 dk.netarkivet.deploy.jar
 dk.netarkivet.harvester.jar
 dk.netarkivet.viewerproxy.jar
 dk.netarkivet.monitor.jar

and the jar files for the packages wayback, je, jericho, jetty, junit, poi and libidn. These are either not required or aleady included in the wayback distribution.

The NetarchiveSuite settings file location can be specified in the catalina.sh file of your tomcat with a line like CATALINA_OPTS="-Ddk.netarkivet.settings.file=/home/user/settings_for_my_repository.xml".

NetarchiveCacheResourceStore has been tested with a wayback localcdxcollection using settings like:

<bean id="localcdxcollection" class="org.archive.wayback.webapp.WaybackCollection">

     <property name="resourceStore">
       <bean class="dk.netarkivet.wayback.NetarchiveResourceStore">
       </bean>
     </property>


    <property name="resourceIndex">
          <bean class="org.archive.wayback.resourceindex.LocalResourceIndex">
          <property name="source">
          <bean class="org.archive.wayback.resourceindex.CompositeSearchResultSource">
            <property name="CDXSources">
              <list>
                 <value>index1.cdx</value>
                 <value>index2.cdex</value>
              </list>
            </property>
          </bean>

        </property>
        <property name="maxRecords" value="40000" />
      </bean>

    </property>
  </bean>

but should work with other types of wayback collection.

There is an ant build file which can be used to repack the wayback war-file with the addition of the netarchivesuite plugin. Ant tasks to unpack and repack the wayback war-file are in wayback.build.xml and there are samples settings in examples/wayback/*.

dk.netarkivet.wayback.batch.ExtractWaybackCDXBatchJob

This batch job is a wrapper for the parts of the wayback API which generate CDX index files for use in wayback. The job can be called with a script like

java \
   -Ddk.netarkivet.settings.file=../../settings_wayback.xml \
   -Dsettings.common.applicationInstanceId=CDX_BATCH \
   -cp ../lib/dk.netarkivet.archive.jar \
   dk.netarkivet.archive.tools.RunBatch \
   -Ndk.netarkivet.wayback.ExtractWaybackCDXBatchJob \
   -J../lib/dk.netarkivet.wayback.jar,../lib/wayback-core-1.4.2.jar \
   -R1042-.*(?<!metadata-[0-9]).arc \
   -BMYREPLICA \
   -Oout.cdx

Note the syntax of the regular expression which selects all arcfiles generated by job 1042 except for metadata arcfiles. The cdx files generated are unsorted. For use in wayback they must be sorted and merged e.g. using unix sort:

export LC_ALL=C; sort --temporary-directory=/tmp 1.cdx 2.cdx 3.cdx 4.cdx > sorted.cdx

Our experience is that sorting and merging of files with total size up to 100GB can be accomplished in a few hours on a moderately powerful server machine.

dk.netarkivet.wayback.batch.ExtractDeduplicateCDXBatchJob

In netarchivesuite, duplicate objects which are not harvested are recorded as extra metadata in the heritrix crawl log. In order to be able to browse these items, these deduplication records need to be indexed. Each deduplication record will generate a cdx record showing the harvested time as the time when the duplicate record was discovered, but pointing to the archive location here the original records is stored. The batch job to execute this indexing is invoked in exactly the same way as that described above for indexing the archived data, except that in this case we would use a regular expression which matches only metadata files, rather than one which matches everything except metadata files. For example

-R1042-'.*'metadata'.*'arc

As in the above case, the returned cdx files are unsorted.

dk.netarkivet.wayback.DeduplicateToCDXApplication

This is a command line interface to the same code for generating CDX indexes from deduplication records in crawl log files (not metadata arcfiles). It can be invoked by

java -cp dk.netarkivet.wayback.jar dk.netarkivet.wayback.DeduplicateToCDXApplication crawl1.log crawl2.log crawl3.log > out.cdx

Additional Tools Manual 3.16/Tools in Wayback Module (last edited 2011-05-03 14:55:47 by SoerenCarlsen)

-  ⇤ ← Revision 2 as of 2011-05-03 14:54:01 → 
  Size: 5239
  Editor: SoerenCarlsen
  Comment:
+   ← Revision 3 as of 2011-05-03 14:55:47 → ⇥
  Size: 5218
  Editor: SoerenCarlsen
  Comment:
-Deletions are marked like this.
+Additions are marked like this.
 Line 4:
-=== dk.netarkivet.wayback.NetarchiveResourceStore ===
+=== dk.netarkivet.wayback.NetarchiveCacheResourceStore ===
Wayback is a tool for browsing in webarchives. It can be downloaded from http://archive-access.sourceforge.net/projects/wayback/. The !NetarchiveSuite  plugin for wayback is a class {{{NetarchiveCacheResourceStore}}} which implements {{{org.archive.wayback.ResourceStore}}}. {{{NetarchiveCacheResourceStore}}} instantiates a connection to a !NetarchiveSuite !ArcRepository and retrieves archive data from it via !NetarchiveSuite.
-Line 6:
+Line 7:
-Wayback is a tool for browsing in webarchives. It can be downloaded from http://archive-access.sourceforge.net/projects/wayback/. The !NetarchiveSuite 
plugin for wayback is a class {{{NetarchiveResourceStore}}} which implements {{{org.archive.wayback.ResourceStore}}}. {{{NetarchiveResourceStore}}}
instantiates a connection to a !NetarchiveSuite !ArcRepository and retrieves archive data from it via !NetarchiveSuite.
+In order to make use of the plugin, it is necessary to
-Line 10:
+Line 9:
-In order to make use of the plugin, it is necessary to
-Line 15:
+Line 13:
-The lib directory for wayback will be under {{{wayback/WEB-INF/lib}}} in your tomcat webapps directory. Copy
into the lib directory all the jar files in {{{netarchivesuite/lib}}} ''except''
+The lib directory for wayback will be under {{{wayback/WEB-INF/lib}}} in your tomcat webapps directory. Copy into the lib directory all the jar files in {{{netarchivesuite/lib}}} ''except''
-Line 23:
+Line 21:
 and the jar files for the packages {{{wayback, je, jericho, jetty, junit, poi}}} and {{{libidn}}}. These are either not required or aleady included in the wayback distribution.
-Line 25:
+Line 23:
-The !NetarchiveSuite settings file location can be specified in the catalina.sh file of your tomcat with a line like
{{{CATALINA_OPTS="-Ddk.netarkivet.settings.file=/home/user/settings_for_my_repository.xml"}}}.
+The !NetarchiveSuite settings file location can be specified in the catalina.sh file of your tomcat with a line like {{{CATALINA_OPTS="-Ddk.netarkivet.settings.file=/home/user/settings_for_my_repository.xml"}}}.
-Line 28:
+Line 25:
-!NetarchiveResourceStore has been tested with a wayback localcdxcollection using settings like:
+!NetarchiveCacheResourceStore has been tested with a wayback localcdxcollection using settings like:
-Line 34:
+Line 32:
        </bean>
-Line 38:
+Line 36:
     <property name="resourceIndex">
-Line 43:
+Line 41:
               <list>
-Line 48:
+Line 46:
           </bean>
-Line 56:
+Line 54:
-Line 60:
+Line 57:
-There is an ant build file which can be used to repack the wayback war-file with the addition of the netarchivesuite plugin.
Ant tasks to unpack and repack the wayback war-file are in wayback.build.xml and there are samples settings in ''examples/wayback/*''.
+There is an ant build file which can be used to repack the wayback war-file with the addition of the netarchivesuite plugin. Ant tasks to unpack and repack the wayback war-file are in wayback.build.xml and there are samples settings in ''examples/wayback/*''.
-Line 64:
+Line 60:
+This batch job is a wrapper for the parts of the wayback API which generate CDX index files for use in wayback.  The job can be called with a script like
-Line 65:
+Line 62:
-This batch job is a wrapper for the parts of the wayback API which generate CDX index files for use in wayback. 
The job can be called with a script like
-Line 79:
+Line 74:
-Note the syntax of the regular expression which selects all arcfiles generated by job 1042 ''except'' for metadata arcfiles. The cdx files
generated are unsorted. For use in wayback they must be sorted and merged e.g. using unix sort:
+Note the syntax of the regular expression which selects all arcfiles generated by job 1042 ''except'' for metadata arcfiles. The cdx files generated are unsorted. For use in wayback they must be sorted and merged e.g. using unix sort:
-Line 84:
+Line 79:
-Our experience is that sorting and merging of files with total size up to 100GB can be accomplished in a few hours on a moderately powerful 
server machine.
+Our experience is that sorting and merging of files with total size up to 100GB can be accomplished in a few hours on a moderately powerful  server machine.
-Line 87:
+Line 81:
-=== dk.netarkivet.wayback.batch.ExtractDeduplicateCDXBatchJob  ===
In netarchivesuite, duplicate objects which are not harvested are recorded as extra metadata in the heritrix crawl log. In order to be able to browse these
items, these deduplication records need to be indexed. Each deduplication record will generate a cdx record showing the harvested time as the time when the
duplicate record was discovered, but pointing to the archive location here the original records is stored. The batch job to execute this indexing is invoked
in exactly the same way as that described above for indexing the archived data, except that in this case we would use a regular expression which matches only
metadata files, rather than one which matches everything except metadata files. For example
+=== dk.netarkivet.wayback.batch.ExtractDeduplicateCDXBatchJob ===
In netarchivesuite, duplicate objects which are not harvested are recorded as extra metadata in the heritrix crawl log. In order to be able to browse these items, these deduplication records need to be indexed. Each deduplication record will generate a cdx record showing the harvested time as the time when the duplicate record was discovered, but pointing to the archive location here the original records is stored. The batch job to execute this indexing is invoked in exactly the same way as that described above for indexing the archived data, except that in this case we would use a regular expression which matches only metadata files, rather than one which matches everything except metadata files. For example
-Line 99:
+Line 90:
-This is a command line interface to the same code for generating CDX indexes from deduplication records in crawl log files (not metadata arcfiles).
It can be invoked by
+This is a command line interface to the same code for generating CDX indexes from deduplication records in crawl log files (not metadata arcfiles). It can be invoked by