Diff for "Configuration Manual 3.14/Wayback Configurations"

Differences between revisions 1 and 8 (spanning 7 versions)

Wayback Configuration

This section describes the configuration of the applications responsible for the continuous indexing of files in a NetarchiveSuite arcrepository. In addition, there is a plugin which enables an arcrepository to be accessed by an instance of wayback. This is described in the Additional Tools Manual, along with various batch jobs which may be of use to anyone wishing to index an arcrepository without using the applications described here.

Basic Concepts for the Indexer/Aggregator

There are two applications responsible for indexing an arcrepository. The WaybackIndexerApplication checks a repository for any new files and issues batch jobs to index each new file individually. These unsorted index files are deposited in a local folder. The AggregatorApplication sorts and merges these index files and then merges the result into the existing index files being used by your wayback instance. These applications may be configured and deployed using the NetarchiveSuite Deploy Tool.

WaybackIndexerApplication

This application uses a database to maintain a list of all files in a repository and information as to whether or not they have been archived. It uses a set of worker threads to issue batch jobs to index any new files found. The application behaviour is that any arcfile which contains the string "metadata" in its name is assumed to be a metadata file and is indexed with a tool that searches for deduplication records. Any other file is simple indexed as an arcfile.

The default settings for this application are

<settings>
    <wayback>
        <hibernate>
            <c3p0>
                <acquire_increment>1</acquire_increment>
                <idle_test_period>100</idle_test_period>
                <max_size>100</max_size>
                <max_statements>100</max_statements>
                <min_size>10</min_size>
                <timeout>100</timeout>
            </c3p0>
            <connection_url>jdbc:derby:derbyDB/wayback_indexer_db;create=true</connection_url>
            <db_driver_class>org.apache.derby.jdbc.ClientDriver</db_driver_class>
            <use_reflection_optimizer>false</use_reflection_optimizer>
            <transaction_factory>org.hibernate.transaction.JDBCTransactionFactory</transaction_factory>
            <dialect>org.hibernate.dialect.DerbyDialect</dialect>
            <show_sql>true</show_sql>
            <format_sql>true</format_sql>
            <hbm2ddl_auto>update</hbm2ddl_auto>
            <user></user>
            <password></password>
        </hibernate>
        <indexer>
            <replicaId>ONE</replicaId>
            <final_batch_output_dir>batchOutputDir</final_batch_output_dir>
            <temp_batch_output_dir>tempdir</temp_batch_output_dir>
            <maxFailedAttempts>3</maxFailedAttempts>
            <producerDelay>0</producerDelay>
            <producerInterval>86400000</producerInterval>
            <consumerThreads>5</consumerThreads>
            <initialFiles></initialFiles>
        </indexer>
    </wayback>
</settings>

As can be seen, the application uses a hibernate object-relational mapping layer to communicate with a relational database. Thus it should be possible to plug in any RDBMS simply by changing the hibernate settings. The code has only been tested with DerbyDB and postgresql. The hibernate settings are not described in any more detail here as they are fully documented in the hibernate documentation at http://www.hibernate.org.

The NetarchiveSuite-specific settings are as follows:

dk.netarkivet.wayback.settings.indexer.replicaId: The Id of the replica to be used for indexing. Since indexing is a relatively intensive operation, it is useful to be able to specify which replica is used by the indexer.

dk.netarkivet.wayback.settings.indexer.final_batch_output_dir: The directory where the unsorted index files are stored.

dk.netarkivet.wayback.settings.indexer.temp_batch_output_dir: A directory in which the output from partially finished batch jobs can be written.

dk.netarkivet.wayback.settings.indexer.maxFailedAttempts: The maximum number of failures allowed per file before the indexer permanently gives up attempting to index a given file. At present there is no way, other than manipulating the database, to retry indexing a file once it has reached this limit.

dk.netarkivet.wayback.settings.indexer.producerDelay: The delay in milliseconds after the system start before the indexing process begins.

dk.netarkivet.wayback.settings.indexer.producerInterval: The interval (in milliseconds) between successive reads of the latest filelist from the repository. The value of this parameter is a compromise between updating the index as quickly as possible and overburdening the repository with heavy-duty FileListBatchJobs.

dk.netarkivet.wayback.settings.indexer.consumerThreads: The number of simultaneous indexing threads to be started and hence the maximum number of indexing batch jobs to be run simultaneously.

dk.netarkivet.wayback.settings.indexer.initialFiles: the path to a file containing a list of files in the archive which the indexer should not archive. This can be used when deploying the indexer to a legacy system to ensure that archive files already indexed are not reindexed at unnecessary computational expense.

AggregatorApplication

The aggregator takes all files found in the indexer's output directory, sorts them, and merges them into an existing index file. The unix sort command is used so this application runs only in unix-like systems. The aggregator uses a rollover system to deal with the growth of index files. At any given time, the active index files will consist of a list such as

wayback_intermediate.index
wayback.index
wayback.index.1
wayback.index.2
wayback.index.3 (etc.)

Whenever the aggregator runs (the interval between aggregator runs is determined by the parameter dk.netarkivet.wayback.settings.aggregator.aggregationInterval}} in milliseconds) the new index files are sorted and merged into {{{wayback_intermediate.index. If this file is now larger than dk.netarkivet.wayback.settings.aggregator.maxIntermediateIndexFileSize (in KB) then this file is merged into wayback.index. If this would cause wayback.index to grow to larger than dk.netarkivet.wayback.settings.aggregator.maxMainIndexFileSize then the filenames for the main index files are rolled over - wayback.index is renamed to wayback.index.1 etc., and a new wayback.index is started. Note that at present the names of these index files are hard-coded.

In addition to the settings described above, the aggregator also uses

dk.netarkivet.wayback.settings.aggregator.indexFileInputDir: the directory where the unsorted index files are to be found, ie the output directory of the indexer

dk.netarkivet.wayback.settings.aggregator.indexFileOutputDir: the directory containing all the final active index files

dk.netarkivet.wayback.settings.aggregator.tempAggregatorDir: a temporary workspace directory. This directory should have storage space at least equal to maxMainIndexFileSize and should ideally be on the same file system as indexFileOutputDir.

Configuration Manual 3.14/Wayback Configurations (last edited 2010-08-31 07:54:22 by ColinRosenthal)

-  ⇤ ← Revision 1 as of 2010-05-04 13:16:12 → 
  Size: 4052
  Editor: SoerenCarlsen
  Comment: Generated documentation branch for 3.14
+   ← Revision 8 as of 2010-08-31 07:54:22 → ⇥
  Size: 7492
  Editor: ColinRosenthal
  Comment:
-Deletions are marked like this.
+Additions are marked like this.
 Line 2:
-[[Action(edit)]]
+<<Action(edit)>>
 Line 4:
-The Wayback installation under !NetarchiveSuite is only tested on a pc installed with linux and in !ProxyReplay mode. Other modes should work, but no guaranties are given.
+This section describes the configuration of the applications responsible for the continuous indexing of files in a !NetarchiveSuite arcrepository. In addition, there is a plugin which enables an arcrepository to be accessed by an instance of wayback. This is described in the Additional Tools Manual, along with various batch jobs which may be of use to anyone wishing to index an arcrepository without using the applications described here.
 Line 6:
-=== Requirements ===
The following applications should be running and reachable from the machine running Tomcat with Wayback web application.
+=== Basic Concepts for the Indexer/Aggregator ===
There are two applications responsible for indexing an arcrepository. The {{{WaybackIndexerApplication}}} checks a repository for any new files and issues batch jobs to index each new file individually. These unsorted index files are deposited in a local folder. The {{{AggregatorApplication}}} sorts and merges these index files and then merges the result into the existing index files being used by your wayback instance. These applications may be configured and deployed using the !NetarchiveSuite Deploy Tool.
 Line 9:
-. JMS server.
 1. FTP server.
 1. Archive (eg. Standalone archive given in ./conf/wayback/standalone_archive.xml). The needed applications from !NetarchiveSuite is !BitarchiveApplication, !BitarchiveMonitorApplication, !ArcRepositoryApplication. The !NetarchveSuite version should be newer than 3.10.
This setup has been tested with Tomcat (6.0.20).
+=== WaybackIndexerApplication ===
This application uses a database to maintain a list of all files in a repository and information as to whether or not they have been archived. It uses a set of worker threads to issue batch jobs to index any new files found. The application behaviour is that any arcfile which contains the string "metadata" in its name is assumed to be a metadata file and is indexed with a tool that searches for deduplication records. Any other file is simple indexed as an arcfile.
-Line 14:
+Line 12:
-When configuring Wayback to work with !NetarchiveSuite, the above services is needed, furthermore it is needed to have a full source package of the !NetarchiveSuite and an installation of ''ant'', it has been tested with 1.7.1.

=== Configuration ===
The two configuration files that should be modified are located in ''./conf/wayback/'' in the !NetarchiveSuite full source package. The files are named ''CDXCollection.xml'' and ''wayback.xml''.

==== wayback.xml ====
In this config file there are multiple settings that should be changed to fit your setup, to make the system run correctly:

''wayback.basedir=/tmp/wayback'' - The web application should have read and write access to this directory.

The port should be specified in the following three lines, and be available (i.e. not yet already used by another application).

 * <bean name="8080:wayback" class="org.archive.wayback.webapp.!AccessPoint">
 * <property name="replayURIPrefix" value="http://localhost.archive.org:8080/wayback/"/>
 * <bean name="8090" parent="8080:wayback">
==== CDXCollection.xml ====
This configuration file describes where Wayback finds its CDX files (i.e indices of the ARC/WARC files).

In this file it should only be necessary to change the following path to point a local CDX collection.

''<value>/wayback/file.sorted.cdx</value>''

=== Compiling Tomcat target ===
This can be done from the !NetarchiveSuite root directory. By running the command ''ant -file wayback.build.xml warfile'', this produces a ROOT.war file in the !NetarchiveSuite root director, and this ROOT.war file should be copied to'' $TOMCAT_HOME/webapps/''.

Tomcat should furthermore have access to a settings.xml file, see below. This can be done by adding the following line to ''$TOMCAT_HOME/bin/catalina.sh'' just after the first line.

''CATALINA_OPTS='-Ddk.netarkivet.settings.file=$TOMCAT_HOME/webapps/ROOT/WEB-INF/settings.xml' ''

This setting file is a !NetarchiveSuite settings.xml file, and only includes the ''common'' and ''wayback'' sections.

The following settings should be modified to fit the local installation.

Change the following to match the FTP settings on the system.
+The default settings for this application are
-Line 50:
+Line 15:
-        <remoteFile>
            <!-- TODO: See user documentation for NetarchiveSuite
            http://netarkivet.dk/suite/Documentation . -->
            <serverName>ftp.yourdomain.com</serverName>
            <userName>ftpuser</userName>
            <userPassword>ftppassword</userPassword>
        </remoteFile>
+<settings>
    <wayback>
        <hibernate>
            <c3p0>
                <acquire_increment>1</acquire_increment>
                <idle_test_period>100</idle_test_period>
                <max_size>100</max_size>
                <max_statements>100</max_statements>
                <min_size>10</min_size>
                <timeout>100</timeout>
            </c3p0>
            <connection_url>jdbc:derby:derbyDB/wayback_indexer_db;create=true</connection_url>
            <db_driver_class>org.apache.derby.jdbc.ClientDriver</db_driver_class>
            <use_reflection_optimizer>false</use_reflection_optimizer>
            <transaction_factory>org.hibernate.transaction.JDBCTransactionFactory</transaction_factory>
            <dialect>org.hibernate.dialect.DerbyDialect</dialect>
            <show_sql>true</show_sql>
            <format_sql>true</format_sql>
            <hbm2ddl_auto>update</hbm2ddl_auto>
            <user></user>
            <password></password>
        </hibernate>
        <indexer>
            <replicaId>ONE</replicaId>
            <final_batch_output_dir>batchOutputDir</final_batch_output_dir>
            <temp_batch_output_dir>tempdir</temp_batch_output_dir>
            <maxFailedAttempts>3</maxFailedAttempts>
            <producerDelay>0</producerDelay>
            <producerInterval>86400000</producerInterval>
            <consumerThreads>5</consumerThreads>
            <initialFiles></initialFiles>
        </indexer>
    </wayback>
</settings>
-Line 58:
+Line 50:
-Update the following mail settings
+As can be seen, the application uses a hibernate object-relational mapping layer to communicate with a relational database. Thus it should be possible to plug in any RDBMS simply by changing the hibernate settings. The code has only been tested with {{{DerbyDB}}} and {{{postgresql}}}. The hibernate settings are not described in any more detail here as they are fully documented in the hibernate documentation at http://www.hibernate.org.
-Line 60:
+Line 52:
+The !NetarchiveSuite-specific settings are as follows:

~+{{{dk.netarkivet.wayback.settings.indexer.replicaId}}}: The Id of the replica to be used for indexing. Since indexing is a relatively intensive operation, it is useful to be able to specify which replica is used by the indexer. +~

~+{{{dk.netarkivet.wayback.settings.indexer.final_batch_output_dir}}}: The directory where the unsorted index files are stored.  +~

~+{{{dk.netarkivet.wayback.settings.indexer.temp_batch_output_dir}}}: A directory in which the output from partially finished batch jobs can be written. +~

~+{{{dk.netarkivet.wayback.settings.indexer.maxFailedAttempts}}}: The maximum number of failures allowed per file before the indexer permanently gives up attempting to index a given file. At present there is no way, other than manipulating the database, to retry indexing a file once it has reached this limit. +~

~+{{{dk.netarkivet.wayback.settings.indexer.producerDelay}}}: The delay in milliseconds after the system start before the indexing process begins. +~

~+{{{dk.netarkivet.wayback.settings.indexer.producerInterval}}}: The interval (in milliseconds) between successive reads of the latest filelist from the repository. The value of this parameter is a compromise between updating the index as quickly as possible and overburdening the repository with heavy-duty !FileListBatchJobs. +~

~+{{{dk.netarkivet.wayback.settings.indexer.consumerThreads}}}: The number of simultaneous indexing threads to be started and hence the maximum number of indexing batch jobs to be run simultaneously. +~ 

~+{{{dk.netarkivet.wayback.settings.indexer.initialFiles}}}: the path to a file containing a list of files in the archive which the indexer should not archive. This can be used when deploying the indexer to a legacy system to ensure that archive files already indexed are not reindexed at unnecessary computational expense.<<BR>>+~

=== AggregatorApplication ===

The aggregator takes all files found in the indexer's output directory, sorts them, and merges them into an existing index file. The unix {{{sort}}} command is used so this application runs only in unix-like systems. The aggregator uses a rollover system to deal with the growth of index files. At any given time, the active index files will consist of a list such as
-Line 61:
+Line 74:
-        <mail>
            <server>mail.yourdomain.com</server>
        </mail>
        <notifications>
            <class>dk.netarkivet.common.utils.EMailNotifications</class>
            <sender>example@yourdomain.com</sender>
            <receiver>example@yourdomain.com</receiver>
        </notifications>
+wayback_intermediate.index
wayback.index
wayback.index.1
wayback.index.2
wayback.index.3 (etc.)
-Line 70:
+Line 80:
-=== Described elsewhere ===
It is outside the scope of this configuration guide to describe how to harvest a ARC/WARC file. It is also outside the scope of this guide to describe how to get import an    ARC/WARC collection into Wayback by way of CDX-entries for each object in the colletion.
-Line 73:
+Line 81:
-Setting up !NetarchiveSuite archive is described elsewhere and a sample setup file is given in the !NetarchiveSuite source package.
+Whenever the aggregator runs (the interval between aggregator runs is determined by the parameter {{{dk.netarkivet.wayback.settings.aggregator.aggregationInterval}} in milliseconds) the new index files are sorted and merged into {{{wayback_intermediate.index}}}. If this file is now larger than {{{dk.netarkivet.wayback.settings.aggregator.maxIntermediateIndexFileSize}}} (in KB) then this file is merged into {{{wayback.index}}}. If this would cause {{{wayback.index}}} to grow to larger than {{{dk.netarkivet.wayback.settings.aggregator.maxMainIndexFileSize}}} then the filenames for the main index files are rolled over - {{{wayback.index}}} is renamed to {{{wayback.index.1}}} etc., and a new {{{wayback.index}}} is started. Note that at present the names of these index files are hard-coded.

In addition to the settings described above, the aggregator also uses

~+{{{dk.netarkivet.wayback.settings.aggregator.indexFileInputDir}}}: the directory where the unsorted index files are to be found, ie the output directory of the indexer+~

~+{{{dk.netarkivet.wayback.settings.aggregator.indexFileOutputDir}}}: the directory containing all the final active index files+~

~+{{{dk.netarkivet.wayback.settings.aggregator.tempAggregatorDir}}}: a temporary workspace directory. This directory should have storage space at least equal to {{{maxMainIndexFileSize}}} and should ideally be on the same file system as {{{indexFileOutputDir}}}.+~