Wayback Configuration

edit

This section describes the configuration of the applications responsible for the continuous indexing of files in a NetarchiveSuite arcrepository. In addition, there is a plugin which enables an arcrepository to be accessed by an instance of wayback. This is described in the Additional Tools Manual, along with various batch jobs which may be of use to anyone wishing to index an arcrepository without using the applications described here.

Basic Concepts for the Indexer/Aggregator

There are two applications responsible for indexing an arcrepository. The WaybackIndexerApplication checks a repository for any new files and issues batch jobs to index each new file individually. These unsorted index files are deposited in a local folder. The AggregatorApplication sorts and merges these index files and then merges the result into the existing index files being used by your wayback instance. These applications may be configured and deployed using the NetarchiveSuite Deploy Tool.

WaybackIndexerApplication

This application uses a database to maintain a list of all files in a repository and information as to whether or not they have been archived. It uses a set of worker threads to issue batch jobs to index any new files found. The application behaviour is that any arcfile which contains the string "metadata" in its name is assumed to be a metadata file and is indexed with a tool that searches for deduplication records. Any other file is simple indexed as an arcfile.

The default settings for this application are

<settings>
    <wayback>
        <hibernate>
            <c3p0>
                <acquire_increment>1</acquire_increment>
                <idle_test_period>100</idle_test_period>
                <max_size>100</max_size>
                <max_statements>100</max_statements>
                <min_size>10</min_size>
                <timeout>100</timeout>
            </c3p0>
            <connection_url>jdbc:derby:derbyDB/wayback_indexer_db;create=true</connection_url>
            <db_driver_class>org.apache.derby.jdbc.ClientDriver</db_driver_class>
            <use_reflection_optimizer>false</use_reflection_optimizer>
            <transaction_factory>org.hibernate.transaction.JDBCTransactionFactory</transaction_factory>
            <dialect>org.hibernate.dialect.DerbyDialect</dialect>
            <show_sql>true</show_sql>
            <format_sql>true</format_sql>
            <hbm2ddl_auto>update</hbm2ddl_auto>
            <user></user>
            <password></password>
        </hibernate>
        <indexer>
            <replicaId>ONE</replicaId>
            <final_batch_output_dir>batchOutputDir</final_batch_output_dir>
            <temp_batch_output_dir>tempdir</temp_batch_output_dir>
            <maxFailedAttempts>3</maxFailedAttempts>
            <producerDelay>0</producerDelay>
            <producerInterval>86400000</producerInterval>
            <consumerThreads>5</consumerThreads>
            <initialFiles></initialFiles>
        </indexer>
    </wayback>
</settings>

As can be seen, the application uses a hibernate object-relational mapping layer to communicate with a relational database. Thus it should be possible to plug in any RDBMS simply by changing the hibernate settings. The code has only been tested with DerbyDB and postgresql. The hibernate settings are not described in any more detail here as they are fully documented in the hibernate documentation at http://www.hibernate.org.

The NetarchiveSuite-specific settings are as follows:

dk.netarkivet.wayback.settings.indexer.replicaId: The Id of the replica to be used for indexing. Since indexing is a relatively intensive operation, it is useful to be able to specify which replica is used by the indexer. dk.netarkivet.wayback.settings.indexer.final_batch_output_dir: The directory where the unsorted index files are stored. dk.netarkivet.wayback.settings.indexer.temp_batch_output_dir: A directory in which the output from partially finished batch jobs can be written. dk.netarkivet.wayback.settings.indexer.maxFailedAttempts: The maximum number of failures allowed per file before the indexer permanently gives up attempting to index a given file. At present there is no way, other than manipulating the database, to retry indexing a file once it has reached this limit. dk.netarkivet.wayback.settings.indexer.producerDelay: The delay in milliseconds after the system start before the indexing process begins. dk.netarkivet.wayback.settings.indexer.producerInterval: The interval (in milliseconds) between successive reads of the latest filelist from the repository. The value of this parameter is a compromise between updating the index as quickly as possible and overburdening the repository with heavy-duty FileListBatchJobs. dk.netarkivet.wayback.settings.indexer.consumerThreads: dk.netarkivet.wayback.settings.indexer.initialFiles: