Overview of the NetarchiveSuite Wayback module

The wayback module in netarchivesuite consists of three components

  1. The wayback webapplication. This is a tomcat webapplication configured with a NetarchiveSuite plugin that allows it to communicate with the archive.

  2. The NetarchiveSuite batch-indexer. This is a standalone NetarchiveSuite application which indexes newly harvested material.

  3. The NetarchiveSuite index-aggregator. This is a standalone NetarchiveSuite application which sorts the indexes and merges them into the large index files used by wayback.

Status of Wayback

The first two components are complete and the third is nearing completion. However the batch-indexer is not useful without the index-aggregator so it should not be run at the present time. However, in the setup used for the release test, the settings file generated for the batch-indexer is also used for wayback itself. Therefore it is necessary to configure and deploy the indexer although it should not be started.

Configuration of Tomcat

Tomcat should be configured with two Connectors in server.xml:

    <Connector port="8080" protocol="HTTP/1.1"
               connectionTimeout="20000"
               redirectPort="8443"  />
    <Connector port="8090" protocol="HTTP/1.1"
               connectionTimeout="20000"
               redirectPort="8443"  />

There are start and stop scripts for tomcat in the webarkivering CVS under conf/wayback/test/scripts. The start script should be modified to point to the desired location of the NetarchiveSuite settings file, the name of the archive environment (PROD or ACCEPT) and a unique applicationInstanceId (e.g. WAYBACK). Note that the stop script includes a cleanup command which removes an unwanted directory from tomcat's temp area.

Configuration of Apache

As discussed at the February teleconference, PROD wayback must only be visible externally via a secured apache proxy (connected to port 8080).

Configuration of Wayback

The script used to deploy and start wayback in the release test is as follows:

MACHINE=test@kb-test-way-001
rm -r wayback_scripts
## Get the start and stop scripts for tomcat
CVS_RSH=ssh cvs -Q -d:ext:test@kb-prod-udv-001.kb.dk:/home/cvsroot checkout -P -d wayback_scripts projects/webarkivering/conf/wayback/test/scripts/start.sh
CVS_RSH=ssh cvs -Q -d:ext:test@kb-prod-udv-001.kb.dk:/home/cvsroot checkout -P -d wayback_scripts projects/webarkivering/conf/wayback/test/scripts/stop.sh
## Set the applicationInstanceId and Environment for wayback
INST_ID=WAYBACKWEBAPP$TESTX
sed "s/##ENV##/$TESTX/" wayback_scripts/start.sh >  wayback_scripts/temp.sh; mv  wayback_scripts/temp.sh  wayback_scripts/start.sh
sed "s/##INST_ID##/$INST_ID/"  wayback_scripts/start.sh >  wayback_scripts/temp.sh; mv  wayback_scripts/temp.sh  wayback_scripts/start.sh
scp wayback_scripts/*.sh $MACHINE:/home/test
ssh $MACHINE chmod 755 start.sh
ssh $MACHINE chmod 755 stop.sh
ssh $MACHINE ./start.sh
scp $HOME/wayback/wayback.war $MACHINE:tomcat/webapps/ROOT.war
## Copy the wayback settings file to the target machine
## **
scp  /home/test/release_software_dist/$TESTX/kb-test-way-001.kb.dk/settings_WaybackIndexerApplication.xml $MACHINE:/home/test/settings_wayback.xml
## Check out all spring configuration files
CVS_RSH=ssh cvs -Q -d:ext:test@kb-prod-udv-001.kb.dk:/home/cvsroot checkout -P -d spring projects/webarkivering/conf/wayback/test/spring
## Copy Spring files over to wayback
scp spring/* $MACHINE:/home/test/tomcat/webapps/ROOT/WEB-INF
## Restart tomcat
sleep 5
ssh $MACHINE ./start.sh

It should be straightforward to adapt this to use in production. Simple copy all the referenced configuration files in CVS to a new directory conf/wayback/prod and edit any necessary connection information to make it appropriate for the production system (see below for more details).

The line marked ** copies the NetarchiveSuite settings file for the batch-indexer over to the location of the wayback settings file (as defined in the tomcat start script). This settings file is generated from deploy_config_multi_bitapps.xml in the CVS. It is therefore necessary to modify the production deploy.xml to include the wayback batch-indexer (see below for more details).

The file wayback.war is the unmodified webapplication released by the wayback developers. It may be copied from kb-prod-udv-001:/home/test/wayback/wayback.war.

Wayback Configuration Files

These lie in conf/wayback/test/spring in CVS. The following files require modification:

  1. CDXCollection.xml: modify the element

    <property name="CDXSources">
                  <list>
                      <value>/home/test/wayback_cdx/index.cdx</value>
                  </list>
                </property>
    • to point to the actual locations of the index files in the PROD or ACCEPT environments
  2. wayback.xml: change the name of the host machine (kb-test-way-001) to the name of the actual host at all occurrences (five places). Change the value of the property maxRecords to 20000.

Deploy Configuration

The section

<deployMachine name="kb-test-way-001.kb.dk">
            <deployClassPath>lib/dk.netarkivet.archive.jar</deployClassPath>
            <deployClassPath>lib/dk.netarkivet.common.jar</deployClassPath>
            <deployClassPath>lib/dk.netarkivet.monitor.jar</deployClassPath>
            <deployClassPath>lib/dk.netarkivet.wayback.jar</deployClassPath>
            <applicationName name="dk.netarkivet.wayback.indexer.WaybackIndexerApplication">
            <settings>
                <wayback>
                        <hibernate>
                            <c3p0>
                                <acquire_increment>1</acquire_increment>
                                <idle_test_period>100</idle_test_period>
                                <max_size>100</max_size>
                                <max_statements>100</max_statements>
                                <min_size>10</min_size>
                                <timeout>100</timeout>
                            </c3p0>
                            <connection_url>jdbc:derby://localhost:1527/wayback_indexer_db;create=true</connection_url>
                            <db_driver_class>org.apache.derby.jdbc.ClientDriver</db_driver_class>
                            <use_reflection_optimizer>false</use_reflection_optimizer>
                            <transaction_factory>org.hibernate.transaction.JDBCTransactionFactory</transaction_factory>
                            <dialect>org.hibernate.dialect.DerbyDialect</dialect>
                            <show_sql>true</show_sql>
                            <format_sql>true</format_sql>
                            <hbm2ddl_auto>update</hbm2ddl_auto>
                            <user></user>
                            <password></password>
                        </hibernate>
                        <indexer>
                            <replicaId>KB</replicaId>
                            <batch_output_dir>batchOutputDir</batch_output_dir>
                            <tempdir>tempdir</tempdir>
                            <maxFailedAttempts>3</maxFailedAttempts>
                            <producerDelay>0</producerDelay>
                            <producerInterval>300000</producerInterval>
                            <consumerThreads>5</consumerThreads>
                        </indexer>
                    </wayback>
            </settings>
            </applicationName>
        </deployMachine>

For future reference, however, it should also be noted that the classpath lib/dk.netarkivet.wayback.jar has been added to all archive applications in this deploy file and this should be carried over to the production deployment configuration for future use.

Versioning

The accept test must run against an archive running the same version as the current PROD system. Wayback must use NetarchiveSuite files from the current release.

For reference, minutes of the February teleconference are at http://kb-prod-udv-001.kb.dk/twiki/bin/view/Netarkiv/BriefMeeting22Feb2010

WaybackProdInstallation (last edited 2010-08-16 10:24:30 by localhost)