Wayback to the Future

A prototype of netarkivets wayback has been implemented.

Here is a comparison of dependencies between wayback and netarchivesuite.

A proto-prototype of netarkivets wayback has now been implemented.

The implementation of Wayback as an interface to the netarkivet collection is envisaged as a three-phase project:

Establishing a Basic Wayback System (BWS)
Customising the Basic Wayback System and analysis of special requirements for scaling and automation for an Advanced Wayback System (AWS)
Implementation of Advanced Wayback System

Wayback is implemented as a java web-application running in tomcat. It has three major components which need to be configured:

resourceIndex
resourceStore
replay

The replay mode, ie how archived webpages are presented to the viewer, is highly configurable and at some point we will need to address the various options to determine which one(s) are most suitable for us. In the meantime, the default, which uses url rewriting, is certainly good enough for the purposes of building a working installation. The BWS we will therefore use the default replay interface.

Phase 1: Implementation of Basic Wayback System

The architecture for the BWS, as dicussed in the Odense meeting, is as follows:

In this architecture, there are two installations of Wayback to be configured: one for index-searching only, and one for interaction with clients and the arcrepository. In order to achieve a functional BWS we will need both a functioning resourceIndex and a functioning resourceStore. The tasks for Phase 1 are therefore divided into two assignment groups.

Assignment Group P1.1: resourceIndex

This component takes a query url and looks up matching results in an index which, for us, will be a collection of sorted cdx files. BnF run their index-server as a single machine with their 2TB of CDX files in 26 different files. Since they are happy with their performance this strongly suggests we should be able to do the same. I.e. we mount our total CDX collection (about 800GB) on a single machine running a tomcat and wayback webapp. We should, nevertheless, test the viability of this solution.

Assignment P1.1.1: Identify subset of CDX files for pre-prototype resourceIndex testing

The purpose of this assignment is to identify and collect a set of CDX files representing a suitably large subset of our archive for preliminary feasibility and performance tests of indexing. 100GB should be sufficient. The files should include some results from both snapshot and selective harvests so that some url searches will produce multiple hits and others only few. (The following two assignments may strictly be considered unnecessary, but they will provide us with invaluable experience in configuring and using resourceIndexes and we should therefore strongly consider devoting resources to them.)

Assignment P1.1.2: Configure a prototype resourceIndex tester

This assignment involves configuring tomcat and wayback on a suitable machine and mounting the pre-prototype index-subset on the machine. The success criterion for this task is one successful search (ie a search with at least one result). Note that the pre-prototype resourceIndex tester is not required to be able to communicate with the ArcRepository. It is only for testing the resourceIndex component. A developer PC with a plentiful supply of diskspace should therefore be adequate.

Assignment P1.1.3: pre-prototype resourceIndex performance testing

The goal of this task is to estimate the typical performance of the production resourceIndex as well as provide data that can be used to estimate future scaling. A possible strategy is to use shell scripts to test the response time of the pre-prototype resourceIndex to various search loads, including scaling with the number and size of CDX files. It would be especially valuable if this task could assess the performance gains resulting from merging CDX files.

Assignment P1.1.4: identify requirements for WBS resourceIndex server

The purpose of this task is to identify the requirements, including the configuration and firewall requirements, for a machine to act as a resourceIndex server in which the Wayback composite CDX index webapp runs. (This machine is not directly accessible to any end user but talks only to the Wayback Access Webapp which runs in another tomcat.)

Assignment P1.1.5: configure prototype resourceIndex server

Implement the requirements identified in assignment P1.1.4.

Assignment Group P1.2: resourceStore and Access Machine

Assignment P1.2.1: identify requirements for Wayback Access Machine

This is the machine which runs the wayback instance through which end-users have access. It must be able to talk to the Wayback Index Machine and to instantiate a !JMSArcrepositoryClient capable of talking to the main ArcRepository. It must also have controllable access from client machines at both SB and KB.

Assignment P1.2.2: configure Wayback Access Machine for search

Configure the resourceIndex on the Wayback Access Machine so that clients can query url's using the RemoteIndexSource. At this stage, it will be possible to see what results are obtained, but not to view them.

Assignment P1.2.3: develop NetarchiveResourceStore for arcfiles

This is an implementation of the interface org.archive.wayback.ResourceStore which takes a CaptureSearchResult and returns an ArcResource. It should take an ArcRepositoryClient in its constructor. It's development should follow the normal cycle for netarchive development. However there may be unexpected difficulties caused by a lack of clear documentation in the API's of the various objects involved, and these may require some communication with the Wayback developers. (In the worst case, we may need to invoke the alternative strategy of implementing an http wrapper for a !JMSArcRepositoryClient. This would allow us to use Wayback's inbuilt SimpleResourceStore.)[As an example, and a note to future developers, a look at the source-code of the class ArcResource indicates that the ArcReader parameter to its constructor may be null. This is documented in a comment, but invisible in the javadoc API.]

Assignment P1.2.4: integration testing of WBS

Develop and apply a suite of tests to determine that the WBS has acceptable functionality.

Phase 2: Customisation of Basic Wayback System and analysis of future requirements

This phase incorporates two elements - customisation of the user experience and analysis of areas to be addressed in future versions of the Netarkivet/Wayback interface.

Assignment Group P2.1: customisation

In this assignment group, the various forms of replay provided by Wayback will be analysed and a choice made and implemented as to which we will support. We will also look at the possibilities for customising the user experience by, for example, adding netarkivet "watermarking" or other information to pages being used, as well as cosmetic customisation of the search interface.

Assignment Group P2.2: future requirements

There are three primary areas which we already know will have to be considered for future requirements:

Automatic or semi-automatic indexing workflow. We require a workflow in which indexes for new content are automatically generated, sorted, possibly merged, and added to Wayback.
Identification and handling of scaling bottlenecks. The WBS has several possible bottlenecks (for example the client-access communication, the index-lookup, the arcrepository access etc.) and it is unclear which of them will actually prove troublesome in the short to medium term. This assignment group will involve analysing the WBS performance in order to identify the relative importance of these bottlenecks and assessing distinct strategies (e.g. distribution vs. duplication) for dealing with them. It is possible that by this stage new functionality in wayback itself will be available to help.
WARC support. The NetarchivetResourceStore implemented in Phase 1 will need to be extended to deal with WARC files.

Phase 3: Implementation of Advanced Wayback System

Following on from the analysis undertaken in Phase 2, a scalable and easily managed AWS will be developed.

Assignment	Description	Estimate/md
P1.1.1	Identify subset of CDX files for pre-prototype resourceIndex testing	1
P1.1.2	Configure a prototype resourceIndex tester	1
P1.1.3	pre-prototype resourceIndex performance testing	3
P1.1.4	identify requirements for WBS resourceIndex server	3
P1.1.5	configure prototype resourceIndex server	2
P1.2.1	identify requirements for Wayback Access Machine	3
P1.2.2	configure Wayback Access Machine for search	2
P1.2.3	develop NetarchiveResourceStore for arcrepository	6
P1.2.4	integration testing of WBS	4