NutchWAX and NetarchiveSuite integration

A demo project using NutchWAX as a standalone product directly on a subset of ARC files. This demo can be found on http://sb-prod-acs-001:8993 and require access to an SB proxy.

NutchWAX

NutchWAX has in two separate function, which works independently of one another. One is the ability to create Lucene index from a number of ARC/WARC files via a command line tool, the other is a JAVA application which runs in Tomcat and facilitate the search results from one or more searchers.

Indexing

Indexing web data with NutchWAX is done by using the the two libraries nutchwax-0.10.0 and hadoop-0.9.2. The indexing is executed by these two libraries, by the following commands:

% export HADOOP_HOME=path_to/hadoop-0.9.2/

% export NUTCHWAX_HOME=path_to/nutchwax-0.10.0/

% ${HADOOP_HOME}/bin/hadoop jar ${NUTCHWAX_HOME}/nutchwax.jar all /tmp/inputs /tmp/outputs test

Which are all described on http://archive-access.sourceforge.net/projects/nutchwax/apidocs/overview-summary.html#toc. Scrips are made for exporting the right things, and are easily made for automaticly generating the commands for indexing one or more ARC/WARC files.

Searching

Searching is either done by a single search slave, which are controlled by the front end running in Tomcat or by a search master' controlling 1 or more search slaves, in the second case the search master is controlled by the same front end. Each search slave are searching in one Lucene index. This means that in order to cover large data with a index, the most feasible way is to construct multiple indexes and distribute these over the server park, each containing approximately the same number of slaves and the same amount of index data.

In the test setup on Netarkivet.dk we indexed around 2.1 TB of web data, which resulted in 454GB index, which were distributed on 12 search slaves, divided on two machines. When searching the query time for the Lucene index, which were not warmed up was between 0.05s and 6.0s.

search slaves

Search slave are autonomous applications, which only are associated with an index.

A search slave is started like this.

% $HADOOP_HOME/bin/hadoop jar $NUTCHWAX_HOME/nutchwax-0.10.0.jar class 'org.archive.access.nutch.NutchwaxDistributedSearch$Server' 1234 /index-0001/ &> slave-searcher-1234-index-0001.log &

search application

The main searcher, is a Tomcat application, which before use should be setup to the specific server configurations. To connect the main searcher with with the search slaves, the following must be set in webapps/nutchwax/WEB-INF/classes/hadoop-site.xml configuration file. The property named searcher.dir should point to a directory containing a single text file, containing hostname and port of each search slaves.

<property>
  <name>searcher.dir</name>
  <value>/nutchwax-slaves/</value>
</property>

In the directory /nutchwax-slaves there should be an text file, in this format:

% cat /nutcwax-slaves/search-servers.txt
localhost 1234
localhost 1235
localhost 1236
localhost 1237
host2 1234
host2 1235
host3 1234

Other configuration done to the main searcher, is server specific.

Integration

Integrating the NutchWAX project into NetarchiveSuite contains of several problems, which each should be solved.

Setup and customize search master.
Make NutchWAX indexing possible via batch script, such that it is possible to continuous index and add new data to the search slaves.

Workflow:

Establishing a basic way of via a batch job getting a hold of the ARC files needed for a specific index and merge the newly generated index with some existing ones.
Creating from the ARC files the index via the NutchWAX indexing tools.
Pushing index to a free search machine.
Automatically setting up new search slave, starting, and register this with search master.

This index facility of NutchWAX is implemented in JAVA, which is run as a standalone tool. It should be investigated if a JAVA API is available.

The search facility of NutchWAX is implemented as a JAVA web-application running in a Tomcat server. Like Wayback a solution with Jetty as web container, should be investigated.

Roadmap

something

Comments

2010-02-05 (SVC) Aaron Binns suggested moving to Nutchwax 0.12.9, as its space requirements is much less than 0.10.0. Furthermore, it avoids creating/updating a Nutch harvestDB, which is required for Nutch when used for harvesting but not for indexing and search purposes In later releases, the use of hadoop code by Nutchwax is hidden, as nutchwax comes with its own script for indexing. Aaron Binns also suggested using Jetty instead of Tomcat, which they are currently doing at IA