NutchWAX and NetarchiveSuite integration

A demo project using NutchWAX as a standalone product directly on a subset of ARC files. This demo can be found on http://sb-prod-acs-001:8993 and require access to an SB proxy.

NutchWAX

NutchWAX has in two separate function, which works independently of one another. One is creating Lucene index from a number of ARC/WARC files via a command line tool, the other is a JAVA application which runs in Tomcat which presents search results from one or more search results.

Indexing

Indexing web data with NutchWAX is done by using the the two libraries nutchwax-0.10.0 and hadoop-0.9.2. The indexing is executed by these two libraries, by the following commands:

% export HADOOP_HOME=path_to/hadoop-0.9.2/

% export NUTCHWAX_HOME=path_to/nutchwax-0.10.0/

% ${HADOOP_HOME}/bin/hadoop jar ${NUTCHWAX_HOME}/nutchwax.jar all /tmp/inputs /tmp/outputs test

Which are all described on [http://archive-access.sourceforge.net/projects/nutchwax/apidocs/overview-summary.html#toc]

Searching

Searching is either done by a single search slave, which are controlled by the front end running in Tomcat or by a search master' controlling 1 or more search slaves, in the second case the search master is controlled by the same front end. Each search slave are searching in one Lucene index. This means that in order to cover large data with a index, the most feasible way is to construct multiple indexes and distribute these over the server park, each containing approximately the same number of slaves and the same amount of index data.

In the test setup on Netarkivet.dk we indexed around 2.1 TB of web data, which resulted in 454GB index, which were distributed on 12 search slaves, divided on two machines. When searching the query time for the Lucene index, which were not warmed up was between 0.05s and 6.0s.

search slaves

Search slave are autonomous applications, which only are associated with an index.

A search slave is started like this.

% $HADOOP_HOME/bin/hadoop jar $NUTCHWAX_HOME/nutchwax-0.10.0.jar class 'org.archive.access.nutch.NutchwaxDistributedSearch$Server' 1234 /index-0001/ &> slave-searcher-1234-index-0001.log &

search application

The main searcher, is a Tomcat application, which before use should be setup to the specific server configurations. To connect the main searcher with with the search slaves, the following must be set in webapps/nutchwax/WEB-INF/classes/hadoop-site.xml configuration file. The property named searcher.dir should point to a directory containing a single text file, containing hostname and port of each search slaves.

<property>
  <name>searcher.dir</name>
  <value>/nutchwax-slaves/</value>
</property> 

In the directory /nutchwax-slaves there should be an text file, in this format.

% cat /nutcwax-slaves/search-servers.txt
localhost 1234
localhost 1235
localhost 1236
localhost 1237
host2 1234
host2 1235
host3 1234 

Other configuration done to the main searcher, is server specific.

Integration

Integrating the NutchWAX project into NetarchiveSuite contains of several problems, which each should be solved.

  1. Setup and customize search master.
  2. Make NutchWAX indexing accessible from NetarchiveSuite.

Workflow:

  1. Establishing a basic way of via a batch job getting a hold of the ARC files needed for a specific index.
  2. Creating from the ARC files the index via the NutchWAX indexing tools.
  3. Pushing index to a free search machine.
  4. Automatically setting up new search slave, starting, and register this with search master.

This index facility of NutchWAX is implemented in JAVA, as is as standard run as a command line tool.

The search facility of NutchWAX is implemented as a JAVA web-application running in a Tomcat server.

Time table

something