NutchWAX and NetarchiveSuite integration

A demo project using NutchWAX as a standalone product directly on a subset of ARC files. This demo can be found on http://sb-prod-acs-001:8993 and require access to an SB proxy.

NutchWAX

NutchWAX has in two separate function, which works independently of one another. One is creating Lucene index from a number of ARC/WARC files via a command line tool, the other is a JAVA application which runs in Tomcat which presents search results from one or more search results.

Indexing

Indexing web data with NutchWAX is done by using the the two libraries nutchwax-0.10.0 and hadoop-0.9.2. The indexing is executed by these two libraries, by the following commands:

% export HADOOP_HOME=path_to/hadoop-0.9.2/

% export NUTCHWAX_HOME=path_to/nutchwax-0.10.0/

% ${HADOOP_HOME}/bin/hadoop jar ${NUTCHWAX_HOME}/nutchwax.jar all /tmp/inputs /tmp/outputs test

Which are all described on [http://archive-access.sourceforge.net/projects/nutchwax/apidocs/overview-summary.html#toc]

Searching

master searcher

search slaves

Integration

Integrating the NutchWAX project into NetarchiveSuite contains of several problems, which each should be solved.

  1. Setup and customize search master.
  2. Make NutchWAX indexing accessible from NetarchiveSuite.

Workflow:

  1. Establishing a basic way of via a batch job getting a hold of the ARC files needed for a specific index.
  2. Creating from the ARC files the index via the NutchWAX indexing tools.
  3. Pushing index to a free search machine.
  4. Automatically setting up new search slave, starting, and register this with search master.

This index facility of NutchWAX is implemented in JAVA, as is as standard run as a command line tool.

The search facility of NutchWAX is implemented as a JAVA web-application running in a Tomcat server.

== Time table ==