NutchWAX and NetarchiveSuite integration
A demo project using NutchWAX as a standalone product directly on a subset of ARC files. This demo can be found on http://sb-prod-acs-001:8993 and require access to an SB proxy.
NutchWAX
NutchWAX has in two separate function, which works independently of one another. One is the ability to create Lucene index from a number of ARC/WARC files via a command line tool, the other is a JAVA application which runs in Tomcat and facilitate the search results from one or more searchers.
Indexing
Indexing web data with NutchWAX is done by using the the two libraries nutchwax-0.10.0 and hadoop-0.9.2. The indexing is executed by these two libraries, by the following commands:
% export HADOOP_HOME=path_to/hadoop-0.9.2/
% export NUTCHWAX_HOME=path_to/nutchwax-0.10.0/
% ${HADOOP_HOME}/bin/hadoop jar ${NUTCHWAX_HOME}/nutchwax.jar all /tmp/inputs /tmp/outputs test
Which are all described on [http://archive-access.sourceforge.net/projects/nutchwax/apidocs/overview-summary.html#toc]. Scrips are made for exporting the right things, and are easily made for automaticly generating the commands for indexing one or more ARC/WARC files.
Searching
Searching is either done by a single search slave, which are controlled by the front end running in Tomcat or by a search master' controlling 1 or more search slaves, in the second case the search master is controlled by the same front end. Each search slave are searching in one Lucene index. This means that in order to cover large data with a index, the most feasible way is to construct multiple indexes and distribute these over the server park, each containing approximately the same number of slaves and the same amount of index data. In the test setup on Netarkivet.dk we indexed around 2.1 TB of web data, which resulted in 454GB index, which were distributed on 12
Search slave are autonomous applications, which only are associated with an index. A search slave is started like this.
The main searcher, is a Tomcat application, which before use should be setup to the specific server configurations. To connect the main searcher with with the search slaves, the following must be set in In the directory Other configuration done to the main searcher, is server specific.
Integrating the NutchWAX project into NetarchiveSuite contains of several problems, which each should be solved. Workflow: This index facility of NutchWAX is implemented in JAVA, which is run as a standalone tool. It should be investigated if a JAVA API is available. The search facility of NutchWAX is implemented as a JAVA web-application running in a Tomcat server. Like Wayback a solution with Jetty as web container, should be investigated.
something search slaves
search application
<property>
<name>searcher.dir</name>
<value>/nutchwax-slaves/</value>
</property>
% cat /nutcwax-slaves/search-servers.txt
localhost 1234
localhost 1235
localhost 1236
localhost 1237
host2 1234
host2 1235
host3 1234
Integration
Roadmap