NutchWAX and NetarchiveSuite integration
A demo project using NutchWAX as a standalone product directly on a subset of ARC files. This demo can be found on http://sb-prod-acs-001:8993 and require access to an SB proxy.
NutchWAX
NutchWAX has in two separate function, which works independently of one another. One is creating Lucene index from a number of ARC/WARC files via a command line tool, the other is a JAVA application which runs in Tomcat which presents search results from one or more search results.
Indexing
Indexing web data with NutchWAX is done by using the the two libraries nutchwax-0.10.0 and hadoop-0.9.2. The indexing is executed by these two libraries, by the following commands:
% export HADOOP_HOME=path_to/hadoop-0.9.2/
% export NUTCHWAX_HOME=path_to/nutchwax-0.10.0/
% ${HADOOP_HOME}/bin/hadoop jar ${NUTCHWAX_HOME}/nutchwax.jar all /tmp/inputs /tmp/outputs test
Which are all described on [http://archive-access.sourceforge.net/projects/nutchwax/apidocs/overview-summary.html#toc]
Searching
Searching is either done by a single search slave, which are controlled by the front end running in Tomcat or by a search master' controlling 1 or more search slaves, in the second case the search master is controlled by the same front end. Each search slave are searching in one Lucene index. This means that in order to cover large data with a index, the most feasible way is to construct multiple indexes and distribute these over the server park, each containing approximately the same number of slaves and the same amount of index data. In the test setup on Netarkivet.dk we indexed around 2.1 TB of web data, which resulted in 454GB index, which were distributed on 12
Search slave are autonomous applications, which only are associated with an index. A search slave is started like this.
The main searcher, is a Tomcat application, which before use should be setup to the specific server configurations. To connect the main searcher with with the search slaves, the following must be set in In the directory Other configuration done to the main searcher, is server specific.
Integrating the NutchWAX project into NetarchiveSuite contains of several problems, which each should be solved. Make NutchWAX indexing accessible from NetarchiveSuite. Workflow: This index facility of NutchWAX is implemented in JAVA, as is as standard run as a command line tool. The search facility of NutchWAX is implemented as a JAVA web-application running in a Tomcat server. == Time table == something search slaves
search application
<property>
<name>searcher.dir</name>
<value>/nutchwax-slaves/</value>
</property>
% cat /nutcwax-slaves/search-servers.txt
localhost 1234
localhost 1235
localhost 1236
localhost 1237
host2 1234
host2 1235
host3 1234
Integration