Differences between revisions 1 and 10 (spanning 9 versions)

NutchWAX and NetarchiveSuite integration

A demo project using NutchWAX as a standalone product directly on a subset of ARC files. This demo can be found on http://sb-prod-acs-001:8993 and require access to an SB proxy.

NutchWAX

NutchWAX has in two separate function, which works independently of one another. One is the ability to create Lucene index from a number of ARC/WARC files via a command line tool, the other is a JAVA application which runs in Tomcat and facilitate the search results from one or more searchers.

Indexing

Indexing web data with NutchWAX is done by using the the two libraries nutchwax-0.10.0 and hadoop-0.9.2. The indexing is executed by these two libraries, by the following commands:

% export HADOOP_HOME=path_to/hadoop-0.9.2/

% export NUTCHWAX_HOME=path_to/nutchwax-0.10.0/

% ${HADOOP_HOME}/bin/hadoop jar ${NUTCHWAX_HOME}/nutchwax.jar all /tmp/inputs /tmp/outputs test

Which are all described on http://archive-access.sourceforge.net/projects/nutchwax/apidocs/overview-summary.html#toc. Scrips are made for exporting the right things, and are easily made for automaticly generating the commands for indexing one or more ARC/WARC files.

Searching

Searching is either done by a single search slave, which are controlled by the front end running in Tomcat or by a search master' controlling 1 or more search slaves, in the second case the search master is controlled by the same front end. Each search slave are searching in one Lucene index. This means that in order to cover large data with a index, the most feasible way is to construct multiple indexes and distribute these over the server park, each containing approximately the same number of slaves and the same amount of index data.

In the test setup on Netarkivet.dk we indexed around 2.1 TB of web data, which resulted in 454GB index, which were distributed on 12 search slaves, divided on two machines. When searching the query time for the Lucene index, which were not warmed up was between 0.05s and 6.0s.

search slaves

Search slave are autonomous applications, which only are associated with an index.

A search slave is started like this.

% $HADOOP_HOME/bin/hadoop jar $NUTCHWAX_HOME/nutchwax-0.10.0.jar class 'org.archive.access.nutch.NutchwaxDistributedSearch$Server' 1234 /index-0001/ &> slave-searcher-1234-index-0001.log &

search application

The main searcher, is a Tomcat application, which before use should be setup to the specific server configurations. To connect the main searcher with with the search slaves, the following must be set in webapps/nutchwax/WEB-INF/classes/hadoop-site.xml configuration file. The property named searcher.dir should point to a directory containing a single text file, containing hostname and port of each search slaves.

<property>
  <name>searcher.dir</name>
  <value>/nutchwax-slaves/</value>
</property>

In the directory /nutchwax-slaves there should be an text file, in this format:

% cat /nutcwax-slaves/search-servers.txt
localhost 1234
localhost 1235
localhost 1236
localhost 1237
host2 1234
host2 1235
host3 1234

Other configuration done to the main searcher, is server specific.

Integration

Integrating the NutchWAX project into NetarchiveSuite contains of several problems, which each should be solved.

Setup and customize search master.
Make NutchWAX indexing possible via batch script, such that it is possible to continuous index and add new data to the search slaves.

Workflow:

Establishing a basic way of via a batch job getting a hold of the ARC files needed for a specific index and merge the newly generated index with some existing ones.
Creating from the ARC files the index via the NutchWAX indexing tools.
Pushing index to a free search machine.
Automatically setting up new search slave, starting, and register this with search master.

This index facility of NutchWAX is implemented in JAVA, which is run as a standalone tool. It should be investigated if a JAVA API is available.

The search facility of NutchWAX is implemented as a JAVA web-application running in a Tomcat server. Like Wayback a solution with Jetty as web container, should be investigated.

Roadmap

something

Comments

2010-02-05 (SVC) Aaron Binns suggested moving to Nutchwax 0.12.9, as its space requirements is much less than 0.10.0. Furthermore, it avoids creating/updating a Nutch harvestDB, which is required for Nutch when used for harvesting but not for indexing and search purposes In later releases, the use of hadoop code by Nutchwax is hidden, as nutchwax comes with its own script for indexing. Aaron Binns also suggested using Jetty instead of Tomcat, which they are currently doing at IA

-  ⇤ ← Revision 1 as of 2009-10-14 08:26:17 → 
  Size: 9596
  Editor: HenrikKirk
  Comment:
+   ← Revision 10 as of 2010-08-16 10:24:47 → ⇥
  Size: 4923
  Editor: localhost
  Comment: converted to 1.6 markup
-Deletions are marked like this.
+Additions are marked like this.
 Line 1:
-= Wayback to the Future =
A [:WaybackPrototype:prototype] of netarkivets wayback has been implemented.
+= NutchWAX and NetarchiveSuite integration =
A demo project using NutchWAX as a standalone product directly on a subset of ARC files. This demo can be found on http://sb-prod-acs-001:8993 and require access to an SB proxy.
 Line 4:
-Here is a comparison of [:WaybackDependencies:dependencies] between wayback and netarchivesuite.
+== NutchWAX ==
NutchWAX has in two separate function, which works independently of one another. One is the ability to create Lucene index from a number of ARC/WARC files via a command line tool, the other is a JAVA application which runs in Tomcat and facilitate the search results from one or more searchers.
-Line 6:
+Line 7:
-A [:WaybackProtoprototype:proto-prototype] of netarkivets wayback has now been implemented.
+=== Indexing ===
Indexing web data with NutchWAX is done by using the the two libraries ''nutchwax-0.10.0'' and ''hadoop-0.9.2''. The indexing is executed by these two libraries, by the following commands:
-Line 8:
+Line 10:
-The implementation of Wayback as an interface to the netarkivet collection is envisaged as a three-phase project:
+'' % export HADOOP_HOME=path_to/hadoop-0.9.2/ ''
-Line 10:
+Line 12:
-. Establishing a Basic Wayback System (BWS)
 1. Customising the Basic Wayback System and analysis of special requirements for scaling and automation for an Advanced Wayback System (AWS)
 1. Implementation of Advanced Wayback System
Wayback is implemented as a java web-application running in tomcat. It has three major components which need to be configured:
+'' % export NUTCHWAX_HOME=path_to/nutchwax-0.10.0/ ''
-Line 15:
+Line 14:
-. resourceIndex
 1. resourceStore
 1. replay
The replay mode, ie how archived webpages are presented to the viewer, is highly configurable and at some point we will need to address the various options to determine which one(s) are most suitable for us. In the meantime, the default, which uses url rewriting, is certainly good enough for the purposes of building a working installation. The BWS we will therefore use the default replay interface.
+'' % ${HADOOP_HOME}/bin/hadoop jar ${NUTCHWAX_HOME}/nutchwax.jar all /tmp/inputs /tmp/outputs test''
-Line 20:
+Line 16:
-== Phase 1: Implementation of Basic Wayback System ==
The architecture for the BWS, as dicussed in the Odense meeting, is as follows:
+Which are all described on http://archive-access.sourceforge.net/projects/nutchwax/apidocs/overview-summary.html#toc. Scrips are made for exporting the right things, and are easily made for automaticly generating the commands for indexing one or more ARC/WARC files.
-Line 23:
+Line 18:
-attachment:wb.png
+=== Searching ===
Searching is either done by a single ''search slave'', which are controlled by the front end running in Tomcat or by a ''search master' controlling 1 or more ''search slaves'', in the second case the ''search master'' is controlled by the same front end. Each ''search slave'' are searching in one Lucene index. This means that in order to cover large data with a index, the most feasible way is to construct multiple indexes and distribute these over the server park, each containing approximately the same number of slaves and the same amount of index data.  ''
-Line 25:
+Line 21:
-In this architecture, there are two installations of Wayback to be configured: one for index-searching only, and one for interaction with clients and the arcrepository. In order to achieve a functional BWS we will need both a functioning resourceIndex and a functioning resourceStore. The tasks for Phase 1 are therefore divided into two assignment groups.
+''In the test setup on Netarkivet.dk we indexed around 2.1 TB of web data, which resulted in 454GB index, which were distributed on 12 ''search slaves'', divided on two machines. When searching the query time for the Lucene index, which were not ''warmed up'' was between '''0.05'''s and '''6.0'''s.   ''
-Line 27:
+Line 23:
-=== Assignment Group P1.1: resourceIndex ===
This component takes a query url and looks up matching results in an index which, for us, will be a collection of sorted cdx files. BnF run their index-server as a single machine with their 2TB of CDX files in 26 different files. Since they are happy with their performance this strongly suggests we should be able to do the same. I.e. we mount our total CDX collection (about 800GB) on a single machine running a tomcat and wayback webapp. We should, nevertheless, test the viability of this solution.
+==== search slaves ====
Search slave are autonomous applications, which only are associated with an index.
-Line 30:
+Line 26:
-==== Assignment P1.1.1: Identify subset of CDX files for pre-prototype resourceIndex testing ====
The purpose of this assignment is to identify and collect a set of CDX files representing a suitably large subset of our archive for preliminary feasibility and performance tests of indexing. 100GB should be sufficient. The files should include some results from both snapshot and selective harvests so that some url searches will produce multiple hits and others only few. (The following two assignments may strictly be considered unnecessary, but they will provide us with invaluable experience in configuring and using resourceIndexes and we should therefore strongly consider devoting resources to them.)
+A search slave is started like this.
-Line 33:
+Line 28:
-==== Assignment P1.1.2: Configure a prototype resourceIndex tester ====
This assignment involves configuring tomcat and wayback on a suitable machine and mounting the pre-prototype index-subset on the machine. The success criterion for this task is one successful search (ie  a search with at least one result). Note that the pre-prototype resourceIndex tester is not required to be able to communicate with the !ArcRepository. It is only for testing the resourceIndex component. A developer PC with a plentiful supply of diskspace should therefore be adequate.
+ % $HADOOP_HOME/bin/hadoop jar $NUTCHWAX_HOME/nutchwax-0.10.0.jar class 'org.archive.access.nutch.NutchwaxDistributedSearch$Server' 1234 /index-0001/  &> slave-searcher-1234-index-0001.log & '' ''
-Line 36:
+Line 30:
-==== Assignment P1.1.3: pre-prototype resourceIndex performance testing ====
The goal of this task is to estimate the typical performance of the production resourceIndex as well as provide data that can be used to estimate future scaling. A possible strategy is to use shell scripts to test the response time of the pre-prototype resourceIndex to various search loads, including scaling with the number and size of CDX files. It would be especially valuable if this task could assess the performance gains resulting from merging CDX files.
+==== search application ====
The main searcher, is a Tomcat application, which before use should be setup to the specific server configurations. To connect the main searcher with with the search slaves, the following must be set in  webapps/nutchwax/WEB-INF/classes/hadoop-site.xml'' configuration file. The property named ''searcher.dir'' should point to a directory containing a single text file, containing hostname and port of each search slaves. ''
-Line 39:
+Line 33:
-==== Assignment P1.1.4: identify requirements for WBS resourceIndex server ====
The purpose of this task is to identify the requirements, including the configuration and firewall requirements, for a machine to act as a resourceIndex server in which the Wayback composite CDX index webapp runs. (This machine is not directly accessible to any end user but talks only to the Wayback Access Webapp which runs in another tomcat.)
+{{{
<property>
  <name>searcher.dir</name>
  <value>/nutchwax-slaves/</value>
</property> 
}}}
In the directory /nutchwax-slaves there should be an text file, in this format:
{{{
% cat /nutcwax-slaves/search-servers.txt
localhost 1234
localhost 1235
localhost 1236
localhost 1237
host2 1234
host2 1235
host3 1234 
}}}
-Line 42:
+Line 51:
-==== Assignment P1.1.5: configure prototype resourceIndex server ====
Implement the requirements identified in assignment P1.1.4.
+Other configuration done to the main searcher, is server specific.
-Line 45:
+Line 53:
-=== Assignment Group P1.2: resourceStore and Access Machine ===
==== Assignment P1.2.1: identify requirements for Wayback Access Machine ====
This is the machine which runs the wayback instance through which end-users have access. It must be able to talk to the Wayback Index Machine and to instantiate a !JMSArcrepositoryClient capable of talking to the main !ArcRepository. It must also have controllable access from client machines at both SB and KB.
+== Integration ==
Integrating the NutchWAX project into NetarchiveSuite contains of several problems, which each should be solved.
-Line 49:
+Line 56:
-==== Assignment P1.2.2: configure Wayback Access Machine for search ====
Configure the resourceIndex on the Wayback Access Machine so that clients can query url's using the !RemoteIndexSource. At this stage, it will be possible to see what results are obtained, but not to view them.
+. Setup and customize search master.
 1. Make NutchWAX indexing possible via batch script, such that it is possible to continuous index and add new data to the search slaves.
Workflow:
-Line 52:
+Line 60:
-==== Assignment P1.2.3: develop NetarchiveResourceStore for arcfiles ====
This is an implementation of the interface org.archive.wayback.ResourceStore which takes a !CaptureSearchResult and returns an !ArcResource. It should take an !ArcRepositoryClient in its constructor. It's development should follow the normal cycle for netarchive development. However there may be unexpected difficulties caused by a lack of clear documentation in the API's of the various objects involved, and these may require some communication with the Wayback developers. (In the worst case, we may need to invoke the alternative strategy of implementing an http wrapper for a !JMSArcRepositoryClient. This would allow us to use Wayback's inbuilt !SimpleResourceStore.)[As an example, and a note to future developers, a look at the source-code of the class !ArcResource indicates that the !ArcReader parameter to its constructor may be null. This is documented in a comment, but invisible in the javadoc API.]
+. Establishing a basic way of via a batch job getting a hold of the ARC files needed for a specific index and merge the newly generated index with some existing ones.
 1. Creating from the ARC files the index via the NutchWAX indexing tools.
 1. Pushing index to a free search machine.
 1. Automatically setting up new search slave, starting, and register this with search master.
This index facility of NutchWAX is implemented in JAVA, which is run as a standalone tool. It should be investigated if a JAVA API is available.
-Line 55:
+Line 66:
-==== Assignment P1.2.4: integration testing of WBS ====
Develop and apply a suite of tests to determine that the WBS has acceptable functionality.
+The search facility of NutchWAX is implemented as a JAVA web-application running in a Tomcat server. Like Wayback a solution with Jetty as web container, should be investigated.
-Line 58:
+Line 68:
-== Phase 2: Customisation of Basic Wayback System and analysis of future requirements ==
This phase incorporates two elements - customisation of the user experience and analysis of areas to be addressed in future versions of the Netarkivet/Wayback interface.
+== Roadmap ==
something
-Line 61:
+Line 71:
-=== Assignment Group P2.1: customisation ===
In this assignment group, the various forms of replay provided by Wayback will be analysed and a choice made and implemented as to which we will support. We will also look at the possibilities for customising the user experience by, for example, adding netarkivet "watermarking" or other information to pages being used, as well as cosmetic customisation of the search interface.
+== Comments ==
-Line 64:
+Line 73:
-=== Assignment Group P2.2: future requirements ===
There are three primary areas which we already know will have to be considered for future requirements:

 1. Automatic or semi-automatic indexing workflow. We require a workflow in which indexes for new content are automatically generated, sorted, possibly merged, and added to Wayback.
 1. Identification and handling of scaling bottlenecks. The WBS has  several possible bottlenecks (for example the client-access communication, the index-lookup, the arcrepository access etc.) and it is unclear which of them will actually prove troublesome in the short to medium term. This assignment group will involve analysing the WBS performance in order to identify the relative importance of these bottlenecks and assessing distinct strategies (e.g. distribution vs. duplication) for dealing with them. It is possible that by this stage new functionality in wayback itself will be available to help.
 1. WARC support. The !NetarchivetResourceStore implemented in Phase 1 will need to be extended to deal with WARC files.
== Phase 3: Implementation of Advanced Wayback System ==
Following on from the analysis undertaken in Phase 2, a scalable and easily managed AWS will be developed.
||<tablewidth="1161px" tableheight="523px"bgcolor="#0033cc">Assignment ||<bgcolor="#0033cc">Description ||<bgcolor="#0033cc">Estimate/md ||
||<bgcolor="#999999">P1.1.1 || Identify subset of CDX files for pre-prototype resourceIndex testing ||<bgcolor="#999999">1 ||
||<bgcolor="#999999">P1.1.2 ||Configure a prototype resourceIndex tester ||<bgcolor="#999999">1 ||
||<bgcolor="#999999">P1.1.3 ||pre-prototype resourceIndex performance testing ||<bgcolor="#999999">3 ||
||<bgcolor="#999999">P1.1.4 ||identify requirements for WBS resourceIndex server ||<bgcolor="#999999">3 ||
||<bgcolor="#999999">P1.1.5 ||configure prototype resourceIndex server ||<bgcolor="#999999">2 ||
||<bgcolor="#999999">P1.2.1 ||identify requirements for Wayback Access Machine ||<bgcolor="#999999">3 ||
||<bgcolor="#999999">P1.2.2 ||configure Wayback Access Machine for search ||<bgcolor="#999999">2 ||
||<bgcolor="#999999">P1.2.3 ||develop !NetarchiveResourceStore for arcrepository ||<bgcolor="#999999">6 ||
||<bgcolor="#999999">P1.2.4 ||integration testing of WBS ||<bgcolor="#999999">4 ||
+-02-05 (SVC)
Aaron Binns suggested moving to Nutchwax 0.12.9, as its space requirements is much less than 0.10.0. Furthermore, it 
avoids creating/updating a Nutch harvestDB, which is required for Nutch when used for harvesting but not for indexing and search purposes
In later releases, the use of hadoop code by Nutchwax is hidden, as nutchwax comes with its own script for indexing.
Aaron Binns also suggested using Jetty instead of Tomcat, which they are currently doing at IA