4024
Comment:
|
4438
|
Deletions are marked like this. | Additions are marked like this. |
Line 7: | Line 7: |
NutchWAX has in two separate function, which works independently of one another. One is creating Lucene index from a number of ARC/WARC files via a command line tool, the other is a JAVA application which runs in Tomcat which presents search results from one or more search results. | NutchWAX has in two separate function, which works independently of one another. One is the ability to create Lucene index from a number of ARC/WARC files via a command line tool, the other is a JAVA application which runs in Tomcat and facilitate the search results from one or more searchers. |
Line 19: | Line 19: |
Which are all described on [http://archive-access.sourceforge.net/projects/nutchwax/apidocs/overview-summary.html#toc] | Which are all described on [http://archive-access.sourceforge.net/projects/nutchwax/apidocs/overview-summary.html#toc]. Scrips are made for exporting the right things, and are easily made for automaticly generating the commands for indexing one or more ARC/WARC files. |
Line 67: | Line 67: |
2. Make NutchWAX indexing accessible from NetarchiveSuite. | 2. Make NutchWAX indexing possible via batch script, such that it is possible to continuous index and add new data to the search slaves. |
Line 70: | Line 70: |
1. Establishing a basic way of via a batch job getting a hold of the ARC files needed for a specific index. | 1. Establishing a basic way of via a batch job getting a hold of the ARC files needed for a specific index and merge the newly generated index with some existing ones. |
Line 75: | Line 75: |
This index facility of NutchWAX is implemented in JAVA, as is as standard run as a command line tool. | This index facility of NutchWAX is implemented in JAVA, which is run as a standalone tool. It should be investigated if a JAVA API is available. |
Line 77: | Line 77: |
The search facility of NutchWAX is implemented as a JAVA web-application running in a Tomcat server. | The search facility of NutchWAX is implemented as a JAVA web-application running in a Tomcat server. Like Wayback a solution with Jetty as web container, should be investigated. |
Line 80: | Line 80: |
== Time table == | == Roadmap == |
NutchWAX and NetarchiveSuite integration
A demo project using NutchWAX as a standalone product directly on a subset of ARC files. This demo can be found on http://sb-prod-acs-001:8993 and require access to an SB proxy.
NutchWAX
NutchWAX has in two separate function, which works independently of one another. One is the ability to create Lucene index from a number of ARC/WARC files via a command line tool, the other is a JAVA application which runs in Tomcat and facilitate the search results from one or more searchers.
Indexing
Indexing web data with NutchWAX is done by using the the two libraries nutchwax-0.10.0 and hadoop-0.9.2. The indexing is executed by these two libraries, by the following commands:
% export HADOOP_HOME=path_to/hadoop-0.9.2/
% export NUTCHWAX_HOME=path_to/nutchwax-0.10.0/
% ${HADOOP_HOME}/bin/hadoop jar ${NUTCHWAX_HOME}/nutchwax.jar all /tmp/inputs /tmp/outputs test
Which are all described on [http://archive-access.sourceforge.net/projects/nutchwax/apidocs/overview-summary.html#toc]. Scrips are made for exporting the right things, and are easily made for automaticly generating the commands for indexing one or more ARC/WARC files.
Searching
Searching is either done by a single search slave, which are controlled by the front end running in Tomcat or by a search master' controlling 1 or more search slaves, in the second case the search master is controlled by the same front end. Each search slave are searching in one Lucene index. This means that in order to cover large data with a index, the most feasible way is to construct multiple indexes and distribute these over the server park, each containing approximately the same number of slaves and the same amount of index data. In the test setup on Netarkivet.dk we indexed around 2.1 TB of web data, which resulted in 454GB index, which were distributed on 12
Search slave are autonomous applications, which only are associated with an index. A search slave is started like this.
The main searcher, is a Tomcat application, which before use should be setup to the specific server configurations. To connect the main searcher with with the search slaves, the following must be set in In the directory Other configuration done to the main searcher, is server specific.
Integrating the NutchWAX project into NetarchiveSuite contains of several problems, which each should be solved. Workflow: This index facility of NutchWAX is implemented in JAVA, which is run as a standalone tool. It should be investigated if a JAVA API is available. The search facility of NutchWAX is implemented as a JAVA web-application running in a Tomcat server. Like Wayback a solution with Jetty as web container, should be investigated.
something search slaves
search application
<property>
<name>searcher.dir</name>
<value>/nutchwax-slaves/</value>
</property>
% cat /nutcwax-slaves/search-servers.txt
localhost 1234
localhost 1235
localhost 1236
localhost 1237
host2 1234
host2 1235
host3 1234
Integration
Roadmap