Differences between revisions 3 and 4

Appendix C: Easy Installation of NetarchiveSuite

Verify that you have all the needed software installed by installing the QuickStart according to https://netarchive.dk/suite/Quick_Start_Manual_3.12 e.g. in /home/test/netarchive by starting the Quickstart
Shutdown the QuickStart according to the QuickStart Manual
Download following attached files to e.g. /home/test/netarchive:

[https://gforge.statsbiblioteket.dk/plugins/scmsvn/viewcvs.php/*checkout*/trunk/scripts/installation/RunNetarchiveSuite.sh?root=netarchivesuite RunNetarchiveSuite.sh]

[https://gforge.statsbiblioteket.dk/plugins/scmsvn/viewcvs.php/*checkout*/trunk/examples/deploy_standalone_example.xml?root=netarchivesuite deploy example one machine.xml]

The first script is a simple script for doing all the steps during deployment. It takes a NetarchiveSuite package ('.zip'), a configuration file (the second file), and a temporary installation directory as arguments (in the given order).

In the configuration file all the applications are placed on one machine (e.g. the current machine, localhost). This gives the same kind of instance as the QuickStart. If run directly it is installed and run from the directory /home/test/USER.

Below, you find other deploy examples. ( They have to be modfied to your environment)

E.g.

cd /home/test/netarchive
bash RunNetarchiveSuite.sh NetarchiveSuite.zip deploy_example_one_machine.xml USER/
#if you have not setup your ssh keygen correctly, you need to login some times before the installation finish successfully

The script creates a "USER" folder in e.g. /home/test , which contains e.g. methods for starting and stopping NetarchiveSuite and starts the whole NetarchiveSuite.

Set your browser to proxy according to the QuickStart Manual on port 8070
Choose the URL e.g. http://dia-test-int-01.kb.dk:8074/HarvestDefinition/
You can now create, run and browse according to the QuickStart - or User Manual

Examples of deploy configuration files

In the following are two examples of configuration files for deploy. The first two requires adaptation to your own system before use.

[https://gforge.statsbiblioteket.dk/plugins/scmsvn/viewcvs.php/*checkout*/trunk/examples/deploy_distributed_example.xml?root=netarchivesuite deploy distributed example.xml]

The instance with two replicas divided over two physical locations. Each physical locations contain several machines. Bitarchive machines, harvester machine and viewerproxy machine. Only one physical location has an administator machine, which contains the GUI application, the Bitarchive monitors and the arc repository.

[https://gforge.statsbiblioteket.dk/plugins/scmsvn/viewcvs.php/*checkout*/trunk/examples/deploy_distributed_example_single.xml?root=netarchivesuite deploy distributed example single.xml]

This is the instance with only one replica and one physical location. It is very close to the first example, just with one replica removed.

[https://gforge.statsbiblioteket.dk/plugins/scmsvn/viewcvs.php/*checkout*/trunk/examples/deploy_distributed_example_checksum.xml?root=netarchivesuite deploy distributed example checksum.xml]

Compared to the 'deploy distributed example' this instance contains a third replica, a checksum replica. This new replica has only one application, which is placed on the east location and share a machine with some of the other applications.

A running HW/SW setup example from June 2009 for Netarkivet.dk

http://netarchive.dk/suite/Installation_Manual_3.12?action=AttachFile&do=view&target=HW_SW_production_example.txt

How to add a harvester more on the same machine and set all to HIGHPRIORITY selective harvesting

Using e.g. deploy_example.xml

Duplicate the existing harvester <applicationName> definition within <deployMachine>

In the new duplicate harvester config, change all following duplicate values to new unique values within <deployMachine>:

<applicationInstanceId>
<common><jmx><port> and <rmiPort>
<heritrix><guiport> and <jmxPort>
<serverDir>harvester_high_2</serverDir>

and set

<queuePriority>HIGHPRIORITY</queuePriority>

e.g.:

<applicationName name="dk.netarkivet.harvester.harvesting.HarvestControllerApplication">
- <settings>
  - <common>
    - <applicationInstanceId>high2</applicationInstanceId> <jmx>
      - <port>8112</port> <rmiPort>8212</rmiPort>
      </jmx>
    </common> <harvester>
    - <harvesting>
      - <queuePriority>HIGHPRIORITY</queuePriority> <heritrix>
        <guiPort>8192</guiPort>  <jmxPort>8193</jmxPort>
        <jmxUsername>controlRole</jmxUsername> <jmxPassword>R_D</jmxPassword>
        </heritrix> <serverDir>harvester_high_2</serverDir>
      </harvesting>
    </harvester>
  </settings>
</applicationName>

How to configure which Heritrix report has to be uploaded in the metadata ARC file

Three new setting properties have been added:

- settings.harvester.harvesting.metadata.heritrixFilePattern is a java pattern that allows to filter which files in the crawl dir (not recursively) to include in the lmetadata ARC.

- settings.harvester.harvesting.metadata.reportFilePattern is also a java pattern that controls which subset of the files selected by heritrixFilePattern are to be considered as report files. All the other files will be considered as setup files.

- settings.harvester.harvesting.metadata.logFilePattern is a third java pattern that controls which files in the logs subdirectory of the crawldir are to be added as log files to the metadata ARC.

Installation Manual 3.12/AppendixC (last edited 2010-09-14 07:31:28 by TueLarsen)

-  ⇤ ← Revision 3 as of 2009-12-10 09:03:48 → 
  Size: 5741
  Editor: JonasFrellesen
  Comment:
+   ← Revision 4 as of 2010-01-27 12:17:16 → ⇥
  Size: 5728
  Editor: SoerenCarlsen
  Comment:
-Deletions are marked like this.
+Additions are marked like this.
 Line 1:
-= Easy Installation of NetarchiveSuite =
+= Appendix C: Easy Installation of NetarchiveSuite =
 Line 7:
-Line 45:
+Line 44:
-Compared to the 'deploy distributed example' this instance contains a third replica, a checksum replica.
This new replica has only one application, which is placed on the east location and share a machine with some of the other applications.
+Compared to the 'deploy distributed example' this instance contains a third replica, a checksum replica. This new replica has only one application, which is placed on the east location and share a machine with some of the other applications.
-Line 57:
+Line 55:
 Using e.g. deploy_example.xml
-Line 61:
+Line 58:
-Line 64:
+Line 60:
- * <applicationInstanceId> 
 * <common><jmx><port> and <rmiPort> 
 * <heritrix><guiport> and <jmxPort>
+ * <applicationInstanceId>
 * <common><jmx><port> and <rmiPort>
 * <heritrix><guiport> and <jmxPort>
-Line 68:
+Line 64:
+and set
-Line 69:
+Line 66:
-and set
-Line 71:
+Line 67:
-Line 90:
+Line 85:
-Line 92:
+Line 86:
-Line 95:
+Line 88:
-- ''settings.harvester.harvesting.metadata.heritrixFilePattern''
is a java pattern that allows to filter which files in the crawl
dir (not recursively) to include in the lmetadata ARC.
+- ''settings.harvester.harvesting.metadata.heritrixFilePattern'' is a java pattern that allows to filter which files in the crawl dir (not recursively) to include in the lmetadata ARC.
-Line 99:
+Line 90:
-- ''settings.harvester.harvesting.metadata.reportFilePattern'' is
also a java pattern that controls which subset of the files selected
by heritrixFilePattern are to be considered as report files.
All the other files will be considered as setup files.
+- ''settings.harvester.harvesting.metadata.reportFilePattern'' is also a java pattern that controls which subset of the files selected by heritrixFilePattern are to be considered as report files. All the other files will be considered as setup files.
-Line 104:
+Line 92:
-- ''settings.harvester.harvesting.metadata.logFilePattern'' is a
third java pattern that controls which files in the logs subdirectory
of the crawldir are to be added as log files to the metadata ARC.
+- ''settings.harvester.harvesting.metadata.logFilePattern'' is a third java pattern that controls which files in the logs subdirectory of the crawldir are to be added as log files to the metadata ARC.