Configuration Manual 3.8/AppendixB

Appendix

Mandatory

A number of

A. The QuotaEnforcer

The QuotaEnforcer &

B. The DeDuplicator

The DeDuplicator C. The "http-headers" element

This element

D. The Archiver element

This element

E. The ContentSize element

To have statistics

F. The Scope element

The scope

The anatomy of a decidingscope

Finally,

The header

The defining deciderule

Here we have

Standard harvest rules

These rules

Define general crawlertraps
Lists of

The HarvestTemplateApplication
You can upload create download update Predefined harvest templates

All our templates

Templates w/ DomainScope

default_orderxml.xml
Templates w/ HostScope

host_10levels_orderxml.xml
Templates w/ PathScope

path_10levels_orderxml.xml class="anchor" id="line-4">
B: Managing Heritrix Harvest Templates (order.xml) href="/suite/Configuration%20Manual%203.8/AppendixB?action=edit">edit
The NetarchiveSuite software uses a patched version of Heritrix 1.12.1 to harvest webpages. A harvest done by Heritrix is specified with a harvest template (invariably named order.xml). A harvest template describes how much to harvest and from where. Furthermore a seedlist is always associated with a given order.xml.
The standard harvest template used by NetarchiveSuite follow the order.xml standard of Heritrix 1.10+.
Our default harvest template can be seen here in full: default_orderxml.xm
If you intend to build your own templates, it is recommended to use this template as a baseline.
elements in the NetarchiveSuite and their role elements in the order.xml are required in all NetarchiveSuite harvest templates:
is used to restrict the number of bytes harvested from each domain in the seedlist.
<newObject name="QuotaEnforcer" class="org.archive.crawler.prefetch.QuotaEnforcer"> <boolean name="force-retire">false</boolean> <boolean name="enabled">true</boolean> <newObject name="QuotaEnforcer#decide-rules" class="org.archive.crawler.deciderules.DecideRuleSequence"> <map name="rules"> </map> </newObject> <long name="server-max-fetch-successes">-1</long> <long name="server-max-success-kb">-1</long> <long name="server-max-fetch-responses">-1</long> <long name="server-max-all-kb">-1</long> <long name="host-max-fetch-successes">-1</long> <long name="host-max-success-kb">-1</long> <long name="host-max-fetch-responses">-1</long> <long name="host-max-all-kb">-1</long> <long name="group-max-fetch-successes">-1</long> <long name="group-max-success-kb">-1</long> <long name="group-max-fetch-responses">-1</long> <long name="group-max-all-kb">-1</long> lt;/newObject>
is a module authored by Kristinn Sigurdsson from the National Library of Iceland. It is part of the Heritrix Write-processor chain. It enables us to avoid saving duplicates in our storage. It does this by looking up the url of the potential duplicate object in the index associated with this module. If the url is found in the index, and the checksum for the url in the index is unaltered, the object is not stored. However a reference to where the object is stored is written to the crawl log. If the url for the object is not found in the index, the object is stored normally. Note that only non-text objects are examined by this module, i.e. where the mimetype of the object does not match "^text/.*" (like text/html or text/plain).
NetarchiveSuite uses a patched version of the 0.3.0-20061218 beta version of the deduplicator.
<newObject name="DeDuplicator" class="is.hi.bok.deduplicator.DeDuplicator"> <boolean name="enabled">true</boolean> <map name="filters"> </map> <string name="index-location"/> <string name="matching-method">By URL</string> <boolean name="try-equivalent">true</boolean> <boolean name="change-content-size">false</boolean> <string name="mime-filter">^text/.*</string> <string name="filter-mode">Blacklist</string> <string name="analysis-mode">Timestamp</string> <string name="log-level">SEVERE</string> <string name="origin"/> <string name="origin-handling">Use index information</string> <boolean name="stats-per-host">true</boolean> <boolean name="use-sparse-range-filter">true</boolean> ></newObject>
describes, how Heritrix will present itself to the webservers when fetching data. It points by default to the non-existing webpage http://my_website.com/my_infopage.html and the equally non-existing mail address " my_email@my_website.com ". Please update this to your own institution and email!
<map name="http-headers"> <string name="user-agent">Mozilla/5.0 (compatible; heritrix/1.12.1 +http://my_website.com/my_infopage.html)</string> <string name="from">my_email@my_website.com</string> </map>
does the actual writing of the fetched objects to an arcfile. In the future we may want to write to WARC files instead, which can be easily be done. Heritrix allows you to have multiple 'Writers' in use at the same time. For instance, you can write your objects to both ARC and WARC at the same time, as well as writing the objects to a database.
<newObject name="Archiver" class="org.archive.crawler.writer.ARCWriterProcessor"> <boolean name="enabled">true</boolean> <newObject name="Archiver#decide-rules" class="org.archive.crawler.deciderules.DecideRuleSequence"> <map name="rules"> </map> </newObject> <boolean name="compress">false</boolean> <string name="prefix">IAH</string> <string name="suffix">${HOSTNAME}</string> <integer name="max-size-bytes">100000000</integer> <stringList name="path"> <string>arcs</string> </stringList> <integer name="pool-max-active">5</integer> <integer name="pool-max-wait">300000</integer> <long name="total-bytes-to-write">0</long> <boolean name="skip-identical-digests">false</boolean> </newObject>
work right when jobs finishes and goes back into the database all templates in NetarchiveSuite needs a special content-size annotation post-processor. If this element is not present, the size will allways be 0 in the database for harvests done without this in the template:
<newObject name="ContentSize" class="dk.netarkivet.harvester.harvesting.ContentSizeAnnotationPostProcessor"> <boolean name="enabled">true</boolean> <newObject name="ContentSize#decide-rules" class="org.archive.crawler.deciderules.DecideRuleSequence"> <map name="rules"> </map> </newObject> </newObject>
element decides which urls to harvest and which not to harvest. Before release 3.6.0, we used the following three scopes:
DomainScope. The standard NetarchiveSuite scope allows the harvester to fetch all objects coming from any 2nd level domains represented by one of the seeds. Embeddded objects, like images, and stylesheets are always fetched even when coming from other domains.
HostScope. This scope are restricted to fetching objects from the hosts represented by the seeds.
PathScope. This scope are restricted to fetching objects from
These 3 scopes were all deprecated from Heritrix 1.10.0, and now all NetarchiveSuite templates are required to use the DecidingScope instead. This type of Scope uses a sequence of DecideRules to define the scope of the harvest. We now emulate these three scopes by adding a specific DecideRule to the DecidingScope. In the case of DomainScope, it required designing our own DecideRule (dk.netarkivet.harvester.harvesting.OnNSDomainsDecideRule). So for DomainScope type scopes, you add the following element:
<newObject name="acceptURIFromSeedDomains" class="dk.netarkivet.harvester.harvesting.OnNSDomainsDecideRule"> <string name="decision">ACCEPT</string> <string name="surts-source-file">seeds.txt</string> <boolean name="seeds-as-surt-prefixes">false</boolean> <string name="surts-dump-file"/> <boolean name="also-check-via">false</boolean> <boolean name="rebuild-on-reconfig">true</boolean> </newObject>
Emulating the HostScope requires adding the OnHostsDecideRule element:
<newObject name="acceptIfOnSeedsHosts" class="org.archive.crawler.deciderules.OnHostsDecideRule"> <string name="decision">ACCEPT</string> <string name="surts-dump-file"></string> <boolean name="also-check-via">false</boolean> <boolean name="rebuild-on-reconfig">true</boolean> </newObject>
Emulating the PathScope requires adding the SurtPrefixesDecideRule element:
<newObject name="acceptIfSurtPrefixed" class="org.archive.crawler.deciderules.SurtPrefixedDecideRule"> <string name="decision">ACCEPT</string> <string name="surts-source-file"></string> <boolean name="seeds-as-surt-prefixes">true</boolean> <string name="surts-dump-file"></string> <boolean name="also-check-via">false</boolean> <boolean name="rebuild-on-reconfig">true</boolean> </newObject>
An example of a complete DecidingScope element is shown below.
<newObject name="scope" class="org.archive.crawler.deciderules.DecidingScope"> <boolean name="enabled">true</boolean> <string name="seedsfile">seeds.txt</string> <boolean name="reread-seeds-on-config">true</boolean>  <newObject name="decide-rules" class="org.archive.crawler.deciderules.DecideRuleSequence"> <map name="rules"> <newObject name="rejectByDefault" class="org.archive.crawler.deciderules.RejectDecideRule"/> <newObject name="acceptURIFromSeedDomains" class="dk.netarkivet.harvester.harvesting.OnNSDomainsDecideRule"> <string name="decision">ACCEPT</string> <string name="surts-source-file"></string> <boolean name="seeds-as-surt-prefixes">true</boolean> <string name="surts-dump-file"/> <boolean name="also-check-via">false</boolean> <boolean name="rebuild-on-reconfig">true</boolean> </newObject> <newObject name="rejectIfTooManyHops" class="org.archive.crawler.deciderules.TooManyHopsDecideRule"> <integer name="max-hops">25</integer> </newObject> <newObject name="rejectIfPathological" class="org.archive.crawler.deciderules.PathologicalPathDecideRule"> <integer name="max-repetitions">3</integer> </newObject> <newObject name="acceptIfTranscluded" class="org.archive.crawler.deciderules.TransclusionDecideRule"> <integer name="max-trans-hops">25</integer> <integer name="max-speculative-hops">1</integer> </newObject> <newObject name="pathdepthfilter" class="org.archive.crawler.deciderules.TooManyPathSegmentsDecideRule"> <integer name="max-path-depth">20</integer> </newObject> <newObject name="global_crawlertraps" class="org.archive.crawler.deciderules.MatchesListRegExpDecideRule"> <string name="decision">REJECT</string> <string name="list-logic">OR</string> <stringList name="regexp-list"> <string>.*core\.UserAdmin.*core\.UserLogin.*</string> <string>.*core\.UserAdmin.*register\.UserSelfRegistration.*</string> <string>.*\/w\/index\.php\?title=Speci[ae]l:Recentchanges.*</string> <string>.*act=calendar&cal_id=.*</string> <string>.*advCalendar_pi.*</string> <string>.*cal\.asp\?date=.*</string> <string>.*cal\.asp\?view=monthly&date=.*</string> <string>.*cal\.asp\?view=weekly&date=.*</string> <string>.*cal\.asp\?view=yearly&date=.*</string> ..... <string>.*index\.php\?iDate=.*</string> <string>.*index\.php\?module=PostCalendar&func=view.*</string> <string>.*index\.php\?option=com_events&task=view.*</string> <string>.*index\.php\?option=com_events&task=view_day&year=.*</string> <string>.*index\.php\?option=com_events&task=view_detail&year=.*</string> <string>.*index\.php\?option=com_events&task=view_month&year=.*</string> <string>.*index\.php\?option=com_events&task=view_week&year=.*</string> </stringList> </newObject> </map>  </newObject>  </newObject> 
we describe the rest of the components of a decidingscope element.
id="line-218">
<newObject name="scope" class="org.archive.crawler.deciderules.DecidingScope"> <boolean name="enabled">true</boolean> <string name="seedsfile">seeds.txt</string> <boolean name="reread-seeds-on-config">true</boolean>  <newObject name="decide-rules" class="org.archive.crawler.deciderules.DecideRuleSequence"> <map name="rules"> <newObject name="rejectByDefault" class="org.archive.crawler.deciderules.RejectDecideRule"/>
the deciderule, that defines this as either a DomainScope, a HostScope, or a PathScope
add more restrictions to the scope:
Restrict the amount of hops allowed from any seed. Normally set to 25.
Restrict the amount of repetitions in a URL-path, eg. repetition/repetition/... Repetitions are normally symptoms of crawlertraps.
Define the maximal transclusion hops, and maximal speculative hops. (http://crawler.archive.org/apidocs/org/archive/crawler/deciderules/TransclusionDecideRule.html)
Restrict the maximal path depth. Normally set to 20
<newObject name="rejectIfTooManyHops" class="org.archive.crawler.deciderules.TooManyHopsDecideRule"> <integer name="max-hops">25</integer> </newObject> <newObject name="rejectIfPathological" class="org.archive.crawler.deciderules.PathologicalPathDecideRule"> <integer name="max-repetitions">3</integer> </newObject> <newObject name="acceptIfTranscluded" class="org.archive.crawler.deciderules.TransclusionDecideRule"> <integer name="max-trans-hops">25</integer> <integer name="max-speculative-hops">1</integer> </newObject> <newObject name="pathdepthfilter" class="org.archive.crawler.deciderules.TooManyPathSegmentsDecideRule"> <integer name="max-path-depth">20</integer> </newObject>
to be avoided crawlertraps to be avoided are defined with a MatchesListRegExpDecideRule. Here we list all crawlertraps (defined by a regular expression). If any object matches one of these regular expression, the object is not fetched (unless a previous rule require the object to be fetched).
<newObject name="global_crawlertraps" class="org.archive.crawler.deciderules.MatchesListRegExpDecideRule"> <string name="decision">REJECT</string> <string name="list-logic">OR</string> <stringList name="regexp-list"> <string>.*core\.UserAdmin.*core\.UserLogin.* </stringList>
When creating a new Harvestjob, another MatchesListRegExpDecideRule is added to the harvestTemplate, that specifies the crawlertraps to be avoided.
tool and download the templates using our GUI. This is described in our User Manual. But you can also upload and download the templates using the commandline HarvestTemplateApplication. This application allows you to create, download, update templates. We have made a script to make it easier to use this application: HarvestTemplateApplication.sh.txt
java dk.netarkivet.harvester.datamodel.HarvestTemplateApplication <command> <args> <template-name> <xml-file for this template> [<template-name>] <template-name> <xml-file to replace this template> >showall
fall in three categories depending on the scope defined in the template. Note that our templates generally do not obey robots.txt. This is because the Danish legislation allows is to ignore the constraints dictated by robots.txt. However, there are two exceptions to this rule:
default_obeyrobots.xml
default_obeyrobots_withforms.xml
Even though DomainScope, HostScope, PathScope are now emulated using DecidingScope, these categories are still useful:
(standard template)
default_withforms.xml (standard template that can handle forms)
default_obeyrobots.xml (standard template that can handle forms)
default_obeyrobots_withforms.xml (standard template that obeys robots.txt and handles forms)
default_orderxml_low_bandwidth.xml (standard template for sites with low bandwidth)
frontpages.xml (harvest template that only harvest the seeds and associated stylesheets and images)
frontpages_plus_1level.xml (The above plus one extra level extra)
frontpages_plus_2levels.xml (The above plus 2 extra levels)
(harvest the hosts of the seeds up to 10 levels from seeds)
host_100levels_orderxml.xml (harvest the hosts of the seeds up to 100 levels from seeds)
(harvest the hosts of the seeds up to 10 levels from seeds)
path_100levels_orderxml.xml (harvest the hosts of the seeds up to 100 levels from seeds)

Configuration Manual 3.8/AppendixB (last edited 2010-08-16 10:24:08 by localhost)