## page was renamed from Configuration Manual 3.12/AppendixC ## page was renamed from Installation Manual 3.12/AppendixD <> = Appendix B: Managing Heritrix Harvest Templates (order.xml) = <> The !NetarchiveSuite software uses [[HeritrixPatches|Heritrix 1.14.3]] to harvest webpages. A harvest done by Heritrix is specified with a harvest template (invariably named order.xml). A harvest template describes how much to harvest and from where. Furthermore a seedlist is always associated with a given order.xml. The standard harvest template used by !NetarchiveSuite follow the order.xml standard of Heritrix 1.10+. Our default harvest template can be seen here in full: [[attachment:default_orderxml.xml|default_orderxml.xm]] If you intend to build your own templates, it is recommended to use this template as a baseline. == Mandatory elements in the NetarchiveSuite and their role == A number of elements in the order.xml are required in all !NetarchiveSuite harvest templates: === A. The QuotaEnforcer === The !QuotaEnforcer is used to restrict the number of bytes harvested from each domain in the seedlist. {{{ false true -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 true }}} === B. The DeDuplicator === The!DeDuplicator is a module authored by Kristinn Sigurdsson from the National Library of Iceland. It is part of the[[http://crawler.archive.org/articles/developer_manual/overview.html#processor_chains|Heritrix Write-processor chain]]. It enables us to avoid saving duplicates in our storage. It does this by looking up the url of the potential duplicate object in the index associated with this module. If the url is found in the index, and the checksum for the url in the index is unaltered, the object is not stored. However a reference to where the object is stored is written to the crawl log. If the url for the object is not found in the index, the object is stored normally. Note that only non-text objects are examined by this module, i.e. where the mimetype of the object does not match "^text/.*" (like text/html or text/plain). Note that the deduplication is disabled if either the !DeDuplicator element in the harvest template is disabled (the value of the attribute "enabled" is set to false), or the general setting ''settings.harvester.harvesting.deduplication.enabled'' is set to ''false''. !NetarchiveSuite uses version 0.4.0 of the deduplicator. {{{ true By URL true false ^text/.* Blacklist Timestamp SEVERE Use index information true }}} === C. The "http-headers" element === This element describes, how Heritrix will present itself to the webservers when fetching data. It points by default to the non-existing webpage http://my_website.com/my_infopage.html and the equally non-existing mail address " my_email@my_website.com ". Please update this to your own institution and email! {{{ Mozilla/5.0 (compatible; heritrix/1.14.3 +http://my_website.com/my_infopage.html) my_email@my_website.com }}} === D. The Archiver element === This element does the actual writing of the fetched objects to an arcfile. In the future we may want to write to WARC files instead, which can be easily be done. Heritrix allows you to have multiple 'Writers' in use at the same time. For instance, you can write your objects to both ARC and WARC at the same time, as well as writing the objects to a database. {{{ true false IAH ${HOSTNAME} 100000000 arcs 5 300000 0 false }}} === E. The ContentSize element === To have statistics work right when jobs finishes and goes back into the database all templates in !NetarchiveSuite require a special content-size annotation post-processor. If this element is not present, the size will allways be 0 in the database for harvests done without this in the template: {{{ true }}} === F. The Scope element === The scope element decides which urls to harvest and which not to harvest. Before release 3.6.0, we used the following three scopes: A. !DomainScope. The standard !NetarchiveSuite scope allows the harvester to fetch all objects coming from any 2nd level domains represented by one of the seeds. Embeddded objects, like images, and stylesheets are always fetched even when coming from other domains. A. !HostScope. This scope are restricted to fetching objects from the hosts represented by the seeds. A. !PathScope. This scope are restricted to fetching objects from These 3 scopes were all deprecated from Heritrix 1.10.0, and now all !NetarchiveSuite templates are required to use the !DecidingScope instead. This type of Scope uses a sequence of !DecideRules to define the scope of the harvest. We now emulate these three scopes by adding a specific !DecideRule to the !DecidingScope. In the case of !DomainScope, it required designing our own !DecideRule (dk.netarkivet.harvester.harvesting.OnNSDomainsDecideRule). So for !DomainScope type scopes, you add the following element: {{{ ACCEPT seeds.txt false false true }}} Emulating the !HostScope requires adding the !OnHostsDecideRule element: {{{ ACCEPT false true }}} Emulating the !PathScope requires adding the !SurtPrefixesDecideRule element: {{{ ACCEPT true false true }}} An example of a complete !DecidingScope element is shown below. {{{ true seeds.txt true ACCEPT true false true 25 3 25 1 20 REJECT OR .*core\.UserAdmin.*core\.UserLogin.* .*core\.UserAdmin.*register\.UserSelfRegistration.* .*\/w\/index\.php\?title=Speci[ae]l:Recentchanges.* .*act=calendar&cal_id=.* .*advCalendar_pi.* .*cal\.asp\?date=.* .*cal\.asp\?view=monthly&date=.* .*cal\.asp\?view=weekly&date=.* .*cal\.asp\?view=yearly&date=.* ..... .*index\.php\?iDate=.* .*index\.php\?module=PostCalendar&func=view.* .*index\.php\?option=com_events&task=view.* .*index\.php\?option=com_events&task=view_day&year=.* .*index\.php\?option=com_events&task=view_detail&year=.* .*index\.php\?option=com_events&task=view_month&year=.* .*index\.php\?option=com_events&task=view_week&year=.* }}} ==== The anatomy of a decidingscope ==== Finally, we describe the rest of the components of a decidingscope element. ===== The header ===== {{{ true seeds.txt true }}} ===== The defining deciderule ===== Here we have the deciderule, that defines this as either a !DomainScope, a !HostScope, or a !PathScope ===== Standard harvest rules ===== These rules add more restrictions to the scope: * Restrict the amount of hops allowed from any seed. Normally set to 25. * Restrict the amount of repetitions in a URL-path, eg. repetition/repetition/... Repetitions are normally symptoms of crawlertraps. * Define the maximal transclusion hops, and maximal speculative hops. (http://crawler.archive.org/apidocs/org/archive/crawler/deciderules/TransclusionDecideRule.html) * Restrict the maximal path depth. Normally set to 20 {{{ 25 3 25 1 20 }}} ===== Define general crawlertraps to be avoided ===== Lists of crawlertraps to be avoided are defined with a !MatchesListRegExpDecideRule. Here we list all crawlertraps (defined by a regular expression). If any object matches one of these regular expression, the object is not fetched (unless a previous rule require the object to be fetched). {{{ REJECT OR .*core\.UserAdmin.*core\.UserLogin.* }}} When creating a new Harvestjob, another !MatchesListRegExpDecideRule is added to the harvestTemplate, that specifies the crawlertraps to be avoided. == The HarvestTemplateApplication tool == You can upload and download the templates using our GUI. This is described in our [[User Manual 3.12/Harvester Templates|User Manual]]. But you can also upload and download the templates using the commandline !HarvestTemplateApplication. This application allows you to create, download, update templates. We have made a script to make it easier to use this application: [[attachment:HarvestTemplateApplication.sh.txt]] {{{ java dk.netarkivet.harvester.tools.HarvestTemplateApplication create download [] update showall }}} == Predefined harvest templates == All our templates fall in three categories depending on the scope defined in the template. Note that our templates generally do not obey robots.txt. This is because the Danish legislation allows is to ignore the constraints dictated by robots.txt. However, there are two exceptions to this rule: * default_obeyrobots.xml * default_obeyrobots_withforms.xml Even though !DomainScope, !HostScope, !PathScope are now emulated using !DecidingScope, these categories are still useful: === Templates w/ DomainScope === 1. default_orderxml.xml (standard template) 1. default_withforms.xml (standard template that can handle forms) 1. default_obeyrobots.xml (standard template that can handle forms) 1. default_obeyrobots_withforms.xml (standard template that obeys robots.txt and handles forms) 1. default_orderxml_low_bandwidth.xml (standard template for sites with low bandwidth) 1. frontpages.xml (harvest template that only harvest the seeds and associated stylesheets and images) 1. frontpages_plus_1level.xml (The above plus one extra level extra) 1. frontpages_plus_2levels.xml (The above plus 2 extra levels) === Templates w/ HostScope === 1. host_10levels_orderxml.xml (harvest the hosts of the seeds up to 10 levels from seeds) 1. host_100levels_orderxml.xml (harvest the hosts of the seeds up to 100 levels from seeds) === Templates w/ PathScope === 1. path_10levels_orderxml.xml (harvest the hosts of the seeds up to 10 levels from seeds) 1. path_100levels_orderxml.xml (harvest the hosts of the seeds up to 100 levels from seeds)