Appendices

Appendix A: Plug-ins in the NetarchiveSuite

edit

All the settings above ending on ".class" indicate that the implementation of a certain feature can be replaced by alternative implementations. There is usually a choice of several classes to choose from. But our framework does at least enable the installer to replace the default class with a class of his own, if no existing alternatives are suitable.

We now describe the available plugs, and existing plugins for these plugs.

settings.common.remoteFile.class: This setting allows you to select your chosen way of filetransfer in the NetarchiveSuite. You can here choose between FTPRemoteFile (where the data is transferred using a FTP-server), HTTPRemoteFile (where the data is transferred using a two embedded webservers (one at each end), and HTTPSRemoteFile which works just like HTTPRemoteFile except it uses a shared certificate file for secure communication. Note that the HTTPRemoteFile and HTTPSRemoteFile requires dedicated ports in the firewall to be open between all possible senders and recipients of data. For implementers of new filetransfer methods, this class must implement the class dk.netarkivet.common.distribute.RemoteFile. The default value is FTPRemoteFile.

settings.common.database.class: This setting should be set to use a DBSpecifics class specific to the database type if different from the default. See javadoc for details.

settings.common.jms.class This class designates what kind of JMS broker the NetarchiveSuite uses to send messages between applications. Presently only the Sun JMS brokers is supported (dk.netarkivet.common.distribute.JMSConnectionSunMQ). This class must implement the dk.netarkivet.common.distribute.JMSConnection class.

settings.common.arcrepositoryClient.class. Must implement dk.netarkivet.common.distribute.ArcRepositoryClient The available choices are the default dk.netarkivet.archive.arcrepository.distribute.JMSArcRepositoryClient (that is required, if you want to access the distributed type of archive that is included in the NetarchiveSuite). and the dk.netarkivet.common.distribute.LocalArcRepositoryClient (allows for access to a local archive)

settings.common.notifications.class: Allows for different ways of making notifications. The default choice is the class dk.netarkivet.common.utils.EMailNotifications (which allows you to receive notifications by email). The use of this plugin requires setting the mail-server, the recipient- and sending email-address. Alternatively, you can use dk.netarkivet.common.utils.PrintNotifications, which simply prints the notifications to stderr on the terminal.

settings.common.webinterface.sitesection.class This setting allows you to add webmodules to the NetarchiveSuite GUI. Several SiteSection classes can be active in the same GUI. the default(standard) configuration contains all 5 existing webmodules:

  1. HarvestDefinition: Allows you to define and schedule harvests ,

  2. HarvestHistory: See the status of running and finished harvestjobs

  3. BitPreservation: This module has tools for sanity testing data in the bitarchives

  4. QA: Module for doing Quality Assurance
  5. Status: Module for monitoring the health of all machines and applications

settings.common.webinterface.language: The languages supported by the webinterface. Danish (locale=da), English (locale=en), German (locale=de), and Italian (locale=it) are supported currently. The Coding Guidelines will tell you how to add support for more languages to the NetarchiveSuite.

settings.common.indexClient: The client selected for access to indices. Indices are requested by the HarversterControllerApplication instances.

    <indexClient>
        <!-- The class instantiated to give access to indices. Will be created by IndexClientFactory -->
        <class>dk.netarkivet.archive.indexserver.distribute.IndexRequestClient</class>
        <!-- The amount of time, in milliseconds, we should wait for replies
             when issuing a call to generate an index over som jobs.
         -->
        <indexRequestTimeout>43200000</indexRequestTimeout>
    </indexClient>

settings.common.monitorregistryClient.class. This defines which class to use for monitor registry. Must implement the interfacedk.netarkivet.common.distribute.monitorregistry.MonitorRegistryClient. There is two available implementations:

  • dk.netarkivet.common.distribute.monitorregistry.PrintMonitorRegistryClient (just prints out how to stdout the JMXport and RMIport to use for connecting to its JVM.

  • dk.netarkivet.monitor.distribute.JMSMonitorRegistryClient: registers itself centrally with a registry by sending JMS messages every minute. This delay can be configured with the settings.common.monitorregistryClient.reregisterdelay setting.

The default class is dk.netarkivet.monitor.distribute.JMSMonitorRegistryClient. settings.common.freespaceprovider.class: This setting defines which plugin to use for reporting how much free space is available. Must implement the dk.netarkivet.common.utils.FreeSpaceProvider interface. Available implementations are:

  • dk.netarkivet.common.utils.DefaultFreeSpaceProvider (uses File.getUsableSpace() to compute the free space available)

  • dk.netarkivet.common.utils.FilebasedFreeSpaceProvider (Reads the free space available out of a file)

The default class is dk.netarkivet.common.utils.DefaultFreeSpaceProvider.

settings.archive.admin.class: Class for accessing and manipulating the administrative data for the ArcRepository. All classes must implement the dk.netarkivet.archive.arcrepositoryadmin.AdminData interface. The available implementations are:

  • dk.netarkivet.archive.arcrepositoryadmin.UpdateableAdminData (filebased implementation that uses a admin.data file containing the ingested files and their checksums).

  • dk.netarkivet.archive.arcrepositoryadmin.DatabaseAdmin (database implementation that uses a database defined by the following settings: settings.archive.admin.database.[class|machine|port|dir].

The default class is dk.netarkivet.archive.arcrepositoryadmin.UpdateableAdminData

settings.archive.admin.database.class: Which class to use for your adminDB database. This plugin is used, if the setting settings.archive.admin.class is set to the class dk.netarkivet.archive.arcrepositoryadmin.DatabaseAdmin and the setting settings.archive.bitpreservation.class is set to the class dk.netarkivet.archive.arcrepository.bitpreservation.DatabaseBasedActiveBitPreservation

settings.archive.bitpreservation.class: Setting for which class should handle ActiveBitPreservation. All implementations must implement the dk.netarkivet.archive.arcrepository.bitpreservation.ActiveBitPreservation. The following implementations are available:

  • dk.netarkivet.archive.arcrepository.bitpreservation.DatabaseBasedActiveBitPreservation (uses a database to store the results of the bitpreservation actions).

  • dk.netarkivet.archive.arcrepository.bitpreservation.FileBasedActiveBitPreservation (stores the results of bitpreservation actions to a set of files on disk.)

settings.harvester.harvesting.heritrixController.class: This class handles the communication to a running Heritrix instance. All implementations must implement the dk.netarkivet.harvester.harvesting.HeritrixController interface. There are two implementations available of which one is deprecated:

  • dk.netarkivet.harvester.harvesting.DirectHeritrixController (deprecated; embeds a Heritrix CrawlController which starts and stops one crawl job)

  • dk.netarkivet.harvester.harvesting.!JMXHeritrixController (Starts Heritrix as an independent process, ready to crawl a predefined crawljob; Heritrix is asked to shutdown after the crawljob has terminated)

The default class is dk.netarkivet.harvester.harvesting.JMXHeritrixController.

settings.wayback.urlcanonicalizer.classname: The class used to canonicalize urls. This class must implement the interface org.archive.wayback.UrlCanonicalizer. The only acceptable implementation is dk.netarkivet.wayback.batch.copycode.NetarchiveSuiteAggressiveUrlCanonicalizer which is luckily the default class.

Appendix B: Managing Heritrix Harvest Templates (order.xml)

edit

The NetarchiveSuite software uses Heritrix 1.14.3 to harvest webpages. A harvest done by Heritrix is specified with a harvest template (invariably named order.xml). A harvest template describes how much to harvest and from where. Furthermore a seedlist is always associated with a given order.xml.

The standard harvest template used by NetarchiveSuite follow the order.xml standard of Heritrix 1.10+.

Our default harvest template can be seen here in full: default_orderxml.xm

If you intend to build your own templates, it is recommended to use this template as a baseline.

Mandatory elements in the NetarchiveSuite and their role

A number of elements in the order.xml are required in all NetarchiveSuite harvest templates:

A. The QuotaEnforcer

The QuotaEnforcer is used to restrict the number of bytes harvested from each domain in the seedlist.

<newObject name="QuotaEnforcer" class="org.archive.crawler.prefetch.QuotaEnforcer">
        <boolean name="force-retire">false</boolean>
        <boolean name="enabled">true</boolean>
        <newObject name="QuotaEnforcer#decide-rules" class="org.archive.crawler.deciderules.DecideRuleSequence">
                <map name="rules">
                </map>
        </newObject>
        <long name="server-max-fetch-successes">-1</long>
        <long name="server-max-success-kb">-1</long>
        <long name="server-max-fetch-responses">-1</long>
        <long name="server-max-all-kb">-1</long>
        <long name="host-max-fetch-successes">-1</long>
        <long name="host-max-success-kb">-1</long>
        <long name="host-max-fetch-responses">-1</long>
        <long name="host-max-all-kb">-1</long>
        <long name="group-max-fetch-successes">-1</long>
        <long name="group-max-success-kb">-1</long>
        <long name="group-max-fetch-responses">-1</long>
        <long name="group-max-all-kb">-1</long>
        <boolean name="use-sparse-range-filter">true</boolean>
</newObject>

B. The DeDuplicator

The!DeDuplicator is a module authored by Kristinn Sigurdsson from the National Library of Iceland. It is part of theHeritrix Write-processor chain. It enables us to avoid saving duplicates in our storage. It does this by looking up the url of the potential duplicate object in the index associated with this module. If the url is found in the index, and the checksum for the url in the index is unaltered, the object is not stored. However a reference to where the object is stored is written to the crawl log. If the url for the object is not found in the index, the object is stored normally. Note that only non-text objects are examined by this module, i.e. where the mimetype of the object does not match "^text/.*" (like text/html or text/plain). Note that the deduplication is disabled if either the DeDuplicator element in the harvest template is disabled (the value of the attribute "enabled" is set to false), or the general setting settings.harvester.harvesting.deduplication.enabled is set to false. NetarchiveSuite uses version 0.4.0 of the deduplicator.

<newObject name="DeDuplicator" class="is.hi.bok.deduplicator.DeDuplicator">
        <boolean name="enabled">true</boolean>
        <map name="filters">
        </map>
        <string name="index-location"/>
        <string name="matching-method">By URL</string>
        <boolean name="try-equivalent">true</boolean>
        <boolean name="change-content-size">false</boolean>
        <string name="mime-filter">^text/.*</string>
        <string name="filter-mode">Blacklist</string>
        <string name="analysis-mode">Timestamp</string>
        <string name="log-level">SEVERE</string>
        <string name="origin"/>
        <string name="origin-handling">Use index information</string>
        <boolean name="stats-per-host">true</boolean>
</newObject>

C. The "http-headers" element

This element describes, how Heritrix will present itself to the webservers when fetching data. It points by default to the non-existing webpage http://my_website.com/my_infopage.html and the equally non-existing mail address " my_email@my_website.com ". Please update this to your own institution and email!

        <map name="http-headers">
            <string name="user-agent">Mozilla/5.0 (compatible; heritrix/1.14.3 +http://my_website.com/my_infopage.html)</string>
            <string name="from">my_email@my_website.com</string>
        </map>

D. The Archiver element

This element does the actual writing of the fetched objects to an arcfile. In the future we may want to write to WARC files instead, which can be easily be done. Heritrix allows you to have multiple 'Writers' in use at the same time. For instance, you can write your objects to both ARC and WARC at the same time, as well as writing the objects to a database.

<newObject name="Archiver" class="org.archive.crawler.writer.ARCWriterProcessor">
                <boolean name="enabled">true</boolean>
                <newObject name="Archiver#decide-rules" class="org.archive.crawler.deciderules.DecideRuleSequence">
                        <map name="rules">
                        </map>
                </newObject>
                <boolean name="compress">false</boolean>
                <string name="prefix">IAH</string>
                <string name="suffix">${HOSTNAME}</string>
                <integer name="max-size-bytes">100000000</integer>
                <stringList name="path">
                    <string>arcs</string>
                </stringList>
                <integer name="pool-max-active">5</integer>
                <integer name="pool-max-wait">300000</integer>
                <long name="total-bytes-to-write">0</long>
                <boolean name="skip-identical-digests">false</boolean>
    </newObject>

E. The ContentSize element

To have statistics work right when jobs finishes and goes back into the database all templates in NetarchiveSuite require a special content-size annotation post-processor. If this element is not present, the size will allways be 0 in the database for harvests done without this in the template:

<newObject name="ContentSize"
class="dk.netarkivet.harvester.harvesting.ContentSizeAnnotationPostProcessor">
                <boolean name="enabled">true</boolean>
                <newObject name="ContentSize#decide-rules"
class="org.archive.crawler.deciderules.DecideRuleSequence">
                        <map name="rules">
                        </map>
                </newObject>
        </newObject>

F. The Scope element

The scope element decides which urls to harvest and which not to harvest. Before release 3.6.0, we used the following three scopes:

  1. DomainScope. The standard NetarchiveSuite scope allows the harvester to fetch all objects coming from any 2nd level domains represented by one of the seeds. Embeddded objects, like images, and stylesheets are always fetched even when coming from other domains.

  2. HostScope. This scope are restricted to fetching objects from the hosts represented by the seeds.

  3. PathScope. This scope are restricted to fetching objects from

These 3 scopes were all deprecated from Heritrix 1.10.0, and now all NetarchiveSuite templates are required to use the DecidingScope instead. This type of Scope uses a sequence of DecideRules to define the scope of the harvest. We now emulate these three scopes by adding a specific DecideRule to the DecidingScope. In the case of DomainScope, it required designing our own DecideRule (dk.netarkivet.harvester.harvesting.OnNSDomainsDecideRule). So for DomainScope type scopes, you add the following element:

<newObject name="acceptURIFromSeedDomains" class="dk.netarkivet.harvester.harvesting.OnNSDomainsDecideRule">
                                <string name="decision">ACCEPT</string>
                                <string name="surts-source-file">seeds.txt</string>
                                <boolean name="seeds-as-surt-prefixes">false</boolean>
                                <string name="surts-dump-file"/>
                                <boolean name="also-check-via">false</boolean>
                                <boolean name="rebuild-on-reconfig">true</boolean>
</newObject>

Emulating the HostScope requires adding the OnHostsDecideRule element:

<newObject name="acceptIfOnSeedsHosts" class="org.archive.crawler.deciderules.OnHostsDecideRule">
                                <string name="decision">ACCEPT</string>
                                <string name="surts-dump-file"></string>
                                <boolean name="also-check-via">false</boolean>
                                <boolean name="rebuild-on-reconfig">true</boolean>
                        </newObject>

Emulating the PathScope requires adding the SurtPrefixesDecideRule element:

<newObject name="acceptIfSurtPrefixed" class="org.archive.crawler.deciderules.SurtPrefixedDecideRule">
                                <string name="decision">ACCEPT</string>
                                <string name="surts-source-file"></string>
                                <boolean name="seeds-as-surt-prefixes">true</boolean>
                                <string name="surts-dump-file"></string>
                                <boolean name="also-check-via">false</boolean>
                                <boolean name="rebuild-on-reconfig">true</boolean>
                        </newObject>

An example of a complete DecidingScope element is shown below.

        <newObject name="scope" class="org.archive.crawler.deciderules.DecidingScope">
            <boolean name="enabled">true</boolean>
            <string name="seedsfile">seeds.txt</string>
            <boolean name="reread-seeds-on-config">true</boolean>
            <!-- DecideRuleSequence. Multiple DecideRules applied in order with last non-PASS the resulting decision -->
            <newObject name="decide-rules" class="org.archive.crawler.deciderules.DecideRuleSequence">
                <map name="rules">
                        <newObject name="rejectByDefault" class="org.archive.crawler.deciderules.RejectDecideRule"/>
                        <newObject name="acceptURIFromSeedDomains" class="dk.netarkivet.harvester.harvesting.OnNSDomainsDecideRule">
                                <string name="decision">ACCEPT</string>
                                <string name="surts-source-file"></string>
                                <boolean name="seeds-as-surt-prefixes">true</boolean>
                                <string name="surts-dump-file"/>
                                <boolean name="also-check-via">false</boolean>
                                <boolean name="rebuild-on-reconfig">true</boolean>
                        </newObject>
                        <newObject name="rejectIfTooManyHops" class="org.archive.crawler.deciderules.TooManyHopsDecideRule">
                                <integer name="max-hops">25</integer>
                        </newObject>
                        <newObject name="rejectIfPathological" class="org.archive.crawler.deciderules.PathologicalPathDecideRule">
                                <integer name="max-repetitions">3</integer>
                        </newObject>
                        <newObject name="acceptIfTranscluded" class="org.archive.crawler.deciderules.TransclusionDecideRule">
                                <integer name="max-trans-hops">25</integer>
                                <integer name="max-speculative-hops">1</integer>
                        </newObject>
                        <newObject name="pathdepthfilter" class="org.archive.crawler.deciderules.TooManyPathSegmentsDecideRule">
                                <integer name="max-path-depth">20</integer>
                        </newObject>
                        <newObject name="global_crawlertraps" class="org.archive.crawler.deciderules.MatchesListRegExpDecideRule">
                             <string name="decision">REJECT</string>
                             <string name="list-logic">OR</string>
                             <stringList name="regexp-list">
                             <string>.*core\.UserAdmin.*core\.UserLogin.*</string>
                             <string>.*core\.UserAdmin.*register\.UserSelfRegistration.*</string>
                             <string>.*\/w\/index\.php\?title=Speci[ae]l:Recentchanges.*</string>
                             <string>.*act=calendar&amp;cal_id=.*</string>
                             <string>.*advCalendar_pi.*</string>
                             <string>.*cal\.asp\?date=.*</string>
                             <string>.*cal\.asp\?view=monthly&amp;date=.*</string>
                             <string>.*cal\.asp\?view=weekly&amp;date=.*</string>
                             <string>.*cal\.asp\?view=yearly&amp;date=.*</string>
                             .....
                             <string>.*index\.php\?iDate=.*</string>
                             <string>.*index\.php\?module=PostCalendar&amp;func=view.*</string>
                             <string>.*index\.php\?option=com_events&amp;task=view.*</string>
                             <string>.*index\.php\?option=com_events&amp;task=view_day&amp;year=.*</string>
                             <string>.*index\.php\?option=com_events&amp;task=view_detail&amp;year=.*</string>
                             <string>.*index\.php\?option=com_events&amp;task=view_month&amp;year=.*</string>
                             <string>.*index\.php\?option=com_events&amp;task=view_week&amp;year=.*</string>
                        </stringList>
                    </newObject>
                </map> <!-- end rules -->
            </newObject> <!-- end decide-rules -->
        </newObject> <!-- End DecidingScope -->

The anatomy of a decidingscope

Finally, we describe the rest of the components of a decidingscope element.

The header

<newObject name="scope" class="org.archive.crawler.deciderules.DecidingScope">
            <boolean name="enabled">true</boolean>
            <string name="seedsfile">seeds.txt</string>
            <boolean name="reread-seeds-on-config">true</boolean>
            <!-- DecideRuleSequence. Multiple DecideRules applied in order with last non-PASS the resulting decision -->
            <newObject name="decide-rules" class="org.archive.crawler.deciderules.DecideRuleSequence">
                  <map name="rules">
                        <newObject name="rejectByDefault" class="org.archive.crawler.deciderules.RejectDecideRule"/>

The defining deciderule

Here we have the deciderule, that defines this as either a DomainScope, a HostScope, or a PathScope

Standard harvest rules

These rules add more restrictions to the scope:

                        <newObject name="rejectIfTooManyHops" class="org.archive.crawler.deciderules.TooManyHopsDecideRule">
                                <integer name="max-hops">25</integer>
                        </newObject>
                        <newObject name="rejectIfPathological" class="org.archive.crawler.deciderules.PathologicalPathDecideRule">
                                <integer name="max-repetitions">3</integer>
                        </newObject>
                        <newObject name="acceptIfTranscluded" class="org.archive.crawler.deciderules.TransclusionDecideRule">
                                <integer name="max-trans-hops">25</integer>
                                <integer name="max-speculative-hops">1</integer>
                        </newObject>
                        <newObject name="pathdepthfilter" class="org.archive.crawler.deciderules.TooManyPathSegmentsDecideRule">
                                <integer name="max-path-depth">20</integer>
                        </newObject>

Define general crawlertraps to be avoided

Lists of crawlertraps to be avoided are defined with a MatchesListRegExpDecideRule. Here we list all crawlertraps (defined by a regular expression). If any object matches one of these regular expression, the object is not fetched (unless a previous rule require the object to be fetched).

<newObject name="global_crawlertraps" class="org.archive.crawler.deciderules.MatchesListRegExpDecideRule">
                             <string name="decision">REJECT</string>
                             <string name="list-logic">OR</string>
                             <stringList name="regexp-list">
                               <string>.*core\.UserAdmin.*core\.UserLogin.*
                             </stringList>

When creating a new Harvestjob, another MatchesListRegExpDecideRule is added to the harvestTemplate, that specifies the crawlertraps to be avoided.

The HarvestTemplateApplication tool

You can upload and download the templates using our GUI. This is described in our User Manual. But you can also upload and download the templates using the commandline HarvestTemplateApplication. This application allows you to create, download, update templates. We have made a script to make it easier to use this application: HarvestTemplateApplication.sh.txt

java dk.netarkivet.harvester.tools.HarvestTemplateApplication <command> <args>
create <template-name> <xml-file for this template>
download [<template-name>]
update <template-name> <xml-file to replace this template>
showall

Predefined harvest templates

All our templates fall in three categories depending on the scope defined in the template. Note that our templates generally do not obey robots.txt. This is because the Danish legislation allows is to ignore the constraints dictated by robots.txt. However, there are two exceptions to this rule:

  • default_obeyrobots.xml
  • default_obeyrobots_withforms.xml

Even though DomainScope, HostScope, PathScope are now emulated using DecidingScope, these categories are still useful:

Templates w/ DomainScope

  1. default_orderxml.xml (standard template)
  2. default_withforms.xml (standard template that can handle forms)
  3. default_obeyrobots.xml (standard template that can handle forms)
  4. default_obeyrobots_withforms.xml (standard template that obeys robots.txt and handles forms)
  5. default_orderxml_low_bandwidth.xml (standard template for sites with low bandwidth)
  6. frontpages.xml (harvest template that only harvest the seeds and associated stylesheets and images)
  7. frontpages_plus_1level.xml (The above plus one extra level extra)
  8. frontpages_plus_2levels.xml (The above plus 2 extra levels)

Templates w/ HostScope

  1. host_10levels_orderxml.xml (harvest the hosts of the seeds up to 10 levels from seeds)
  2. host_100levels_orderxml.xml (harvest the hosts of the seeds up to 100 levels from seeds)

Templates w/ PathScope

  1. path_10levels_orderxml.xml (harvest the hosts of the seeds up to 10 levels from seeds)
  2. path_100levels_orderxml.xml (harvest the hosts of the seeds up to 100 levels from seeds)

Appendix C : Migrate the Heritrix templates to NetarchiveSuite 3.6.0+

edit

If you are just using the predefined templates with few changes like changed the email-address and website information, the easiest way to migrate is to modify the predefined templates found in the binary distribution of NetarchiveSuite in the harvestdefinitionbasedir/order_templates_dist directory and change the email-adress and website information again.

If you do this, you also get the more inconsequential updates to the template:

  • The removal of obsolete attributes from some elements
  • Addition of new attributes to some elements

Then you just update the existing templates in your database with these modified ones using the HarvestTemplateApplication tool mentioned in Configuration Manual - Appendix B: Managing Heritrix Harvest Templates. Note that some templates are no longer distributed with NetarchiveSuite. If you want to keep using those, you need to follow the procedure described below.

If you have already put a lot effort in making your own templates, you can update your existing templates by "only" upgrading the scope element in the templates from either a DomainScope, HostScope, or a PathScope.

Before we explain how to migrate these scopes to a DecidingScope, you need to know something about the anatomy of these scopes.

1) Header (includes scope class,and attributes):

<newObject name="scope" class="org.archive.crawler.scope.PathScope">
            <boolean name="enabled">true</boolean>
            <string name="seedsfile">seeds.txt</string>
            <boolean name="reread-seeds-on-config">true</boolean>
            <integer name="max-link-hops">10</integer>
            <integer name="max-trans-hops">5</integer>

2) An OrFilter element named "exclude-filter" containing a number of filters as components: a HopsFilter, a PathDepthFilter, a PathologicalPathFilter, a URIRegExpFilter, a URIListRegExpFilter (filter to avoid common crawlertraps), and potentially other types of filters: Each of these filters will have to be converted to a similar DecideRule. Explanation to follow.

            <newObject name="exclude-filter" class="org.archive.crawler.filter.OrFilter">
                <boolean name="enabled">true</boolean>
                <boolean name="if-matches-return">true</boolean>
                <map name="filters">
                    <newObject name="hops_filter" class="org.archive.crawler.filter.HopsFilter">
                        <boolean name="enabled">true</boolean>
                    </newObject>
                    <newObject name="pathdepth" class="org.archive.crawler.filter.PathDepthFilter">
                        <boolean name="enabled">true</boolean>
                        <integer name="max-path-depth">20</integer>
                        <boolean name="path-less-or-equal-return">false</boolean>
                    </newObject>
                    <newObject name="pathologicalpath" class="org.archive.crawler.filter.PathologicalPathFilter">
                        <boolean name="enabled">true</boolean>
                        <integer name="repetitions">3</integer>
                    </newObject>
                    <newObject name="dr_dk" class="org.archive.crawler.filter.URIRegExpFilter">
                        <boolean name="enabled">true</boolean>
                        <boolean name="if-match-return">true</boolean>
                        <string name="regexp">.*dr\.dk.*epg\.asp.*</string>
                    </newObject>
                    <newObject name="globale_crawlertraps" class="org.archive.crawler.filter.URIListRegExpFilter">
                        <boolean name="enabled">true</boolean>
                        <boolean name="if-match-return">true</boolean>
                        <string name="list-logic">OR</string>
                        <stringList name="regexp-list">
                            <string>.*core\.UserAdmin.*core\.UserLogin.*</string>
                            <string>.*core\.UserAdmin.*register\.UserSelfRegistration.*</string>
                            <string>.*\/w\/index\.php\?title=Speci[ae]l:Recentchanges.*</string>
                            <string>.*act=calendar&amp;cal_id=.*</string>
                            .....
                            <string>.*calendar\.asp\?qMonth=.*</string>
                            <string>.*calendar\.php\?sid=.*</string>
                            <string>.*worldscinet\.com.*</string>
                            <string>.*www3\.interscience\.wiley\.com.*</string>
                            <string>.*www-gdz\.sub\.uni-goettingen\.de.*</string>
                        </stringList>
                    </newObject>
                </map>
            </newObject>

3) Additional filters. Here we have a "Force-accept-filter", an "additionalScopeFocus" filter, and a "transitive Filter", of which only the transitiveFilter element needs to be converted. The two other elements are just deleted.

            <newObject name="force-accept-filter" class="org.archive.crawler.filter.OrFilter">
                <boolean name="enabled">true</boolean>
                <boolean name="if-matches-return">true</boolean>
                <map name="filters">
                </map>
            </newObject>
            <newObject name="additionalScopeFocus" class="org.archive.crawler.filter.FilePatternFilter">
                <boolean name="enabled">true</boolean>
                <boolean name="if-match-return">true</boolean>
                <string name="use-default-patterns">All</string>
                <string name="regexp"/>
            </newObject>
            <newObject name="transitiveFilter" class="org.archive.crawler.filter.TransclusionFilter">
                <boolean name="enabled">true</boolean>
                <integer name="max-speculative-hops">1</integer>
                <integer name="max-referral-hops">15</integer>
                <integer name="max-embed-hops">15</integer>
            </newObject>
      </newObject> <!-- end of scope element -->

How to convert from the former scopes to a decidingscope

Converting the header is easy. All headers have the form:

 <newObject name="scope" class="org.archive.crawler.deciderules.DecidingScope">
            <boolean name="enabled">true</boolean>
            <string name="seedsfile">seeds.txt</string>
            <boolean name="reread-seeds-on-config">true</boolean>
            <!-- DecideRuleSequence. Multiple DecideRules applied in order with last non-PASS the resulting decision -->
            <newObject name="decide-rules" class="org.archive.crawler.deciderules.DecideRuleSequence">

                <map name="rules">
                        <newObject name="rejectByDefault" 
                                class="org.archive.crawler.deciderules.RejectDecideRule"/>

plus a special defining deciderule that emulates the DomainScope, the HostScope, or the PathScope. 1) The defining deciderule for DomainScope is (the only one using a special purpose DecideRule):

<newObject name="acceptURIFromSeedDomains" class="dk.netarkivet.harvester.harvesting.OnNSDomainsDecideRule">
                                <string name="decision">ACCEPT</string>
                                <string name="surts-source-file">seeds.txt</string>
                                <boolean name="seeds-as-surt-prefixes">false</boolean>
                                <string name="surts-dump-file"/>
                                <boolean name="also-check-via">false</boolean>
                                <boolean name="rebuild-on-reconfig">true</boolean>
</newObject>

2) The defining deciderule for HostScope is:

<newObject name="OnHostsRule" class="org.archive.crawler.deciderules.OnHostsDecideRule">
            <string name="decision">ACCEPT</string>
            <string name="surts-dump-file"/>
            <boolean name="also-check-via">false</boolean>
            <boolean name="rebuild-on-reconfig">true</boolean>
          </newObject>

3) The defining deciderule for PathScope is:

<newObject name="acceptIfSurtPrefixed" class="org.archive.crawler.deciderules.SurtPrefixedDecideRule">
                                <string name="decision">ACCEPT</string>
                                <string name="surts-source-file"></string>
                                <boolean name="seeds-as-surt-prefixes">true</boolean>
                                <string name="surts-dump-file"></string>
                                <boolean name="also-check-via">false</boolean>
                                <boolean name="rebuild-on-reconfig">true</boolean>
</newObject>

After the header and the defining deciderule, we a deciderule corresponding to the 'hops_filter'. Note that the two last attributes 'max-link-hops', and 'max-trans-hops' in the header cease to be general scope attributes. Instead max-trans-hops become an attribute for the "acceptIfTranscluded" mentioned above, and the 'max-link-hops' attribute becomes an attribute for the new 'hops_filter' deciderule. The following

<integer name="max-link-hops">10</integer>
<newObject name="hops_filter" class="org.archive.crawler.filter.HopsFilter">
        <boolean name="enabled">true</boolean>
</newObject>

is then translated to the following deciderule

<newObject name="rejectIfTooManyHops" class="org.archive.crawler.deciderules.TooManyHopsDecideRule">
        <integer name="max-hops">10</integer>
</newObject>

Following this, we need to add a translation of the 'pathdepth' element, and the 'pathologicalpath' element, plus a translation of the 'transitiveFilter' element in the last part of the scope. The following

     <newObject name="pathdepth" class="org.archive.crawler.filter.PathDepthFilter">
                        <boolean name="enabled">true</boolean>
                        <integer name="max-path-depth">20</integer>
                        <boolean name="path-less-or-equal-return">false</boolean>
     </newObject>
     <newObject name="pathologicalpath" class="org.archive.crawler.filter.PathologicalPathFilter">
        <boolean name="enabled">true</boolean>
        <integer name="repetitions">3</integer>
     </newObject>

     <newObject name="transitiveFilter" class="org.archive.crawler.filter.TransclusionFilter">
                <boolean name="enabled">true</boolean>
                <integer name="max-speculative-hops">1</integer>
                <integer name="max-referral-hops">15</integer>
                <integer name="max-embed-hops">15</integer>
     </newObject>

is translated to

<newObject name="rejectIfPathological" class="org.archive.crawler.deciderules.PathologicalPathDecideRule">
        <integer name="max-repetitions">3</integer>
</newObject>
<newObject name="acceptIfTranscluded" class="org.archive.crawler.deciderules.TransclusionDecideRule">
        <integer name="max-trans-hops">5</integer>
        <integer name="max-speculative-hops">1</integer>
</newObject>
<newObject name="pathdepthfilter" class="org.archive.crawler.deciderules.TooManyPathSegmentsDecideRule">
        <integer name="max-path-depth">20</integer>
</newObject>

.

Note that the attributes 'max-referral-hops' and 'max-embed-hops' in the 'transitiveFilter' element have been merged into one single attribute 'max-trans-hops' which is now no longer an attribute of the scope, as it was in the old scopes.


Now you only need to convert all remaining URIRegExpFilter and URIListRegExpFilter elements to a corresponding DecideRule. The deciderule corresponding to URIRegExpFilter is MatchesRegExpDecideRule, and the deciderule corresponding to URIListRegExpFilter is MatchesListRegExpDecideRule. Converting the dr_dk element (a URIRegExpFilter)

<newObject name="dr_dk" class="org.archive.crawler.filter.URIRegExpFilter">
                        <boolean name="enabled">true</boolean>
                        <boolean name="if-match-return">true</boolean>
                        <string name="regexp">.*dr\.dk.*epg\.asp.*</string>
                    </newObject>

gives us:

<newObject name="dr_dk" class="org.archive.crawler.deciderules.MatchesRegExpDecideRule">
        <string name="decision">REJECT</string>
        <string name="regexp">.*dr\.dk.*epg\.asp.*</string>
 </newObject>

Converting the globale_crawlertraps element (URIListRegExpFilter)

<newObject name="globale_crawlertraps" class="org.archive.crawler.filter.URIListRegExpFilter">
                        <boolean name="enabled">true</boolean>
                        <boolean name="if-match-return">true</boolean>
                        <string name="list-logic">OR</string>
                        <stringList name="regexp-list">
                            <string>.*core\.UserAdmin.*core\.UserLogin.*</string>
                            <string>.*core\.UserAdmin.*register\.UserSelfRegistration.*</string>
                            <string>.*\/w\/index\.php\?title=Speci[ae]l:Recentchanges.*</string>
                            <string>.*act=calendar&amp;cal_id=.*</string>
                            .....
                            <string>.*calendar\.asp\?qMonth=.*</string>
                            <string>.*calendar\.php\?sid=.*</string>
                            <string>.*worldscinet\.com.*</string>
                            <string>.*www3\.interscience\.wiley\.com.*</string>
                            <string>.*www-gdz\.sub\.uni-goettingen\.de.*</string>
                        </stringList>
                    </newObject>

gives us

<newObject name="globale_crawlertraps" class="org.archive.crawler.deciderules.MatchesListRegExpDecideRule">
            <string name="decision">REJECT</string>
            <string name="list-logic">OR</string>
            <stringList name="regexp-list">
                <string>.*core\.UserAdmin.*core\.UserLogin.*</string>
                <string>.*core\.UserAdmin.*register\.UserSelfRegistration.*</string>
                <string>.*\/w\/index\.php\?title=Speci[ae]l:Recentchanges.*</string>
                <string>.*act=calendar&amp;cal_id=.*</string>
                .....
                <string>.*calendar\.asp\?qMonth=.*</string>
                <string>.*calendar\.php\?sid=.*</string>
                <string>.*worldscinet\.com.*</string>
                <string>.*www3\.interscience\.wiley\.com.*</string>
            </stringList>
</newObject>


Finally we need to wrap up the the sequence of deciderules and the scope itself. So we add

                </map> <!-- end rules -->
            </newObject> <!-- end decide-rules  -->
        </newObject> <!-- End DecidingScope -->

.

Configuration Manual 3.14/Appendices (last edited 2010-08-16 10:24:38 by localhost)