Anchor(MigrateHeritrixTemplatesTo36)

Appendix C : Migrate the Heritrix templates to NetarchiveSuite 3.6.0+

Action(edit)

If you are just using the predefined templates with few changes like changed the email-address and website information, the easiest way to migrate is to modify the predefined templates found in the binary distribution of NetarchiveSuite in the harvestdefinitionbasedir/order_templates_dist directory and change the email-adress and website information again.

If you do this, you also get the more inconsequential updates to the template:

Then you just update the existing templates in your database with these modified ones using the HarvestTemplateApplication tool mentioned in [:Configuration Manual 3.12#ManagingHeritrixHarvestTemplates:Configuration Manual - Appendix B: Managing Heritrix Harvest Templates]. Note that some templates are no longer distributed with NetarchiveSuite. If you want to keep using those, you need to follow the procedure described below.

If you have already put a lot effort in making your own templates, you can update your existing templates by "only" upgrading the scope element in the templates from either a DomainScope, HostScope, or a PathScope.

Before we explain how to migrate these scopes to a DecidingScope, you need to know something about the anatomy of these scopes.

1) Header (includes scope class,and attributes):

<newObject name="scope" class="org.archive.crawler.scope.PathScope">
            <boolean name="enabled">true</boolean>
            <string name="seedsfile">seeds.txt</string>
            <boolean name="reread-seeds-on-config">true</boolean>
            <integer name="max-link-hops">10</integer>
            <integer name="max-trans-hops">5</integer>

2) An OrFilter element named "exclude-filter" containing a number of filters as components: a HopsFilter, a PathDepthFilter, a PathologicalPathFilter, a URIRegExpFilter, a URIListRegExpFilter (filter to avoid common crawlertraps), and potentially other types of filters: Each of these filters will have to be converted to a similar DecideRule. Explanation to follow.

            <newObject name="exclude-filter" class="org.archive.crawler.filter.OrFilter">
                <boolean name="enabled">true</boolean>
                <boolean name="if-matches-return">true</boolean>
                <map name="filters">
                    <newObject name="hops_filter" class="org.archive.crawler.filter.HopsFilter">
                        <boolean name="enabled">true</boolean>
                    </newObject>
                    <newObject name="pathdepth" class="org.archive.crawler.filter.PathDepthFilter">
                        <boolean name="enabled">true</boolean>
                        <integer name="max-path-depth">20</integer>
                        <boolean name="path-less-or-equal-return">false</boolean>
                    </newObject>
                    <newObject name="pathologicalpath" class="org.archive.crawler.filter.PathologicalPathFilter">
                        <boolean name="enabled">true</boolean>
                        <integer name="repetitions">3</integer>
                    </newObject>
                    <newObject name="dr_dk" class="org.archive.crawler.filter.URIRegExpFilter">
                        <boolean name="enabled">true</boolean>
                        <boolean name="if-match-return">true</boolean>
                        <string name="regexp">.*dr\.dk.*epg\.asp.*</string>
                    </newObject>
                    <newObject name="globale_crawlertraps" class="org.archive.crawler.filter.URIListRegExpFilter">
                        <boolean name="enabled">true</boolean>
                        <boolean name="if-match-return">true</boolean>
                        <string name="list-logic">OR</string>
                        <stringList name="regexp-list">
                            <string>.*core\.UserAdmin.*core\.UserLogin.*</string>
                            <string>.*core\.UserAdmin.*register\.UserSelfRegistration.*</string>
                            <string>.*\/w\/index\.php\?title=Speci[ae]l:Recentchanges.*</string>
                            <string>.*act=calendar&amp;cal_id=.*</string>
                            .....
                            <string>.*calendar\.asp\?qMonth=.*</string>
                            <string>.*calendar\.php\?sid=.*</string>
                            <string>.*worldscinet\.com.*</string>
                            <string>.*www3\.interscience\.wiley\.com.*</string>
                            <string>.*www-gdz\.sub\.uni-goettingen\.de.*</string>
                        </stringList>
                    </newObject>
                </map>
            </newObject>

3) Additional filters. Here we have a "Force-accept-filter", an "additionalScopeFocus" filter, and a "transitive Filter", of which only the transitiveFilter element needs to be converted. The two other elements are just deleted.

            <newObject name="force-accept-filter" class="org.archive.crawler.filter.OrFilter">
                <boolean name="enabled">true</boolean>
                <boolean name="if-matches-return">true</boolean>
                <map name="filters">
                </map>
            </newObject>
            <newObject name="additionalScopeFocus" class="org.archive.crawler.filter.FilePatternFilter">
                <boolean name="enabled">true</boolean>
                <boolean name="if-match-return">true</boolean>
                <string name="use-default-patterns">All</string>
                <string name="regexp"/>
            </newObject>
            <newObject name="transitiveFilter" class="org.archive.crawler.filter.TransclusionFilter">
                <boolean name="enabled">true</boolean>
                <integer name="max-speculative-hops">1</integer>
                <integer name="max-referral-hops">15</integer>
                <integer name="max-embed-hops">15</integer>
            </newObject>
      </newObject> <!-- end of scope element -->

How to convert from the former scopes to a decidingscope

Converting the header is easy. All headers have the form:

 <newObject name="scope" class="org.archive.crawler.deciderules.DecidingScope">
            <boolean name="enabled">true</boolean>
            <string name="seedsfile">seeds.txt</string>
            <boolean name="reread-seeds-on-config">true</boolean>
            <!-- DecideRuleSequence. Multiple DecideRules applied in order with last non-PASS the resulting decision -->
            <newObject name="decide-rules" class="org.archive.crawler.deciderules.DecideRuleSequence">

                <map name="rules">
                        <newObject name="rejectByDefault" 
                                class="org.archive.crawler.deciderules.RejectDecideRule"/>

plus a special defining deciderule that emulates the DomainScope, the HostScope, or the PathScope. 1) The defining deciderule for DomainScope is (the only one using a special purpose DecideRule):

<newObject name="acceptURIFromSeedDomains" class="dk.netarkivet.harvester.harvesting.OnNSDomainsDecideRule">
                                <string name="decision">ACCEPT</string>
                                <string name="surts-source-file">seeds.txt</string>
                                <boolean name="seeds-as-surt-prefixes">false</boolean>
                                <string name="surts-dump-file"/>
                                <boolean name="also-check-via">false</boolean>
                                <boolean name="rebuild-on-reconfig">true</boolean>
</newObject>

2) The defining deciderule for HostScope is:

<newObject name="OnHostsRule" class="org.archive.crawler.deciderules.OnHostsDecideRule">
            <string name="decision">ACCEPT</string>
            <string name="surts-dump-file"/>
            <boolean name="also-check-via">false</boolean>
            <boolean name="rebuild-on-reconfig">true</boolean>
          </newObject>

3) The defining deciderule for PathScope is:

<newObject name="acceptIfSurtPrefixed" class="org.archive.crawler.deciderules.SurtPrefixedDecideRule">
                                <string name="decision">ACCEPT</string>
                                <string name="surts-source-file"></string>
                                <boolean name="seeds-as-surt-prefixes">true</boolean>
                                <string name="surts-dump-file"></string>
                                <boolean name="also-check-via">false</boolean>
                                <boolean name="rebuild-on-reconfig">true</boolean>
</newObject>

After the header and the defining deciderule, we a deciderule corresponding to the 'hops_filter'. Note that the two last attributes 'max-link-hops', and 'max-trans-hops' in the header cease to be general scope attributes. Instead max-trans-hops become an attribute for the "acceptIfTranscluded" mentioned above, and the 'max-link-hops' attribute becomes an attribute for the new 'hops_filter' deciderule. The following

<integer name="max-link-hops">10</integer>
<newObject name="hops_filter" class="org.archive.crawler.filter.HopsFilter">
        <boolean name="enabled">true</boolean>
</newObject>

is then translated to the following deciderule

<newObject name="rejectIfTooManyHops" class="org.archive.crawler.deciderules.TooManyHopsDecideRule">
        <integer name="max-hops">10</integer>
</newObject>

Following this, we need to add a translation of the 'pathdepth' element, and the 'pathologicalpath' element, plus a translation of the 'transitiveFilter' element in the last part of the scope. The following

     <newObject name="pathdepth" class="org.archive.crawler.filter.PathDepthFilter">
                        <boolean name="enabled">true</boolean>
                        <integer name="max-path-depth">20</integer>
                        <boolean name="path-less-or-equal-return">false</boolean>
     </newObject>
     <newObject name="pathologicalpath" class="org.archive.crawler.filter.PathologicalPathFilter">
        <boolean name="enabled">true</boolean>
        <integer name="repetitions">3</integer>
     </newObject>

     <newObject name="transitiveFilter" class="org.archive.crawler.filter.TransclusionFilter">
                <boolean name="enabled">true</boolean>
                <integer name="max-speculative-hops">1</integer>
                <integer name="max-referral-hops">15</integer>
                <integer name="max-embed-hops">15</integer>
     </newObject>

is translated to

<newObject name="rejectIfPathological" class="org.archive.crawler.deciderules.PathologicalPathDecideRule">
        <integer name="max-repetitions">3</integer>
</newObject>
<newObject name="acceptIfTranscluded" class="org.archive.crawler.deciderules.TransclusionDecideRule">
        <integer name="max-trans-hops">5</integer>
        <integer name="max-speculative-hops">1</integer>
</newObject>
<newObject name="pathdepthfilter" class="org.archive.crawler.deciderules.TooManyPathSegmentsDecideRule">
        <integer name="max-path-depth">20</integer>
</newObject>

.

Note that the attributes 'max-referral-hops' and 'max-embed-hops' in the 'transitiveFilter' element have been merged into one single attribute 'max-trans-hops' which is now no longer an attribute of the scope, as it was in the old scopes.


Now you only need to convert all remaining URIRegExpFilter and URIListRegExpFilter elements to a corresponding DecideRule. The deciderule corresponding to URIRegExpFilter is MatchesRegExpDecideRule, and the deciderule corresponding to URIListRegExpFilter is MatchesListRegExpDecideRule. Converting the dr_dk element (a URIRegExpFilter)

<newObject name="dr_dk" class="org.archive.crawler.filter.URIRegExpFilter">
                        <boolean name="enabled">true</boolean>
                        <boolean name="if-match-return">true</boolean>
                        <string name="regexp">.*dr\.dk.*epg\.asp.*</string>
                    </newObject>

gives us:

<newObject name="dr_dk" class="org.archive.crawler.deciderules.MatchesRegExpDecideRule">
        <string name="decision">REJECT</string>
        <string name="regexp">.*dr\.dk.*epg\.asp.*</string>
 </newObject>

Converting the globale_crawlertraps element (URIListRegExpFilter)

<newObject name="globale_crawlertraps" class="org.archive.crawler.filter.URIListRegExpFilter">
                        <boolean name="enabled">true</boolean>
                        <boolean name="if-match-return">true</boolean>
                        <string name="list-logic">OR</string>
                        <stringList name="regexp-list">
                            <string>.*core\.UserAdmin.*core\.UserLogin.*</string>
                            <string>.*core\.UserAdmin.*register\.UserSelfRegistration.*</string>
                            <string>.*\/w\/index\.php\?title=Speci[ae]l:Recentchanges.*</string>
                            <string>.*act=calendar&amp;cal_id=.*</string>
                            .....
                            <string>.*calendar\.asp\?qMonth=.*</string>
                            <string>.*calendar\.php\?sid=.*</string>
                            <string>.*worldscinet\.com.*</string>
                            <string>.*www3\.interscience\.wiley\.com.*</string>
                            <string>.*www-gdz\.sub\.uni-goettingen\.de.*</string>
                        </stringList>
                    </newObject>

gives us

<newObject name="globale_crawlertraps" class="org.archive.crawler.deciderules.MatchesListRegExpDecideRule">
            <string name="decision">REJECT</string>
            <string name="list-logic">OR</string>
            <stringList name="regexp-list">
                <string>.*core\.UserAdmin.*core\.UserLogin.*</string>
                <string>.*core\.UserAdmin.*register\.UserSelfRegistration.*</string>
                <string>.*\/w\/index\.php\?title=Speci[ae]l:Recentchanges.*</string>
                <string>.*act=calendar&amp;cal_id=.*</string>
                .....
                <string>.*calendar\.asp\?qMonth=.*</string>
                <string>.*calendar\.php\?sid=.*</string>
                <string>.*worldscinet\.com.*</string>
                <string>.*www3\.interscience\.wiley\.com.*</string>
            </stringList>
</newObject>


Finally we need to wrap up the the sequence of deciderules and the scope itself. So we add

                </map> <!-- end rules -->
            </newObject> <!-- end decide-rules  -->
        </newObject> <!-- End DecidingScope -->

.