## page was renamed from Configuration Manual 3.16/AppendixD ## page was renamed from Installation Manual 3.16/AppendixE <> = Appendix C : Migrate the Heritrix templates to NetarchiveSuite 3.6.0+ = <> If you are just using the predefined templates with few changes like changed the email-address and website information, the easiest way to migrate is to modify the predefined templates found in the binary distribution of !NetarchiveSuite in the harvestdefinitionbasedir/order_templates_dist directory and change the email-adress and website information again. If you do this, you also get the more inconsequential updates to the template: * The removal of obsolete attributes from some elements * Addition of new attributes to some elements Then you just update the existing templates in your database with these modified ones using the !HarvestTemplateApplication tool mentioned in [[Configuration Manual 3.16#ManagingHeritrixHarvestTemplates|Configuration Manual - Appendix B: Managing Heritrix Harvest Templates]]. Note that some templates are no longer distributed with !NetarchiveSuite. If you want to keep using those, you need to follow the procedure described below. If you have already put a lot effort in making your own templates, you can update your existing templates by "only" upgrading the scope element in the templates from either a !DomainScope, !HostScope, or a !PathScope. Before we explain how to migrate these scopes to a !DecidingScope, you need to know something about the anatomy of these scopes. 1) Header (includes scope class,and attributes): {{{ true seeds.txt true 10 5 }}} 2) An !OrFilter element named "exclude-filter" containing a number of filters as components: a !HopsFilter, a !PathDepthFilter, a !PathologicalPathFilter, a URIRegExpFilter, a URIListRegExpFilter (filter to avoid common crawlertraps), and potentially other types of filters: Each of these filters will have to be converted to a similar !DecideRule. Explanation to follow. {{{ true true true true 20 false true 3 true true .*dr\.dk.*epg\.asp.* true true OR .*core\.UserAdmin.*core\.UserLogin.* .*core\.UserAdmin.*register\.UserSelfRegistration.* .*\/w\/index\.php\?title=Speci[ae]l:Recentchanges.* .*act=calendar&cal_id=.* ..... .*calendar\.asp\?qMonth=.* .*calendar\.php\?sid=.* .*worldscinet\.com.* .*www3\.interscience\.wiley\.com.* .*www-gdz\.sub\.uni-goettingen\.de.* }}} 3) Additional filters. Here we have a "Force-accept-filter", an "additionalScopeFocus" filter, and a "transitive Filter", of which only the transitiveFilter element needs to be converted. The two other elements are just deleted. {{{ true true true true All true 1 15 15 }}} == How to convert from the former scopes to a decidingscope == Converting the header is easy. All headers have the form: {{{ true seeds.txt true }}} plus a special defining deciderule that emulates the !DomainScope, the !HostScope, or the !PathScope. 1) The defining deciderule for !DomainScope is (the only one using a special purpose !DecideRule): {{{ ACCEPT seeds.txt false false true }}} 2) The defining deciderule for !HostScope is: {{{ ACCEPT false true }}} 3) The defining deciderule for !PathScope is: {{{ ACCEPT true false true }}} After the header and the defining deciderule, we a deciderule corresponding to the 'hops_filter'. Note that the two last attributes 'max-link-hops', and 'max-trans-hops' in the header cease to be general scope attributes. Instead max-trans-hops become an attribute for the "acceptIfTranscluded" mentioned above, and the 'max-link-hops' attribute becomes an attribute for the new 'hops_filter' deciderule. The following {{{ 10 true }}} is then translated to the following deciderule {{{ 10 }}} Following this, we need to add a translation of the 'pathdepth' element, and the 'pathologicalpath' element, plus a translation of the 'transitiveFilter' element in the last part of the scope. The following {{{ true 20 false true 3 true 1 15 15 }}} is translated to {{{ 3 5 1 20 }}}. Note that the attributes 'max-referral-hops' and 'max-embed-hops' in the 'transitiveFilter' element have been merged into one single attribute 'max-trans-hops' which is now no longer an attribute of the scope, as it was in the old scopes. ---- Now you only need to convert all remaining URIRegExpFilter and URIListRegExpFilter elements to a corresponding DecideRule. The deciderule corresponding to URIRegExpFilter is !MatchesRegExpDecideRule, and the deciderule corresponding to URIListRegExpFilter is !MatchesListRegExpDecideRule. Converting the dr_dk element (a URIRegExpFilter) {{{ true true .*dr\.dk.*epg\.asp.* }}} gives us: {{{ REJECT .*dr\.dk.*epg\.asp.* }}} Converting the globale_crawlertraps element (URIListRegExpFilter) {{{ true true OR .*core\.UserAdmin.*core\.UserLogin.* .*core\.UserAdmin.*register\.UserSelfRegistration.* .*\/w\/index\.php\?title=Speci[ae]l:Recentchanges.* .*act=calendar&cal_id=.* ..... .*calendar\.asp\?qMonth=.* .*calendar\.php\?sid=.* .*worldscinet\.com.* .*www3\.interscience\.wiley\.com.* .*www-gdz\.sub\.uni-goettingen\.de.* }}} gives us {{{ REJECT OR .*core\.UserAdmin.*core\.UserLogin.* .*core\.UserAdmin.*register\.UserSelfRegistration.* .*\/w\/index\.php\?title=Speci[ae]l:Recentchanges.* .*act=calendar&cal_id=.* ..... .*calendar\.asp\?qMonth=.* .*calendar\.php\?sid=.* .*worldscinet\.com.* .*www3\.interscience\.wiley\.com.* }}} ---- Finally we need to wrap up the the sequence of deciderules and the scope itself. So we add {{{ }}}.