Describe It38CheckHarvestNotDeduplicated here.
Kør en standard TEST1, og verificer at der ikke foretages deduplication, selvom det er slået til i den Heritrix template, som er benyttet (default_order.xml). Det sidste checkes i http://kb-test-adm-001.kb.dk:8077/HarvestDefinition/Definitions-download-harvest-template.jsp
Verificer, at ingen order.xml for et høstet job indeholder DeDuplicator elementet:
<newObject name="DeDuplicator" class="is.hi.bok.deduplicator.DeDuplicator">
<boolean name="enabled">true</boolean> <map name="filters"> </map> <string name="index-location"/> <string name="matching-method">By URL</string> <boolean name="try-equivalent">true</boolean> <boolean name="change-content-size">false</boolean> <string name="mime-filter">^text/.*</string> <string name="filter-mode">Blacklist</string> <string name="analysis-mode">Timestamp</string> <string name="log-level">SEVERE</string> <string name="origin"/> <string name="origin-handling">Use index information</string> <boolean name="stats-per-host">true</boolean>
<boolean name="use-sparse-range-filter">true</boolean>
</newObject>
Verificer, at metadata arc filen ikke rummer en processor-report for deduplicator som i følgende report:
metadata://netarkivet.dk/crawl/reports/processors-report.txt?heritrixVersion=1.14.3&harvestid=2&jobid=6 130.225.27.140 2009091611 3022 text/plain 1846 Processors report - 200909161130
Job being crawled: forsider_plus_1niveau Number of Processors: 16 NOTE: Some processors may not return a report! Processor: org.archive.crawler.fetcher.FetchHTTP
Function: Fetch HTTP URIs CrawlURIs handled: 140 Recovery retries: 0 Processor: org.archive.crawler.extractor.ExtractorHTTP
Function: Extracts URIs from HTTP response headers CrawlURIs handled: 140 Links extracted: 0 Processor: org.archive.crawler.extractor.ExtractorHTML
Function: Link extraction on HTML documents CrawlURIs handled: 51 Links extracted: 443 Processor: org.archive.crawler.extractor.ExtractorCSS
Function: Link extraction on Cascading Style Sheets (.css) CrawlURIs handled: 0 Links extracted: 0 Processor: org.archive.crawler.extractor.ExtractorJS
Function: Link extraction on JavaScript code CrawlURIs handled: 0 Links extracted: 0
Processor: org.archive.crawler.extractor.ExtractorSWF
Function: Link extraction on Shockwave Flash documents (.swf) CrawlURIs handled: 0 Links extracted: 0 Processor: is.hi.bok.digest.DeDuplicator
Function: Abort processing of duplicate records - Lookup by url in use Total handled: 88 Duplicates found: 0 0.0% Bytes total: 6391852 (6.1 MB) Bytes discarded: 0 (0 0.0% New (no hits): 88 Exact hits: 0 Equivalent hits: 0 Timestamp predicts: (Where exact URL existed in the index) Change correctly: 0 Change falsly: 0 Non-change correct:0 Non-change falsly: 0 Missing timpestamp:0 [Host] [total] [duplicates] [bytes] [bytes discarded] [new] [exact] [equiv] [change correct] [change falsly] [non-change co rrect] [non-change falsly] [no timestamp] sejr.kb.dk 88 0 6391852 0 88 0 0 0 0 0 0 0