This page is about how to verify, that data is deduplicated

Click on the JobID for your finished snapshot harvest (or repeated selective harvest) in the Job status overview

Click on "Browse reports for jobs"

Click on the "processors-report" e.g. "metadata://netarkivet.dk/crawl/reports/processors-report.txt?heritrixVersion=1.14.4&harvestid=1&jobid=1"

Check that there is a deduplicator processors-report similar to this one (the numbers will be different), but duplicates found should be non-zero:

Total handled: 88
Duplicates found: 20 20.0%
Bytes total: 6391852 (6.1 MB)
Bytes discarded: 0 (0  0.0%
New (no hits): 88
Exact hits: 0
Equivalent hits: 0
......

It38CheckHarvestDeduplicated (last edited 2012-07-03 12:36:55 by SoerenCarlsen)