This page is about how to verify, that data is deduplicated Go to http://$GUIadminserver:$http-port/HarvestDefinition/ In the one-machine setup (deploy_example_one_machine.xml ) the link will be : http://localhost:8074 Click on the JobID for your finished snapshot harvest in the Job status overview Click on "Browse reports for jobs" Click on the "processors-report" e.g. "metadata://netarkivet.dk/crawl/reports/processors-report.txt?heritrixVersion=1.14.3&harvestid=1&jobid=1" Check that there is a deduplicator processors-report like this one: Total handled: 88
Duplicates found: 20 20.0%
Bytes total: 6391852 (6.1 MB)
Bytes discarded: 0 (0 0.0%
New (no hits): 88
Exact hits: 0
Equivalent hits: 0
......