This page is about how to verify, that data is deduplicated
Go to http://$GUIadminserver:$http-port/HarvestDefinition/
- where GUIadminserver and http-port are specified in the deploy configuration file under the application named dk.netarkivet.common.webinterface.GUIApplication
In the one-machine setup (deploy_example_one_machine.xml ) the link will be : http://localhost:8074
Click on the JobID for your finished snapshot harvest in the Job status overview
Click on "Browse reports for jobs"
Click on the "processors-report" e.g. "metadata://netarkivet.dk/crawl/reports/processors-report.txt?heritrixVersion=1.14.4&harvestid=1&jobid=1"
Check that there is a deduplicator processors-report like this one:
Total handled: 88 Duplicates found: 20 20.0% Bytes total: 6391852 (6.1 MB) Bytes discarded: 0 (0 0.0% New (no hits): 88 Exact hits: 0 Equivalent hits: 0 ......