4228
Comment:
|
← Revision 12 as of 2010-08-16 10:24:54 ⇥
3019
converted to 1.6 markup
|
Deletions are marked like this. | Additions are marked like this. |
Line 1: | Line 1: |
Describe It38CheckHarvestNotDeduplicated here. | Start a web browser by e.g. $ mozilla Note that it is important that the browser is started on the same machine as the RunNetarchiveSuite.sh is run on |
Line 3: | Line 3: |
start a web browser by e.g. $ mozilla Note that it is important that the browser is started on the same machine as the RunNetarchiveSuite.sh is run on | setup the browser to proxy on port 8070 and exclude localhost e.g. in mozilla: |
Line 5: | Line 5: |
setup the browser to proxy on port 8070 and exclude localhost e.g. in mozilla: Choose in the mozilla toolbar: Edit->Preferences->Advanced->Proxies Checkmark: Manual Proxy Configuration and add: Proxy: localhost Port: 8070 No Proxy for: localhost Write following url in the started browser http://localhost:8074/HarvestDefinition |
Choose in the mozilla toolbar: Edit->Preferences->Advanced->Proxies Checkmark: Manual Proxy Configuration and add: Proxy: localhost Port: 8070 No Proxy for: localhost Write following url in the started browser http://localhost:8074/HarvestDefinition |
Line 20: | Line 9: |
Forward the access port and set browser to be local forward port | '''Forward the access port and set browser to be local forward port: ''' |
Line 24: | Line 13: |
Write the following in a prompt <verbatim> ssh -g -N -L$PORT:kb-test-acs-001.kb.dk:$PORT kb-test-acs-001.kb.dk </verbatim> Note that the shell will hang after this command. The ssh must be killed after the test (by <CTRL>C) |
Write the following in a prompt {{{ssh -g -N -L$PORT:kb-test-acs-001.kb.dk:$PORT kb-test-acs-001.kb.dk}}} Note that the shell will hang after this command. The ssh must be killed after the test (by <CTRL>C) |
Line 31: | Line 15: |
Set browser (Mozilla, Internet Explorer or <nop>FireFox) to be local forward port. | Set browser (Mozilla, Internet Explorer or FireFox) to be local forward port. |
Line 35: | Line 19: |
Internet Explorer: Go to =Tools -> Internet options= choose =Connections= and click =LAN settings=. Mark "Use proxy with …" and insert "kb-prod-udv-001.kb.dk" in 'Address' with 'Port:' and set the test port (807?). Click 'Advanced' and insert =kb-test-adm-001.kb.dk= under 'Exception' ('Do not use proxy for …') | Internet Explorer: Go to =Tools -> Internet options= choose =Connections= and click =LAN settings=. Mark "Use proxy with …" and insert "kb-prod-udv-001.kb.dk" in 'Address' with 'Port:' and set the test port (807?). Click 'Advanced' and insert =kb-test-adm-001.kb.dk= under 'Exception' ('Do not use proxy for …') |
Line 37: | Line 21: |
<nop>FireFox 2: Go to =General->Settings=, choose =Connection= and click =Settings=. Mark 'Manual proxy configuration:' and insert "kb-prod-udv-001.kb.dk" in 'HTTP Proxy' and set 'Port:' to the test port (807?). Mark ‘Use for all protocols’ and insert text “=kb-test-adm-001.kb.dk=” in 'No proxy for' | FireFox 2: Go to =General->Settings=, choose =Connection= and click =Settings=. Mark 'Manual proxy configuration:' and insert "kb-prod-udv-001.kb.dk" in 'HTTP Proxy' and set 'Port:' to the test port (807?). Mark ‘Use for all protocols’ and insert text “=kb-test-adm-001.kb.dk=” in 'No proxy for' |
Line 39: | Line 23: |
<nop>FireFox 3: Go to =Tools->Settings=, choose =Advanced->Network= and click =Settings...=. Mark 'Manual proxy configuration:' and insert "kb-prod-udv-001.kb.dk" in 'HTTP Proxy' and set 'Port:' to the test port (807?). Mark ‘Use this profor all protocols’ and insert text “=kb-test-adm-001.kb.dk, kb-prod-udv-001.kb.dk=” in 'No proxy for' | FireFox 3: Go to =Tools->Settings=, choose =Advanced->Network= and click =Settings...=. Mark 'Manual proxy configuration:' and insert "kb-prod-udv-001.kb.dk" in 'HTTP Proxy' and set 'Port:' to the test port (807?). Mark ‘Use this profor all protocols’ and insert text “=kb-test-adm-001.kb.dk, kb-prod-udv-001.kb.dk=” in 'No proxy for' |
Line 41: | Line 25: |
* Go to http://$GUIadminserver:$http-port/HarvestDefinition/ . where GUIadminserver and http-port are specified in the deploy configuration file under the application named dk.netarkivet.common.webinterface.GUIApplication . In the one-machine setup (deploy_example_one_machine.xml ) the link will be : http://localhost:8074 Click on the JobID for your finished snapshot harvest in the Job status overview |
|
Line 42: | Line 30: |
Verificer, at metadata arc filen ikke rummer en processor-report for deduplicator som i følgende report: | Click on "Browse reports for jobs" |
Line 44: | Line 32: |
metadata://netarkivet.dk/crawl/reports/processors-report.txt?heritrixVersion=1.14.3&harvestid=2&jobid=6 130.225.27.140 2009091611 3022 text/plain 1846 Processors report - 200909161130 | Click on the "processors-report" e.g. "metadata://netarkivet.dk/crawl/reports/processors-report.txt?heritrixVersion=1.14.3&harvestid=1&jobid=1" |
Line 46: | Line 34: |
Job being crawled: forsider_plus_1niveau Number of Processors: 16 NOTE: Some processors may not return a report! Processor: org.archive.crawler.fetcher.FetchHTTP |
Check that there is no deduplicator processors-report like this one: |
Line 49: | Line 36: |
Function: Fetch HTTP URIs CrawlURIs handled: 140 Recovery retries: 0 Processor: org.archive.crawler.extractor.ExtractorHTTP Function: Extracts URIs from HTTP response headers CrawlURIs handled: 140 Links extracted: 0 Processor: org.archive.crawler.extractor.ExtractorHTML Function: Link extraction on HTML documents CrawlURIs handled: 51 Links extracted: 443 Processor: org.archive.crawler.extractor.ExtractorCSS Function: Link extraction on Cascading Style Sheets (.css) CrawlURIs handled: 0 Links extracted: 0 Processor: org.archive.crawler.extractor.ExtractorJS Function: Link extraction on JavaScript code CrawlURIs handled: 0 Links extracted: 0 Processor: org.archive.crawler.extractor.ExtractorSWF Function: Link extraction on Shockwave Flash documents (.swf) CrawlURIs handled: 0 Links extracted: 0 Processor: is.hi.bok.digest.DeDuplicator Function: Abort processing of duplicate records - Lookup by url in use Total handled: 88 Duplicates found: 0 0.0% Bytes total: 6391852 (6.1 MB) Bytes discarded: 0 (0 0.0% New (no hits): 88 Exact hits: 0 Equivalent hits: 0 Timestamp predicts: (Where exact URL existed in the index) Change correctly: 0 Change falsly: 0 Non-change correct:0 Non-change falsly: 0 Missing timpestamp:0 [Host] [total] [duplicates] [bytes] [bytes discarded] [new] [exact] [equiv] [change correct] [change falsly] [non-change co rrect] [non-change falsly] [no timestamp] sejr.kb.dk 88 0 6391852 0 88 0 0 0 0 0 0 0 |
{{{ Total handled: 88 Duplicates found: 0 0.0% Bytes total: 6391852 (6.1 MB) Bytes discarded: 0 (0 0.0% New (no hits): 88 Exact hits: 0 Equivalent hits: 0 ...... }}} |
Start a web browser by e.g. $ mozilla Note that it is important that the browser is started on the same machine as the RunNetarchiveSuite.sh is run on
setup the browser to proxy on port 8070 and exclude localhost e.g. in mozilla:
Choose in the mozilla toolbar: Edit->Preferences->Advanced->Proxies Checkmark: Manual Proxy Configuration and add: Proxy: localhost Port: 8070 No Proxy for: localhost Write following url in the started browser http://localhost:8074/HarvestDefinition
If you are a netarchive tester:
Forward the access port and set browser to be local forward port:
Do the following on kb-prod-udv-001.kb.dk as a test user:
Write the following in a prompt ssh -g -N -L$PORT:kb-test-acs-001.kb.dk:$PORT kb-test-acs-001.kb.dk Note that the shell will hang after this command. The ssh must be killed after the test (by <CTRL>C)
Set browser (Mozilla, Internet Explorer or FireFox) to be local forward port.
Mozilla: Go to =preferences -> advanced.proxies= and activate 'manual proxy-settings', 'HTTP-proxy:' is set to "kb-prod-udv-001.kb.dk" and the belonging 'Port:' is set to the test port (807?). 'No proxy for' is set to "kb-test-adm-001.kb.dk"
Internet Explorer: Go to =Tools -> Internet options= choose =Connections= and click =LAN settings=. Mark "Use proxy with …" and insert "kb-prod-udv-001.kb.dk" in 'Address' with 'Port:' and set the test port (807?). Click 'Advanced' and insert =kb-test-adm-001.kb.dk= under 'Exception' ('Do not use proxy for …')
FireFox 2: Go to =General->Settings=, choose =Connection= and click =Settings=. Mark 'Manual proxy configuration:' and insert "kb-prod-udv-001.kb.dk" in 'HTTP Proxy' and set 'Port:' to the test port (807?). Mark ‘Use for all protocols’ and insert text “=kb-test-adm-001.kb.dk=” in 'No proxy for'
FireFox 3: Go to =Tools->Settings=, choose =Advanced->Network= and click =Settings...=. Mark 'Manual proxy configuration:' and insert "kb-prod-udv-001.kb.dk" in 'HTTP Proxy' and set 'Port:' to the test port (807?). Mark ‘Use this profor all protocols’ and insert text “=kb-test-adm-001.kb.dk, kb-prod-udv-001.kb.dk=” in 'No proxy for'
Go to http://$GUIadminserver:$http-port/HarvestDefinition/
- where GUIadminserver and http-port are specified in the deploy configuration file under the application named dk.netarkivet.common.webinterface.GUIApplication
In the one-machine setup (deploy_example_one_machine.xml ) the link will be : http://localhost:8074
Click on the JobID for your finished snapshot harvest in the Job status overview
Click on "Browse reports for jobs"
Click on the "processors-report" e.g. "metadata://netarkivet.dk/crawl/reports/processors-report.txt?heritrixVersion=1.14.3&harvestid=1&jobid=1"
Check that there is no deduplicator processors-report like this one:
Total handled: 88 Duplicates found: 0 0.0% Bytes total: 6391852 (6.1 MB) Bytes discarded: 0 (0 0.0% New (no hits): 88 Exact hits: 0 Equivalent hits: 0 ......