Differences between revisions 1 and 8 (spanning 7 versions)
Revision 1 as of 2009-10-02 09:14:06
Size: 3084
Editor: TueLarsen
Comment:
Revision 8 as of 2009-10-05 13:13:20
Size: 3087
Editor: TueLarsen
Comment:
Deletions are marked like this. Additions are marked like this.
Line 1: Line 1:
Describe It38CheckHarvestNotDeduplicated here.
Line 3: Line 2:
Kør en standard TEST1, og verificer at der ikke foretages deduplication, selvom det er slået til i den Heritrix template, som er benyttet (default_order.xml). Det sidste checkes i http://kb-test-adm-001.kb.dk:8077/HarvestDefinition/Definitions-download-harvest-template.jsp Start a web browser by e.g. $ mozilla Note that it is important that the browser is started on the same machine as the RunNetarchiveSuite.sh is run on
Line 5: Line 4:
Verificer, at ingen order.xml for et høstet job indeholder DeDuplicator elementet: setup the browser to proxy on port 8070 and exclude localhost e.g. in mozilla:
Line 7: Line 6:
<newObject name="DeDuplicator" class="is.hi.bok.deduplicator.DeDuplicator"> Choose in the mozilla toolbar:
Edit->Preferences->Advanced->Proxies
Checkmark:
Manual Proxy Configuration
and add:
Proxy: localhost
Port: 8070
No Proxy for: localhost
Write following url in the started browser http://localhost:8074/HarvestDefinition
Line 9: Line 16:
<boolean name="enabled">true</boolean> <map name="filters"> </map> <string name="index-location"/> <string name="matching-method">By URL</string> <boolean name="try-equivalent">true</boolean> <boolean name="change-content-size">false</boolean> <string name="mime-filter">^text/.*</string> <string name="filter-mode">Blacklist</string> <string name="analysis-mode">Timestamp</string> <string name="log-level">SEVERE</string> <string name="origin"/> <string name="origin-handling">Use index information</string> <boolean name="stats-per-host">true</boolean>
Line 11: Line 17:
<boolean name="use-sparse-range-filter">true</boolean> If you are a netarchive tester:
Line 13: Line 19:
</newObject> Forward the access port and set browser to be local forward port
Line 15: Line 21:
Verificer, at metadata arc filen ikke rummer en processor-report for deduplicator som i følgende report: Do the following on kb-prod-udv-001.kb.dk as a test user:
Line 17: Line 23:
metadata://netarkivet.dk/crawl/reports/processors-report.txt?heritrixVersion=1.14.3&harvestid=2&jobid=6 130.225.27.140 2009091611 3022 text/plain 1846 Processors report - 200909161130 Write the following in a prompt
<verbatim>
ssh -g -N -L$PORT:kb-test-acs-001.kb.dk:$PORT kb-test-acs-001.kb.dk
</verbatim>
Note that the shell will hang after this command.
The ssh must be killed after the test (by &lt;CTRL&gt;C)
Line 19: Line 30:
Job being crawled: forsider_plus_1niveau Number of Processors: 16 NOTE: Some processors may not return a report!
Processor: org.archive.crawler.fetcher.FetchHTTP
Set browser (Mozilla, Internet Explorer or <nop>FireFox) to be local forward port.
Line 22: Line 32:
Function: Fetch HTTP URIs CrawlURIs handled: 140 Recovery retries: 0
Processor: org.archive.crawler.extractor.ExtractorHTTP
Mozilla: Go to =preferences -> advanced.proxies= and activate 'manual proxy-settings', 'HTTP-proxy:' is set to "kb-prod-udv-001.kb.dk" and the belonging 'Port:' is set to the test port (807?). 'No proxy for' is set to "kb-test-adm-001.kb.dk"
Line 25: Line 34:
Function: Extracts URIs from HTTP response headers CrawlURIs handled: 140 Links extracted: 0
Processor: org.archive.crawler.extractor.ExtractorHTML
Internet Explorer: Go to =Tools -> Internet options= choose =Connections= and click =LAN settings=. Mark "Use proxy with …" and insert "kb-prod-udv-001.kb.dk" in 'Address' with 'Port:' and set the test port (807?). Click 'Advanced' and insert =kb-test-adm-001.kb.dk= under 'Exception' ('Do not use proxy for …')
Line 28: Line 36:
Function: Link extraction on HTML documents CrawlURIs handled: 51 Links extracted: 443
Processor: org.archive.crawler.extractor.ExtractorCSS
<nop>FireFox 2: Go to =General->Settings=, choose =Connection= and click =Settings=. Mark 'Manual proxy configuration:' and insert "kb-prod-udv-001.kb.dk" in 'HTTP Proxy' and set 'Port:' to the test port (807?). Mark ‘Use for all protocols’ and insert text “=kb-test-adm-001.kb.dk=” in 'No proxy for'
Line 31: Line 38:
Function: Link extraction on Cascading Style Sheets (.css) CrawlURIs handled: 0 Links extracted: 0
Processor: org.archive.crawler.extractor.ExtractorJS
<nop>FireFox 3: Go to =Tools->Settings=, choose =Advanced->Network= and click =Settings...=. Mark 'Manual proxy configuration:' and insert "kb-prod-udv-001.kb.dk" in 'HTTP Proxy' and set 'Port:' to the test port (807?). Mark ‘Use this profor all protocols’ and insert text “=kb-test-adm-001.kb.dk, kb-prod-udv-001.kb.dk=” in 'No proxy for'
Line 34: Line 40:
Function: Link extraction on JavaScript code CrawlURIs handled: 0 Links extracted: 0  * Go to http://$GUIadminserver:$http-port/HarvestDefinition/
  . where GUIadminserver and http-port are specified in the deploy configuration file under the application named dk.netarkivet.common.webinterface.GUIApplication
  . In the one-machine setup (deploy_example_one_machine.xml ) the link will be : http://localhost:8074
Line 36: Line 44:
Processor: org.archive.crawler.extractor.ExtractorSWF Click on the JobID for your finished snapshot harvest in the Job status overview
Line 38: Line 46:
Function: Link extraction on Shockwave Flash documents (.swf) CrawlURIs handled: 0 Links extracted: 0
Processor: is.hi.bok.digest.DeDuplicator
Click on "Browse reports for jobs"
Line 41: Line 48:
Function: Abort processing of duplicate records
- Lookup by url in use
Total handled: 88 Duplicates found: 0 0.0% Bytes total: 6391852 (6.1 MB)
Bytes discarded: 0 (0 0.0% New (no hits): 88 Exact hits: 0 Equivalent hits: 0 Timestamp predicts: (Where exact URL existed in the index) Change correctly: 0 Change falsly: 0 Non-change correct:0 Non-change falsly: 0 Missing timpestamp:0 [Host] [total] [duplicates] [bytes] [bytes discarded] [new] [exact] [equiv] [change correct] [change falsly] [non-change co rrect] [non-change falsly] [no timestamp] sejr.kb.dk 88 0 6391852 0 88 0 0 0 0 0 0 0
Click on the "processors-report" e.g. "metadata://netarkivet.dk/crawl/reports/processors-report.txt?heritrixVersion=1.14.3&harvestid=1&jobid=1"

Check that there is no deduplicator processors-report like this one:

{{{
Total handled: 88
Duplicates found: 0 0.0%
Bytes total: 6391852 (6.1 MB)
Bytes discarded: 0 (0 0.0%
New (no hits): 88
Exact hits: 0
Equivalent hits: 0
......
}}}

Start a web browser by e.g. $ mozilla Note that it is important that the browser is started on the same machine as the RunNetarchiveSuite.sh is run on

setup the browser to proxy on port 8070 and exclude localhost e.g. in mozilla:

Choose in the mozilla toolbar: Edit->Preferences->Advanced->Proxies Checkmark: Manual Proxy Configuration and add: Proxy: localhost Port: 8070 No Proxy for: localhost Write following url in the started browser http://localhost:8074/HarvestDefinition

If you are a netarchive tester:

Forward the access port and set browser to be local forward port

Do the following on kb-prod-udv-001.kb.dk as a test user:

Write the following in a prompt <verbatim> ssh -g -N -L$PORT:kb-test-acs-001.kb.dk:$PORT kb-test-acs-001.kb.dk </verbatim> Note that the shell will hang after this command. The ssh must be killed after the test (by <CTRL>C)

Set browser (Mozilla, Internet Explorer or <nop>FireFox) to be local forward port.

Mozilla: Go to =preferences -> advanced.proxies= and activate 'manual proxy-settings', 'HTTP-proxy:' is set to "kb-prod-udv-001.kb.dk" and the belonging 'Port:' is set to the test port (807?). 'No proxy for' is set to "kb-test-adm-001.kb.dk"

Internet Explorer: Go to =Tools -> Internet options= choose =Connections= and click =LAN settings=. Mark "Use proxy with …" and insert "kb-prod-udv-001.kb.dk" in 'Address' with 'Port:' and set the test port (807?). Click 'Advanced' and insert =kb-test-adm-001.kb.dk= under 'Exception' ('Do not use proxy for …')

<nop>FireFox 2: Go to =General->Settings=, choose =Connection= and click =Settings=. Mark 'Manual proxy configuration:' and insert "kb-prod-udv-001.kb.dk" in 'HTTP Proxy' and set 'Port:' to the test port (807?). Mark ‘Use for all protocols’ and insert text “=kb-test-adm-001.kb.dk=” in 'No proxy for'

<nop>FireFox 3: Go to =Tools->Settings=, choose =Advanced->Network= and click =Settings...=. Mark 'Manual proxy configuration:' and insert "kb-prod-udv-001.kb.dk" in 'HTTP Proxy' and set 'Port:' to the test port (807?). Mark ‘Use this profor all protocols’ and insert text “=kb-test-adm-001.kb.dk, kb-prod-udv-001.kb.dk=” in 'No proxy for'

Click on the JobID for your finished snapshot harvest in the Job status overview

Click on "Browse reports for jobs"

Click on the "processors-report" e.g. "metadata://netarkivet.dk/crawl/reports/processors-report.txt?heritrixVersion=1.14.3&harvestid=1&jobid=1"

Check that there is no deduplicator processors-report like this one:

Total handled: 88 
Duplicates found: 0 0.0% 
Bytes total: 6391852 (6.1 MB) 
Bytes discarded: 0 (0  0.0% 
New (no hits): 88 
Exact hits: 0 
Equivalent hits: 0 
......

It38CheckHarvestNotDeduplicated (last edited 2010-08-16 10:24:54 by localhost)