Differences between revisions 1 and 18 (spanning 17 versions)
Revision 1 as of 2009-10-28 09:11:15
Size: 4903
Editor: TueLarsen
Comment:
Revision 18 as of 2013-05-03 12:41:01
Size: 4820
Comment:
Deletions are marked like this. Additions are marked like this.
Line 1: Line 1:
---+ Define and run an event harvest
<verbatim>
This page describes how to define and run an event harvest.
It also test that seeds lists are created from first and second level definition of the domain names.
</verbatim>
'''Define and run an event harvest'''
Line 7: Line 3:
---++ Do following in a browser:
Start program
   * Go to =http://kb-test-adm-001.kb.dk:807?/HarvestDefinition/= (where '807?' is the port number)
This page describes how to define and run an event harvest. It also test that seeds lists are created from first and second level definition of the domain names.

Do following in a browser: Start Program

 * Go to http://$GUIadminserver:$http-port/HarvestDefinition/
  . where GUIadminserver and http-port are specified in the deploy configuration file under the application named dk.netarkivet.common.webinterface.GUIApplication
  . In the one-machine setup (deploy_example_one_machine.xml ) the link will be : http://localhost:8074
Line 11: Line 12:
   * Click 'Definitions'->'Selective Harvests' in the left menu
   * Click 'Create new harvestdefinition' in the bottom of the main window     * Fill in the Harvest name and note the name for later use (from now referred as &lt;eh. name&gt;)
   * Choose "Once_an_hour" in the drop down list for 'Schedule'
   * Click Save (DO NOT CLICK ACTIVATE YET)

* Click 'Definitions'->'Selective Harvests' in the left menu
 * Click 'Create new harvestdefinition' in the bottom of the main window
* Fill in the Harvest name and note the name for later use (from now referred as <eh. name>)
 * Choose '''Once_an_hour''' in the drop down list for 'Schedule'
 * Click Save (DO NOT CLICK ACTIVATE YET)
Line 17: Line 20:
   * Click 'Edit' in column 6 on the line with the &lt;eh. name&gt;
   * Click 'Add seeds' at the bottom of the main page
   * Choose "forsider" in the drop-down list for 'Harvest template'
   * Write domain list from 'Seed list 1' given below (you can cut and paste from this page)
   * Click 'Insert'
   * Click 'Add seeds' again
   * Choose "forsider_plus_2niveauer" in the drop-down list for 'Harvest template'
   * Write domain list from 'Seed list 2' given below (you can cut and paste from this page)
   * Click 'Insert'
   * Click 'Save'

 * Click 'Edit' in column 6 on the line with the <eh. name>
 * Write domain list from 'Seed list 1' given below to a file on your desktop e.g. notepad)
 * Click 'Add seeds from a file' at the bottom of the main page
 * Click 'Browse" and pick up the just created file with seeds
 * Choose '''frontpages''' in the drop-down list for 'Harvest template'
 * Click 'Insert'
 * Now click 'Add seeds'
 * Choose '''frontpages_plus_2levels''' in the drop-down list for 'Harvest template'
 * Write domain list from 'Seed list 2' given below (you can cut and paste from this page)
 * Click 'Insert'
 * Click 'Save'
Line 28: Line 34:
   * For each of the domains =raeder.dk=, =statsbiblioteket.dk=, =netarkivet.dk= do:
      * Click 'Definitions'->'Find Domain(s)'
      * Search for domain by writing its name as text and click 'Search'
      * Check that there exists a configuration with the name "&lt;eh. name&gt;_forsider__"
    
* Check that there exists a seed list with the name "&lt;eh. name&gt;_forsider__"
    
* Click 'Edit' in the line with seed list "&lt;eh. name&gt;_forsider__"
      * C
heck that the seed list shown corresponds to the seed list for the domain (see below)
(Note that after fixing NAS-2165, you have to click on Show unused configurations/seedlists show all)

* For each of the domains =raeder.dk=, =statsbiblioteket.dk=, =netarkivet.dk= do:
  * Click 'Definitions'->'Find Domain(s)'
  * Search for domain by writing its name as text and click 'Search'
  * Check that there exists a configuration with the name "<eh. name>_frontpages__" __"
* Check that there exists a seed list with the name "<eh. name>_frontpages
  * Click 'Edit' in the line with seed list
"<eh. name>_frontpages__" __",
* Check that the seed list shown corresponds to the seed list for the domain (see below)
Line 36: Line 45:
   * For the domains =kaarefc.dk=, =netarkivet.dk= do:
     * Click 'Definitions'->'Find Domain(s)'
      * Search for =netarkivet.dk= by writing this text and click &#8216;Search&#8217;
      * Check that there exists a configuration with name
"&lt;eh. name&gt;_forsider_plus_2niveauer__",
      * Check that there exists a seed list with the name "&lt;eh. name&gt;_forsider_plus_2niveauer__"
    
* Click 'Edit' in the line with seed list "&lt;eh. name&gt;_forsider_plus_2niveauer__"
    
* Check that the seed list shown corresponds to the seed list for the domain (see below)
(Note that after fixing NAS-2165, you have to click on Show unused configurations/seedlists show all)

* For the domains =kaarefc.dk=, =netarkivet.dk= do:
  * Click 'Definitions'->'Find Domain(s)'
  * Search for =netarkivet.dk= by writing this text and click ‘Search’
  * Check that there exists a configuration with name "<eh. name>_frontpages_plus_
2levels
  * Check that there exists a seed list with the name "<eh. name>_frontpages_plus_
2levels__" __"
* Click 'Edit' in the line with seed list "<eh. name>_frontpages_plus_2levels
* Check that the seed list shown corresponds to the seed list for the domain (see below)
Line 44: Line 56:
   * Click 'Definitions'->'Selective Harvests' in the left menu
   * Click 'Activate' in column 5 on the line with the &lt;eh. name&gt;
Check harvest status of the event harvest
  
* Click 'Harvest status'->'All Jobs' in the left menu
   * Select "All" in "Only display job status" to the rigth from the menu
   * Click the "Show" button, until the &lt;eh. name&gt; appears in a new job line (approx. after a minute)
   * Check that two jobs appears and that they both have Harvest name &lt;eh. name&gt;
---++++
Seed list 1 (Harvest template "forsider"):
<verbatim>
http://netarkivet.dk/index-da.php

* Click 'Definitions'->'Selective Harvests' in the left menu
 * Click 'Activate' in column 5 on the line with the <eh. name>

Check harvest status of the event harvest using menu "All Jobs"

* Click 'Harvest status'->'All Jobs' in the left menu
 * Select "All" in "Only display job status" to the rigth from the menu
 * Click the "Show" button, until the <eh. name> appears in a new job line (approx. after a minute)
 * Check that two jobs appears and that they both have Harvest name <eh. name>
 * Check the menu "Running jobs", that the jobs appears and that you can go to the Heritrix GUI
  . by clicking on the host link and by using the login/password: "admin"/"adminPassword" and close the window again.

Seed list 1 (Harvest template "frontpages"):

{{{

http://netarkivet.dk/adgang-for-forskere/
Line 55: Line 74:
http://netarkivet.dk/pilot-index-da.php
http://netarkivet.dk/pilot-index-en.php
http://netarkivet.dk/fase2index-da.php
http://netarkivet.dk/fase2index-en.php
Line 60: Line 75:
http://sb-test-net-001.statsbiblioteket.dk/website/testsite/clock.php
</verbatim>
http://kb-prod-udv-001.kb.dk/netarchivesuite/clock.php
}}}
Seed list 2 (Harvest template "frontpages_plus_2levels"):
Line 63: Line 79:
---++++Seed list 2 (Harvest template "forsider_plus_2niveauer"):
<verbatim>
{{{
Line 66: Line 81:
http://www.netarkivet.dk/website/sources/index-da.php
Line 70: Line 84:
</verbatim> }}}
Seed list "<eh. name>_frontpages__" for domain =raeder.dk= __"
Line 72: Line 87:
---++++Seed list "&lt;eh. name&gt;_forsider__" for domain =raeder.dk=
<verbatim>
{{{
Line 75: Line 89:
</verbatim> }}}
Seed list "<eh. name>_frontpages
Line 77: Line 92:
---++++Seed list "&lt;eh. name&gt;_forsider__" for domain =statsbiblioteket.dk=
<verbatim>
http://sb-test-net-001.statsbiblioteket.dk/website/testsite/clock.php
</verbatim>
{{{
http://kb-prod-udv-001.kb.dk/netarchivesuite/clock.php
}}}
Seed list "<eh. name>_frontpages__" for domain =netarkivet.dk=
Line 82: Line 97:
---++++Seed list "&lt;eh. name&gt;_forsider__" for domain =netarkivet.dk=
<verbatim>
http://netarkivet.dk/fase2index-en.php
http://netarkivet.dk/fase2index-da.php
http://netarkivet.dk/pilot-index-en.php
http://netarkivet.dk/pilot-index-da.php
{{{
Line 89: Line 99:
http://netarkivet.dk/index-da.php
</verbatim>
http://netarkivet.dk/adgang-for-forskere/
}}}
Seed list "<eh. name>_frontpages_plus_2levels
Line 92: Line 103:
---++++Seed list "&lt;eh. name&gt;_forsider_plus_2niveauer __" for domain =netarkivet.dk=
<verbatim>
http://www.netarkivet.dk/website/sources/index-da.php
{{{
Line 96: Line 105:
</verbatim> }}}
Seed list "<eh. name>_frontpages_plus_2levels" for domain =kaarefc.dk=
Line 98: Line 108:
---++++Seed list "&lt;eh. name&gt;_forsider_plus_2niveauer" for domain =kaarefc.dk=
<verbatim>
{{{
Line 101: Line 110:
</verbatim> }}}

Define and run an event harvest

This page describes how to define and run an event harvest. It also test that seeds lists are created from first and second level definition of the domain names.

Do following in a browser: Start Program

Make a new selective (event) harvest definition with a name you can remember

  • Click 'Definitions'->'Selective Harvests' in the left menu

  • Click 'Create new harvestdefinition' in the bottom of the main window
  • Fill in the Harvest name and note the name for later use (from now referred as <eh. name>)

  • Choose Once_an_hour in the drop down list for 'Schedule'

  • Click Save (DO NOT CLICK ACTIVATE YET)

Add seeds to the selective (event) harvest

  • Click 'Edit' in column 6 on the line with the <eh. name>

  • Write domain list from 'Seed list 1' given below to a file on your desktop e.g. notepad)
  • Click 'Add seeds from a file' at the bottom of the main page
  • Click 'Browse" and pick up the just created file with seeds
  • Choose frontpages in the drop-down list for 'Harvest template'

  • Click 'Insert'
  • Now click 'Add seeds'
  • Choose frontpages_plus_2levels in the drop-down list for 'Harvest template'

  • Write domain list from 'Seed list 2' given below (you can cut and paste from this page)
  • Click 'Insert'
  • Click 'Save'

Check that seed lists for domains in Seed list 1 has changed correspondingly (Note that after fixing NAS-2165, you have to click on Show unused configurations/seedlists show all)

  • For each of the domains =raeder.dk=, =statsbiblioteket.dk=, =netarkivet.dk= do:
    • Click 'Definitions'->'Find Domain(s)'

    • Search for domain by writing its name as text and click 'Search'
    • Check that there exists a configuration with the name "<eh. name>_frontpages" "

    • Check that there exists a seed list with the name "<eh. name>_frontpages

    • Click 'Edit' in the line with seed list "<eh. name>_frontpages" ",

    • Check that the seed list shown corresponds to the seed list for the domain (see below)

Check that seed lists for domains in Seed list 2 has changed correspondingly (Note that after fixing NAS-2165, you have to click on Show unused configurations/seedlists show all)

  • For the domains =kaarefc.dk=, =netarkivet.dk= do:
    • Click 'Definitions'->'Find Domain(s)'

    • Search for =netarkivet.dk= by writing this text and click ‘Search’
    • Check that there exists a configuration with name "<eh. name>_frontpages_plus_2levels

    • Check that there exists a seed list with the name "<eh. name>_frontpages_plus_2levels" "

    • Click 'Edit' in the line with seed list "<eh. name>_frontpages_plus_2levels

    • Check that the seed list shown corresponds to the seed list for the domain (see below)

Activate the harvest

  • Click 'Definitions'->'Selective Harvests' in the left menu

  • Click 'Activate' in column 5 on the line with the <eh. name>

Check harvest status of the event harvest using menu "All Jobs"

  • Click 'Harvest status'->'All Jobs' in the left menu

  • Select "All" in "Only display job status" to the rigth from the menu
  • Click the "Show" button, until the <eh. name> appears in a new job line (approx. after a minute)

  • Check that two jobs appears and that they both have Harvest name <eh. name>

  • Check the menu "Running jobs", that the jobs appears and that you can go to the Heritrix GUI
    • by clicking on the host link and by using the login/password: "admin"/"adminPassword" and close the window again.

Seed list 1 (Harvest template "frontpages"):

http://netarkivet.dk/adgang-for-forskere/
http://netarkivet.dk/index-en.php
http://www.raeder.dk/
http://kb-prod-udv-001.kb.dk/netarchivesuite/clock.php

Seed list 2 (Harvest template "frontpages_plus_2levels"):

http://netarkivet.dk/index-en.php
http://www.kaarefc.dk/
http://www.kaarefc.dk/private/
http://www.kaarefc.dk/wop/

Seed list "<eh. name>_frontpages" for domain =raeder.dk= "

http://www.raeder.dk/

Seed list "<eh. name>_frontpages

http://kb-prod-udv-001.kb.dk/netarchivesuite/clock.php

Seed list "<eh. name>_frontpages" for domain =netarkivet.dk=

http://netarkivet.dk/index-en.php
http://netarkivet.dk/adgang-for-forskere/

Seed list "<eh. name>_frontpages_plus_2levels

http://netarkivet.dk/index-en.php

Seed list "<eh. name>_frontpages_plus_2levels" for domain =kaarefc.dk=

http://www.kaarefc.dk/

It11DefineEventHarvest (last edited 2013-05-03 12:41:01 by SoerenCarlsen)