Domains

Creating Domains

create_domains.png

The [Create domain] is used for creating new domains in the system. It is possible to create a single domain as well as list of domains. It is also possible to import domains from a file.

To create single domains enter domain names in the text box and press [Create].

To bulk create domains from a file select the file from your local computer with [Browse] and press [Ingest]. The file must be a simple list of domain names – one at each line. The file must be UTF-8 encoded if it contains special characters.

New domains get a default configuration when created (with the defaultorderxml template and a default maximum number of bytes). New domains also get a defaultseedlist when created. The frontpage is defined as www.domainname – e.g. 'www.netarchive.dk' for the domain 'netarchive.dk'

Already existing domains in the system will not be recreated.

Finding Domains

find_domains.png

[Find Domain(s)] is used to find domains existing in the system.

Write a domain name in the box (e.g. kb.dk). Searching is done on the complete text string. Press [Search].

Left and/or right wildcards with *.

If there are several hits, a list is given of the found domains. If only one hit it leads directly to the Domain page.

If the search for a specific domain results in no hits, you are prompted with the ability to create the domain in the system and by accepting [Yes] it leads directly to Domain page for the newly created domain.

Editing Domains

domain_edit.png

Edit domain’ is an overview of a single domain where it is possible to edit the domain’s definition in the harvest system.

Free commentary text box.

’Alias of’: Here it can be stated if the domain is an alias of another domain – they are identical in content and only one of them should be harvested. Domains marked as an alias will not be harvested within the snapshot harvests. Alias is defined one year at a time and then has to be renewed.

‘Configurations’: [New configuration] and [Edit] open a new page: ‘Enter/edit configuration’ (see below) ‘Seed lists’: [New seed list] and [Edit] opens a new page: ‘Enter/edit seed list’. (see below)

‘Crawler traps’: [Show crawler traps] opens a new text box: ‘Crawler traps’ (see below)

[Show historical harvest information for …] opens a new page “Harvest history for domain....” (see ["User Manual 3.16/Harvest History"])

Editing configurations

configuration_edit.png

’Enter/edit configuration’ is used to define a new configuration and edit an existing one. A configuration contains information about which Harvest template and Seed lists are used (more than one Seed list can be used - hold down CTRL).

At the creation of a new configuration a name is given that thereafter can not be changed.

Furthermore it is possible to choose between different Harvester templates and maximum number of bytes to be harvested in each harvest of the configuration. At creation the default number of bytes is chosen for each domain. And a default maximum number af objects is set, but can be overwritten.

Editing seed lists

seedlist_edit.png

’Enter/edit seed list’ is used to define a new Seed list or to edit an existing one.

At the creation of a new Seed list a name is given that thereafter can not be changed.

In the ’Seeds’ text box a list of seeds to be harvested is given. Seeds can be omitted by writing a # prefix, e.g. #http://www.kb.dk. This can also be used for comments inside the seedlist – e.g. '#this seed is important'

Editing crawlertraps

crawlertrap.png

A crawlertrap is a path followed blindly by the harvester which in principle can continue forever. This typically could be a calendar.

To avoid crawlertraps on a domain, the administrator can state parts of URLs that should never be harvested (in any configuration). Matching URLs are omitted in all harvests of the domain and in other domains harvested in the same job. So be careful not to give too general statements that could potentially omit things on other domains (perhaps always include the domainname itself in the statement).

The string of text must be stated as a 'regular expression'.

Domain statistics

domain_statistics.png

The domain statistics page will give you information about number of subdomains for each unique Top level domain known in the system. IP-numbers will be counted separately.

The number in the “Number of subdomains” column is clickable and will do a search for all domains matching that Top level domain. This is only applicable to Top level domains with a limited number of subdomains since the matching domains will be listed on one page – and that page will get very long if the system contains hundreds of thousands of domains.

Alias summary

alias_summary.png

The alias summary page gives an overview of the domains marked as aliases of other domains in the system. Both domain names are clickable and will open the domain page for the clicked domain.

The “Expires” column shows when the alias expires (12 month after they are marked). The mark does not disappear after 12 month in the database but the “Overview of Aliases” page will show the “expired” ones in the top.

To renew an alias for another 12 month one is currently forced to open the domain page of the marked domain (the “Domain” column) – select “renew alias” and press “Save”

edit

User Manual 3.16/Domains (last edited 2011-05-11 09:54:19 by MikisSethSorensen)