Global Crawler Traps

A crawler trap is any sequence of webpages which a crawler can blindly and endlessly follow without harvesting any new information. A common example is a calendar system with hyperlinks to subsequent or previous dates. Crawler traps can be avoided by specifying (as regular expressions) URLs which the crawler is to ignore. In NetarchiveSuite one can specify crawler traps either per-domain or globally. This section describes the management of global crawler traps.

A list of crawler traps is just a plain text file containing crawler-trap regular expressions one-per-line. Lists may be active or inactive. When NetarchiveSuite creates a new job for any harvest, all crawler traps for all active lists (excluding duplicates) are added to the crawl template for that job.

GlobalCrawlerTraps_1.png

To upload a list of global traps, first click on the [Edit] link and fill in a name and description for the list of crawler traps and the path where the file containing the crawler trap expressions is to be found. You can also choose whether the list should be initially active or inactive. Click [Create] to upload the list.

GlobalCrawlerTraps_2.png

A list may be made active or inactive by clicking on the [Activate] and [Deactivate] buttons. Lists may also be viewed (via the [Retrieve] button), deleted, or edited. Note that the retrieved version of a crawler trap list may differ from the original uploaded version because any duplicates in the original are removed during upload and the order of the lines in the retrieved version will not be the same as in the original file. The [Edit] actions allow for uploading of a new version of the list.

GlobalCrawlerTraps_3.png

A side effect of using global crawler trap lists is that the database will grow more rapidly as the modified crawl template, including all the active crawler traps, is stored for every job.

edit

User Manual 3.12/Global Crawler Traps (last edited 2010-08-16 10:25:07 by localhost)