Differences between revisions 2 and 3

Global Crawler Traps

A crawler trap is any sequence of webpages which a crawler can blindly and endlessly follow without harvesting any new information. A common example is a calendar system with hyperlinks to subsequent or previous dates. Crawler traps can be avoided by specifying (as regular expressions) URLs which the crawler is to ignore. In NetarchiveSuite one can specify crawler traps either per-domain or globally. This section describes the management of global crawler traps.

A list of crawler traps is just a plain text file containing crawler-trap regular expressions one-per-line. Lists may be active or inactive. When NetarchiveSuite creates a new job for any harvest, all crawler traps for all active lists (excluding duplicates) are added to the crawl template for that job.

To upload a list of global traps, first click on the "Edit" link and fill in a name and description for the list of crawler traps and the path where the file containing the crawler trap expressions is to be found. You can also choose whether the list should be initially active or inactive. Click "Create" to upload the list.

A list may be made active or inactive by clicking on the "Activate" and "Deactivate" buttons. Lists may also be viewed (via the "Retrieve" button), deleted, or edited. The "Edit" actions allow for uploading of a new version of the list.

A side effect of using global crawler trap lists is that the database will grow more rapidly as the modified crawl template, including all the active crawler traps, is stored for every job.

User Manual 3.12/Global Crawler Traps (last edited 2010-08-16 10:25:07 by localhost)

-  ⇤ ← Revision 2 as of 2010-01-20 12:28:59 → 
  Size: 26
  Editor: ColinRosenthal
  Comment:
+   ← Revision 3 as of 2010-01-20 12:47:42 → ⇥
  Size: 1569
  Editor: ColinRosenthal
  Comment:
-Deletions are marked like this.
+Additions are marked like this.
 Line 2:
+A crawler trap is any sequence of webpages which a crawler can blindly and endlessly follow without harvesting any new information. A common example is a calendar system with hyperlinks to subsequent or previous dates. Crawler traps can be avoided by specifying (as regular expressions) URLs which the crawler is to ignore. In !NetarchiveSuite one can specify crawler traps either per-domain or globally. This section describes the management of global crawler traps.

A list of crawler traps is just a plain text file containing crawler-trap regular expressions one-per-line. Lists may be active or inactive. When !NetarchiveSuite creates a new job for any harvest, all crawler traps for all active lists (excluding duplicates) are added to the crawl template for that job.

To upload a list of global traps, first click on the "Edit" link and fill in a name and description for the list of crawler traps and the path where the file containing the crawler trap expressions is to be found. You can also choose whether the list should be initially active or inactive. Click "Create" to upload the list.

A list may be made active or inactive by clicking on the "Activate" and "Deactivate" buttons. Lists may also be viewed (via the "Retrieve" button), deleted, or edited. The "Edit" actions allow for uploading of a new version of the list.

A side effect of using global crawler trap lists is that the database will grow more rapidly as the modified crawl template, including all the active crawler traps, is stored for every job.