How do you manage and validate your crawler traps?

When you add 1 or more "regexp" (Regular expression) crawler traps in the NAS GUI in Global Crawler Traps GUI, they are NOT validated.

A NOT valid regexp string can stop all harvesting activity in the system, if you by mistake add it to the Global Crawler Trap list. You get typically the Heritrix log error message :" 2011-03-01T11:39:10.400Z -5 - http://www.sf.dk/Default.aspx - - no-type #049 - - - err=java.util.regex.PatternSyntaxException" in all harvesting logs.

The only places where your regexp's are validated before they are inserted into NAS is in "Edit Harvest Templates" menu and in "Crawler traps" in the "Edit Domain" screen.

Netarkivet.dk :

I have created a feature request here https://gforge.statsbiblioteket.dk/tracker/index.php?func=detail&aid=2140&group_id=7&atid=108

I recommend for now that you check your regexp strings on your test system before you upload them to the production system.

You can e.g. also check your regexp strings here : http://www.javaregex.com/test.html

And here is more about "regexp"'s : http://en.wikipedia.org/wiki/Regular_expression

How do you manage and validate your crawler traps today?

Best regards Tue

Validation of crawler traps or regular expressions (last edited 2011-04-13 11:44:11 by TueLarsen)