Differences between revisions 4 and 5
Revision 4 as of 2011-03-02 13:31:35
Size: 1173
Editor: TueLarsen
Comment:
Revision 5 as of 2011-03-02 13:48:23
Size: 1316
Editor: TueLarsen
Comment:
Deletions are marked like this. Additions are marked like this.
Line 16: Line 16:
I recommend that you check your regexp strings on your test system before you upload I have created a feature request here https://gforge.statsbiblioteket.dk/tracker/index.php?func=detail&aid=2140&group_id=7&atid=108

I recommend for now that you check your regexp strings on your test system before you upload

How do you manage and validate your crawler traps?

When you add 1 or more "regexp" (Regular expression) crawler traps in the NAS GUI in the domain screen or in Global Crawler Traps GUI, they are NOT validated for wrong syntax.

A NOT valid regexp string can stop all harvesting activity in the system, if you by mistake e.g. add it to the Global Crawler Trap list. You get typically the Heritrix log error message :" 2011-03-01T11:39:10.400Z -5 - http://www.sf.dk/Default.aspx - - no-type #049 - - - err=java.util.regex.PatternSyntaxException" in all harvesting logs.

The only place where your regexp's are validated before they are inserted into NAS is in "Edit Harvest Templates" menu, when you upload an order.xml with regexp lists.

Netarkivet.dk :

I have created a feature request here https://gforge.statsbiblioteket.dk/tracker/index.php?func=detail&aid=2140&group_id=7&atid=108

I recommend for now that you check your regexp strings on your test system before you upload them to the production system.

You can e.g. also check your regexp strings here : http://www.javaregex.com/test.html

And here is more about "regexp"'s : http://en.wikipedia.org/wiki/Regular_expression

How do you manage and validate your crawler traps today?

Best regards Tue

Validation of crawler traps or regular expressions (last edited 2011-04-13 11:44:11 by TueLarsen)