Differences between revisions 2 and 3
Revision 2 as of 2011-03-02 13:19:25
Size: 1052
Editor: TueLarsen
Comment:
Revision 3 as of 2011-03-02 13:26:58
Size: 1154
Editor: TueLarsen
Comment:
Deletions are marked like this. Additions are marked like this.
Line 1: Line 1:
=== Validation of Crwalertraps or regular expressions === === How do you manage and validate your crawler traps? ===
Line 7: Line 7:
if you by mistake add it to the Global Crawler Trap list. if you by mistake e.g. add it to the Global Crawler Trap list.
Line 9: Line 9:
in all harvesting logs.
Line 13: Line 14:
So please check your regexp strings on your test system before you upload I recommend that you check your regexp strings on your test system before you upload
Line 20: Line 21:
How do you manage and validate your crawler traps today?

How do you manage and validate your crawler traps?

When you add 1 or more "regexp" (Regular expression) crawler traps in the NAS GUI in the domain screen or in Global Crawler Traps GUI, they are NOT validated for wrong syntax.

A NOT valid regexp string can stop all harvesting activity in the system, if you by mistake e.g. add it to the Global Crawler Trap list. You get typically the Heritrix log error message :" 2011-03-01T11:39:10.400Z -5 - http://www.sf.dk/Default.aspx - - no-type #049 - - - err=java.util.regex.PatternSyntaxException" in all harvesting logs.

The only place where your regexp's are validated before they are inserted into NAS is in "Edit Harvest Templates" menu, when you upload an order.xml with regexp lists.

I recommend that you check your regexp strings on your test system before you upload them to the production system.

You can e.g. also check your regexp strings here : http://www.javaregex.com/test.html

And here is more about "regexp"'s : http://en.wikipedia.org/wiki/Regular_expression

How do you manage and validate your crawler traps today?

Best regards Tue

Validation of crawler traps or regular expressions (last edited 2011-04-13 11:44:11 by TueLarsen)