Assignment Harvester 4: Global Crawler Traps

Global crawler traps are crawler traps which can apply to each and every job started by a given instance of the Harvest Definition webapp. The global crawler traps are modelled as a list of resources (local files and possibly local directories and web resources) from which crawler traps are to be loaded. The traps themselves are just regular expressions in a newline-separated text-file. The global-crawler-trap list should be initialised from settings and modifiable from the GUI.

This assignment consists of three, relatively small, tasks.

  1. "Backend" this is a model representing a collection of globally defined crawlertraps. This should be a singleton class GlobalCrawlerTraps containing a collection representing all the known global traps. Its constructor initialises the global crawler traps from settings.xml. It should have methods {{{getAll()

remove() add()}}} The precise API will depend on how we structure the crawler traps. E.g. as a CrawlerTrap class or a collection of Strings or a Map representing traps with a unique id mapped to the regexp String. The question of how much validation to be applied is open. The possibilities are:

  1. No validation
  2. Check that resources exist
  3. Check that resources exist and contain a list of regular expressions.
  4. Configurable validation. (The second of these is preferred. An exception should be thrown on attempting to initialise or add an invalid resource.)
  1. "Business Logic". This consists of modifications to the class Job such that global crawler traps are added. They should be added by the Constructor. The code currently used to add domain-specific crawler traps Job.editOrderXML_crawlerTraps should be extracted to a utility method and shared between global and domain-specific traps.

  2. "Front End": A single jsp page which would list the current traps, with a "delete" button next to each, and a text-input box for adding a path to a new trap.

Possible Enhancements

  1. GUI to include possibility to browse content of global traps. This is a relatively minor enhancement which can be added as time allows.
  2. Model to include inactive traps, GUI to include activate/deactivate options for traps. This would allow us, for example, to distribute NAS with a selection of inactive traps which end-users could choose to activate or not as they wish. The additional coding effort to include this feature is probably small enough to recommend its inclusion.

Estimates

4md

AssignmentHarvester4 (last edited 2010-08-16 10:24:58 by localhost)