Diff for "Development/Plugins"

Differences between revisions 1 and 2

Current plugins

The current system (v. 3.2) contains the following pluggable parts:

RemoteFile
JMSConnection
ArcRepositoryClient
IndexClient
DBSpecifics

They are described to some extent in the DeveloperManual. This page will only contain ideas for future changes of them.

Possible future plugins

There is always a large number of places in a system that can be made pluggable (Emacs, for instance, is all plug-ins), but each plug-in in NetarchiveSuite requires a certain amount of work to set up, so we can only make so many. While we cannot promise that we implement any of these, here are some of the places that we think may be useful places to have plug-ins.

Job partitioning. The current system for splitting harvest runs into multiple jobs is based on assumptions about harvest style and the composition of web sites that may not be valid any more. While there is a need to split harvest runs into smaller chunks once a certain size is reached, some installations may never reach that size. Thus there are two obvious other implementations for the job scheduling algorithm: Never split jobs, and split jobs in a more comprehensible way.
External harvester invocation. The use of Heritrix is currently hardcoded to the point where some changes to the order.xml are done on the scheduler side. In case somebody wants to use a different harvester or invoke Heritrix in a different way, the harvester-specific parts could be turned into a plug-in. This would also require that the template system in the harvester.datamodel package allows for other than Heritrix order.xml templates.
Harvesting metadata handling. Which files to collect after a harvest is currently hardcoded and very Heritrix-specific. Not only will this have to change if another harvester is used, there might also be some people who want to store the metadata differently. An obvious place for a plug-in.
Domain differentiation. We currently use a simplistic domain name model of <domain>.<tld>. Not only are there countries where this model does not work at all, there are also frequently single domains that should be handled in a more complex way, as their subdomains are really separate entities just as much as other domains. Since different institutions will have different views on this problem, a plug-in would be helpful.
Index server. The use of Lucene is currently hardcoded into the CrawlLogIndexCache, which is used for both deduplication and viewerproxy indexes. Other indexing systems or different uses of Lucene would be possible if this was pluggable.
Batch job handling. Currently, batch job results are simply concatenated together, but there are many other ways this combination could happen. One interesting possibility is to use map-reduce algorithms, but there are other potentially useful approaches that could be placed in plug-ins.
Bitarchive. While the current bitarchive is made to work with the most common setups (multiple machines possible, multiple bitarchives possible per machine, multiple directories useable per bitarchive), there are potentially other ways to store files that do not fit well with the bitarchive system. Rather than complicating the bitarchive system, it may make more sense to pull some part of it out into a plug-in.
Viewerproxy command resolvers. The viewerproxy system has separate classes to handle various commands in a chain. It ought to be possible to add more resolvers to this chain as a plug-in.
Viewerproxy missing-URL systems. One of the strengths of the viewerproxy system is that it can collect a list of URLs that are not in the archive. It is, however, very simplistic in its approach, and should be not only improved, but also made pluggable.

Potential changes to the plug-in system

The plug-in system as it is now is not optimal for creating new plug-ins.

In particular, it is impractical to specify different settings for different plug-ins. The XML Schemas are OK, in that you can specify that a particular class used allows particular entries in the XML, but all the known settings are listed in the Settings classes with no distinction between pluggable and non-pluggable settings. Thus we cannot verify that a settings file is correct.

There can only be one of a particular kind of plug-in in a given JVM, since the one plug-in used is determined by the settings file. Allowing multiple plug-ins would make it easier to, say, provide more ways of combining batch results. There will then need to be a mechanism for specifying which plug-ins to use when.

-  ⇤ ← Revision 1 as of 2007-08-23 15:39:59 → 
  Size: 4768
  Editor: LarsClausen
  Comment:
+   ← Revision 2 as of 2010-08-16 10:25:08 → ⇥
  Size: 4773
  Editor: localhost
  Comment: converted to 1.6 markup
-Deletions are marked like this.
+Additions are marked like this.
 Line 15:
-There is always a large number of places in a system that can be made pluggable ([http://www.ccil.org/jargon/jargon_23.html#TAG901 Emacs, for instance, is all plug-ins]), but each plug-in in !NetarchiveSuite requires a certain amount of work to set up, so we can only make so many.  While we cannot promise that we implement any of these, here are some of the places that we think may be useful places to have plug-ins.
+There is always a large number of places in a system that can be made pluggable ([[http://www.ccil.org/jargon/jargon_23.html#TAG901|Emacs, for instance, is all plug-ins]]), but each plug-in in !NetarchiveSuite requires a certain amount of work to set up, so we can only make so many.  While we cannot promise that we implement any of these, here are some of the places that we think may be useful places to have plug-ins.
 Line 17:
-. Job partitioning.  The current system for [:/JobSplitting: splitting harvest runs into multiple jobs] is based on assumptions about harvest style and the composition of web sites that may not be valid any more.  While there is a need to split harvest runs into smaller chunks once a certain size is reached, some installations may never reach that size.  Thus there are two obvious other implementations for the job scheduling algorithm: Never split jobs, and split jobs in a more comprehensible way.
+. Job partitioning.  The current system for [[/JobSplitting| splitting harvest runs into multiple jobs]] is based on assumptions about harvest style and the composition of web sites that may not be valid any more.  While there is a need to split harvest runs into smaller chunks once a certain size is reached, some installations may never reach that size.  Thus there are two obvious other implementations for the job scheduling algorithm: Never split jobs, and split jobs in a more comprehensible way.
 Line 22:
-. Batch job handling.  Currently, batch job results are simply concatenated together, but there are many other ways this combination could happen.  One interesting possibility is to use [http://en.wikipedia.org/wiki/MapReduce map-reduce algorithms], but there are other potentially useful approaches that could be placed in plug-ins.
+. Batch job handling.  Currently, batch job results are simply concatenated together, but there are many other ways this combination could happen.  One interesting possibility is to use [[http://en.wikipedia.org/wiki/MapReduce|map-reduce algorithms]], but there are other potentially useful approaches that could be placed in plug-ins.