System Design 3.10/Coding Guidelines - Design Related

Settings

Also include relevant parts to the design decision input document (is now implemented) settings structure

Until release 3.8, all our settings were defined in the class dk.netarkivet.common.Settings, and all access to the settings was done using static methods in the same class. Furthermore, you needed to have all settings defined in a settings.xml.

As of version 3.8, the settings are read using methods in the new class dk.netarkivet.common.utils.Settings The declaration of the settings themselves, however, have been moved to new settings classes in each of the modules. Every module except for the Deploy module now have its own Settings class and a settings.xml file containing the default values for the module settings.

In addition to that, every plug-in is intended to declare its own settings inside itself, and be associated with an xml-file containing the default values of these settings placed in the same directory as the plugin itself. For more information, please refer to the Configuration Basics description in the Configuration Manual.

Associated with most of the NetarchiveSuite plug-ins, there are also factory classes that hides the complexity behind selecting the correct plugin according to the chosen settings. The names of these classes all ends on Factory, e.g. JMSConnectionFactory, RemoteFileFactory.

Almost all configuration of NetarchiveSuite is done through the main module Settings classes as for example the dk.netarkivet.common.Settings class. It provides a simple interface to the settings.xml file as well as definitions of all current configuration default settings. The settings.xml file itself is an XML file with a structure reminiscent of the package structure.

Settings are referred to inside NetarchiveSuite by their path in the XML structure. For instance, the storeRetries setting in arcrepositoryClient under common is referred to with the string "settings.common.arcrepositoryClient.storeRetries". However, to avoid typos, each known setting has its path defined as a String constant in the Settings class, which is used throughout the code (also deploy in checks of whether the settings are known). The Settings class file also includes description of the setting in their javadoc.

To add a new general setting, the following steps need to be taken:

The Settings class should get a definition for the path of the setting. This String, although a constant, must mot be declared final since settings are initialised by a static initialiser in the class
Javadoc for the definition must including the path name of the setting as well as description of the setting.
all default settings.xml files must be updated (including those in the unit tests).
conf/settings_example.xml must be updated.
scripts/simple_harvest/settings.xml (for quickstart) must be updated, if needed.

Note that there are no XML Schema to be updated, because the use of default settings means that setting files do not need to be there, and plug-in setting means that we generally do not know which setting will be used. Note also the Configuration Manual includes a description of how deploy can be used to validate whether the settings in a settings file are valid (not necessarily exhaustive!).

JMS Channels

edit

Placement in NetarchiveSuite software

Every channel is named after the set of applications instances that are expected to receive messages sent to it. All channel names are constructed privately in the dk.netarkivet.distribute.ChannelID class. To get a channel, you must use one of the public methods in dk.netarkivet.distribute.Channels.

Channel Behavior

There are used to types of channels:

Queue where only one listener takes a message from the queue. For example a queue where request for doing a harvest is only received by one harvester.
Topic where all listeners takes a copy of the message. For example a batch job which has to be executed by all bitarchive applications.

The type of is not affecting the channel name directly. It is indicated by the Channel Prefix, since only channel names starting with ALL are topics, while the rest are queues.

The channel type for different queues is given in table in next section.

Naming Conventions

The actual channels name of a given channel has the following structure:

<Environment Name>_<Channel Prefix>_<Replica Annotation>[_<Machine>][_<Application Instance>]

Environment Name value must be the same for all channels on all JVMs in an integrated system. It identifies the environment for the queue, e.g.: "DEV", "TEST", "RELEASETEST", "CLOVER", "PROD" or initials on developer running personal environment for e.g. sanity tests. It is read in from the setting settings.common.environmentName of the system.
Channel Prefix is constructed by a convention for channel behavior together with denotation of the concerned application(s). For example THE_SCHED uses the convention THE for SCHED denoting the scheduler of the harvester.
The conventions for channel behavior are:
- THE - communicate with singleton in the system.
- THIS - communicate with one uniquely identified instance of a number of distributed listeners.
- ANY - has all instances listening to the same queue.
- ALL - is used for topics only, i.e. topics are sent to all listeners on the channel.
Use Replica Id will result in a channel name with the replica identifier in question (which must match possible replicas in settings). In case of no use of replica id the COMMON will be used.
Use IP Node will result in a channel name with the IP address of the machine whether the application in question is placed. It is read directly from the running system.
In case of no use of IP Node, nothing will be written.
Use Application Instance Id is used, if only one process is expected to listen to the queue. It will result in a channel name with an identification of the individual application instance. It consists of the application abbreviation (uppercase letters of application name) and the application instance identifier from the settings. It is read from the common settings: settings.common.applicationName and settings.common.applicationInstanceId.
In case of no use of Application Instance Id, nothing will be written.
Channel Type is only here to complete the picture. Please refer to the section on channel behavior.

Channel names are constructed as described in the below table (columns described after the table).

Channel Prefix	Use Replica Id (for Replica Annotation)	Use IP Node (for Machine)	Use Application Instance Identification	Channel Type	Example
THE_SCHED	No	No	No	Queue	PROD_THE_SCHED_COMMON
ANY_HIGHPRIORITY_HACO	No	No	No	Queue	PROD_ANY_HIGHPRIORITY_HACO_COMMON
ANY_LOWPRIORITY_HACO	No	No	No	Queue	PROD_ANY_LOWPRIORITY_HACO_COMMON
THIS_REPOS_CLIENT	No	Yes	Yes	Queue	PROD_THIS_REPOS_CLIENT_COMMON_130_226_258_7_HCA_HIGH
THE_REPOS	No	No	No	Queue	PROD_THE_REPOS_COMMON
THE_BAMON	Yes	No	No	Queue	PROD_THE_BAMON_TWO
ALL_BA	Yes	No	No	Topic	PROD_ALL_BA_TWO
ANY_BA	Yes	No	No	Queue	PROD_ANY_BA_TWO
THIS_INDEX_CLIENT	No	Yes	Yes	Queue	PROD_THIS_INDEX_CLIENT_COMMON_130_226_258_7_ISA
INDEX_SERVER	No	No	No	Queue	PROD_INDEX_SERVER_COMMON
MONITOR	No	No	No	Queue	PROD_MONITOR_COMMON
ERROR	No	No	No	Queue	PROD_ERROR_COMMON
THE_CR	Yes	No	No	Queue	PROD_THE_CR_THREE

The examples are using values

Environment name PROD
Possible replica identifiers ONE or TWO
IP on machine 130.226.258.7
Application instances
- HCA_HIGH (for HarvestControllerApplication with instance id "HIGH" )
- ISA (for IndexServerApplication with instance id "" )

Design Notes

Note that creation of channel names for the ANY xxx_HACO-queues are designed in a way so extension to more priorities is easy.

Localization

edit

The NetarchiveSuite web pages are internationalized, that is they are ready to be translated into other languages. The default distribution contains a default (English) version and Danish, Italian, French and German versions, but adding a new language does not take any coding. All translatable strings are collected in five resource bundles, one for each of the five main modules mentioned above. The default translation files are src/dk/netarkivet/common/Translations.properties, src/dk/netarkivet/archive/Translations.properties, src/dk/netarkivet/harvester/Translations.properties, src/dk/netarkivet/viewerproxy/Translations.properties, and src/dk/netarkivet/monitor/Translations.properties.

To translate to a new language, first copy each of these files to a file in the same directory, but with _XX after Translations, where XX is the Unicode language code for the language you're going to translate into, e.g. if you're translating into Limburgish, use Translations_li.properties. If you're translating into a language that has different versions for different countries, you may need to use _XX_YY, where XX is the language code and YY is the ISO country code, e.g. Translations_fr_CA.properties for Canadian French. Then edit each of the new files to have your translation instead of the English translation for each line. Most of the important syntax should be evident from the original, but for details consult the XXX. According to the Java documentation (specifically the Javadoc of the Properties class) resource bundles should use iso-8859-1 with escaped Unicode for all other characters. It is good practice to use escaped Unicode for all non-ASCII characters as this results in files which are more-easily shared between different text-editing environments.

The translation has not been done throughout the code, only in the web-related parts. Thus log messages and unexpected error messages are in English and cannot be translated through the resource bundles.

JSP

edit

The webpages in NetarchiveSuite are written using JSP (Java Server Pages) with Apache I18N Taglib for internationalization. To support a unified look across pages from different modules, we have divided the pages into SiteSections as described in the next section. Any processing of requests happens in Java code before the web page is displayed, such that update errors can be handled with a uniform error page. Internationalization is primarily done with the taglib tags <fmt:message>, <fmt:date> etc.

The main feature of JSP is that ordinary Java (not JavaScript) can be used at server-side to generate HTML. The special tags <%...%> indicate a piece of Java code to run, while the tags <%=...> indicates a Java expression to run whose value will be inserted (as is, see escape mechanisms below) in the HTML. While it is possible to output to HTML from Java code using out.print(), it is discouraged as it a) is confusing to read, and b) does not allow for using taglibs for internationalization.

We use a number of standard methods defined in dk.common.webinterface.HTMLUtils. Of particular note are the following methods:

generateHeader(): This method takes a PageContext and generates the full header part of the HTML page, including the starting <body> tag. It should always be used to create the header, as it also creates the menu and language selection links. After this method has been called, redirection or forwarding is no longer possible, so any processing that can cause fatal errors must be done before calling generateHeader(). The title of the page is taken from the SiteSection information based on the URL used in the request.
generateFooter(): This closes the HTML page and should be called as the last thing on any JSP page.
setUTF8(): This method must be called at the very start of the JSP page to ensure that reading from the request is handled in UTF-8.
encode(): This encodes any character that is not legal to have in a URL. It should be used whenever an unknown string (or a string with known illegal characters) is made part of a URL. Note that it is not idempotent, calling it twice on a string is likely to create a mess.
escapeHTML(): This escapes any character that has special meaning in HTML (such as < or &). It should be used any time a unknown string (or a string with known special characters) is being put into HTML. Note that it is not idempotent: If you escape something twice, you get a horrible-looking mess.
encodeAndEscape(): This method combines encode() and escapeHTML() in one, which is useful when you're putting unknown strings directly into URLs in HTML.

The SiteSection system

Each part of the web site (as identified by the top-level menu items on the left side) is defined by one subclass of the SiteSection class. These sections are loaded through the <siteSection> settings, each of which connect one SiteSection class with its WAR file and the path it will appear under in the URL.

Each SiteSection subclass defines the name used in the left-hand menu, the prefix of all its pages, the number of pages visible in the left-hand menu when within this section, a suffix and title for each page in the section (including hidden pages), the directory that the section should be deployed under, and a resource bundle name for translations. Furthermore, the SiteSections have hooks for code that needs to be run on deployment and undeployment. If you want to add a new page to the section, you will only need to add a new line to the list of pages with a unique (within the SiteSection) suffix and a key for the page title, plus a default translation in the corresponding Translation.properties file. If you want it to appear in the left-hand menu, update the number of visible pages to n+1 and put your new pages as one of the first n+1 lines.

This is an example of what a simple SiteSection can look like. Note that only the first two pages from the list have entries in the left-hand menu. This class does no special initialisation and shutdown.

    public HistorySiteSection() {
        super("sitesection;history", "Harveststatus", 2,
              new String[][]{
                      {"alljobs", "pagetitle;all.jobs"},
                      {"perdomain", "pagetitle;all.jobs.per.domain"},
                      {"perhd", "pagetitle;all.jobs.per.harvestdefinition"},
                      {"perharvestrun", "pagetitle;all.jobs.per.harvestrun"},
                      {"jobdetails", "pagetitle;details.for.job"}
              }, "History",
                 dk.netarkivet.harvester.Constants.TRANSLATIONS_BUNDLE);
    }
    
    public void intitialize() {}
    public void close() {}

Processing updates

Some JSP sites cause updates when posted with specific parameters. Such parameters should always be specified in the beginning of the JSP file. All updates of underlying file systems, databases etc should happen before generateHeader() is called, so processing errors can be properly redirected. The preferred way to process updates is to create a method processRequest() in a class corresponding to the web page, but under the webinterface package of the corresponding module. This method should take the pageContext and I18N parameters from the JSP page, together they contain all the information needed from there.

In case of processing errors, the processing method should call HTMLUtils.forwardToErrorPage() and then throw a ForwardedToErrorPage exception. The JSP code should always enclose the processRequest() call in a try-catch block and return immediately if ForwardedToErrorPage is thrown. This mechanism should be used for "expected" errors, mainly illegal parameters. Errors of the "this can never happen" nature should just cause normal exceptions. Like in other code, the processRequest() method should check its parameters, but it should also check the parameters posted in the request to check that they conform to the requirements. Some methods for that purpose can be found in HTMLUtils.

I18n

We use the Apache I18n taglib for most internationalization on the web pages. This means that instead of writing different versions of a web page for different languages, we replace all natural language parts of the page with special formatting instructions. These are then used to look up translations to the language in effect in translation resource bundles.

Normal strings can be handled with the <fmt:message/> tag. If variable paratemers are introduced, such as object names or domain names, they can be passed as parameters using <fmt:message key="translation.key"><fmt:param value="<%myVal>"/></fmt:message>. Note that while the message retrieved for the key gets any HTML-specific characters escaped, the values do not and should be manually escaped. It is possible if necessary to pass HTML as parameters.

Dates should in general be entered using <fmt:formatDate type="both">, though a few places use a more explicit handling of formats. This lets the date be expressed in the native language's favorite style.

Integers and longs are handled in Java properties files with {0, number, integer} or {0, number, long} as described in MessageFormat. In JSP files we handled integer/long to strings with HTMLUtils.localiseLong(long, PageContext) or HTMLUtil.localiseLong(long, Locale).

Note the boilerplate code at the start of every page that defines output encoding, taglib usage, translation bundle, and a general-purpose I18N object. It is important that the translation bundles from the Constants class for the module you're in is used, or incomprehensible errors will occur.

    pageEncoding="UTF-8"
%><%@taglib uri="http://java.sun.com/jsp/jstl/fmt" prefix="fmt"
%><fmt:setLocale value="<%=HTMLUtils.getLocale(request)%>" scope="page"
/><fmt:setBundle scope="page" basename="<%=dk.netarkivet.archive.Constants.TRANSLATIONS_BUNDLE%>"/><%!
    private static final I18n I18N
            = new I18n(dk.netarkivet.archive.Constants.TRANSLATIONS_BUNDLE);
%><%

Pluggable parts

edit

Some points in NetarchiveSuite can be swapped out for other implementations, in a way similar to what Heritrix uses.

Also include relevant parts of design document that was basis for implementation of plug-ins

[To be introduced more]

How pluggability works

Factories [To be described more]

...request for suggestions on pluggability areas [To be described more]

RemoteFile

The RemoteFile interface defines how large chunks of data are transferred between machines in a NetarchiveSuite installation. This is necessary because JMS has a relatively low limit on the size of messages, well below the several hundred megabytes to over a gigabyte that is easily stored in an ARC file. There are two current implementations available in the default distribution:

FTPRemoteFile - this implementation uses one or more FTP servers for transfer. While this requires more setup and causes extra copying of data, the method has the advantage of allowing more protective network configurations.
HTTPRemoteFile - this implementation uses an embedded HTTP server in each application that wants to send a RemoteFile. Additionally, it will detect when a file transfer happens within the same machine and use local copying or renaming as applicable. For single-machine installations, this is the implementation to use. In a multi-machine installation, it does require that all machines that can send RemoteFile objects (including the bitarchive machines) must have a port accessible from the rest of the system, which may go against security polices.
HTTPSRemoteFile - This is an extension of HTTPRemoteFile that ensures that the communication is secure and encrypted. It is implemented with a shared certificate scheme, and only clients with access to the certificate will be able to contact the embedded HTTP server.

All three implementations will detect when 0 bytes are to be transferred and avoid creating unnecessary file in this case.

Describe interface...

JMSConnection

The JMSConnection provides access to a specific JMS connection. The default NetarchiveSuite distribution contains only one implementation, namely JMSConnectionSunMQ which uses Sun's OpenMQ. We recommend using this implementation, as other implementations have previously been found to violate some assumptions that NetarchiveSuite depends on.

Describe interface...

ArcRepositoryClient

The ArcRepositoryClient handles access to the Archive module, both upload and low-level access. There are two implementations in the default distribution:

JMSArcRepositoryClient - this is a full-fledged distributed implementation using JMS for communication, allowing multiple locations with multiple machines each.
LocalArcRepositoryClient - An ARC repository implementation that stores all files in a local directories.

Describe interface...

IndexClient

The IndexClient provides the Lucene indices that are used for deduplication and for viewerproxy access. It makes use of the ArcRepositoryClient to fetch data from the archive and implements several layers of caching of these data and of Lucene-indices created from the data. It is advisable to perform regular clean-up of the cache directories.

Describe interface...

DBSpecifics

This DBSpecifics interface allows substitution of the database used to store harvest definitions. There are three implementations, one for MySQL, one for Derby running as a separate server, and one for Derby running embeddedly. Which is these to choose is mostly a matter of individual preference. The embedded Derby implementation has been in use at the Danish web archive for over two years.

Describe interface...

Notifications

The Notifications interface lets you choose how you want important error notifications to be handled in your system. Two implementations exist, one to send emails, and one to print the messages to System.err. Adding more specialised plugins should be easy.

Describe interface...

XML handling by Deploy

edit

Deploy needs to create a settings file in the xml fileformat for each application. The creation of these files involves the ability to manipulate the xml datastructure, create new xml instances based on previous ones and save the result to a file.

The existing XMLTree class (under dk.netarkivet.common.utils) did not support these abilities, and it would require a larger structural change of the class to meet these demands.

A new class (dk.netarkivet.deploy.XmlStructure) has therefore been implemented to meet the requests for handling the xml datastructure. This class uses the com.dom4j structure for handling the xml datastructure.

The dk.netarkivet.deploy.XmlStructure class has the ability to inherit and overwrite, which is used in the deploy configuration structure. The abilty to inherit is implemented as creating a new instance identical to a current instance, thus inheritance can only occour during creation of a new instance of this class.

The overwrite function merges the current XmlStructure with a new tree. This means that the leafs which are present in both new tree and the current one, the value in the current leaf will be overwritten by the leaf on the new tree. Branches which only exists in the new tree will be appended to the current tree. The branches in both trees are recursively overwritten

System Design 3.10/Coding Guidelines - Design Related (last edited 2010-08-16 10:24:12 by localhost)