Differences between revisions 2 and 3
Revision 2 as of 2010-04-21 13:28:52
Size: 6474
Comment:
Revision 3 as of 2010-04-21 13:29:41
Size: 6448
Comment:
Deletions are marked like this. Additions are marked like this.
Line 24: Line 24:
The interface for the RemoteFile is defined by the Java interface dk.netarkivet.common.distribute.RemoteFile: The !RemoteFile interface is defined by the Java interface'' dk.netarkivet.common.distribute.!RemoteFile'':
Line 27: Line 28:
Line 32: Line 32:
Line 35: Line 34:

Line 45: Line 42:
Line 54: Line 50:
Line 63: Line 58:
Line 72: Line 66:
Line 78: Line 71:
Line 85: Line 77:
Line 90: Line 81:
Line 97: Line 87:

Pluggable parts

Action(edit)

Some points in NetarchiveSuite can be swapped out for other implementations, in a way similar to what Heritrix uses.

Also include relevant parts of design document that was basis for implementation of [:Development/Plugins:plug-ins]

[To be introduced more]

How pluggability works

Factories [To be described more]

...request for suggestions on pluggability areas [To be described more]

RemoteFile

The RemoteFile interface defines how large chunks of data are transferred between machines in a NetarchiveSuite installation. This is necessary because JMS has a relatively low limit on the size of messages, well below the several hundred megabytes to over a gigabyte that is easily stored in an ARC file. There are two current implementations available in the default distribution:

  • FTPRemoteFile - this implementation uses one or more FTP servers for transfer. While this requires more setup and causes extra copying of data, the method has the advantage of allowing more protective network configurations.
  • HTTPRemoteFile - this implementation uses an embedded HTTP server in each application that wants to send a RemoteFile. Additionally, it will detect when a file transfer happens within the same machine and use local copying or renaming as applicable. For single-machine installations, this is the implementation to use. In a multi-machine installation, it does require that all machines that can send RemoteFile objects (including the bitarchive machines) must have a port accessible from the rest of the system, which may go against security polices.

  • HTTPSRemoteFile - This is an extension of HTTPRemoteFile that ensures that the communication is secure and encrypted. It is implemented with a shared certificate scheme, and only clients with access to the certificate will be able to contact the embedded HTTP server.

All three implementations will detect when 0 bytes are to be transferred and avoid creating unnecessary file in this case.

The RemoteFile interface is defined by the Java interface dk.netarkivet.common.distribute.RemoteFile:

package dk.netarkivet.common.distribute;
import java.io.File;
import java.io.InputStream;
import java.io.OutputStream;
import java.io.Serializable;
import dk.netarkivet.common.exceptions.ArgumentNotValid;
import dk.netarkivet.common.exceptions.IOFailure;
/**
 *  RemoteFile: Interface for encapsulating remote files.
 *  Enables us to transmit large files between system components situated
 *  on different machines.
 *  Our current JMS broker(s) does not allow large message
 *  (i.e. messages > 70 MB).
 */
public interface RemoteFile extends Serializable {
    /**
     * Copy remotefile to local disk storage.
     * Used by the data recipient
     * @param destFile local File
     * @throws IOFailure on communication trouble.
     * @throws ArgumentNotValid on null parameter or non-writable file
     */
    public void copyTo(File destFile);
    /**
     * Write the contents of this remote file to an output stream.
     * @param out OutputStream that the data will be written to.  This stream
     * will not be closed by this operation.
     * @throws IOFailure If append operation fails
     * @throws ArgumentNotValid on null parameter
     */
    public void appendTo(OutputStream out);
    /**
     * Get an inputstream that contains the data transferred in this RemoteFile.
     * @return A stream object with the data in the object.  Note that the
     * close() method of this may throw exceptions if e.g. a transmission error
     * is detected.
     * @throws IOFailure on communication trouble.
     */
    public InputStream getInputStream();
    /**
     * Return the file name.
     * @return the file name
     */
    public String getName();
    /**
     * Returns a MD5 Checksum on the file. May return null, if checksums not
     * supported for this operation.
     * @return MD5 checksum
     */
    public String getChecksum();
    /**
     * Cleanup this remote file. The file is invalid after this.
     */
    public void cleanup();
    /** Returns the total size of the remote file.
     * @return Size of the remote file.
     */
    public long getSize();
}

JMSConnection

The JMSConnection provides access to a specific JMS connection. The default NetarchiveSuite distribution contains only one implementation, namely JMSConnectionSunMQ which uses Sun's OpenMQ. We recommend using this implementation, as other implementations have previously been found to violate some assumptions that NetarchiveSuite depends on.

Describe interface...

ArcRepositoryClient

The ArcRepositoryClient handles access to the Archive module, both upload and low-level access. There are two implementations in the default distribution:

  • JMSArcRepositoryClient - this is a full-fledged distributed implementation using JMS for communication, allowing multiple locations with multiple machines each.
  • LocalArcRepositoryClient - An ARC repository implementation that stores all files in a local directories.

Describe interface...

IndexClient

The IndexClient provides the Lucene indices that are used for deduplication and for viewerproxy access. It makes use of the ArcRepositoryClient to fetch data from the archive and implements several layers of caching of these data and of Lucene-indices created from the data. It is advisable to perform regular clean-up of the cache directories.

Describe interface...

DBSpecifics

This DBSpecifics interface allows substitution of the database used to store harvest definitions. There are three implementations, one for MySQL, one for Derby running as a separate server, and one for Derby running embeddedly. Which is these to choose is mostly a matter of individual preference. The embedded Derby implementation has been in use at the Danish web archive for over two years.

Describe interface...

Notifications

The Notifications interface lets you choose how you want important error notifications to be handled in your system. Two implementations exist, one to send emails, and one to print the messages to System.err. Adding more specialised plugins should be easy.

Describe interface...

System Design 3.12/Pluggable parts (last edited 2010-08-16 10:24:30 by localhost)