Assignment group B.2 - Improve and isolate bit preservation GUI.

Note (August 11, 2011): The assignments in this group has either been implemented in the NetarchiveSuite (3.12.X) or may eventually be implemented as part of the Bitrepository project (https://sbforge.org/display/BITMAG/Bitrepository+frontpage) or as part of NetarchiveSuite when the Bitrepositoryproject has matured. The latter holds for the following assignments:

Assignment B.2.3 - Use segments in bitarchives
Assignment B.2.4 - Write BitPreservation scheduler
Assignment B.2.5 - Write BitPreservation webinterface
Assignment B.2.6 - Handling doublets

Printer friendly version

Contents

Assignment group B.2 - Improve and isolate bit preservation GUI.

References

Reference documents

Dependencies

All the tasks below need to be done in the order they are specified.
Neither of the tasks require to be done in the same iteration, although there is only limited added value before task B.2.5 is implemented.

Terminology

From section 2.2a below, the term replica is used for the old term location. The reason for this rename is that location was confusingly giving association that it has to do with the location of for instance a bitarchive.
Below I often use the words "wrong" or "changed" about files to mean that they have a different checksum than expected.

Bugs

(maybe) addressed by these assignments (import ones in bold):

Feature requests

(maybe) addressed by these assignments (import ones in bold):

Assignment B.2.1 - Refactor code

Currently there is no clear design in the bit preservation classes.
Responsibility for running and analysing status jobs on the bitarchives is shared between classes in the packages dk.netarkivet.archive.arcrepository.bitpreservation, dk.netarkivet.archive.webinterface, and the webpages in BitPreservation.war.
Clearly this is not a good solution, and it would be a good idea to get a clear distinction. The obvious design will:
Have one clear interface with the needed operations.
Reduce the responsibilities of the classes in dk.netarkivet.archive.webinterface to delegating requests from the webinterface to the bit preservation classes, and converting the results to HTML where needed.
Move all logic from the jsp files in BitPreservation.war to the dk.netarkivet.archive.webinterface classes.
The interface for the needed operations seems to be
/** * Active bitpreservation is assumed to have access to admin data and bitarchives. * Operations may request informations from the bitarchive by sending batch jobs, * reading admin data directly, or reading from cached information from either. */ public interface ActiveBitPreservation { // General status /** * Get details of the status of one file in the bitarchives * * @return The details of one file. */ FilePreservationStatus getFilePreservationStatus(String filename); // Check status for bitarchives /** * Return a list of files marked as missing on this location. * A file is considered missing if it does not exist compared to * admin data. * * @param location The location to get missing files from. * * @return A list of missing files. * * @throw IOFailure if the list cannot be generated. */ Iterable getMissingFiles(Location location); /** * Return a list of files with changed checksums on this location. * A file is considered changed if checksum does not compare to * admin data. * * @param location The location to get list of changed files from. * * @return A list of fiels with changed checksums. * * @throw IOFailure if the list cannot be generated. */ Iterable getChangedFiles(Location location); /** * Return the number of missing files for location. Guaranteed not to * recheck the archive, simply returns the number returned by the last * test. * * @param location The location to get number of missing files from. * * @return The number of missing files. */ int getNumberOfMissingFiles(Location location); /** * Return the number of changed files for location. Guaranteed not to * recheck the archive, simply returns the number returned by the last * test. * * @param location The location to get number of changed files from. * * @return The number of changed files. */ int getNumberOfChangedFiles(Location location); /** * Return the date for last check of missing files for location. Guaranteed * not to recheck the archive, simply returns the number returned by the * last test. * * @param location The location to get date for changed files from. * * @return The date for last check of missing files. */ Date getDateForMissingFiles(Location location); /** * Return the date for last check of changed files for location. Guaranteed * not to recheck the archive, simply returns the number returned by the * last test. * * @param location The location to get date for changed files from. * * @return The date for last check of changed files. */ Date getDateForChangedFiles(Location location); // Update files in bitarchives /** * Check that files are indeed missing, and present in admin data and * reference location. If so, upload missing files from reference location * to this location. * * @param location The location to restore files to * @param filename The names of the files. * * @throw IOFailure if the file cannot be reestablished * @throw PermissionDenied if the file is not in correct state */ void reuploadMissingFiles(Location location, String... filename); /** * Check that file checksum is indeed different to admin data and reference * location. If so, remove missing file and upload it from reference * location to this location. * * @param location The location to restore file to * @param filename The name of the file. * @param filename The credentials needed for removing the file (password). * * @throw IOFailure if the file cannot be reestablished * @throw PermissionDenied if the file is not in correct state */ void replaceChangedFile(Location location, String filename, String credentails); // Check status for admin data /** * Return a list of files present in bitarchive but missing in AdminData. * * @return A list of missing files. * * @throw IOFailure if the list cannot be generated. */ Iterable getMissingFilesForAdminData(); /** * Return a list of files with wrong checksum or status in admin data. * * @return A list of files with wrong checksum or status. * * @throw IOFailure if the list cannot be generated. */ Iterable getChangedFilesForAdminData(); // Update admin data /** * Reestablish admin data to match bitarchive states for files. * * @param filename The files to reestablish state for. * * @throw PermissionDenied if the file is not in correct state */ void addMissingFilesToAdminData(String... filename); /** * Reestablish admin data to match bitarchive states for file. * * @param filename The file to reestablish state for. * * @throw PermissionDenied if the file is not in correct state */ void changeStatusForAdminData(String filename); }
It should be possible to do this by pure refactoring.
In the current implementation, admin data is read directly, and the get...files operations will submit a batchjob to the requested bitarchives.
The time for last check, and the number found in that are available in filestorage, where the results are cached.
Take care to implement especially the methods to update a wrong state correctly. They need to re-request status before correcting, to be sure we are in a state where correct is allowed.
Note: As is currently the case, the above interface does not address the issue where there is not a clear majority. Possibly we should always allow corrects in this case? (OR let this be configurable)

Estimated time

Estimator

8 md

KFC

Assignment B.2.2a0 - Separate the meaning of replica and physical location

Currently a location has two different meanings. It is either the physical location where the mascine on which an application is running, or it refers to the bitarchive replica identy or use.
This information is currently set in the following settings:
locations.location
thisLocation
batchLocation
In order to separate the two meanings these needs to be transformed into the following settings:
replicas.replicaId (instead of locations.location, and accopanied with replicaType, replicaName according to the following sections)
useReplicaId (for default batchjob location, get operations)
thisReplicaId (for bitArchive and bitArchiveMonitor applications in order to define channels)
thisPhysicalLocation (for applications definition in status overview)
Besides correcting the settings, the code must be adjusted for setting of channels for biarchive and bitArchiveMonitor. Furthermore changes must be made for SingleMBeans in order to transfer relevant data to the system state screen. Currently, the organisation on the system state screen is the physical lcoation except in the case of vitarchivemonitor applications.

Estimated time

Estimator

Risk factor

2 md

KFC+ELZI

1,5

Assignment B.2.2a - Generalise replica to include all checksum voters

Note that replica is a new name for location in the current code and documentation. The reason for this rename is that location was confusingly giving association that it has to do with the location of for instance a bitarchive.
Today admin.data contains the third vote in checksum-votes as well as upload status for bitarchive replicas which contain the files from which the checksum can be calculated for a checksum vote. This means that admin.data represents two different kinds of information:
checksum to be used for vote about right version of file
log data for upload status of bitarchive replicas (the status of ArcRepository)
In the following the checksums-part of admin.data will be referred to as a checksum replica. Bitarchive replicas will still be called bitarchive replicas. A common term for both checksum replicas and bitarchive replicas will be replicas. In this first step the admin.data will only be the upload status data part.
In other words the checksum part of admin.data will be represented in a checksum replica, while the upload status will be stored separately (eventially in a database).
Assignment B.2.2a is concerned with separating the admin.data into these two parts and generalise the replica concept. Many of the terms used are defined in the Glossary, especially the figure under term Repository can be of help.
The benefits gained is that functionality and the logic becomes more simple and flexible. Furthermore it becomes simple to add functionality to restore data for the checksum replica. The functionality will also become independent of number of checksum and bitarchive replicas. And lastly, it will be possible to move the checksum replicas to other physical locations than the ArcRepository. The disadvantage is that we do not use the checksum information that is stored in the admin.data, - however the benefits is considered much more valuable.
The below figure illustrates this new view, where 'CS' represents the checksums-part of the old admin.data (for example stored in a file). The upload status data is represented in an internal BitPreservation database (see also next section) and the log of the upload status is still logged in admin.data.

According to the figure, the new checksum instance must be implemented separately (preferably as a pluggable part leaving it pluggable whether checksums are stored in file or database etc.). An interface must be defined and an implementation where the checksum instance is implemented as file based will part of this assignment.
Futhermore a replica API to interface to bitarchive and checksum replicas must be implemented. It will be in this API that for instance getAllFileNames will be effectuated as batchjobs to bitarchive replicas and direct calls to checksum replicas.
In dk.netarkivet.archive.arcrepository.bitpreservation.ActiveBitPreservation and its implementation dk.netarkivet.archive.arcrepository.bitpreservation.FileBasedActiveBitPreservation, this will mean that the following methods will be generalised for all replicas corresponding to the methods for bitarchive replicas
getMissingFilesForAdminData()
getChangedFilesForAdminData()
addMissingFilesToAdminData
changeStateForAdminData
The Checksum Instance of the Checksum replica can be implemented as an interface, where we choose to make implementation in form of a file containing records per line: <filename>,<checksum>
The settings must be changed, so it requires definition of all replicas including checksum replicas. Extra information must be added to replica setting: replica id (in order to ensure consistency between database and setting, even though names changes) and replica type to tell whether it is a checksum or bitarchive replica - this has to be known by the bit preservation interface, since checksum replicas will not be able to provide data objects for correction of currupted data objects.
Functionality to communicate with replicas in dk.netarkivet.archive.arcrepository.bitpreservation.FileBasedActiveBitPreservation must be moved to a Replica class (renamed Location), that has adjuted behaviour according to type of replica (checksum or bitarchive). Whether this is an extension of the existing Location (renamed to Replica) class needs to be analysed further (see also comment under assignment B2.2b)
The old admin.data must be converted into the new structure:
let the new checksum replica defined in the settings be empty at start
read old admin data
(class should be adjusted to only contain methods for the conversion)
"upload" each file with checksum to new checksum replica
store upload status in database (alternatively in log if db is not yet there)
check successful conversion
This assignment can be made independent by B.2.2b if upload status is stored in a separate file. However, there probably will be considerable reduction in time used (especially concerning the convertion of admin.data), if they are implemented in one go.

Estimated time

Estimator

Risk factor

14 md

KFC+ELZI

1,5

Assignment B.2.2b - Store bit preservation information in a database

The status of the last checks is currently stored in files, and handled by the classes WorkFiles and ActiveBitPreservation.
Replace the use of files with data stored in a database. The database will contain all cache information of files, including the upload status primerily used at upload. The history needs to be saved via history table with backup or in logs from bit preservation actions.
The database needs to contain the following tables:

note:
the grayed fields belongs to assignment B.2.3 on segments, which is described later
replicafileinfo checksum may in rare cases be null. It will normally be set first time at upload, but in case it have been lost the file list job can leave it with null value, meaning a file has not had its checksum calculated yet.
a file is missing if the replicafileinfo filelist_status say so or if the table entry is non-existing in the file-info table for the given replica id. In the first case it is because the file has been lost at some stage, in the other case the file has never existed.
replica_type is either checksum or bitarchive depending on whether there is only a checksum or includes file/data objects.
the checksum_lastcheckdate and filelist_lastcheckdate are introduced in order for the schedular to schedulation of filelist and checksums jobs. For instance in a segment is to be checked every second month then the next checking time is min(<.._lastcheckdate for files in mount)+<2 months>
At the moment the known replica is calculated from settings. Introducing the database means that there can be intentionally or unintentionally conflict between settings and data in database. This needs to be checked at the start-up.
Note that the update of a replicafileinfo record needs to take into account that a segment can be changed or a copy can exist (see also assignment B.2.6)
Note also that we would like to register historical data on the checksum and file-list. It must be investigated whether we to do this in a history table (with back-up) in order to have quick access to historical data, or we wish to keep the information in logs.
All data in the database is reproducible from settings and replicas.
Reproducing data would then be done via the following steps:
read replicas from settings
create replica entries on basis of read settings including replica_id
for each replica entry { <br>
for each file_name in <all file names from replica entry> {
if (File entry does not allready exist for file_name) {
create File entry with file_name and new file_id
}
let
replcia_id = replica_id from iteration
file_id be file_id from File
create replicaFileInfo entry with
replica_id, file_id from above
checksum = unknown
checksum_checkdate = unknown
filelist_status = ok
filelist_checkdate = <now>
upload_status = <look up in admin.data>
}
for each File entry belonging to actual replica entry {
find checksum of
file_name from given File entry
on the actual replica entry
update replicaFileInfo corresponding to given file_id and replica_id with
checksum = found checksum
checksum_checkdate = <now>
}
}
Reproduction of admin.data can now also be done on basis of the BitPreservation database (if it exists).

Further acitivities are:

Create a create-script like we have for the HarvestDefinition database.

Update the developer documentation to contain info about this database.

Create a DAO for this database. It must contain the following methods:

interface BitPreservationDAO {
    /** Given the output of a checksum job, add the results to the database.
      * NOTE: the Checksum version of Replica must be implemented with output
      *       in the same form as checksumJobOutput for implementation of
      *       bitArchive replicas
      * @param checksumJobOutput The output of a checksum job.
      * @param replica The replica this checksum job is for.
      */
      void addChecksumInformation(File checksumjobOutput, Replica replica);

    /** Given the output of a file list job, add the results to the database.
      * NOTE: the Checksum version of Replica must be implemented with output
      *       in the same form as filelistJobOutput for implementation of
      *       bitArchive replicas
      * @param filelistJobOutput The output of a filelist job.
      * @param replica The replica this filelist job is for.
      */
      void addFileListInformation(File filelistjobOutput, Replica replica);

    /** Return files with status COMPLETE for the replica , but not present in the last missing
      * files job.
      * This is done by querying the database for files with no or different update date from
      * the last known update date for bitarchive, but which are present from admin data.
      * @param replica The replica to check for.
      */
      Iterable getMissingFilesInLastUpdate(Replica replica);

    /** Return files with status COMPLETE for the replica, but not present in the last missing
      * files job.
      * This is done by querying the database for files with different checksum from the
      * checksum in the last known update date for bitarchive, but which are present from admin
      * data.
      * @param replica The replica to check for.
      */
      Iterable getWrongFilesInLastUpdate(Replica replica);

    /** Return the count of missing files for replica.
      * @param replica The replica to get the count for.
      */
    int getNumberOfMissingFilesInLastUpdate(Replica replica );

    /** Return the count of changed files for replica.
      * @param replica The replica to get the count for.
      */
    int getNumberOfWrongFilesInLastUpdate(Replica replica);

    /** Get the date for the last file list job.
      * @param replica The replica to get the date for.
      */
    Date getDateOfLastMissingFilesUpdate(Replica replica);

    /** Get the date for the last file list job.
      * @param replica The replica to get the date for.
      */
    Date getDateOfLastWrongFilesUpdate(Replica replica);
}

Note that we need to move the database classes (DBConnect; DBSpecifics and subclasses) from dk.netarkivet.harvester.datamodel to dk.netarkivet.common.utils.db, since they are now used by two modules.
Special thought needs to be put in to database backup. Currently it is done by the HarvestScheduler, but that should probably be moved to it's own timer thread somewhere in dk.netarkivet.common.utils.db.
Implement ActiveBitPreservation using the DAO methods.
For instance, getMissingFiles would be implemented by
Sending a FileListJob (for all replicas including checksum replicas)
Storing the result in the database
Returning the result of getMissingFiles from the DAO

Estimated time

Estimator

Risk factor

8 md

KFC+ELZI

2

Implementation Of B.2.2

The database is initialised by checking whether all the replicas have been created, and then creating the missing ones.
When a new file is added to the database the replicafileinfo for each replica is created.

Correct

The correct method will not be done through a 'CorrectMessage', but instead in the following way:
Send a 'RemoveAndGetMessage' for the file to the replica. This way it is validated whether the entry in the archive actually has a wrong checksum.
Then a valid file is uploaded through a 'UploadMessage' to the replica.

GetAllFilenames

This is the checksum replicas alternative to running a FileListJob on a bitarchive. It extracts all the filenames and puts them into a RemoteFile, which is returned through the reply message.

GetAllChecksums

This is the checksum replicas alternative to running a ChecksumJob on a bitarchive. It extracts the filename and checksum for all the entries in its archive, puts them into a RemoteFile, which is returned through the reply of the message.

Generel changes needed for this assignment (found by review)

Change the method for retrieving filenames and checksums from Bitarchives (use GetAllChecksumsMessage and GetAllFilenamesMessage).
Implement CorrectMessage for both the ChecksumArchive and the BitArchive, and then deprecate the old method ('GetAndRemoveFile' followed by 'Upload').
Move the classes for handling databases to common (since the DatabaseBasedActiveBitPreservation also will be using them).
Make LocalArcRepositoryClient able to use the checksum archive.
Remodel the channels so only one replica channels is public available (only the BitarchiveMonitors should know of the ALL_BA and ANY_BA channels).
Create one package for the archive messages (they are currently placed in either dk.netarkivet.archive.bitarchive.distribute or dk.netarkivet.archive.checksum.distribute).
Ensure that Reestablish file will not try to retrieve the file from a checksum replica.

Subtask involved in this assignment

B.2.2b opgave inddeling
Create database. DONE (2 md)
Described in the Assignment.
Make interface to database. DONE (5 md)
A class for accessing the different parts of the database (input/output).
Make bit preservation based on the database (DatabaseBasedActiveBitPreservation). DONE (5 md)
Opdate the database based on the results of messages from replicas (cannot handle correction/upload, since it at the moment requires admin.data)
Expand Deploy to automatically install a bitpreservation database. DONE (2 md)
The bitpreservation database has to be deploy the same way as the harvest-definition database (bpdb.jar instead of fullhddb.jar).
Make a common message interface for communication with the replicas. DONE (5 md)
The Bitarchive replica needs to be able to handle the same kind of message as the Checksum replica (GetChecksumMessage, GetAllFilenamesMessage, GetAllChecksumsMessage, and CorrectMessage).
Change ArcRepository, FileBasedActiveBitPreservation and DatabaseBasedActiveBitPreservation to only use these messages (stop using batch!).
Change ArcRepository to not be dependent on admin.data, but be able to use the database instead. DONE (5 md)
Requires a new database access (e.g. External Derby or MySQL), since the current access only allows a single application at the time, which will conflict with both the ArcRepository and the DatabaseBasedActiveBitPreservation (located on GUIApplication) using the same database).
Create as settings to avoid destroying existing functionality, but instead allow other options.
Remove admin.data completely. DONE (2 md)
Make a tool for ingest admin.data into the database.
Make a new bitpreservation homepage design. DONE (5 md)
The current design for bitpreservation is dependent on admin.data, and has to be changed.
Co-operate with DatabaseBasedActiveBitPreservation and FileBasedActiveBitPreservation.
LocalArcRepositoryClient handling of DatabaseBasedActiveBitPreservation? DONE (4 md)
TOTAL: DONE

Assignment B.2.3 - Use segments in bitarchives

A segment in a bitarchive is defined as one bitarchive directory on one server. It is represented with a URL like:
mountpoint://sb-prod-bar-001.statsbiblioteket.dk:7676/netarkiv/0003
That is
mountpoint://<<host>>:<<appid>>/<<bitarchivedir>>
We will extend the system to know the position of a file by segment, in the following places:
all replicas
upload replies
Batch jobs
The bit preservation database
The bit preservation operations

All Replicas

Extend with method that given a file name will return the segment this file exists in. For bitarchive replicas call BitarchiveAdmin, for checksum replicas return "".
Futhermore, for bitarchives, write a package local method in BitarchiveAdmin, that given a file name will return the segment this file exists in.
Change the signature of Birarchive.upload() to return a URL. On successful uploads, return the segment.

Upload replies

Extend UploadMessage to include a field "URL segment". In BitarchiveServer.upload, set this field to the result of Bitarchive.upload() on success.
In ArcRepository, extend onUpload to extract the segment URL on success, and pass it on to the BitPreservation database
Note: that the update of a !replicaFileInfo record needs to take into acount that a segment can be changed or a copy can exist (see also assignment B.2.6)
Note: we need to check whether it is necessary to add segment to ArcRepositoryEntry and ArchiveRecord (A Heritrix class)

Batch jobs

Batch jobs should be extended in two ways:
A specialised subclass of FileBatchJob must be implemented that only considers those files that are part of given segments. This should be fairly easy to implement, using the same mechanism as FileBatchJob.processOnlyFilesNamed(...), and extending BatchLocalFiles to use that mechanism when present.
The output of FileListJob and ChecksumJob should be extended to report the segment as well.
This information should be used in BitPreservation database (old AdminData) too.

Bit Preservation Checks

Checks are done by updates of and queries on replicafileinfo entries in the database.

In the following the decisions on actions for files in result of file-list jobs and checksum jobs will be done considering the following cases (each indicating action to take for the particular (file_id, replica_id, segment_id)

File_id

Replica_id

Segment_id

Action

1.

match

match

match

Update entry

2.

match

No match

match

Error (inconsistent database)

3.

match

No match

No match

Insert new entry (none for this replica)

4.

match

match

No match

Doublet, moved file or missing file

5.

No match

No match

No match

Insert new entry

6.

No match

match

No match

Insert new entry (none for this replica)

7.

No match

No match

match

Insert new entry (none for this replica)

8.

No match

match

match

Insert new entry (none for this replica)

File list jobs

A file-list job on a segment belonging to a bitarchive will result in following updates/insertions of replicafileinfo entries (per file):

If a matching replicafileinfo entry can be found on bitarchive (for replicaid AND fileid) then
If a matching replicafileinfo entry can be found on the segment then Update
filelist_status=true filelist_checkdatetime=<file list job execution time> upload_status=UPLOAD_COMPLETED
Else (none on the segment)
here - as first implementation - we should create a new entry (values shown below). However this this case covers two different possible diagnostics:
a doublet (see assignment section 2.6 – must be analysed further) or
a moved file (to new segment – must be analysed further)
Else
create a new entry (values shown below). Note that we here assume that it is not caused by inconsistent database where segment suddently belong to another bitArchive (see above table)

values for insertion is listed here:

replica_id=1 file_id=file1 segment_id=A checksum=<empty> upload_status=UPLOAD_COMPLETED filelist_status=true filelist_checkdatetime=<file list job execution time> checksum_checkdatetime=<empty>

When this is done for all files in the file-list job on a segment, the following must be done:

Update all replicafileinfo for segment with filelist_checkdatetime != <file list job execution time> :

filelist_status=false filelist_checkdatetime=<file list job execution time> Update Segment filelist_checkdatetime=<file list job execution time>

Note that the information of filelist_checkdatetime for segment is an uptimization. The same result could be gained with the query: min(filelist_checkdatetime) on replicafileinfos for the segment.

Now a missing file list status for bitarchive a would be:

Files with filelist_status=false UNION
Files represented in replicafileinfo with status UPLOAD_COMPLETED that do not exist with a replica_id for bitarchive 1

Checksum jobs

A checksum-list job on a segment belonging to a bitarchive will result in following updates/insertions of replicafileinfo entries (per file):
Treat is as for the file-list job described above, but where
update includes following
checksum=<checksum returned for file> upload_status=UPLOAD_COMPLETED filelist_status=true checksum_checkdatetime=<checksum job execution time>
create a new entry includes following
replica_id=1 file_id=file1 segment_id=A checksum=<checksum returned for file> upload_status=UPLOAD_COMPLETED filelist_status=true checksum_checkdatetime=<checksum job execution time>
Note that it can be discussed what relevance the file list job has, and whether filelist_checkdatetime should be updated. One argument for the file list job is that the checksum job may fail on some files (the file list job is more simple). An argument for setting file_list status is that when there is a checksum, there is also a file, - furthermore the procedure for different cases of existens is the same. However, if we still want file_list jobs to exist, it will be good to be able to distinguish whether the existens of the file was found via a checksum job or a filelist job, thus it is only the checksum_checkdatetime that are set.
When the above checksum updates are made then
Update Segment
checksum_checkdatetime=<file list job execution time>
Note that the information of checksum_checkdatetime for segment is an uptimization. The same result could be gained with the query: min(checksum_checkdatetime) on replicafileinfos for the segment.

The Bit Preservation Database

Once these are working and correct, we can add the segment as a field in the database.
The DAO-methods to update the database must be extended to store this field as well.
The get-methods from the DAO need to be updated in the following way:
Each of the methods must be overloaded to have a version that takes a segment as an extra parameter. These overloaded methods must behave exactly as the old methods, except all queries are restricted to query only for files in this segment.
However the original methods are more tricky. Where previously they only needed to know if the files were present/has correct checksum in the last update from a specific job, they now need to take into account that it is the last update of that particular segment.

The Bit Preservation Operations

Obviously, the ActiveBitPreservation interface now also needs to have its get() methods overloaded, to make it possible to specify a segment.

Estimated time

Estimator

Risk factor

6 md

KFC+ELZI

3

Assignment B.2.4 - Write BitPreservation scheduler

The datamodel

The datamodel for a bit preservation scheduler is very similar to that of the harvest definition scheduler.
The datamodel has:
A BitPreservationDefinition
A BitPreservationJob
A Schedule
A BitPreservationDefinition defines: A name, a comment, a schedule, a bitarchive, which segments to check (choice of all, random or specified), and the kind of job (Missing files or checksum). It will also contain an isActive, numEvents and nextDate field, defined as in HarvestDefinition and PartialHarvest.
A BitPreservationJob is one run of a BitPreservationDefinition. How a run is done is defined below. It also contains the results of the run (i.e. how many and which files were damaged) and whether any errors have been corrected.
The Schedule can be reused from HarvestDefinition.
Write DAOs and objects for the datamodel. Use inspiration from the equivalent classes in HarvestDefinition.
Also write the BitPreservationDefinition.createJob() method that generates a job from a BitPreservationDefinition. Remember to update nextDate afterwards.
Note: This means we will need to move the Schedule class and DAOs as well as the Named interface and NamedUtils to dk.netarkivet.common.datamodel.
Note: ScheduleDBDao.mayDelete checks the table 'partialharvests'. We may wish to rewrite that statement to check it only if that table is present. It should now also check on the bitpreservationdefinition table.

The scheduler

The way the harvest scheduler is currently written, it does not seem to be reasonable to try to make a common super class. However it will be valuable to see how the HarvestScheduler is written.
Make a new singleton BitPreservationScheduler.
Copy the method HarvestDefinitionDAO.generateJobs to the scheduler class (why is this method in HarvestDefinitionDAO anyway?) and change it in the obvious way. You will also need to copy the method !HarvestDefinitionDAO.getReadyHarvestDefinitions() to BitPreservationDefinitionDAO.getReadyBitPreservationDefinitions().
Make a method submitNewJobs(). It should look for every newly generated job, and for each of them do the following in a separate thread:
Send the batch job defined by the BitPreservationDefinition (missing/checksum, specified/random/all segments)
Wait for reply, and add these to the database
Check for missing/wrong files in the segments we just got reply for
Write the results to the job history
If there were any errors, send a notification
Make a timerthread that starts every minute. In that task, call "generateJobs()" and then "submitJobs()"
Note that means that the methods findMissingFiles and findChangedFiles must be removed from dk.netarkivet.archive.arcrepository.bitpreservation.ActiveBitPreservation and its implementation dk.netarkivet.archive.arcrepository.bitpreservation.FileBasedActiveBitPreservation, - or there functionality must be replaced by starting a job that can be monitored with the other jobs

Estimated time

Estimator

Risk factor

5 md

KFC+ELZI

3

Assignment B.2.5 - Write BitPreservation webinterface

See the mockup for inspiration.
Status page (see mockup)
BitPreservationDefinition page
Schedule definition page (can be copied verbatim as is)
List of all jobs
Job details for job with missing files (see mockup)
Job details for job with wrong files (see mockup)
The most interesting pages are the two detail pages for jobs with failed files.
The intention from the mockup is that the details are generated from the database.
The links will call a getFile on the replica to see the contents (here the type of the replica is important, since it is only on bitarchive replicas you can use the get operations). This is especially important in case of differing checksum data.
When you select to do an update, the relevant methods from ActiveBitPreservation are called. Note that you can only upload FROM bitarchive replicas (not checksum replicas).
The status is then updated in the status from the job! It is important that it is obvious which problems have been resolved. Possibly there should be a "recheck"-button (for when a mountpoint goes back online).
In the final design it is important to take the following into account (which is not included in the mock-up) The following pages will be needed:
the layout shoud be independent of the number of checksum and bitarchive replicas
it should be part of the design that upload is only possible from bitarchive replicas
it should be part of the design that viewing file contents is only possible from bitarchive replicas
it should be considered where to put recheck buttons

Estimated time

Estimator

Risk factor

6 md

KFC+ELZI

3

Assignment B.2.6 - Handling doublets

In rare cases it can happen that a file is uploaded twice on two different instances. This is the case when the first upload completes at the bitarchive instance, but for some reason cannot deliver the message of the completion.
In these cases there will be two entries in the BitPresevation database, and this case must be handled separatly.
Note that this case must be taken into account when updating af replicaFileInfo entry. A different segment can either mean it is a dublet, or that the mount address has changed.

Estimated time

Estimator

Risk factor

5 md

KFC+ELZI

3

	File_id	Replica_id	Segment_id	Action
1.	match	match	match	Update entry
2.	match	No match	match	Error (inconsistent database)
3.	match	No match	No match	Insert new entry (none for this replica)
4.	match	match	No match	Doublet, moved file or missing file
5.	No match	No match	No match	Insert new entry
6.	No match	match	No match	Insert new entry (none for this replica)
7.	No match	No match	match	Insert new entry (none for this replica)
8.	No match	match	match	Insert new entry (none for this replica)