Support of Data Analysis

The data analysis can be performed through the batch functionality.

This assignment has been split into the following sub-parts:

Creating batchjobs for retrieving the statistics.
Up-To-Date system
Improved presentation of statistical results (by using JFree)

Batchjob

By creating the necessary batchjobs for extracting and processing the wanted data, it is possible to perform the wanted analysis of the data in the archive. The wanted data is extracted from the requested files in each individual bitarchive through the batchjob, and this data is then combined and processed into the wanted statistics before the results are returned to the user.

Currently these statistical results have to be converted manually into whichever kind of presentation wanted by the user, e.g. by using a spreadsheet tool such as Microsoft Excel or OpenOffice Calc.

Basically this requires a different batchjob for each statistical analysis wanted to be performed, e.g. one batchjob for extracting the metadata, another one for extracting the hosts, etc.

Estimate: Depends extremely on the purpose of the batchjob.

Example: A batchjob for retrieving statistics about mimetypes

A batchjob which retrieves the mimetypes of all the arc-records, and calculates the statistics about how often each unique mimetype has been found.

Estimate: 1 md.

Uncertainty: A (such a batchjob already exists)

Performance enhancement

Currently the only way to enhance the extraction of data is to limit the number of files, from which the data is extracted. This can be done by using regular expression patterns to limit the files, which the batchjob is executed upon. Besides limiting the batchjob to only being executed on a single file, then the limitation can be done in three ways: limit to only the metadata-files, limit to specific harvest ids, or limit to files with a specific date stamp.

Some of the batchjobs would only need to be run upon the metadata files, since they by default contain the necessary information about the harvest to make some of the wanted statistics. This kind of performance enhancement won’t be possible for every wanted piece of statistics, since it depends on whether the metadata files contains the data wanted for statistics.
- Regular expression format: “.*-metadata-.*”
The batchjobs could also be limited to specific harvest ids, though the statistics would then only be for the specific harvest. To enhance performance even further, this can be combined with the first choice, which means only executing the batchjob on the metadata files for specific harvests.
- Regular expression format: “ID-.*”, e.g. for the files from the harvest with id 1: "1-.*"
- Combined with the first limit, then the limitation for the metadata files for the harvest with id 1: “1-metadata-.*”
The date-stamp can be used to limit the batchjob to only be executed on the files which has been harvested within a specific time frame. This timeframe could be year, month, day, hour, minute, second, and less than a second. It just requires the pattern of the limiting regular expression to be adjusted to the given timeframe for the wanted dates. The date-stamp is only used by the content files, so this cannot be used in combination with the metadata files.
- Regular expression format: “.*-DATE.*”, e.g. for the files for the first day of Netarkivet.dk: “.*-20050701.*”

Daily up-to-date statistics

A system for automatically being up-to-date with statistics about the content of the archive is possible.

This would require a system, which every day could perform a check for all the new files within the repository, and then retrieve the wanted statistical data from those files. The statistics from these new files should then be retrieved with a batchjob.

This whole up-to-date system would be very closely linked to the batchjobs retrieving the specific statistical data. It would be logical to use a database for storing the list of files and which batchjobs have been executed upon them.

Estimate: 15 md

Uncertainty: C

Improved presentation of statistical results (by using JFree)

JFree seems like a good tool for creating charts for improved presentation of data in a neat way. It provides a complete interface for creation and presentation of such charts.

The default way of presenting the charts seems to be through the Java GUI api, but it should also be possible to extract the data as an image, which would be preferable, since we do not use the Java GUI api.

The creation of these charts will be performed as post-processing to the batchjob results. Currently this requires some external software to perform this post-processing, but it would be possible to extend the NetarchiveSuite GUI to be able to present the results in such a way.

It would be logical to link this presentation tool with the BatchGUI for making it possible to present the results of the batchjobs through charts. If this should be a default The batchjob results would have to be presented in a way, which makes it possible to handle automatically for an interface to the JFree.

Estimate: 10 md.

Uncertainty: C

Support of data analysis of metadata ARCs (last edited 2010-12-01 14:23:18 by JonasFrellesen)