Support of Data Analysis

The data analysis can be performed through the batch functionality.

This assignment has been split into the following sub-parts:

Batchjob

By creating the necessary batchjobs for extracting and processing the wanted data, it is possible to perform the wanted analysis of the data in the archive. The wanted data is extracted from the requested files in each individual bitarchive through the batchjob, and this data is then combined and processed into the wanted statistics before the results are returned to the user.

Currently these statistical results have to be converted manually into whichever kind of presentation wanted by the user, e.g. by using a spreadsheet tool such as Microsoft Excel or OpenOffice Calc.

Basically this requires a different batchjob for each statistical analysis wanted to be performed, e.g. one batchjob for extracting the metadata, another one for extracting the hosts, etc.

Estimate: Depends extremely on the purpose of the batchjob.

Example: A batchjob for retrieving statistics about mimetypes

A batchjob which retrieves the mimetypes of all the arc-records, and calculates the statistics about how often each unique mimetype has been found.

Estimate: 1 md.

Uncertainty: A (such a batchjob already exists)

Performance enhancement

Currently the only way to enhance the extraction of data is to limit the number of files, from which the data is extracted. This can be done by using regular expression patterns to limit the files, which the batchjob is executed upon. Besides limiting the batchjob to only being executed on a single file, then the limitation can be done in three ways: limit to only the metadata-files, limit to specific harvest ids, or limit to files with a specific date stamp.

Daily up-to-date statistics

A system for automatically being up-to-date with statistics about the content of the archive is possible.

This would require a system, which every day could perform a check for all the new files within the repository, and then retrieve the wanted statistical data from those files. The statistics from these new files should then be retrieved with a batchjob.

This whole up-to-date system would be very closely linked to the batchjobs retrieving the specific statistical data. It would be logical to use a database for storing the list of files and which batchjobs have been executed upon them.

Estimate: 15 md

Uncertainty: C

Improved presentation of statistical results (by using JFree)

JFree seems like a good tool for creating charts for improved presentation of data in a neat way. It provides a complete interface for creation and presentation of such charts.

The default way of presenting the charts seems to be through the Java GUI api, but it should also be possible to extract the data as an image, which would be preferable, since we do not use the Java GUI api.

The creation of these charts will be performed as post-processing to the batchjob results. Currently this requires some external software to perform this post-processing, but it would be possible to extend the NetarchiveSuite GUI to be able to present the results in such a way.

It would be logical to link this presentation tool with the BatchGUI for making it possible to present the results of the batchjobs through charts. If this should be a default The batchjob results would have to be presented in a way, which makes it possible to handle automatically for an interface to the JFree.

Estimate: 10 md.

Uncertainty: C

Support of data analysis of metadata ARCs (last edited 2010-12-01 14:23:18 by JonasFrellesen)