Future development tasks in Netarchivesuite (draft)

The following list is open for discussion (and the list is not prioritized!)

1) Heritrix 3. Wait for Heritrix 3.0.1/2. Kristinn Sigurdsson under way with deduplicator module for H3. An alpha/beta version has already been released on Sourceforge (http://deduplicator.sourceforge.net/release3.html) I note with some trepidation the following comment about changes in memory usage: "Please note that changes in the Lucene library mean that memory usage will be approximately 40% greater than before"!

2) Redundancy in netarchivesuite to implement graceful degradation instead of JVM crashes, as we have recently seen with the BitarchiveMonitors that oversee the communication with the distributed bitarchive servers.

3) Support for other message broker systems. Currently only! supports one type ( Open Message Queue, developed in Open Source by Oracle). One possibility is using the Advanced Message Queueing Protocol that LiWA Video Capture software uses and that RabbitMQ supports. Could one serialize our Message objects and send them through

4) Improve on index-generation. Could parts of the index-generation be parallelized. (existing FR/Bugs: enable reading logs/cdx'es from all bitarchive replicas, and lower/optimize the timeouts for the batch-jobs that retrieve the logs/cdx'es).

5) The hibernate framework has been introduced in the wayback package. Maybe we should move hibernate in all DAOs. However, the introduction of hibernate and its dependency of other packages has currently

6) Introduction of WARC support in netarchivesuite. An effort is already underway implementing basic support of warc in netarchivesuite, enabling upload of warc to a NAS archive, and retrieval of WARC-records.

7) Support of data analysis of metadata arcs (mimetypes, hosts) and maybe also of arc files (content of some specific file types). There are probably already scripts and batchjob doing this outside the netarchive suite. Do you think these functionalities can be integrated into the netarchive suite, giving daily up-to-date statistics based on harvest-definitions or different runs of definitions. A graphical Output beside all statistical data would be a nice add on (jfree.org). Or is it better to have this outside the suite, having tools which just read the netarchive suite database to get the expected results. How can that be done in a smart way (maybe caching of snippets out of arc Files), since reading of arc Files is a very expensive task.

8) ???

9) ???

10) ???

FutureTasks (last edited 2010-09-17 15:09:30 by SoerenCarlsen)