Differences between revisions 6 and 16 (spanning 10 versions)
Revision 6 as of 2007-09-11 06:48:00
Size: 2270
Editor: EldZierau
Comment:
Revision 16 as of 2010-08-16 10:25:11
Size: 3070
Editor: localhost
Comment: converted to 1.6 markup
Deletions are marked like this. Additions are marked like this.
Line 1: Line 1:
#acl NetarkivetGroup:read,write,delete,revert,admin All:
= Yellow Notes for Potential Future Developments =
Points noted as being important for some participants are marked by being bolded
= Potential Future Developments =
Line 5: Line 3:
Points that needs further explanation is marked with ? == Yellow Notes for Potential Future Developments ==
These are the notes from the "potential future developments" board at the [[AgendaWorkshopSeptember2007|workshop 2007]].
The notes collected during the workshop were discussed and most important issues for some participants were marked.
Line 7: Line 7:
== Roadmap == The issues marked as important are bolded in the below listing.

=== Roadmap ===
Line 10: Line 12:
 * New Heritrix integration
== Quality Assurance ==
 * '''New Heritrix integration'''
=== Quality Assurance ===
Line 13: Line 15:
 * Use scope in QA ?
== Extra Metadata ==
 * Use scope in QA (improved integration between harvest info and QA)
=== Extra Metadata ===
Line 16: Line 18:
 * Naming of jobs
 * Look up of ???? data to add to metadata
 * Naming of jobs/groups of harvests
 * Look up of WHOIS data to add to metadata
Line 21: Line 23:
== Harvest Tuning ==
 * Possibility to inactivate domains
 * Removing seeds from database
 * Automatically check seeds
 * Exclude domains from snapshot harvest
=== Harvest Tuning ===
 * Possibility to deactivate domains
 * '''Removing seeds from database'''
 * Automatically check seeds when entered
 * '''Exclude domains from snapshot harvest'''
Line 27: Line 29:
 * Domain implementation
 * QueueAssignmentPolicy part of plugin ???
 * Global crawlertraps
 * '''Domain implementation'''
 * !QueueAssignmentPolicy part of plugin ???
 * '''Global crawlertraps'''
Line 31: Line 33:
== Heritrix ==
 * Pause/stop harvest from User Interface
=== Heritrix ===
 * '''Pause/stop harvest from User Interface'''
Line 35: Line 37:
 * Operator in metadata  * Store operator in metadata
Line 37: Line 39:
 * Meta language for Heritix templates
 * Sharing of crawlertraps (Heritrix)
 *
Arcreader and Arcwriter separately
 * Heritrix split
 * Use Heritrix deduplcation module
 *
Arcfile naming
== Infrastructure ==
 * Take down parts and start them again
 * Meta language for Heritrix templates
 * '''Sharing of crawlertraps (Heritrix)'''
 * '''
Arcreader and Arcwriter separately'''
 * Heritrix split of a single job
 * Use Heritrix deduplication module
 * '''
Arcfile naming - should be configurable'''
=
== Infrastructure ===
 * Take down applications/machines and start them again
Line 47: Line 49:
== Database == === Database ===
Line 51: Line 53:
 * All SQL out in DBSpecific for plugin
== Open Source Procedure ==
 * '''All SQL out in DBSpecific/separate file for plugin'''
=== Open Source Procedure ===
Line 55: Line 57:
 * User traning material
 * Documentaiton on listen to queues
== User Interface ==
 * More neat
user interface
 * Click pattern
 *
Filter job lists by category
 *
Password & user nameto control settings
 *
Calendar view of jobs
 * User training material translated to English
 * Documentation on listen to queues
=== User Interface ===
 * Neater
user interface
 * '''Click pattern '''
 * '''
Filter job lists by category '''
 * '''
Password & user name to control settings '''
 * '''
Calendar view of jobs'''
Line 64: Line 66:
 * Provinance
== End User access ==
 * Provenance
=== End User access ===
 * '''Access to harvested data by public users'''

== "Round the table" on Potential Future Developments ==
All participants pointed out the most important issue for them to be highest priority in future development. All pointed at one of the following two issues:
 * '''Harvest Tuning: Domain implementation''''
 * '''Roadmap: New Heritrix integration'''

Potential Future Developments

Yellow Notes for Potential Future Developments

These are the notes from the "potential future developments" board at the workshop 2007. The notes collected during the workshop were discussed and most important issues for some participants were marked.

The issues marked as important are bolded in the below listing.

Roadmap

  • Java upgrade to Java 1.6
  • Encrypting password for bit archive
  • New Heritrix integration

Quality Assurance

  • Automatic filtering of collected URLs
  • Use scope in QA (improved integration between harvest info and QA)

Extra Metadata

  • Contractual Information
  • Naming of jobs/groups of harvests
  • Look up of WHOIS data to add to metadata
  • Access to Heritrix logs via Harvest definition interface
  • A way to show count of stored bytes (after deduplication) and information on deduplication
  • Subdomains from logs

Harvest Tuning

  • Possibility to deactivate domains
  • Removing seeds from database

  • Automatically check seeds when entered
  • Exclude domains from snapshot harvest

  • Alias detection tool integrated in system
  • Domain implementation

  • QueueAssignmentPolicy part of plugin ???

  • Global crawlertraps

  • Count inline images in the same domain

Heritrix

  • Pause/stop harvest from User Interface

  • Check email address using Heritrix method
  • Internationalisation of Heritrix (not seen as important)
  • Store operator in metadata
  • Heritrix timeout challenge
  • Meta language for Heritrix templates
  • Sharing of crawlertraps (Heritrix)

  • Arcreader and Arcwriter separately

  • Heritrix split of a single job
  • Use Heritrix deduplication module
  • Arcfile naming - should be configurable

Infrastructure

  • Take down applications/machines and start them again
  • Use of SPRING for deploy
  • Automatic start of application that is down

Database

  • Specify wanted backup directory
  • Elaborate on update procedure when database has changed
  • Oracle implementation as plugin
  • All SQL out in DBSpecific/separate file for plugin

Open Source Procedure

  • Control on formats for patches
  • What about new translations (in release information)
  • User training material translated to English
  • Documentation on listen to queues

User Interface

  • Neater user interface
  • Click pattern

  • Filter job lists by category

  • Password & user name to control settings

  • Calendar view of jobs

  • For people with special needs
  • Provenance

End User access

  • Access to harvested data by public users

"Round the table" on Potential Future Developments

All participants pointed out the most important issue for them to be highest priority in future development. All pointed at one of the following two issues:

  • Harvest Tuning: Domain implementation'

  • Roadmap: New Heritrix integration

WorkshopSeptember2007YellowNotes (last edited 2010-08-16 10:25:11 by localhost)