Differences between revisions 5 and 7 (spanning 2 versions)
Revision 5 as of 2007-09-11 06:46:56
Size: 2250
Editor: EldZierau
Comment:
Revision 7 as of 2007-09-11 06:50:49
Size: 2364
Editor: EldZierau
Comment:
Deletions are marked like this. Additions are marked like this.
Line 7: Line 7:
= Roadmap = == Roadmap ==
Line 10: Line 10:
 * New Heritrix integration
= Quality Assurance =
 * '''New Heritrix integration'''
== Quality Assurance ==
Line 14: Line 14:
= Extra Metadata = == Extra Metadata ==
Line 21: Line 21:
= Harvest Tuning = == Harvest Tuning ==
Line 23: Line 23:
 * Removing seeds from database  * '''Removing seeds from database'''
Line 25: Line 25:
 * Exclude domains from snapshot harvest  * '''Exclude domains from snapshot harvest'''
Line 27: Line 27:
 * Domain implementation  * '''Domain implementation'''
Line 29: Line 29:
 * Global crawlertraps  * '''Global crawlertraps'''
Line 31: Line 31:
= Heritrix =
 * Pause/stop harvest from User Interface
== Heritrix ==
 * '''Pause/stop harvest from User Interface'''
Line 38: Line 38:
 * Sharing of crawlertraps (Heritrix)
 * Arcreader and Arcwriter separately
 * '''Sharing of crawlertraps (Heritrix)'''
 * '''Arcreader and Arcwriter separately'''
Line 42: Line 42:
 * Arcfile naming
= Infrastructure =
 * '''Arcfile naming'''
== Infrastructure ==
Line 47: Line 47:
= Database = == Database ==
Line 51: Line 51:
 * All SQL out in DBSpecific for plugin
= Open Source Procedure =
 * '''All SQL out in DBSpecific for plugin'''
== Open Source Procedure ==
Line 57: Line 57:
= User Interface =
 * More neat user interface
 * Click pattern
 * Filter job lists by category
 * Password & user nameto control settings
 * Calendar view of jobs
== User Interface ==
 * '''More neat user interface '''
 * '''Click pattern '''
 * '''Filter job lists by category '''
 * '''Password & user nameto control settings '''
 * '''Calendar view of jobs'''
Line 65: Line 65:
= End User access = == End User access ==

Yellow Notes for Potential Future Developments

Points noted as being important for some participants are marked by being bolded

Points that needs further explanation is marked with ?

Roadmap

  • Java upgrade to Java 1.6
  • Encrypting password for bit archive
  • New Heritrix integration

Quality Assurance

  • Automatic filtering of collected URLs
  • Use scope in QA ?

Extra Metadata

  • Contractual Information
  • Naming of jobs
  • Look up of ???? data to add to metadata
  • Access to Heritrix logs via Harvest definition interface
  • A way to show count of stored bytes (after deduplication) and information on deduplication
  • Subdomains from logs

Harvest Tuning

  • Possibility to inactivate domains
  • Removing seeds from database

  • Automatically check seeds
  • Exclude domains from snapshot harvest

  • Alias detection tool integrated in system
  • Domain implementation

  • QueueAssignmentPolicy part of plugin ???

  • Global crawlertraps

  • Count inline images in the same domain

Heritrix

  • Pause/stop harvest from User Interface

  • Check email address using Heritrix method
  • Internationalisation of Heritrix (not seen as important)
  • Operator in metadata
  • Heritrix timeout challenge
  • Meta language for Heritix templates
  • Sharing of crawlertraps (Heritrix)

  • Arcreader and Arcwriter separately

  • Heritrix split
  • Use Heritrix deduplcation module
  • Arcfile naming

Infrastructure

  • Take down parts and start them again
  • Use of SPRING for deploy
  • Automatic start of application that is down

Database

  • Specify wanted backup directory
  • Elaborate on update procedure when database has changed
  • Oracle implementation as plugin
  • All SQL out in DBSpecific for plugin

Open Source Procedure

  • Control on formats for patches
  • What about new translations (in release information)
  • User traning material
  • Documentaiton on listen to queues

User Interface

  • More neat user interface

  • Click pattern

  • Filter job lists by category

  • Password & user nameto control settings

  • Calendar view of jobs

  • For people with special needs
  • Provinance

End User access

WorkshopSeptember2007YellowNotes (last edited 2010-08-16 10:25:11 by localhost)