Differences between revisions 5 and 6
Revision 5 as of 2007-09-11 06:46:56
Size: 2250
Editor: EldZierau
Comment:
Revision 6 as of 2007-09-11 06:48:00
Size: 2270
Editor: EldZierau
Comment:
Deletions are marked like this. Additions are marked like this.
Line 7: Line 7:
= Roadmap = == Roadmap ==
Line 11: Line 11:
= Quality Assurance = == Quality Assurance ==
Line 14: Line 14:
= Extra Metadata = == Extra Metadata ==
Line 21: Line 21:
= Harvest Tuning = == Harvest Tuning ==
Line 31: Line 31:
= Heritrix = == Heritrix ==
Line 43: Line 43:
= Infrastructure = == Infrastructure ==
Line 47: Line 47:
= Database = == Database ==
Line 52: Line 52:
= Open Source Procedure = == Open Source Procedure ==
Line 57: Line 57:
= User Interface = == User Interface ==
Line 65: Line 65:
= End User access = == End User access ==

Yellow Notes for Potential Future Developments

Points noted as being important for some participants are marked by being bolded

Points that needs further explanation is marked with ?

Roadmap

  • Java upgrade to Java 1.6
  • Encrypting password for bit archive
  • New Heritrix integration

Quality Assurance

  • Automatic filtering of collected URLs
  • Use scope in QA ?

Extra Metadata

  • Contractual Information
  • Naming of jobs
  • Look up of ???? data to add to metadata
  • Access to Heritrix logs via Harvest definition interface
  • A way to show count of stored bytes (after deduplication) and information on deduplication
  • Subdomains from logs

Harvest Tuning

  • Possibility to inactivate domains
  • Removing seeds from database
  • Automatically check seeds
  • Exclude domains from snapshot harvest
  • Alias detection tool integrated in system
  • Domain implementation
  • QueueAssignmentPolicy part of plugin ???

  • Global crawlertraps
  • Count inline images in the same domain

Heritrix

  • Pause/stop harvest from User Interface
  • Check email address using Heritrix method
  • Internationalisation of Heritrix (not seen as important)
  • Operator in metadata
  • Heritrix timeout challenge
  • Meta language for Heritix templates
  • Sharing of crawlertraps (Heritrix)
  • Arcreader and Arcwriter separately
  • Heritrix split
  • Use Heritrix deduplcation module
  • Arcfile naming

Infrastructure

  • Take down parts and start them again
  • Use of SPRING for deploy
  • Automatic start of application that is down

Database

  • Specify wanted backup directory
  • Elaborate on update procedure when database has changed
  • Oracle implementation as plugin
  • All SQL out in DBSpecific for plugin

Open Source Procedure

  • Control on formats for patches
  • What about new translations (in release information)
  • User traning material
  • Documentaiton on listen to queues

User Interface

  • More neat user interface
  • Click pattern
  • Filter job lists by category
  • Password & user nameto control settings

  • Calendar view of jobs
  • For people with special needs
  • Provinance

End User access

WorkshopSeptember2007YellowNotes (last edited 2010-08-16 10:25:11 by localhost)