Differences between revisions 6 and 7
Revision 6 as of 2007-09-11 06:48:00
Size: 2270
Editor: EldZierau
Comment:
Revision 7 as of 2007-09-11 06:50:49
Size: 2364
Editor: EldZierau
Comment:
Deletions are marked like this. Additions are marked like this.
Line 10: Line 10:
 * New Heritrix integration  * '''New Heritrix integration'''
Line 23: Line 23:
 * Removing seeds from database  * '''Removing seeds from database'''
Line 25: Line 25:
 * Exclude domains from snapshot harvest  * '''Exclude domains from snapshot harvest'''
Line 27: Line 27:
 * Domain implementation  * '''Domain implementation'''
Line 29: Line 29:
 * Global crawlertraps  * '''Global crawlertraps'''
Line 32: Line 32:
 * Pause/stop harvest from User Interface  * '''Pause/stop harvest from User Interface'''
Line 38: Line 38:
 * Sharing of crawlertraps (Heritrix)
 * Arcreader and Arcwriter separately
 * '''Sharing of crawlertraps (Heritrix)'''
 * '''Arcreader and Arcwriter separately'''
Line 42: Line 42:
 * Arcfile naming  * '''Arcfile naming'''
Line 51: Line 51:
 * All SQL out in DBSpecific for plugin  * '''All SQL out in DBSpecific for plugin'''
Line 58: Line 58:
 * More neat user interface
 * Click pattern
 * Filter job lists by category
 * Password & user nameto control settings
 * Calendar view of jobs
 * '''More neat user interface '''
 * '''Click pattern '''
 * '''Filter job lists by category '''
 * '''Password & user nameto control settings '''
 * '''Calendar view of jobs'''

Yellow Notes for Potential Future Developments

Points noted as being important for some participants are marked by being bolded

Points that needs further explanation is marked with ?

Roadmap

  • Java upgrade to Java 1.6
  • Encrypting password for bit archive
  • New Heritrix integration

Quality Assurance

  • Automatic filtering of collected URLs
  • Use scope in QA ?

Extra Metadata

  • Contractual Information
  • Naming of jobs
  • Look up of ???? data to add to metadata
  • Access to Heritrix logs via Harvest definition interface
  • A way to show count of stored bytes (after deduplication) and information on deduplication
  • Subdomains from logs

Harvest Tuning

  • Possibility to inactivate domains
  • Removing seeds from database

  • Automatically check seeds
  • Exclude domains from snapshot harvest

  • Alias detection tool integrated in system
  • Domain implementation

  • QueueAssignmentPolicy part of plugin ???

  • Global crawlertraps

  • Count inline images in the same domain

Heritrix

  • Pause/stop harvest from User Interface

  • Check email address using Heritrix method
  • Internationalisation of Heritrix (not seen as important)
  • Operator in metadata
  • Heritrix timeout challenge
  • Meta language for Heritix templates
  • Sharing of crawlertraps (Heritrix)

  • Arcreader and Arcwriter separately

  • Heritrix split
  • Use Heritrix deduplcation module
  • Arcfile naming

Infrastructure

  • Take down parts and start them again
  • Use of SPRING for deploy
  • Automatic start of application that is down

Database

  • Specify wanted backup directory
  • Elaborate on update procedure when database has changed
  • Oracle implementation as plugin
  • All SQL out in DBSpecific for plugin

Open Source Procedure

  • Control on formats for patches
  • What about new translations (in release information)
  • User traning material
  • Documentaiton on listen to queues

User Interface

  • More neat user interface

  • Click pattern

  • Filter job lists by category

  • Password & user nameto control settings

  • Calendar view of jobs

  • For people with special needs
  • Provinance

End User access

WorkshopSeptember2007YellowNotes (last edited 2010-08-16 10:25:11 by localhost)