Differences between revisions 11 and 12
Revision 11 as of 2007-09-11 10:47:22
Size: 2527
Editor: LarsClausen
Comment:
Revision 12 as of 2007-09-11 10:51:30
Size: 2637
Editor: LarsClausen
Comment:
Deletions are marked like this. Additions are marked like this.
Line 18: Line 18:
 * Naming of jobs  * Naming of jobs/groups of harvests
Line 26: Line 26:
 * Automatically check seeds  * Automatically check seeds when entered
Line 37: Line 37:
 * Operator in metadata  * Store operator in metadata
Line 39: Line 39:
 * Meta language for Heritix templates  * Meta language for Heritrix templates
Line 42: Line 42:
 * Heritrix split
 * Use Heritrix deduplcation module
 * Heritrix split of a single job
 * Use Heritrix deduplication module
Line 46: Line 46:
 * Take down parts and start them again  * Take down applications/machines and start them again
Line 53: Line 53:
 * '''All SQL out in DBSpecific for plugin'''  * '''All SQL out in DBSpecific/separate file for plugin'''
Line 57: Line 57:
 * User traning material
 * Documentaiton on listen to queues
 * User training material translated to English
 * Documentation on listen to queues
Line 63: Line 63:
 * '''Password & user nameto control settings '''  * '''Password & user name to control settings '''
Line 66: Line 66:
 * Provinance  * Provenance

Yellow Notes for Potential Future Developments

These are the notes from the "potential future developments" board at the [:AgendaWorkshopSeptember2007:workshop 2007]. Points noted as being important for some participants are marked by being bolded

Points that needs further explanation is marked with ?

Roadmap

  • Java upgrade to Java 1.6
  • Encrypting password for bit archive
  • New Heritrix integration

Quality Assurance

  • Automatic filtering of collected URLs
  • Use scope in QA (improved integration between harvest info and QA)

Extra Metadata

  • Contractual Information
  • Naming of jobs/groups of harvests
  • Look up of WHOIS data to add to metadata
  • Access to Heritrix logs via Harvest definition interface
  • A way to show count of stored bytes (after deduplication) and information on deduplication
  • Subdomains from logs

Harvest Tuning

  • Possibility to inactivate domains
  • Removing seeds from database

  • Automatically check seeds when entered
  • Exclude domains from snapshot harvest

  • Alias detection tool integrated in system
  • Domain implementation

  • QueueAssignmentPolicy part of plugin ???

  • Global crawlertraps

  • Count inline images in the same domain

Heritrix

  • Pause/stop harvest from User Interface

  • Check email address using Heritrix method
  • Internationalisation of Heritrix (not seen as important)
  • Store operator in metadata
  • Heritrix timeout challenge
  • Meta language for Heritrix templates
  • Sharing of crawlertraps (Heritrix)

  • Arcreader and Arcwriter separately

  • Heritrix split of a single job
  • Use Heritrix deduplication module
  • Arcfile naming

Infrastructure

  • Take down applications/machines and start them again
  • Use of SPRING for deploy
  • Automatic start of application that is down

Database

  • Specify wanted backup directory
  • Elaborate on update procedure when database has changed
  • Oracle implementation as plugin
  • All SQL out in DBSpecific/separate file for plugin

Open Source Procedure

  • Control on formats for patches
  • What about new translations (in release information)
  • User training material translated to English
  • Documentation on listen to queues

User Interface

  • Neater user interface
  • Click pattern

  • Filter job lists by category

  • Password & user name to control settings

  • Calendar view of jobs

  • For people with special needs
  • Provenance

End User access

WorkshopSeptember2007YellowNotes (last edited 2010-08-16 10:25:11 by localhost)