2270
Comment:
|
2364
|
Deletions are marked like this. | Additions are marked like this. |
Line 10: | Line 10: |
* New Heritrix integration | * '''New Heritrix integration''' |
Line 23: | Line 23: |
* Removing seeds from database | * '''Removing seeds from database''' |
Line 25: | Line 25: |
* Exclude domains from snapshot harvest | * '''Exclude domains from snapshot harvest''' |
Line 27: | Line 27: |
* Domain implementation | * '''Domain implementation''' |
Line 29: | Line 29: |
* Global crawlertraps | * '''Global crawlertraps''' |
Line 32: | Line 32: |
* Pause/stop harvest from User Interface | * '''Pause/stop harvest from User Interface''' |
Line 38: | Line 38: |
* Sharing of crawlertraps (Heritrix) * Arcreader and Arcwriter separately |
* '''Sharing of crawlertraps (Heritrix)''' * '''Arcreader and Arcwriter separately''' |
Line 42: | Line 42: |
* Arcfile naming | * '''Arcfile naming''' |
Line 51: | Line 51: |
* All SQL out in DBSpecific for plugin | * '''All SQL out in DBSpecific for plugin''' |
Line 58: | Line 58: |
* More neat user interface * Click pattern * Filter job lists by category * Password & user nameto control settings * Calendar view of jobs |
* '''More neat user interface ''' * '''Click pattern ''' * '''Filter job lists by category ''' * '''Password & user nameto control settings ''' * '''Calendar view of jobs''' |
Yellow Notes for Potential Future Developments
Points noted as being important for some participants are marked by being bolded
Points that needs further explanation is marked with ?
Roadmap
- Java upgrade to Java 1.6
- Encrypting password for bit archive
New Heritrix integration
Quality Assurance
- Automatic filtering of collected URLs
- Use scope in QA ?
Extra Metadata
- Contractual Information
- Naming of jobs
- Look up of ???? data to add to metadata
- Access to Heritrix logs via Harvest definition interface
- A way to show count of stored bytes (after deduplication) and information on deduplication
- Subdomains from logs
Harvest Tuning
- Possibility to inactivate domains
Removing seeds from database
- Automatically check seeds
Exclude domains from snapshot harvest
- Alias detection tool integrated in system
Domain implementation
QueueAssignmentPolicy part of plugin ???
Global crawlertraps
- Count inline images in the same domain
Heritrix
Pause/stop harvest from User Interface
- Check email address using Heritrix method
- Internationalisation of Heritrix (not seen as important)
- Operator in metadata
- Heritrix timeout challenge
- Meta language for Heritix templates
Sharing of crawlertraps (Heritrix)
Arcreader and Arcwriter separately
- Heritrix split
- Use Heritrix deduplcation module
Arcfile naming
Infrastructure
- Take down parts and start them again
- Use of SPRING for deploy
- Automatic start of application that is down
Database
- Specify wanted backup directory
- Elaborate on update procedure when database has changed
- Oracle implementation as plugin
All SQL out in DBSpecific for plugin
Open Source Procedure
- Control on formats for patches
- What about new translations (in release information)
- User traning material
- Documentaiton on listen to queues
User Interface
More neat user interface
Click pattern
Filter job lists by category
Password & user nameto control settings
Calendar view of jobs
- For people with special needs
- Provinance