Differences between revisions 15 and 16
Revision 15 as of 2009-04-09 15:02:07
Size: 990
Editor: OnbUser
Comment:
Revision 16 as of 2009-04-23 07:43:00
Size: 1363
Comment:
Deletions are marked like this. Additions are marked like this.
Line 11: Line 11:
  * scheduling
  * responsabilities, roles of participants in a broad crawl defining crawl target (number of URL, scope, seed lists, politeness,budget...)
  * dealing with junk data
  * sorting and spliting seed lists into different jobs running test crawl
Line 15: Line 19:
  * using frontier reports
  * modifying settings, creating overrides
Line 16: Line 22:
  * visual QA
  * running a patch crawl
Line 17: Line 25:

Preliminary Agenda items (proposals) for the non-technical workshop

Introduction
  • Presentation of participants
  • Expectations
  • Review/update of agenda
  • Step by step experience from Netarchive.dk concerning broad crawls
  • Preparation
    • scheduling
    • responsabilities, roles of participants in a broad crawl defining crawl target (number of URL, scope, seed lists, politeness,budget...)
    • dealing with junk data
    • sorting and spliting seed lists into different jobs running test crawl
  • Selection of sites
  • How to manage deduplication
  • Actual impact on computing and storage
  • Experience during the crawl
    • using frontier reports
    • modifying settings, creating overrides
  • QA
    • visual QA
    • running a patch crawl
  • Metrics from the past domain crawls : how much, how many, how fast, etc.
  • Collection
  • What's a collection?
  • User management NetarchiveSuite
  • Different set of roles using the NetarchiveSuite

  • A simple user interface for people who are not very familiar with webarchiving.
  • Statistics module
  • Base for all kinds of calculations and general information about the webarchive
  • Comparing results of crawls for quality control.
  • Access
  • Comparison of legal basis regarding access
  • Access with Wayback
  • PreliminaryAgendaItemsNonTech (last edited 2010-08-16 10:24:58 by localhost)