Differences between revisions 5 and 6
Revision 5 as of 2008-03-27 12:35:07
Size: 10929
Editor: LarsClausen
Comment:
Revision 6 as of 2010-08-16 10:24:08
Size: 10987
Editor: localhost
Comment: converted to 1.6 markup
Deletions are marked like this. Additions are marked like this.
Line 3: Line 3:
 * [http://library.wellcome.ac.uk/assets/WTL039229.pdf ''Collecting and preserving the World Wide Web: A feasibility study undertaken for the JISC and Wellcome Trust''], Michael Day, UKOLN, University of Bath, Version 1.0, 25 February 2003  * [[http://library.wellcome.ac.uk/assets/WTL039229.pdf|''Collecting and preserving the World Wide Web: A feasibility study undertaken for the JISC and Wellcome Trust'']], Michael Day, UKOLN, University of Bath, Version 1.0, 25 February 2003
Line 7: Line 7:
 * [http://netarchive.dk/publikationer/nhc-kb-dk-msst2006.pdf ''A formal analysis of recovery in a preservational data grid''], Niels H. Christensen, Royal Library of Denmark, Dept. of Digital Preservation & Netarkivet.dk, presented at MSST2006, the 14th NASA Goddard - 23rd IEEE Conference on Mass Storage Systems and Technologies, May 15-18, 2006, College Park, Maryland, USA  * [[http://netarchive.dk/publikationer/nhc-kb-dk-msst2006.pdf|''A formal analysis of recovery in a preservational data grid'']], Niels H. Christensen, Royal Library of Denmark, Dept. of Digital Preservation & Netarkivet.dk, presented at MSST2006, the 14th NASA Goddard - 23rd IEEE Conference on Mass Storage Systems and Technologies, May 15-18, 2006, College Park, Maryland, USA
Line 11: Line 11:
 * [http://www.dia.uniroma3.it/~vldbproc/017_129.pdf ''Crawling the Hidden Web''], Sriram Raghavan and Hector Garcia-Molina, Computer Science Department, Stanford University, Stanford, CA 94305, USA, Proceedings of the 27th VLDB Conference, Roma, Italy, 2001  * [[http://www.dia.uniroma3.it/~vldbproc/017_129.pdf|''Crawling the Hidden Web'']], Sriram Raghavan and Hector Garcia-Molina, Computer Science Department, Stanford University, Stanford, CA 94305, USA, Proceedings of the 27th VLDB Conference, Roma, Italy, 2001
Line 13: Line 13:
 * [http://nicomedia.math.upatras.gr/courses/mnets/mat/Lawrence&Giles_AccessibilityOfInformationOnTheWeb.pdf ''Accessibility of information on the web''], Steve Lawrence and C. Lee Giles, Nature, Vol 400, 8 July 1999  * [[http://nicomedia.math.upatras.gr/courses/mnets/mat/Lawrence&Giles_AccessibilityOfInformationOnTheWeb.pdf|''Accessibility of information on the web'']], Steve Lawrence and C. Lee Giles, Nature, Vol 400, 8 July 1999
Line 15: Line 15:
 * [http://www10.org/cdrom/papers/102/index.html ''Effective Web Data Extraction with Standard XML Technologies''], Jussi Myllymaki, IBM Almaden Research Center, 650 Harry Road, San Jose, CA 95120, USA, WWW10, May 2-5, 2001, Hong Kong  * [[http://www10.org/cdrom/papers/102/index.html|''Effective Web Data Extraction with Standard XML Technologies'']], Jussi Myllymaki, IBM Almaden Research Center, 650 Harry Road, San Jose, CA 95120, USA, WWW10, May 2-5, 2001, Hong Kong
Line 17: Line 17:
 * [http://kushmerick.org/nick/research/download/kushmerick-odbase2003.pdf ''Learning to invoke Web forms''], Nicholas Kushmerick, Computer Science Department, University College Dublin, Ireland, Proc. Int. Conf. Ontologies, Databases & Applications of Semantics (Catania, 2003).  * [[http://kushmerick.org/nick/research/download/kushmerick-odbase2003.pdf|''Learning to invoke Web forms'']], Nicholas Kushmerick, Computer Science Department, University College Dublin, Ireland, Proc. Int. Conf. Ontologies, Databases & Applications of Semantics (Catania, 2003).
Line 19: Line 19:
 * [http://dx.doi.org/10.1016/j.ijar.2003.07.002 ''Information retrieval in the Web: beyond current search engines''], Ricardo Baeza-Yates, Center for Web Research, Department of Computer Science, University of Chile, Blanco Encalada, 2120, Santiago, Chile, International Journal of Approximate Reasoning, Volume 34, Issues 2-3, November 2003, Pages 97-104  * [[http://dx.doi.org/10.1016/j.ijar.2003.07.002|''Information retrieval in the Web: beyond current search engines'']], Ricardo Baeza-Yates, Center for Web Research, Department of Computer Science, University of Chile, Blanco Encalada, 2120, Santiago, Chile, International Journal of Approximate Reasoning, Volume 34, Issues 2-3, November 2003, Pages 97-104
Line 21: Line 21:
 * [http://dx.doi.org/10.1016/S1389-1286(02)00213-X ''Search engines and Web dynamics''], Knut Magne Risvik, Rolf Michelsen, Fast Search and Transfer ASA, Stoperigata 2, P.O. Box 1677, Vika NO-0120 Oslo, Norway, Computer Networks, Volume 39, Issue 3, 21 June 2002, Pages 289-302  * [[http://dx.doi.org/10.1016/S1389-1286(02)00213-X|''Search engines and Web dynamics'']], Knut Magne Risvik, Rolf Michelsen, Fast Search and Transfer ASA, Stoperigata 2, P.O. Box 1677, Vika NO-0120 Oslo, Norway, Computer Networks, Volume 39, Issue 3, 21 June 2002, Pages 289-302
Line 23: Line 23:
 * [http://dx.doi.org/10.1016/S0098-7913(02)00184-3 ''Invisible Web: Uncovering Information Sources Search Engines Can't See''], Chris Sherman and Gary Price, NJ: Information Today, Inc., 2001. 439 p. $29.95. ISBN 0-910965-51-X  * [[http://dx.doi.org/10.1016/S0098-7913(02)00184-3|''Invisible Web: Uncovering Information Sources Search Engines Can't See'']], Chris Sherman and Gary Price, NJ: Information Today, Inc., 2001. 439 p. $29.95. ISBN 0-910965-51-X
Line 25: Line 25:
 * [http://portal.acm.org/citation.cfm?id=779235 ''High-performance web crawling''], Marc Najork, Compaq Computer Corporation Systems Research Center, Palo Alto, CA and Allan Heydon, Model N, Inc., South San Francisco, CA, Handbook of massive data sets, pages 25 - 45, 2002, ISBN:1-4020-0489-3  * [[http://portal.acm.org/citation.cfm?id=779235|''High-performance web crawling'']], Marc Najork, Compaq Computer Corporation Systems Research Center, Palo Alto, CA and Allan Heydon, Model N, Inc., South San Francisco, CA, Handbook of massive data sets, pages 25 - 45, 2002, ISBN:1-4020-0489-3
Line 29: Line 29:
 * [http://lis.sagepub.com/cgi/reprint/31/1/21 ''Classifying Web sites and Web pages: the use of metrics and URL characteristics as markers''], Wallace C. Koehler, Jr., University of Oklahoma, Journal of Librarianship and Information Science, Vol. 31, No. 1, 21-31 (1999)  * [[http://lis.sagepub.com/cgi/reprint/31/1/21|''Classifying Web sites and Web pages: the use of metrics and URL characteristics as markers'']], Wallace C. Koehler, Jr., University of Oklahoma, Journal of Librarianship and Information Science, Vol. 31, No. 1, 21-31 (1999)
Line 31: Line 31:
 * [http://portal.acm.org/ft_gateway.cfm?id=511484&type=pdf&coll=GUIDE&dl=GUIDE&CFID=57533127&CFTOKEN=37490285 ''Aliasing on the world wide web: prevalence and performance implications''], Terence Kelly, University of Michigan, Ann Arbor, MI and Jeffrey Mogul, Compaq Western Research Lab, Palo Alto, CA, Proceedings of the 11th international conference on World Wide Web, Honolulu, Hawaii, USA, pages 281 - 292, 2002, ISBN:1-58113-449-5  * [[http://portal.acm.org/ft_gateway.cfm?id=511484&type=pdf&coll=GUIDE&dl=GUIDE&CFID=57533127&CFTOKEN=37490285|''Aliasing on the world wide web: prevalence and performance implications'']], Terence Kelly, University of Michigan, Ann Arbor, MI and Jeffrey Mogul, Compaq Western Research Lab, Palo Alto, CA, Proceedings of the 11th international conference on World Wide Web, Honolulu, Hawaii, USA, pages 281 - 292, 2002, ISBN:1-58113-449-5
Line 33: Line 33:
 * [http://dbpubs.stanford.edu/pub/showDoc.Fulltext?lang=en&doc=1999-39&format=pdf&compression=&name=1999-39.pdf ''Finding replicated web collections.''], J. Cho,N. Shivakumar,H. Garcia-Molina, Proceedings of 2000 ACM International Conference on Management of Data (SIGMOD) Conference, May 2000  * [[http://dbpubs.stanford.edu/pub/showDoc.Fulltext?lang=en&doc=1999-39&format=pdf&compression=&name=1999-39.pdf|''Finding replicated web collections.'']], J. Cho,N. Shivakumar,H. Garcia-Molina, Proceedings of 2000 ACM International Conference on Management of Data (SIGMOD) Conference, May 2000
Line 35: Line 35:
 * [http://portal.acm.org/ft_gateway.cfm?id=1083676&type=pdf&coll=GUIDE&dl=GUIDE&CFID=18586673&CFTOKEN=47728477 ''Discovering large dense subgraphs in massive graphs''], David Gibson, Ravi Kumar, and Andrew Tomkins, IBM Almaden Research Center, San Jose, CA, Proceedings of the 31st international conference on Very large data bases, Trondheim, Norway, pages 721 - 732, ISBN:1-59593-154-6  * [[http://portal.acm.org/ft_gateway.cfm?id=1083676&type=pdf&coll=GUIDE&dl=GUIDE&CFID=18586673&CFTOKEN=47728477|''Discovering large dense subgraphs in massive graphs'']], David Gibson, Ravi Kumar, and Andrew Tomkins, IBM Almaden Research Center, San Jose, CA, Proceedings of the 31st international conference on Very large data bases, Trondheim, Norway, pages 721 - 732, ISBN:1-59593-154-6
Line 37: Line 37:
 * [http://dx.doi.org/10.1016/S1389-1286(99)00021-3 ''Mirror, mirror on the Web: a study of host pairs with replicated content''], Krishna Bharat, and Andrei Broder, Compaq Systems Research Center, 130 Lytton Avenue, Palo Alto, CA 94301, USA, Computer Networks, Volume 31, Issues 11-16, 17 May 1999, Pages 1579-1590  * [[http://dx.doi.org/10.1016/S1389-1286(99)00021-3|''Mirror, mirror on the Web: a study of host pairs with replicated content'']], Krishna Bharat, and Andrei Broder, Compaq Systems Research Center, 130 Lytton Avenue, Palo Alto, CA 94301, USA, Computer Networks, Volume 31, Issues 11-16, 17 May 1999, Pages 1579-1590
Line 39: Line 39:
 * [http://dx.doi.org/10.1002/1097-4571(2000)9999:9999<::AID-ASI1025>3.0.CO;2-0 ''A comparison of techniques to find mirrored hosts on the WWW''], Krishna Bharat, Andrei Broder, and Jeffrey Dean, Google Inc., 2400 Bayshore Ave., Mountain View, CA 94043, and Monika R. Henzinger, AltaVista Company, 1825 S. Grant St., San Mateo, CA 94402, Journal of the American Society for Information Science, Volume 51, Issue 12, Pages 1114 - 1122  * [[http://dx.doi.org/10.1002/1097-4571(2000)9999:9999<::AID-ASI1025>3.0.CO;2-0|''A comparison of techniques to find mirrored hosts on the WWW'']], Krishna Bharat, Andrei Broder, and Jeffrey Dean, Google Inc., 2400 Bayshore Ave., Mountain View, CA 94043, and Monika R. Henzinger, AltaVista Company, 1825 S. Grant St., San Mateo, CA 94402, Journal of the American Society for Information Science, Volume 51, Issue 12, Pages 1114 - 1122
Line 41: Line 41:
 * [http://doi.acm.org/10.1145/336296.336345 ''Defining Logical Domains in a Web Site''], Wen-Syan Li, Okan Kolak, Quoc Vu, and Hajime Takano, Proceedings of the eleventh ACM Conference on Hypertext and hypermedia, San Antonio, Texas, United States, pages: 123 - 132, ISBN:1-58113-227-1  * [[http://doi.acm.org/10.1145/336296.336345|''Defining Logical Domains in a Web Site'']], Wen-Syan Li, Okan Kolak, Quoc Vu, and Hajime Takano, Proceedings of the eleventh ACM Conference on Hypertext and hypermedia, San Antonio, Texas, United States, pages: 123 - 132, ISBN:1-58113-227-1
Line 45: Line 45:
 * [http://vefsofnun.bok.hi.is/upload/3/ManagingDuplicatesAcrossSequentialCrawls.pdf ''Managing duplicates across sequential crawls''], Kristinn Sigurðsson, National and University Library of Iceland, Arngrímsgötu 3, 107 Reykjavík, Iceland, Proceedings of the 5th International Web Archving Workshop, Alicante, Spain, September 21-22, 2006.  * [[http://vefsofnun.bok.hi.is/upload/3/ManagingDuplicatesAcrossSequentialCrawls.pdf|''Managing duplicates across sequential crawls'']], Kristinn Sigurðsson, National and University Library of Iceland, Arngrímsgötu 3, 107 Reykjavík, Iceland, Proceedings of the 5th International Web Archving Workshop, Alicante, Spain, September 21-22, 2006.
Line 47: Line 47:
 * [http://dx.doi.org/10.1016/S1389-1286(99)00037-7 ''Towards a better understanding of Web resources and server responses for improved caching''], Craig E. Wills1 and Mikhail Mikhailov, Computer Science Department, Worcester Polytechnic Institute, Worcester, MA 01609, USA, Computer Networks, Volume 31, Issues 11-16, 17 May 1999, Pages 1231-1243  * [[http://dx.doi.org/10.1016/S1389-1286(99)00037-7|''Towards a better understanding of Web resources and server responses for improved caching'']], Craig E. Wills1 and Mikhail Mikhailov, Computer Science Department, Worcester Polytechnic Institute, Worcester, MA 01609, USA, Computer Networks, Volume 31, Issues 11-16, 17 May 1999, Pages 1231-1243
Line 49: Line 49:
 * [http://workshop99.ircache.net/Papers/wills-final.ps.gz ''Examining the cacheability of user-requested Web resources''], C. E. Wills and M. Mikhailov, Proc. of the Fourth Int. Workshop on Web Content Caching and Distribution, Apr. 1999.  * [[http://workshop99.ircache.net/Papers/wills-final.ps.gz|''Examining the cacheability of user-requested Web resources'']], C. E. Wills and M. Mikhailov, Proc. of the Fourth Int. Workshop on Web Content Caching and Distribution, Apr. 1999.
Line 51: Line 51:
 * [http://www.hpl.hp.com/techreports/Compaq-DEC/WRL-99-3.pdf ''Errors in timestamp-based HTTP header values''], Mogul, J., Tech. rep. 99/3, (Dec.) Compaq Computer Corporation, Western Research Laboratory.  * [[http://www.hpl.hp.com/techreports/Compaq-DEC/WRL-99-3.pdf|''Errors in timestamp-based HTTP header values'']], Mogul, J., Tech. rep. 99/3, (Dec.) Compaq Computer Corporation, Western Research Laboratory.
Line 53: Line 53:
 * [http://dx.doi.org/10.1016/S0169-7552(98)00108-1 ''Efficient crawling through URL ordering''], Junghoo Cho, Hector Garcia-Molina, and Lawrence Page, Proceedings of the Seventh International Conference on World Wide Web 7, Brisbane, Australia, pages 161 - 172, ISSN:0169-7552  * [[http://dx.doi.org/10.1016/S0169-7552(98)00108-1|''Efficient crawling through URL ordering'']], Junghoo Cho, Hector Garcia-Molina, and Lawrence Page, Proceedings of the Seventh International Conference on World Wide Web 7, Brisbane, Australia, pages 161 - 172, ISSN:0169-7552
Line 55: Line 55:
 * [http://sites.computer.org/debull/A02mar/short-a.ps ''Computing web page importance without storing the graph of the web (extended abstract)''], S. Abiteboul, M. Preda, and G. Cobena, IEEE-CS Data Engineering Bulletin, Volume 25, 2002.  * [[http://sites.computer.org/debull/A02mar/short-a.ps|''Computing web page importance without storing the graph of the web (extended abstract)'']], S. Abiteboul, M. Preda, and G. Cobena, IEEE-CS Data Engineering Bulletin, Volume 25, 2002.
Line 57: Line 57:
 * [http://www.springerlink.com/content/jdyw0e7kqlf6u4aw/fulltext.pdf ''Issues in Monitoring Web Data''], Serge Abiteboul, INRIA and Xyleme, France, R. Cicchetti et al. (Eds.): DEXA 2002, LNCS 2453, pp. 1â8, 2002.  * [[http://www.springerlink.com/content/jdyw0e7kqlf6u4aw/fulltext.pdf|''Issues in Monitoring Web Data'']], Serge Abiteboul, INRIA and Xyleme, France, R. Cicchetti et al. (Eds.): DEXA 2002, LNCS 2453, pp. 1â8, 2002.
Line 59: Line 59:
 * [http://dx.doi.org/10.1045/december2002-masanes ''Towards Continuous Web Archiving: First Results and an Agenda for the Future''], Julien Masanès, Bibliothèque Nationale de France, D-Lib Magazine, December 2002, Volume 8 Number 12, ISSN 1082-9873  * [[http://dx.doi.org/10.1045/december2002-masanes|''Towards Continuous Web Archiving: First Results and an Agenda for the Future'']], Julien Masanès, Bibliothèque Nationale de France, D-Lib Magazine, December 2002, Volume 8 Number 12, ISSN 1082-9873
Line 65: Line 65:
 * [http://www.niso.org/international/SC4/N595.pdf ''Information and documentation - The WARC File Format''], ISO TC 46/SC 4 N 595, Working Draft.  * [[http://www.niso.org/international/SC4/N595.pdf|''Information and documentation - The WARC File Format'']], ISO TC 46/SC 4 N 595, Working Draft.
Line 67: Line 67:
 * [http://www.apsr.edu.au/publications/aons_report.pdf ''AONS System Documentation''], Joseph Curtis, Joseph, Australian Partnership for Sustainable Repositories, The Australian National University, Revision 169 2006-09-29, September 2006.  * [[http://www.apsr.edu.au/publications/aons_report.pdf|''AONS System Documentation'']], Joseph Curtis, Joseph, Australian Partnership for Sustainable Repositories, The Australian National University, Revision 169 2006-09-29, September 2006.
Line 69: Line 69:
 * [http://unesdoc.unesco.org/images/0014/001477/147782E.pdf ''Risks associated with the Use of Recordable CDs and DVDs as Reliable storage Media in Archival Collections - Strategies and Alternatives''], Kevin Bradley, Memory of the World Programme Sub-Committee on Technology, CI/INF/2006/1/REV, October 2006.  * [[http://unesdoc.unesco.org/images/0014/001477/147782E.pdf|''Risks associated with the Use of Recordable CDs and DVDs as Reliable storage Media in Archival Collections - Strategies and Alternatives'']], Kevin Bradley, Memory of the World Programme Sub-Committee on Technology, CI/INF/2006/1/REV, October 2006.
Line 79: Line 79:
 * [http://www.nla.gov.au/padi/topics/43.html National Library of Australia's PADI bibliography]  * [[http://www.nla.gov.au/padi/topics/43.html|National Library of Australia's PADI bibliography]]

Articles about web archiving in general

Articles about NetarchiveSuite

  • ''A formal analysis of recovery in a preservational data grid'', Niels H. Christensen, Royal Library of Denmark, Dept. of Digital Preservation & Netarkivet.dk, presented at MSST2006, the 14th NASA Goddard - 23rd IEEE Conference on Mass Storage Systems and Technologies, May 15-18, 2006, College Park, Maryland, USA

Articles about harvesting

Alias detection

Etags, datestamps, and adaptive revisiting

Articles about archiving and preservation

Articles about browsing harvested material

Description of other harvesting systems/archives

  • [Web Curator Tool], Philip Beresford, British Library, Ariadne, Issue 50, January 2007

Arms, William: [ARMS-01a]

Arms, William: [ARMS-01b]

Aschenbrenner, Andreas: [ASCHENBRENNER-01]

Cho, Junghoo & Garcia-Monila, Hector: [CHO-00]

Lymann, Peter: [LYMAN-02]

  • “Archiving the World Wide Web”

i “Building a National Strategy for Digital Preservation: Issues in Digital Media Archiving” Library of Congress & Council on Library and Information Resources, April 2002

Phillips, Margaret E: [PHILLIPS-02] “Archiving the Web: The national collection of Australian online publications”

Raghavan, Siriam & Garcia-Molina, Hector: [RAGHAVAN-01]

UNESCO & National Library of Australia: [UNLA-03]

Webb, Colin: [WEBB-00]

Bibliography (last edited 2010-08-16 10:24:08 by localhost)