== Articles about web archiving in general == * [[http://library.wellcome.ac.uk/assets/WTL039229.pdf|''Collecting and preserving the World Wide Web: A feasibility study undertaken for the JISC and Wellcome Trust'']], Michael Day, UKOLN, University of Bath, Version 1.0, 25 February 2003 == Articles about NetarchiveSuite == * [[http://netarchive.dk/publikationer/nhc-kb-dk-msst2006.pdf|''A formal analysis of recovery in a preservational data grid'']], Niels H. Christensen, Royal Library of Denmark, Dept. of Digital Preservation & Netarkivet.dk, presented at MSST2006, the 14th NASA Goddard - 23rd IEEE Conference on Mass Storage Systems and Technologies, May 15-18, 2006, College Park, Maryland, USA == Articles about harvesting == * [[http://www.dia.uniroma3.it/~vldbproc/017_129.pdf|''Crawling the Hidden Web'']], Sriram Raghavan and Hector Garcia-Molina, Computer Science Department, Stanford University, Stanford, CA 94305, USA, Proceedings of the 27th VLDB Conference, Roma, Italy, 2001 * [[http://nicomedia.math.upatras.gr/courses/mnets/mat/Lawrence&Giles_AccessibilityOfInformationOnTheWeb.pdf|''Accessibility of information on the web'']], Steve Lawrence and C. Lee Giles, Nature, Vol 400, 8 July 1999 * [[http://www10.org/cdrom/papers/102/index.html|''Effective Web Data Extraction with Standard XML Technologies'']], Jussi Myllymaki, IBM Almaden Research Center, 650 Harry Road, San Jose, CA 95120, USA, WWW10, May 2-5, 2001, Hong Kong * [[http://kushmerick.org/nick/research/download/kushmerick-odbase2003.pdf|''Learning to invoke Web forms'']], Nicholas Kushmerick, Computer Science Department, University College Dublin, Ireland, Proc. Int. Conf. Ontologies, Databases & Applications of Semantics (Catania, 2003). * [[http://dx.doi.org/10.1016/j.ijar.2003.07.002|''Information retrieval in the Web: beyond current search engines'']], Ricardo Baeza-Yates, Center for Web Research, Department of Computer Science, University of Chile, Blanco Encalada, 2120, Santiago, Chile, International Journal of Approximate Reasoning, Volume 34, Issues 2-3, November 2003, Pages 97-104 * [[http://dx.doi.org/10.1016/S1389-1286(02)00213-X|''Search engines and Web dynamics'']], Knut Magne Risvik, Rolf Michelsen, Fast Search and Transfer ASA, Stoperigata 2, P.O. Box 1677, Vika NO-0120 Oslo, Norway, Computer Networks, Volume 39, Issue 3, 21 June 2002, Pages 289-302 * [[http://dx.doi.org/10.1016/S0098-7913(02)00184-3|''Invisible Web: Uncovering Information Sources Search Engines Can't See'']], Chris Sherman and Gary Price, NJ: Information Today, Inc., 2001. 439 p. $29.95. ISBN 0-910965-51-X * [[http://portal.acm.org/citation.cfm?id=779235|''High-performance web crawling'']], Marc Najork, Compaq Computer Corporation Systems Research Center, Palo Alto, CA and Allan Heydon, Model N, Inc., South San Francisco, CA, Handbook of massive data sets, pages 25 - 45, 2002, ISBN:1-4020-0489-3 === Alias detection === * [[http://lis.sagepub.com/cgi/reprint/31/1/21|''Classifying Web sites and Web pages: the use of metrics and URL characteristics as markers'']], Wallace C. Koehler, Jr., University of Oklahoma, Journal of Librarianship and Information Science, Vol. 31, No. 1, 21-31 (1999) * [[http://portal.acm.org/ft_gateway.cfm?id=511484&type=pdf&coll=GUIDE&dl=GUIDE&CFID=57533127&CFTOKEN=37490285|''Aliasing on the world wide web: prevalence and performance implications'']], Terence Kelly, University of Michigan, Ann Arbor, MI and Jeffrey Mogul, Compaq Western Research Lab, Palo Alto, CA, Proceedings of the 11th international conference on World Wide Web, Honolulu, Hawaii, USA, pages 281 - 292, 2002, ISBN:1-58113-449-5 * [[http://dbpubs.stanford.edu/pub/showDoc.Fulltext?lang=en&doc=1999-39&format=pdf&compression=&name=1999-39.pdf|''Finding replicated web collections.'']], J. Cho,N. Shivakumar,H. Garcia-Molina, Proceedings of 2000 ACM International Conference on Management of Data (SIGMOD) Conference, May 2000 * [[http://portal.acm.org/ft_gateway.cfm?id=1083676&type=pdf&coll=GUIDE&dl=GUIDE&CFID=18586673&CFTOKEN=47728477|''Discovering large dense subgraphs in massive graphs'']], David Gibson, Ravi Kumar, and Andrew Tomkins, IBM Almaden Research Center, San Jose, CA, Proceedings of the 31st international conference on Very large data bases, Trondheim, Norway, pages 721 - 732, ISBN:1-59593-154-6 * [[http://dx.doi.org/10.1016/S1389-1286(99)00021-3|''Mirror, mirror on the Web: a study of host pairs with replicated content'']], Krishna Bharat, and Andrei Broder, Compaq Systems Research Center, 130 Lytton Avenue, Palo Alto, CA 94301, USA, Computer Networks, Volume 31, Issues 11-16, 17 May 1999, Pages 1579-1590 * [[http://dx.doi.org/10.1002/1097-4571(2000)9999:9999<::AID-ASI1025>3.0.CO;2-0|''A comparison of techniques to find mirrored hosts on the WWW'']], Krishna Bharat, Andrei Broder, and Jeffrey Dean, Google Inc., 2400 Bayshore Ave., Mountain View, CA 94043, and Monika R. Henzinger, AltaVista Company, 1825 S. Grant St., San Mateo, CA 94402, Journal of the American Society for Information Science, Volume 51, Issue 12, Pages 1114 - 1122 * [[http://doi.acm.org/10.1145/336296.336345|''Defining Logical Domains in a Web Site'']], Wen-Syan Li, Okan Kolak, Quoc Vu, and Hajime Takano, Proceedings of the eleventh ACM Conference on Hypertext and hypermedia, San Antonio, Texas, United States, pages: 123 - 132, ISBN:1-58113-227-1 === Etags, datestamps, and adaptive revisiting === * [[http://vefsofnun.bok.hi.is/upload/3/ManagingDuplicatesAcrossSequentialCrawls.pdf|''Managing duplicates across sequential crawls'']], Kristinn Sigurðsson, National and University Library of Iceland, Arngrímsgötu 3, 107 Reykjavík, Iceland, Proceedings of the 5th International Web Archving Workshop, Alicante, Spain, September 21-22, 2006. * [[http://dx.doi.org/10.1016/S1389-1286(99)00037-7|''Towards a better understanding of Web resources and server responses for improved caching'']], Craig E. Wills1 and Mikhail Mikhailov, Computer Science Department, Worcester Polytechnic Institute, Worcester, MA 01609, USA, Computer Networks, Volume 31, Issues 11-16, 17 May 1999, Pages 1231-1243 * [[http://workshop99.ircache.net/Papers/wills-final.ps.gz|''Examining the cacheability of user-requested Web resources'']], C. E. Wills and M. Mikhailov, Proc. of the Fourth Int. Workshop on Web Content Caching and Distribution, Apr. 1999. * [[http://www.hpl.hp.com/techreports/Compaq-DEC/WRL-99-3.pdf|''Errors in timestamp-based HTTP header values'']], Mogul, J., Tech. rep. 99/3, (Dec.) Compaq Computer Corporation, Western Research Laboratory. * [[http://dx.doi.org/10.1016/S0169-7552(98)00108-1|''Efficient crawling through URL ordering'']], Junghoo Cho, Hector Garcia-Molina, and Lawrence Page, Proceedings of the Seventh International Conference on World Wide Web 7, Brisbane, Australia, pages 161 - 172, ISSN:0169-7552 * [[http://sites.computer.org/debull/A02mar/short-a.ps|''Computing web page importance without storing the graph of the web (extended abstract)'']], S. Abiteboul, M. Preda, and G. Cobena, IEEE-CS Data Engineering Bulletin, Volume 25, 2002. * [[http://www.springerlink.com/content/jdyw0e7kqlf6u4aw/fulltext.pdf|''Issues in Monitoring Web Data'']], Serge Abiteboul, INRIA and Xyleme, France, R. Cicchetti et al. (Eds.): DEXA 2002, LNCS 2453, pp. 1â8, 2002. * [[http://dx.doi.org/10.1045/december2002-masanes|''Towards Continuous Web Archiving: First Results and an Agenda for the Future'']], Julien Masanès, Bibliothèque Nationale de France, D-Lib Magazine, December 2002, Volume 8 Number 12, ISSN 1082-9873 == Articles about archiving and preservation == * ''An Architectural Framework for the IIPC Toolset'', Svein Arne Solbakk and Sverre Bang, National Library of Norway, IIPC working draft, October 2005 * [[http://www.niso.org/international/SC4/N595.pdf|''Information and documentation - The WARC File Format'']], ISO TC 46/SC 4 N 595, Working Draft. * [[http://www.apsr.edu.au/publications/aons_report.pdf|''AONS System Documentation'']], Joseph Curtis, Joseph, Australian Partnership for Sustainable Repositories, The Australian National University, Revision 169 2006-09-29, September 2006. * [[http://unesdoc.unesco.org/images/0014/001477/147782E.pdf|''Risks associated with the Use of Recordable CDs and DVDs as Reliable storage Media in Archival Collections - Strategies and Alternatives'']], Kevin Bradley, Memory of the World Programme Sub-Committee on Technology, CI/INF/2006/1/REV, October 2006. == Articles about browsing harvested material == == Description of other harvesting systems/archives == * [''Web Curator Tool''], Philip Beresford, British Library, Ariadne, Issue 50, January 2007 == Other bibliographies of related matter == * [[http://www.nla.gov.au/padi/topics/43.html|National Library of Australia's PADI bibliography]] == Unclassified related material == Arms, William: [ARMS-01a] ”Web Preservation Project – Interim Report” Connell University, Januar 2001 http://www.cs.cornell.edu/wya/LC-web/interim.doc Arms, William: [ARMS-01b] ”Web Preservation Project – Final Report” Connell University, September 2001 http://www.loc.gov/minerva/webpresf.pdf Aschenbrenner, Andreas: [ASCHENBRENNER-01] “Long-Term Preservation of Digital Material - Building an Archive to Preserve Digital Cultural Heritage from the Internet” Master Thesis, Technical University Vienna, December 2001. http://www.ifs.tuwien.ac.at/~aola/publications/thesis-ando.pdf Cho, Junghoo & Garcia-Monila, Hector: [CHO-00] “The Evolution of the web and Implications for an Incremental Crawler” Proceedings of the 26th VLDB Conference Cairo, Egypten, 2000 http://www.vldb.org/conf/2000/P200.pdf Lymann, Peter: [LYMAN-02] “Archiving the World Wide Web” i “Building a National Strategy for Digital Preservation: Issues in Digital Media Archiving” Library of Congress & Council on Library and Information Resources, April 2002 http://www.clir.org/pubs/reports/pub106/pub106.pdf Phillips, Margaret E: [PHILLIPS-02] “Archiving the Web: The national collection of Australian online publications” International Symposium on web archiving Tokyo, Japan, Januar 2002 http://www.ndl.go.jp/e/enews/nla.doc Raghavan, Siriam & Garcia-Molina, Hector: [RAGHAVAN-01] “Crawling the Hidden Web” Proceedings of the 27th VLDB Conference Rom, Italien, 2001 http://www.vldb.org/conf/2001/P129.pdf UNESCO & National Library of Australia: [UNLA-03] "Guidelines for the Preservation of Digital Heritage" UNESCO, 2003 http://unesdoc.unesco.org/images/0013/001300/130071e.pdf Webb, Colin: [WEBB-00] “Towards a Preserved National Collection of Selected Australian Digital Publications” Paper fra “Preservation 2000 Conference”, York, December 2000 http://www.nla.gov.au/nla/staffpaper/2000/webb6.HTML