fea0c27f39741be625dc0efe3ae6be74.ppt
- Количество слайдов: 17
Archiving and Preserving the Web Kristine Hanna Internet Archive July 2008 1
Open Source Technology primarily developed by Internet Archive and IIPC • Heritrix: web harvester to capture the content • Wayback Machine: access tool for rendering and viewing content. Displays archived web pages--surf the web as it was. • Nutch. WAX: Search engine. Standard full-text search 2
Heritrix development 2. 0 (2008) Duplicate Reduction (saving storage) Prioritization of seeds, domains, Url’s Adapting to WARC format 2. 2 (September 2008) and 2. 4 (2009) • adaptive & continuous revisit crawling at a large scale – Ability to run one never-ending 'master crawl' on the same 'scope’ and not break up the crawl • improving check pointing for stable long-running crawl – Essentially a 'snapshot' of the entire state of the crawl, so if anything goes wrong, we can pick up from exactly that 'snapshot' point, with all internal queues/counters in exactly the same state. • better crawling of web video content 3 • improving the usability and documentation features
Nutch. WAX Development. 12 • • • (September) De-duplication of archive content during indexing. Adds support for WARC files Addresses high priority bugs Built on most recent versions of Nutch/Hadoop Distributed computing system scales to 100 millions of documents. Open Search interface to integrate with numerous 3 rd-party systems 1. 0 (December) • Improve and simplify installation, indexing and service deployment of Nutch • Provide Nutch. WAX documentation 4
Wayback Development 1. 4 (July) • Configurable/customizable error messages per website • support for exclusions framework including date ranges • anchoring date during replay to prevent "drift" through a replay session • anchoring window, to limit embedded content to a defined time range within a replay session • index format change to "identity format” • proxy mode embedding of time lines, banners, etc 1. 6 (December) • Performance optimizations and better documentation • Ability to play back https • Improved packaging, installation and documentation • Formal Support for Windows platform • Improved video replay • Thumbnails and/or document titles in the UI • In page difference between two captures (visual comparison as you move through time) 5
IA Projects • Using Open Source tools • Collaborating with Partners 6
National Libraries Ongoing thematic crawls, event based harvests, and domain snapshots • • • Iceland Germany UK Norway Denmark US Czech Republic France Ireland Australia Norway Sweden 7
Topic/Event crawls Library of Congress • • National elections – 2000, 2002, 2004, 2006, 2008 Supreme Court Nomination War in Iraq Crisis in Darfur Egyptian Elections Olympics. gov Papal Election 8
Community Web archiving • Hurricane Katrina collection – Contributors: The Internet Archive, the Library of Congress, CDL, a group of universities, and many individual contributors – spans content generated between September 4 and November 8, 2005 – 1700 web sites /61 million pages, all text searchable Public access at http: //websearchive. org/katrina/ • Tsunami Collection ¯ Contributors: The Internet Archive, Singapore Internet Research Centre, Web Archivist ¯ 1500 sites / 4 million pages, all text searchable Public access at http: //tsunami. archive. org/ 9
Virginia Tech University Web archiving as a result of crisis and tragedy • Tragedy at Virginia Tech 3 million documents all text searchable accessible to the public at http: //www. dl-vt 416. org/ • Northern Illinois University 10
World Wide Web of Humanities • Collaboration between IA, Hanzo Web and Oxford Internet Institute. Funded by NEH and JISC • Objective is to support new methodologies for digital humanities research built around large collections of web and digitized data, using automated tools to extract, index, and analyze the data • Chose a well-rounded set of humanities materials that will allow us to test the tools against a variety of types of documents and resource types • Will build focused research collections around the topics of 11 World Wars I and II
K-12 • Collaboration with LOC and CDL • Chose 3 high schools from around the country (California, Illinois and Louisiana) • http: //www. archive-it. org/k 12 12
Around the World in Two Billion Pages • Mellon Award - unique global snapshot of the Web – – Crawled from June 2007 to December 2007 Over 60 countries participated Started with 18, 000 seeds (websites) Completed with 2 billion pages http: //wa. archive. org/aroundtheworld/ 13
Archive-It (state archives, state university and public libraries, university libraries and non government non profits) – Web based application that allows users to harvest manage and preserve collections of born digital content. – Own institution’s websites, topics/subjects/events and/or government records – Functions include: setting crawl frequencies, defining scope, cataloging with metadata, managing and analysis of collections and full text search – Includes hosting and storage 14
Video • 2007: • IA Engineers crawled over one million You Tube videos. Broad crawls off of home page links (most popular, most viewed) • Started crawling embedded videos for LOC Election ‘ 08 collection • 2008: • NDIIPP project with UNC: 8 weekly crawls • Broad crawls: 2 weekly crawls from You Tube home page, prioritized based on popularity • Focused/topical crawls: 3 weekly crawls with specific id’s or search queries provided by UNC • Broad and/or Focused: last 3 crawls (TBA) 15
Video Harvests • Difficult to interact with youtube and other proprietary flash video players • Configuration is a moving target, since these video hosting sites may change their software at any time. • Highly customized scoping rules need to be added to capture all the URLs relevant to embedded Flash videos • replay (through the Wayback Machine) is complicated by some of the same issues we face with Flash in general 16
s What’s Next for Internet Archive and Web Archiving • Collaboration and Partnerships – Continue to act as a technology partner in providing web archiving services – Continue to develop Open Source software – Develop common tools, storage formats and standards through the IIPC, and with our partners • Multiple copies around the world – Within IA’s own repository, and with partners such as LC, Bnf, Library of Alexandria 17
fea0c27f39741be625dc0efe3ae6be74.ppt