
9ec6a1e6b96c03e6c14e39cf243581fe.ppt
- Количество слайдов: 43
WAX: A candle in the darkness A digital to digital project Wendy Gogel, Andrea Goethals Harvard University Library, Office for Information Systems May 1, 2009
Today’s Journey • The Darkness – The Web Introducing the challenge of web archiving • The Candle – WAX HUL’s Web Archive Collection Service • The Light – The Collections Demonstrating the results
The Darkness: The Web
The Challenges of Web Archiving • A fleeting record – here today, gone tomorrow • • • Government Documents Public Debate Culture Personal expression University Output
Harvard Magazine May/June 2009
Curator Activities • • Selection Acquisition Rights management Quality assurance Arrangement Storage Description and indexing for discovery (cataloguing, searching, browsing) • Presentations and exhibitions • Preservation
IP and Other Legal Risks • Copyright infringement • State tort liability • Civil damages, resulting from invasion of privacy, sensitive personal data, commercial content, defamatory content • Statutory content restrictions • Foreign Laws
Preservation Challenges • We were not there at creation • Viruses more likely • Formats misidentify themselves • A lot of formats are invalid (especially HTML) • It’s a moving target – what should we preserve? • • Evolving born digital formats Proliferation of formats Partial capture Complex behaviors and styles • Complex delivery to maintain • Hyperlinked resources • Multiple renderers will continue to evolve
2006/07 Alternatives Selection Crawling Management (QA and Metadata) Storage Preservation Discovery and Display Wayback (IA) No Yes Partial Replicated storage – Not Harvard owned No full text searching Contract IA Yes No, handle in -house No, Handle in-house Archive It! (IA) Yes Minimal, has since improved Yes Partial Replicated storage Minimal, has since improved Customize IIPC Tools (WAX)* Yes Yes More than others Yes * Additional benefit of integration with HUL central services Notes 2008 costs: $16, 000/yr $2, 000/yr Harvard copy
The Candle: WAX
HUL’s Web Archiving Project • • 2. 5 year pilot project funded by LDI Key Goals 1. Gain experience in domain 2. Explore legal terrain 3. Investigate sustainability of a Harvard web archiving service • • quantify technical, human, and $ requirements aim for operational efficiencies
Project Players 1. Curators and Collection Managers • • • Harvard University Archives Schlesinger Library on the History of Women in America Edwin O. Reischauer Institute of Japanese Studies 2. Legal Counsel – Office of General Counsel (OGC) 3. Technologists - OIS
What Did We Build? WAX
What Did We Build? WAX
What Did We Build? WAX
What Did We Build? WAX
Third Party Software • International Internet Preservation Consortium (IIPC) tools www. netpreserve. org • Heritrix • HCC • Nutch. WAX • Wayback • • • JBoss Oracle Struts Tomcat Quartz job scheduler
The Web is vast and interconnected. How do you specify the part you want to capture? Or “training a web crawler”…
How to Train a Web Crawler 1. Tell it where to start • “Seed URI” 2. Tell it what to collect and where to stop • “Scope” 3. Tell it when and how often • “Schedule”
Web Archiving Steps 1. Create a harvest profile Identify website URI (“seed”), define scope and schedule 2. 3. 4. 5. Harvest web site QA harvest Send harvest to DRS Index harvest Becomes searchable and viewable by users A lot of work per website – which can automated?
Web Archiving Steps Manual by curator → Automated by scheduler and crawler software → 1. Create a harvest profile 2. Harvest web site Manual by curator → 3. QA harvest Manual by curator → 4. Send harvest to DRS Automated by Indexing software → 5. Index harvest
Workflow Efficiencies • Curator’s manual tasks: • Create a harvest profile • 3 scopes: Directory, host and host+1 • Schedules • Global excluded URIs • QA harvests • Remove unwanted pieces • Detect missing pieces • Refinement of seed scope • Send harvests to DRS How can the system help with these tasks?
Efficiencies: QA Harvests
• • • Exclude URIs from future crawls Delete URIs from harvest and Exclude them from future crawls
Efficiencies: Send Harvests to DRS
The Ultimate Shortcut? • Can pre-configure WAX to send harvests directly to the DRS • Skip QA step • Skip push to archive step
Web Harvest Objects: Unit of Preservation in the DRS • For each crawl starting from a seed URI: • One or more ARC files (*. arc. gz) • contain one or more “resources” - the individual HTML, JPEG, Javascript, etc. files that make up the harvested web pages • Crawl log • records all URI requests, regardless of result • Crawler configuration • Metadata • descriptive, administrative, technical
WAX Legal Mitigations: Crawls • Polite crawling • Obey robots. txt • Leave WAX crawler information in logs • Employ a respectful “request frequency” during crawls • Don’t overload web servers • Capture surface web only • No attempt to crawl protected content • Choice of offsite crawler for curators • Non-Harvard IP address
WAX Legal Mitigations: Use • Don’t compete with or divert traffic from live site • Exclude robots from the WAX archive • Add transformative content • Framing • Presentation pages with original intellectual content • Embargo display for 3 months • Link to live site
The Collections • 191 “seeds” identified by curators for harvesting • Stored in DRS: • Over 8 million web archive resources • 365. 17 gigabytes of storage ($913/year) • 291 mime types
application/x-download application/x-java-vm Shockwave message/rfc 822 text/Javascript audio/x-realaudio image/x-portable-anymap textcss chemical/mdl-rdf javascript/x-javascript application/x-Shockwave-Flash content-type application/bds png image/png? ver=074219 b 2138 e 87 ecf 980914 471183 dfc text/x-c++ application/xrds+xml "text/xml" image/x-bmp gif application/x-rar-compressed Image/png mime/type image/null text/troff application/vnd. sun. xml. impress text/enriched application/icalendar application-x/javascript x-mapp-php 4 imag/x-icon application/x-shockwave-flash 2 -preview Swish image/x-photoshop application/x-quicktimeplayer image/x-cmu-raster httpd/yahoo-send-as-is application/x-mpeg Video/X-Flv text/x-python text/text Text/HTML audio/mid text/Calendar application/x-wais-source application/x-perl audio/x-scpls image/txt application/pgp-keys Applicationxm text/calendar PNG text/x-vcard x-png application/octet-string unknown/unknown application/x-troff-me text/x-javascript video/x-m 4 v application/octetstream application/pgp-signature Image application/x-sh audio/x-mpegurl audio/unknown chemical/x-xyz image/x-portable-graymap image/#{favicon_formats[format]} image/files/curryjpg test/xml text/x-invalid video/x-flv text/javascript+json application/perl application/x. atom+xml application/octet_stream video/mp 4
The Light: The Collections
The Partners Megan Sniffin-Marinoff, University Archivist A-Sites: Archived Harvard Web Sites collected by the Harvard University Archives Marilyn Dunn, Executive Director of the Schlesinger Library and Librarian of the Radcliffe Institute Blogs: Capturing Women's Voices collected by the Arthur and Elizabeth Schlesinger Library on the History of Women in America Helen Hardacre, Reischauer Institute Professor of Japanese Religions and Society Web Archiving Project on Constitutional Revision collected by the Reischauer Institute of Japanese Studies with Sponsorship from the Harvard College Library Documentation Center on Contemporary Japan
To Participate http: //hul. harvard. edu/ois/systems/wax
Questions? “…we have rather chosen to fill our hives with honey and wax, thus furnishing mankind with the two noblest of things, which are sweetness and light. ” Jonathan Swift
Image Credits Title slide: http: //www. flickr. com/photos/lwr/59014972/in/set-1552655/ The darkness: http: //www. melegraph. com/images/outerspace. jpg The candle: http: //www. sxc. hu/pic/m/a/as/asolario/472153_peach_votive_candle. jpg The Web: http: //projecta-z. com/Internet_map_1024. jpg The light http: //i 252. photobucket. com/albums/hh 2/habeba 2007/candles-1 -1. gif