Скачать презентацию WAX A candle in the darkness A digital Скачать презентацию WAX A candle in the darkness A digital

9ec6a1e6b96c03e6c14e39cf243581fe.ppt

  • Количество слайдов: 43

WAX: A candle in the darkness A digital to digital project Wendy Gogel, Andrea WAX: A candle in the darkness A digital to digital project Wendy Gogel, Andrea Goethals Harvard University Library, Office for Information Systems May 1, 2009

Today’s Journey • The Darkness – The Web Introducing the challenge of web archiving Today’s Journey • The Darkness – The Web Introducing the challenge of web archiving • The Candle – WAX HUL’s Web Archive Collection Service • The Light – The Collections Demonstrating the results

The Darkness: The Web The Darkness: The Web

The Challenges of Web Archiving • A fleeting record – here today, gone tomorrow The Challenges of Web Archiving • A fleeting record – here today, gone tomorrow • • • Government Documents Public Debate Culture Personal expression University Output

Harvard Magazine May/June 2009 Harvard Magazine May/June 2009

Curator Activities • • Selection Acquisition Rights management Quality assurance Arrangement Storage Description and Curator Activities • • Selection Acquisition Rights management Quality assurance Arrangement Storage Description and indexing for discovery (cataloguing, searching, browsing) • Presentations and exhibitions • Preservation

IP and Other Legal Risks • Copyright infringement • State tort liability • Civil IP and Other Legal Risks • Copyright infringement • State tort liability • Civil damages, resulting from invasion of privacy, sensitive personal data, commercial content, defamatory content • Statutory content restrictions • Foreign Laws

Preservation Challenges • We were not there at creation • Viruses more likely • Preservation Challenges • We were not there at creation • Viruses more likely • Formats misidentify themselves • A lot of formats are invalid (especially HTML) • It’s a moving target – what should we preserve? • • Evolving born digital formats Proliferation of formats Partial capture Complex behaviors and styles • Complex delivery to maintain • Hyperlinked resources • Multiple renderers will continue to evolve

2006/07 Alternatives Selection Crawling Management (QA and Metadata) Storage Preservation Discovery and Display Wayback 2006/07 Alternatives Selection Crawling Management (QA and Metadata) Storage Preservation Discovery and Display Wayback (IA) No Yes Partial Replicated storage – Not Harvard owned No full text searching Contract IA Yes No, handle in -house No, Handle in-house Archive It! (IA) Yes Minimal, has since improved Yes Partial Replicated storage Minimal, has since improved Customize IIPC Tools (WAX)* Yes Yes More than others Yes * Additional benefit of integration with HUL central services Notes 2008 costs: $16, 000/yr $2, 000/yr Harvard copy

The Candle: WAX The Candle: WAX

HUL’s Web Archiving Project • • 2. 5 year pilot project funded by LDI HUL’s Web Archiving Project • • 2. 5 year pilot project funded by LDI Key Goals 1. Gain experience in domain 2. Explore legal terrain 3. Investigate sustainability of a Harvard web archiving service • • quantify technical, human, and $ requirements aim for operational efficiencies

Project Players 1. Curators and Collection Managers • • • Harvard University Archives Schlesinger Project Players 1. Curators and Collection Managers • • • Harvard University Archives Schlesinger Library on the History of Women in America Edwin O. Reischauer Institute of Japanese Studies 2. Legal Counsel – Office of General Counsel (OGC) 3. Technologists - OIS

What Did We Build? WAX What Did We Build? WAX

What Did We Build? WAX What Did We Build? WAX

What Did We Build? WAX What Did We Build? WAX

What Did We Build? WAX What Did We Build? WAX

Third Party Software • International Internet Preservation Consortium (IIPC) tools www. netpreserve. org • Third Party Software • International Internet Preservation Consortium (IIPC) tools www. netpreserve. org • Heritrix • HCC • Nutch. WAX • Wayback • • • JBoss Oracle Struts Tomcat Quartz job scheduler

The Web is vast and interconnected. How do you specify the part you want The Web is vast and interconnected. How do you specify the part you want to capture? Or “training a web crawler”…

How to Train a Web Crawler 1. Tell it where to start • “Seed How to Train a Web Crawler 1. Tell it where to start • “Seed URI” 2. Tell it what to collect and where to stop • “Scope” 3. Tell it when and how often • “Schedule”

Web Archiving Steps 1. Create a harvest profile Identify website URI (“seed”), define scope Web Archiving Steps 1. Create a harvest profile Identify website URI (“seed”), define scope and schedule 2. 3. 4. 5. Harvest web site QA harvest Send harvest to DRS Index harvest Becomes searchable and viewable by users A lot of work per website – which can automated?

Web Archiving Steps Manual by curator → Automated by scheduler and crawler software → Web Archiving Steps Manual by curator → Automated by scheduler and crawler software → 1. Create a harvest profile 2. Harvest web site Manual by curator → 3. QA harvest Manual by curator → 4. Send harvest to DRS Automated by Indexing software → 5. Index harvest

Workflow Efficiencies • Curator’s manual tasks: • Create a harvest profile • 3 scopes: Workflow Efficiencies • Curator’s manual tasks: • Create a harvest profile • 3 scopes: Directory, host and host+1 • Schedules • Global excluded URIs • QA harvests • Remove unwanted pieces • Detect missing pieces • Refinement of seed scope • Send harvests to DRS How can the system help with these tasks?

Efficiencies: QA Harvests Efficiencies: QA Harvests

 • • • Exclude URIs from future crawls Delete URIs from harvest and • • • Exclude URIs from future crawls Delete URIs from harvest and Exclude them from future crawls

Efficiencies: Send Harvests to DRS Efficiencies: Send Harvests to DRS

The Ultimate Shortcut? • Can pre-configure WAX to send harvests directly to the DRS The Ultimate Shortcut? • Can pre-configure WAX to send harvests directly to the DRS • Skip QA step • Skip push to archive step

Web Harvest Objects: Unit of Preservation in the DRS • For each crawl starting Web Harvest Objects: Unit of Preservation in the DRS • For each crawl starting from a seed URI: • One or more ARC files (*. arc. gz) • contain one or more “resources” - the individual HTML, JPEG, Javascript, etc. files that make up the harvested web pages • Crawl log • records all URI requests, regardless of result • Crawler configuration • Metadata • descriptive, administrative, technical

WAX Legal Mitigations: Crawls • Polite crawling • Obey robots. txt • Leave WAX WAX Legal Mitigations: Crawls • Polite crawling • Obey robots. txt • Leave WAX crawler information in logs • Employ a respectful “request frequency” during crawls • Don’t overload web servers • Capture surface web only • No attempt to crawl protected content • Choice of offsite crawler for curators • Non-Harvard IP address

WAX Legal Mitigations: Use • Don’t compete with or divert traffic from live site WAX Legal Mitigations: Use • Don’t compete with or divert traffic from live site • Exclude robots from the WAX archive • Add transformative content • Framing • Presentation pages with original intellectual content • Embargo display for 3 months • Link to live site

The Collections • 191 “seeds” identified by curators for harvesting • Stored in DRS: The Collections • 191 “seeds” identified by curators for harvesting • Stored in DRS: • Over 8 million web archive resources • 365. 17 gigabytes of storage ($913/year) • 291 mime types

application/x-download application/x-java-vm Shockwave message/rfc 822 text/Javascript audio/x-realaudio image/x-portable-anymap textcss chemical/mdl-rdf javascript/x-javascript application/x-Shockwave-Flash content-type application/bds application/x-download application/x-java-vm Shockwave message/rfc 822 text/Javascript audio/x-realaudio image/x-portable-anymap textcss chemical/mdl-rdf javascript/x-javascript application/x-Shockwave-Flash content-type application/bds png image/png? ver=074219 b 2138 e 87 ecf 980914 471183 dfc text/x-c++ application/xrds+xml "text/xml" image/x-bmp gif application/x-rar-compressed Image/png mime/type image/null text/troff application/vnd. sun. xml. impress text/enriched application/icalendar application-x/javascript x-mapp-php 4 imag/x-icon application/x-shockwave-flash 2 -preview Swish image/x-photoshop application/x-quicktimeplayer image/x-cmu-raster httpd/yahoo-send-as-is application/x-mpeg Video/X-Flv text/x-python text/text Text/HTML audio/mid text/Calendar application/x-wais-source application/x-perl audio/x-scpls image/txt application/pgp-keys Applicationxm text/calendar PNG text/x-vcard x-png application/octet-string unknown/unknown application/x-troff-me text/x-javascript video/x-m 4 v application/octetstream application/pgp-signature Image application/x-sh audio/x-mpegurl audio/unknown chemical/x-xyz image/x-portable-graymap image/#{favicon_formats[format]} image/files/curryjpg test/xml text/x-invalid video/x-flv text/javascript+json application/perl application/x. atom+xml application/octet_stream video/mp 4

The Light: The Collections The Light: The Collections

The Partners Megan Sniffin-Marinoff, University Archivist A-Sites: Archived Harvard Web Sites collected by the The Partners Megan Sniffin-Marinoff, University Archivist A-Sites: Archived Harvard Web Sites collected by the Harvard University Archives Marilyn Dunn, Executive Director of the Schlesinger Library and Librarian of the Radcliffe Institute Blogs: Capturing Women's Voices collected by the Arthur and Elizabeth Schlesinger Library on the History of Women in America Helen Hardacre, Reischauer Institute Professor of Japanese Religions and Society Web Archiving Project on Constitutional Revision collected by the Reischauer Institute of Japanese Studies with Sponsorship from the Harvard College Library Documentation Center on Contemporary Japan

To Participate http: //hul. harvard. edu/ois/systems/wax To Participate http: //hul. harvard. edu/ois/systems/wax

Questions? “…we have rather chosen to fill our hives with honey and wax, thus Questions? “…we have rather chosen to fill our hives with honey and wax, thus furnishing mankind with the two noblest of things, which are sweetness and light. ” Jonathan Swift

Image Credits Title slide: http: //www. flickr. com/photos/lwr/59014972/in/set-1552655/ The darkness: http: //www. melegraph. com/images/outerspace. Image Credits Title slide: http: //www. flickr. com/photos/lwr/59014972/in/set-1552655/ The darkness: http: //www. melegraph. com/images/outerspace. jpg The candle: http: //www. sxc. hu/pic/m/a/as/asolario/472153_peach_votive_candle. jpg The Web: http: //projecta-z. com/Internet_map_1024. jpg The light http: //i 252. photobucket. com/albums/hh 2/habeba 2007/candles-1 -1. gif