Скачать презентацию Thinking Differently About Web Page Preservation Michael L Скачать презентацию Thinking Differently About Web Page Preservation Michael L

dafbed3665c6cf51362720e0e047545e.ppt

  • Количество слайдов: 77

Thinking Differently About Web Page Preservation Michael L. Nelson, Frank Mc. Cown, Joan A. Thinking Differently About Web Page Preservation Michael L. Nelson, Frank Mc. Cown, Joan A. Smith Old Dominion University Norfolk VA {mln, fmccown, jsmit}@cs. odu. edu Library of Congress Brown Bag Seminar June 29, 2006 Research supported in part by NSF, Library of Congress and Andrew Mellon Foundation

Background • “We can’t save everything!” – if not “everything”, then how much? – Background • “We can’t save everything!” – if not “everything”, then how much? – what does “save” mean?

“Women and Children First” HMS Birkenhead, Cape Danger, 1852 638 passengers 193 survivors all “Women and Children First” HMS Birkenhead, Cape Danger, 1852 638 passengers 193 survivors all 7 women & 13 children image from: http: //www. btinternet. com/~palmiped/Birkenhead. htm

We should probably save a copy of this… We should probably save a copy of this…

Or maybe we don’t have to… the Wikipedia link is in the top 10, Or maybe we don’t have to… the Wikipedia link is in the top 10, so we’re ok, right?

Surely we’re saving copies of this… Surely we’re saving copies of this…

2 copies in the UK 2 Dublin Core records That’s probably good enough… 2 copies in the UK 2 Dublin Core records That’s probably good enough…

What about the things that we know we don’t need to keep? You DO What about the things that we know we don’t need to keep? You DO support recycling, right?

A higher moral calling for pack rats? A higher moral calling for pack rats?

Just Keep the Important Stuff! Just Keep the Important Stuff!

Lessons Learned from the AIHT (Boring stuff: D-Lib Magazine, December 2005) Preservation metadata is Lessons Learned from the AIHT (Boring stuff: D-Lib Magazine, December 2005) Preservation metadata is like a David Hockney Polaroid collage: each image is both true and incomplete, and while the result is not faithful, it does capture the “essence” images from: http: //facweb. cs. depaul. edu/sgrais/collage. htm

Preservation: Fortress Model Five Easy Steps for Preservation: 1. 2. 3. 4. 5. Get Preservation: Fortress Model Five Easy Steps for Preservation: 1. 2. 3. 4. 5. Get a lot of $ Buy a lot of disks, machines, tapes, etc. Hire an army of staff Load a small amount of data “Look upon my archive ye Mighty, and despair!” image from: http: //www. itunisie. com/tourisme/excursion/tabarka/images/fort. jpg

Alternate Models of Preservation • Lazy Preservation – Let Google, IA et al. preserve Alternate Models of Preservation • Lazy Preservation – Let Google, IA et al. preserve your website • Just-In-Time Preservation – Wait for it to disappear first, then a “good enough” version • Shared Infrastructure Preservation – Push your content to sites that might preserve it • Web Server Enhanced Preservation – Use Apache modules to create archival-ready resources image from: http: //www. proex. ufes. br/arsm/knots_interlaced. htm

Lazy Preservation “How much preservation do I get if I do nothing? ” Frank Lazy Preservation “How much preservation do I get if I do nothing? ” Frank Mc. Cown

Outline: Lazy Preservation • Web Infrastructure as a Resource • Reconstructing Web Sites • Outline: Lazy Preservation • Web Infrastructure as a Resource • Reconstructing Web Sites • Research Focus

Web Infrastructure Web Infrastructure

Cost of Preservation Publisher’s cost (time, equipment, knowledge) Client-view Server-view H Filesystem backups Furl/Spurl Cost of Preservation Publisher’s cost (time, equipment, knowledge) Client-view Server-view H Filesystem backups Furl/Spurl Browser cache LOCKSS Info. Monitor Hanzo: web i. PROXY TTApache Web archives SE caches H L H Coverage of the Web

Outline: Lazy Preservation • Web Infrastructure as a Resource • Reconstructing Web Sites • Outline: Lazy Preservation • Web Infrastructure as a Resource • Reconstructing Web Sites • Research Focus

Research Questions • How much digital preservation of websites is afforded by lazy preservation? Research Questions • How much digital preservation of websites is afforded by lazy preservation? – Can we reconstruct entire websites from the WI? – What factors contribute to the success of website reconstruction? – Can we predict how much of a lost website can be recovered? – How can the WI be utilized to provide preservation of server-side components?

Prior Work • Is website reconstruction from WI feasible? – Web repository: G, M, Prior Work • Is website reconstruction from WI feasible? – Web repository: G, M, Y, IA – Web-repository crawler: Warrick – Reconstructed 24 websites • How long do search engines keep cached content after it is removed?

Timeline of SE Resource Acquisition and Release Vulnerable resource – not yet cached (tca Timeline of SE Resource Acquisition and Release Vulnerable resource – not yet cached (tca is not defined) Replicated resource – available on web server and SE cache (tca < current time < tr) Endangered resource – removed from web server but still cached (tca < current time < tcr) Unrecoverable resource – missing from web server and cache (tca < tcr < current time) Joan A. Smith, Frank Mc. Cown, and Michael L. Nelson. Observed Web Robot Behavior on Decaying Web Subsites, D-Lib Magazine, 12(2), February 2006. Frank Mc. Cown, Joan A. Smith, Michael L. Nelson, and Johan Bollen. Reconstructing Websites for the Lazy Webmaster, Technical report, ar. Xiv cs. IR/0512069, 2005.

Cached Image Cached Image

Cached PDF http: //www. fda. gov/cder/about/whatwedo/testtube. pdf canonical MSN version Yahoo version Google version Cached PDF http: //www. fda. gov/cder/about/whatwedo/testtube. pdf canonical MSN version Yahoo version Google version

Web Repository Characteristics Type MIME type Typical file ext Google Yahoo MSN IA HTML Web Repository Characteristics Type MIME type Typical file ext Google Yahoo MSN IA HTML text/html C C Plain text/plain txt, ans M M M C Graphic Interchange Format image/gif M M ~R C Joint Photographic Experts Group image/jpeg jpg M M ~R C Portable Network Graphic image/png M M ~R C M M M C Adobe Portable Document Format application/pdf Java. Script application/javascript js M Microsoft Excel application/vnd. ms-excel xls M ~S M C Microsoft Power. Point application/vnd. ms-powerpoint ppt M M M C Microsoft Word application/msword doc M M M C Post. Script application/postscript ps M ~S C M ~R ~S C Canonical version is stored Modified version is stored (modified images are thumbnails, all others are html conversions) Indexed but not retrievable Indexed but not stored

SE Caching Experiment • • Create html, pdf, and images Place files on 4 SE Caching Experiment • • Create html, pdf, and images Place files on 4 web servers Remove files on regular schedule Examine web server logs to determine when each page is crawled and by whom • Query each search engine daily using unique identifier to see if they have cached the page or image Joan A. Smith, Frank Mc. Cown, and Michael L. Nelson. Observed Web Robot Behavior on Decaying Web Subsites. D-Lib Magazine, February 2006, 12(2)

Caching of HTML Resources - mln Caching of HTML Resources - mln

Reconstructing a Website Original URL Starting URL Results page Warrick File system Retrieved resource Reconstructing a Website Original URL Starting URL Results page Warrick File system Retrieved resource Web Repo Cached URL Cached resource 1. Pull resources from all web repositories 2. Strip off extra header and footer html 3. Store most recently cached version or canonical version 4. Parse html for links to other resources

How Much Did We Reconstruct? “Lost” web site Reconstructed web site A A B How Much Did We Reconstruct? “Lost” web site Reconstructed web site A A B D B’ C E F G C’ F E Missing link to D; points to old resource G F can’t be found

Reconstruction Diagram added 20% identical 50% changed 33% missing 17% Reconstruction Diagram added 20% identical 50% changed 33% missing 17%

Websites to Reconstruct • Reconstruct 24 sites in 3 categories: 1. small (1 -150 Websites to Reconstruct • Reconstruct 24 sites in 3 categories: 1. small (1 -150 resources) 2. medium (150 -499 resources) 3. large (500+ resources) • Use Wget to download current website • Use Warrick to reconstruct • Calculate reconstruction vector

Results Frank Mc. Cown, Joan A. Smith, Michael L. Nelson, and Johan Bollen. Reconstructing Results Frank Mc. Cown, Joan A. Smith, Michael L. Nelson, and Johan Bollen. Reconstructing Websites for the Lazy Webmaster, Technical Report, ar. Xiv cs. IR/0512069, 2005.

Aggregation of Websites Aggregation of Websites

Web Repository Contributions Web Repository Contributions

Warrick Milestones • www 2006. org – first lost website reconstructed (Nov 2005) • Warrick Milestones • www 2006. org – first lost website reconstructed (Nov 2005) • DCkickball. org – first website someone else reconstructed without our help (late Jan 2006) • www. iclnet. org – first website we reconstructed for someone else (mid Mar 2006) • Internet Archive officially “blesses” Warrick (mid Mar 2006)1 1 http: //frankmccown. blogspot. com/2006/03/warrick-is-gaining-traction. html

Outline: Lazy Preservation • Web Infrastructure as a Resource • Reconstructing Web Sites • Outline: Lazy Preservation • Web Infrastructure as a Resource • Reconstructing Web Sites • Research Focus

Proposed Work • How lazy can we afford to be? – Find factors influencing Proposed Work • How lazy can we afford to be? – Find factors influencing success of website reconstruction from the WI – Perform search engine cache characterization • Inject server-side components into WI for complete website reconstruction • Improving the Warrick crawler – Evaluate different crawling policies • Frank Mc. Cown and Michael L. Nelson, Evaluation of Crawling Policies for a Web-repository Crawler, ACM Hypertext 2006. – Development of web-repository API for inclusion in Warrick

Factors Influencing Website Recoverability from the WI • Previous study did not find statistically Factors Influencing Website Recoverability from the WI • Previous study did not find statistically significant relationship between recoverability and website size or Page. Rank • Methodology – Sample large number of websites - dmoz. org – Perform several reconstructions over time using same policy – Download sites several times over time to capture change rates

Evaluation • Use statistical analysis to test for the following factors: – – – Evaluation • Use statistical analysis to test for the following factors: – – – Size Makeup Path depth Page. Rank Change rate • Create a predictive model – how much of my lost website do I expect to get back?

Marshall TR Server – running EPrints Marshall TR Server – running EPrints

We can recover the missing page and PDF, but what about the services? We can recover the missing page and PDF, but what about the services?

Recovery of Web Server Components • Recovering the client-side representation is not enough to Recovery of Web Server Components • Recovering the client-side representation is not enough to reconstruct a dynamically-produced website • How can we inject the server-side functionality into the WI? • Web repositories like HTML – Canonical versions stored by all web repos – Text-based – Comments can be inserted without changing appearance of page • Injection: Use erasure codes to break a server file into chunks and insert the chunks into HTML comments of different pages

Recover Server File from WI Recover Server File from WI

Evaluation • Find the most efficient values for n and r (chunks created/recovered) • Evaluation • Find the most efficient values for n and r (chunks created/recovered) • Security – Develop simple mechanism for selecting files that can be injected into the WI – Address encryption issues • Reconstruct an EPrints website with a few hundred resources

SE Cache Characterization • Web characterization is an active field • Search engine caches SE Cache Characterization • Web characterization is an active field • Search engine caches have never been characterized • Methodology – Randomly sample URLs from four popular search engines: Google, MSN, Yahoo, Ask – Download cached version and live version from the Web – Examine HTTP headers and page content – Test for overlap with Internet Archive – Attempt to access various resource types (PDF, Word, PS, etc. ) in each SE cache

Summary: Lazy Preservation When this work is completed, we will have… • demonstrated and Summary: Lazy Preservation When this work is completed, we will have… • demonstrated and evaluated the lazy preservation technique • provided a reference implementation • characterized SE caching behavior • provided a layer of abstraction on top of SE behavior (API) • explored how much we store in the WI (server -side vs. client-side representations)

Web Server Enhanced Preservation “How much preservation do I get if I do just Web Server Enhanced Preservation “How much preservation do I get if I do just a little bit? ” Joan A. Smith

Outline: Web Server Enhanced Preservation • OAI-PMH • mod_oai: complex objects + resource harvesting Outline: Web Server Enhanced Preservation • OAI-PMH • mod_oai: complex objects + resource harvesting • Research Focus

WWW and DL: Separate Worlds “Crawlapalooza” WWW DL 1994 DL Today “Harvester Home Companion” WWW and DL: Separate Worlds “Crawlapalooza” WWW DL 1994 DL Today “Harvester Home Companion” The problem is not that the WWW doesn’t work; it clearly does. The problem is that our (preservation) expectations have been lowered.

Data Providers / Repositories “A repository is a network accessible server that can process Data Providers / Repositories “A repository is a network accessible server that can process the 6 OAI-PMH requests … A repository is managed by a data provider to expose metadata to harvesters. ” Service Providers / Harvesters “A harvester is a client application that issues OAI-PMH requests. A harvester is operated by a service provider as a means of collecting metadata from repositories. ”

Aggregators allow for: • scalability for OAI-PMH • load balancing • community building • Aggregators allow for: • scalability for OAI-PMH • load balancing • community building • discovery data providers (repositories) aggregator service providers (harvesters)

OAI-PMH data model resource OAI-PMH sets OAI-PMH identifier metadata. Prefix datestamp entry point to OAI-PMH data model resource OAI-PMH sets OAI-PMH identifier metadata. Prefix datestamp entry point to all records pertaining to the resource Dublin Core metadata MARCXML metadata item records metadata pertaining to the resource

OAI-PMH Used by Google & Academic. Live (MSN) Why support OAI-PMH? $ These guys OAI-PMH Used by Google & Academic. Live (MSN) Why support OAI-PMH? $ These guys are in business (i. e. , for profit) Q How does OAI-PMH help their bottom line? A By improving the search and analysis process

Resource Harvesting with OAI-PMH resource OAI-PMH identifier = entry point to all records pertaining Resource Harvesting with OAI-PMH resource OAI-PMH identifier = entry point to all records pertaining to the resource metadata pertaining to the resource Dublin Core metadata MARCXML metadata MPEG-21 DIDL METS simple more expressive highly expressive item records

Outline: Web Server Enhanced Preservation • OAI-PMH • mod_oai: complex objects + resource harvesting Outline: Web Server Enhanced Preservation • OAI-PMH • mod_oai: complex objects + resource harvesting • Research Focus

Two Problems The counting problem The representation problem There is no way to determine Two Problems The counting problem The representation problem There is no way to determine the list of valid URLs at a web site Machine-readable formats and humanreadable formats have different requirements

mod_oai solution • • Integrate OAI-PMH functionality into the web server itself… mod_oai: an mod_oai solution • • Integrate OAI-PMH functionality into the web server itself… mod_oai: an Apache 2. 0 module to automatically answer OAI-PMH requests for an http server – written in C – respects values in. htaccess, httpd. conf • • compile mod_oai on http: //www. foo. edu/ base. URL is now http: //www. foo. edu/modoai – Result: web harvesting with OAI-PMH semantics (e. g. , from, until, sets) http: //www. foo. edu/modoai? verb=List. Identifiers&metdata. Prefix=oai_dc&from=2004 -09 -15&set=mime: video: mpeg The humanreadable web site Prepped for machine-friendly harvesting Give me a list of all resources, include Dublin Core metadata, dating from 9/15/2004 through today, and that are MIME type video-MPEG.

A Crawler’s View of the Web Site Not crawled web root (protected) Not crawled A Crawler’s View of the Web Site Not crawled web root (protected) Not crawled (Generated on-the-fly by CGI, e. g. ) robots. txt or robots META tag Not crawled Crawled pages (unadvertised & unlinked) Not crawled (remote link only) (too deep) Remote web site

Apache’s View of the Web Site Require authentication web root Generated on-the-fly (CGI, e. Apache’s View of the Web Site Require authentication web root Generated on-the-fly (CGI, e. g. ) Tagged: No robots Unknown/not visible

The Problem: Defining The “Whole Site” • For a given server, there a set The Problem: Defining The “Whole Site” • For a given server, there a set of URLs, U, and a set of files F – Apache maps U F – mod_oai maps F U • Neither function is 1 -1 nor onto – We can easily check if a single u maps to F, but given F we cannot (easily) generate U • Short-term issues: – dynamic files • exporting unprocessed server-side files would be a security hole – Index. Ignore • httpd will “hide” valid URLs – File permissions • httpd will advertise files it cannot read • Long-term issues – Alias, Location • files can be covered up by the httpd – User. Dir • interactions between the httpd and the filesystem

A Webmaster’s Omniscient View My. SQL 1. Data 1 2. User. abc 3. Fred. A Webmaster’s Omniscient View My. SQL 1. Data 1 2. User. abc 3. Fred. foo Authenticated httpd Dynamic web root Tagged: No robots 1. file 1 2. /dir/wwx Orphaned 3. Foo. html Deep Unknown/not visible

HTTP “Get” versus OAI-PMH Get. Record HTTP GET Machine-readable JHOVE METADATA Human-readable MD-5 LS HTTP “Get” versus OAI-PMH Get. Record HTTP GET Machine-readable JHOVE METADATA Human-readable MD-5 LS Complex Object mod_oai Apache Web Server “GET /modoai/? verb=Get. Record&identifier= headlines. html&metadaprefix=oai_didl” “GET /headlines. html HTTP 1. 1” WEB SITE

OAI-PMH data model in mod_oai resource OAI-PMH sets MIME type metadata pertaining to the OAI-PMH data model in mod_oai resource OAI-PMH sets MIME type metadata pertaining to the resource http: //techreports. larc. nasa. gov/ltrs/PDF/2004/aiaa/NASA-aiaa-2004 -0015. pdf OAI-PMH identifier = entry point to all records pertaining to the resource Dublin Core metadata HTTP header metadata MPEG-21 DIDL item records

Complex Objects That Tell A Story http: //foo. edu/bar. pdf encoded as an MPEG-21 Complex Objects That Tell A Story http: //foo. edu/bar. pdf encoded as an MPEG-21 DIDL Russian Nesting Doll DC metadata Jhove metadata Checksum … Provenance • • • First came Lenin • Then came Stalin… • . . . . . . . . . . . . . . .

Resource Discovery: List. Identifiers HARVESTER: • issues a List. Identifiers, • finds URLs of Resource Discovery: List. Identifiers HARVESTER: • issues a List. Identifiers, • finds URLs of updated resources • does HTTP GETs updates only • can get URLs of resources with specified MIME types

Preservation: List. Records HARVESTER: • issues a List. Records, • Gets updates as MPEG Preservation: List. Records HARVESTER: • issues a List. Records, • Gets updates as MPEG 21 DIDL documents (HTTP headers, resource By Value or By Reference) • can get resources with specified MIME types

What does this mean? • For an entire web site, we can: – serialize What does this mean? • For an entire web site, we can: – serialize everything as an XML stream – extract it using off-the-shelf OAI-PMH harvesters – efficiently discover updates & additions • For each URL, we can: – create “preservation ready” version with configurable {descriptive|technical|structural} metadata • e. g. , Jhove output, datestamps, signatures, provenance, automatically generated summary, etc. Harvest the resource extract metadata Jhove & other pertinent info include an index translations… or lexical signatures, Summaries, etc Wrap it all together In an XML Stream Ready for the future

Outline: Web Server Enhanced Preservation • OAI-PMH • mod_oai: complex objects + resource harvesting Outline: Web Server Enhanced Preservation • OAI-PMH • mod_oai: complex objects + resource harvesting • Research Focus

Research Contributions Thesis Question: How well can Apache support web page preservation? Goal: To Research Contributions Thesis Question: How well can Apache support web page preservation? Goal: To make web resources “preservation ready” – – Support refreshing (“how many URLs at this site? ”): the counting problem Support migration (“what is this object? ”): the representation problem How: Using OAI-PMH resource harvesting – – – Aggregate forensic metadata • Automate extraction Encapsulate into an object • XML stream of information Maximize preservation opportunity • Bring DL technology into the realm of WWW

Experimentation & Evaluation • Research solutions to the counting problem – Different tools yield Experimentation & Evaluation • Research solutions to the counting problem – Different tools yield different results – Google sitemap <> Apache file list <> robot crawled pages – Combine approaches for one automated, full URL listing • Apache logs are detailed history of site activity • Compare user page requests with crawlers’ requests • Compare crawled pages with actual site tree • Continue research on the representation problem – Integrate utilities into mod_oai (Jhove, etc. ) – Automate metadata extraction & encapsulation • Serialize and reconstitute – complete back-up of site & reconstitution through XML stream

Summary: Web Server Enhanced Preservation • Better web harvesting can be achieved through: – Summary: Web Server Enhanced Preservation • Better web harvesting can be achieved through: – OAI-PMH: structured access to updates – Complex object formats: modeled representation of digital objects • Address 2 key problems: – Preservation (List. Records) – The Representation Problem – Web crawling (List. Identifiers) – The Counting Problem • mod_oai: reference implementation – Better performance than wget & crawlers – not a replacement for DSpace, Fedora, eprints. org, etc. • More info: – http: //www. modoai. org/ – http: //whiskey. cs. odu. edu/ Automatic harvesting of web resources rich in metadata packaged for the future Today: manual Tomorrow: automatic!

Summary Michael L. Nelson Summary Michael L. Nelson

Summary • Digital preservation is not hard, its just big. – Save the women Summary • Digital preservation is not hard, its just big. – Save the women and children first, of course, but there is room for many more… • Using the by-product of SE and WI, we can get a good amount of preservation for free – prediction: Google et al. will eventually see preservation as a business opportunity • Increasing the role of the web server will solve most of the digital preservation problems – complex objects + OAI-PMH = digital preservation solution

“As you know, you preserve the files you have. They’re not the files you “As you know, you preserve the files you have. They’re not the files you might want or wish to have at a later time” “if you think about it, you can have all the metadata in the world on a file and a file can be blown up” image from: http: //www. washingtonpost. com/wp-dyn/articles/A 132 -2004 Dec 14. html