db252a626e792d789c8f5b9b5e54d866.ppt
- Количество слайдов: 15
Strategies for archiving the Danish web space Bjarne Andersen Head of Digital Resources State and University Library, Aarhus bja@netarkivet. dk http: //netarchive. dk
Agenda o o o o o New legal deposit law in Denmark Collection strategies Netarchive. Suite software package Snapshot harvesting Selective harvesting Event harvesting Challenges in snapshot harvesting Snapshot harvesting usefulness Future work
Legal deposit law 1 o o Revision of the legal deposit law in 1997 n -> legal deposit included static documents on the internet During in 1998 -1999 we found out that: n We were actually preserving the least interesting part o o Many of the documents in that collection are also available in print A lot of work was done between 2000 -2004 n 2 pilot projects run by the two national libraries o n n Testing different software / different strategies for archiving / storing web material A governmental publication on ”preserving the Danish digital cultural heritage” (2003) A report to the ministry of culture (2004) outlining o o Recommendations from the two national libraries on how to solve the ”entire” problem Issues to be covered by a new revision of the legal deposit law
Legal deposit law 2 o A new revision came into force on july 1 st 2005 n n Allowing the two national libraries to automatically gather all danish websites Danish roughly defined as: o o n Websites on the. dk TLD Websites minded on a Danish audience / written in Danish Websites about Danish people (Hans Christian Andersen) More or less any site of interest to Denmark We are by law granted access to all relevant data from the. dk TLD administrator
Legal deposit law 3 o The law covers all public available material n Material that all Danish people in principal can gain access to o Material which requires action before usage (payment, registration…. ) Pay-sites should hand out username / password upon request (for free) Other interesting parts n n Combined strategy (snapshot, selective and event-harvesting) Robots. txt explicitly mentioned in the regulations of the law o o A lot of the very interesting websites have very restrictive robots. txt’s (we discovered around 35. 000 robots. txt-files) During 6 snap shots of more than 750. 000 web sites we had fewer than 50 complaints about robots. txt
Legal deposit law 4 o In the end led to funding of n Netarchive. dk o Virtual centre in cooperation between n n o o Implementing a complete system Running the archiving on a daily basis n o The Royal Library, Copenhagen The State & University Library, Aarhus Currently with an annual budget of 450. 000 euros Involving 15 people from the two libraries n 4. 5 Man-years of man-power
The 3 collection strategies o Illustrated by coverage over time o Amount of data collected so far n Snapshots: 61 TB (6 times) n Selective harvests: 9. 5 TB (80 web sites) n Event harvests: 5. 6 TB (9 events)
Netarchive. Suite software package o o We needed a curator tool ready at July 1 st 2005 n Requirement number 1: Operated by librarians With the web interface librarians can: n Define harvests (all three types) o n Based on quite simple settings + a number of different predefined heritrix setups Do quality control o o Looking at harvest results (simple reports and statistics) Browsing through harvested material n o Automated pickup of missing URIs Netarchive. Suite was released as Open Source in July 2007 n Currently used by a number of national libraries
Snapshot harvesting o The. dk TLD currently holds > 750. 000 active domains n o We encountered around 42. 000 Danish domains outside the. dk TLD o By extracting links from the entire. dk web space – checking country-code by IP-number (Geo. IP) o By doing Google searches on Danish localities (city names. . ) With 8 machines we can do n n One complete snapshot (including de. Duplication) at 20 TB in 80 days De. Duplication saves around 30% of the storage space
Selective harvesting o Archiving of 80 selected websites n n n News sites ”Typical” dynamic and heavily used sites representing civic society, the commercial sector and public authorities Experimental and/or unique sites, documenting new ways of using the web (e. g. net art) Harvested much more frequent o From weekly to several times per day
Event harvesting o o o Combining the other two strategies n Taking a larger number of sites (200 -3000) n On a more frequent basis (daily / weekly) n In a shorter period of time We have done 9 event harvests so far n Elections, different national events We have pre-defined some harvest-definitions on especially news-sites (both local and national) n With one click we can start these if a sudden event should happen – to ensure collection of important sites from the very beginning
Challenges in snapshot harvesting o Number of domains is constantly growing n n o Domains are growing bigger and bigger n n n o 2005: 607. 000 domains – 480. 0 o 0 active 2008: 950. 000 domains – 750. 000 active Audio/Video is getting more and more popular Sites larger than 10 Mb increased from 40. 000 to 90. 000 Sites larger than 500 Mb increased from 6. 000 to 12. 000 Web 2. 0 makes harvesting difficult n Web material is inlined from other web sites – from all over the world o n o The border of a web site is disappearing The web is going more and more dynamic – Flash / Ajax The amount of traps and spam grows constantly n In Denmark librarians manually inspect all websites larger than 1 Gb o o Currently over 3000 domains They identify aliases and potential crawler traps n That task should be (semi)-automated
Snapshot harvesting usefulness o With snap shot harvesting a web archive ensures cultural heritage by n n Archiving regular ”pictures” of entire national parts of the internet Archiving as much as possible in a quite cheap way o o Snap shots is very useful for research in many different areas n n n n o o Linguistics Web technologies File formats and their evolution Web design Genealogy / Ancestor search Web site history And many more – to be defined in the future And off cause useful for more ordinary users wanting n o Netarchive. dk: Storage space and 15 hours per week for librarians To find content disappeared from the live web – 40 -100 days lifetime Getting more and more interesting over time Currently access to Netarchive. dk is limited to researchers
Future work o o o Automating discovery of Danish web sites outside the. dk TLD Automated quality assurance for large crawls Automating filtering of web spam and traps Improving archiving of web 2. 0 n Dynamic web content n Streaming audio/video Non of these problems are Danish n Lets solve them together n LIWA – European project working on most of these problems Danish challenges n Working for better access possibilities o o On the system level: Way. Back Machine / Nutch. WAX search On the political level: Change of law
Questions ?
db252a626e792d789c8f5b9b5e54d866.ppt