- Количество слайдов: 21
Archiving the UK Web Helen Hockx-Yu Web Archiving Programme Manager British Library 13 October 2008
Overview n The need to archive the web n The UK Domain n British Library web archiving activities n What do we collect and how do we select? n Workflow n Metadata n System components n Permission / legal issues n Selective archiving: pros and cons n Conclusion 2
The need to archive the web n To save e-publishing n Reduce the risk of loss of ephemeral in-scope material – cited average life expectancy of websites is 44 days n Research requires reliable references to e-publishing n Contributes to the memory of the nation n Issues Illegal (pornography, terrorist etc. ) or rubbish sites n Needle in haystack: how to filter out useful material n How to foresee future research interests n How to keep pace with rapidly changing web technology n 3
The UK domain – an overview n Territoriality criteria n Has a. uk domain name n Relates to UK-based individuals or organisations which use other domain names, e. g. . org, . com, . net etc. or alternatives n Can be demonstrated, if an overseas publication, to be made available by a UK-based publisher n 6. 1 million. uk domains registered in 2007 plus 50, 000 other domains which can be identified as published in the UK n Growth by 17% per annum till 2011, then by 15% till 2016 n On-line publications in scope for legal deposit estimated at 3. 9 million sites in 2007 rising to 14. 6 million in 2016 n Average size of websites is 25 MB (domain harvesting model) or 180 MB (permission-based harvesting) 4
British Library web archiving activities – history (1) n DOMAIN. UK project by the BL in 2001: 6 -month experiment to select and capture 100 UK web-sites – leading to ongoing Web Archiving Programme n BL participated in and led the UK Web Archiving Consortium (UKWAC), collaborative initiative since 2004 to build a collective national web archive n Permission-based selective archive n Underwent major system / data migration earlier this year: from PANDAS to IIPC toolset n Over 3, 700 unique websites and over 11, 400 instances, measuring approximately 2 TB 5
British Library web archiving activities – history (2) n BL the largest collector: to date archived 1, 853 unique websites, 5, 264 instances, or 1 TB of data n BL infrastructure shared by NLW, TNA, Wellcome Trust and JISC. Currently finalising the public access interface to the archive n Web Archiving Programme: BL as the point of first resort for a comprehensive archive of material from the UK Web domain n embed web archiving within the BL's overall collection development policy n provide the infrastructure to collect, preserve and make accessible web materials 6
British Library Web Collection Policy The BL will collect freely available sites selectively from the UK web space by prioritising the archiving of sites of research value across the spectrum of knowledge. In addition, the BL archives a selection of sites which are representative of British social history and cultural heritage in all its diversity and across the regions. It will also archive a small number of sites which demonstrate web innovation. - British Library Web Collection Policy 7
What does the British Library select? n Research n stated intentions of sites themselves or potential to be primary resources for research n sites hosted by universities, government bodies, grey material published by campaigning organisations and charities n “research on the web” n Social history & culture n Sites representing British cultural diversity, regional difference, social significance (current trends, e. g. Facebook); key event of national life, topicality n Innovation n Award winning sites (pre-selected), sites illustrating web’s information, communication and training strength 8
Some examples n Institute for Criminal Policy Research Live site n UK General Election 2005: Andrew George MP Live site n Archived site (25 September 2005) David Shaw’s Homepage Live site n Archived site (13 March 2007) London Bombings: ABC News Map Live site n Archived site (21 April 2005) Football Poets Live site n Archived site (17 March 2005) Archived site (25 June 2006) E-publishing trends: Egg Bacon Chips and Beans Live site Archived site(17 Jan 2006) 9
How do we select? n Subject specialists team n 20 curators spending 5% of their time n cultural change n Appreciation of web technology & understanding of crawler limitation n Collaboration with external organisations n Women’s library, V&A n UK Web Archiving Consortium n Recommendations (e. g. colleagues) n self selected via UKWAC web form 10
Workflow 11
Workflow (1) 12
Workflow (2) 13
Workflow (3) Archiving 14
Metadata n Broad subject headings n Collections n Catalogue records for collections n 26 thematic collections (e. g. blogs, British countryside, Digital lives) n Event-based collections n UK general election 2005 n London Terrorist attack 7/7/05 n Indian Ocean Tsunami n London Mayor election 2008 n Olympic & Paralympics Games 2012 n Titles n Descriptions, at instance, site and collection levels, but not consistently collected n Permissions n Harvest log files (generated during the crawling process) n Will start project to define requirements for preservation metadata and workflow 15
System Components n Web Curator Tool (WCT) n workflow management tool for selective archiving jointly developed by the BL and NLNZ n 4 iterations of development completed n requirements collected from sources worldwide n uses embedded Heritirx for crawling (1. 14. 1) n archivists still learning to use crawl profiles and settings effectively n Nutchwax for (keyword) indexing and search n Access interface incorporating OSWM n URL and keyword search n Browsing by titles, subjects and collections n RESTful API based on Open. Search standards n Smart crawler project: BL works together with the Internet Archive and other national libraries to develop smart capabilities of the crawler 16
Permission / legal issues n Legal Deposit Libraries Act 2003 and extension of legal deposit to nonprint publications – not yet fully implemented n LDAP Web Archiving Sub-committee advising the Secretary of State on implementation of the Act: regulation-based harvesting and archiving of freely available online publications. n Slow process with delays; earliest legislation expected April 2010 n Low response rate to the permission requests (25% success rate) n 3 rd party cover can be prohibitive (e. g. multiple contributors) n Resources required on both sides to do admin n Difficulties in tracking down the right person n Valuable websites disappearing before owners are found or contacted 17
Selective archiving: pros and cons n No access restrictions n Added value and depth to collection due to curatorial input n Better quality (look and feel) of archived sites due to detailed quality assurance checks n Easier to navigate as is based on subject knowledge and tools n Offers better hooks to build hybrid library collections (as catalogue records) However: n Low response rate to permission requests n Labour intensive n Expensive n Only a small portion of the UK domain is being collected 18
Conclusions n Lack of national legislation is hold us back and the biggest issue for web archiving in the UK n Most national libraries in the world undertake selective and domain harvesting at the same time n Selective and domain harvesting are not exclusive of each other; they can and should complement each other n BL needs to get up to speed n Need to shift web archiving to the centre of BL’s overall collection and infrastructure n Selective archiving is expected to play a continued role along side domain harvesting 19
Reference n Alison Hill, Archiving in the British Library: an overview, June 2008 n Philip Beresford & Ravish Mistry, Web Archiving with the Web Curator Tool at the British Library, September 2008 n LDAP Web Archiving Sub-Committee, Recommendation for the Collection and Preservation of UK Online Publications, free of charge and without access restriction, 22 May 2008 20
Any Questions? Helen Hockx-Yu Web Archiving Programme Manager The British Library 96 Euston Road London NW 1 2 DB Tel: +44 (0)20 7412 7184 Mobile: 07766 474 368 Email: helen. hockx-yu@bl. uk Website: http: //www. bl. uk/aboutus/stratpolprog/digi/webarch/index. html 21