88a09b8e011b75aaac365b2be2f3bb6f.ppt
- Количество слайдов: 15
WEB ARCHIVING issues and challenges Deborah Woodyard Digital Preservation Coordinator BCS, Oxfordshire, 19 February, 2004 1
Where to start? Selection n Collection Development Policy n Need to be able to find them again n Cataloguing issues n 404 Not Found n Need to capture web sites n Who is responsible for capture? n Who is responsible for preservation/access? n What does this mean? n Define a web site - Where are the boundaries: n Links n Content on other sites / other servers n Changes with time – significant change n 2
Technical issues – Capture software n Taking ‘Snapshots’ n Follow directory structure or links? n Where to break links / replace broken links? n Relative vs absolute linking n No changes to code for authenticity n Preserve ‘original’ version, provide ‘access’ version n Obey robots. txt exclusions n Politeness – server load n Quality control checking 3
Technical issues - Web sites n n n File types - HTML, gif, JPEG, Javascript, asp, etc. Software plug-ins - permission - access Dynamic database driven sites - producing static pages - producing pages on-the-fly Frequency of capture Extent of capture - volume - duplication - storage and access to partial sites 4
Technical issues – storage and access Management and storage - high volume - multiple captures - long term, inc. storage system migration - disaster recovery n Permanent naming n Ensuring authenticity - trusted digital repository - checksums, signatures – long term n Signifying access to archived version n 5
Technical issues - preservation Preserve bits n Preserve intellectual object, + ‘look & feel’ n Preserve functionality n Technology changes - physical storage - hardware platform - operating systems - application software - HTML n 6
Technical issues – preservation strategies Metadata for preservation - describe bits: how and where stored - describe how to interpret/use bits - describe the context for the bits n Migration - in part / in whole - valid code? - keep all versions? - manage multiple versions n Emulation - of software / OS / platform n 7
LEGAL DISCUSSION n Minimise risk n Capture non-commercial sites n Preserve without providing access n Embargo or limit access n Document actions taken n Maintain ability to remove access 8
Cost £££ ? ? - to do it - of not doing it 9
PROJECTS General project types: n Selective - narrow, high quality, low volume n Comprehensive - broad, lower quality, high volume n Combination - useful, high quality, high volume 10
PROJECTS British Library involvement: n Domain. UK - selective n UK Web Archiving Consortium - selective n International Internet Preservation Consortium (IIPC) – comprehensive/combination 11
Project details Domain. uk n Web. Whacker, HTTrack n Regular captures of simple sites n Staff PC (later networked drive), very small n No access UK WAC n UK partners sharing one system n PANDAS management, HTTrack, Oracle n Manual selection, cataloguing and quality checking n Web interface for capture and public access 12
Project details IIPC n Comprehensive automated selection - links in / links out - authority / hits - rare words n Designing new crawler / harvester n Developing technical architecture n Deep web? n Access challenging 13
FUTURE WORK n Expand collection n Collaborative projects, inc. automated capture and metadata generation n Legal deposit instruments for web archiving n Provide restricted access 14
USEFUL REFERENCES http: //library. wellcome. ac. uk/projects/archiving_reports. shtml Collecting and preserving the World Wide Web: A feasibility study undertaken for the JISC and Wellcome Trust Michael Day, UKOLN, University of Bath Version 1. 0 - 25 February 2003 Legal issues relating to the archiving of Internet resources in the UK, EU, US and Australia Andrew Charlesworth, University of Bristol, Centre for IT and Law Version 1. 0 - 25 February 2003 2 nd ECDL workshop on Web archiving http: //bibnum. bnf. fr/ecdl/2002/index. html Digital Preservation Coalition http: //www. dpconline. org/ 15
88a09b8e011b75aaac365b2be2f3bb6f.ppt