Archiving the Web The Bibliothèque nationale de

Archiving the Web – The Bibliothèque nationale de France’s « L’archivage du Web » Bert Wendland Bibliothèque nationale de France

Who I am / who we are > Bert Wendland > Crawl engineer in the IT department of Bn. F > Semi-joint working group > Legal Deposit department > 1 head of group > 4 librarians > IT department > 1 project coordinator > 1 developer > 2 crawl engineers > A network of 80 digital curators 23 May 2013 Archiving the Web 2

27 th November 2012 Session 4 - Web archiving for decision-makers 3

27 th November 2012 Session 4 - Web archiving for decision-makers 4

27 th November 2012 Session 4 - Web archiving for decision-makers 14

27 th November 2012 Session 4 - Web archiving for decision-makers 15

Agenda > Context: I will present the Bn. F and web archiving as part of its legal mission. > Concepts: I will describe how we operationalise the task of collecting and preserving the French web in terms of data, and how this relates to the general web archive at www. archive. org. > Infrastructure: I will give an overview of the infrastructure that supports this task. > Data acquisition: I will describe our mixed model of web harvesting that combines broad crawls and selective crawls to achieve a good trade-off between breadth and depth in coverage and temporal granularity. > Data storage and access: I will describe the indexing structures that allow users to query this web archive. 28 th November 2012 Session 5 - Integrating web archiving in IT operations 16

Context The Bn. F and web archiving as part of its legal mission

The Bn. F > Bibliothèque nationale de France > About 30 million books, periodicals and others > 10 million at the new site > Yearly 60. 000 new books > 400 TB of data in the web archive > 100 TB of new data every year > Two sites > Old site « Richelieu » in the centre of Paris > New site « François-Mitterand » since 1996 > Two levels at the new site > Study library ( « Haut-de-jardin » ): open stacks > Research library ( « Rez-de-jardin » ): access to all collection, including web archives 23 May 2013 Archiving the Web 18

The legal deposit 1368 Royal manuscripts of king Charles V in the Louvre 1537 Legal deposit by king Francis I: all editors should send copies of their productions to the royal library 1648 Legal deposit extended to maps and plans 1793 Musical scores 1925 Photographs and gramophone records 1975 Video recordings 1992 CD-ROMs and electronic documents 2002 Websites (experimentally) 2006 Websites (in production) 23 May 2013 Archiving the Web 19

Extension of the Legal deposit Act in 2006 > Coverage (article 39) « Sont également soumis au dépôt légal les signes, signaux, écrits, images, sons ou messages de toute nature faisant l’objet d’une communication au public par voie électronique. » > Conditions (article 41 II) « Les organismes dépositaires procèdent à la collecte des signes, signaux, écrits, images sons ou messages de toute nature mis à la disposition du public ou de catégories de public, … Ils peuvent procéder eux-mêmes à cette collecte selon des procédures automatiques ou en déterminer les modalités en accord avec ces personnes. » > Responsibilities (article 50) INA (Institut national de l'audiovisuel) for radio and TV websites Bn. F for anything else > No permission required to collect, but access to the archive restricted to in-house > The goal is not to gather all or the “best of the Web”, but to preserve a representative collection of the Web at a certain date 23 May 2013 Archiving the Web 20

Concepts How we collect and preserve the French web

23 May 2013 Archiving the Web 22

The Internet Archive > Non-profit organisation, founded 1996 by Brewster Kahle in San Francisco > Stated mission of “universal access to all knowledge” > Websites, but also other media like scanned books, movies, audio collections, … > Web archiving from the beginning, only 4 years after the start of the WWW > Main technologies for web archiving: > Heritrix: the crawler > Wayback Machine: access the archive 23 May 2013 Archiving the Web 23

Partnership Bn. F – IA > A five-years partnership between 2004 and 2008 > Data > 2 focused crawls and 5 broad crawls on behalf of Bn. F > Extraction of historical Alexa data concerning. fr back to 1996 > Technology > Heritrix > Wayback Machine > 5 Petaboxes > Know-how > Installation of Petaboxes by engineers of IA > Presence of an IA crawl engineer one day a week for 6 months 23 May 2013 Archiving the Web 24

How search engines work Archiving the Web, that’s archiving the files, the links and some meta data. Source : www. brightplanet. com

How the web crawler works Queue of URLs Web crawler (“Heritrix”) “Seeds”: Connection to the page http: //www. site-untel. fr Storing the data http: //www. monblog. fr Extraction of links … Connection to the page Discovered URLs: http: //www. unautre-site. fr Storing the data Extraction of links http: //www. autre-blog. fr … Verification parameters: YES NO URL rejected Connection to the page Storing the data Extraction of links … Storage

Current production workflow NAS_preload Selection Validation Netarchive. Suite Planning VMware BCWeb Experience Monitoring Crawling Heritrix Netarchive. Suite Quality Assurance NAS_qual SPAR Indexing Wayback Machine Preservation Access Indexing Process 23 May 2013 Archiving the Web 27

> « »

Applications > BCWeb (“Bn. F Collecte du Web”) > Bn. F in-house development > Selection tool for librarians: proposition of URLs to collect for selective crawls > Technical validation of URLs by digital curators > Definition of collection packages > Transfer to Netarchive. Suite > NAS_preload (“Netarchive. Suite Pre-Load”) > Bn. F in-house development > Preparation of broad crawls, based on a list of officially registered domains by AFNIC 23 May 2013 Archiving the Web 29

Applications > Netarchive. Suite > Open source application > Collaborative work of: > Bn. F > The two national deposit libraries in Denmark (the Royal Library in Copenhagen and the State and University Library in Aarhus) > Austrian National Library (ÖNB) > Central and main application of the archiving process > > > 23 May 2013 Planning the crawls Creating and launching jobs Monitoring Quality Assurance Experience evaluation 30

Applications > Heritrix > Open source application by Internet Archive > Its name is an archaic English word for heiress (woman who inherits) > A crawl is configured as a job in Heritrix, which consists mainly of: > a list of URLs to start from (the seeds) > a scope (collect all URLs in the domain of a seed, stay on the same host, only a particular web page, etc. ) > a set of filters to exclude unwanted URLs from the crawl > a list of extractors (to extract URLs from HTML, CSS, Java. Script) > many other technical parameters, for instance to define the “politeness” of a crawl or whether or not obey a website’s robots. txt file 23 May 2013 Archiving the Web 31

Applications > The Wayback Machine > Open source application by Internet Archive > Gives access to the archived data > SPAR (“Système de Préservation et d’Archive Réparti”) > Not really an application, it is the Bn. F’s digital repository > Long-term preservation system for digital objects, compliant with the OAIS (Open Archival Information System) standard, ISO 14721 23 May 2013 Archiving the Web 32

Applications > NAS_qual (“Netarchive. Suite Quality Assurance”) > Bn. F in-house development > Indicators and statistics about the crawls > The Indexing Process > Chain of shell scripts, developed in-house by Bn. F 23 May 2013 Archiving the Web 33

Data and process model 23 May 2013 Archiving the Web 34

Daily operations: same steps, different actions Curators Engineers > Monitoring: dashboard in Netarchive. Suite, filters in Heritrix, answers to webmaster's requests > Monitoring: dashboard in Nagios, operation on virtual machines, information to give to webmasters > Quality assurance: analysis of indicators, visual control in WB > Quality assurance: production of indicators > Experience: reports on harvest concerning contents and websites description > Experience: reports on harvest concerning IT exploitation 23 May 2013 Archiving the Web 35

Challenges > What is the French web? > Not only. fr, also. com or. org > Some data remain difficult to harvest > > Streaming, databases, videos, Java. Script Dynamic web pages Contents protected by passwords Complex instructions for Dailymotion, paid contents for newspapers 23 May 2013 Archiving the Web 36

Infrastructure The machines that support the task

Platforms Postgre. SQL Pilot Database Machines with Linux Indexer master Indexer NAS Operational Platform Application 23 May 2013 Archiving the Web

Platforms 1 pilot, 1 indexer master, 2 to 10 indexers, 20 to 70 crawlers. Variable and scalable number of computers Operational Platform: PFO Identical setup to the PFO, the MAB (MAB = Marche À Blanc, Trial Run) aims to simulate and test harvests in real conditions for our curator team. Its size is also variable and subject to changes. Trial Run Platform: MAB The PFP is a technical test platform for the use of our engineers team. Pre-production Platform: PFP 23 May 2013 Archiving the Web

Platforms Our needs: > Flexibility regarding the number of crawlers allocated to a platform > Hardware resources sharing and optimisation > All classical needs of production environments such as robustness and reliability 1 2 3 4 5 23 May 2013 6 7 8 9 hypervisor Solution: Virtualisation! Archiving the Web > Virtual computers > Configuration « templates » > Resource pool grouping of the computers > Automatic management of all shared resources

The DL-WEB cluster Cluster DL-WEB Shared resources 1 23 May 2013 Archiving the Web 2 3 4 5 6 7 8 9

Dive into the hardware 1 2 3 4 2 x 9 RAM of 4 GB = 72 GB RAM / machine 2 sockets On every socket, 1 CPU Total of 16 logical CPUs per machine 2 cores 4 threads

Physical Machines 2 x 9 x 4 Gb = 72 GB 2 x 4 = 16 CPU 9 x 72 = 648 GB 9 x 16 = 144 CPU 1 2 3 4 23 May 2013 5 6 7 8 9 Archiving the Web 43

Park of virtual machines PFO MAB PFP pilot 1 1 1 index-server 1 1 1 index-master 1 - - crawler 70 70 10 indexer 10 - - heritrix 5 5 5 free 5 5 5 93 82 22 197

Distributed Resource Scheduler (DRS) and V-motion A virtual machine is hosted on a single physical server at a given time. If the load of VM hosted on one of the servers becomes too heavy, some of the VMs are moved onto another host dynamically and without interruption. If one of the hosts fails, all the VM hosted on this server are moved to other hosts and are rebooted. 23 May 2013 Archiving the Web

Fault tolerance (FT) > An active copy of the FT VM runs on another server > If the server where the master VM is hosted fails, the ghost VM instantly takes control without interruption > A copy is then created on a third server > The other VMs are moved and restarted Fault Tolerance can be quite greedy regarding resources especially concerning network consumption. That’s why we have activated this functionality only for the pilot machine. 23 May 2013 Archiving the Web

Data acquisition Our mixed model of web harvesting

Bn. F “mixed model” of harvesting Number of websites Broad crawls - once a year -. fr domains and beyond Project crawls: - one shots - related to an event or a theme Ongoing crawls: - running throughout the year - news or reference websites Calendar year 23 May 2013 Archiving the Web 48

Aggregation of a large number of sources > In 2012: > 2. 4 million domains in. fr and. re, provided by AFNIC (Association française pour le nommage Internet en coopération – the French domain name allocation authority) > 3, 000 domains in. nc, provided by OPT-NC (Office des postes et télécommunications de Nouvelle-Calédonie – the office of telecommunications of New Caledonia) > 2. 6 million domains already present in Netarchive. Suite database > 13, 000 domains from the selection of URLs by Bn. F librarians (in BCWeb) > 6, 000 domains from other workflows of the Library that contain URLs as part of the metadata: publishers’ declarations for books and periodicals, the Bn. F catalogue, identification of new periodicals by librarians, print periodicals that move to online publishing, and others > After de-duplication, this generated a list of 3. 3 million unique domains 23 May 2013 Archiving the Web 49

Volume of collections > Seven broad crawls since 2004 > 1996 -2005 collections thanks to Internet Archive > Tens of thousands of focus-crawled websites since 2002 > Total size > 20 billion URLs > 400 Terabytes 23 May 2013 Archiving the Web 50

Volume of collections 23 May 2013 Archiving the Web 51

Data storage and access The indexing structures and how users query the web archive

Data access: the Wayback Machine Client Web server 1 Browser 2 Web interface 14 13 URL server 3 4 CDX machine CDX server CDX Path CDX 5/6 Data storage machine Data server ARC 23 May 2013 ARC ARC ARC Archiving the Web 9 7/8 Data storage machine 11 10 ARC 12 Path ARC Data server ARC ARC 53

The ARC files File description For every collected URL: URL, IP-address, Archive-date, Contenttype, Archive-length, HTTP headers and HTML code filedesc: //IA-001102. arc 0 19960923142103 text/plain 76 1 0 Alexa Internet http: //www. dryswamp. edu: 80/index. html 127. 100. 2 19961104142103 text/html 202 HTTP/1. 0 200 Document follows Date: Mon, 04 Nov 1996 14: 21: 06 GMT Server: NCSA/1. 4. 1 Content-type: text/html Last-modified: Sat, 10 Aug 1996 22: 33: 11 GMT Content-length: 30 <HTML> Hello World!!! </HTML> http: //www. …

ARC file format 23 May 2013 Archiving the Web 55

The CDX files Indexation of the ARC files CDX A b e m s c V v D d g n 0 -0 -0 checkmate. com/Bugs/Bug_Investigators. html 20010424210551 209. 52. 183. 152 text/html 200 58670 fbe 7432 c 5 bed 6 f 3 dcd 7 ea 32 b 221 17130110 59129865 1927657 6501523 DE_crawl 6. 20010424210458 5750 A = canonized URL, b = date, e = IP, m = mime type, s = response code, c = checksum, V = compressed arc file offset, v = uncompressed arc file offset, D = compressed dat file offset, d = uncompr. dat file offset, g = file name, n = arc document length 23 May 2013 Archiving the Web 56

The PATH files Location of the ARC files ARC file name, location DE_crawl 6. 20010424210458 /dlwebdata/01002/ DE_crawl 6. 20010424210458. arc. gz IA-001102. arc /dlwebdata/01003/IA-001102. arc 23 May 2013 Archiving the Web 57

Indexing the data

Binary Search > Sorted list of data > O(log n) > a maximum of 35 search operations for 20 billion lines!

SPAR Système de Préservation et d’Archive Réparti Long-term preservation system for digital objects, compliant with the OAIS (Open Archival Information System) standard, ISO 14721 Digitized books Pre Ingest Digitized audiovisual documents Web archiving Archiving the Web 60

SPAR A generic repository solution at Bn. F

Public access to the collections > Customised version of open-source Wayback Machine > Three access points: > URL search > Experimental full-text search using Nutch. WAX (only covers about 10% of collections…) > Guided tours 23 May 2013 Archiving the Web 62

“Guided tours” > Selections in the web archives, created by Bn. F subject librarians and external partners > Provide a user-friendly way of discovering the contents of the archives > Provides visibility for project collections 23 May 2013 Archiving the Web 63

Thank you for your attention Questions?