74dadc6d716d2253596fa5261cc3c7db.ppt
- Количество слайдов: 21
Webscraping at Statistics Netherlands Olav ten Bosch 23 March 2016, ESSnet big data WP 2, Rome
Content – – – Internet as a datasource (IAD): motivation Some IAD projects over past years Technologies used Summary / trends Observations / thoughts – Legal – The Dutch Business Register 2
The why Fas t mo er, bet re e ffic ter, ient Administrative sources – Tax, social security services – Municipalities/ Provinces – Supermarkets –… Internet sources –… – Surveys New indicat ors Less!!! 3
Fuel prices (2009) ‐ Daily fuel prices from website of unmanned petrol stations (tinq. nl) ‐ Regional prices (per station) every day Now: 2016: ‐ A direct data feed from travelcard company, weekly ‐ Fuel prices per day and all transactions of that week ‐ Publication in website: prices per month 4
Airline tickets (2010) – – – Pilot: 3 robots on 6 airline companies 2 robots by external companies, 1 by SN Prices comply with manual collection Quite expensive; negative business case 2016: still manual price collection of airline tickets 5
Housing market – Housing market (from 2011): ‐ Discussions with external company for > 1 year (i. Woz) ‐ We scraped 5 sites, about 250. 000 observations / week, 2 years 2013 ->: ‐ Direct feed from one of the sites (Jaap. nl) ‐ Statline tables: Bestaande woningen in verkoop ‐ “based on 80 -90 percent of the market” 7
Bulk price collection for CPI (1) – Bulk price collection for CPI (from 2012): ‐ Mainly clothing ‐ Software scrapes all prices and product data (id, name, description, category, colour, size, …) 2016: ‐ About 500. 000 price observations daily from 10 sites ‐ Data from 3 sites used in production of Dutch CPI ‐ Price collection process embedded in organisation ‐ Plans to extend to > 20 sites; other domains 8
Bulk price collection for CPI (2) Data collection & Feature extraction Features: Fine‐knit Jumper Dark blue Striped Cotton edges Structured data Big Data Index methods Index based on internet data 9 Processing bulk data from the Internet
Robot-assisted price collection – Robot tool for detecting price changes on (parts of) websites – Traffic light indicates status: ‐ Green: nothing changed, prices is saved in database ‐ Red: some change, need attention of statistician ‐ Two click to hold price or store a new one ‐ In production from 2014
Collect data on enterprises for EGR (2013) – Pilot: find data about EGR enterprises on the web ‐ We scraped semi structured data from Wikipedia ‐ Multiple wikipedia languages (NL, EN, DE, FR) ‐ 2016: something alike in ESSnet BD WP 2? 11
Search product descriptions for classifying business activities – Search product descriptions on web (from 2014) ‐ First time we used automated search with Google search API for statistics ‐ Pilot, no production ‐ Some doubts on google results 12
Twitter-Linked. In (1) – Linked. In-Twitter for profiling (2015) ‐ Automated search on Linked. In based on a sample of twitter users ‐ Very specific and experimental ‐ “Profiling of Twitter data, a big data selectivity study”, Piet Daas, Joep Burger, Quan Lé, Olav ten Bosch 13
14
Scraping websites of enterprises – Identify family businesses (search and / or crawling) (2016) – Identify businesses with a Corporate Social Responsibility (CSR) (search and / or crawling) (2016) – Research program: ‐ “Extracting information from websites to improve economic figures” – This ESSnet BD WP 2 !!! 15
Crawling for Statistics Url‐base Internet Incomplete statistical data Search terms Navigation terms Focused Crawler (Roboto) Item identifyer terms “year report, family business” More complete statistical data Data store Search & Match Elastic. Search 16
Technologies used – – – – – Perl (2009), Djuggler (2010) Python, Scrapy (2010) R (2011 -2015) Node. JS (Javacript on server) (2014 -) Google Search API (2014 -) Elastic. Search (2016) Roboto (nodejs package, 2015 -2016) Nutch: tested, not used Generic Framework (robot framework) for bulk scraping of prices 17
Summary / trends Production Scrape Tinq x Search Crawl External company Travelcard x Airlines (x) 2 robots Jaap. nl Housing x (x) Bulk. CPI x x Robottool x x (x) x x EGR RGS Twitter/ Linkedin Enterprises x x x Dataprovider? 18
Observations / thoughts … ‐ ‐ If it is there, we can get it Technology is (usually) not the problem! The internet is a living thing! It’s too simple to think we can just buy the internet somewhere and then make statistics! ‐ It’s powerful to combine something we know with something we observe! ‐ External companies can help, but be careful … 19
20
Legal – Dutch Statistics Law: ‐ Enterprises have to provide data to Statistics Netherlands on request ‐ Scraping information from websites reduces response burden ‐ Statistics Netherlands does use data for official statistics only – Dutch database legislation: ‐ Commercial re-use of intellectual property is forbidden ‐ This may also apply to internet sources – Privacy: ‐ Dutch (statistical) legislation on protection of personal information ‐ Statistics Netherlands does only scrape public sources and processes data within Statistics Netherlands’ safe environment, just as with other (privacy-sensitive) data internally – Netiquette: ‐ respect robots. txt ‐ identify yourself (user-agent) 21 ‐ do not overload servers, use some idle time between requests
Dutch Business Register (simplified) ‐ From administrative units to statistical units: Legal units relationships Cluster of control Enterprise groups Enterprises Local units Sources: ‐ Trade Register ‐ Tax Register ‐ Social security register (employees) ‐ Profilers ‐ About 1. 5 Million administrative entities ‐ About 0. 5 Million have a url ‐ Quality of url field not known, but seems usable 22
74dadc6d716d2253596fa5261cc3c7db.ppt