229cee8534f719f096a382b902d7548b.ppt
- Количество слайдов: 24
ONS Big Data Project GSS Methodology Symposium 3 July 2014
Session objectives • Provide an overview of big data, particularly in Official Statistics • Introduce the ONS Big Data Project • Provide a brief overview of our 4 pilot studies and other project objectives • Provide links to more information
What is Big Data? “Big data are high volume, high velocity, and high variety information assets that require new forms of processing to enable enhanced decision making, insight discovery and process optimization” (Gartner 2012) Volume - exceeds limits of traditional column and row relational DB - constantly growing Velocity - arrives rapidly, often in real time Vertical scalability Requires Data streaming Requires Variety - does not have a standard structure, e. g. text, images - ability to grow storage to accommodate new ‘records’ Requires - real time processing, analysis and transformation Horizontal scalability - ability to additional data structures
How is big data generated? Digital satellite images
Big Data Technologies Cloud Computing Parallel Computing No. SQL Databases General Programming Data Visualization Machine Learning
Big Data and Official Statistics • Replace existing outputs • Produce an entirely new outputs • Complement other sources: • Filling in gaps • Auxiliary variables for statistical models • Improve operational processes • Quality assurance
What is the ONS Big Data Project? • A one year project which aims to: • investigate the potential for big data in official statistics while understanding the challenges • establish an ONS policy and longer term strategy which incorporates ONS’s position within Government and internationally in this field • Recommend next steps to support the strategy going forward
Big Data Project work packages • • Management and Strategy Stakeholder Engagement Communication Analysis and infrastructure – pilot projects: Smart meter Mobile Phones Prices Twitter
Stakeholder Engagement • International: • UNECE / ESS • Leading NSIs are Italy and Netherlands • Cross-government: • HMG Data Science Community of Interest Group • Big data for statistics vs other types of analysis • UK Government Big Data Champion (Jane Naylor) • Academia: • University of Southampton • ESRC Big data network
Analysis & Infrastructure: Technical challenges • Huge and continuously growing data streams, requiring new data architectures and software • Feasibility and efficiency of processing, typically requiring parallel computing on a large scale • New skills will be required, bringing together statistical and technological expertise
Analysis & Infrastructure: The Data Science Skill Set No living person is expert in all these disciplines. A very rare person would be proficient in them all. An individual might be expert in one or two, and proficient in another two or three. Data science is a TEAM SPORT. Source: http: //en. wikibooks. org/wiki/Data_Science: _An_Introduction
Pilot 1: Smart meter project
Pilot 1: Smart meter project Research Question: Investigate the potential of smart meter electricity data (high frequency – 30 mins) to identify household occupancy levels, potentially household structure • England Ireland both conducted pilots of rollout in 2009 -2010 – data now available for research • Southampton University commissioned by Beyond 2011 to conduct preliminary research (due mid Feb 2014)
Pilot 2: Mobile Phone Project • 4 pilot projects: Smartmeter Mobile Phones Prices Twitter • RD&I Research Innovation Labs
Pilot 2: Mobile Phone Project Research Question: To investigate the possibility of using mobile phone data to model population flows, eg travel to work statistics • Location data: Telefonica proposal to provide aggregate data on origin-destination flows • Requirement to engage with GDS before proceeding further
Pilot 3: Prices Project Research Question: To investigate how we can scrape prices data from the internet and how this data could be used within price statistics • 2 -day workshop held with big data experts from Statistics Netherlands • Focus on groceries • Early prototype code in place • Engagement with Billion Prices Project
Pilot 3: Prices by webscraping Rendered webpage: HTML code: . . . </div><div class="product. Lists" id="end. Facets-1"><ul class="cf products line"><li id="p-254942348 -3" class=" first"><div class="desc"><h 3 class="in. Basket. Info. Container"><a id="h-254942348" href="/groceries/Product/Details/? id=254942348" class="si_pl_254942348 -title"><span class="image"><img src="http: //img. tesco. com/Groceries/pi/1215010044000121IDShot_90 x 90. jpg" alt="" /><!----></span>arburtons Toastie Sliced W White Bread 800 G</a></h 3><p class="limited. Life"><a href="http: //www. tesco. com/groceries/zones/default. aspx? name=quality-andfreshness">Delivering the freshest food to your door- Find out more > </a></p><div class="desc. Content"><!----><div class="promo"><a href="/groceries/Special. Offer. Detail/Default. aspx? promo. Id=A 31234788" title="All products available for this offer" id="flyout-254942348 -promo-A 31234788 --pos" class="promo. Flyout"><span class="promo. Img. Box"><img src="/Groceries/UIAssets/I/Sites/Retail/Superstore/Online/Product/pos/2 for. png" class="promo. Flyout promo" alt="Special Offer" id="flyout-254942348 -promo-A 31234788 --posimg" /></span><em>Any 2 for £ 2. 00</em></a><span> valid from 21/1/2014 until 10/2/2014</span></div><div class="tools"><div class="more. Info"><a href="/groceries/Product/Details/? id=254942348" class="midi. Flyout" id="flyout-254942348 -midi-0 -"><img class="midi. Flyout hd" src="http: //ui. tescoassets. com/groceries/UIAssets/I/. . /Compressed/I_635209615845382232/Sites/Retail/Superstore/Online/Product/i nfo. Blue. gif" alt="" title="View product information" id="flyout-254942348 -midi-1 -" /></a></div><!----><div class="links"><ul><li><a href="http: //www. tesco. com/groceries/product/browse/default. aspx? notepad=white%20 sliced%20 loaf%20800 g& N=4294793217" class="shelf. Flyout active plaintooltip" id="s-tt-254942348" title="Premium White Bread"> Rest of <span class="hide">Premium White Bread <!----></span>shelf </a></li></ul></div></div><div class="quantity"><div class="content add. To. Basket"><p class="price"><span class="line. Price"> £ 1. 45<!----></span><span class="line. Price. Abbr"> (£ 0. 18/100 g)</span></p><h 4 class="hide">Add to basket</h 4><form method="post" id="f. Multisearch-254942348". . .
Pilot 3: The Billion Prices Project @ MIT Lehman Brothers files for bankruptcy (15 Sept 2008) Daily Online Price Index (United States)
Pilot 4: Twitter Project Research Question: To investigate how to capture geo-located tweets from Twitter and how this data might provide insights on commuting patterns and internal migration • Opportunity to start experimenting early on with big data technologies • Pilot work has successfully harvested geolocated tweets from the live Twitter feed using Python and Twitter API • Need to determine whether planned application will exceed rate-limits
Pilot 4: Twitter Project Temporal Patterns of International Mobility by selected country
Pilot 4: Mobility patterns from Twitter Dover Calais
Where to from here? • We need to think hard about how we can exploit this deluge of data, new tools and technologies • We must share and collaborate in applications of big data for official statistics • We need to be able to respond to challenges about our statistical outputs arising from big data sources • We need to look beyond our national borders
Finding out more information
Questions • ?
229cee8534f719f096a382b902d7548b.ppt