Скачать презентацию 大规模数据处理 云计算 Lecture 1 Introduction to Map Reduce 闫宏飞 Скачать презентацию 大规模数据处理 云计算 Lecture 1 Introduction to Map Reduce 闫宏飞

7342a5d21dce6e700fd2cd7c3736938d.ppt

  • Количество слайдов: 44

大规模数据处理/云计算 Lecture 1 Introduction to Map. Reduce 闫宏飞 北京大学信息科学技术学院 7/5/2011 http: //net. pku. edu. 大规模数据处理/云计算 Lecture 1 Introduction to Map. Reduce 闫宏飞 北京大学信息科学技术学院 7/5/2011 http: //net. pku. edu. cn/~course/cs 402/ Jimmy Lin University of Maryland 课程建设 SEWMGroup This work is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3. 0 United States See http: //creativecommons. org/licenses/by-nc-sa/3. 0/us/ for details

What is this course about? • • Data-intensive information processing Large-data (“web-scale”) problems Focus What is this course about? • • Data-intensive information processing Large-data (“web-scale”) problems Focus on Map. Reduce programming An entry-level course~ 2

What is Map. Reduce? • Programming model for expressing distributed computations at a massive What is Map. Reduce? • Programming model for expressing distributed computations at a massive scale • Execution framework for organizing and performing such computations • Open-source implementation called Hadoop 3

Why Large Data? Why Large Data?

How much data? ¢ Google processes 20 PB a day (2008) ¢ Wayback Machine How much data? ¢ Google processes 20 PB a day (2008) ¢ Wayback Machine has 3 PB + 100 TB/month (3/2009) ¢ Facebook has 2. 5 PB of user data + 15 TB/day (4/2009) ¢ e. Bay has 6. 5 PB of user data + 50 TB/day (5/2009) ¢ CERN’s LHC will generate 15 PB a year (? ? ) 640 K ought to be enough for anybody. 5

6 6

Happening everywhere! microarray chips microprocessors Molecular biology (cancer) fiber optics Network traffic (spam) 300 Happening everywhere! microarray chips microprocessors Molecular biology (cancer) fiber optics Network traffic (spam) 300 M/day Simulations (Millennium) particle colliders Particle events (LHC) 1 B 7 1 M/sec

8 Maximilien Brice, © CERN 8 Maximilien Brice, © CERN

9 Maximilien Brice, © CERN 9 Maximilien Brice, © CERN

10 Maximilien Brice, © CERN 10 Maximilien Brice, © CERN

11 Maximilien Brice, © CERN 11 Maximilien Brice, © CERN

No data like more data! s/knowledge/data/g; How do we get here if we’re not No data like more data! s/knowledge/data/g; How do we get here if we’re not Google? (Banko and Brill, ACL 2001) (Brants et al. , EMNLP 2007) 12

Example: information extraction ¢ Answering factoid questions l l Pattern matching on the Web Example: information extraction ¢ Answering factoid questions l l Pattern matching on the Web Works amazingly well Who shot Abraham Lincoln? X shot Abraham Lincoln ¢ Learning relations l l l Start with seed instances Search for patterns on the Web Using patterns to find more instances Wolfgang Amadeus Mozart (1756 - 1791) Einstein was born in 1879 Birthday-of(Mozart, 1756) Birthday-of(Einstein, 1879) PERSON (DATE – PERSON was born in DATE (Brill et al. , TREC 2001; Lin, ACM TOIS 2007) (Agichtein and Gravano, DL 2000; Ravichandran and Hovy, ACL 2002; … ) 13

Example: Scene Completion Hays, Efros (CMU), “Scene Completion Using Millions of Photographs” SIGGRAPH, 2007 Example: Scene Completion Hays, Efros (CMU), “Scene Completion Using Millions of Photographs” SIGGRAPH, 2007 ¢ Image Database Grouped by Semantic Content l l ¢ ¢ l 30 different Flickr. com groups 2. 3 M images total (396 GB). l Select Candidate Images Most Suitable for Filling Hole l l l Classify images with gist scene detector [Torralba] Color similarity Local context matching Computation l ¢ Index images offline 50 min. scene matching, 20 min. local matching, 4 min. compositing Reduces to 5 minutes total by using 5 machines Extension l Flickr. com has over 500 million 14 images … 14

More Data More Gains? • CNNIC中国互联网络发展状况统计 截至 2010年 6月底,我国网民规模达 4. 2亿人, 互联网普及率持续上升增至 31. 8%。手机网民成 More Data More Gains? • CNNIC中国互联网络发展状况统计 截至 2010年 6月底,我国网民规模达 4. 2亿人, 互联网普及率持续上升增至 31. 8%。手机网民成 为拉动中国总体网民规模攀升的主要动力,半年 内新增 4334万,达到 2. 77亿人,增幅为 18. 6%。 值得关注的是,互联网商务化程度迅速提高,全 国网络购物用户达到 1. 4亿,网上支付、网络购物 和网上银 行半年用户增长率均在 30%左右,远远 超过其他类网络应用。 15

2009年全国新闻出版业基本情况 • 2009年:出版书籍238868种(初版145475种, 重版、重印 93393种),总印数 37. 88亿册(张), 总印张 312. 46亿印张,折合用纸量 73. 4万吨(包 括附录用纸 1. 2009年全国新闻出版业基本情况 • 2009年:出版书籍238868种(初版145475种, 重版、重印 93393种),总印数 37. 88亿册(张), 总印张 312. 46亿印张,折合用纸量 73. 4万吨(包 括附录用纸 1. 41亿印张,折合用纸量 0. 33万吨), 定价总金额 567. 27亿 元(包括附录定价总金额 4. 73亿元)。与上年相比种数增长 8. 86%(初版 增长 11. 24%,重版、重印增长 5. 36%),总印数 增长 4. 53%,总印 张增长 4. 61%,定价总金额增 长 8. 94%。 16

Did you know? 17 Did you know? 17

Did you know? • “We are currently preparing our students for jobs that don’t Did you know? • “We are currently preparing our students for jobs that don’t yet exist …” • “It is estimated that a week’s worth of the New York Times contains more information than a person was likely to come across in a lifetime in the 18 th century” • “The amount of new technical information is doubling every 2 years” • “So what does IT ALL MEAN? ” 18

“We are living in exponential times “ 19 “We are living in exponential times “ 19

Two Different Views • a “thrower-awayer” Jennifer Widom • My. Life. Bits Gordon Bell Two Different Views • a “thrower-awayer” Jennifer Widom • My. Life. Bits Gordon Bell “丢弃,必要时再找回来的代价 要比维护它们要小得多” “trying to live an efficient life so that one has time to work and be with one’s family. “ 20

Information Overloading • 不能学以致用的原因之一: 信息超载 – 对于那些只接触过一次的信息, 我们通常只能记住其中一小部 分。 – 我们应该少而精而非多而浅地 去学习。 – 要想掌握某件事,关键在于间 Information Overloading • 不能学以致用的原因之一: 信息超载 – 对于那些只接触过一次的信息, 我们通常只能记住其中一小部 分。 – 我们应该少而精而非多而浅地 去学习。 – 要想掌握某件事,关键在于间 隔性重复。 – 一旦真正透彻地掌握了自己的 作,人们就会变得更有创造 性,甚至能够创造奇迹。 21

What is Cloud Computing? What is Cloud Computing?

The best thing since sliced bread? ¢ Before clouds… l l l ¢ Grids The best thing since sliced bread? ¢ Before clouds… l l l ¢ Grids Vector supercomputers … Cloud computing means many different things: l l Large-data processing Rebranding of web 2. 0 Utility computing Everything as a service 23

Rebranding of web 2. 0 ¢ Rich, interactive web applications l l l ¢ Rebranding of web 2. 0 ¢ Rich, interactive web applications l l l ¢ Clouds refer to the servers that run them AJAX as the de facto standard (for better or worse) Examples: Facebook, You. Tube, Gmail, … “The network is the computer”: take two l l l User data is stored “in the clouds” Rise of the netbook, smartphones, etc. Browser is the OS 24

Source: Wikipedia (Electricity meter) Source: Wikipedia (Electricity meter)

Utility Computing ¢ What? l l ¢ Why? l l l ¢ Computing resources Utility Computing ¢ What? l l ¢ Why? l l l ¢ Computing resources as a metered service (“pay as you go”) Ability to dynamically provision virtual machines Cost: capital vs. operating expenses Scalability: “infinite” capacity Elasticity: scale up or down on demand Does it make sense? l l Benefits to cloud users Business case for cloud providers I think there is a world market for about five computers. 26

Everything as a Service ¢ Utility computing = Infrastructure as a Service (Iaa. S) Everything as a Service ¢ Utility computing = Infrastructure as a Service (Iaa. S) l l ¢ Platform as a Service (Paa. S) l l ¢ Why buy machines when you can rent cycles? Examples: Amazon’s EC 2, Rackspace Give me nice API and take care of the maintenance, upgrades, … Example: Google App Engine Software as a Service (Saa. S) l l Just run it for me! Example: Gmail, Salesforce 27

Utility Computing ¢ “pay-as-you-go” 好比让用户把电源插头插在墙上,你得到 的电压和Microsoft得到的一样,只是你用得少,pay less; utility computing的目标就是让计算资源也具有这样的服务 能力,用户可以使用 500强公司所拥有的计算资源,只是 use less pay Utility Computing ¢ “pay-as-you-go” 好比让用户把电源插头插在墙上,你得到 的电压和Microsoft得到的一样,只是你用得少,pay less; utility computing的目标就是让计算资源也具有这样的服务 能力,用户可以使用 500强公司所拥有的计算资源,只是 use less pay less。这是cloud computing的一个重要方面 28

Platform as a Service (Paa. S) ¢ 对于开发Web Application和Services,Paa. S提供了一整套 基于Internet的,从开发,测试,部署,运营到维护的全方 位的集成环境。特别它从一开始就具备了Multi-tenant architecture,用户不需要考虑多用户并发的问题,而由 platform来解决,包括并发管理,扩展性,失效恢复,安全。 Platform as a Service (Paa. S) ¢ 对于开发Web Application和Services,Paa. S提供了一整套 基于Internet的,从开发,测试,部署,运营到维护的全方 位的集成环境。特别它从一开始就具备了Multi-tenant architecture,用户不需要考虑多用户并发的问题,而由 platform来解决,包括并发管理,扩展性,失效恢复,安全。 29

Software as a Service (Saa. S) ¢ a model of software deployment whereby a Software as a Service (Saa. S) ¢ a model of software deployment whereby a provider licenses an application to customers for use as a service on demand. 30

Who cares? ¢ Ready-made large-data problems l l ¢ Lots of user-generated content Even Who cares? ¢ Ready-made large-data problems l l ¢ Lots of user-generated content Even more user behavior data Examples: Facebook friend suggestions, Google ad placement Business intelligence: gather everything in a data warehouse and run analytics to generate insight Utility computing l l l Provision Hadoop clusters on-demand in the cloud Lower barrier to entry for tackling large-data problem Commoditization and democratization of large-data capabilities 31

Story around Hadoop Story around Hadoop

Google-IBM Cloud Computing Initiative • 2007年 10月初,Google和IBM联 合与6所大学签署协议,提供在大 型分布式计算系统上开发软件的 课程和支持服务,帮助学生和研 究人员获得开发网络级应用软件 的经验。这个项目的主要内容是 传授Map. Reduce算法和Hadoop Google-IBM Cloud Computing Initiative • 2007年 10月初,Google和IBM联 合与6所大学签署协议,提供在大 型分布式计算系统上开发软件的 课程和支持服务,帮助学生和研 究人员获得开发网络级应用软件 的经验。这个项目的主要内容是 传授Map. Reduce算法和Hadoop 文件系统。两家公司将各自出资 2000~ 2500万美元,为从事计算 机科学研究的教授和学生提供所 需的电脑软硬件和相关服务。 33

Cloud Computing Initiative • Google and IBM team on cloud computing initiative for universities(2007) Cloud Computing Initiative • Google and IBM team on cloud computing initiative for universities(2007) – provide several hundred computers – access through the Internet to test parallel programming projects • The idea for the program from Google senior software engineer Christophe Bisciglia – Google Code University 34

The Information Factories • Googleplex(pre-2008) – servers number 450, 000, according to the lowest The Information Factories • Googleplex(pre-2008) – servers number 450, 000, according to the lowest estimate – 200 petabytes of hard disk storage – four petabytes of RAM – To handle the current load of 100 million queries a day, – input-output bandwidth must be in the neighborhood of 3 petabits per second 35

Google Infrastructure • 2003 Google Infrastructure • 2003 "The Google file system, " in sosp. Bolton Landing, NY, USA: ACM Press, 2003. • 2004 "Map. Reduce: Simplified Data Processing on Large Clusters, " in osdi, 2004, • 2006 "Bigtable: A Distributed Storage System for Structured Data (Awarded Best Paper!), " in osdi, 2006 • ……. 36

Hadoop Project Doug Cutting 37 Hadoop Project Doug Cutting 37

History of Hadoop • • • • 2004 - Initial versions of what is History of Hadoop • • • • 2004 - Initial versions of what is now Hadoop Distributed File System and Map-Reduce implemented by Doug Cutting & Mike Cafarella December 2005 - Nutch ported to the new framework. Hadoop runs reliably on 20 nodes. January 2006 - Doug Cutting joins Yahoo! February 2006 - Apache Hadoop project official started to support the standalone development of Map-Reduce and HDFS. March 2006 - Formation of the Yahoo! Hadoop team May 2006 - Yahoo sets up a Hadoop research cluster - 300 nodes April 2006 - Sort benchmark run on 188 nodes in 47. 9 hours May 2006 - Sort benchmark run on 500 nodes in 42 hours (better hardware than April benchmark) October 2006 - Research cluster reaches 600 Nodes December 2006 - Sort times 20 nodes in 1. 8 hrs, 100 nodes in 3. 3 hrs, 500 nodes in 5. 2 hrs, 900 nodes in 7. 8 January 2007 - Research cluster reaches 900 node April 2007 - Research clusters - 2 clusters of 1000 nodes Sep 2008 - Scaling Hadoop to 4000 nodes at Yahoo! 38

Google Code University • 2008, Seminar: Mass Data Processing Technology on Large Scale Clusters, Google Code University • 2008, Seminar: Mass Data Processing Technology on Large Scale Clusters, Tsinghua University Aaron Kimball 39

Startup: Cloudera • Cloudera is pushing a commercial distribution for Hadoop Mike Olson Christophe Startup: Cloudera • Cloudera is pushing a commercial distribution for Hadoop Mike Olson Christophe Bisciglia Doug Cutting Aaron Kimball Tom White 40

Course Administrivia Course Administrivia

Text. Books • [Tom] Tom White, Hadoop: The Definitive Guide, O'Reilly, 2009. [Chinese Edition] Text. Books • [Tom] Tom White, Hadoop: The Definitive Guide, O'Reilly, 2009. [Chinese Edition] • [Lin] Jimmy Lin and Chris Dyer, Data-Intensive Text Processing with Map. Reduce, 2010. [EBook] 42

This schedule is tentative and subject to change without notice ID Topics Contents Reading This schedule is tentative and subject to change without notice ID Topics Contents Reading 1 Background 1. Cloud. Computing概念出现及演变, [Lin]Ch 1: Introduction [Tom]Ch 1: Meet Hadoop 现状, 2. 本课程主要内容、要求 2 System 1. Map. Reduce & runtime system 2. HDFS [Lin]Ch 2: Mapreduce Basic [Tom]Ch 6: How mapreduce works [paper] 3 Environment 1. 熟悉Hadoop环境 (*)Hadoop environment setup (*)Word. Count [Tom]Ch 9: Setting up hadoop cluster [Tom]Appendix A: Installing apache hadoop [Tom]Ch 2: Mapreduce 4 Algorithm Design 1. Map. Reduce Program Develop 2. Basic Map. Reduce algorithm design and design patterns (*)Word. Count (Mapreduce library class) [Tom]Ch 5: Developing a Map. Reduce Application [Lin]Ch 3: Map. Reduce algorithm design 5 Text retrieval (*)Inverted Index [Lin]Ch 4: Inverted Indexing for Text Retrieval 6 Graph Algorithm (*)Page. Rank (Mapreduce features: [Lin]Ch 5: Graph Algorithms side data distribution) 43

Recap • Why large data? • Cloud computing • Story about Hadoop 44 Recap • Why large data? • Cloud computing • Story about Hadoop 44