Скачать презентацию CMSC 100 Google Deconstructed Professor Marie des Jardins Скачать презентацию CMSC 100 Google Deconstructed Professor Marie des Jardins

87d33eed47ef26500caaac61c76a0698.ppt

  • Количество слайдов: 16

CMSC 100 Google Deconstructed Professor Marie des. Jardins Tuesday, September 4, 2012 CMSC 100 CMSC 100 Google Deconstructed Professor Marie des. Jardins Tuesday, September 4, 2012 CMSC 100 -- Google 1 Tue 9/4/12

Everyone Knows What Google Is… n Right? n So how does it work? 2 Everyone Knows What Google Is… n Right? n So how does it work? 2 CMSC 100 -- Google Tue 9/4/12

What’s So Great About Google? n Abstraction (of course!) n Crawl/index/search model n Page. What’s So Great About Google? n Abstraction (of course!) n Crawl/index/search model n Page. Rank algorithm and complex ranking model n Scalability and robustness 3 CMSC 100 -- Google Tue 9/4/12

Let’s Try It… n Google demo/activity: n Find the market capitalization of Google n Let’s Try It… n Google demo/activity: n Find the market capitalization of Google n Find a zero-hit “word” n Let’s add it to the course website n Now does Google notice it? n How long do you think it will it take for Google to “learn” it? n Find a satellite picture of this building n Show the Page. Rank of several different pages 4 CMSC 100 -- Google Tue 9/4/12

Pre. History of the Web: Computation • First, there were numbers (early counting systems) Pre. History of the Web: Computation • First, there were numbers (early counting systems) • Then there were “computers” (abaci, Babbage, Difference Engine, Jacquard loom) • Then there was theory of computation (Turing) • Then there were computers (ENIAC, PCs, Macs; explosion of computing power in the latter half of the 20 th century) 5 CMSC 100 -- Google Tue 9/4/12

History of the Web: Medieval Times n n n n 6 n Internet (1960 History of the Web: Medieval Times n n n n 6 n Internet (1960 s – government sites and a few universities) Talk/email and Usenet newsgroups used by a limited population in the ‘ 80 s 1988: First computer worm (created by Robert Morris) 1989: World Wide Web invented by Tim Berners-Lee 1992: Netscape browser released 1993: World Wide Web Worm (not a virus: an early search engine); 300 K documents 1995: e. Bay goes live 1996: Larry Page invents Page. Rank 1997: Google. com domain registered; 2 M-100 M documents; 20 M queries in Alta. Vista CMSC 100 -- Google Tue 9/4/12

History of the Web: The Renaissance n Notable quotes: n n n n n History of the Web: The Renaissance n Notable quotes: n n n n n 7 “It is foreseeable that by the year 2000, a comprehensive index of the Web will contain over a billion documents” “It is likely that search engines will handle hundreds of millions of queries per day by the year 2000. ” [Brin & Page, “The Anatomy of a Search Engine”] 1998: Google incorporated with initial investment of $1 M 1999: blogger. com goes live; blogs explode in popularity 2000: Google. Ads introduced 2003: My. Space. com 2004: Facebook 2004: Google IPO, with a market capitalization of $23 B 2004: Mass-market VOIP (Voice Over Internet Protocol) 2006: Google buys You. Tube CMSC 100 -- Google Tue 9/4/12

History of the Web: The Modern Era n n 8 2007: Google has a History of the Web: The Modern Era n n 8 2007: Google has a 53. 6% market share (Yahoo has 20%) 2008: Facebook overtakes myspace as the #1 social networking site Today: n Google has a market capitalization of $221. 5 B, over 30 K employees, and a 67%US search market share n Google is #73 on the Fortune 500 n The planet has 6. 9 B people, of whom 2. 3 B have Internet access Tomorrow: § The Semantic Web will enable more meaningful information retrieval § Service-based computing and intelligent agent technology will let you perform all sorts of tasks automatically (travel planning, etc. ) § Mobile phones will become the primary Internet platform § Continuing increased global access to the Web will lead to increased democracy, freedom of information, and protection of human rights CMSC 100 -- Google Sources: [investor. google. com, googlesystem. blogspot. com, internetworldstats. com] Tue 9/4/12

Google: Crawling the Web n Googlebot: Start with a few “well connected” pages, follow Google: Crawling the Web n Googlebot: Start with a few “well connected” pages, follow the links. Lather, rinse, repeat. n Deepbot: Crawl the entire web once a month n Freshbot: Visit frequently updated sites more often n As of 2012, Google indexes about 35 B documents. n Bing indexes 18 B, and Yahoo indexes 3 -4 B [worldwidewebsize. com] n As of August 2012, Google had identified 30 T (that’s thirty trillion!) unique URLs (but not all of them are interesting enough to index) n Google claims to handle more than 3 billion queries a day 9 CMSC 100 -- Google Tue 9/4/12

Google Queries and Tools n Demo: n “Advanced Google Operators” – http: //www. google. Google Queries and Tools n Demo: n “Advanced Google Operators” – http: //www. google. com/intl/gn/help/operators. html n “Google Cheat Sheet” – http: //www. googleguide. com/advanced_operators_reference. html n Important concepts: stemming, stop words n Parsing n Term reordering for efficiency n More apps: n n Google Maps and Earth n 10 Google products on “Even More” link Google Mail, Calendar, Documents, Plus. . n . . . and even a freakin’ self-driving car!! [check it out. . . ] CMSC 100 -- Google Tue 9/4/12

Ranking the Web n n Googlefight: When queries duke it out Google uses keyword Ranking the Web n n Googlefight: When queries duke it out Google uses keyword similarity but includes other criteria in “scoring” documents, most notably Page. Rank n n Page. Rank assigns a “score” to a web page based on how many other pages point to it A page’s Page. Rank depends on the Page. Ranks of the “referring” pages, so it is a recursive definition! Other criteria in Google’s ranking scheme include “extra credit” for keywords that appear earlier in the body of the document, in the title tag, in H 1 HTML headers, and in anchor text (on links to the page being ranked) “Search engine optimization” (legitimate) vs. “Google bombing” (not allowed) 11 CMSC 100 -- Google Tue 9/4/12

Indexing the Web n Googlewhack: Query that returns exactly one page (can you find Indexing the Web n Googlewhack: Query that returns exactly one page (can you find one? ) n Efficient data structures are needed! n How do you find a phone number in the yellow pages, or a word in the dictionary? n Examples of successively more efficient data structures: Unordered list, ordered binary tree, hash table n Indexes: sorted by keyword, with “pointer” to pages containing that keyword n Document server: store the cached versions of the actual pages 12 CMSC 100 -- Google Tue 9/4/12

Storing the Web n n How much storage is needed for (say) 10 B Storing the Web n n How much storage is needed for (say) 10 B documents? Server farms n n n 13 As of 2012, Google had an estimated 1. 8 M (yes, 1. 8 million) servers in locations around the globe Issues: Energy consumption, heat generation/dissipation, environmental impact (server farms are typically located near, and dissipate their heat into, water sources) Google claims they have been carbon neutral since 2007 (through energy savings, green energy sources, and carbon offsets [google. com/green] Design philosophy: Use many cheap, redundant machines, plus failure detection and handling → robust, scalable, affordable Cloud computing (21 st-century computation paradigm: use a very loosely connected network of many servers to manage enormous quantities of data and huge numbers of queries) CMSC 100 -- Google Tue 9/4/12

Privacy and Security n Is it good that Google provides ready access to “all” Privacy and Security n Is it good that Google provides ready access to “all” information? n n n 14 n Do you want all of your publicly available data to be readily available to any random inquirer? Do you want all of your data to be permanently available via caching, even if you’ve taken it off your facebook page? Do you want “leaked” (legally or not, but against your wishes) private information to become readily, permanently available public data? If you create a story/book/painting/drawing/idea, do you want other people to be able to get it and use it as much as they want, for any purpose, for free? Do you have any music you didn’t pay for on your ipod? Do you have movies you didn’t buy? Google Earth: security and privacy issues China/censorship – filtering search results Digitizing books – copyright issues Cookies/user data Tabulating and selling search queries Copyright and privacy law are way behind the technology curve! CMSC 100 -- Google Tue 9/4/12

Google’s Business Model n Why is it free? n n n Advertising: $10. 6 Google’s Business Model n Why is it free? n n n Advertising: $10. 6 B ad revenue in 2006. How Google Ads work: n Ad. Words: “pay-per-click” for “sponsored links” using Google’s technology to identify relevant ads n Ad. Sense to put sponsored advertising on your own Big issue: Click fraud 15 CMSC 100 -- Google Tue 9/4/12

What Doesn’t Google Do Well? n What doesn’t Google do well? n Non-digitized documents What Doesn’t Google Do Well? n What doesn’t Google do well? n Non-digitized documents (handwritten/scanned data); non-text documents (audio, images, video, multimedia (e. g. , Power. Point presentations) n Databases and dynamically generated content Semantic knowledge (beyond keywords and Page. Rank) n 16 CMSC 100 -- Google Tue 9/4/12