Characterizing the Web CSCI 572 Information Retrieval and

Characterizing the Web CSCI 572: Information Retrieval and Search Engines Summer 2010

Outline • The web – Scale – Complexity – Growth • Differences between then and now • Where the web is headed May-20 -10 CS 572 -Summer 2010 CAM-2

The Web • Massive scale directed graph • Driven by the underlying REST architecture – – – The key abstraction of information is a resource, named by an URL. The representation of a resource is a sequence of bytes, plus representation metadata to describe those bytes. All interactions are context-free: each interaction contains all of the information necessary to understand the request. Components perform only a small set of well-defined methods on a resource producing a representation to capture the current or intended state of that resource and transfer that representation between components. Representation metadata are encouraged in support of caching and representation reuse. The presence of intermediaries is promoted. May-20 -10 CS 572 -Summer 2010 Copyright © Richard N. Taylor, Nenad Medvidovic, and Eric M. Dashofy. All rights reserved. CAM-3

Scale • http: //www. worldwidewebsize. com/ GYBA = Sorted on Google, Yahoo!, Bing and Ask YGBA = Sorted on Yahoo!, Google, Bing and Ask May-20 -10 CS 572 -Summer 2010 CAM-4

How is the scale measured? • # of indexed web pages by search engines? – Is this an accurate representation? • Published data from major ISPs? – Is this accurate information? • What’s missing? – The “deep” web, or dynamic pages – Pages behind security firewalls May-20 -10 CS 572 -Summer 2010 CAM-5

Why is scale important? • Has many influential drivers on the ultimate use cases of the web – Discovery and retrieval of information via: • Search Engines • Web Services and Grid Computing • Targeted communities like Social Networking and the growing field of Analytics • Has many influential drivers on the way we build software for web-scale systems – New programming paradigms, e. g. , Map Reduce – New technologies to handle huge scale computing, or “Big Data” May-20 -10 CS 572 -Summer 2010 CAM-6

Complexity May-20 -10 CS 572 -Summer 2010 CAM-7

Proliferation of content types available • By some accounts, 16 K to 51 K content types* • What to do with content types? – Parse them • How? • Extract their text and structure – Index their metadata • In an indexing technology like Lucene, Solr, or Compass, or in Google Appliance – Identify what language they belong to • Ngrams *http: //filext. com/ May-20 -10 CS 572 -Summer 2010 CAM-8

Growth • Steady growth, on logarithmic scale since mid 90’s • Well into the 100 s of M of website and 10 s of B of web page scale (even without the deep web) May-20 -10 CS 572 -Summer 2010 CAM-9

What does growth mean to us (you)? • Need for efficient algorithms for all sorts of things – Mining the web for information on you to target ads – Mining the web for information on you to decide whether to hire you or not – Disseminating news effectively (to you) – Disseminating media effectively (to you) – Providing rich browser experiences to lure you to web sites so that you can be sold products • NOTE: I underlined you everywhere above for those that missed it, we’ll get back to this May-20 -10 CS 572 -Summer 2010 CAM-10

The Web: Then and Now • Before – The purpose of the web was for geeks to exchange email, post on bulletin boards regarding their favorite D&D games, to send files to one another – Scope was limited to geeks, broad infection was many years away – Search* since 1996: Hotbot, Excite, Web. Crawler, Ask. Jeeves, Yahoo!, Google, Dog. Pile, Altavista, Lycos, MSN Search, AOL Search, Infoseek, Netscape, Metacrawler, All. The. Web *http: //sixrevisions. com/web_design/popular-search-engines-in-the-90 s-then-and-now/ May-20 -10 CS 572 -Summer 2010 CAM-11

The Web: Then and Now • Now – The purpose is limitless • Computation with services, semantic description of content, proliferation of content, rich browsers, clients, interaction, media • Social web is next big thing – Scope is (I kid you not, a 2 year old on up) – Search* now: Google, with competitors like Yahoo and Bing pulling up the rear, and trying to build out open source computational infrastructures to compete *http: //sixrevisions. com/web_design/popular-search-engines-in-the-90 s-then-and-now/ May-20 -10 CS 572 -Summer 2010 CAM-12

The movement towards the social web • Social Networking companies have figured out that mining info about you guys can help build the “semantic” information that was once dreamed about by the likes of Tim Berners-Lee in his Scientific American article in the late 90’s, early 2000’s • Why did semantic web fail to gain acceptance but social web has succeeded? – The realization that machines are poor annotators of information and that they are even worse trust establishers – And that you guys are the experts at this! May-20 -10 CS 572 -Summer 2010 CAM-13

Social Web and “Big Data” • Many challenges induced by the complexity, scale, and growth of the traditional web are only increased when the social web is taken into account • The development of algorithms to crawl the social graph have led to several Ph. D. s and are huge money makers for existing businesses – Analytics is what they call this nowadays • Search is a HUGE challenge and interesting research problem within the social web – Instead of using information retrieval to deduce a “rank” for a page, use the trust value assigned via your social graph May-20 -10 CS 572 -Summer 2010 CAM-14

Wrapup • Web has changed dramatically in the last 10 years • Understand the different dimensions of the web and the variation points – Scale, complexity and growth are only a selected few • Understand where the web is going and why May-20 -10 CS 572 -Summer 2010 CAM-15