News and Notes 2 24 Homework 2 due

Скачать презентацию News and Notes 2 24 Homework 2 due

a801399c76cfa43f0f6919d7c79189ea.ppt

Количество слайдов: 18

News and Notes, 2/24 • Homework 2 due at the start of Thursday’s class • New required readings: – “Micromotives and Macrobehavior”, chapters 1, 3 and 4 – Watts, chapters 7 and 8 • Midterm Thursday March 4 – will cover up to and including “The Web as Network” • Today’s class: short lecture + an in-class exercise

News and Notes, 2/19 • • • Three new required articles in Web as Network section Homework 2 distributed last class; due Feb 26 Pick up your Homework 1 if you haven’t Grading of Homework 1, problem 4. 1 Midterm March 4 MK office hours 10: 30 -12

The Web as Networked Life CSE 112 Spring 2004 Prof. Michael Kearns

The Web as Network • Consider the web as a network – vertices: individual (html) pages – edges: hyperlinks between pages – will view as both a directed and undirected graph • What is the structure of this network? – connected components – degree distributions – etc. • What does it say about the people building and using it? – page and link generation – visitation statistics • What are the algorithmic consequences? – web search – community identification

Graph Structure in the Web [Broder et al. paper] • Report on the results of two massive “web crawls” • Executed by Alta. Vista in May and October 1999 • Details of the crawls: – – automated script following hyperlinks (URLs) from pages found large set of starting points collected over time crawl implemented as breadth-first search have to deal with spam, infinite paths, timeouts, duplicates, etc. • May ’ 99 crawl: – 200 million pages, 1. 5 billion links • Oct ’ 99 crawl: – 271 million pages, 2. 1 billion links • Unaudited, self-reported Sep ’ 03 stats: – 3 major search engines claim > 3 billion pages indexed

Five Easy Pieces • Authors did two kinds of breadth-first search: – ignoring link direction weak connectivity – only following forward links strong connectivity • They then identify five different regions of the web: – strongly connected component (SCC): • can reach any page in SCC from any other in directed fashion – component IN: • can reach any page in SCC in directed fashion, but not reverse – component OUT: • can be reached from any page in SCC, but not reverse – component TENDRILS: • weakly connected to all of the above, but cannot reach SCC or be reached from SCC in directed fashion (e. g. pointed to by IN) – SCC+IN+OUT+TENDRILS form weakly connected component (WCC) – everything else is called DISC (disconnected from the above) – here is a visualization of this structure

Size of the Five • • SCC: ~56 M pages, ~28% IN: ~43 M pages, ~ 21% OUT: ~43 M pages, ~21% TENDRILS: ~44 M pages, ~22% DISC: ~17 M pages, ~8% WCC > 91% of the web --- the giant component One interpretation of the pieces: – SCC: the heart of the web – IN: newer sites not yet discovered and linked to – OUT: “insular” pages like corporate web sites

Diameter Measurements • Directed worst-case diameter of the SCC: – at least 28 • Directed worst-case diameter of IN SCC OUT: – at least 503 • Over 75% of the time, there is no directed path between a random start and finish page in the WCC – when there is a directed path, average length is 16 • Average undirected distance in the WCC is 7 • Moral: – web is a “small world” when we ignore direction – otherwise the picture is more complex

Degree Distributions • They are, of course, heavy-tailed • Power law distribution of component size – consistent with the Erdos-Renyi model • Undirected connectivity of web not reliant on “connectors” – what happens as we remove high-degree vertices?

Beyond Macroscopic Structure • Such studies tell us the coarse overall structure of the web • Use and construction of the web are more fine-grained – people browse the web for certain information or topics – people build pages that link to related or “similar” pages • How do we quantify & analyze this more detailed structure? • We’ll examine two related examples: – Kleinberg’s hubs and authorities • automatic identification of “web communities” – Page. Rank • automatic identification of “important” pages • one of the main criteria used by Google – both rely mainly on the link structure of the web – both have an algorithm and a theory supporting it

Hubs and Authorities • Suppose we have a large collection of pages on some topic • • – possibly the results of a standard web search Some of these pages are highly relevant, others not at all How could we automatically identify the important ones? What’s a good definition of importance? Kleinberg’s idea: there are two kinds of important pages: – authorities: highly relevant pages – hubs: pages that point to lots of relevant pages – (I had these backwards last time…) • If you buy this definition, it further stands to reason that: – a good hub should point to lots of good authorities – a good authority should be pointed to by many good hubs – this logic is, of course, circular • We need some math and an algorithm to sort it out

The HITS System (Hyperlink-Induced Topic Search) • Given a user-supplied query Q: – assemble root set S of pages (e. g. first 200 pages by Alta. Vista) – grow S to base set T by adding all pages linked (undirected) to S – might bound number of links considered from each page in S • Now consider directed subgraph induced on just pages in T • For each page p in T, define its – hub weight h(p); initialize all to be 1 – authority weight a(p); initialize all to be 1 • Repeat “forever”: – a(p) : = sum of h(q) over all pages q p – h(p) : = sum of a(q) over all pages p q – renormalize all the weights • This algorithm will always converge! – weights computed related to eigenvectors of connectivity matrix – further substructure revealed by different eigenvectors • Here are some examples

The Page. Rank Algorithm • Let’s define a measure of page importance we will call the rank • Notation: for any page p, let – N(p) be the number of forward links (pages p points to) – R(p) be the (to-be-defined) rank of p • Idea: important pages distribute importance over their forward links • So we might try defining – – R(p) : = sum of R(q)/N(q) over all pages q p can again define iterative algorithm for computing the R(p) if it converges, solution again has an eigenvector interpretation problem: cycles accumulate rank but never distribute it • The fix: – R(p) : = [sum of R(q)/N(q) over all pages q p] + E(p) – E(p) is some external or exogenous measure of importance – some technical details omitted here (e. g. normalization) • Let’s play with the Page. Rank calculator

The “Random Surfer” Model • Let’s suppose that E(p) sums to 1 (normalized) • Then the resulting Page. Rank solution R(p) will – also be normalized – can be interpreted as a probability distribution • R(p) is the stationary distribution of the following process: – – starting from some random page, just keep following random links if stuck in a loop, jump to a random page drawn according to E(p) so surfer periodically gets “bored” and jumps to a new page E(p) can thus be personalized for each surfer • An important component of Google’s search criteria

But What About Content? • Page. Rank and Hubs & Authorities – both based purely on link structure – often applied to an pre-computed set of pages filtered for content • So how do (say) search engines do this filtering? • This is the domain of information retrieval

Basics of Information Retrieval • Represent a document as a “bag of words”: – – – for each word in the English language, count number of occurences so d[i] is the number of times the i-th word appears in the document usually ignore common words (the, and, of, etc. ) usually do some stemming (e. g. “washed” “wash”) vectors are very long (~100 Ks) but very sparse need some special representation exploiting sparseness • Note all that we ignore or throw away: – the order in which the words appear – the grammatical structure of sentences (parsing) – the sense in which a word is used • firing a gun or firing an employee

Bag of Words Document Comparison • View documents as vectors in a very high-dimensional space • Can now import geometry and linear algebra concepts • Similarity between documents d and e: – S d[i]*e[i] over all words I – may normalize d and e first – this is their projection onto each other • Improve by using TF/IDF weighting of words: – term frequency --- how frequent is the word in this document? – inverse document frequency --- how frequent in all documents? – give high weight to words with high TF and low IDF • Search engines: – view the query as just another “document” – look for similar documents via above

Marrying Content and Structure • So one overall approach is to – use traditional IR methods to find documents with desired content – apply structural methods to elevate authorities, high Page. Rank, etc. • For some problems, a more integrated approach exists • Example: co-training • Suppose we want to learn a rule to classify pages – e. g. classify as “CS faculty home page” or not – two sources of info: – – • content on page --- technical terms, educational background, etc. • links on page --- to other faculty, department site, students, etc. first learn a rule using content only from a small labeled set now use this rule to label further pages now learn a rule using links only from these labeled pages repeat… • Another example: maintaining a list of company names