Скачать презентацию CS 345 A Data Mining Lecture 1 Introduction Скачать презентацию CS 345 A Data Mining Lecture 1 Introduction

09a128d67b02e1068425529276d5ac49.ppt

  • Количество слайдов: 29

CS 345 A Data Mining Lecture 1 Introduction to Web Mining CS 345 A Data Mining Lecture 1 Introduction to Web Mining

What is Web Mining? Discovering useful information from the World-Wide Web and its usage What is Web Mining? Discovering useful information from the World-Wide Web and its usage patterns

Web Mining v. Data Mining o Structure (or lack of it) n o Scale Web Mining v. Data Mining o Structure (or lack of it) n o Scale n o Textual information and linkage structure Data generated per day is comparable to largest conventional data warehouses Speed n Often need to react to evolving usage patterns in real-time (e. g. , merchandising)

Web Mining topics o o o Web graph analysis Power Laws and The Long Web Mining topics o o o Web graph analysis Power Laws and The Long Tail Structured data extraction Web advertising Systems Issues

Web Mining topics o o o Web graph analysis Power Laws and The Long Web Mining topics o o o Web graph analysis Power Laws and The Long Tail Structured data extraction Web advertising Systems Issues

Size of the Web o Number of pages n n n Technically, infinite Much Size of the Web o Number of pages n n n Technically, infinite Much duplication (30 -40%) Best estimate of “unique” static HTML pages comes from search engine claims o o Until last year, Google claimed 8 billion(? ), Yahoo claimed 20 billion Google recently announced that their index contains 1 trillion pages n How to explain the discrepancy?

The web as a graph o Pages = nodes, hyperlinks = edges n n The web as a graph o Pages = nodes, hyperlinks = edges n n o Ignore content Directed graph High linkage n n 10 -20 links/page on average Power-law degree distribution

Structure of Web graph o Let’s take a closer look at structure n n Structure of Web graph o Let’s take a closer look at structure n n Broder et al (2000) studied a crawl of 200 M pages and other smaller crawls Bow-tie structure o Not a “small world”

Bow-tie Structure Source: Broder et al, 2000 Bow-tie Structure Source: Broder et al, 2000

What can the graph tell us? o Distinguish “important” pages from unimportant ones n What can the graph tell us? o Distinguish “important” pages from unimportant ones n o Discover communities of related pages n o Page rank Hubs and Authorities Detect web spam n Trust rank

Web Mining topics o o o Web graph analysis Power Laws and The Long Web Mining topics o o o Web graph analysis Power Laws and The Long Tail Structured data extraction Web advertising Systems Issues

Power-law degree distribution Source: Broder et al, 2000 Power-law degree distribution Source: Broder et al, 2000

Power-laws galore o Structure n n n o In-degrees Out-degrees Number of pages per Power-laws galore o Structure n n n o In-degrees Out-degrees Number of pages per site Usage patterns n n Number of visitors Popularity e. g. , products, movies, music

The Long Tail Source: Chris Anderson (2004) The Long Tail Source: Chris Anderson (2004)

The Long Tail o Shelf space is a scarce commodity for traditional retailers n The Long Tail o Shelf space is a scarce commodity for traditional retailers n o o Also: TV networks, movie theaters, … The web enables near-zero-cost dissemination of information about products More choice necessitates better filters n n Recommendation engines (e. g. , Amazon) How Into Thin Air made Touching the Void a bestseller

Web Mining topics o o o Web graph analysis Power Laws and The Long Web Mining topics o o o Web graph analysis Power Laws and The Long Tail Structured data extraction Web advertising Systems Issues

Extracting Structured Data http: //www. simplyhired. com Extracting Structured Data http: //www. simplyhired. com

Extracting structured data http: //www. fatlens. com Extracting structured data http: //www. fatlens. com

Web Mining topics o o o Web graph analysis Power Laws and The Long Web Mining topics o o o Web graph analysis Power Laws and The Long Tail Structured data extraction Web advertising Systems Issues

Ads vs. search results Ads vs. search results

Ads vs. search results o Search advertising is the revenue model n n o Ads vs. search results o Search advertising is the revenue model n n o Multi-billion-dollar industry Advertisers pay for clicks on their ads Interesting problems n n What ads to show for a search? If I’m an advertiser, which search terms should I bid on and how much to bid?

Web Mining topics o o o Web graph analysis Power Laws and The Long Web Mining topics o o o Web graph analysis Power Laws and The Long Tail Structured data extraction Web advertising Systems Issues

Two Approaches to Analyzing Data o Machine Learning approach n n o Emphasizes sophisticated Two Approaches to Analyzing Data o Machine Learning approach n n o Emphasizes sophisticated algorithms e. g. , Support Vector Machines Data sets tend to be small, fit in memory Data Mining approach n n n Emphasizes big data sets (e. g. , in the terabytes) Data cannot even fit on a single disk! Necessarily leads to simpler algorithms

Philosophy o In many cases, adding more data leads to better results that improving Philosophy o In many cases, adding more data leads to better results that improving algorithms n n n o Netflix Google search Google ads More on my blog: Datawocky (datawocky. com)

Systems architecture CPU Machine Learning, Statistics Memory “Classical” Data Mining Disk Systems architecture CPU Machine Learning, Statistics Memory “Classical” Data Mining Disk

Very Large-Scale Data Mining CPU Mem Disk … Cluster of commodity nodes CPU Mem Very Large-Scale Data Mining CPU Mem Disk … Cluster of commodity nodes CPU Mem Disk

Systems Issues o Web data sets can be very large n o Cannot mine Systems Issues o Web data sets can be very large n o Cannot mine on a single server! n o Tens to hundreds of terabytes Need large farms of servers How to organize hardware/software to mine multi-terabye data sets n Without breaking the bank!

Web Mining topics o o o Web graph analysis Power Laws and The Long Web Mining topics o o o Web graph analysis Power Laws and The Long Tail Structured data extraction Web advertising Systems Issues

Project o Lots of interesting project ideas n o Infrastructure n n o If Project o Lots of interesting project ideas n o Infrastructure n n o If you can’t think of one please come discuss with us Aster Data cluster on Amazon EC 2 Supports both Map. Reduce and SQL Data n n n Netflix Share. This Google Web. Base TREC