Mining di dati web A A 2006 2007 Mining

Mining di dati web A. A 2006/2007 Mining the Web

Il Corso Codice: nw 451 Sigla: MDW Crediti: 6 Orario: Mercoledì e Venerdì 16: 00 -18: 00, aula B Ricevimento: Richiedere appuntamento per e-mail c/o ISTI, Area Ricerca CNR, località San Cataldo, Pisa, ingresso 19 Mining the Web 2

Docenti Raffaele Perego raffaele. perego@isti. cnr. it, tel. 0503152993 Claudio. Lucchese claudio. lucchese@isti. cnr. it, tel. 0503152967 Fabrizio Silvestri fabrizio. silvestri@isti. cnr. it, tel. 0503153011 Diego Puppin diego. puppin@isti. cnr. it, tel. 0503153011 Antonio Panciatici antonio. panciatici@isti. cnr. it, tel. 0503152967 Mining the Web 3

Obiettivi del corso Il World Wide Web (WWW) ha cambiato il modo di concepire le informazioni, di renderle fruibili e di gestirle. Scoprire nel web informazioni non note, non banali e rilevanti è sempre più importante e difficile. Il Web mining è quindi diventato fondamentale per l’ottimizzazione di strumenti strategici quali i siti di e-commerce, i motori di ricerca, le directory Il corso si propone l’obiettivo di fornire strumenti e conoscenze in questo settore Mining the Web 4

Contenuti del Corso Introduzione Data Mining, Knowledge Discovery e il Web Motori di Ricerca Crawling, indexing, querying Web Content Mining Similarità, clustering, classificazione di testi Web Structure Mining Social networks, ranking, ecc. Web Usage Mining Recommender systems, ecc. Argomenti avanzati (? !) Mining the Web 5

Materiale didattico Libro di testo Mining the Web: discovering knowledge from hypertext data. S. Chakrabarti. Morgan Kaufmann, 2003. Libri Consigliati Managing Gigabytes. I. H. Witten e A. Moffat e T. C. Bell. Morgan Kaufmann, 1999. Modern Information Retrieval. R. Baeza-Yates e B. Ribeiro -Neto. Addison Wesley, 1999. Lucidi delle lezioni e articoli Pubblicati su http: //malvasia. isti. cnr. it/~raffaele/webmining Mining the Web 6

Materiale didattico Si ringraziano Chakrabarti e Ramakrishnan Per i lucidi allegati al libro di testo scaricabili all’indirizzo: http: //www. cse. iitb. ac. in/~soumen/mining-the-web/ Fosca Giannotti e Dino Pedreschi Per i lucidi introduttivi mutuati dal corso TDM KDNUGGETS (http: //www. kdnuggets. com) Ferragina, Attardi, Garcia Molina, ecc. Internet : -) Mining the Web 7

Esame Prerequisiti (consigliati) AA 270 – TDM – Tecniche di “Data Mining” – Primo Semestre. Modalità di Esame Il superamento dell’esame è condizionato al corretto svolgimento di un progetto (individuale o di gruppo? ) e da una discussione orale sui contenuti del corso (seminario su un articolo a scelta? ). Mining the Web 8

Introduzione Data Mining e Knowledge Discovery Ipertesti e cenni di storia del Web Mining the Web 9

What is DM? Mining the Web 10

What is DM? Mining the Web 11

Motivations for DM Data explosion problem: Automated data collection tools, mature database technology and internet, lead to tremendous amounts of data stored in databases, data warehouses and other information repositories. We are drowning in information, but starving for knowledge! (John Naisbett) Data mining : Extraction of interesting knowledge (rules, regularities, patterns, constraints) from large amounts of data Mining the Web 12

Motivations for DM Abundance of business and industry data Competitive focus - Knowledge Management Inexpensive, powerful computing engines Strong theoretical/mathematical foundations machine learning & logic statistics database management systems Etc. Mining the Web 13

Sources of Data (e. g. ) Business Transactions widespread use of bar codes => storage of millions of transactions daily (e. g. , Walmart: 2000 stores => 20 M transactions per day, credit card records!!) most important problem: effective use of the data in a reasonable time frame for competitive decision-making e-commerce data Scientific Data data generated through multitude of experiments and observations examples, geological data, satellite imaging data, NASA earth observations, CERN HEP rate of data collection far exceeds the speed by which we analyze them Financial Data company information economic data (GNP, price indexes, etc. ) stock Mining the Web markets 14

Sources of Data (e. g. ) Personal / Statistical Data government census medical histories customer profiles demographic data and statistics about sports and athletes World Wide Web and Online Repositories Billions of Web documents, images, video, etc. emails, news, messages link structure of the hypertext from millions of Web sites Web usage data (from server/proxy logs, network traffic, and user registrations) online databases, and digital libraries Mining the Web 15

Classes of DM applications Database analysis and decision support Market analysis target marketing, customer relation management, market basket analysis Risk analysis Forecasting, customer retention, quality control, competitive analysis. Fraud detection Text mining E. g. Mining opinions from email, documents Mining the Web 16

THE WEB!! Classes of DM applications Searching: google, askjeeves, yahoo, etc. Social networks analysis Web advertizing E. g. IBM Surf-Aid applies data mining algorithms to Web access logs for market-related pages to discover customer preference and behavior, analyzing effectiveness of Web marketing, improving Web site organization, etc. Watch for the PRIVACY pitfall! Many Others …. Sports. IBM Advanced Scout analyzed NBA game statistics (shots blocked, assists, and fouls) to gain competitive advantage for New York Knicks and Miami Heat. Astronomy. JPL and the Palomar Observatory discovered 22 Mining the Web 17 quasars with the help of data mining

What is KDD? A process! The selection and processing of data for: the identification of novel, accurate, and useful patterns, and the modeling of real-world phenomena. Data mining is a major component of the KDD process automated discovery of patterns and development of predictive and explanatory models. Mining the Web 18

The KDD process Interpretation and Evaluation Data Mining Knowledge Selection and Preprocessing Data Consolidation p(x)=0. 02 Patterns & Models Warehouse Prepared Data Consolidated Data Sources Mining the Web 19

The KDD Process in Practice KDD steps can be merged or combined Data Selection + Data Transformation = Data Consolidation Data Cleaning + Data Integration = Data Preprocessing KDD is an Iterative Process art + engineering rather than science Mining the Web 20

The virtuous cycle Knowledge Problem Identify Problem or Opportunity Strategy Mining the Web Act on Knowledge Measure effect of Action Results 21

The steps of the KDD process Learning the application domain: relevant prior knowledge and goals of application Data consolidation: Creating a target data set Selection and Preprocessing Data cleaning : (may take 60% of effort!) Data reduction and projection: find useful features, dimensionality/variable reduction, invariant representation. Choosing data mining methods E. g. , classification, association, clustering. Choosing the mining algorithm(s) Data mining: search for patterns of interest Interpretation and evaluation: analysis of results. visualization, transformation, removing redundant patterns, … Use of discovered knowledge Mining the Web 22

Roles in the KDD process Mining the Web 23

Major Data Mining Tasks Classification: predicting an item class Clustering: finding clusters in data Associations: e. g. A & B & C occur frequently Visualization: to facilitate human discovery Summarization: describing a group Deviation Detection: finding changes Estimation: predicting a continuous value Link Analysis: finding relationships … Mining the Web 24

Classification Learn a method for predicting the instance class from pre-labeled (classified) instances Many approaches: Statistics, Decision Trees, Neural Networks, . . . Mining the Web 25

Clustering Find “natural” grouping of instances given un-labeled data Mining the Web 26

Transactions Association Rules & Frequent Itemsets: Milk, Bread (4) Bread, Cereal (3) Milk, Bread, Cereal (2) … Rules: Milk => Bread (66%) Mining the Web 27

Visualization & Data Mining Visualizing the data to facilitate human discovery Presenting the discovered results in a visually "nice" way Mining the Web 28

Summarization n Describe features of the selected group n Use natural language and graphics n Usually in Combination with Deviation detection or other methods Average length of stay in this study area rose 45. 7 percent, from 4. 3 days to 6. 2 days, because. . . Mining the Web 29

Data Mining Central Quest Find true patterns and avoid overfitting Mining the Web 30

Overfitting Finding seemingly significant but really random patterns due to searching too many possibilities Violation of Occam’s razor the explanation of any phenomenon should make as few assumptions as possible lex parsimoniae entia non sunt multiplicanda praeter necessitatem, Mining the Web 31

Hypertexts and the Web Mining the Web

World Wide Web Hypertext documents Text Links Web billions of documents authored by millions of diverse people edited by no one in particular distributed over millions of computers, connected by variety of media Mining the Web 33

Citation History of Hypertext Hyperlinking Branching, non-linear discourse, nested commentary Ramayana - one of the great epic poems of India; attributed to the sage Valmiki, it recounts the life and exploits of Lord Rama. Mahabharata- an epic poem that recounts the struggle between the Kauravas and Pandavas over the disputed kingdom of Bharata, the ancient name for India Talmud - compilation of Jewish oral teachings, assembled in written form in the early centuries of the Christian era Dictionary, encyclopedia self-contained networks of textual nodes joined by referential links Mining the Web 34

Hypertext systems Memex, 1945 [Vannevar Bush, US President Roosevelt's science advisor] stands for “memory extension” Aim: to create and help follow hyperlinks across documents photoelectrical-mechanical storage and computing device that could store vast amounts of information, in which a user had the ability to create links of related text and illustrations. This trail could then be stored and used for future reference. Bush believed that using this associative method of information gathering was not only practical in its own right, but was closer to the way the mind ordered information. " Mining the Web 35

Hypertext systems Hypertext, term coined by Ted Nelson in a 1965 paper to the ACM 20 th national conference: [. . . ] By 'hypertext' mean nonsequential writing - text that branches and allows choice to the reader, best read at an interactive screen. Mining the Web 36

Hypertext systems The first hypertext-based system was developed in 1967 by a team of researchers led by Dr. Andries van Dam at Brown University. The research was funded by IBM and the first hypertext implementation, Hypertext Editing System, ran on an IBM/360 mainframe. IBM later sold the system to the Houston Manned Spacecraft Center which reportedly used it for the Apollo space program documentation Mining the Web 37

Hypertext systems Xanadu hypertext, by Ted Nelson, 1981: In the Xanadu scheme, a universal document database (docuverse), would allow addressing of any substring of any document from any other document. "This requires an even stronger addressing scheme than the Universal Resource Locators used in the World-Wide Web. " [De Bra] Additionally, Xanadu would permanently keep every version of every document, thereby eliminating the possibility of a broken link. Xanadu would only maintain the current version of the document in its entirety. Mining the Web 38

World-wide Web Initiated at CERN in 1989 By Tim Berners-Lee, now w 3 c director: “W 3 was originally developed to allow information sharing within internationally dispersed teams, and the dissemination of information by support groups. Originally aimed at the High Energy Physics community, it has spread to other areas and attracted much interest in user support, resource discovery and collaborative work areas. It is currently the most advanced information system deployed on the Internet, and embraces within its data model most information in previous networked information systems. ” Mining the Web 39

World-wide Web GUIs Berners-Lee (World. Wide. Web - 1990) Erwise and Viola(1992), Midas (1993) Mosaic (1993) a hypertext GUI for the X-window system HTML: markup language for rendering hypertext HTTP: hypertext transport protocol for sending HTML and other data over the Internet CERN HTTPD: server of hypertext documents Mining the Web 40

The early days of the Web : CERN HTTP traffic grows by 1000 between 1991 -1994 (image courtesy W 3 C) Mining the Web 41

The early days of the Web: The number of servers grows from a few hundred tothe Web Mining a million between 1991 and 1997 (image courtesy Nielsen)42

1994: the landmark year Foundation of the “Mosaic Communications Corporation” (later Nestcape) first World-Wide Web conference MIT and CERN agreed to set up the Worldwide Web Consortium (W 3 C). Mining the Web 43

The Web A populist, participatory medium number of writers =(approx) number of readers. enables near-zero-cost dissemination of information Abundance and authority crisis liberal and informal culture of content generation and dissemination. Very little uniform civil code. redundancy and non-standard form and content. millions of qualifying pages for most broad queries Example: java or kayaking no per se authoritative information about the reliability of a site Mining the Web 44

Problems due to Uniform accessibility little support for adapting to the background of specific users. commercial interests routinely influence the operation of Web search Users pay for connection costs, not for contents Profit depends from ads, sales, etc “Search Engine Optimization“ !! Mining the Web 45

What is Web Mining? Discovering interesting and useful information from Web content, structure and usage Examples: Web search, e. g. Google, Yahoo, MSN, Ask, … Specialized search: e. g. Froogle (comparison shopping), job ads (Flipdog) e. Commerce : Recommendations: e. g. Netflix, Amazon improving conversion rate: next best product to offer Advertising, e. g. Google Adsense Fraud detection: click fraud detection, … Improving Web site design and performance Mining the Web 46

How does it differ from “classical” Data Mining? The web is not a relation Textual information and linkage structure Usage data is huge and growing rapidly Google’s usage logs are bigger than their web crawl Data generated per day is comparable to largest conventional data warehouses Content and structure data rich in features and patterns spontaneous formation and evolution of topic-induced graph clusters hyperlink-induced communities Ability to react in real-time to usage patterns No human in the loop Mining the Web 47 Reproduced from Ullman & Rajaraman with permission

How big is the Web ? Number of pages Technically, infinite Because of dynamically generated content Lots of duplication (30 -40%) Best estimate of “unique” static HTML pages comes from search engine claims Google = 8 billion, Yahoo = 20 billion Lots of marketing hype Mining the Web 48 Reproduced from Ullman & Rajaraman with permission

96, 854, 877 web sites (Sept 2006) http: //news. netcraft. com/archives/web_server_survey. html Total Sites Across All Domains August 1995 - September 2006 Mining the Web 49

The web as a graph Pages = nodes, hyperlinks = edges Ignore content Directed graph High linkage 8 -10 links/page on average Power-law degree distribution Mining the Web 50 Reproduced from Ullman & Rajaraman with permission

Power-law degree distribution Mining the Web Source: Broder et al, 2000 51 Reproduced from Ullman & Rajaraman with permission

Power-laws abounding In-degrees Out-degrees Number of pages per site Number of visitors Term distribution in pages Query distribution in query logs Let’s take a closer look at structure Broder et al. (2000) studied a crawl of 200 M pages and other smaller crawls Not a “small world” Mining the Web 52 Reproduced from Ullman & Rajaraman with permission

Bow-tie Structure Source: Broder et al, 2000 Mining the Web 53 Reproduced from Ullman & Rajaraman with permission

Searching the Web The Web Mining the Web Content aggregators Content consumers 54 Reproduced from Ullman & Rajaraman with permission

Ads vs. search results Mining the Web 55 Reproduced from Ullman & Rajaraman with permission

Ads vs. search results Search advertising is the revenue model Multi-billion-dollar industry Advertisers pay for clicks on their ads Interesting problems How to pick the top 10 results for a search from 2, 230, 000 matching pages? What ads to show for a search? If I’m an advertiser, which search terms should I bid on and how much to bid? Mining the Web 56 Reproduced from Ullman & Rajaraman with permission

Sidebar: What’s in a name? Geico sued Google, contending that it owned the trademark “Geico” Thus, ads for the keyword geico couldn’t be sold to others Court Ruling: search engines can sell keywords including trademarks No court ruling yet: whether the ad itself can use the trademarked word(s) Mining the Web 57 Reproduced from Ullman & Rajaraman with permission

Extracting Structured Data http: //www. simplyhired. com Mining the Web 58 Reproduced from Ullman & Rajaraman with permission

Extracting structured data Mining the Web http: //www. fatlens. com 59 Reproduced from Ullman & Rajaraman with permission

The Long Tail (yet another power-law) Mining Anderson (2004) Source: Christhe Web 60 Reproduced from Ullman & Rajaraman with permission

The Long Tail Shelf space is a scarce commodity for traditional retailers Also: TV networks, movie theaters, … The web enables near-zero-cost dissemination of information about products More choices necessitate better filters Recommendation engines (e. g. , Amazon) Mining the Web 61 Reproduced from Ullman & Rajaraman with permission

Major Web Mining topics Crawling the web Web graph analysis Structured data extraction Classification and vertical search Collaborative filtering Web advertising and optimization Mining web logs Systems Issues Mining the Web 62 Reproduced from Ullman & Rajaraman with permission

Web search basics User Web crawler Search Indexer The Web Mining the Web Indexes Ad indexes 63 Reproduced from Ullman & Rajaraman with permission

Search engine components Spider (a. k. a. crawler/robot) – builds corpus Collects web pages recursively For each known URL, fetch the page, parse it, and extract new URLs Repeat Additional pages from direct submissions & other sources The indexer – creates inverted indexes Various policies wrt which words are indexed, capitalization, support for Unicode, stemming, support for phrases, etc. Query processor – serves query results Front end – query reformulation, word stemming, capitalization, optimization of Booleans, etc. Back end – finds matching documents and ranks them Mining the Web 64 Reproduced from Ullman & Rajaraman with permission

Mining the Web 65