1
Data Mining: Introduction Lecture Notes for Chapter 1 Introduction to Data Mining Gun Ho Lee ghlee@ssu. ac. kr Intelligent Information Systems Lab Soongsil University, Korea This material is modified and reproduced based on books and materials of P-N, Ran and et al, J. Han and M. Kamber, M. Dunham, etc 2
q Introduction to Data Mining q Pang-Ning Tan, Michigan State University Michael Steinbach, University of Minnesota Vipin Kumar, University of Minnesota q Publisher: Addison-Wesley Copyright: 2006 Format: Cloth; 769 pp q Data Mining: Concepts and Techniques q J. Han and M. Kamber (2001) 3
q Data Mining: Practical Machine Learning Tools and Techniques q Ian H. Witten q Data Mining: Introductory and Advanced Topics Margaret Dunham (2003) 4
Why Mine Data? Motivation: “Necessity is the Mother of Invention” l Data explosion problem – Automated data collection tools and mature database technology lead to tremendous amounts of data stored in databases, data warehouses and other information repositories l We are drowning in data, but starving for knowledge! l Solution: Data warehousing and data mining – Data warehousing and on-line analytical processing – Extraction of interesting knowledge (rules, regularities, patterns, constraints) from data in large databases 5
Mining Large Data Sets - Motivation l l l There is often information “hidden” in the data that is not readily evident Human analysts may take weeks to discover useful information Much of the data is never analyzed at all The Data Gap Total new disk (TB) since 1995 Number of analysts 6
Why Mine Data? Commercial Viewpoint q Lots of data is being collected and warehoused – Web data, e-commerce – purchases at department/ grocery stores – Bank/Credit Card transactions q Computers have become cheaper and more powerful q Competitive Pressure is Strong – Provide better, customized services for an edge (e. g. in Customer Relationship Management) 7
Data Mining History l l l The approach has roots in practice dating back over 40 years. In the early 1960 s, data mining was called statistical analysis, and the pioneers were statistical software companies such as SAS and SPSS. By the late 1980 s, the traditional techniques had been augmented by new methods such as fuzzy logic, heuristics and neural networks. 8
Evolution of Database Technology q 1960 s: – Data collection, database creation, IMS and network DBMS q 1970 s: – Relational data model, relational DBMS implementation q 1980 s: – RDBMS, advanced data models (extended-relational, OO, deductive, etc. ) and application-oriented DBMS (spatial, scientific, engineering, etc. ) q 1990 s— 2000 s: – Data mining and data warehousing, multimedia databases, and Web databases 9
Why Data Mining? — Potential Applications q Database analysis and decision support – Market analysis and management utarget marketing, customer relation management, market basket analysis, cross selling, market segmentation – Risk analysis and management u. Forecasting, customer retention, improved underwriting, quality control, competitive analysis – Fraud detection and management q Other Applications – Text mining (news group, email, documents) and Web analysis. – Intelligent query answering 10
Customer Relationship Management (CRM) 11
Corporate Analysis and Risk Management l l l Finance planning and asset evaluation – cash flow analysis and prediction – contingent claim analysis to evaluate assets – cross-sectional and time series analysis (financialratio, trend analysis, etc. ) Resource planning: – summarize and compare the resources and spending Competition: – monitor competitors and market directions – group customers into classes and a class-based pricing procedure – set pricing strategy in a highly competitive market 12
Fraud Detection and Management (1) l l l Applications – widely used in health care, retail, credit card services, telecommunications (phone card fraud), etc. Approach – use historical data to build models of fraudulent behavior and use data mining to help identify similar instances Examples – auto insurance: detect a group of people who stage accidents to collect on insurance – money laundering: detect suspicious money transactions (US Treasury's Financial Crimes Enforcement Network) – medical insurance: detect professional patients and ring of doctors and ring of references 13
Fraud Detection and Management (2) l l l Detecting inappropriate medical treatment – Australian Health Insurance Commission identifies that in many cases blanket screening tests were requested (save Australian $1 m/yr). Detecting telephone fraud – Telephone call model: destination of the call, duration, time of day or week. Analyze patterns that deviate from an expected norm. – British Telecom identified discrete groups of callers with frequent intra-group calls, especially mobile phones, and broke a multimillion dollar fraud. Retail – Analysts estimate that 38% of retail shrink is due to dishonest employees. 14
Scientific Viewpoint l Data collected and stored at enormous speeds (GB/hour) – remote sensors on a satellite – telescopes scanning the skies – microarrays generating gene expression data – scientific simulations generating terabytes of data l l Traditional techniques infeasible for raw data Data mining may help scientists – in classifying and segmenting data – in Hypothesis Formation 15
Other Applications l l l Sports – IBM Advanced Scout analyzed NBA game statistics (shots blocked, assists, and fouls) to gain competitive advantage for New York Knicks and Miami Heat Astronomy – JPL and the Palomar Observatory discovered 22 quasars with the help of data mining Internet Web Surf-Aid – IBM Surf-Aid applies data mining algorithms to Web access logs for market-related pages to discover customer preference and behavior pages, analyzing effectiveness of Web marketing, improving Web site organization, etc. 16
What is Data Mining? q. Many Definitions – Non-trivial extraction of implicit, previously unknown and potentially useful information from data – Exploration & analysis, by automatic or semi-automatic means, of large quantities of data in order to discover meaningful patterns 17
What is (not) Data Mining? What is not Data Mining? l – Look up phone number in phone directory – Query a Web search engine for information about “Amazon” l What is Data Mining? – Certain names are more prevalent in certain US locations (O’Brien, O’Rurke, O’Reilly… in Boston area) – Group together similar documents returned by search engine according to their context (e. g. Amazon rainforest, Amazon. com, ) 18
Origins of Data Mining q Draws ideas from machine learning/AI, pattern recognition, statistics, and database systems q Traditional Techniques may be unsuitable due to Statistics/ Machine Learning/ – Enormity of data AI Pattern Recognition – High dimensionality of data Data Mining – Heterogeneous, distributed nature Database systems of data 19
Convergence of Three Technologies 20 20
Data Mining Tasks q Prediction Methods – Use some variables to predict unknown or future values of other variables. q Description Methods – Find human-interpretable patterns that describe the data. From [Fayyad, et. al. ] Advances in Knowledge Discovery and Data Mining, 1996 21
Data Mining Tasks. . . q Exploratory Data Analysis q Classification [Predictive] q Clustering [Descriptive] q Association Rule Discovery [Descriptive] q Sequential Pattern Discovery [Descriptive] q Regression [Predictive] q Deviation Detection [Predictive] 22
Exploratory Data Analysis (EDA) q Explore the data without any clear ideas of what we are looking for q EDA techniques are interactive and visual q Many effective visualization techniques for small and lowdimensional data q High dimensionality => difficult visualization => requires dimensionality reduction and projection techniques q Examples of visualization techniques: pie charts, histograms, scatterplots, contour plots q 23
Predictive Data Mining Predictive Modeling – Classification and Regression q Goal: Build a model that will predict the value of one variable from the known values of other variables q - Classification: the variable to be predicted is categorical (i. e. its values belong to a pre-specified, finite set of possibilities) - Regression: the variable to be predicted is numeric called supervised learning in Machine Learning 24
Classification: Definition q Given a collection of records (training set ) – Each record contains a set of attributes, one of the attributes is the class. Find a model for class attribute as a function of the values of other attributes. q Goal: previously unseen records should be assigned a class as accurately as possible. q – A test set is used to determine the accuracy of the model. Usually, the given data set is divided into training and test sets, with training set used to build the model and test set used to validate it. 25
Classification Example l l a ic r o ca g te a ic r c in t on u uo s ss la c Test Set Training Set Learn Classifier Model 26
Classification: Application 1 q Direct Marketing – Goal: Reduce cost of mailing by targeting a set of consumers likely to buy a new cell-phone product. – Approach: u Use the data for a similar product introduced before. u We know which customers decided to buy and which decided otherwise. This {buy, don’t buy} decision forms the class attribute. u Collect various demographic, lifestyle, and companyinteraction related information about all such customers. – Type of business, where they stay, how much they earn, etc. u Use this information as input attributes to learn a classifier model. From [Berry & Linoff] Data Mining Techniques, 1997 27
q Ex. 1: Credit card purchases authorization - Credit card companies must determine whether to authorize credit card purchases based on past transactions. 4 classes have been identified: • authorize • ask for further identification before authorization • do not authorize and call police q Ex. 2: Credit card application approval - Predict if to accept or deny credit card applications q Historic data: 28
Classification: Application 2 q Fraud Detection – Goal: Predict fraudulent cases in credit card transactions. – Approach: u Use credit card transactions and the information on its account-holder as attributes. – When does a customer buy, what does he buy, how often he pays on time, etc u Label past transactions as fraud or fair transactions. This forms the class attribute. u Learn a model for the class of the transactions. u Use this model to detect fraud by observing credit card transactions on an account. 29
Classification: Application 3 q Customer Attrition/Churn: – Goal: To predict whether a customer is likely to be lost to a competitor. – Approach: u Use detailed record of transactions with each of the past and present customers, to find attributes. – How often the customer calls, where he calls, what time-of-the day he calls most, his financial status, marital status, etc. u Label the customers as loyal or disloyal. u Find a model for loyalty. From [Berry & Linoff] Data Mining Techniques, 1997 30
Classification: Application 4 q Sky Survey Cataloging – Goal: To predict class (star or galaxy) of sky objects, especially visually faint ones, based on the telescopic survey images (from Palomar Observatory). – 3000 images with 23, 040 x 23, 040 pixels per image. – Approach: u Segment the image. u Measure image attributes (features) - 40 of them per object. u Model the class based on these features. u Success Story: Could find 16 new high red-shift quasars, some of the farthest objects that are difficult to find! From [Fayyad, et. al. ] Advances in Knowledge Discovery and Data Mining, 1996 31
Classifying Galaxies Courtesy: http: //aps. umn. edu Early Class: • Stages of Formation Attributes: • Image features, • Characteristics of light waves received, etc. Intermediate Late Data Size: • 72 million stars, 20 million galaxies • Object Catalog: 9 GB • Image Database: 150 GB 32
Descriptive Data Mining • Goal: Describe all of the data (or the process that generated the data) • Density estimation - what is the probability distribution • Dependency modeling – what are the relationships between variables • Clustering (segmentation) – find groups of data objects that are � similar to one another within the same group(cluster) � dissimilar to the objects in other clusters � called unsupervised learning in Machine Learning 33
Clustering – More Example Ex. 3: Re-design of uniforms for female soldiers in US army q Goal: reduce the number of uniform sizes to be kept in inventory while still providing good fit q Researchers from Cornell Uni used clustering and designed a new set of sizes �Traditional clothing size system: ordered set of graduated sizes where all dimensions increase together �The new system: sizes that fit body types e. g. one size for short-legged, small waist, women with wide and long torsos, average arms, broad shoulders, and skinny necks 34
Clustering Definition Given a set of data points, each having a set of attributes, and a similarity measure among them, find clusters such that – Data points in one cluster are more similar to one another. – Data points in separate clusters are less similar to one another. q Similarity Measures: – Euclidean Distance if attributes are continuous. – Other Problem-specific Measures. q 35
Illustrating Clustering x. Euclidean Distance Based Clustering in 3 -D space. Intracluster distances are minimized Intercluster distances are maximized 36
Clustering: Application 1 q Market Segmentation: – Goal: subdivide a market into distinct subsets of customers where any subset may conceivably be selected as a market target to be reached with a distinct marketing mix. – Approach: u Collect different attributes of customers based on their geographical and lifestyle related information. u Find clusters of similar customers. u Measure the clustering quality by observing buying patterns of customers in same cluster vs. those from different clusters. 37
Clustering: Application 2 q Document Clustering: – Goal: To find groups of documents that are similar to each other based on the important terms appearing in them. – Approach: To identify frequently occurring terms in each document. Form a similarity measure based on the frequencies of different terms. Use it to cluster. – Gain: Information Retrieval can utilize the clusters to relate a new document or search term to clustered documents. 38
Illustrating Document Clustering q q Clustering Points: 3204 Articles of Los Angeles Times. Similarity Measure: How many words are common in these documents (after some word filtering). 39
Clustering of S&P 500 Stock Data z Observe Stock Movements every day. z Clustering points: Stock-{UP/DOWN} z Similarity Measure: Two points are more similar if the events described by them frequently happen together on the same day. z We used association rules to quantify a similarity measure. 40
Associative DM q q Goal: Find relationships among data: • market-basket analysis - find combinations of items that occur typically together • sequential analysis – find sequential patterns in data Market-basket analysis • Uses the information about what customers buy to give us insight into who they are and why they make certain purchases • Ex. 1 A grocery store retailer is trying to decide if to put bread on sale. He generates association rules and finds what other products are typically purchased with bread. A particular type of cheese is sold 60% of the time the bread is sold and a jelly is sold 70% of the time. Based on these findings, he decides: 1) to place some cheese and jelly at the end of the aisle where the bread is placed and 2) not to place either of these 3 items on sale at the same time. 41
Market-Basket Analysis – More Examples • Where should strawberries be placed to maximize its sale? • Services purchased together by telecommunication customers (e. g. broad band Internet, call forwarding, etc. ) help determine how to bundle these services together to maximize revenue. • Unusual combinations of insurance claims can be a sign of a fraud • Medical histories can give indications of complications based on combinations of treatments • Sport: analyzing game statistics (shots blocked, assists, and fouls) to gain competitive advantage - “When player X is on the floor, player Y’s shot accuracy decreases from 75% to 30%” - Bhandari et. al. (1997). Advanced Scout: data mining and knowledge discovery in NBA data, Data Mining and Knowledge Discovery, 1(1), pp. 121 -125 42
Association Rule Discovery: Definition q Given a set of records each of which contain some number of items from a given collection; – Produce dependency rules which will predict occurrence of an item based on occurrences of other items. Rules Discovered: {Milk} --> {Coke} {Diaper, Milk} --> {Beer} 43
Association Rule Discovery: Application 1 q Marketing and Sales Promotion: – Let the rule discovered be {Bagels, … } --> {Potato Chips} – Potato Chips as consequent => Can be used to determine what should be done to boost its sales. – Bagels in the antecedent => Can be used to see which products would be affected if the store discontinues selling bagels. – Bagels in antecedent and Potato chips in consequent => Can be used to see what products should be sold with Bagels to promote sale of Potato chips! 44
Association Rule Discovery: Application 2 q Supermarket shelf management. – Goal: To identify items that are bought together by sufficiently many customers. – Approach: Process the point-of-sale data collected with barcode scanners to find dependencies among items. – A classic rule -u If a customer buys diaper and milk, then he is very likely to buy beer. u So, don’t be surprised if you find six-packs stacked next to diapers! 45
Association Rule Discovery: Application 3 q Inventory Management: – Goal: A consumer appliance repair company wants to anticipate the nature of repairs on its consumer products and keep the service vehicles equipped with right parts to reduce on number of visits to consumer households. – Approach: Process the data on tools and parts required in previous repairs at different consumer locations and discover the co-occurrence patterns. 46
Sequential Pattern Discovery: Definition q Given is a set of objects, with each object associated with its own timeline of events, find rules that predict strong sequential dependencies among different events. (A B) q (C) (D E) Rules are formed by first disovering patterns. Event occurrences in the patterns are governed by timing constraints. (A B) <= xg (C) (D E) >ng <= ws <= ms 47
Sequential Analysis Finds sequential patterns in data - These patterns are similar to market-basket analysis but the relationship is based on time q Ex. 1. Most people who purchase CD players, purchase CDs within 3 days. q Ex. 2. The webmaster at the company X periodically analyses the web log data to determine how the users of X browse them. He finds that 70% of the users of page A follow one of the following patterns: - A->B->C - A->D->B->C - A->E->B->C q He then decides to add a link from page A to C q 48
Sequential Pattern Discovery: Examples q q In telecommunications alarm logs, – (Inverter_Problem Excessive_Line_Current) (Rectifier_Alarm) --> (Fire_Alarm) In point-of-sale transaction sequences, – Computer Bookstore: (Intro_To_Visual_C) (C++_Primer) --> (Perl_for_dummies, Tcl_Tk) – Athletic Apparel Store: (Shoes) (Racket, Racketball) --> (Sports_Jacket) 49
Regression q q q Predict a value of a given continuous valued variable based on the values of other variables, assuming a linear or nonlinear model of dependency. Greatly studied in statistics, neural network fields. Examples: – Predicting sales amounts of new product based on advetising expenditure. – Predicting wind velocities as a function of temperature, humidity, air pressure, etc. – Time series prediction of stock market indices. 50
Deviation/Anomaly Detection q Detect significant deviations from normal behavior q Applications: – Credit Card Fraud Detection – Network Intrusion Detection Typical network traffic at University level may reach over 100 million connections per day 51
Challenges of Data Mining Scalability q Dimensionality q Complex and Heterogeneous Data q Data Quality q Data Ownership and Distribution q Privacy Preservation q Streaming Data q 52
Data Mining: A KDD Process Pattern Evaluation – Data mining: the core of knowledge Data Mining discovery process. Task-relevant Data Warehouse Selection Data Cleaning Data Integration Databases 53
Data Mining and Business Intelligence Increasing potential to support business decisions Making Decisions Data Presentation Visualization Techniques Data Mining Information Discovery End User Business Analyst Data Exploration Statistical Analysis, Querying and Reporting Data Warehouses / Data Marts OLAP, MDA Data Sources Paper, Files, Information Providers, Database Systems, OLTP DBA 54
Steps of a KDD Process l l l l l Learning the application domain: – relevant prior knowledge and goals of application Creating a target data set: data selection Data cleaning and preprocessing: (may take 60% of effort!) Data reduction and transformation: – Find useful features, dimensionality/variable reduction, invariant representation. Choosing functions of data mining – summarization, classification, regression, association, clustering. Choosing the mining algorithm(s) Data mining: search for patterns of interest Pattern evaluation and knowledge presentation – visualization, transformation, removing redundant patterns, etc. Use of discovered knowledge 55
Architecture of a Typical Data Mining System Graphical user interface Pattern evaluation Data mining engine Database or data warehouse server Data cleaning & data integration Databases Knowledge-base Filtering Data Warehouse 56
Data Mining: Confluence of Multiple Disciplines Database Technology Machine Learning Information Science Statistics Data Mining Visualization Other Disciplines 57
Components of DM Algorithms q DM algorithms have 3 main components • Model (structure) - DM algorithms attempt to fit a model to data tree in Decision Trees (DT) - layers of non-linear transformations of weighted sums of the inputs in backpropagation Neural Networks (NNs) • Preference (score function) – preference criteria used to fit one model over another - Number of misclassifications in DTs - Mean squared error in NNs • Search method – how the data is searched by the algorithm - Greedy search over structure in DTs - Gradient descent over parameters in NNs 58
Summary l Data mining: discovering interesting patterns from large amounts of data l A natural evolution of database technology, in great demand, with wide applications l A KDD process includes data cleaning, data integration, data selection, transformation, data mining, pattern evaluation, and knowledge presentation l Mining can be performed in a variety of information repositories l Data mining functionalities: characterization, discrimination, association, classification, clustering, outlier and trend analysis, etc. l Classification of data mining systems l Major issues in data mining 59