265fe03623936d05f897f7e428cc26f6.ppt
- Количество слайдов: 27
Chapter 26: Data Mining Prepared by Assoc. Professor Bela Stantic
Definition Data mining is the exploration and analysis of large quantities of data in order to discover valid, novel, potentially useful, and ultimately understandable patterns in data. Example pattern: 62% of customers who bought milk bought cheese as well Ramakrishnan and Gehrke. Database Management Systems, 3 rd Edition.
Definition (Cont. ) Data mining is the exploration and analysis of large quantities of data in order to discover valid, novel, potentially useful, and ultimately understandable patterns in data. Valid: The patterns hold in general. Novel: We did not know the pattern beforehand. Useful: We can devise actions from the patterns. Understandable: We can interpret and comprehend the patterns. Ramakrishnan and Gehrke. Database Management Systems, 3 rd Edition.
Why Use Data Mining Today? Human analysis skills are inadequate: • Volume and dimensionality of the data • High data growth rate Availability of: • • • Data Storage Computational power Off-the-shelf software Expertise Ramakrishnan and Gehrke. Database Management Systems, 3 rd Edition.
Sources of Data • • • Supermarket scanners, POS data Credit card transactions Direct mail response Call center records ATM machines Demographic data Sensor networks Cameras Web server logs Customer web site trails Ramakrishnan and Gehrke. Database Management Systems, 3 rd Edition.
Why Use Data Mining Today? Competitive pressure! “The secret of success is to know something that nobody else knows. ” Aristotle Onassis • Competition on service, not only on price (Banks, phone companies, hotel chains, rental car companies) • Personalization, • The real-time enterprise Ramakrishnan and Gehrke. Database Management Systems, 3 rd Edition.
The Knowledge Discovery Process Steps: l Identify business problem l Data mining l Action l Evaluation and measurement l Deployment and integration into businesses processes Ramakrishnan and Gehrke. Database Management Systems, 3 rd Edition.
Data Mining Step in Detail 2. 1 Data preprocessing • Data selection: Identify target datasets and relevant fields • Data cleaning • • Remove noise and outliers Data transformation Create common units Generate new fields 2. 2 Data mining model construction 2. 3 Model evaluation – present to the end user in understandable form (visually) Ramakrishnan and Gehrke. Database Management Systems, 3 rd Edition.
Preprocessing and Mining Knowledge Patterns Target Data Preprocessed Data Interpretation Model Construction Original Data Preprocessing Data Integration and Selection Ramakrishnan and Gehrke. Database Management Systems, 3 rd Edition.
What is a Data Mining Model? A data mining model is a description of a specific aspect of a dataset. It produces output values for an assigned set of input values. Examples: • Linear regression model • Classification model • Clustering Ramakrishnan and Gehrke. Database Management Systems, 3 rd Edition.
Data Mining: Types of Data • Relational data and transactional data • Spatial and temporal data, spatio-temporal observations • Time-series data • Text • Images, video • Mixtures of data • Sequence data • Features from processing other data sources Ramakrishnan and Gehrke. Database Management Systems, 3 rd Edition.
Types of Variables • Numerical: Domain is ordered and can be represented on the real line (e. g. , age, income) • Nominal or categorical: Domain is a finite set without any natural ordering (e. g. , occupation, marital status, race) • Ordinal: Domain is ordered, but absolute differences between values is unknown (e. g. , preference scale, severity of an injury) Ramakrishnan and Gehrke. Database Management Systems, 3 rd Edition.
Applications of Frequent Itemsets • • • Market Basket Analysis Association Rules Classification (especially: text) Seeds for construction of Bayesian Networks Web log analysis Collaborative filtering Ramakrishnan and Gehrke. Database Management Systems, 3 rd Edition.
Frequent Itemset • Itemset – set of items purchased {pen}, {pen, milk}, … • Support of the itemset is the fraction of transactions in database that contain all the items in the itemset. • Frequent Itemset - If support is higher than the user-specified minimal support • The a Priori property – Every subset of a frequent itemset is also a frequent itemset Ramakrishnan and Gehrke. Database Management Systems, 3 rd Edition.
Frequent Itemset – refined algorithm • Min support 70% • Level 1 – finds that the items {pen}, {milk}, and {ink} are frequent itemset • Level 2 following the a Priori Property set of two items can be only from frequent itemset: {pen, milk}, {pen, ink} and {ink, milk}. We find that the itemsets {pen, milk}, {pen, ink} are frequent. • Level 3 in not required as item (ink, milk} is not frequent so therefore itemset {pen, ink, milk} is not frequent as well. Ramakrishnan and Gehrke. Database Management Systems, 3 rd Edition.
Iceberg queries • We can apply logic from the refined frequent itemset algorithm to the Iceberg queries. • Consider example: SELECT cust. ID, Item, sum(qty) FROM Purchase GROUP BY cust. ID, Item HAVIN SUM(qty)> 5 This query would perform better if we look only for customers or items that satisfy the criteria: SELECT cust. ID, sum(qty) FROM Purchase GROUP BY cust. ID HAVIN SUM(qty)> 5 OR SELECT Item, sum(qty) FROM Purchase GROUP BY Item HAVIN SUM(qty)> 5 Ramakrishnan and Gehrke. Database Management Systems, 3 rd Edition.
Association Analysis • Consider shopping cart filled with several items • Market basket analysis tries to answer the following questions: • Who makes purchases? • What do customers buy together? • In what order do customers purchase items? • When do customers purchase the most and what? Ramakrishnan and Gehrke. Database Management Systems, 3 rd Edition.
Market Basket Analysis Given: • A database of customer transactions • Each transaction is a set of items • Example: Transaction with TID 111 contains items {Pen, Ink, Milk, Juice} Ramakrishnan and Gehrke. Database Management Systems, 3 rd Edition.
Market Basket Analysis (Contd. ) • Coocurrences • 80% of all customers purchase items X, Y and Z together. • Association rules • 60% of all customers who purchase X and Y also buy Z. • Sequential patterns • 60% of customers who first buy X also purchase Y within three weeks. Ramakrishnan and Gehrke. Database Management Systems, 3 rd Edition.
Confidence and Support We prune the set of all possible association rules using two interestingness measures: • Support of a rule: • X àY has support s if P(XY) = s • Represents percentage of the transactions that contain all these items • Confidence of a rule: • X àY has confidence c if P(sup(LHS U RHS) | sup (LHS)) = c • Confidence for a rule X àY is the percentage of such transactions that also contain all items in Y We can also define • Support of an itemset (a coocurrence) XY: • XY has support s if P(XY) = s Ramakrishnan and Gehrke. Database Management Systems, 3 rd Edition.
Examples: • {Pen} => {Milk} Support: 100% Confidence: 75% • {Ink} => {Pen} Support: 75% Confidence: 100% Ramakrishnan and Gehrke. Database Management Systems, 3 rd Edition.
Example • Find all itemsets with support >= 75%? Ramakrishnan and Gehrke. Database Management Systems, 3 rd Edition.
Example • Can you find all association rules with support >= 50%? Ramakrishnan and Gehrke. Database Management Systems, 3 rd Edition.
Market Basket Analysis: Applications • Sample Applications • Direct marketing • Fraud detection for medical insurance • Floor/shelf planning • Web site layout • Cross-selling Ramakrishnan and Gehrke. Database Management Systems, 3 rd Edition.
Association Rules and ISA Hierarchies • Or Category hierarchy, can be imposed on group of items in same hierarchy such as, Pen and Ink belong to Stationary while Juice and milk belong to Beverages. • When applying Assoc. Rules on hiearchy it allows us to detect relationship between different levels of hierarchies. Ramakrishnan and Gehrke. Database Management Systems, 3 rd Edition.
Generalised Association Rules “On a day when a pen is purchased, it is likely that the milk is also purchased” • If we use the date field as group we can consider more general problem called calendric market basket analysis. • Every Thursday, First Sunday every Month, First Monday every Semester, etc Ramakrishnan and Gehrke. Database Management Systems, 3 rd Edition.
The use of Assoc. Rules for prediction • Are widely used for prediction, however such predictive usage is not justified without additional analysis and domain knowledge. Ramakrishnan and Gehrke. Database Management Systems, 3 rd Edition.
265fe03623936d05f897f7e428cc26f6.ppt