Chapter 1 Introduction to Data Mining Chen Chun-Hsien

Скачать презентацию Chapter 1 Introduction to Data Mining Chen Chun-Hsien

f436919dfb3bb47574c4366d52053ded.ppt

Количество слайдов: 42

Chapter 1 Introduction to Data Mining Chen. Chun-Hsien Department of Information Management Chang Gung University 2018/3/15 Introduction to Data Mining 1

Outline n n What is data mining? n Applications of data mining n Data mining process n Main data mining techniques n 2018/3/15 Motivation to data mining Classification of data mining systems Introduction to Data Mining 2

Motivation n Phenomenon : data explosion (Automated data collection tools, mature database technology) n Tremendous amount of Web pages + n 40 billion photos on Facebook n 1 million new transactions/hour added in Walmart database n Big data in Clouds n Data from wearable devices, Internet of Things (Io. T) n Problem : We drown in data, but need knowledge for decision making n Solution : data Mining One of main emerging technologies that will change the world in the near future 2018/3/15 Introduction to Data Mining 3

What Is Data Mining? n Formal Definition of Data mining n n Alternative names n 2018/3/15 Automatic extraction of interesting (non-trivial, implicit, previously unknown and potentially useful) knowledge (rules, regularities, patterns, trends, affinities) from large amount of data Business intelligence (BI), knowledge discovery in databases (KDD), data/pattern analysis, knowledge extraction, data dredging, information harvesting, etc. Introduction to Data Mining 4

Example : Mining a Concept Hierarchy all Europe region country city office 2018/3/15 Germany Frankfurt . . Spain North_America Canada Vancouver. . . L. Chan . . . Introduction to Data Mining . . . Mexico Toronto M. Wind 5

Part of International Sales, Shipping Data 2018/3/15 Introduction to Data Mining 6

Confluence of Multiple Disciplines Artificial Intelligence Machine Learning Statistics Data Mining Information Science 2018/3/15 Visualization Database Technology Introduction to Data Mining KDD process 7

Evolution of Database Technology n 1960 s: n n 1970 s: n n Advanced data models (OO, spatial, temporal D/Bs, etc. ) 1990 s ~: n 2018/3/15 Relational data model, relational DBMS 1980 s: n n Data collection, database creation, network DBMS Data mining, data warehousing, multimedia D/B, and Web Introduction to Data Mining 8

Applications of Data Mining n Decision support n Business decision support n n n Medical decision support Other Applications n n n 2018/3/15 Consumer understanding and service improvement Market trend analysis and management Risk analysis and management Fraud detection and management Web analysis Bioinformatics Text mining Introduction to Data Mining 9

Applications of Data Mining (Market Analysis and Management) n Data sources for analysis n Transactions of credit card, retail industry, etc. n Public lifestyle studies (breakfast, brunch, coffee) n n Various questionnaires Market basket analysis and cross selling n n 2018/3/15 (1/2) Associations/co-relations between product sales Prediction based on the association information Introduction to Data Mining 10

Applications of Data Mining (Market Analysis and Management) (2/2) Group profiling of customers n n n Find clusters of “model” customers who share the same characteristics: e. g. , spending habits, income level, interest, … Data mining can tell you what types of customers buy what products (by clustering or classification techniques) n Identifying customer requirements n n 2018/3/15 Identifying potential product sales for (e. C) customers Use prediction to find what factors will attract new customers Introduction to Data Mining 11

Applications of Data Mining (Risk Analysis and Management) n Finance planning and asset evaluation n n Cash flow analysis and prediction Asset evaluation Time series analysis (trend analysis) Competitive analysis and market segmentation n Monitoring competitors and market directions n Setting pricing strategy in a highly competitive market n 2018/3/15 Grouping customers/a class-based pricing procedure Introduction to Data Mining 12

Applications of Data Mining (Fraud Detection and Management) n Applications n n Approach n n Health care, insurance, credit card services use historical data to build models of fraudulent behavior, and use data mining techniques to help identify similar instances Examples n Detection of money laundering: Detect suspicious money transaction patterns in banks n Fraud detection of medical insurance: Detect cheating ring of patients and doctors 2018/3/15 Introduction to Data Mining 13

Applications of Data Mining (Other Applications) n Web Mining : mining web logs (FB + News portal) n n n Discovering customer preference and behavior Analyzing effectiveness of Web marketing Biomedical Informatics n n Drug discovery n n Finding related genes of genetic diseases Bacterial identification Text Ming n n n 2018/3/15 Detection of email spam : analyze email content Medical informatics : automatic classification of cancer reports News classification : find related articles Introduction to Data Mining 14

Steps in KDD Process (Technically) n Data mining Knowledge Evaluation/Presentation The core step of KDD process Pattern Data Mining Relevant Data Preprocessing Databases 2018/3/15 Introduction to Data Mining 15

Main Steps of a KDD Process (Fully) n Domain knowledge Acquisition n n Data collection and preprocessing (may take 60% of effort!) n n n 2018/3/15 Choosing functions of data mining n association, classification, clustering, regression, summarization. Choosing the mining algorithm(s) Searching for patterns of interest Pattern evaluation and knowledge presentation n n Data selection and integration : creating a target data set Data cleaning, data transformation, and data reduction Data mining n n Learning relevant prior knowledge and goals of application removing redundant patterns, transformation, visualization, etc. Use of discovered knowledge Introduction to Data Mining 16

Mining On What Kind of Data? n Relational databases Transactional databases Data warehouses n Advanced D/B and information repositories n n Temporal data (Time-series data) n Spatial databases n Text databases and multimedia databases n Object-oriented databases n 2018/3/15 Web pages Heterogeneous and legacy databases Introduction to Data Mining 17

Steps in KDD Process Relevant Data Preprocessing Databases 2018/3/15 Introduction to Data Mining 18

Why Data Preprocessing? n Data in the real world is dirty (e. g. , Face. Book) n n No quality data, no quality mining results! n 2018/3/15 incomplete lacking attribute values, lacking certain attributes of interest, or containing only aggregate data noisy containing errors or outliers inconsistent containing discrepancies in codes or names Quality decisions must be based on quality data Introduction to Data Mining 19

What Major Tasks in Data Preprocessing n Data cleaning n Data integration n Data transformation n Data reduction n Data discretization 2018/3/15 Introduction to Data Mining 20

Steps in KDD Process Pattern Data Mining Relevant Data Preprocessing Databases 2018/3/15 Introduction to Data Mining 21

Main Data Mining Techniques n Association Rule Mining (Descriptive Analysis) n Classification and Prediction (Predictive Analysis) n Cluster Analysis (Exploratory Analysis) n n n 2018/3/15 Regression Analysis Outlier Analysis Trend Analysis Introduction to Data Mining 22

Main Data Mining Techniques n (1/4) Association Rule Mining (association rule : correlation and causality) Form of association rules n buy(T, “Beer”) àbuy(T, “Diaper”) [support = 2%, confidence = 70%] n sales(T, “computer”) àsales(T, “software”) [support = 1%, confidence = 75%] n IBM failure story age(X, “ 31. . 35”) ^ income(X, “ 40. . 49 K”) àbuys(X, “i. Pad”) [support = 1%, confidence = 70%] 2018/3/15 3 C retail stores age(X, “ 21. . 25”) ^ income(X, “ 30. . 39 K”) àbuys(X, “PC”) [support = 2%, confidence = 60%] n Walmart story Introduction to Data Mining Acer failure story 23

Association Rule Mining (Support and Confidence) transactions buy both transactions buy X n Given a transaction D/B, find all the rules X Y with minimum support and confidence n transactions buy Y n all transactions support, S, probability that a transaction contains {X & Y } confidence, C, conditional probability that a transaction having {X} also contains Y Association rules with sup. >= 50% n n 2018/3/15 A C (50%, 66. 6%) C A (50%, 100%) Introduction to Data Mining 24

Main Data Mining Techniques Supervised Learning (2/4) Use a training set to construct a model for the outcome forecast of future events. Two main types n Classification n Finding a model that distinguishes classes for future events e. g. , loan approval, customer classification, recognition of finger print n n Model representation: decision-tree, artificial neural networks Prediction n Finding a model that predicts numerical values for future events e. g. , stock price prediction n 2018/3/15 Model representation: regression, artificial neural networks Introduction to Data Mining 25

Classification vs. Prediction Use a training set to construct a model for the outcome forecast of future events n Classification n Prediction n predicts numerical values Constructs a continuous-valued function to predict unknown or missing values Typical Applications n n n 2018/3/15 predicts categorical class labels constructs a classification model to classify new data credit card approval medical diagnosis & treatment Pattern recognition Introduction to Data Mining 26

Process of Classification & Prediction (A Two-Step Process) Model construction Model f y=f(x) (x I, y O) Learning Algorithms Training Data (I, O) Model usage input features x’ 2018/3/15 : : Model f output y’ Data Mining: Concepts and Techniques class label or value 27

An Example of Training Dataset (Data of Consumers' Buying Behavior) Input features (I) : customer characteristics class label (O) This follows an example from Quinlan’s ID 3 2018/3/15 Introduction to Data Mining 28

A Decision Tree Model for Predicting buy_PC Model : buy_PC = f (age, student, credit rating) x : < <=30, yes, fair > f ? : test (input) attribute : class label for Buy_PC age? <= 30 : attribute value 30. . 40 student? yes > 40 credit rating? y : yes no yes excellent no fair yes 29

A Decision Tree for CAD Screening (Constructed from ~500 Records) 2018/3/15 Introduction to Data Mining 30

Main Data Mining Techniques Cluster analysis (3/4) Cluster analysis (unsupervised learning) n Class label is unknown: Group data to form new classes n Application example : Customer profiling for product recommendation (Online Bookstores) n Typical clustering principle Maximizing the intra-class similarity and minimizing the interclass similarity 2018/3/15 Introduction to Data Mining 31

Example of 2 D Cluster Analysis 3 clusters with points X, Y, and Z as outliers Z B C A Y X Difficulty : Data distribution of high dimension is not visually visible. 2018/3/15 Introduction to Data Mining 32

Clustering Example in High Dimension (Cluster Analysis CAD data) Clustering dendrogram Profile of CAD patients Data matrix for visualization 2018/3/15 Profile of healthy people Introduction to Data Mining 33

Profile of Stroke Patients (Diagnosed by Indices of Chinese Medicine) 2018/3/15 Main Data Mining Techniquesto Data Mining Informatics Introduction for Biomedical 34

Main Data Mining Techniques Example of Linear Regression (4/4) y • Predict y’s value at X 1 using linear regression • y = f (x), what is f ? • explore the meaning Y 1 ? of a and b y=ax+b X 1 2018/3/15 Main Data Mining Techniquesto Data Mining Informatics Introduction for Biomedical x 35

Other Data Mining Techniques n (4/4) Outlier analysis n Outlier: a data object that does not comply with the general behavior of the data n It can be considered as noise or exception but is quite useful in fraud detection, rare events analysis n Trend analysis n n n 2018/3/15 Trend and deviation: regression analysis Sequential pattern mining, periodicity analysis Other pattern-directed or statistical analyses Introduction to Data Mining 36

Are All the “Discovered” Patterns Interesting? n A data mining system/query may generate thousands of patterns, not all of them are interesting. Pattern screening becomes a problem. n Interestingness : a measure for automatic pattern screening A pattern is interesting if it is easily understood, potentially useful, novel, valid on new or test data with some degree of certainty, or it validates some hypothesis that a user seeks to confirm n Objective vs. subjective interestingness measures for data screening n Objective: based on statistics and structures of data patterns, e. g. , support, confidence, etc. n Subjective: based on user’s belief in the data, e. g. , unexpectedness, novelty, actionability, etc. 2018/3/15 Introduction to Data Mining 37

Can We Find All and Only Interesting Patterns? Completeness vs. Optimization n Completeness : Find all the interesting patterns n n Can a data mining system find all the interesting patterns? Optimization : Only find interesting patterns n Can a data mining system find only the interesting patterns? n Approaches n n 2018/3/15 First generate all the patterns and then filter out the uninteresting ones. Generate only the interesting patterns—mining query optimization Introduction to Data Mining 38

Classification Scheme of DM Techniques n General functionality n n n Descriptive/Exploratory data mining Predictive data mining Different views, different classifications n n Kinds of knowledge to be discovered n Kinds of techniques utilized n 2018/3/15 Kinds of databases to be mined Kinds of applications adapted Introduction to Data Mining 39

A Multi-Dimensional View of DM Technique Classification n Databases to be mined n n Knowledge to be mined n n Database-oriented, data warehouse (OLAP), machine learning, statistics, visualization, etc. Applications adapted n 2018/3/15 Association, classification, clustering, trend, characterization, deviation and outlier analysis, etc. Techniques utilized n n Relational, transactional , Web, object-oriented, object-relational, active, spatial, time-series, text, multi-media, heterogeneous, legacy, etc. Retail, telecommunication, banking, fraud analysis, stock market analysis, Web mining, Biomedical informatics, etc. Introduction to Data Mining 40

Summary for Data Mining n n Data mining: automatic discovery of interesting knowledge from large amounts of data A natural evolution of database technology, in great demand, with wide applications A KDD process includes data pre-processing, data mining, pattern evaluation, and knowledge presentation Main data mining functions: association, classification, clustering, outlier and trend analysis, characterization, etc. 2018/3/15 Introduction to Data Mining 41

Thanks !!!! Have a Nice Day ! 2018/3/15 Introduction to Data Mining 42