Скачать презентацию Week 9 Data Mining System Knowledge Data Discovery Скачать презентацию Week 9 Data Mining System Knowledge Data Discovery

1d08c5da61d93d14a278fb76e34b2e1b.ppt

  • Количество слайдов: 19

Week 9 Data Mining System (Knowledge Data Discovery) Week 9 Data Mining System (Knowledge Data Discovery)

Case Scenario ABC Enterprise is a multinational company that offers multimedia content services in Case Scenario ABC Enterprise is a multinational company that offers multimedia content services in several regions in Asia. It has more than 6 millions content subscribers. For a company of this size, another major problem is to maintain good relationship with their existing content subscribers. Every year, they have to offer good content promotion to suit their customer needs. However, this is a difficult task because they have huge collection of data about their subscribers which have different needs and lifestyle. Therefore, the CEO of the company, Mr. Ridzuan wishes that there is a system that can be built to analyze enormous data about their subscribers and can suggest what kind of content promotions suitable for them.

Knowledge Discovery & Data Mining n n n Knowledge Discovery (KD) is a process Knowledge Discovery & Data Mining n n n Knowledge Discovery (KD) is a process of extracting previously unknown, valid, and actionable (understandable) information from large databases. Data mining is a step in the KDD process of applying data analysis and discovery algorithms. Relates to machine learning, pattern recognition, statistics, data visualization etc.

n Knowledge discovery in databases (KDD) is the non-trivial process of identifying valid, potentially n Knowledge discovery in databases (KDD) is the non-trivial process of identifying valid, potentially useful and ultimately understandable patterns in data. Clean, Collect, Summarize Operational Databases Data Warehouse Data Preparation Training Data Verification, Evaluation Data Mining Model Patterns

Why Mine Data? n Huge amounts of data being collected and warehoused n n Why Mine Data? n Huge amounts of data being collected and warehoused n n n Walmart records 20 millions per day health care transactions: multi-gigabyte databases Mobil Oil: geological data of over 100 terabytes Affordable computing Competitive pressure n n gain an edge by providing improved, customized services information as a product in its own right

Data Mining Methods n Prediction Methods n n using some variables to predict unknown Data Mining Methods n Prediction Methods n n using some variables to predict unknown or future values of other variables Descriptive Methods n finding human-interpretable patterns describing the data

Data Mining Tasks Classification n Clustering n Association Rule Discovery n Sequential Pattern Discovery Data Mining Tasks Classification n Clustering n Association Rule Discovery n Sequential Pattern Discovery n

1. Classification n n Data defined in terms of attributes, one of which is 1. Classification n n Data defined in terms of attributes, one of which is the class. Find a model for class attribute as a function of the values of other(predictor) attributes, such that previously unseen records can be assigned a class as accurately as possible.

Classification: Example Classification: Example

Classification: Direct Marketing n n Goal: Reduce cost of soliciting (mailing) by targeting a Classification: Direct Marketing n n Goal: Reduce cost of soliciting (mailing) by targeting a set of consumers likely to buy a new product. Data n n for similar product introduced earlier we know which customers decided to buy and which did not {buy, not buy} class attribute collect various demographic, lifestyle, and company related information about all such customers - as possible predictor variables. Learn classifier model

2. Clustering n Given a set of data points, each having a set of 2. Clustering n Given a set of data points, each having a set of attributes, and a similarity measure among them, find clusters such that n n n data points in one cluster are more similar to one another data points in separate clusters are less similar to one another. Similarity measures n n Euclidean distance if attributes are continuous Problem specific measures

Clustering: Market Segmentation n n Goal: subdivide a market into distinct subsets of customers Clustering: Market Segmentation n n Goal: subdivide a market into distinct subsets of customers where any subset may conceivably be selected as a market target to be reached with a distinct marketing mix. Approach: n n n collect different attributes on customers based on geographical, and lifestyle related information identify clusters of similar customers measure the clustering quality by observing buying patterns of customers in same cluster vs. those from different clusters.

3. Association Rule Discovery n Given a set of records, each of which contain 3. Association Rule Discovery n Given a set of records, each of which contain some number of items from a given collection: n produce dependency rules which will predict occurrence of an item based on occurences of other items

Association Rule Discovery Marketing and Sales Promotion Application Association Rule Discovery Marketing and Sales Promotion Application

4. Sequential Pattern Discovery n Given: set of objects, each associated with its own 4. Sequential Pattern Discovery n Given: set of objects, each associated with its own timeline of events, find rules that predict strong sequential dependencies among different events, of the form (A B) (C) (D E) --> (F)

Sequential Pattern Discovery: Examples n n n sequences in which customers purchase goods/services understanding Sequential Pattern Discovery: Examples n n n sequences in which customers purchase goods/services understanding long term customer behavior -- timely promotions. In point-of--sale transaction sequences n Athletic Apparel Store: (Shoes) (Racket, Racketball) --> (Sports Jacket)

Data Mining Systems n Clementine (SPSS) n n Data Miner (Statistica) n n http: Data Mining Systems n Clementine (SPSS) n n Data Miner (Statistica) n n http: //www. spss. com/spssbi/clementine/index. htm http: //www. statsoft. com/dataminer. html Rule. Quest (C 5. 0) n http: //www. rulequest. com/

Limitation/Challenges n large data n n high dimensionality n n number of variables (features), Limitation/Challenges n large data n n high dimensionality n n number of variables (features), number of cases (examples) multi gigabyte, terabyte databases efficient algorithms, parallel processing large number of features: exponential increase in search space (potential for spurious patterns) Use of domain knowledge n utilizing knowledge on complex data relationships, known facts

Intelligence Density Dimension n n Accuracy Explainability Flexibility Response speed Intelligence Density Dimension n n Accuracy Explainability Flexibility Response speed