Скачать презентацию Issues in Data Mining Applications -Tutorial How to Скачать презентацию Issues in Data Mining Applications -Tutorial How to

7d9e88b74a2f59e1ed20b4e9b234b010.ppt

  • Количество слайдов: 49

Issues in Data Mining Applications -Tutorial. How to Make A Decision About Your Own Issues in Data Mining Applications -Tutorial. How to Make A Decision About Your Own Data Mining Tool? Authors: Nemanja Jovanovic, [email protected] yu Jovanovic, [email protected] sezampro. Valentina Milenkovic, [email protected] yu [email protected] eunet. Prof. Dr. Veljko Milutinovic, [email protected] bg. ac. yu Prof. Milutinovic, [email protected] etf. bg. ac. yu

Data Mining vs. Knowledge Mining = ? ? 2 Data Mining vs. Knowledge Mining = ? ? 2

Instead of a foreword “…. If you are not able to swim in the Instead of a foreword “…. If you are not able to swim in the ocean of the data, you will get drowned…. ” 3

Tutorial Content: This Tutorial will guide you through the following sections: What really means Tutorial Content: This Tutorial will guide you through the following sections: What really means Data Mining? Successful Data Mining Comparasion of fourteen DM tools How to improve existing Data Mining applications? Potential applications Myths and facts about Data Mining Two case studies The future of DM applications 4

Other definitions of Data Mining Data mining is the (semi)automatic discovery of patterns, associations, Other definitions of Data Mining Data mining is the (semi)automatic discovery of patterns, associations, anomalies, and changes in data. Data mining, on the other hand, extracts information from a database that the user did not know existed. Also, data mining is the search for relationships and global patterns that exist in large databases but are `hidden' among the vast amount of data. 5

The Foundations of Data Mining Massive data collection Powerful multiprocessor computers Data Mining algorithms The Foundations of Data Mining Massive data collection Powerful multiprocessor computers Data Mining algorithms Volume of data 1970 1980 1990 2000 6

Evolution Of Data Mining Evolutionary Step Business Question Enabling Technologies Product Providers Characteristics Data Evolution Of Data Mining Evolutionary Step Business Question Enabling Technologies Product Providers Characteristics Data Collection (1960 s) What was my average total revenue over the last 5 years? Computers, tapes, disks IBM, CDC Retrospective, static data delivery Data Access (1980 s) What were unit sales in New England last March? RDBMS, SQL, ODBC Oracle, Sybase Informix, IBM, Microsoft Retrospective, dynamic data delivery at record level Data Navigation (1990 s) What were unit sales in New England last March? Drill down to Boston. OLAP, Multidimensional databases, data warehouses Pilot, IRI, Arbor, Redbrick, Evolutionary Technologies Retrospective, dynamic data delivery at multiple levels Data Mining (2000) What’s likely to happen to Boston unit sales next month? Why? Advanced algorithms, multiprocessors, massive databases Lockheed, IBM, SGI, numerous startups Prospective, proactive information delivery 7

Data Mining context Application domain Data mining problem type Technical aspect Data mining tools Data Mining context Application domain Data mining problem type Technical aspect Data mining tools and technique 8

Data Mining Techniques Artificial Neural Networks Decision Trees Genetic Algorithms Rule Induction K-Nearest Neighbor Data Mining Techniques Artificial Neural Networks Decision Trees Genetic Algorithms Rule Induction K-Nearest Neighbor (k-NN) Data Visualization . . patern . . . Input . . . . 9 Output patern

Examples of DM projects to stimulate your imagination ñ Here are six examples of Examples of DM projects to stimulate your imagination ñ Here are six examples of how data mining is helping corporations to operate more efficiently and profitably in today's business environment. – Targeting a set of consumers who are most likely to respond to a direct mail campaign – Predicting the probability of default for consumer loan applications – Reducing fabrication flaws in VLSI chips – Predicting audience share for television programs – Predicting the probability that a cancer patient will respond to radiation therapy – Predicting the probability that an offshore oil well is actually going to produce oil 11

Successful Data Mining Come up with a precise formulation of the problem you are Successful Data Mining Come up with a precise formulation of the problem you are trying to solve and use the right data Have a clearly articulated business problem and then determine whether data mining is the proper solution technology Understand deliver the fundamentals Have your technology folks be involved, too Visualization of the data mining output is very important in a meaningful way Allow the user to interact with the visualization 12

Comparison of forteen DM tools Evaluated by four undergraduates inexperienced at data mining, a Comparison of forteen DM tools Evaluated by four undergraduates inexperienced at data mining, a relatively experienced graduate student and a profesional data mining consultant Run under the MS Windows 95, MS Windows NT, Macintosh System 7. 5 Use one of the four technologies: Decision Trees, Rule Inductions, Neural or Polynomial Networks Solve two binary classification problems: multi-classification and noiseless estimation problem Price from 75$ to 25. 000$ 13

Comparison of forteen DM tools The Decision Tree products were - CART - Scenario Comparison of forteen DM tools The Decision Tree products were - CART - Scenario - See 5 - S-Plus The Rule Induction tools were - Wiz. Why - Data. Mind - DMSK Neural Networks were built from three programs - Neuro. Shell 2 - Pc. OLPARS - PRW The Polynomial Network tools were - Model. Quest Expert - Gnosis - a module of Neuro. Shell 2 - Knowledge. Miner 14

Criteria for evaluating DM tools A list of 20 criteria for evaluating DM tools, Criteria for evaluating DM tools A list of 20 criteria for evaluating DM tools, put into 4 categories: Capability measures what a desktop tool can do, and how well it does it - Handless missing data - Considers misclassification costs - Allows data transformations - Quality of tesing options - Has programming language - Provides useful output reports - Visualisation 15

Visualisation + excellent capability good capability - some capability “blank” no capability 16 Visualisation + excellent capability good capability - some capability “blank” no capability 16

Criteria for evaluating DM tools Learnability/Usability shows how easy a tool is to learn Criteria for evaluating DM tools Learnability/Usability shows how easy a tool is to learn and use - Tutorials Wizards Easy to learn User’s manual Online help Interface 17

Criteria for evaluating DM tools Interoperability shows a tool’s ability to interface with other Criteria for evaluating DM tools Interoperability shows a tool’s ability to interface with other computer applications - Importing data Exporting data Links to other applications Flexibility - Model adjustment flexibility Customizable work enviroment Ability to write or change code 18

Data Input & Output Model + excellent capability good capability - some capability “blank” Data Input & Output Model + excellent capability good capability - some capability “blank” no capability 19

A classification of data sets Pima Indians Diabetes data set – 768 cases of A classification of data sets Pima Indians Diabetes data set – 768 cases of Native American women from the Pima tribe some of whom are diabetic, most of whom are not – 8 attributes plus the binary class variable for diabetes per instance Wisconsin Breast Cancer data set – 699 instances of breast tumors some of which are malignant, most of which are benign – 10 attributes plus the binary malignancy variable per case The Forensic Glass Identification data set – 214 instances of glass collected during crime investigations – 10 attributes plus the multi-class output variable per instance Moon Cannon data set – 300 solutions to the equation: x = 2 v 2 sin(g)cos(g)/g – the data were generated without adding noise 20

Evaluation of forteen DM tools 21 Evaluation of forteen DM tools 21

Strenghts and Weaknesses Strengths Ease of use (Scenario, Wiz. Why. . ) Data visualisation Strenghts and Weaknesses Strengths Ease of use (Scenario, Wiz. Why. . ) Data visualisation (Splus, Mine. Set. . . ) Depth of algorithms (tree options) (CART, See 5, S-plus. . ) Multiplte neural network architectures (Neuro. Shell) Weaknesses Difficult file I/O (OLPARS, CART) Limited visualisation (PRW, See 5, Wiz. Why) Narrow analyses path (Scenario) 22

How to improve existing DM applications The top ten points: Database integration – no How to improve existing DM applications The top ten points: Database integration – no more flat files – use the millions $ spent on data warehousing Automated model scoring – without scoring DM is pretty useless – should be integrated with the driving applications Exporting models to other applications – close the loop between DM and applications that need to use the results (scores) 23

How to improve existing DM applications Business templates – cross-selling specific application is more How to improve existing DM applications Business templates – cross-selling specific application is more valuable than a general modeling tool Effort knob – it is relevant in a way that tuning parametars are not Incorporate financial information – the financial information is very important and often available and shold be provided as input to the DM application 24

How to improve existing DM applications Computed target columns – allow the user to How to improve existing DM applications Computed target columns – allow the user to interactively create a new target variable Time-series data – a year’s worth of monthly balance information is qualitatively different than twelve distinct non-time-series variables Use versus View – do not present visually to user the full model, only the most important levels Wizards – not necessarily but desirable – prevent human error by keeping the user on track 25

Potential Applications Data mining has many varied fields of application, some of which are Potential Applications Data mining has many varied fields of application, some of which are listed below. Retail/Marketing - Identify buying patterns from customers - Find associations among customer demographic characteristics - Predict response to mailing campaigns - Market basket analysis 26

Potential Applications • Banking - Detect patterns of fraudulent credit card use - Identify Potential Applications • Banking - Detect patterns of fraudulent credit card use - Identify `loyal' customers - Determine credit card spending by customer groups - Find hidden correlations between different financial indicators - Identify stock trading rules from historical market data 27

Potential Applications • Insurance and Health Care - Claims analysis - i. e. , Potential Applications • Insurance and Health Care - Claims analysis - i. e. , which medical procedures are claimed together - Predict which customers will buy new policies - Identify behaviour patterns of risky customers - Identify fraudulent behaviour 28

Potential Applications • Transportation - Determine the distribution schedules among outlets - Analyse loading Potential Applications • Transportation - Determine the distribution schedules among outlets - Analyse loading patterns • Medicine - Characterise patient behaviour to predict office visits - Identify successful medical therapies for different illnesses - To predict the effectiveness of surgical procedures or medical tests 29

Potential Applications • Sport - To make the best choice about players in different Potential Applications • Sport - To make the best choice about players in different circumstance - To predict the results of relevance match - Do a better list of seed players in groups or tournament § DM report from an NBA game When Price was Point-Guard, J. Williams missed 0% (0) of his jump field-goal attempts and made 100% (4) his jump field-goal-attempts. The total number of such field-goal-attempts was 4. 30 of

DM and Customer Relationship Management CRM is a process that manages the interactions between DM and Customer Relationship Management CRM is a process that manages the interactions between a company and its customers Users of CRM software applications are database marketers Goals of database marketers are: à identifying market segments, which requires significant data about prospective customers and their buying behaviors à build and execute campaigns Tightly integrating the two disciplines presents an opportunity for companies to gain competetive adventage 31

DM and Customer Relationship Management How Data Mining helps Database Marketing Scoring The role DM and Customer Relationship Management How Data Mining helps Database Marketing Scoring The role of Campaign Management Software Increasing the customer lifetime value Combining Data Mining and Campaign Management 32

DM and Customer Relationship Management Evaluating the benefits of a Data Mining model Gains DM and Customer Relationship Management Evaluating the benefits of a Data Mining model Gains chart Profability chart 33

Myths and Facts about Data Mining Myth: DM produces surprising results that will utterly Myths and Facts about Data Mining Myth: DM produces surprising results that will utterly transform your business. Myth: DM techniques are so sophisticated that they can substitute for domain knowledge or for experience in analysis and model building. Myth: DM tools automatically find the patterns you are looking for, without being told what to do. 34

Myths and Facts about Data Mining Myth: Data mining is more effective with more Myths and Facts about Data Mining Myth: Data mining is more effective with more data, so all existing data should be brought into any data-mining effort. Myth: Building a DM model on a sample of a database is ineffective, because sampling loses the information in the unused data. Myth: Data mining is another fad that will soon fade, allowing us to return to standard business practice. 35

Myths and Facts about Data Mining Myth: DM is useful only in certain areas, Myths and Facts about Data Mining Myth: DM is useful only in certain areas, such as marketing, sales, and fraud detection. Myth: The methods used in DM are fundamentally different from the older quantitative model-building techniques. Myth: Data mining is an extremely complex process. Myth: Only massive databases are worth mining. 36

Data Mining Examples Bass Brewers “We’ve been brewing beer since 1777, with increased competition Data Mining Examples Bass Brewers “We’ve been brewing beer since 1777, with increased competition comes a demand to make faster better informed decision” Northern Bank “The information is now more accessible, paperless and timely. ” TSB Group Plc “We are using Holos because of its flexibility and its excellent multidimensional database” 37

Data Mining Examples Delphic Universites “Real value is added to data by multidimensional manipulation Data Mining Examples Delphic Universites “Real value is added to data by multidimensional manipulation (being able to to easily compare many different views of the avaible information in one report) and by modeling. ” Harvard - Holden “Sybase technology has allowed us to develop an information system that will preserve this legacy into the twenty-first century” J. P. Morgan “The promise of data mining tools like Information Harvester is that they are able to quickly wade through massive amounts of data to identify relationships or trending information that would not have been avaible without the tool” 38

Case study of Breast Cancer Survival Analysis Case study of the influence of various Case study of Breast Cancer Survival Analysis Case study of the influence of various patient characteristics survival rates for breast cancer The survival analysis technique employed is Cox Regression (this technique is useful in situations, where some of the patients do not die during the observation period) Linear regression technique (if all patients had died during the observation period) 39 on

Case study of Breast Cancer Survival Analysis The observation period runs for 133. 8 Case study of Breast Cancer Survival Analysis The observation period runs for 133. 8 months The modeling sample contains 746 patients (50 patients died during the observation period and 696 who survived beyond the end of the observation period) In this example, we are testing only four predictors: Ø Ø Age, in years, at the start of the observation period (22 to 88) Pathological tumor size, in centimeters (0. 10 to 7. 00) Number of positive axillary lymph nodes (0 to 35) Estrogen receptor status (positive vs. negative) 40

Case study of Breast Cancer Survival Analysis The Cox Regression used a backward stepwise Case study of Breast Cancer Survival Analysis The Cox Regression used a backward stepwise likelihood-ratio variable selection method Significance criteria were set at 0. 05 for inclusion in the model, and 0. 10 for removal from the model Printout from the final step of the stepwise regression analysis: ________ Variables in the Equation _______ Variable B S. E. Wald df Sig R Exp(B) AGE -. 0314. 0121 6. 7486 1. 0094 -. 0893. 9691 PATHSIZE. 3975. 1175 11. 4476 1. 0007. 1259 1. 4881 LNPOS. 1372. 0361 14. 4100 1. 0001. 1443 1. 1471 ____________________________ The column labeled "Sig" shows the statistical significance of included variables "Sig" The column labeled "R" shows the degree of unique correlation with the dependent variable 41

Case study of Breast Cancer Survival Analysis Some key things to note are: Estrogen Case study of Breast Cancer Survival Analysis Some key things to note are: Estrogen status was removed as a predictor because it did not reach the 0. 05 significance criterion for inclusion Number of positive axillary lymph nodes was the strongest predictor of survival rates (R=. 1443 / Sig=. 0001), then follow pathological tumor size (R=. 1259 / Sig. =. 0007), over the course of the observation period Age, although significant, is somewhat less influential than the other two predictors (R=-0. 893 / Sig. =. 0094) Note that both the number of positive axillary lymph nodes and the pathological tumor size are positively correlated, which means that they are directly associated with more rapid mortality. Age is negatively correlated with the dependent variable, which means that younger age is predictive of somewhat longer survival. 42

Case study of Breast Cancer Survival Analysis All patients survive through the 10 month Case study of Breast Cancer Survival Analysis All patients survive through the 10 month of the observation period At the fortieth month, the mortality rate increases and continues at this fairly constant increased rate through the forty-fifth month At the forty-fifty month, there is a five-month period without additional mortality 11% of the original sample has died The following chart shows the cumulative survival function during the observation period: 43

Case study of Breast Cancer Survival Analysis Conclusions and Implications The case study presented Case study of Breast Cancer Survival Analysis Conclusions and Implications The case study presented here is relatively simple, and is for illustrative purposes only. With the addition of more candidate predictors (progesterone receptor status, histologic grade, blood type etc. ), an even more powerful model could emerge. By understanding the influence of patient characteristics on mortality rates over time, we are in a better position to estimate survival times for individual patients, and to defend using different or more aggressive therapeutic approaches for some patients. 44

Securities Brokerage Case Study The following four pages are derived from a copyrighted case Securities Brokerage Case Study The following four pages are derived from a copyrighted case study originally created by Smart. Drill Data Mining (Marlborough, MA, U. S. A. ). Their website is: http: //smartdrill. com And the original case study appears in its entirety here: http: //smartdrill. com/CHAID. html 45

Securities Brokerage Case Study Predictive market segmentation model designed to identify and profile high-value Securities Brokerage Case Study Predictive market segmentation model designed to identify and profile high-value brokerage customer segments as targets for special marketing communications efforts. The dependent variable for this ordinal CHAID model is brokerage account commission dollars during the past 12 months We begin by splitting the client's entire customer file into a modeling sample and a validation sample. (Once the model is built using the modeling sample, we apply it to the validation sample to see how well it works on a sample other than the on which it was built). 46

Securities Brokerage Case Study The resulting CHAID model has 55 segments. However, the results Securities Brokerage Case Study The resulting CHAID model has 55 segments. However, the results are summarized in the following comb chart, showing the segment indexes (indexes of average dollar value) 47

Securities Brokerage Case Study The part of Gains Chart: Average Annual Brokerage Commission Dollars Securities Brokerage Case Study The part of Gains Chart: Average Annual Brokerage Commission Dollars Gains chart provides quantitative detail useful for financial and marketing planning. We have highlighted the top 20% of the file in blue The top 20% of the file is worth an average of about $334 per account, which is nearly three times the average account value for the entire sample. … … 48 … … . . .

Securities Brokerage Case Study Using the data in the gains chart this information, we Securities Brokerage Case Study Using the data in the gains chart this information, we can better plan our communications/promotion budget. In general, the best segments represent customers who are experienced, aggressive, self-directed traders. The other decisions, which the gains chart and the segmentation rules can help us make: Ø We might wish to conduct some market research among customers in under-performing segments, or among under-performing customers in the better segments Ø We can use the segment definitions to help us identify possible issues and question areas to include in the survey Before we try to apply such a model, we perform a validation against a holdout sample, to confirm that it is a good model. 49

The future of DM applications Different opinions Very little functionality in DB systems to The future of DM applications Different opinions Very little functionality in DB systems to support DM applications Data mining, as a vital application, is just one more advance in the on-going research process Data mining will not go away The End 50