Скачать презентацию KDD Cup Survey Xinyue Liu 1 Outline Скачать презентацию KDD Cup Survey Xinyue Liu 1 Outline

96c86409e7052eeb9706a91e18bfc42e.ppt

  • Количество слайдов: 33

KDD Cup Survey Xinyue Liu 1 KDD Cup Survey Xinyue Liu 1

Outline u Nuts and Bolts of KDD Cup u KDD Cup 97 -99 u Outline u Nuts and Bolts of KDD Cup u KDD Cup 97 -99 u KDD Cup 2000 u Summary 2

About KDD Cup A knowledge discovery and data mining tools competition in conjunction with About KDD Cup A knowledge discovery and data mining tools competition in conjunction with KDD conferences. It aims at: n n n showcase the best methods for discovering higherlevel knowledge from data. Helping to close the gap between research and industry Stimulating further KDD research and development 3

Statistics § § § Participation in KDD Cup grew steadily, especially requests to access Statistics § § § Participation in KDD Cup grew steadily, especially requests to access the data Average person-hours per submission: 204 Max person-hours per submission: 910 Commercial software grew from 44% (cup 97) to 52% (cup 98) to 77% (cup 2000) 4

Algorithms Decision trees most widely tried and by far the most commonly submitted 5 Algorithms Decision trees most widely tried and by far the most commonly submitted 5

KDD Cup 97 A classification task – to predict Financial services industry direct mail KDD Cup 97 A classification task – to predict Financial services industry direct mail response u Winners u n n n Charles Elkan, a Ph. D from UC-San Diego with his Boosted Naive Bayesian (BNB) Silicon Graphics, Inc with their software Mine. Set Urban Science Applications, Inc. with their software gain, Direct Marketing Selection System 6

BNB Boosting – to learn a series of classifiers, where each classifier in the BNB Boosting – to learn a series of classifiers, where each classifier in the series pays more attention to the examples misclassified by its predecessor. Repeated T rounds. u BNB – representationally equivalent to a multilayer perceptron with a single hidden layer. u Complexity – O(ef) e – examples f - attributes u 7

Mine. Set u A KDD tool that combines data access, transformation, classification, and visualization. Mine. Set u A KDD tool that combines data access, transformation, classification, and visualization. 8

KDD Cup 98 u URL: www. kdnuggets. com/meetings/kdd 98/kdd-cup- u 98. html A classification KDD Cup 98 u URL: www. kdnuggets. com/meetings/kdd 98/kdd-cup- u 98. html A classification task – to analyze fund raising mail responses to a non-profit organization u Winners n n n Urban Science Applications, Inc. with their software Gain. Smarts. SAS Institute, Inc. with their software Enterprise Miner. Quadstone Limited with their software Decisionhouse 9

Gain. Smarts u Gain. Smarts – a feature selection expert system First step - Gain. Smarts u Gain. Smarts – a feature selection expert system First step - used Logistic Regression to assign each prospect a probability of donation (Pi). n Second step - used Linear Regression to estimate a conditional donation amount of responding donors (Ai) n Result (<1% error) - Prediction = Pi * Ai n 10

Enterprise Miner A data mining solution that addresses the entire data mining process u Enterprise Miner A data mining solution that addresses the entire data mining process u SEMMA Process u Sample n Explore n Modify n Model n Assess n u Algorithms Decision tree n Neural network n Regression n 11

Decisionhouse u Decisionhouse – an integrated modelling software suite by Quadstone n n n Decisionhouse u Decisionhouse – an integrated modelling software suite by Quadstone n n n Data exploration using visualization modules. Use Decision trees and Scorecards to model more complex tasks. Choose the final model by comparing a variety of modeling approaches and looking at the difference in predicted net profitability (lift curve). 12

Results Maximum Possible Profit Line ($72, 776 in profits with 4, 873 mailed) Mail Results Maximum Possible Profit Line ($72, 776 in profits with 4, 873 mailed) Mail to Everyone Solution ($10, 560 in profits with 96, 367 mailed) Gain. Smarts SAS/Enterprise Miner Quadstone/Decisionhouse

KDD Cup 99 URL: www. cse. ucsd. edu/users/elkan/kdresults. html u Problem same data set KDD Cup 99 URL: www. cse. ucsd. edu/users/elkan/kdresults. html u Problem same data set as KDD Cup 98 u Winners n SAS Institute Inc. with their software Enterprise Miner. n Amdocs with their Information Analysis Environment u 14

Software u u SAS – using two-stage model which includes two multi-layer perceptron (MLP) Software u u SAS – using two-stage model which includes two multi-layer perceptron (MLP) neural networks models. Amdocs – using its own Information Analysis Environment, which allows modeling of the value and class membership simultaneously. Algorithms used is a hybrid logistic regression model 15

KDD Cup 2000 www. ecn. purdue. edu/KDDCUP/ Sponsored by Purdue University Blue Martini Software KDD Cup 2000 www. ecn. purdue. edu/KDDCUP/ Sponsored by Purdue University Blue Martini Software 16

Data Set Data collected from Gazelle. com, a legwear and legcare web retailer u Data Set Data collected from Gazelle. com, a legwear and legcare web retailer u Pre-processed u. Training set: 2 months u Test sets: one month u Data collected includes: n Click streams n Order information n Registration form 17

Problems u The goal – to design models to support web-site personalization and to Problems u The goal – to design models to support web-site personalization and to improve the profitability of the site by increasing customer response. u Questions - When given a set of page views, will the visitor view another page on the site or leave? which product brand will the visitor view in the remainder of the session? characterize heavy spenders characterize killer pages characterize which product brand a visitor will view in the remainder of the session? 1. 2. 3. 4. 5. 18

Evaluation Accuracy/score was measured for the two questions with test sets u Insight questions Evaluation Accuracy/score was measured for the two questions with test sets u Insight questions judged with help of retail experts from Gazelle and Blue Martini Created a list of insights from all participants § Each insight was given a weigh u § Each participant was scored on all insights u Additional factors: § Presentation quality § Correctness 19

The Winners u Question 1 & 5 Winner: Amdocs u Question 2 & 3 The Winners u Question 1 & 5 Winner: Amdocs u Question 2 & 3 Winner: Salford Systems u Question 4 Winner: e-steam poster 20

Software (Amdocs) Exploratory Data Analysis – SAS u Classification Tree – Amdocs Business Insight Software (Amdocs) Exploratory Data Analysis – SAS u Classification Tree – Amdocs Business Insight Tool u n n Decision tree Rules Extraction Modeling Combining models 21

Scheme 22 Scheme 22

Main Model Decision Tree 5 trees built on 34000 cases Rule Generator 1466 rules Main Model Decision Tree 5 trees built on 34000 cases Rule Generator 1466 rules 111 continue rules Best Rule 23 Hybrid Model Merged Rules

Sub-models Best rule Hybrid Model Merged Rules 24 Chooses most accurate rule satisfied by Sub-models Best rule Hybrid Model Merged Rules 24 Chooses most accurate rule satisfied by each record Logistic regression on rule set + raw field values combine to define score for each record Logistic regression on rule set defines score for each record as a combination of rules the record satisfies Each model captures a different aspect of the overall behavior in the data. Combining or ensembling the models provides the best prediction results.

Software (Salford) u CART - a decision tree tool that automatically searching for and Software (Salford) u CART - a decision tree tool that automatically searching for and isolating significant patterns and relationships u MARS - a multivariate non-parametric regression procedure u Hot. Spot. Detector u Tree. Net 25

Binary recursive partitioning. u Key elements: Cart u n Splitting rules Brute force search Binary recursive partitioning. u Key elements: Cart u n Splitting rules Brute force search all possible splits for all variables u Rank each splitting rule on the basis of a quality-of-split criterion (default GINI) u n n Recursion - split until further splitting is impossible or stopped. Class assignment u u n n Plurality rule Assign every node whether it is terminal or not. Pruning Trees – does not stop in the middle Testing - best sub-tree is the one with the lowest error 26

MARs u u u Automatic variable search Automatic variable transformation Automatic limited interaction searches MARs u u u Automatic variable search Automatic variable transformation Automatic limited interaction searches Variable nesting Built-in testing regimens model selection parameters. 27

Insights u (Heavy Spenders) Some of the Good insights § Referrers - establish ad Insights u (Heavy Spenders) Some of the Good insights § Referrers - establish ad policy based on conversion § § § rates, not click-throughs Not an AOL user - browser window too small for layout Referring site traffic changed dramatically over time Came to site from print-ad or news, not friends & families Very high and very low income Geographic: Northeast U. S. states Repeat visitors 28

Insights u (Who leaves? ) Some of the good insights Crawlers, bots accounted for Insights u (Who leaves? ) Some of the good insights Crawlers, bots accounted for 16% of sessions Long processing time (> 12 seconds) implies high abandonment Referring sites: mycoupons have long sessions, shopnow. com are prone to exit quickly Returning visitors' prob of continuing is double View of specific products (Oroblue, Levante) cause abandonment Probability of leaving decreases with page views Free Gift and Welcome templates on first three pages encouraged visitors to stay at site n n n n 29

Insights(Brand view) u Some good insights n n 30 Referrer URL is great predictor: Insights(Brand view) u Some good insights n n 30 Referrer URL is great predictor: u Fashionmall. com and winnie-cooper are referrers for Hanes and Donna Karan u mycoupons. com, tripod, deal-finder are referrers for American Essentials Previous views of a product imply later views

Summary u u u Data mining requires background knowledge and access to business users Summary u u u Data mining requires background knowledge and access to business users Successful data mining solutions combine automated and manual analysis, integrating the power of the machine with expert knowledge and human insight Web Mining is challenging: crawlers/bots, frequent site changes, etc. KDD Cup is an excellent source to learn the state-ofart KDD techniques KDD Cup data available for research and education 31

References Elkan C. (1997). Boosting and Naive Bayesian Learning. Technical Report No. CS 97 References Elkan C. (1997). Boosting and Naive Bayesian Learning. Technical Report No. CS 97 -557, September 1997, UCSD. Decisionhouse (1998). KDD Cup 98: Quadstone Take Bronze Miner Award. Retrieved March 15, 2001 from http: //www. kdnuggets. com/meetings/kdd 98/quadstone/index. ht ml Urbane Science (1998). Urbane Science wins the KDD-98 Cup. Retrieved March 15, 2001 from http: //www. kdnuggets. com/meetings/kdd 98/gain-kddcup 98 release. html Georges, J. & Milley, A. (1999). KDD’ 99 Competition: Knowledge Discovery Contest. Retrieved March 15, 2001 from http: //www. cse. ucsd. edu/users/elkan/saskdd 99. pdf Rosset, S. & Inger A. (1999). KDD-Cup 99 : Knowledge Discovery In a Charitable Organization’s Donor Database. Retrieved March 15, 2001 from http: //www. cse. ucsd. edu/users/elkan/KDD 2. doc 32

References (Cont. ) Sebastiani P. , Ramoni M. & Crea A. (1999). Profiling your References (Cont. ) Sebastiani P. , Ramoni M. & Crea A. (1999). Profiling your Customers using Bayesian Networks. Retrieved March 15, 2001 from http: //bayesware. com/resources/tutorials/kddcup 99. pdf Inger A. , Vatnik N. , Rosset S. & Neumann E. (2000). KDD-Cup 2000: Question 1 Winner’s Report. Retrieved March 18, 2000 from http: //www. ecn. purdue. edu/KDDCUP/amdocs-slides-1. ppt Neumann E. , Vatnik N. , Rosset S. , Duenias M. , Sasson I. & Inger A. (2000). KDD-Cup 2000: Question 5 Winner’s Report. Retrieved March 18, 2000 from http: //www. ecn. purdue. edu/KDDCUP/amdocs-slides-5. ppt Salford System white papers: http: //www. salford-systems. com/whitepaper. html Summary talk presented at KDD (2000) http: //robotics. stanford. edu/~ronnyk/kdd. Cup. Talk. ppt 33