Скачать презентацию KDD Cup 2009 Fast Scoring on a Large Скачать презентацию KDD Cup 2009 Fast Scoring on a Large

e521c7c6cc6a21313cfb5b7f2d12cec1.ppt

  • Количество слайдов: 35

KDD Cup 2009 Fast Scoring on a Large Database Presentation of the Results at KDD Cup 2009 Fast Scoring on a Large Database Presentation of the Results at the KDD Cup Workshop June 28, 2008 The Organizing Team

KDD Cup 2009 Organizing Team Project team at Orange Labs R&D: • Vincent Lemaire KDD Cup 2009 Organizing Team Project team at Orange Labs R&D: • Vincent Lemaire • Marc Boullé • Fabrice Clérot • Raphaël Féraud • Aurélie Le Cam • Pascal Gouzien Beta testing and proceedings editor: • Gideon Dror Web site design: • Olivier Guyon (Mister. P. net, France) Coordination (KDD cup co-chairs): • Isabelle Guyon • David Vogel

Thanks to our sponsors… § Orange § ACM SIGKDD § Pascal § Unipen § Thanks to our sponsors… § Orange § ACM SIGKDD § Pascal § Unipen § Google § Health Discovery Corp § Clopinet § Data Mining Solutions § MPS

Record KDD Cup Participation Year # Teams 1997 45 1998 57 1999 24 2000 Record KDD Cup Participation Year # Teams 1997 45 1998 57 1999 24 2000 31 2001 136 2002 18 2003 57 2004 102 2005 37 2006 68 2007 95 2008 128 2009 453

Participation Statistics § 1299 registered teams § 7865 entries § 46 countries : Argentina Participation Statistics § 1299 registered teams § 7865 entries § 46 countries : Argentina Germany Malaysia South Korea Australia Greece Mexico Spain Austria Hong Kong Netherlands Sweden Belgium Hungary New Zealand Switzerland Brazil India Pakistan Taiwan Bulgaria Iran Portugal Turkey Canada Ireland Romania Uganda Chile Israel Russian Federation United Kingdom China Italy Singapore Uruguay Fiji Japan Slovak Republic United States Finland Jordan Slovenia France Latvia South Africa

A worlwide operator § One of the main telecommunication operators in the world § A worlwide operator § One of the main telecommunication operators in the world § Providing services to more than 170 millions customers over five continents § Including 120 millions under the Orange Brand

KDD Cup 2009 organized by Orange Customer Relationship Management (CRM) § Three marketing tasks: KDD Cup 2009 organized by Orange Customer Relationship Management (CRM) § Three marketing tasks: predict the propensity of customers – to switch provider: Churn – to buy new products or services: Appentency – to buy upgrades or new options proposed to them: Up-selling § Objective: improve the return of investments (ROI) of marketing campaigns – Increase the efficiency of the campaign given a campaign cost – Decrease the campaign cost for a given marketing objective § Better prediction leads to better ROI

Data, constraints and requirements § Input data – – – § § Relational databases Data, constraints and requirements § Input data – – – § § Relational databases – About one hundred models per month – Fast data preparation and modeling – Fast deployment Numerical or categorical Noisy Missing values Heavily unbalanced distribution Train data § Deployment – Tens of millions of instances Model requirements – Robust – Accurate – Understandable – Hundreds of thousands of instances – Tens of thousand of variables § Train and deploy requirements § Business requirement – Return of investment for the whole process

In-house system From raw data to scoring models § Data warehouse – Relational data In-house system From raw data to scoring models § Data warehouse – Relational data base § Data mart – Star schema Data feeding Customer Services Products Call details … § Feature construction – PAC technology – Generates tens of thousands of variables § PAC Khiops Data preparation and modeling – Khiops technology scoring model

Design of the challenge § Orange business objective – Benchmark the in-house system against Design of the challenge § Orange business objective – Benchmark the in-house system against state of the art techniques § Data – – – § Tabular format – Standard format for the data mining community – Domain knowledge incorporated using feature construction (PAC) – Easy anonymization Tasks – § Data store – Not an option Data warehouse – Confidentiality and scalability issues – Relational data requires domain knowledge and specialized skills Three representative marketing tasks Requirements – – Fast data preparation and modeling (fully automatic) Accurate – – – Fast deployment Robust Understandable

Data sets extraction and preparation § Input data – – – § Instance selection Data sets extraction and preparation § Input data – – – § Instance selection – – § Using PAC technology 20000 constructed variables to get a tabular representation Keep 15 000 variables (discard constant variables) Small track: subset of 230 variables related to classical domain knowledge Anonymization – – § Resampling given the three marketing tasks Keep 100 000 instances, with less unbalanced target distributions Variable construction – – § 10 relational table A few hundreds of fields One million customers Discard variable names, discard identifiers Randomize order of variables Rescale each numerical variable by a random factor Recode each categorical variable using random category names Data samples – – 50 000 train and test instances sampled randomly 5000 validation instances sampled randomly from the test set

Scientific and technical challenge § Scientific objective – Fast data preparation and modeling: within Scientific and technical challenge § Scientific objective – Fast data preparation and modeling: within five days – Large scale: 50 000 train and test data, 15 000 variables – Hetegeneous data – Numerical with missing values – Categorical with hundreds of values – Heavily unbalanced distribution § KDD social meeting objective – Attract as many participants as possible – Additional small track and slow track – Online feedback on validation dataset – Toy problem (only one informative input variable) – Leverage challenge protocol overhead – One month to explore descriptive data and test submission protocol – Attractive conditions – No intellectual property conditions – Money prizes

Business impact of the challenge § Bring Orange datasets to the data mining community Business impact of the challenge § Bring Orange datasets to the data mining community – Benefit for community – Access to challenging data – Benefit for Orange – Benchmark of numerous competing techniques – Drive the research efforts towards Orange needs § Evaluate the Orange in-house system – High number of participants and high quality of the results – Orange in-house results: – Improved by a significant margin when leveraging all business requirements – Almost Parretto optimal when other criterions are considered (automation, very fast train and deploy, robustness and understandability) – Need to study the best challenge methods to get more insights

KDD Cup 2009: Result Analysis Best Result (period considered in the figure) In House KDD Cup 2009: Result Analysis Best Result (period considered in the figure) In House System (downloadable : www. khiops. com) Baseline (Naïve Bayes)

Overall – Test AUC – Fast Good Result Very Quickly Best Results (on each Overall – Test AUC – Fast Good Result Very Quickly Best Results (on each dataset) Submissions

Overall – Test AUC – Fast Good Result Very Quickly Best Results (on each Overall – Test AUC – Fast Good Result Very Quickly Best Results (on each dataset) Submissions In House (Orange ) System: • No parameters • On 1 standard laptop (mono proc) • If deal as 3 different problems

Overall – Test AUC – Fast Very Fast Good Result Small improvement after the Overall – Test AUC – Fast Very Fast Good Result Small improvement after the first day (83. 85 84. 93)

Overall – Test AUC – Slow Very Small improvement after the 5 th day Overall – Test AUC – Slow Very Small improvement after the 5 th day (84. 93 85. 2) Improvement due to unscrambling?

Overall – Test AUC – Submissions 23. 24% of the submissions (>0. 5) < Overall – Test AUC – Submissions 23. 24% of the submissions (>0. 5) < Baseline 84. 75% of the submissions (>0. 5) < In House 15. 25% of the submissions (>0. 5) > In House

Overall – Test AUC 'Correlation' Test / Valid Overall – Test AUC 'Correlation' Test / Valid

Overall – Test AUC 'Correlation' Test / Train ? Random Values Submitted Boosting Method Overall – Test AUC 'Correlation' Test / Train ? Random Values Submitted Boosting Method or Train Target Submitted Over fitting

Overall – Test AUC - 12 hours Test AUC – 36 days Test AUC Overall – Test AUC - 12 hours Test AUC – 36 days Test AUC - 24 hours Test AUC – 5 days

Overall – Test AUC Difference between : • best result at the end of Overall – Test AUC Difference between : • best result at the end of the first day and • best result at the end of the 36 days =1. 35% Test AUC - 12 hours • time to adjust model parameters ? • time to train ensemble method ? • time to find more processors ? • time to test more methods • time to unscramble ? • … Test AUC – 36 days

Test AUC = f (time) Churn – Test AUC – day [0: 36] Harder Test AUC = f (time) Churn – Test AUC – day [0: 36] Harder ? Appetency – Test AUC – day [0: 36] Up-selling– Test AUC – day [0: 36] Easier ?

Test AUC = f (time) Churn – Test AUC – day [0: 36] =1. Test AUC = f (time) Churn – Test AUC – day [0: 36] =1. 84% Appetency – Test AUC – day [0: 36] Up-selling– Test AUC – day [0: 36] =1. 38% Harder ? =0. 11% Easier ? Difference between : • best result at the end of the first day and • best result at the end of the 36 days

Correlation Test AUC / Valid AUC (5 days) Churn – Test/Valid – day [0: Correlation Test AUC / Valid AUC (5 days) Churn – Test/Valid – day [0: 5] Harder ? Appetency – Test/Valid – day [0: 5] Up-selling– Test/Valid – day [0: 5] Easier ?

Correlation Train AUC / Valid AUC (36 days) Churn – Test/Train – day [0: Correlation Train AUC / Valid AUC (36 days) Churn – Test/Train – day [0: 36] Appetency – Test/Train – day [0: 36] Difficulty to conclude something… Up-selling– Test/Train – day [0: 36]

Histogram Test AUC / Valid AUC ([0: 5] or ]5 -36] days) Churn – Histogram Test AUC / Valid AUC ([0: 5] or ]5 -36] days) Churn – Test AUC – day [0: 36] Appetency – Test AUC – day [0: 36] Up-selling– Test AUC – day [0: 36] Knowledge (parameters? ) found during 5 days helps after… ?

Histogram Test AUC / Valid AUC ([0: 5] or ]5 -36] days) Churn – Histogram Test AUC / Valid AUC ([0: 5] or ]5 -36] days) Churn – Test AUC – day [0: 36] Appetency – Test AUC – day [0: 36] Up-selling– Test AUC – day [0: 36] Knowledge (parameters? ) found during 5 days helps after… ? Churn – Test AUC – day ]5: 36] Appetency – Test AUC – day ]5: 36] YES ! Up-selling– Test AUC – day ]5: 36]

Fact Sheets: Preprocessing & Feature Selection PREPROCESSING (overall usage=95%) Replacement of the missing values Fact Sheets: Preprocessing & Feature Selection PREPROCESSING (overall usage=95%) Replacement of the missing values Discretization Normalizations Grouping modalities Other prepro Principal Component Analysis 0 20 40 60 80 Percent of participants FEATURE SELECTION (overall usage=85%) Feature ranking Filter method Other FS Forward / backward wrapper Embedded method Wrapper with search 0 10 20 30 40 Percent of participants 50 60

Fact Sheets: Classifier CLASSIFIER (overall usage=93%) Decision tree. . . Linear classifier Non-linear kernel Fact Sheets: Classifier CLASSIFIER (overall usage=93%) Decision tree. . . Linear classifier Non-linear kernel Other Classif - About 30% logistic loss, >15% exp loss, >15% sq loss, ~10% hinge loss. Neural Network Naïve Bayes - Less than 50% regularization (20% 2 -norm, 10% 1 -norm). Nearest neighbors Bayesian Network - Only 13% unlabeled data. Bayesian Neural Network 0 10 20 30 40 Percent of participants 50 60

Fact Sheets: Model Selection MODEL SELECTION (overall usage=90%) 10% test K-fold or leave-one-out Out-of-bag Fact Sheets: Model Selection MODEL SELECTION (overall usage=90%) 10% test K-fold or leave-one-out Out-of-bag est Bootstrap est Other-MS - About 75% ensemble methods (1/3 boosting, 1/3 bagging, 1/3 other). Other cross-valid Virtual leave-one-out - About 10% used unscrambling. Penalty-based Bi-level Bayesian 0 10 20 30 40 Percent of participants 50 60

Fact Sheets: Implementation >= 32 GB > 8 GB Run in parallel <= 8 Fact Sheets: Implementation >= 32 GB > 8 GB Run in parallel <= 8 GB None <= 2 GB Multi-processor Memory Parallelism Mac OS Java Other (R, SAS) Matlab Linux Unix Windows C C++ Software Platform Operating System

Winning methods Fast track: - IBM research, USA +: Ensemble of a wide variety Winning methods Fast track: - IBM research, USA +: Ensemble of a wide variety of classifiers. Effort put into coding (most frequent values coded with binary features, missing values replaced by mean, extra features constructed, etc. ) - ID Analytics, Inc. , USA +: Filter+wrapper FS. Tree. Net by Salford Systems an additive boosting decision tree technology, bagging also used. - David Slate & Peter Frey, USA: Grouping of modalities/discretization, filter FS, ensemble of decision trees. Slow track: - University of Melbourne: CV-based FS targeting AUC. Boosting with classification trees and shrinkage, using Bernoulli loss. - Financial Engineering Group, Inc. , Japan: Grouping of modalities, filter FS using AIC, gradient tree-classifier boosting. - National Taiwan University +: Average 3 classifiers: (1) Solve joint multiclass problem with l 1 -regularized maximum entropy model. (2) Ada. Boost with tree-based weak leaner. (3) Selective Naïve Bayes. -(+: small dataset unscrambling)

Conclusion § Participation exceeded our expectations. We thank the participants for their hard work, Conclusion § Participation exceeded our expectations. We thank the participants for their hard work, our sponsors, and Orange who offered: – A problem of real industrial interest with challenging scientific and technical aspects – Prizes. § Lessons learned: – Do not under-estimate the participants: five days were given for the fast challenge, only a few hours sufficed to some participants. – Ensemble methods are effective. – Ensemble of decision trees offer off-the-shelf solutions to problems with large numbers of samples and attributes, mixed types of variables, and lots of missing values.