
c0e6b645c2f2f2fb00d3a971efcaac4b.ppt
- Количество слайдов: 45
DATA SCIENCE APPLICATIONS @DATAMININGAPPS Prof. dr. Bart Baesens Dr. Seppe vanden Broucke Department of Decision Sciences and Information Management, KU Leuven (Belgium) School of Management, University of Southampton (United Kingdom) {Bart. Baesens; Seppe. vanden. Broucke}@kuleuven. be Twitter/Facebook/Youtube: Data. Mining. Apps www. dataminingapps. com
The Analytics Process Model Baesens (2014), Analytics in a big data world: The essential guide to data science and its applications
Team members • • • Database/Datawarehouse administrator Business expert (e. g. marketeer, credit risk analyst, …) Legal expert Data scientist/data miner Software/tool vendors A multidisciplinary team needs to be set up!
Data Scientist • A data scientist should be a good programmer! • A data scientist should have solid quantitative skills! • A data scientist should excel in communication and visualization skills! • A data scientist should have a solid business understanding! • A data scientist should be creative! Baesens, Weber, Bravo, vanden Broucke (2015), Hiring Data Scientists: what to look for!
Analytics • Term often used interchangeably with data mining, knowledge discovery, predictive/descriptive modeling, … • Essentially refers to extracting useful business patterns and/or mathematical decision models from a preprocessed data set • Predictive analytics – Predict the future based on patterns learnt from past data – Classification (churn, response) versus regression (CLV) • Descriptive analytics – Describe patterns in data – Clustering, Association rules, Sequence rules
Analytic Model requirements • Business relevance – Solve a particular business problem • Statistical performance – Statistical significance of model – Statistical prediction performance • Interpretability + Justifiability – Very subjective (depends on decision maker), but CRUCIAL! – Often need to be balanced against statistical performance • Operational efficiency – How can the analytical models be integrated with campaign management? • Economical cost – What is the cost to gather the model inputs and evaluate the model? – Is it worthwhile buying external data and/or models? • Regulatory compliance – In accordance with regulation and legislation Baesens et al (2003), Using neural network rule extraction and decision tables for credit-risk evaluation Verbraken, Verbeke, Baesens (2013), A novel profit maximizing metric for measuring classification performance of customer churn prediction models
Post processing • Interpretation and validation of analytical models by business experts • Trivial versus unexpected (interesting? ) patterns • Sensitivity analysis • How sensitive is the model wrt sample characteristics, assumptions, etc. ? • Deploy analytical model into business setting • Represent model output in a user-friendly way • Integrate with campaign management tools and marketing decision engines • Model monitoring and backtesting • Continuously monitor model output • Contrast model output with observed numbers Castermans, Martens, Van Gestel, Hamers, Baesens (2010), An overview and framework for PD backtesting and benchmarking
Applications • • • Credit scoring Market basket analysis/recommender systems Retention Modeling/churn prediction Response modeling On-line Analytics Social Media Analytics Social Network Analytics Fraud Analytics HR Analytics Process Analytics
Credit Scoring • Estimate probability of default at the time the applicant applies for the loan! • Use predetermined definition of default (e. g. 3 months of payment arrears) • Use application variables – E. g. age, income, marital status, years at address, years with employer, … • Use bureau variables – Bureau score, raw bureau data (e. g. number of credit checks, total amount of credits, delinquency history , …) – In the US: Fico scores between 300 to 850 – Experian, Equifax, Trans. Union – E. g. , Baycorp Advantage (Australia & New Zealand), Schufa (Germany), BKR (the Netherlands), CKP (Belgium), Dun & Bradstreet Van Gestel, Baesens (2009), Credit Risk Management: Basic Concepts. 9
Example Credit Scorecard Characteristic Name Attribute Scorecard Points AGE 1 Up to 26 100 AGE 2 26 - 35 120 AGE 3 35 - 37 185 AGE 4 37+ 225 GENDER 1 Male 90 GENDER 2 Female 180 SALARY 1 Up to 500 120 SALARY 2 501 -1000 140 SALARY 3 1001 -1500 160 SALARY 4 1501 -2000 200 SALARY 5 2001+ 240 Let cut-off = 500 So, a new customer applies for credit …… AGE GENDER SALARY Total 32 Female $1, 150 120 points 180 points 160 points 460 points REFUSE CREDIT Baesens et al (2003), Benchmarking state-of-the-art classification algorithms for credit scoring 10
Association rules • Purpose – Detect frequently occurring patterns between items • How? – Unsupervised data mining (no real target to optimise) – Deriving association rules • Example Applications – Which productsservices are frequently bought together? – Which web pages are frequently visited together? – Which terms often co-occur in a text document? 11
Association rules • Notation: – D: database of transactions tp – each transaction tp consists of a transaction ID and a set of items {i 1, i 2 , …, in} selected from all possible items I • An association rule is an implication of the form: X Y where X I, Y I and X Y = X: rule antecedent, Y: rule consequent • Example: 12 – If a customer has a car loan and car insurance, then the customer has a checking account in 80% of the cases – If a customer buys spaghetti, then the customer buys red wine in 70% of the cases
Example transactions database Transaction 1 stella, hoegaarden, diapers, baby food 2 coke, stella, diapers 3 cigarettes, diapers, baby food 4 chocolates, diapers, hoegaarden, apples 5 tomatoes, water, leffe, stella 6 spaghetti, diapers, baby food, stella 7 water, stella, baby food 8 diapers, baby food, spaghetti 9 baby food, stella, diapers, hoegaarden 10 13 Items apples, chimay, baby food
Association Rules: Support and Confidence • Support of an itemset is the percentage of total transactions in the database that contains the itemset. • The rule X Y has support s if 100 s% of the transactions in D contain X Y. • A frequent itemset is an itemset for which the support is higher than a prespecified threshold (minsup). • The rule X Y has confidence c if 100 c% of the transactions in D that contain X also contain Y. 14
Associations: Support and Confidence Transaction Items 1 stella, hoegaarden, diapers, baby food 2 coke, stella, diapers 3 cigarettes, diapers, baby food 4 chocolates, diapers, hoegaarden, apples 5 tomatoes, water, leffe, stella 6 spaghetti, diapers, baby food, stella 7 water, stella, baby food 8 diapers, baby food, spaghetti 9 baby food, stella, diapers, hoegaarden 10 apples, chimay, baby food E. g. itemset {baby food, diapers, stella } has support = 3/10 or 30% Association Rule: baby food, diapers stella has confidence of 3/5 or 60% 15
Association rule discovery • Often lots of association rules will be discovered • Post-processing is a necessity • Perform sensitivity analysis using minsup and minconf thresholds • Trivial rules, e. g. , buy spaghetti and spaghetti sauce • Unexpected/Unknown rules • Novel and actionable patterns, potentially interesting! • Appropriate visualisation facilities are crucial! Baesens et al. (2000), Post-processing of association rules 16
Market Basket Analysis baby food, diapers stella 1. 2. 3. 4. Put them closer together in the store. Put them far apart in the store. Package baby food, diapers and stella + poorly selling item. 5. Raise the price on one, and lower it on the other. 6. Do not advertise baby food, diapers and stella together 17
Recommender Systems • Help people make decisions by giving them recommendations. • Recommendations are based on preferences of individuals/groups. • Examples – In e-Business, recommend items. – In e-Learning, recommend content. – In search and navigation, recommend links. • Netflix competition – Predict whether someone will enjoy a movie based on how much they liked or disliked other movies. • Amazon, Ebay, … Seret, Verbraken, Versailles, Baesens (2012), A new SOM-based method for profile generation: Theory and an application in direct marketing 18
Example: Recommender Systems 19
Retention modeling/Churn prediction • Understanding why customers leave you • Customer Retention is important because long term loyal customers are less price sensitive, cost less to serve and have a higher lifetime value • Small improvements in customer retention generate significant returns. • Very important in Telco sector (about 2% monthly churn rate) • Transaction versus Relationship buyers • Transaction buyers: buy because of low price • Relationship buyers: want to build loyal relationship with firm Glady, Baesens, Croux (2009), Modeling churn using customer lifetime value 20
Defining churn • Contractual versus Non-contractual setting • Contractual setting: customer cancels contract (e. g. postpaid Telco) • Non-contractual setting: customer hasn’t purchased any products or services during previous 3 months (e. g. online retailer) • Types of churn • • Active: customer stops relationship with firm Passive: customer decreases intensity of relationship Forced: company stops relationship because of e. g. fraud Expected: customer no longer needs product/service (e. g. baby products) Baesens et al. (2002), Bayesian neural network learning for repeat purchase modelling in direct marketing 21
Churn prediction: types of predictors • Demographic data • E. g. , age, gender, marital status • Relationship variables • E. g. length of relationship, number of products purchased, … • ProductService usage data • E. g. , number of transactions in previous month, trend in usage, …. • Complaints data • Number of filed complaints, Service desk contacted, … • RFM data • (Social) network information (cf. infra) 22
RFM Framework • • • Already popular since (Cullinan, 1977) Recency: Number of months since last purchase Frequency: Number of purchases within a given time frame Monetary: dollar value of purchases Different operationalisations of RFM variables • E. g. , Monetary: average/maximum/total dollar value? • Trend variables • Can only be measured for existing customers, not for prospects (e. g. response modeling) • Often used to build a segmentation scheme or combine into a single RFM score 23
Response modeling • Customer acquisition: acquiring new customers with targeted campaigns, win-back campaigns • Campaign can be mail catalogue, email, coupon, A/B or multivariate testing, …. • Identify the customers most likely to respond based on the following information: 24 • Demographic variables (age, gender, marital status, …) • Relationship variables (length of relationship, number of products purchased, …) • RFM variables • (social) network information (cf. infra)
Response modeling setup • Split target group into test group and control group • Test group receives marketing material and control group does not • Incremental impact equals the additional purchases that are directly attributable to the campaign (Larsen, 2010) • Incremental impact=test group purchase rate control group purchase rate 25 Baesens (2014), Analytics in a big data world: The essential guide to data science and its applications
Measuring incremental impact (Larsen, 2010) • Try to factor in the behavior of self-selecting clients, clients that purchase regardless of the marketing campaign • Focus should be on swing clients: interested in the product, but need to be motivated (by e. g. marketing message) to take action • Both test and control group should be representative • Find a model such that the difference between the test group purchase rate and the control group purchase rate is maximized (i. e. identifying the swing clients) 26
Gross versus Net Lift Models (Lo, 2002) Net Lift Gross Lift Previous Campaign data Control Test Holdout data Training data 27 Control Model Test Training data Holdout data Model
Net Lift models (Larsen, 2010) Self-selectors Converted swing clients Y=1 No purchase Test group Y=0 Self-selectors Control group Swing clients No purchase 28 Y=1 Y=0
Building a Difference Score Model • Step 1: Build a logistic regression model estimating probability of purchase given treatment, P(purchase|test) • Step 2: Build a logistic regression model estimating probability of purchase given control, P(purchase|control) • Step 3: Incremental score=P(purchase|test)P(purchase|control) Note: to understand the impact of the predictors, regress the incremental lift scores on the original data! 29
On-line analytics: example questions • How do customers find my website (Google, Facebook, …)? • How to optimise my on-line marketing mix (e. g. Google SEO versus Google Adwords)? • Where am I sending customers to? • What is the average time customers spend on my website? • How can I customise the on-line experience? • How to measure customer engagement? • …
On-line analytics: data collection • Web server logs (server side) 195. 162. 218. 155 - - [27/Jun/2002: 00: 01: 54 +0200] "GET /dutch/shop/detail. html HTTP/1. 1" 200 38890 "http: //www. msn. be/shopping/food/" "Mozilla/4. 0 (MSIE 6. 0)" • Page tagging (client side) – “tagging” web page with a code snippet referencing a separate Java. Script file • Cookies – small text string that a Web server can send to a visitor's Web browser (as part of its HTTP response) – privacy! (cf. regulatory compliance)
KPI monitoring using dashboards
On-line Analytics: challenges • Extremely messy data – Extensive preprocessing needed – Focus on trends + segmentation • Information overload: too many metrics! • Focus on actionable metrics – Bounce rate: ratio of visits where visitor left instantly – Conversion rate: percentage of visitors for which we observed the event (e. g. purchase, pdf download, registration, …) • Integrate on-line with off-line customer data! Huysmans, Mues, Vanthienen, Baesens (2004), Web Usage Mining with Time Constrained Association Rules.
Social Media Analytics • Analyse on-line social media data (e. g. Twitter feeds, Facebook messages, …) • Applications • • • Corporate reputation and sentiment analysis Identification of key themes, opinions and trending topics Social Graphing and Viral Tracking • Develop a social CRM strategy!
Social Network Analytics • Networked data • Telephone calls • Facebook, Twitter, Linked. In, … • Web pages connected by hyperlinks • Research papers connected by citations • Terrorism networks • Applications • Product recommendations • Churn detection • Web page classification • Fraud detection • Terrorism detection ? Baesens (2014), Analytics in a big data world: The essential guide to data science and its applications
Example: Social Networks in a Telco context • Traditional churn prediction models treat customers as isolated entities • However, customers are strongly influenced by their social environment: – recommendations from peers, mouth-to-mouth publicity – social leader influence – promotional offers from operators to acquire groups of friends – reduced tarifs for intra-operator traffic take into account the customers’ social network! Verbeke, Martens, Baesens (2014), Social network analysis for customer churn prediction
Fraud Analytics • Fraud is an uncommon, well-considered, imperceptibly concealed, time-evolving and often carefully organized crime which appears in many types and forms. Baesens, Van Vlasselaer, Verbeke (2015), Fraud Analytics using Descriptive, Predictive and Social Network Techiques.
Fraud Analytics Credit card transaction fraud: • Stolen credit cards (yellow nodes) are often used in the same stores (blue nodes) • Store itself also processes legitimate transactions to cover their fraudulent activities
Fraud Analytics Identify theft: • Before: person calls his/her frequent contacts • After: person also calls new contacts which coincidentally overlap with another persons contacts. before after
Fraud Analytics Social security fraud: • Companies are frequently associated with other companies that perpetrate suspicious/fraudulent activities. Van Vlasselaer, Eliassi-Rad, Akoglu, Snoeck, Baesens (2016), Gotcha! Network-based fraud detection for social security fraud
HR analytics • • • Employee churn Employee performance Employee absence Employee satisfaction Employee Lifetime Value …
Example Absenteeism scorecard Characteristic Name Attribute Points So, a new employee needs to be scored: Age 32 Function Manager 180 points Department Finance 120 points Function Total 460 points Up to 26 26 -35 35 -37 37+ No-manager Manager HR Marketing Finance Production IT 100 120 185 225 90 180 120 140 160 200 240 Let cutoff = 500 No Absenteeism! 160 points Department Baesens (2014), 5 Reasons to Start with Predictive Employee Turnover Analytics.
Hiring & Firing Baesens, De Winne, Sels, What to Do Before You Fire a Pivotal Employee, 2016
Process Analytics • Extracting knowledge from event logs of information systems – Control flow perspective – Organizational perspective – Information perspective De Weerdt, De Backer, Vanthienen, Baesens (2012), A multi-dimensional quality assessment of state-of-the-art process discovery algorithms using real-life event logs
Process Analytics: Example Make order form Case ID Activity Name Originator Timestamp Extra Data 001 Make order form Mary 20 -07 -2010 14: 02: 06 … 002 Make order form Jane 20 -07 -2010 15: 45: 29 … 001 Scan invoice John 10 -08 -2010 09: 52: 31 … 001 Central registration John 10 -08 -2010 10: 00: 36 … 002 Scan invoice John 11 -08 -2010 09: 15: 22 … 002 Central registration John 11 -08 -2010 09: 20: 01 … 001 Accepted Sophie 13 -08 -2010 08: 20: 54 … 002 Accepted Sophie 13 -08 -2010 08: 21: 12 … 001 Decentral rejection Mary 14 -08 -2010 14: 15: 14 … 001 End System 14 -08 -2010 14: 15 … 002 Decentral approval Jane 16 -08 -2010 19: 22: 56 … 002 Invoice booked System 16 -08 -2010 19: 22: 59 … 002 End System 16 -08 -2010 19: 23: 00 … 003 Make order form Mary 19 -08 -2010 07: 52: 41 … 004 Make order form Mary 19 -08 -2010 15: 21: 39 … Scan invoice Genetic Miner Heuristics. Miner Central registration Rejected Decentral revision Accepted Decentral approval Invoice booked Decentral rejection AGNEs Miner End
c0e6b645c2f2f2fb00d3a971efcaac4b.ppt