Скачать презентацию Seven Pitfalls to Avoid when Running Controlled Experiments Скачать презентацию Seven Pitfalls to Avoid when Running Controlled Experiments

0efe26353106bb7db8d44617cd3913dc.ppt

  • Количество слайдов: 12

Seven Pitfalls to Avoid when Running Controlled Experiments on the Web Ronny Kohavi Thomas Seven Pitfalls to Avoid when Running Controlled Experiments on the Web Ronny Kohavi Thomas Roger Longbotham Crook Brian Frasca Microsoft Corporation Experimentation Platform

Controlled Experiments in One Slide Concept is trivial Randomly split traffic between two (or Controlled Experiments in One Slide Concept is trivial Randomly split traffic between two (or more) versions A/Control B/Treatment Collect metrics of interest Analyze Best scientific way to prove causality, i. e. , the changes in metrics are caused by changes introduced in treatment Must run statistical tests to confirm differences are not due to chance

Pitfall 1: Wrong Success Metric Office Online tested new design for homepage Objective: increase Pitfall 1: Wrong Success Metric Office Online tested new design for homepage Objective: increase sales of Office products Control Treatment Overall Evaluation Criterion (OEC) was clicks to the Buy Button Which one was better?

Pitfall 1: Wrong OEC Treatment had a drop in the OEC of 64%! Were Pitfall 1: Wrong OEC Treatment had a drop in the OEC of 64%! Were sales for Treatment correspondingly less also? Our interpretation is that not having the price shown in the Control lead more people to click to determine the price It’s possible the Treatment group ended up with same number of sales – that data was not available Lesson: measure what you really need to measure, even if it’s difficult!

Pitfall 2: Incorrect Interval Calculation Confidence Intervals are a great way to summarize statistical Pitfall 2: Incorrect Interval Calculation Confidence Intervals are a great way to summarize statistical results Example: CI for single mean, assuming Normal distribution Some cases not so straightforward e. g. CI for Percent Effect (Use Fieller’s formula) Example for difference in two means (Effect = = 0. 62) and 95% CI for difference is (0, 1. 24) Percent effect is an increase of 62% and 95% CI for percent effect is (0%, 201%) => not symmetric about 62%

Pitfall 3: Using Standard formulas for Standard Deviation Most metrics for online experiments cannot Pitfall 3: Using Standard formulas for Standard Deviation Most metrics for online experiments cannot use the standard statistical formulas Example: Click-through rate, CTR The standard statistical approach would assume this would be approximately Bernoulli. However, the true standard deviation can be much larger than that, depending on the site

Pitfall 4: Combining Data when Percent to Treatment Varies – Simpson’s Paradox Simplified example: Pitfall 4: Combining Data when Percent to Treatment Varies – Simpson’s Paradox Simplified example: 1, 000 users per day Conversion Rate for two days Friday Saturday C/T split: 99/1 C/T split: 50/50 Total Control 20, 000 5, 000 25, 000 = 2. 02% = 1. 00% = 1. 68% 990, 000 500, 000 1, 490, 000 Treatment 230 6, 000 6, 230 = 2. 30% = 1. 22% 10, 000 500, 000 510, 000 For each individual day the Treatment is much better However, cumulative result for Treatment is worse

Pitfall 5: Not Filtering out Robots Internet sites can get a significant amount of Pitfall 5: Not Filtering out Robots Internet sites can get a significant amount of robot traffic (search engine crawlers, email harvesters, botnets, etc. ) Robots can cause misleading results Most concerned about robots with high traffic (e. g. clicks or PVs) that stay in Treatment or Control (we’ve seen one robot with > 600, 000 clicks in a month on one page) Identifying robots can be difficult Some robots identify themselves Many look like human users and even execute Javascript Use heuristics to ID and remove robots from analysis (e. g. more than 100 clicks in an hour)

8000 Each hour represents clicks from thousands of users The “spikes” can be traced 8000 Each hour represents clicks from thousands of users The “spikes” can be traced to single “users” (robots) 6000 -8000 6/7/07 15: 00 6/7/07 22: 00 6/8/07 5: 00 6/8/07 12: 00 6/8/07 19: 00 6/9/07 2: 00 6/9/07 9: 00 6/9/07 16: 00 6/9/07 23: 00 6/10/07 6: 00 6/10/07 13: 00 6/10/07 20: 00 6/11/07 3: 00 6/11/07 10: 00 6/11/07 17: 00 6/12/07 0: 00 6/12/07 7: 00 6/12/07 14: 00 6/12/07 21: 00 6/13/07 4: 00 6/13/07 11: 00 6/13/07 18: 00 6/14/07 1: 00 6/14/07 8: 00 6/14/07 15: 00 6/14/07 22: 00 6/15/07 5: 00 6/15/07 12: 00 6/15/07 19: 00 6/16/07 2: 00 6/16/07 9: 00 6/16/07 16: 00 6/16/07 23: 00 6/17/07 6: 00 6/17/07 13: 00 6/17/07 20: 00 6/18/07 3: 00 6/18/07 10: 00 6/18/07 17: 00 6/19/07 0: 00 6/19/07 7: 00 6/19/07 14: 00 6/19/07 21: 00 6/20/07 4: 00 6/20/07 11: 00 Effect of Robots on A/A Experiment Clicks for Treatment minus Control by Hour for A/A test No Robots Removed 4000 2000 0 -2000 -4000 -6000

Pitfall 6: Invalid Instrumentation Validating initial instrumentation Logging audit – compare experimentation observations with Pitfall 6: Invalid Instrumentation Validating initial instrumentation Logging audit – compare experimentation observations with recording system of record A/A experiment – run a “mock” experiment where users are randomly assigned to two groups but users get Control in both Expect about 5% of metrics to be statistically significant P-values should be uniformly distributed on the interval (0, 1) and no pvalues should be very close to zero (e. g. <0. 001) A surprising number of partners initially fail either logging audit or A/A experiment

Pitfall 7: Insufficient Experimental Control Must make sure the only difference between Treatment and Pitfall 7: Insufficient Experimental Control Must make sure the only difference between Treatment and Control is the change being tested Hourly click-through rate was plotted for T and C for recent experiment Headlines were supposed to be the same in both One headline was different for one 7 hour period changing result of experiment

Experimentation is Easy! But it requires vigilance and attention to details Good judgment comes Experimentation is Easy! But it requires vigilance and attention to details Good judgment comes from experience, and a lot of that comes from bad judgment. -- Will Rogers