
STAT6202_Ch2_PrintsC_1213.ppt
- Количество слайдов: 44
CHAPTER 2 Describing bivariate data ü NOTATION ü SCATTER PLOT ü SIMPLE LINEAR REGRESSION o What is it? o Least squares estimation o Watch out! o Log transformation and regression o A practical approach ü SAMPLE CORRELATION COEFFICIENT ü RANK CORRELATION COEFICIENT 1 STAT 6202: LECTURE 1
BIVARIATE DATA What is it? ü VALUES OF 2 QUANTITATIVE VARIABLES x AND y, n INDIVIDUALS. o Measurements (x 1, y 1), (x 2, y 2), …, (xn, yn) or: (xi, yi) i =1, 2, …, n ü AN EXAMPLE: DATA ON ALCOHOL CONSUMPTION Number Country Alcohol consumption Cirrhosis & alcoholism (liters/person/year) (death rate/100, 000) 1 France 24. 7 46. 1 2 Italy 15. 2 23. 6 3 Germany 12. 3 23. 7 : : 12 Ireland 5. 6 6. 4 13 Norway 4. 2 4. 3 14 Finland 3. 9 3. 6 15 Israel 3. 1 5. 4 STAT 6202 Chapter 2 2012/2013 2
BIVARIATE DATA Notation ü 2 QUANTITATIVE VARIABLES x AND y, n INDIVIDUALS o Some old statistics o And some new 3 STAT 6202 Chapter 2 2012/2013
BIVARIATE DATA Alcohol example (without France) xi yi xi 2 Italy 15. 2 23. 6 231. 04 556. 96 358. 72 2 Germany 12. 3 23. 7 151. 29 561. 69 291. 51 3 Austria 10. 9 7 118. 81 49 76. 3 : : : 3. 1 5. 4 109. 5 132. 4 i Country 1 : 14 : Total Israel yi 2 xi yi : : 9. 61 29. 16 16. 74 1020. 3 1863. 80 1297. 74 4 STAT 6202 Chapter 2 2012/2013
CHAPTER 2 Describing bivariate data ü NOTATION ü SCATTER PLOT ü SIMPLE LINEAR REGRESSION o What is it? o Least squares estimation o Watch out! o Log transformation and regression o A practical approach ü SAMPLE CORRELATION COEFFICIENT ü RANK CORRELATION COEFICIENT 5 STAT 6202: LECTURE 1
BIVARIATE DATA Scatter plot (including France) ü REPRESENT DATA AS A SCATTER OF POINTS Alcohol consumption (liters/person/year) Cirrhosis & alcoholism (death rate/100, 000) France 24. 7 46. 1 Italy 15. 2 23. 6 Germany 12. 3 23. 7 : : : Ireland 5. 6 6. 4 Norway 4. 2 4. 3 Finland 3. 9 3. 6 Israel 3. 1 5. 4 Country 6 STAT 6202 Chapter 2 2012/2013
BIVARIATE DATA Scatter plot (excluding France) ü REPRESENT DATA AS A SCATTER OF POINTS Alcohol consumption (liters/person/year) Cirrhosis & alcoholism (death rate/100, 000) Italy 15. 2 23. 6 Germany 12. 3 23. 7 : : : Ireland 5. 6 6. 4 Norway 4. 2 4. 3 Finland 3. 9 3. 6 Israel 3. 1 5. 4 Country 7 STAT 6202 Chapter 2 2012/2013
CHAPTER 2 Describing bivariate data ü NOTATION ü SCATTER PLOT ü SIMPLE LINEAR REGRESSION o What is it? o Least squares estimation o Watch out! o Log transformation and regression o A practical approach ü SAMPLE CORRELATION COEFFICIENT ü RANK CORRELATION COEFICIENT 8 STAT 6202: LECTURE 1
REGRESSION What is it? It’s everywhere! ü TRYING TO PREDICT (/EXPLAIN) ONE VARIABLE AS A FUNCTION OF ANOTHER ONE o Or multiple other ones ü ONE OF THE MOST OFTEN USED STATISTICAL TOOLS ü REGRESSION CAN BE VERY POWERFUL AND COMPLEX ü WE WILL LOOK AT THE MOST BASIC o Simple linear regression: § Linear relationship between two variables 9 STAT 6202 Chapter 2 2012/2013
SIMPLE LINEAR REGRESSION Linear relationship ü IN GENERAL: A LINEAR RELATIONSHIP BETWEEN TWO VARIABLES X AND Y IS o A straight line, or equivalently o y = a+bx, where § a is the intercept (value of y when x = 0) § b is the slope (amount that y changes when x changes by 1 unit) 10 STAT 6202 Chapter 2 2012/2013
SIMPLE LINEAR REGRESSION The challenge ü QUESTION: Given the data, what is the linear relationship, i. e. o What is the “correct” straight line? Or equivalently, o What is the “correct” equation: yi = a + b xi ü THE ALCOHOL EXAMPLE (WITHOUT FRANCE) ü ANSWER: Least squares estimation 11 STAT 6202 Chapter 2 2012/2013
CHAPTER 2 Describing bivariate data ü NOTATION ü SCATTER PLOT ü SIMPLE LINEAR REGRESSION o What is it? o Least squares estimation o Watch out! o Log transformation and regression o A practical approach ü SAMPLE CORRELATION COEFFICIENT ü RANK CORRELATION COEFICIENT 12 STAT 6202: LECTURE 1
SIMPLE LINEAR REGRESSION Least Squares Estimation ü COMPUTING THE STRAIGHT LINE WHICH IS CLOSER TO THE DATA POINTS THAN ANY OTHER LINE ü MINIMIZING THE SUM OF SQUARED RESIDUALS ü LEAST SQUARES ESTIMATORS ARE Estimated change in mean y for a 1 unit increase in x Estimated mean y for x = 0 13 STAT 6202 Chapter 2 2012/2013
SIMPLE LINEAR REGRESSION Residuals ü SCATTERED ABOUT THE LINE: RESIDUALS o Vertical distance between data points and regression line, or o yi - a - bxi ü LEAST SQUARES ESTIMATION RESULTS IN 14 STAT 6202 Chapter 2 2012/2013
SIMPLE LINEAR REGRESSION The alcohol example (without France) ü LEAST SQUARES ESTIMATES ARE ü ESTIMATED LINEAR REGRESSION EQUATION -3. 0786 + 1. 6027 xi 15 STAT 6202 Chapter 2 2012/2013
SIMPLE LINEAR REGRESSION Drawing the regression line ü CALCULATE 2 POINTS USING THE REGRESSION EQUATION ü CONNECT THE TWO POINTS 16 STAT 6202 Chapter 2 2012/2013
SIMPLE LINEAR REGRESSION The alcohol example (without France) -3. 0786 + 1. 6027 xi ü INTERPRETATION REGRESSION COEFFICIENTS o b : mean number of deaths per 100, 000 is estimated to increase by 1. 6027 for a 1 liter increase in annual alcohol intake person per year o a: mean number of deaths per 100, 000 is estimated to be – 3. 0786 for a 0 liter annual alcohol intake person per year. 17 STAT 6202 Chapter 2 2012/2013
SIMPLE LINEAR REGRESSION Prediction ü WHAT IS THE EXPECTED DEATH RATE FOR AN ANNUAL ALCOHOL INTAKE OF 15 LITERS P. P. PER YEAR? o Don’t read from graph, use the regression equation o Answer: 21. 0 deaths per 100, 000 people ü WHAT IS THE STANDARD DEVIATION OF THE DEATH RATE FOR A GIVEN LEVEL OF ALCOHOL INTAKE? 18 STAT 6202 Chapter 2 2012/2013
CHAPTER 2 Describing bivariate data ü NOTATION ü SCATTER PLOT ü SIMPLE LINEAR REGRESSION o What is it? o Least squares estimation o Watch out! o Log transformation and regression o A practical approach ü SAMPLE CORRELATION COEFFICIENT ü RANK CORRELATION COEFICIENT 19 STAT 6202: LECTURE 1
SIMPLE LINEAR REGRESSION Watch out! ü YOU CAN ALWAYS DRAW A REGRESSION LINE, BUT o Not always appropriate/good fit o Relationship does not necessarily hold outside range § Average death rate for alcohol intake of 1 liter p. p. p. y? o Regression causality § § Drawing a line does not prove that one variable is causing a change in another one Causality could be the case, but just a line is not proof There can be confounding variables An example: national ice cream sales and my mood STAT 6202 Chapter 2 2012/2013 20
CHAPTER 2 Describing bivariate data ü NOTATION ü SCATTER PLOT ü SIMPLE LINEAR REGRESSION o What is it? o Least squares estimation o Watch out! o Log transformation and regression o A practical approach ü SAMPLE CORRELATION COEFFICIENT ü RANK CORRELATION COEFICIENT 21 STAT 6202: LECTURE 1
LOG TRANSFORMS IN REGRESSION Why/When? ü When natural comparisons are in terms of ratios. REMEMBER: LOG APPROXIMATION PROPERTY o Difference between two numbers, as a fraction of their mean, approximately equals difference between their natural logs: o This works well for fractional differences up to 0. 5 22 STAT 6202 Chapter 2 2012/2013
LOG TRANSFORMS IN REGRESSION Why/When? ü When natural comparisons are in terms of ratios. ü When data vary over several orders of magniture. ü For transforming a positively skewed distribution to a more symmetric scale. o An example: the handbook of Biological Statistics (http: //udel. edu/~mcdonald/stattransform. html) Eastern mudminnow (Umbra pygmaea). 23 STAT 6202 Chapter 2 2012/2013
LOG TRANSFORMS IN REGRESSION Why/When? ü When natural comparisons are in terms of ratios. ü When data vary over several orders of magniture. ü For transforming a positively skewed distribution to a more symmetric scale. ü For making relationships more linear: 24 STAT 6202 Chapter 2 2012/2013
LOG TRANSFORMS IN REGRESSION Pay attention to the following ü PREVIOUSLY WE SAW FOR y = a + bx o Least square estimates: o Interpretations: § b : mean y is estimated to increase by b for a 1 unit increase in x § a: mean y is estimated to be a when x is 0 ü THE ABOVE NEEDS TO BE ‘TRANSLATED’ FOR TRANSFORMATIONS ü AN EXAMPLE: y = a + bz, where z=log(x) o Least square estimates: o Interpretations: § b : mean y is estimated to increase by b for a 1 unit increase in z § a: mean y is estimated to be a when z is 0, i. e. mean y is estimated to be a when x is 1 25 STAT 6202 Chapter 2 2012/2013
LOG TRANSFORMS IN REGRESSION An example: BMI and GDP 26 STAT 6202 Chapter 2 2012/2013
LOG TRANSFORMS IN REGRESSION An example: BMI and log(GDP) 27 STAT 6202 Chapter 2 2012/2013
LOG TRANSFORMS IN REGRESSION An example: BMI and GDP ü DATA FOR BMI (kg/m 2) AND GDP FOR DIFFERENT COUNTRIES ü PREDICT BMI FOR COUNTRY WITH GDP OF 4000 ü APPLYING LOG-TRANSFORMATION MIGHT BE USEFUL ü REGRESSION EQUATION FOR BMI ON LOG(GDP) BMI = 6. 89 + 2 · loge(GDP) ü LOGe(4000)=8. 29, SO PREDICTED BMI IS BMI = 6. 89 + 2 · loge(4000) = 23. 5 STAT 6202 Chapter 2 2012/2013 28
CHAPTER 2 Describing bivariate data ü NOTATION ü SCATTER PLOT ü SIMPLE LINEAR REGRESSION o What is it? o Least squares estimation o Watch out! o Log transformation and regression o A practical approach ü SAMPLE CORRELATION COEFFICIENT ü RANK CORRELATION COEFICIENT 29 STAT 6202: LECTURE 1
SIMPLE LINEAR REGRESSION A practical approach 1. DRAW A SCATTER PLOT 2. THINK OF POTENTIAL USEFUL TRANSFORMATIONS 3. TRANSFORM VARIABLES 4. SELECT VARIABLES WITH STRONGEST LINEAR RELATIONSHIP BY 1. Comparing scatterplots of original and transformed variables 2. Comparing correlation coefficients of original and transformed variables 5. USE VARIABLES WITH STRONGEST LINEAR RELATIONSHIP 6. CALCULATE REGRESSION EQUATION USING LEAST SQUARES ESTIMATION In this course we consider the following 4 options: a. y = a + b • x b. loge(y) = a + b • x c. y = a + b • loge(x) d. loge(y) = a + b • loge(x) ) 7. USE REGRESSION EQUATION FOR PREDICTION STAT 6202 Chapter 2 2012/2013 30
CHAPTER 2 Describing bivariate data ü NOTATION ü SCATTER PLOT ü SIMPLE LINEAR REGRESSION o What is it? o Least squares estimation o Log transformation and regression o Watch out! o A practical approach ü SAMPLE CORRELATION COEFFICIENT ü RANK CORRELATION COEFICIENT 31 STAT 6202: LECTURE 1
BIVARIATE DATA Scatter plot (excluding France) ü REPRESENT DATA AS A SCATTER OF POINTS Alcohol consumption (liters/person/year) Cirrhosis & alcoholism (death rate/100, 000) Italy 15. 2 23. 6 Germany 12. 3 23. 7 : : : Ireland 5. 6 6. 4 Norway 4. 2 4. 3 Finland 3. 9 3. 6 Israel 3. 1 5. 4 Country -3. 0786+1. 62027 xi 32 STAT 6202 Chapter 2 2012/2013
CORRELATION COEFFICIENT The theory ü CORRELATION COEFFICIENT MEASURES THE STRENGTH OF A LINEAR RELATIONSHIP BETWEEN TWO VARIABLES ü FORMULA: ü SOME PROPERTIES o – 1 ≤ rxy ≤ 1 o rxy > 0: y tends to increase as x increases and vice versa o rxy < 0: y tends to decrease as x increases and vice versa o rxy =1 or rxy =-1: all points (x 1, yi) lie on a straight line o the further rxy is away from 0, the closer the points are to a straight line o |rxy| is not affected by linear transformations 33 STAT 6202 Chapter 2 2012/2013
CORRELATION COEFFICIENT Illustrations (1) 34 STAT 6202 Chapter 2 2012/2013
CORRELATION COEFFICIENT Illustrations (2) 35 STAT 6202 Chapter 2 2012/2013
CORRELATION COEFFICIENT The alcohol example (without France) ü REPRESENT DATA AS A SCATTER OF POINTS Alcohol consumption (liters/person/year) Cirrhosis & alcoholism (death rate/100, 000) Italy 15. 2 23. 6 Germany 12. 3 23. 7 : : : Ireland 5. 6 6. 4 Norway 4. 2 4. 3 Finland 3. 9 3. 6 Israel 3. 1 5. 4 Country 36 STAT 6202 Chapter 2 2012/2013
CORRELATION COEFFICIENT Illustrations (3) ü IN BOTH GRAPHS, THE CORRELATION COEFFICIENT BETWEEN THE X AND Y POINTS IS rxy = 0. 7 ü THE PRESENCE OF REMOTE POINTS AND/OR OUTLIERS DOES MODIFY THE APPEARANCE OF THE GRAPHS 37 STAT 6202 Chapter 2 2012/2013
CHAPTER 2 Describing bivariate data ü NOTATION ü SCATTER PLOT ü SIMPLE LINEAR REGRESSION o What is it? o Least squares estimation o Log transformation and regression o Watch out! o A practical approach ü SAMPLE CORRELATION COEFFICIENT ü RANK CORRELATION COEFICIENT 38 STAT 6202: LECTURE 1
SPEARMAN’s CORRELATION COEFFICIENT The theory ü SPEARMAN’S RANK CORRELATION COEFFICIENT, rs, MEASURES THE STRENGHT OF THE LINEAR RELATIONSHIP BETWEEN ORDERINGS OF 2 VARIABLES ü FORMULA: o Where di is the difference between the rank of xi and the rank of yi ü PROPERTIES (SIMILAR TO rxy) – 1 ≤ rs ≤ 1 o rs > 0: y tends to increase as x increases and vice versa o rs < 0: y tends to decrease as x increases and vice versa o the further rs is away from 0, the stronger the relationship o |rs| is not affected by linear transformations STAT 6202 Chapter 2 2012/2013 39
RANKING DATA/OBSERVATIONS How to go about it? ü RANKING DATA o Put n observations in ascending order o Assign each observation its order number from 1 to n o Rank: § For unique observations: the order number § For identical observations: the average order number ü AN EXAMPLE: 20 40 30 20 10 20 30 o Order data: 10 20 20 30 40 o Assign order number: 1 2 3 4 5 6 7 o Final rank: STAT 6202 Chapter 2 2012/2013 1 3 3 3 5. 5 7 2 + 3 + 4 3 5 + 6 2 40
SPEARMAN’s CORRELATION COEFFICIENT Illustrations y x 41 STAT 6202 Chapter 2 2012/2013
SPEARMAN’s CORRELATION COEFFICIENT An example (1) ü A HOUSEHOLD INCOME AND EXPENDITURE EXAMPLE Household Obs Income (£) Expenditure (£) 1 100 50 2 100 3 200 95 4 300 225 5 400 280 6 400 270 7 400 340 8 500 380 9 500 400 10 500 455 11 500 480 12 600 535 42 STAT 6202 Chapter 2 2012/2013
SPEARMAN’s CORRELATION COEFFICIENT An example (2) ü A HOUSEHOLD SPENDING EXAMPLE Household Obs Income (£) Expenditure (£) Household Income(£) Rank Expenditure (£) Rank 1 100 50 100 1. 5 50 1 2 100 100 1. 5 95 2 3 200 95 200 3 100 3 4 300 225 300 4 225 4 5 400 280 400 6 270 5 6 400 270 400 6 280 6 7 400 340 400 6 340 7 8 500 380 500 9. 5 380 8 9 500 400 500 9. 5 400 9 10 500 455 500 9. 5 455 10 11 500 480 500 9. 5 480 11 12 600 535 600 12 535 12 43 STAT 6202 Chapter 2 2012/2013
SPEARMAN’s CORRELATION COEFFICIENT An example (3) ü A HOUSEHOLD SPENDING EXAMPLE Household Obs Income(£) Expenditure (£) Rank xi yi di di 2 0. 25 1 100 50 1. 5 1 0. 5 2 100 1. 5 3 -1. 5 2. 25 3 200 95 3 2 1 1 4 300 225 4 4 0 0 5 400 280 6 6 0 0 6 400 270 6 5 1 1 7 400 340 6 7 -1 1 8 500 380 9. 5 8 1. 5 2. 25 9 500 400 9. 5 9 0. 5 0. 25 10 500 455 9. 5 10 -0. 5 0. 25 11 500 480 9. 5 11 -1. 5 2. 25 12 600 535 12 12 Total STAT 6202 Chapter 2 2012/2013 0 0 10. 5 44
STAT6202_Ch2_PrintsC_1213.ppt