808a43105cc69f828564dccc4ceb0732.ppt
- Количество слайдов: 57
Chapter 11: Simple Linear Regression
Where We’ve Been n n Presented methods for estimating and testing population parameters for a single sample Extended those methods to allow for a comparison of population parameters for multiple samples Mc. Clave: Statistics, 11 th ed. Chapter 11: Simple Linear Regression 2
Where We’re Going n n Introduce the straight-linear regression model as a means of relating one quantitative variable to another quantitative variable Introduce the correlation coefficient as a means of relating one quantitative variable to another quantitative variable Assess how well the simple linear regression model fits the sample data Use the simple linear regression model to predict the value of one variable given the value of another variable Mc. Clave: Statistics, 11 th ed. Chapter 11: Simple Linear Regression 3
11. 1: Probabilistic Models There may be a deterministic reality connecting two variables, y and x But we may not know exactly what that reality is, or there may be an imprecise, or random, connection between the variables. The unknown/unknowable influence is referred to as the random error So our probabilistic models refer to a specific connection between variables, as well as influences we can’t specify exactly in each case: y = f(x) + random error Mc. Clave: Statistics, 11 th ed. Chapter 11: Simple Linear Regression 4
11. 1: Probabilistic Models The relationship between home runs and runs in baseball seems at first glance to be deterministic … Mc. Clave: Statistics, 11 th ed. Chapter 11: Simple Linear Regression 5
11. 1: Probabilistic Models But if you consider how many runners are on base when the home run is hit, or even how often the batter misses a base and is called out, the rigid model becomes more variable. Mc. Clave: Statistics, 11 th ed. Chapter 11: Simple Linear Regression 6
11. 1: Probabilistic Models General Form of Probabilistic Models y = Deterministic component + Random error where y is the variable of interest, and the mean value of the random error is assumed to be 0: E(y) = Deterministic component. Mc. Clave: Statistics, 11 th ed. Chapter 11: Simple Linear Regression 7
11. 1: Probabilistic Models Mc. Clave: Statistics, 11 th ed. Chapter 11: Simple Linear Regression 8
11. 1: Probabilistic Models The goal of regression analysis is to find the straight line that comes closest to all of the points in the scatter plot simultaneously. 300 Attendance 250 R 2 = 0. 0602 200 Attendance n 150 100 50 0 0 20 40 60 80 Week 100 120 140 Mc. Clave: Statistics, 11 th ed. Chapter 11: Simple Linear Regression 160 9
11. 1: Probabilistic Models n A First-Order Probabilistic Model y = 0 + 1 x + where y = dependent variable x = independent variable 0 + 1 x = E(y) = deterministic component = random error component 0 = y – intercept 1 = slope of the line Mc. Clave: Statistics, 11 th ed. Chapter 11: Simple Linear Regression 10
11. 1: Probabilistic Models 0, the y – intercept, and 1, the slope of the line, are population parameters, and invariably unknown. Regression analysis is designed to estimate these parameters. Mc. Clave: Statistics, 11 th ed. Chapter 11: Simple Linear Regression 11
11. 2: Fitting the Model: The Least Squares Approach Step 1 Hypothesize the deterministic component of the probabilistic model E(y) = 0 + 1 x Step 2 Use sample data to estimate the unknown parameters in the model Mc. Clave: Statistics, 11 th ed. Chapter 11: Simple Linear Regression 12
11. 2: Fitting the Model: The Least Squares Approach 4000 Values on the line are the predicted values of total offerings given the average offering. Total Offering 3500 3000 2500 2000 1500 1000 500 0 0 5 10 The distances between the scattered dots and the line are the errors of prediction. 15 20 25 30 35 40 45 Average Offering Mc. Clave: Statistics, 11 th ed. Chapter 11: Simple Linear Regression 13
11. 2: Fitting the Model: The Least Squares Approach 4000 Values on the line are the predicted values of total offerings given the average offering. Total Offering 3500 3000 2500 The line’s estimated parameters are the values that minimize the sum of the squared errors of prediction, and the method of finding those values is called the method of least squares. 2000 1500 1000 500 0 0 5 10 The distances between the scattered dots and the line are the errors of prediction. 15 20 25 30 35 40 45 Average Offering Mc. Clave: Statistics, 11 th ed. Chapter 11: Simple Linear Regression 14
11. 2: Fitting the Model: The Least Squares Approach n Model: n Estimates: n Deviation: n SSE: Mc. Clave: Statistics, 11 th ed. Chapter 11: Simple Linear Regression 15
11. 2: Fitting the Model: The Least Squares Approach n The least squares line is the line that has the following two properties: 1. 2. The sum of the errors (SE) equals 0. The sum of squared errors (SSE) is smaller than that for any other straightline model. Mc. Clave: Statistics, 11 th ed. Chapter 11: Simple Linear Regression 16
11. 2: Fitting the Model: The Least Squares Approach Mc. Clave: Statistics, 11 th ed. Chapter 11: Simple Linear Regression 17
11. 2: Fitting the Model: The Least Squares Approach Can home runs be used to predict errors? Is there a relationship between the number of home runs a team hits and the quality of its fielding? Mc. Clave: Statistics, 11 th ed. Chapter 11: Simple Linear Regression 18
11. 2: Fitting the Model: The Least Squares Approach Home Runs (x) 158 155 139 191 124 xi = 767 Errors (y) xi 2 xiyi 126 24964 19908 87 24025 13485 65 19321 9035 95 36481 18145 119 15625 14756 yi = 492 xi 2 = 120416 xiyi = 75329 Mc. Clave: Statistics, 11 th ed. Chapter 11: Simple Linear Regression 19
11. 2: Fitting the Model: The Least Squares Approach Mc. Clave: Statistics, 11 th ed. Chapter 11: Simple Linear Regression 20
11. 2: Fitting the Model: The Least Squares Approach These results suggest that teams which hit more home runs are (slightly) better fielders (maybe not what we expected). There are, however, only five observations in the sample. It is important to take a closer look at the assumptions we made and the results we got. Mc. Clave: Statistics, 11 th ed. Chapter 11: Simple Linear Regression 21
11. 3: Model Assumptions 1. The mean of the probability distribution of is 0. 2. The variance, 2, of the probability distribution of is constant. 3. The probability distribution of is normal. 4. The values of associated with any two values of y are independent. Mc. Clave: Statistics, 11 th ed. Chapter 11: Simple Linear Regression 22
11. 3: Model Assumptions n n The variance, 2, is used in every test statistic and confidence interval used to evaluate the model. Invariably, 2 is unknown and must be estimated. Mc. Clave: Statistics, 11 th ed. Chapter 11: Simple Linear Regression 23
11. 3: Model Assumptions Mc. Clave: Statistics, 11 th ed. Chapter 11: Simple Linear Regression 24
11. 4: Assessing the Utility of the Model: Making Inferences about the Slope 1 Note: There may be many different patterns in the scatter plot when there is no linear relationship. Mc. Clave: Statistics, 11 th ed. Chapter 11: Simple Linear Regression 25
11. 4: Assessing the Utility of the Model: Making Inferences about the Slope 1 A critical step in the evaluation of the model is to test whether 1 = 0. y . . x Positive Relationship 1 > 0 x No Relationship 1 = 0 Mc. Clave: Statistics, 11 th ed. Chapter 11: Simple Linear Regression . . . . x Negative Relationship 1 < 0 26
11. 4: Assessing the Utility of the Model: Making Inferences about the Slope 1 H 0 : 1 = 0 Ha : 1 ≠ 0 . y . . x Positive Relationship 1 > 0 x No Relationship 1 = 0 Mc. Clave: Statistics, 11 th ed. Chapter 11: Simple Linear Regression . . . . x Negative Relationship 1 < 0 27
11. 4: Assessing the Utility of the Model: Making Inferences about the Slope 1 n The four assumptions described above produce a normal sampling distribution for the slope estimate: called the estimated standard error of the least squares slope estimate. Mc. Clave: Statistics, 11 th ed. Chapter 11: Simple Linear Regression 28
11. 4: Assessing the Utility of the Model: Making Inferences about the Slope 1 Mc. Clave: Statistics, 11 th ed. Chapter 11: Simple Linear Regression 29
11. 4: Assessing the Utility of the Model: Making Inferences about the Slope 1 yi-- (yi-- )2 98. 14 27. 86 776. 4 2. 56 98. 31 -11. 31 127. 9 -14. 4 207. 4 99. 23 -34. 23 1171 95 37. 6 1414 96. 25 -1. 245 1. 55 119 yi = 492 -29. 4 864. 4 SSxx= 2509 100. 1 18. 92 357. 8 SSE = 2435 Home Runs (x) Errors (y) xi - (xi - )2 158 126 4. 6 21. 16 155 87 1. 6 139 65 191 124 xi = 767 = E(Errors|HRs) = 153. 4 Mc. Clave: Statistics, 11 th ed. Chapter 11: Simple Linear Regression 30
11. 4: Assessing the Utility of the Model: Making Inferences about the Slope 1 n Since the t-value does not lead to rejection of the null hypothesis, we can conclude that ¡ ¡ ¡ A different set of data may yield different results. There is a more complicated relationship. There is no relationship (non-rejection does not lead to this conclusion automatically). Mc. Clave: Statistics, 11 th ed. Chapter 11: Simple Linear Regression 31
11. 4: Assessing the Utility of the Model: Making Inferences about the Slope 1 n Interpreting p-Values for ¡ ¡ Software packages report two-tailed p-values. To conduct one-tailed tests of hypotheses, the reported p-values must be adjusted: Mc. Clave: Statistics, 11 th ed. Chapter 11: Simple Linear Regression 32
11. 4: Assessing the Utility of the Model: Making Inferences about the Slope 1 n A Confidence Interval on 1 where the estimated standard error is and t /2 is based on (n – 2) degrees of freedom Mc. Clave: Statistics, 11 th ed. Chapter 11: Simple Linear Regression 33
11. 4: Assessing the Utility of the Model: Making Inferences about the Slope 1 n n In the home runs and errors example, the estimated 1 was -. 0521, and the estimated standard error was. 569. With 3 degrees of freedom, t = 3. 182. The confidence interval is, therefore, which includes 0, so there may be no relationship between the two variables. Mc. Clave: Statistics, 11 th ed. Chapter 11: Simple Linear Regression 34
11. 5: The Coefficients of Correlation and Determination n The coefficient of correlation, r, is a measure of the strength of the linear relationship between two variables. It is computed as follows: Mc. Clave: Statistics, 11 th ed. Chapter 11: Simple Linear Regression 35
11. 5: The Coefficients of Correlation and Determination Positive linear relationship . y . . . . No linear relationship . . . x r → +1 y . . Negative linear relationship . . . . x r 0 x r → -1 Values of r equal to +1 or -1 require each point in the scatter plot to lie on a single straight line. Mc. Clave: Statistics, 11 th ed. Chapter 11: Simple Linear Regression 36
11. 5: The Coefficients of Correlation and Determination n n In the example about homeruns and errors, SSxy= -143. 8 and SSxx= 2509. SSyy is computed as so Mc. Clave: Statistics, 11 th ed. Chapter 11: Simple Linear Regression 37
11. 5: The Coefficients of Correlation and Determination n An r value that close to zero suggests there may not be a linear relationship between the variables, which is consistent with our earlier look at the null hypothesis and the confidence interval on 1. Mc. Clave: Statistics, 11 th ed. Chapter 11: Simple Linear Regression 38
11. 5: The Coefficients of Correlation and Determination n The coefficient of determination, r 2, represents the proportion of the total sample variability around the mean of y that is explained by the linear relationship between x and y. Mc. Clave: Statistics, 11 th ed. Chapter 11: Simple Linear Regression 39
11. 5: The Coefficients of Correlation and Determination Predict values of y with the mean of y if no other information is available High r 2 Predict values of y|x based on a hypothesized linear relationship Evaluate the power of x to predict values of y with the coefficient of determination Low r 2 Mc. Clave: Statistics, 11 th ed. Chapter 11: Simple Linear Regression • x provides important information about y • Predictions are more accurate based on the model • Knowing values of x does not substantially improve predictions on y • There may be no relationship between x and y, or it may be more subtle than a linear relationship 40
11. 6: Using the Model for Estimation and Prediction Estimate the mean of y for a specific value of x: E(y)|x (over many experiments with this x-value) Statistical Inference based on the linear regression model Estimate an individual value of y for a given x value (for a single experiment with this value of x) Mc. Clave: Statistics, 11 th ed. Chapter 11: Simple Linear Regression 41
11. 6: Using the Model for Estimation and Prediction Mc. Clave: Statistics, 11 th ed. Chapter 11: Simple Linear Regression 42
11. 6: Using the Model for Estimation and Prediction n Based on our model results, a team that hits 140 home runs is expected to make 99. 9 errors: Mc. Clave: Statistics, 11 th ed. Chapter 11: Simple Linear Regression 43
11. 6: Using the Model for Estimation and Prediction Mc. Clave: Statistics, 11 th ed. Chapter 11: Simple Linear Regression 44
11. 6: Using the Model for Estimation and Prediction n A 95% Prediction Interval for an Individual Team’s Errors Mc. Clave: Statistics, 11 th ed. Chapter 11: Simple Linear Regression 45
11. 6: Using the Model for Estimation and Prediction intervals for individual new values of y are wider than confidence intervals on the mean of y because of the extra source of error. Error in E(y|xp) Error in predicting a mean value of y|xp Sampling error from the y population Error in predicting a specific value of y|xp Mc. Clave: Statistics, 11 th ed. Chapter 11: Simple Linear Regression 46
11. 6: Using the Model for Estimation and Prediction Mc. Clave: Statistics, 11 th ed. Chapter 11: Simple Linear Regression 47
11. 6: Using the Model for Estimation and Prediction n n Estimating y beyond the range of values associated with the observed values of x can lead to large prediction errors. Beyond the range of observed x values, the relationship may look very different. Estimated relationship True relationship Xi Range of observed values of x Xj Mc. Clave: Statistics, 11 th ed. Chapter 11: Simple Linear Regression 48
11. 7: A Complete Example n Step 1 ¡ How does the proximity of a fire house (x) affect the damages (y) from a fire? n n y = f(x) y = 0 + 1 x + Mc. Clave: Statistics, 11 th ed. Chapter 11: Simple Linear Regression 49
11. 7: A Complete Example Mc. Clave: Statistics, 11 th ed. Chapter 11: Simple Linear Regression 50
11. 7: A Complete Example Mc. Clave: Statistics, 11 th ed. Chapter 11: Simple Linear Regression 51
11. 7: A Complete Example n Step 2 ¡ The data (found in Table 11. 7) produce the following estimates (in thousands of dollars): ¡ The estimated damages equal $10, 280 + $4910 for each mile from the fire station, or Mc. Clave: Statistics, 11 th ed. Chapter 11: Simple Linear Regression 52
11. 7: A Complete Example n Step 3 ¡ ¡ The estimate of the standard deviation, , of is s = 2. 31635 Most of the observed fire damages will be within 2 s 4. 64 thousand dollars of the predicted value Mc. Clave: Statistics, 11 th ed. Chapter 11: Simple Linear Regression 53
11. 7: A Complete Example n Step 4 ¡ Test that the true slope is 0 ¡ SAS automatically performs a two-tailed test, with a reported p-value <. 0001. The one-tailed p-value is <. 00005, which provides strong evidence to reject the null. Mc. Clave: Statistics, 11 th ed. Chapter 11: Simple Linear Regression 54
11. 7: A Complete Example n Step 4 ¡ ¡ ¡ A 95% confidence interval on 1 from the SAS output is 4. 071 ≤ 5. 768. The coefficient of determination, r 2, is. 9235. The coefficient of correlation, r, is Mc. Clave: Statistics, 11 th ed. Chapter 11: Simple Linear Regression 55
11. 7: A Complete Example n Suppose the distance from the nearest station is 3. 5 miles. We can estimate the damage with the model estimates. We’re 95% sure the damage for a fire 3. 5 miles from the nearest station will be between $22, 324 and $32, 667. Mc. Clave: Statistics, 11 th ed. Chapter 11: Simple Linear Regression 56
11. 7: A Complete Example n Suppose the distance from the nearest station is 3. 5 miles. We can estimate the damage with the model estimates. Since the x-values in our sample range from. 7 to 6. 1, predictions about y for xvalues beyond this range will be unreliable. We’re 95% sure the damage for a fire 3. 5 miles from the nearest station will be between $22, 324 and $32, 667. Mc. Clave: Statistics, 11 th ed. Chapter 11: Simple Linear Regression 57