
STAT6202_Ch1_PrintsC_1213.ppt
- Количество слайдов: 57
STAT 6202: Introductory Statistical Methods (B) Dr. Vera Raats Email: vera@stats. ucl. ac. uk Office hours: ? Office: Room 140, 1 -19 Torrington Place Course website: Moodle STAT 6202 Chapter 1 2012/2013 1
ADMINISTRATION ü COURSE SET-UP o Weekly lectures: o Weekly compulsory tutorials: 2 hours 1 hour ü COURSE NOTES AND BOOKS o Print a copy of the course notes yourself o Supplementary reading list (not compulsory) ü COURSE ASSESSMENT o Two parts § ICA (20%): § Exam (80%): 45 minutes on 14 Nov 2012 2. 5 hours in term 3 o Both open book (any printed/written material) and compulsory ü CALCULATORS: only college approved ones are allowed ü MOODLE COURSE WEBSITE: enrolment key is Milly 6202 STAT 6202 Chapter 1 2012/2013 2
CHAPTER 1 Data summary and presentation ü POPULATION AND SAMPLES ü TYPES OF VARIABLES ü SUMMARY STATISTICS ü TABLES AND GRAPHS ü BOXPLOT AND SKEWNESS ü LINEAR TRANSFORMATIONS ü LOG TRANSFORMATIONS STAT 6202 Chapter 1 2012/2013 3
POPULATION AND SAMPLES ü IN A SIMPLE RANDOM SAMPLE EACH INDIVIDUAL IS EQUALLY LIKELY TO GET INTO A SAMPLE. ü AN EXAMPLE? o The managing director of a large multinational company is interested in the age of managers across her company. o The statistician picks a manager at random from a database of UK managers and calls them. If he cannot make contact he calls another manger (picked at random from the database) until he 4 has information about 20 managers. STAT 6202 Chapter 1 2012/2013
POPULATION AND SAMPLES Drawing a simple random sample ü ASSIGN A UNIQUE NUMBER TO EACH MEMBER OF POPULATION. ü THEN DRAW RANDOM NUMBERS. o One option: Table 1 o Another (often used) option: § Computer randomly generated numbers ü SAMPLE CONSISTS OF POPULATION MEMBERS WITH (RANDOMLY) SELECTED NUMBERS. STAT 6202 Chapter 1 2012/2013 5
CHAPTER 1 Data summary and presentation ü POPULATION AND SAMPLES ü TYPES OF VARIABLES ü SUMMARY STATISTICS ü TABLES AND GRAPHS ü BOXPLOT AND SKEWNESS ü LINEAR TRANSFORMATIONS ü LOG TRANSFORMATIONS STAT 6202 Chapter 1 2012/2013 6
TYPES OF VARIABLES Categorical (nominal) · hair color · smoking status Qualitative Ordinal (ordered categorical) Discrete (count) · degree class · severity of illness · no. of children in family · no. accidents in a week Quantitative STAT 6202 Chapter 1 2012/2013 Continuous (measurements) · weight · time 7
CHAPTER 1 Data summary and presentation ü POPULATION AND SAMPLES ü TYPES OF VARIABLES ü SUMMARY STATISTICS ü TABLES AND GRAPHS ü BOXPLOT AND SKEWNESS ü LINEAR TRANSFORMATIONS ü LOG TRANSFORMATIONS STAT 6202 Chapter 1 2012/2013 8
SUMMARY STATISTICS Overview ü LOCATION: o Mean o Median o Quartiles ü SPREAD: o Range (min, max) o Interquartile range (IQR) o Variance § Or equivalently: standard deviation STAT 6202 Chapter 1 2012/2013 9
SUMMARY STATISTICS Notation ü NOTATION: o Sample size: o Data values: n x 1, x 2, …, xn ü AN EXAMPLE: o Score data: 48 0 87 15 86 55 78 32 60 79 0 65 48 52 71 71 59 59 61 39 62 42 68 77 25 52 29 79 40 82 84 22 0 88 o Number of scores: n =34 o x 1= 48, x 2 =0, …, x 34 =88 STAT 6202 Chapter 1 2012/2013 10
SUMMARY STATISTICS Measures of location: mean ü MEAN VALUE: ü SCORE DATA EXAMPLE: o n =34 o Mean: STAT 6202 Chapter 1 2012/2013 11
SUMMARY STATISTICS Measures of location: median ü MEDIAN: o “middle” observation o Calculations: (i) Put observations in increasing order (ii) If n is odd: median is value of th observation (iii) If n is even: median is average value of th and ( +1)th observations ü SCORE DATA EXAMPLE: 0 0 0 15 22 25 29 32 39 40 42 48 48 52 52 55 59 59 60 61 62 65 68 71 71 77 78 79 79 82 84 86 87 88 o 34 scores (n=34) o Since n is even, median is the average value of the 17 th and 18 th observations, so median = (59+59)/2=59 STAT 6202 Chapter 1 2012/2013 12
SUMMARY STATISTICS Measures of location: quartiles ü QUARTILES: o Lower quartile: q. L = value that ¼ of data points are less than o Upper quartile: q. U = value that ¾ of data points are less than o The lower and upper quartiles are also called 25 and 75 percentiles o Calculations: (i) Put data values in increasing order (ii) If n/4 is not a whole number: - q. L is jth value, where j is the next integer larger than n/4 - q. U is kth value, where k is the next integer larger than 3 n/4 (iii) If n/4 is a whole number: - q. L is the average of th and ( +1)th values - q. U is the average of th and ( +1)th values STAT 6202 Chapter 1 2012/2013 13
SUMMARY STATISTICS Measures of spread ü RANGE: Largest data value – Smallest data value ü INTERQUARTILE RANGE: IQR = q. U - q. L o IQR = distance between upper and lower quartile ü SAMPLE VARIANCE: ü STANDARD DEVIATION: Square root of variance ü On calculator: σn-1 STAT 6202 Chapter 1 2012/2013 14
CHAPTER 1 Data summary and presentation ü POPULATION AND SAMPLES ü TYPES OF VARIABLES ü SUMMARY STATISTICS ü TABLES AND GRAPHS ü BOXPLOT AND SKEWNESS ü LINEAR TRANSFORMATIONS ü LOG TRANSFORMATIONS STAT 6202 Chapter 1 2012/2013 15
TABLES AND GRAPHS Frequency table ü FREQUENCY TABLE o What is your favourite football team? o Frequency table: Categories 80 0. 40 Manchester Un. 75 0. 375 Arsenal 30 0. 15 … … … Total STAT 6202 Chapter 1 2012/2013 Relative frequency Liverpool Frequency 200 1 16
TABLES AND GRAPHS A graph overview ü DISPLAYS OF VARIABLES: Type of variable Graph Qualitative Bar plot Discrete Relative frequency chart (+ below) Continuous Dot plot Stem plot Histogram Box plot STAT 6202 Chapter 1 2012/2013 17
TABLES AND GRAPHS Qualitative variables: bar plot (1) ü BACK TO FOOTBALL EXAMPLE o Favourite football team STAT 6202 Chapter 1 2012/2013 Frequency Liverpool ü BAR PLOT Categories 80 Manchester Un. 75 Arsenal 30 … … Total 200 18
TABLES AND GRAPHS Qualitative variables: bar plot (2) ü NOTES FOR BAR PLOT: o Two axes: § One with frequencies § One with categories ü GENERAL NOTE: o Make sure to label and mark o Length of bar represents frequency axes clearly o Often bars are ordered by size STAT 6202 Chapter 1 2012/2013 19
TABLES AND GRAPHS Discrete variables: relative frequency graph (1) ü NOTES FOR RELATIVE FREQUENCY GRAPH: o Two axes: § One with relative frequencies § One with outcomes o Length of bar represents relative frequencies ü AN EXAMPLE: o How many children do you have? Number of children 0 1 8 Total Frequency 20 25 1 80 0. 0125 1 Relative frequency STAT 6202 Chapter 1 2012/2013 0. 2500 0. 3125 20
TABLES AND GRAPHS Discrete variables: relative frequency graph (2) ü AN EXAMPLE: o How many children do you have? Number of children 0 1 8 Total Frequency 20 25 1 80 0. 25 0. 3125 0. 0125 1 Relative frequency o Relative frequency graph STAT 6202 Chapter 1 2012/2013 21
TABLES AND GRAPHS Quantitative variables: dot plot (1) ü NOTES FOR DOT PLOT: o o Optional (but often useful): order your outcomes One axis with scale for outcomes Outcomes represented by dots Identical outcomes are stacked above each other ü AN EXAMPLE (1. 3, page 4 of course notes) o Blood pressure (in mm. Hg) of 21 women 152 105 123 131 99 115 149 137 126 124 128 143 150 112 135 130 123 118 122 136 141 STAT 6202 Chapter 1 2012/2013 22
TABLES AND GRAPHS Quantitative variables: dot plot (2) ü AN EXAMPLE (1. 3, page 4 of course notes) o Blood pressure (in mm. Hg) of 21 women: 152 105 123 131 99 115 149 137 126 124 128 143 150 112 135 130 123 118 122 136 141 o Ordered data: 99 105 112 115 118 122 123 124 126 128 130 131 135 136 137 141 143 149 150 152 o Dot plot: STAT 6202 Chapter 1 2012/2013 23
TABLES AND GRAPHS Quantitative variables: stem plot (1) ü NOTES FOR STEM PLOT: o Optional (but often useful): order your outcomes o Often: take the last digit as “leaf”, rest is stem o Usually aim at 5 -15 stem values, otherwise § Round values (example on page 5 of course notes) § Repeat stem (example in a couple of slides) o Stem plot is less suitable for large amounts of data ü BACK TO EXAMPLE 1. 3: o Ordered blood pressure (in mm. Hg) data: 99 105 112 115 118 122 123 124 126 128 130 131 135 136 137 141 143 149 150 152 STAT 6202 Chapter 1 2012/2013 24
TABLES AND GRAPHS Quantitative variables: stem plot (2) ü BACK TO EXAMPLE 1. 3: o Ordered blood pressure (in mm. Hg) data 99 105 112 115 118 122 123 124 126 128 130 131 135 136 137 141 143 149 150 152 o Stem plot: STAT 6202 Chapter 1 2012/2013 Stem Leaves 9 9 10 5 11 2 5 8 12 2 3 3 4 6 8 13 0 1 5 6 7 14 1 3 9 15 0 2 25
TABLES AND GRAPHS Quantitative variables: stem plot (3) ü SOMETIMES IT IS GOOD TO USE DIFFERENT INTERVALS o E. g 5 rather than 10 ü AN EXAMPLE o Data: 24 32 o Stem plot: STAT 6202 Chapter 1 2012/2013 27 34 28 36 29 39 30 40 31 45 32 Categories Stem Leaves 20 -24 2 4 25 -29 2 7 8 8 9 30 -34 3 0 1 2 2 4 35 -39 3 6 6 9 40 -44 4 0 45 -49 4 5 26
AREA (OR SURFACE) 3 3 meter 3 men/meter AREA = width x height 2 Area: 2 x 3 = 6 STAT 6202 Chapter 1 2012/2013 2 meter Area: 2 meter x 3 meter = 2 meter x 3 men/meter = 27 6 men 6 meter 2
TABLES AND GRAPHS Quantitative data: continuous – big amount (1) ü FREQUENCY HISTOGRAM o o Horizontal axis: variable of interest Vertical axis: frequency density The area (or surface) represents the frequency! An example: Weight Frequency [50, 60) 10 [60, 70) 20 [70, 80) 20 [80, 90) 5 [90, 100) 5 Total 60 STAT 6202 Chapter 1 2012/2013 28
TABLES AND GRAPHS Quantitative data: continuous – big amount (2) ü FREQUENCY HISTOGRAM o o Horizontal axis: variable of interest Vertical axis: frequency density The area (or surface) represents the frequency! An example: Weight Frequency [50, 60) 10 [60, 70) 20 [70, 80) 20 [80, 90) 5 [90, 100) 5 Total 60 STAT 6202 Chapter 1 2012/2013 29
TABLES AND GRAPHS Quantitative data: continuous – big amount (3) ü FREQUENCY HISTOGRAM o o Horizontal axis: variable of interest Vertical axis: frequency density The area (or surface) represents the frequency! An example: Weight Frequency [50, 60) 10 [60, 70) 20 [70, 100) 30 Total 60 STAT 6202 Chapter 1 2012/2013 30
TABLES AND GRAPHS Quantitative data: continuous – big amount (4) ü RELATIVE FREQUENCY HISTOGRAM o o Horizontal axis: variable of interest Vertical axis: relative frequency density The area (or surface) represents the relative frequency! An example: Weight Freq. Relative Frequency [50, 60) 10 0. 167 [60, 70) 20 0. 333 [70, 100) 30 0. 500 Total 60 1 STAT 6202 Chapter 1 2012/2013 31
QUIZ ü 50 people questioned owned 1 computer ü Frequency histogram ü How many people owned 2 or 3 computers? STAT 6202 Chapter 1 2012/2013 (a) 50 (b) 100 (c) 200 32
TABLES AND GRAPHS Quantitative variables: histogram (5) ü EXAMPLE CONTINUED: o Relative frequency histogram STAT 6202 Chapter 1 2012/2013 33
CHAPTER 1 Data summary and presentation ü POPULATION AND SAMPLES ü TYPES OF VARIABLES ü SUMMARY STATISTICS ü TABLES AND GRAPHS ü BOXPLOT AND SKEWNESS ü LINEAR TRANSFORMATIONS ü LOG TRANSFORMATIONS STAT 6202 Chapter 1 2012/2013 34
BOXPLOT AND SKEWNESS Boxplot ü BOX PLOT: o Five figure summary: Min q. L median q. U Max o Whiskers drawn to minimum and maximum unless there are extreme points o Extreme points: § Smaller than or equal to § Bigger than or equal to q. L – 1. 5 IQR q. U + 1. 5 IQR o If there are extreme points, the whiskers are drawn to the last points still within the range (q. L – 1. 5 IQR, q. U + 1. 5 IQR) ü NOTES ON BOX PLOT: o Schematic plot to summarize a data set o Focusses on location, spread and shape o Sample size should not be too small o When used to compare groups, only group with similar sample sizes STAT 6202 Chapter 1 2012/2013 35
BOXPLOT AND SKEWNESS Skewness ü ARE THE DATA SYMMETRIC OR SKEWED? o If the data has a tail it is not symmetric § Left or negatively skewed Mean < Median § Right or positivily skewed Mean > Median o NOTE: for skewed data the median is a better measure of what is typical than the mean! STAT 6202 Chapter 1 2012/2013 36
BOXPLOT AND SKEWNESS Skewness: an example STAT 6202 Chapter 1 2012/2013 Median : n =7 (odd) so median is 4 th value: 2 37
DESCRIPTIVE STATISTICS What are we trying to do? ü DETECT MAIN DATA FEATURES AND PATTERNS o Location, spread, shape o Exception to the general pattern, e. g. outliers o Comparison between groups (see appendix) o Relationships between variables ü SUMMARIZE THE ESSENTIAL INFORMATION o Often combination of text, summary statistics and graphs best ü PRESENT RESULTS INFORMATIVELY AND WITH INTEGRITY ü SEE THROUGH INADEQUATE AND MISLEADING PRESENTATION STAT 6202 Chapter 1 2012/2013 38
CHAPTER 1 Data summary and presentation ü POPULATION AND SAMPLES ü TYPES OF VARIABLES ü SUMMARY STATISTICS ü TABLES AND GRAPHS ü BOXPLOT AND SKEWNESS ü LINEAR TRANSFORMATIONS ü LOG TRANSFORMATIONS STAT 6202 Chapter 1 2012/2013 39
LINEAR TRANSFORMATIONS The theory ü SAY WE HAVE DATA xi WITH MEAN AND VARIANCE ü LINEAR TRANSFORMATIONS OF xi : ui = a + bxi ü WHAT ARE THE MEAN AND VARIANCE OF ui? STAT 6202 Chapter 1 2012/2013 40
STAT 6202 Chapter 1 2012/2013 COURSE MARKS (2) ü SOME SUMMARY STATISTICS: o Number of marks: 177 o Mean mark: 56. 85 o Median mark: 59 o Lowest mark: 11 o Highest mark: 89 o Standard deviation: 17. 99 o Variance: 323. 70 41
LINEAR TRANSFORMATIONS An example: the course marks ü TO RESCALE THREE OPTIONS ARE CONSIDERED 1. Give everyone 5 marks extra 2. Give everyone 5% of their marks extra 3. Give everyone 5% of their marks extra and then add 5 marks ü MEAN AND VARIANCE OF RESCALED MARKS 1. ui = 5 + 1 • xi 2. ui = 0 + 1. 05 • xi 3. ui = 5 + 1. 05 • xi STAT 6202 Chapter 1 2012/2013 42
LINEAR TRANSFORMATIONS Standardisation ü FOR A SAMPLE x 1, x 2, …, xn , LET ü THEN WE HAVE ü HENCE z 1, z 2, …, zn IS CALLED THE STANDARDISED SAMPLE o And this process is called standardisation STAT 6202 Chapter 1 2012/2013 43
CHAPTER 1 Data summary and presentation ü POPULATION AND SAMPLES ü TYPES OF VARIABLES ü SUMMARY STATISTICS ü TABLES AND GRAPHS ü BOXPLOT AND SKEWNESS ü LINEAR TRANSFORMATIONS ü LOG TRANSFORMATIONS STAT 6202 Chapter 1 2012/2013 44
LOG TRANSFORMATIONS The log function: what is it? ü THE INVERSE FUNCTION OF “POWER” FUNCTION ü A “ 10” EXAMPLE o 102 = 100 o log 10(100) = 2 ü In general o ay = x o loga(x) = y ü Natural logs o e ≈ 2. 7182845905 o ey = x o loge(x) = y ü loge(x)= loge(10) · log 10(x) =2. 3026· log 10(x) STAT 6202 Chapter 1 2012/2013 45
LOG TRANSFORMATIONS The log function: some properties ü SOME PROPERTIES: o loga(x) only exists for x>0 o loga(x·z)= loga(x)+ loga(z) o § equal ratios ↔ equal differences of loga e. g. STAT 6202 Chapter 1 2012/2013 46
LOG TRANSFORMATIONS The log function: some approximations ü APPROXIMATION PROPERTY o Difference between two numbers, as a fraction of their mean, approximately equals difference between their natural logs: o This works well for fractional differences up to 0. 5 STAT 6202 Chapter 1 2012/2013 47
LOG TRANSFORMATIONS Note for transforming data ü SAMPLE x 1, x 2, …, xn o All x 1, x 2, …, xn need to be larger than 0 o Transformation results in sample of natural logarithms: loge(x 1), loge(x 2), …, loge(xn) ü HOWEVER, PLEASE NOTE o Mean of loge(x) ≠ loge(mean of x) o Standard deviation of loge(x) is approximately equal to standard deviation of x divided by the mean of x (this works well for relative standard deviations up to about 0. 5) STAT 6202 Chapter 1 2012/2013 48
STAT 6202 Chapter 1 2012/2013 APPENDIX Comparing groups back 49
COMPARING GROUPS Bar plot (1) ü AN EXAMPLE o Type of employment by sex (USA Workforce, 1986) Type Of employment Male (millions) Female (millions) Professional 15. 00 11. 60 Industrial 12. 90 4. 45 Craftsmen 12. 30 1. 25 Sales 6. 90 6. 45 Service 5. 80 9. 60 Clerical 3. 50 14. 30 Agriculture 2. 90 0. 65 STAT 6202 Chapter 1 2012/2013 back 50
COMPARING GROUPS Bar plot (2) ü EMPLOYMENT EXAMPLE (CONTINUED) o Bar plot: composition of US workforce, 1986 (US Labour Department) STAT 6202 Chapter 1 2012/2013 back 51
COMPARING GROUPS Dot plot (1) ü AN EXAMPLE o Weight gains fo chickens on one of two diets Control Experimental 390 321 366 356 361 447 401 375 283 349 402 462 434 403 393 426 356 410 329 399 406 318 467 407 350 384 316 272 427 420 477 392 345 455 360 431 430 339 410 326 STAT 6202 Chapter 1 2012/2013 back 52
COMPARISON GROUPS Dot plot (2) ü CHICKEN EXAMPLE CONTINUED o Dot plot STAT 6202 Chapter 1 2012/2013 back 53
COMPARING GROUPS Stem plot (1) ü AN EXAMPLE o Number of home runs that Babe Ruth hit in each of his 15 years with the NY Yankees, 1920 to 1935 54 59 35 41 46 25 47 60 54 46 49 46 41 34 22 RAW DATA ORDERED SERIES Stem Leaves 2 5 2 2 2 5 3 5 4 3 4 5 4 1 6 7 6 9 6 1 4 1 1 6 6 6 7 9 5 4 9 4 5 4 4 9 6 0 STAT 6202 Chapter 1 2012/2013 back 54
COMPARING GROUPS Stem plot (2) ü EXAMPLE CONTINUED o Home run data also given for Roger Marris: o Back to back stem plot: Babe Ruth Roger Marris Stem 0 8 5 2 1 3 4 6 5 2 2 3 6 8 5 4 3 3 9 9 7 6 6 6 1 1 4 9 4 4 5 0 6 STAT 6202 Chapter 1 2012/2013 6 1 back 55
COMPARING GROUPS Box plot (1) ü AN EXAMPLE: Compare lifetime of two light bulb brands (hundreds of hours) o Data: STAT 6202 Chapter 1 2012/2013 BRAND A BRAND B 10. 500 11. 300 9. 100 7. 000 10. 225 9. 700 10. 000 9. 600 9. 450 10. 500 9. 600 9. 700 11. 800 8. 925 back 56
COMPARING GROUPS Box plot (2) ü LIGHT BULB EXAMPLE CONTINUED: STAT 6202 Chapter 1 2012/2013 BRAND A BRAND B Mean 9. 796 9. 804 Median 9. 700 q. L 9. 450 8. 925 q. U 10. 2250 11. 100 back 57
STAT6202_Ch1_PrintsC_1213.ppt