Chapter 1 Looking at Data

Chapter 1 Looking at Data - Distributions

Introduction • Goal: Using Data to Gain Knowledge • Terms/Definitions: – Individiduals: Units described by or used to obtain data, such as humans, animals, objects (aka experimental or sampling units) – Variables: Characteristics corresponding to individuals that can take on different values among individuals • Categorical Variable: Levels correspond to one of several groups or categories • Quantitaive Variable: Take on numeric values such that arithmetic operations make sense

Introduction • Spreadsheets for Statistical Analyses – Rows: Represent Individuals – Columns: Represent Variables – SPSS, Minitab, EXCEL are examples • Measuring Variables – Instrument: Tool used to make quantitative measurement on subjects (e. g. psychological test or physical fitness measurement) • Independent and Dependent Variables – Independent Variable: Describes a group an individal comes from (categorical) or its level (quantitative) prior to observation – Dependent Variable: Random outcome of interest

Independent and Dependent Variables • Dependent variables are also called response variables • Independent Variables are also called explanatory variables • Marketing: Does amount of exposure effect attitudes? – I. V. : Exposure (in time or number), different subjects receive different levels – D. V. : Measurement of liking of a product or brand • Medicine: Does a new drug reduce heart disease? – I. V. : Treatment (Active Drug vs Placebo) – D. V. : Presence/Absence of heart disease in a time period • Psychology/Finance: Risk Perceptions – I. V. : Framing of Choice (Loss vs Gain) – D. V. : Choice Taken (Risky vs Certain)

Rates and Proportions • Categorical Variables: Typically we count the number with some characteristic in a group of individuals. • The actual count is not a useful summary. More useful summaries include: – Proportion: The number with the characteristic divided by the group size (will lie between 0 and 1) – Percent: # with characteristic per 100 individuals (proportion*100) – Rate per 100, 000: proportion*100, 000

Graphical Displays of Distributions • Graphs of Categorical Variables – Bar Graph: Horizontal axis defines the various categories, heights of bars represent numbers of individuals – Pie Chart: Breaks down a circle (pie) such that the size of the slices represent the numbers of individuals in the categories or percentage of individuals.

Example - AAA Ratings of FL Hotels (Bar Chart)

Example - AAA Ratings of FL Hotels (Pie Chart)

Graphical Displays of Distributions • Graphs of Numeric Variables – Stemplot: Crude, but quick method of displaying the entire set of data and observing shape of distribution 1 Stem: All but rightmost digit, Leaf: Rightmost Digit 2 Put stems in vertical column (small at top), draw vertical line 3 Put leaves in appropriate row in increasing order from stem – Histogram: Breaks data into equally spaced ranges on horizontal axis, heights of bars represent frequencies or percentages

Example: Time (Hours/Year) Lost to Traffic Step 1: Step 2: Step 3: Stems: 10 s of hours Leaves: Hours Stems: 1 2 3 4 5 Stems and Leaves 1 48 2 01244699 3 0112244457778 4 122222245566 5 0336 . Source: Texas Transportation Institute (5/7/2001).

Example: Time (Hours/Year) Lost to Traffic EXCEL Output Note in histogram, the bins represent the number up to and including that number (e. g. T 14, 14<T 21, …, 42<T 49, T>49)

Comparing 2 Groups - Back-to-back Stemplots • Places Stems in Middle, group 1 to left, group 2 to right • Example: Maze Learning: – Groups (I. V. ): Adults vs Children – Measured Response (D. V. ): Average number of Errors in series of Trials

Example - Maze Learning (Average Errors) Stems: Integer parts Leaves: Decimal Parts

Examinining Distributions • Overall Pattern and Deviations • Shape: symmetric, stretched to one direction, multiple humps • Center: Typical values • Spread: Wide or narrow • Outlier: Individual whose value is far from others (see bottom right corner of previous slide) – May be due to data entry error, instrument malfunction, or individual being unusual wrt others

Time Plots -Variable Measured Over Time

Time Plot with Trend/Seasonality

Numeric Descriptions of Distributions • Measures of Central Tendency – Arithmetic Mean: Total equally divided among individual cases – Median: Midpoint of the distribution (M) • Measures of Spread (Dispersion) – Quartiles (first/third): Points that break out the smallest and largest 25% of distribution (Q 1 , Q 3) – 5 Number Summary: (Minimum, Q 1, M, Q 3, Maximum) – Interquartile Range: IQR = Q 3 -Q 1 – Boxplot: Graphical summary of 5 Number Summary – Variance: “Average” squared deviation from mean (s 2) – Standard Deviation: Square root of variance (s)

Measures of Central Tendency • Arithmetic Mean: Obtain the total by summing all values and divide by sample size (“equal allotment” among individuals) • Median: Midpoint of Distribution • Sort values from smallest to largest • If n odd, take the (n+1)/2 ordered value • If n even, take average of n/2 and (n/2)+1 ordered values

2005 Oscar Nominees (Best Picture) • Movie: Domestic Gross/Worldwide Gross – – – The Aviator: $103 M / $214 M Finding Neverland: $52 M / $116 M Million Dollar Baby: $100 M / $216 M Ray: $75 M / $97 M Sideways: $72 M / $108 M • Mean & Median Domestic Gross among nominees ($M):

Delta Flight Times - ATL/MCO Oct, 2004 • • N=372 Flights 10/1/2004 -10/31/2004 Total actual time: 30536 Minutes Mean Time: 30536/372 = 82. 1 Minutes Median: 372/2=186, (372/2)+1=187 – 186 th and 187 th ordered times are 81 minutes: M=81

Measures of Spread • Quartiles: First (Q 1 aka Lower) and Third (Q 3 aka Upper) – Q 1 is the median of the values below the median position – Q 3 is the median of the values below the median position – Notes(See examples on next page): • If n is odd, median position is (n+1)/2, and finding quartiles does not include this value. • If n is even, median position is treated (most commonly) as (n+1)/2 and the two values (positions) used to compute median are used for quartiles.

• Oscar Nominations: – – – # of Individuals: n=5 Median Position: (5+1)/2=3 Positions Below Median Position: 1 -2 Positions Above Median Position: 4 -5 Median of Lower Positions: 1. 5 Median of Lower Positions: 4. 5 • ATL/MCO Flights: – – – # of Individuals: n=372 Median Position: (372+1)/2=186. 5 Positions Below Median Position: 1 -186 Positions Above Median Position: 187 -372 Median of Lower Positions: 93. 5 Median of Upper Positions: 279. 5

Outliers - 1. 5 x. IQR Rule • Outlier: Value that falls a long way from other values in the distribution • 1. 5 x. IQR Rule: An observation may be considered an outlier if it falls either 1. 5 times the interquartile range above third (upper) quartile or the same distance below the first (lower) quartile. • ATL/MCO Data: Q 1=76 Q 3=86 IQR=10 1. 5 x. IQR=15 – “High” Outliers: Above 86+15=101 minutes – “Low” Outliers: Below 76 -15=61 minutes – 12 Flights are at 102 minutes or more (Highest is 122). See (modified) boxplot below

Measures of Spread - Variance and S. D. • Deviation: Difference between an observed value and the overall mean (sign is important): • Variance: “Average” squared deviation (divides the sum of squared deviations by n-1 (as opposed to n) for reasons we see later: • Standard Deviation: Positive square root of s 2

Example - 2005 Oscar Movie Revenues • • • Mean: x=80. 4 The Aviator: i=1 x 1=103 Deviation: 103 -80. 4=22. 6 Finding Neverland: i=2 x 2=52 Dev: 52 -80. 4= -28. 4 Million Dollar Baby: i=3 x 3=100 Dev: 100 -80. 4=19. 6 Ray: i=4 x 4=75 Dev: 75 -80. 4 = -5. 4 Sideways: i=5 x 5=72 Dev: 72 -80. 4 = -8. 4

Computer Output of Summary Measures and Boxplot (SPSS) - ATL/MCO Data

Linear Transformations • Often work with transformed data • Linear Transformation: xnew = a + bx for constants a and b (e. g. transforming from metric system to U. S. , celsius to fahrenheit, etc) • Effects: – Multiplying by b causes both mean and standard deviation to be multiplied by b – Addition by a shifts mean and all percentiles by a but does not effect the standard deviation or spread – Note that for locations, multiplication of b precedes addition of a

Density Curves/Normal Distributions • Continuous (or practically continuous) variables that can lie along a continuous (practically) range of values • Obtain a histogram of data (will be irregular with rigid blocks as bars over ranges) • Density curves are smooth approximations (models) to the coarse histogram – Curve lies above the horizontal axis – Total area under curve is 1 – Area of curve over a range of values represents its probability • Normal Distributions - Family of density curves with very specific properties

Mean and Median of a Density Curve • Mean is the balance point of a distribution of measurements. If the height of the curve represented weight, its where the density curve would balance • Median is the point where half the area is below and half the area is above the point – Symmetric Densities: Mean = Median – Right Skew Densities: Mean > Median – Left Skew Densities: Mean < Median • We will mainly work with means. Notation:

Symmetric (Normal) Distribution

Right Skewed Density Curve

Mean is the Balance Point

Normal Distribution • Bell-shaped, symmetric family of distributions • Classified by 2 parameters: Mean (m) and standard deviation (s). These represent location and spread • Random variables that are approximately normal have the following properties wrt individual measurements: – – Approximately half (50%) fall above (and below) mean Approximately 68% fall within 1 standard deviation of mean Approximately 95% fall within 2 standard deviations of mean Virtually all fall within 3 standard deviations of mean • Notation when X is normally distributed with mean m and standard deviation s :

Two Normal Distributions

Normal Distribution

Example - Heights of U. S. Adults • Female and Male adult heights are well approximated by normal distributions: XF~N(63. 7, 2. 5) XM~N(69. 1, 2. 6) Source: Statistical Abstract of the U. S. (1992)

Standard Normal (Z) Distribution • Problem: Unlimited number of possible normal distributions (- < m < , s > 0) • Solution: Standardize the random variable to have mean 0 and standard deviation 1 • Probabilities of certain ranges of values and specific percentiles of interest can be obtained through the standard normal (Z) distribution

Standard Normal (Z) Distribution Table Area 1 -Table Area z

2 nd Decimal Place I n t g e r p a r t & 1 st D e c i m a l

Finding Probabilities of Specific Ranges • Step 1 - Identify the normal distribution of interest (e. g. its mean (m) and standard deviation (s) ) • Step 2 - Identify the range of values that you wish to determine the probability of observing (XL , XU), where often the upper or lower bounds are or - • Step 3 - Transform XL and XU into Z-values: • Step 4 - Obtain P(ZL Z ZU) from Z-table

Example - Adult Female Heights • What is the probability a randomly selected female is 5’ 10” or taller (70 inches)? • Step 1 - X ~ N(63. 7 , 2. 5) • Step 2 - XL = 70. 0 XU = • Step 3 - • Step 4 - P(X 70) = P(Z 2. 52) = 1 -P(Z 2. 52)=1. 9941=. 0059 ( 1/170)

Finding Percentiles of a Distribution • Step 1 - Identify the normal distribution of interest (e. g. its mean (m) and standard deviation (s) ) • Step 2 - Determine the percentile of interest 100 p% (e. g. the 90 th percentile is the cut-off where only 90% of scores are below and 10% are above). • Step 3 - Find p in the body of the z-table and itscorresponding z-value (zp) on the outer edge: – If 100 p < 50 then use left-hand page of table – If 100 p 50 then use right-hand page of table • Step 4 - Transform zp back to original units:

Example - Adult Male Heights • • • Above what height do the tallest 5% of males lie above? Step 1 - X ~ N(69. 1 , 2. 6) Step 2 - Want to determine 95 th percentile (p =. 95) Step 3 - P(z 1. 645) =. 95 Step 4 - X. 95 = 69. 1 + (1. 645)(2. 6) = 73. 4 (6’, 1. 4”)

Statistical Models • When making statistical inference it is useful to write random variables in terms of model parameters and random errors • Here m is a fixed constant and e is a random variable • In practice m will be unknown, and we will use sample data to estimate or make statements regarding its value

Chapter 1 Looking at Data — Distributions