bb34bcd850975f70c7153af275fa14d0.ppt

- Количество слайдов: 138

Statistical Review Measures of Central Location

SALARY. XLS n Lists starting salaries for 190 graduates from an undergraduate school of business. n The data is in the range named Salary on a sheet called Data. 3. 2 | 3. 3 | 3. 4 | 3. 5 | 3. 6 | 3. 7 | 3. 8 | 3. 9 | 3. 10 | 3. 11

The Mean n We calculate the mean salary by entering the formula “=AVERAGE(Salary)” in cell B 6 of the Excel Functions worksheet. n The mean salary is $29, 762. n The mean in this example is a “representative” measure because the distribution of salaries is nearly symmetric. n The mean can be misleading due to skewness. 3. 2 | 3. 3 | 3. 4 | 3. 5 | 3. 6 | 3. 7 | 3. 8 | 3. 9 | 3. 10 | 3. 11

The Median n The median is the “middle” observation when the data are listed from smallest to largest. n If there is an odd number of observations, the median is the middle observation. n If there is an even number of observations, we take the median to be the average of the two middle observations. 3. 2 | 3. 3 | 3. 4 | 3. 5 | 3. 6 | 3. 7 | 3. 8 | 3. 9 | 3. 10 | 3. 11

The Median -- continued n We calculate the median salary in Example 3. 1 by entering the formula “=MEDIAN(Salary)” in cell B 7 of the Excel Functions worksheet. n The median in this example is $29, 850. n In this case, the mean and the median values are nearly the same because the distribution is approximately symmetric. 3. 2 | 3. 3 | 3. 4 | 3. 5 | 3. 6 | 3. 7 | 3. 8 | 3. 9 | 3. 10 | 3. 11

The Median -- continued n If the salary distribution were skewed (for example, a few graduates received abnormally large salaries), the mean would be biased upward while the median would not be affected by the unusual values. n Thus, it is better to use the median in characterizing the center of a distribution when that distribution is skewed. 3. 2 | 3. 3 | 3. 4 | 3. 5 | 3. 6 | 3. 7 | 3. 8 | 3. 9 | 3. 10 | 3. 11

The Mode n The mode is the most frequently occurring value. n If the values are essentially continuous, as with the salaries in Example 3. 1, then the mode is essentially irrelevant. There is typically no single value that occurs more than once. n Thus, the mode is not likely to provide much information. 3. 2 | 3. 3 | 3. 4 | 3. 5 | 3. 6 | 3. 7 | 3. 8 | 3. 9 | 3. 10 | 3. 11

The Mode

SHOES. XLS n This file lists shoe sizes purchased at a shoe store. n We seek to find the best-selling shoe size at this store. 3. 2 | 3. 3 | 3. 4 | 3. 5 | 3. 6 | 3. 7 | 3. 8 | 3. 9 | 3. 10 | 3. 11

The Mode n The mode is the most frequently occurring value. n If the values are essentially continuous, then the mode is usually not relevant. There is typically no single value that occurs more than once. 3. 2 | 3. 3 | 3. 4 | 3. 5 | 3. 6 | 3. 7 | 3. 8 | 3. 9 | 3. 10 | 3. 11

Why is the mode relevant here? n Shoe sizes come in discrete increments, rather than a continuum; so it makes sense to find the mode, the size that is requested most often, in this example. 3. 2 | 3. 3 | 3. 4 | 3. 5 | 3. 6 | 3. 7 | 3. 8 | 3. 9 | 3. 10 | 3. 11

Computing the Mode in EXCEL n The mode can be found in Excel by entering the formula “=MODE(Range of Data)”. n Applying this formula in the present example, we find that size 11 is the most frequently purchased shoe size. n This is also apparent form the histogram on the next slide. 3. 2 | 3. 3 | 3. 4 | 3. 5 | 3. 6 | 3. 7 | 3. 8 | 3. 9 | 3. 10 | 3. 11

Distribution of Shoe Sizes 3. 2 | 3. 3 | 3. 4 | 3. 5 | 3. 6 | 3. 7 | 3. 8 | 3. 9 | 3. 10 | 3. 11

Measures of Variability: Variance and Standard Deviation

OTIS 4. XLS n Suppose that Otis Elevator is going to stop manufacturing elevator rails. Instead, it is going to buy them from an outside supplier. n Otis would like each rail to have a diameter of 1 inch. n The company has obtained samples of ten elevator rails from each supplier. They are listed in columns A and B of this Excel file. 3. 2 | 3. 3 | 3. 4 | 3. 5 | 3. 6 | 3. 7 | 3. 8 | 3. 9 | 3. 10 | 3. 11

Which should Otis prefer? n Observe that the mean, median, and mode are all exactly 1 inch for each of the two suppliers. n Based on these measures, the two suppliers are equally good and right on the mark. However, we when we consider measures of variability, supplier 1 is somewhat better than supplier 2. Why? 3. 2 | 3. 3 | 3. 4 | 3. 5 | 3. 6 | 3. 7 | 3. 8 | 3. 9 | 3. 10 | 3. 11

Explanation n The reason is that supplier 2’s rails exhibit more variability about the mean than do supplier 1’s rails. n If we want rails to have a diameter of 1 inch, then more variability around the mean is very undesirable! 3. 2 | 3. 3 | 3. 4 | 3. 5 | 3. 6 | 3. 7 | 3. 8 | 3. 9 | 3. 10 | 3. 11

Variance n The most commonly used measures of variability are the variance and standard deviation. n The variance is essentially the average of the squared deviations from the mean. n We say “essentially” because there are two versions of the variance: the population variance and the sample variance. 3. 2 | 3. 3 | 3. 4 | 3. 5 | 3. 6 | 3. 7 | 3. 8 | 3. 9 | 3. 10 | 3. 11

More on the Variance n The variance tends to increase when there is more variability around the mean. n Indeed, large deviations from the mean contribute heavily to the variance because they are squared. n One consequence of this is that the variance is expressed in squared units (squared dollars, for example) rather than original units. 3. 2 | 3. 3 | 3. 4 | 3. 5 | 3. 6 | 3. 7 | 3. 8 | 3. 9 | 3. 10 | 3. 11

Standard Deviation n A more intuitive measure of variability is the standard deviation. n The standard deviation is defined to be the square root of the variance. n Thus, the standard deviation is measured in original units, such as dollars, and it is much easier to interpret. 3. 2 | 3. 3 | 3. 4 | 3. 5 | 3. 6 | 3. 7 | 3. 8 | 3. 9 | 3. 10 | 3. 11

Computing Variance and Standard Deviation in Excel has built-in functions for computing these measures of variability. n The sample variances and standard deviations of the rail diameters from the suppliers in the present example can be found by entering the following formulas: “=VAR(Supplier 1)” in cell E 8 and “=STDEV(Supplier 1)” in cell E 9. 3. 2 | 3. 3 | 3. 4 | 3. 5 | 3. 6 | 3. 7 | 3. 8 | 3. 9 | 3. 10 | 3. 11

Computing Variances & Standard Deviations -- continued n Of course, enter similar formulas for supplier 2 in cells F 8 and F 9. n As we mentioned earlier, it is difficult to interpret the variances numerically because they are expressed in squared inches, not inches. n All we can say is that the variance from supplier 2 is considerably larger than the variance from supplier 1. 3. 2 | 3. 3 | 3. 4 | 3. 5 | 3. 6 | 3. 7 | 3. 8 | 3. 9 | 3. 10 | 3. 11

Interpretation of the Standard Deviation n The standard deviations, on the other hand, are expressed in inches. The standard deviation for supplier 1 is approximately 0. 012 inch, and supplier 2’s standard deviation is approximately three times this large. n This is quite a disparity. Hence, Otis will prefer to buy rails from supplier 1. 3. 2 | 3. 3 | 3. 4 | 3. 5 | 3. 6 | 3. 7 | 3. 8 | 3. 9 | 3. 10 | 3. 11

Interpretation of the Standard Deviation: Rules of Thumb

DOW. XLS n This file contains monthly closing prices for the Dow Jones Index from January 1947 through January 1993. n The monthly returns from the index are also shown starting with February 1947. Each return is the monthly percentage change (expressed) as a decimal) in the index. n How well do the rules of thumb work for these data? 3. 2 | 3. 3 | 3. 4 | 3. 5 | 3. 6 | 3. 7 | 3. 8 | 3. 9 | 3. 10 | 3. 11

Rules of Thumb n Many data sets follow “rules of thumb”. n Approximately 68% of the observations are within one standard deviation of the mean. n Approximately 95% of the observations are within two standard deviations of the mean. n Approximately 99. 7% - almost all - of the observations are within three standard deviations of the mean. 3. 2 | 3. 3 | 3. 4 | 3. 5 | 3. 6 | 3. 7 | 3. 8 | 3. 9 | 3. 10 | 3. 11

Index Time Series Plot n A time series plot of the index show that the index has been increasingly fairly steadily over the period. n Whenever a series indicates a clear trend such as the index does, most of the measures we have been discussing are less relevant. n For example, the mean of the index for this period has at most historical interest. We are probably more interested in predicting the future of the Dow, and the historical mean has little relevance for predicting the future. 3. 2 | 3. 3 | 3. 4 | 3. 5 | 3. 6 | 3. 7 | 3. 8 | 3. 9 | 3. 10 | 3. 11

Time Series Plot of Dow Closing Index 3. 2 | 3. 3 | 3. 4 | 3. 5 | 3. 6 | 3. 7 | 3. 8 | 3. 9 | 3. 10 | 3. 11

Time Series Plot of Dow Returns 3. 2 | 3. 3 | 3. 4 | 3. 5 | 3. 6 | 3. 7 | 3. 8 | 3. 9 | 3. 10 | 3. 11

Return Time Series Plot n A time series plot of the returns show no obvious trend over the period. n The measures we have been discussing are relevant in discussing the series of returns, which fluctuate around a stable mean. n We first calculate the mean and standard deviation of the returns by using the Excel functions AVERAGE and STDEV in cells B 4 and B 5. See the table on the next slide. 3. 2 | 3. 3 | 3. 4 | 3. 5 | 3. 6 | 3. 7 | 3. 8 | 3. 9 | 3. 10 | 3. 11

Rules of Thumb for Dow Jones Data 3. 2 | 3. 3 | 3. 4 | 3. 5 | 3. 6 | 3. 7 | 3. 8 | 3. 9 | 3. 10 | 3. 11

Returns -- continued n The average return is 0. 59% and the standard deviation of about 3. 37%. n Therefore, the rules of thumb (if they apply) imply, for example, that about 2/3 of all returns are within the interval 0. 59% + 3. 37%, that is from -2. 78% to 3. 95%. n In order to determine if the rules of thumb apply to these returns, we can use a frequency table. 3. 2 | 3. 3 | 3. 4 | 3. 5 | 3. 6 | 3. 7 | 3. 8 | 3. 9 | 3. 10 | 3. 11

Creating the Frequency Table n We first enter the upper limits of the suitable categories in the range A 8: A 15. n Any categories can be chosen but it is convenient to choose categories in which each breakpoint is one standard deviation higher than the previous one with the open-ended categories on either end are “more than 3 standard deviations from the mean”. n Next we use the FREQUENCY function to fill in column C. “=FREQUENCY(Returns, Bins)” 3. 2 | 3. 3 | 3. 4 | 3. 5 | 3. 6 | 3. 7 | 3. 8 | 3. 9 | 3. 10 | 3. 11

Frequency Table continued n Finally, we use the frequencies in column C to calculate the actual percentage of return within k standard deviations of the mean for k=1, k=2 and k=3 and we compare these with the percentages from the rules of thumb. n The agreement between these percentages is not perfect - there a few more observations within one standard deviation of the mean the rule of thumb predicts - but in general the rules of thumb work quite well. 3. 2 | 3. 3 | 3. 4 | 3. 5 | 3. 6 | 3. 7 | 3. 8 | 3. 9 | 3. 10 | 3. 11

Obtaining Summary Measures with Add-Ins

SALARY. XLS n Lists starting salaries for 190 graduates from an undergraduate school of business. n The data is in the range named Salary on a sheet called Data. n We need to find a set of useful summary measures for the salaries. 3. 2 | 3. 3 | 3. 4 | 3. 5 | 3. 6 | 3. 7 | 3. 8 | 3. 9 | 3. 10 | 3. 11

Summary Statistics n To find the summary statistics of a set of data we can use the Stat-Pro Add-In or Excel’s Analysis Tool. Pak. In this example we use the Stat-Pro Add-In. n Begin by placing the cursor anywhere within the data range. Then select Stat. Pro/Summary Stats/One. Variable Summary Stats menu item. n Select all variables you want to summarize, and select the summary measures you want to find from the Available Summary Measures dialog box shown on the next slide. 3. 2 | 3. 3 | 3. 4 | 3. 5 | 3. 6 | 3. 7 | 3. 8 | 3. 9 | 3. 10 | 3. 11

Available Summary Measures 3. 2 | 3. 3 | 3. 4 | 3. 5 | 3. 6 | 3. 7 | 3. 8 | 3. 9 | 3. 10 | 3. 11

About the Measures n Four measures are selected by default. These are mean, median, Standard Deviation and Count. You can override these. n A typical output appears here. n It includes many of the measures we have discussed plus a few more. 3. 2 | 3. 3 | 3. 4 | 3. 5 | 3. 6 | 3. 7 | 3. 8 | 3. 9 | 3. 10 | 3. 11

About the Measures -- continued n The mean absolute deviation is similar to the variance except that it is an average of the absolute (note squared) deviations from the mean. n The kurtosis and skewness indicate the relative peakedness of the distribution and its skewness. n By clicking on any of the cells containing the measures (Column B), you’ll see that Stat. Pro provides the formulas for the outputs. (Analysis Tool. Pak does not do so. ) 3. 2 | 3. 3 | 3. 4 | 3. 5 | 3. 6 | 3. 7 | 3. 8 | 3. 9 | 3. 10 | 3. 11

About the Measures -- continued n The effect of this is that if any of the data changes the summary measures we produced change automatically. n All output is formatted as “numerical” to three decimal places by default. You can reformat them in a more appropriate manner if you would like. 3. 2 | 3. 3 | 3. 4 | 3. 5 | 3. 6 | 3. 7 | 3. 8 | 3. 9 | 3. 10 | 3. 11

Measures of Association: Covariance and Correlation

EXPENSES. XLS n A survey questions members of 100 households about their spending habits. n The data in this file represent the salary, expense for cultural activities, expense for sports-related activities, and the expense for dining-out for each household over the past year. n Do these variables appear to be related linearly? 3. 2 | 3. 3 | 3. 4 | 3. 5 | 3. 6 | 3. 7 | 3. 8 | 3. 9 | 3. 10 | 3. 11

Covariance and Correlation n When we need to summarize the relationship between two variables we can use the measures covariance and correlation. We summarize the type of behavior observed in a scatterplot. n Each measures the strength (and direction) of a linear relationship between two numerical variables. n The relationship is “strong” if the points in a scatterplot cluster tightly around some straight line. If this line rises form left to right then the relationship is “positive”. If it falls from left to right then the relationship is “negative”. 3. 2 | 3. 3 | 3. 4 | 3. 5 | 3. 6 | 3. 7 | 3. 8 | 3. 9 | 3. 10 | 3. 11

Determining Linear Relationships n Scatterplots of each variable versus each other would provide the answer to the question but six scatterplots would be required, one for each pair. n To get a quick indication of possible linear relationships we can use Stat-Proto obtain a table of correlations and/or covariances. 3. 2 | 3. 3 | 3. 4 | 3. 5 | 3. 6 | 3. 7 | 3. 8 | 3. 9 | 3. 10 | 3. 11

Table of Correlations and Covariances n To get the table, place the cursor anywhere in the data set and use the Stat. Pro/Summary Stats/Correlations, Covariances menu item and proceed in the obvious way. 3. 2 | 3. 3 | 3. 4 | 3. 5 | 3. 6 | 3. 7 | 3. 8 | 3. 9 | 3. 10 | 3. 11

Relationships n The only relationships that stand out are the positive relationships between salary and cultural expenses and between salary and dining expenses. n The negative relationships are between cultural and sports-related expenses. n To confirm these graphically we show scatterplots of Salary versus Culture and Culture versus Sports 3. 2 | 3. 3 | 3. 4 | 3. 5 | 3. 6 | 3. 7 | 3. 8 | 3. 9 | 3. 10 | 3. 11

Scatterplot Indicating Positive Relationship 3. 2 | 3. 3 | 3. 4 | 3. 5 | 3. 6 | 3. 7 | 3. 8 | 3. 9 | 3. 10 | 3. 11

Scatterplot Indicating Negative Relationship 3. 2 | 3. 3 | 3. 4 | 3. 5 | 3. 6 | 3. 7 | 3. 8 | 3. 9 | 3. 10 | 3. 11

Correlation and Covariance Properties n In general, the following properties are evident from the Table of correlations and covariances. – The correlation between a variable and itself is 1. – The correlation between X and Y is the same as the correlation between Y and X. Therefore, it is sufficient to list the correlations below (or above) the diagonal in the table. (The same is true for the covariances). – The covariance between a variable and itself is the variance of the variable. We indicate this in the heading of the covariance table. 3. 2 | 3. 3 | 3. 4 | 3. 5 | 3. 6 | 3. 7 | 3. 8 | 3. 9 | 3. 10 | 3. 11

Correlation and Covariance Properties -- continued – It is difficult to interpret the magnitudes of covariances. These depend on the fact that the data are measured in dollars rather than, say, thousands of dollars. It is such easier to interpret the magnitudes of the correlations because they are scaled to be between -1 and +1. 3. 2 | 3. 3 | 3. 4 | 3. 5 | 3. 6 | 3. 7 | 3. 8 | 3. 9 | 3. 10 | 3. 11

Describing Data Sets with Boxplots

DOW. XLS n This file lists the monthly returns on the Dow from February 1947 through January 1993. n Use a boxplot to summarize the distribution of these returns. 3. 2 | 3. 3 | 3. 4 | 3. 5 | 3. 6 | 3. 7 | 3. 8 | 3. 9 | 3. 10 | 3. 11

Boxplots n A boxplot is a very useful graphical method for summarizing data. n Boxplots can be used in two ways: either to describe a single variable in a data set or to compare two (or more) variables. n Excel has no boxplot option, but we included this option in the Stat. Pro add-in. 3. 2 | 3. 3 | 3. 4 | 3. 5 | 3. 6 | 3. 7 | 3. 8 | 3. 9 | 3. 10 | 3. 11

Creating Boxplots n Place the cursor anywhere within the data set, use the Stat. Pro/Charts/Box. Plots(s) menu item and proceed in the obvious way. n Eventually two sheets will be added to your workbook. One has a the boxplot chart, while the other contains summary measures used to form the boxplot. n The following slides show the chart and the summary measure information. 3. 2 | 3. 3 | 3. 4 | 3. 5 | 3. 6 | 3. 7 | 3. 8 | 3. 9 | 3. 10 | 3. 11

Boxplot Chart 3. 2 | 3. 3 | 3. 4 | 3. 5 | 3. 6 | 3. 7 | 3. 8 | 3. 9 | 3. 10 | 3. 11

Boxplot Summary Measures 3. 2 | 3. 3 | 3. 4 | 3. 5 | 3. 6 | 3. 7 | 3. 8 | 3. 9 | 3. 10 | 3. 11

Keys to Understanding Boxplots n The right and left of the box are at the third and first quartiles. Therefore, the length of the box equals the interquartile range (IQR), and the box itself represents the middle 50% of the observations. The height of the box has no significance. n The vertical line inside the box indicates the location of the median. The point inside the box indicates the location of the mean. 3. 2 | 3. 3 | 3. 4 | 3. 5 | 3. 6 | 3. 7 | 3. 8 | 3. 9 | 3. 10 | 3. 11

Keys to Understanding Boxplots -- continued n Horizontal lines are drawn from each side of the box. They extend to the most extreme observations tat are no farther than 1. 5 IQRs from the box. They are useful for indicating variability and skewness. n Observations farther than 1. 5 IQRs from the box are shown as individual points. If they are between 1. 5 IRQs and 3 IQRs from the box, they are call mild outliers and are hollow. Otherwise, they are called extreme outliers and are solid. 3. 2 | 3. 3 | 3. 4 | 3. 5 | 3. 6 | 3. 7 | 3. 8 | 3. 9 | 3. 10 | 3. 11

Dow Returns Boxplot n The boxplot for this example summarizes the distribution of the returns. n It implies that the Dow returns are approximately symmetric on each side of the median, although the mean is a bit below the median. n Also there a few mild outliers but no extreme outliers. 3. 2 | 3. 3 | 3. 4 | 3. 5 | 3. 6 | 3. 7 | 3. 8 | 3. 9 | 3. 10 | 3. 11

Describing Data Sets with Boxplots

ACTORS. XLS n Recall that the salaries of famous actors and actresses are listed in this file. n Use side-by-side boxplots to compare the salaries of male and female actors and actresses. 3. 2 | 3. 3 | 3. 4 | 3. 5 | 3. 6 | 3. 7 | 3. 8 | 3. 9 | 3. 10 | 3. 11

Side-by-Side Boxplots n Boxplots are probably most useful for comparing two populations graphically. This is done using side-byside boxplots. n The data setup for this type of “comparison” can be in one of two forms: stacked or unstacked. n Data are stacked if there is a “code” variable that designates which category each observation is in, and there is a single “measurement” variable that contains the data for both categories. 3. 2 | 3. 3 | 3. 4 | 3. 5 | 3. 6 | 3. 7 | 3. 8 | 3. 9 | 3. 10 | 3. 11

Side-by-Side Boxplots -continued n The data are unstacked if there is a separate data column for each category. n In this example the data are stacked because Gender designates the gender associated with each observation and Salary is the single measurement variable. n If the data were unstacked, for example, actors’ salaries would be in one column and actresses’ salaries would be in another. 3. 2 | 3. 3 | 3. 4 | 3. 5 | 3. 6 | 3. 7 | 3. 8 | 3. 9 | 3. 10 | 3. 11

Creating a Side-by-Side Boxplot n Since the data are stacked the following must be steps must be followed to create the boxplots. – Place cursor within the data set and select the Stat. Pro/Charts/Boxplot(s) menu item. – In the dialog box that opens check the “stacked” option. – Then choose Gender as the code variable and Salary as the measurement variable. n The resulting data in stacked form and the side-byside boxplot appears. 3. 2 | 3. 3 | 3. 4 | 3. 5 | 3. 6 | 3. 7 | 3. 8 | 3. 9 | 3. 10 | 3. 11

Actor Data in Stacked Form 3. 2 | 3. 3 | 3. 4 | 3. 5 | 3. 6 | 3. 7 | 3. 8 | 3. 9 | 3. 10 | 3. 11

Side-by-Side Boxplot Chart 3. 2 | 3. 3 | 3. 4 | 3. 5 | 3. 6 | 3. 7 | 3. 8 | 3. 9 | 3. 10 | 3. 11

Reading the Boxplot n It is clear the female salary box is considerably to the left of the male salary box, although both have about the same IQR. n Each boxplot has three indications that the salary distributions are skewed to the right: – the means are larger than the medians – the medians are closer to the left sides of the boxes than to the right sides – the horizontal lines extend farther to the right than to the left of the boxes. However, there are no outliers. 3. 2 | 3. 3 | 3. 4 | 3. 5 | 3. 6 | 3. 7 | 3. 8 | 3. 9 | 3. 10 | 3. 11

Applying the Tools

Background Information n The Spring Mills Company produces and distributes a wide variety of manufactured goods. Due to its variety, it has a a large number of customers. n Spring Mills classifies these customers as small, medium and large, depending on the volume of business each does with them. n Recently they have noticed a problem with accounts receivable. They are not getting paid by their customers in as timely a manner as they would like. This obviously costs them money. 3. 2 | 3. 3 | 3. 4 | 3. 5 | 3. 6 | 3. 7 | 3. 8 | 3. 9 | 3. 10 | 3. 11

RECEIVE. XLS n Spring Mills has gathered data on 280 customer accounts. n For each of these accounts the data set lists three variables: – Size, the size of the customer (coded 1 for small, 2 for medium, 3 for large) – Days, the number of days since the customer was billed – Amount, the amount the customer owes n What information can we obtain from this data? 3. 2 | 3. 3 | 3. 4 | 3. 5 | 3. 6 | 3. 7 | 3. 8 | 3. 9 | 3. 10 | 3. 11

Analysis n It is best to start by getting a good sense of the data. To do this we: – Calculate several summary measures for the Days and Amount – Create a histogram of Amount – Create a scatterplot of Amount versus Days n From these we determine: – a positive skewness in the Amount variable (The mean is considerably larger than the median and the standard deviation of Amount is quite large). – The scatterplot suggests some suspicious behavior, with two distinct groups of points. 3. 2 | 3. 3 | 3. 4 | 3. 5 | 3. 6 | 3. 7 | 3. 8 | 3. 9 | 3. 10 | 3. 11

Analysis -- continued n The next step is to see whether the different customer sizes have any effect on either Days, Amount, or the relationship between Days and Amount. n To do this, it is useful to unstack the Days and Amount variables - that is to create a new Days and Amounts variable for each group of customer sizes. For example, the Days and Amount variables for customers of size 1 are named Days 1 and Amount 1. This can be accomplished by using Stat. Pro’s Unstack procedure but copying and pasting also work. 3. 2 | 3. 3 | 3. 4 | 3. 5 | 3. 6 | 3. 7 | 3. 8 | 3. 9 | 3. 10 | 3. 11

Analysis -- continued n Once unstacked we need to calculate summary measures and a variety of charts on these unstacked variables. n The charts include: – Histograms of Amount for each size customer – Boxplots of days owed by different size customers – Boxplots of amounts owed by different size customers – Scatterplots of Amount versus Days for each size customer 3. 2 | 3. 3 | 3. 4 | 3. 5 | 3. 6 | 3. 7 | 3. 8 | 3. 9 | 3. 10 | 3. 11

Summary Measures for Combined Data 3. 2 | 3. 3 | 3. 4 | 3. 5 | 3. 6 | 3. 7 | 3. 8 | 3. 9 | 3. 10 | 3. 11

Histogram of All Amounts Owed 3. 2 | 3. 3 | 3. 4 | 3. 5 | 3. 6 | 3. 7 | 3. 8 | 3. 9 | 3. 10 | 3. 11

Scatterplot of Amount versus Days for All Customers 3. 2 | 3. 3 | 3. 4 | 3. 5 | 3. 6 | 3. 7 | 3. 8 | 3. 9 | 3. 10 | 3. 11

Summary Measures Broken Down by Size 3. 2 | 3. 3 | 3. 4 | 3. 5 | 3. 6 | 3. 7 | 3. 8 | 3. 9 | 3. 10 | 3. 11

Histogram for Small Customers 3. 2 | 3. 3 | 3. 4 | 3. 5 | 3. 6 | 3. 7 | 3. 8 | 3. 9 | 3. 10 | 3. 11

Histogram of Amount for Medium Customers 3. 2 | 3. 3 | 3. 4 | 3. 5 | 3. 6 | 3. 7 | 3. 8 | 3. 9 | 3. 10 | 3. 11

Histogram of Amount for Large Customers 3. 2 | 3. 3 | 3. 4 | 3. 5 | 3. 6 | 3. 7 | 3. 8 | 3. 9 | 3. 10 | 3. 11

Boxplots of Days Owed by Different Size Customers 3. 2 | 3. 3 | 3. 4 | 3. 5 | 3. 6 | 3. 7 | 3. 8 | 3. 9 | 3. 10 | 3. 11

Boxplots of Amounts Owed by Different Size Customers 3. 2 | 3. 3 | 3. 4 | 3. 5 | 3. 6 | 3. 7 | 3. 8 | 3. 9 | 3. 10 | 3. 11

Scatterplot of Amount versus Days for Small Companies 3. 2 | 3. 3 | 3. 4 | 3. 5 | 3. 6 | 3. 7 | 3. 8 | 3. 9 | 3. 10 | 3. 11

Scatterplot of Amount versus Days for Medium Companies 3. 2 | 3. 3 | 3. 4 | 3. 5 | 3. 6 | 3. 7 | 3. 8 | 3. 9 | 3. 10 | 3. 11

Scatterplot of Amount versus Days for Large Companies 3. 2 | 3. 3 | 3. 4 | 3. 5 | 3. 6 | 3. 7 | 3. 8 | 3. 9 | 3. 10 | 3. 11

Analysis -- continued n There is obviously a lot going on here and it is evident form the charts. We point out the following: – there are considerably fewer large customers than small or medium customers. – the large customers tend to owe considerably more than small or medium customers. – the small customers do not tend to be as long overdue as the small and medium customers. – there is no relationship between Days and Amount for the small customers, but there is a definite positive relationship between these variables for the medium and large customers. 3. 2 | 3. 3 | 3. 4 | 3. 5 | 3. 6 | 3. 7 | 3. 8 | 3. 9 | 3. 10 | 3. 11

More Analysis n We have done the obvious but here is still much more we can do. n For example, suppose Spring Mills wants a breakdown of customers who owe at least $500. n We first create a new variable called “Large? ” next to the original variables that equals 1 for all amounts greater than $500 and equals 0 otherwise. n We do this by entering a formula in D 6 and copying down. The next slide shows a sample of data and the formula to be used. 3. 2 | 3. 3 | 3. 4 | 3. 5 | 3. 6 | 3. 7 | 3. 8 | 3. 9 | 3. 10 | 3. 11

More Analysis - -continued n We enter the formula =IF(C 6>$B$3, 1, 0) in cell D 6 and copy it down. n We can then use a pivot table to create a count of the number of 1’s in this new variable for each value of the Size variable. 3. 2 | 3. 3 | 3. 4 | 3. 5 | 3. 6 | 3. 7 | 3. 8 | 3. 9 | 3. 10 | 3. 11

Pivot Tables for Counts of Customers Who Owe More than $500 3. 2 | 3. 3 | 3. 4 | 3. 5 | 3. 6 | 3. 7 | 3. 8 | 3. 9 | 3. 10 | 3. 11

More Analysis -- continued n We created the pivot table twice, once showing the counts as percentages of each column, and once showing them as percentages of each row. n One table shows that 73% of all customers with amounts less than $500 are small customers. The other table shows that 45% of all medium-sized customers owe more than $500. n This type of analysis is often referred to as “slicing and dicing the data”. They are based on the same counts but portray them in different ways. Neither way is better; they both provide useful information. 3. 2 | 3. 3 | 3. 4 | 3. 5 | 3. 6 | 3. 7 | 3. 8 | 3. 9 | 3. 10 | 3. 11

More Analysis -- continued n Finally, we investigate the amount of interest Spring Mills is losing by the delays in its customers’ payments. n We assume that the company can make 12% annual interest on excess cash. Then we can create a Lost variable for each customer size that indicates the amount of interest Spring Mills loses on each customer group. 3. 2 | 3. 3 | 3. 4 | 3. 5 | 3. 6 | 3. 7 | 3. 8 | 3. 9 | 3. 10 | 3. 11

More Analysis n The formula entered in cell C 10 to calculate Lost 1 is =B 10*A 10*$C$7/365. This is the amount owed by the number of days owed multiplied by the interest rate, divided by the number of days in a year. Then we copy this formula down and to the columns to calculate Lost 2 and Lost 3. n Then we calculate the sums of these amounts in row 5. n The next slide shows a sample of data at this point. 3. 2 | 3. 3 | 3. 4 | 3. 5 | 3. 6 | 3. 7 | 3. 8 | 3. 9 | 3. 10 | 3. 11

More Analysis -- continued n Although Spring Mills is losing more per customer from the large customers, it is losing more total from the medium-sized customers - because there are more of them. 3. 2 | 3. 3 | 3. 4 | 3. 5 | 3. 6 | 3. 7 | 3. 8 | 3. 9 | 3. 10 | 3. 11

More Analysis -- continued n This is shown graphically by a pie chart of the sums in row 5. 3. 2 | 3. 3 | 3. 4 | 3. 5 | 3. 6 | 3. 7 | 3. 8 | 3. 9 | 3. 10 | 3. 11

Findings n If Spring Mills really wants to decrease receivables, it might want to target the medium-sized customer group, from which it is losing the most interest. n Or it could target the large customers because they owe the most on average. n The most appropriate action depends on the cost and effectiveness of targeting any particular customer group. However, the analysis presented here gives the company a much better picture of what’s currently going on. 3. 2 | 3. 3 | 3. 4 | 3. 5 | 3. 6 | 3. 7 | 3. 8 | 3. 9 | 3. 10 | 3. 11

Applying the Tools

Background Information n The R&P Supermarket is open 24 hours a day, 7 days a week. Lately it has been receiving a lot of complaints from customers about excessive waiting in line for checking out. n R&P has decided to investigate this situation by gathering data on arrivals, departures, and line lengths at the checkout stations. n It has collected data in half-hour increments for an entire week - 336 observations - starting at 8 am on Monday and ending at 8 am on the following Monday. 3. 2 | 3. 3 | 3. 4 | 3. 5 | 3. 6 | 3. 7 | 3. 8 | 3. 9 | 3. 10 | 3. 11

CHECKOUT. XLS n This file includes the data collected on the following variables: – Initial. Waiting, the number waiting or being checked out at the beginning of a half-hour period – Arrivals, the number of arrivals to the checkout stations during a period – Departures, the number of checkout stations open during a period – Checkers, the number of checkout stations open during a period 3. 2 | 3. 3 | 3. 4 | 3. 5 | 3. 6 | 3. 7 | 3. 8 | 3. 9 | 3. 10 | 3. 11

CHECKOUT. XLS -- continued n The data set also includes time variables: – Day, day of week – Start. Time, clock time at the beginning of each half-hour period – Time. Interval, a descriptive term for the time of day such as Lunch rush for 11: 30 a. m. to 1: 30 p. m. n Finally the data set includes a calculated variable: – End. Waiting, the number waiting or being checked out at the end of a half-hour period This variable for any time period equals Initial. Waiting plus Arrivals minus Departures; it also equals Initial. Waiting for the next period. 3. 2 | 3. 3 | 3. 4 | 3. 5 | 3. 6 | 3. 7 | 3. 8 | 3. 9 | 3. 10 | 3. 11

The Data 3. 2 | 3. 3 | 3. 4 | 3. 5 | 3. 6 | 3. 7 | 3. 8 | 3. 9 | 3. 10 | 3. 11

Analysis Information n The manager of R&P wants to analyze the data to discover any trends, particularly in the pattern of arrivals throughout the day or across the entire week. n Also the store currently uses a “seat-of-the-pants” approach to opening and closing checkout stations each half hour. The manager would like to see how well the current approach is working. n Of course, she would love to know the “best” strategy for opening and closing checkout stations - but this is beyond her (and our) capabilities at this point 3. 2 | 3. 3 | 3. 4 | 3. 5 | 3. 6 | 3. 7 | 3. 8 | 3. 9 | 3. 10 | 3. 11

Analysis n Obviously time plays a crucial role in this data, so a good place to start is to create one or more time series plots. n The time series plot of Initial. Waiting and Arrivals Variables shown on the next slide shows that: – Fridays and Saturdays are the busiest days – the time pattern of arrivals is somewhat different - more spread out - during the weekends than during the weekdays – there are fairly regular peak arrival periods during the weekdays – the number waiting is sometimes as large as 10 or 20, and the largest of these tend to be around peak arrival times 3. 2 | 3. 3 | 3. 4 | 3. 5 | 3. 6 | 3. 7 | 3. 8 | 3. 9 | 3. 10 | 3. 11

Time Series Plot of Initial Waiting and Arrivals Variables 3. 2 | 3. 3 | 3. 4 | 3. 5 | 3. 6 | 3. 7 | 3. 8 | 3. 9 | 3. 10 | 3. 11

Analysis -- continued n A similar time series plot shown below shows Arrivals and Departures. 3. 2 | 3. 3 | 3. 4 | 3. 5 | 3. 6 | 3. 7 | 3. 8 | 3. 9 | 3. 10 | 3. 11

Analysis -- continued n On this plot it is difficult to separate the two time series; they are practically on top of each other. n Perhaps this is not so bad because this tells us that the store is checking out customers approximately as quickly as they are arriving. n A somewhat more efficient way to understand the time series behavior is to use pivot tables. 3. 2 | 3. 3 | 3. 4 | 3. 5 | 3. 6 | 3. 7 | 3. 8 | 3. 9 | 3. 10 | 3. 11

Analysis - continued n A pivot table for Average Initial. Waiting by Hour of Day is generated. This table can be seen on the next slide. n To create this table, we: – drag the Initial. Waiting variable to the Data area, express it as an average – drag Start. Time variable to the Row area – drag the Day variable to the Page area n Finally, we use the data in the pivot table to create a time series plot. 3. 2 | 3. 3 | 3. 4 | 3. 5 | 3. 6 | 3. 7 | 3. 8 | 3. 9 | 3. 10 | 3. 11

Average Initial Waiting by Hour of Day 3. 2 | 3. 3 | 3. 4 | 3. 5 | 3. 6 | 3. 7 | 3. 8 | 3. 9 | 3. 10 | 3. 11

Analysis -- continued n Note how the variable in the Page area works. Depending on which day we select in the page area the corresponding table and time series plot changes. n Similarly, a pivot table can be created, accompanied by a column chart. Average Arrivals by Time. Interval of Day can be analyzed with a pivot table by dragging Arrival to the Data area, expressing it as an average, dragging the Time. Interval variable to the Row area, and dragging the day variable to the Page area. This table is is on the next slide. You can check for the patterns on each day. 3. 2 | 3. 3 | 3. 4 | 3. 5 | 3. 6 | 3. 7 | 3. 8 | 3. 9 | 3. 10 | 3. 11

Average Arrivals by Time Interval of Days 3. 2 | 3. 3 | 3. 4 | 3. 5 | 3. 6 | 3. 7 | 3. 8 | 3. 9 | 3. 10 | 3. 11

More Analysis n The manager of R&P is ultimately interested in whether the “right” number of checkout stations are available throughout the day. n We can create two scatterplots to provide some evidence. The plots follow this slide. n The first of these would be a scatterplot of Checkers versus Total. Customers. The Total. Customers variable is calculated as the sum of Initial. Waiting and the Arrivals to measure the total amount of work presented to the checkout stations in any half-hour period. 3. 2 | 3. 3 | 3. 4 | 3. 5 | 3. 6 | 3. 7 | 3. 8 | 3. 9 | 3. 10 | 3. 11

Scatterplot of Checkers versus Total Customers 3. 2 | 3. 3 | 3. 4 | 3. 5 | 3. 6 | 3. 7 | 3. 8 | 3. 9 | 3. 10 | 3. 11

Scatterplot of End Waiting versus Checkers 3. 2 | 3. 3 | 3. 4 | 3. 5 | 3. 6 | 3. 7 | 3. 8 | 3. 9 | 3. 10 | 3. 11

More Analysis -- continued n From the scatterplot we can see an obvious positive relationship between these two variables. n Evidently management is reacting as they should - it is opening more checkout stations when there is more traffic. n The second scatterplot shows End. Waiting versus Checkers. There is again a definite upward trend. Periods when more checkout stations are open tend to be associated with periods where more customers still remain in the checkout process. 3. 2 | 3. 3 | 3. 4 | 3. 5 | 3. 6 | 3. 7 | 3. 8 | 3. 9 | 3. 10 | 3. 11

More Analysis - continued n Presumably, management is reacting with more open checkout stations in busy periods, but it is not reacting strongly enough. n Just fiddling with the numbers in the Checker column will not solve the manager's problems. 3. 2 | 3. 3 | 3. 4 | 3. 5 | 3. 6 | 3. 7 | 3. 8 | 3. 9 | 3. 10 | 3. 11

Conclusions n Two problems exist. First, there is a trade-off between the “cost” of having customers wait in line and the cost of paying extra checkout people. This is a difficult tradeoff for any supermarket manager. n Second, the number of departures is clearly related to the number of checkout stations open. Therefore it doesn’t make sense to change the numbers in the Checkers column without changing the numbers in the Departures (and hence Initial. Waiting and End. Waiting) column in an appropriate way. This is not an easy problem. 3. 2 | 3. 3 | 3. 4 | 3. 5 | 3. 6 | 3. 7 | 3. 8 | 3. 9 | 3. 10 | 3. 11

Applying the Tools

Background Information n The Hy. Tex Company is a direct marketer of stereophonic equipment, personal computers and other electronic products. n Hy. Tex advertises entirely by mailing catalogs to its customers, and all of its orders are taken over the telephone. n The company spends a great deal of money on its catalog mailings, and it wants to be sure that this is paying off in sales. 3. 2 | 3. 3 | 3. 4 | 3. 5 | 3. 6 | 3. 7 | 3. 8 | 3. 9 | 3. 10 | 3. 11

CATALOGS. XLS n This file contains the data that has been collected on 1000 customers at the end of the current year. n For each customer it has data on the following variables: – Age: coded as 1 for 30 or younger, 2 for 31 to 55, 3 for 56 and older. – Gender: coded as 1 for males, 2 for females – Own. Home: coded as 1 if customer owns a home, 2 otherwise – Married: coded as 1 if customer is currently married, 2 otherwise 3. 2 | 3. 3 | 3. 4 | 3. 5 | 3. 6 | 3. 7 | 3. 8 | 3. 9 | 3. 10 | 3. 11

CATALOGS. XLS -- continued – Close: coded as 1 if customers lives reasonably close to a shopping area that sells similar merchandise, 2 otherwise – Salary: combined annual salary of customer and spouse (if any) – Children: number of children living with customer – History: coded as “NA” if customer had no dealings with the company before this year, 1 if customer was a low spending customer last year, 2 if medium-spending, 3 if high-spending – Catalogs: Number of catalogs sent to the customer this year – Amount. Spent: Total amount of purchases made by the customer this year 3. 2 | 3. 3 | 3. 4 | 3. 5 | 3. 6 | 3. 7 | 3. 8 | 3. 9 | 3. 10 | 3. 11

Analysis Desired n Hy. Tex wants to analyze these data carefully to understand its customers better. n They want to see whether they are sending the catalogs to the right customers. n Each customer either receives 6, 12, 18, or 24 catalogs through the mail. Currently which customer receives which amount is not thought out carefully. They want to know if the current distribution of catalogs is effective. Is there room for improvement? 3. 2 | 3. 3 | 3. 4 | 3. 5 | 3. 6 | 3. 7 | 3. 8 | 3. 9 | 3. 10 | 3. 11

Analysis n Hy. Tex is obviously interested in the Amount. Spent variable. Therefore, it makes sense to create scatterplots of Amount. Spent versus selected “explanatory” variables. n First we create the scatterplot shown on the next slide showing Amount. Spent versus Salary. – It is clear that customers with higher salaries tend to spend more, although the variability in amounts spent increases significantly as salary increases. 3. 2 | 3. 3 | 3. 4 | 3. 5 | 3. 6 | 3. 7 | 3. 8 | 3. 9 | 3. 10 | 3. 11

Scatterplot of Amount Spent versus Salary 3. 2 | 3. 3 | 3. 4 | 3. 5 | 3. 6 | 3. 7 | 3. 8 | 3. 9 | 3. 10 | 3. 11

Analysis -- continued n Second, we create a scatterplot of Amount. Spent versus Catalogs – This plot shows that there is some tendency toward higher spending among customers who receive more catalogs. – But do the catalogs cause more spending, or are more catalogs sent to customers who would tend to spend more anyway? There is no way to answer this with this data. n Next is the scatterplot of Amount. Spent versus Children – This plot shows the interesting tendency of customers with more children to spend less. 3. 2 | 3. 3 | 3. 4 | 3. 5 | 3. 6 | 3. 7 | 3. 8 | 3. 9 | 3. 10 | 3. 11

Scatterplot of Amount Spent versus Catalogs 3. 2 | 3. 3 | 3. 4 | 3. 5 | 3. 6 | 3. 7 | 3. 8 | 3. 9 | 3. 10 | 3. 11

Scatterplot of Amount Spent versus Children 3. 2 | 3. 3 | 3. 4 | 3. 5 | 3. 6 | 3. 7 | 3. 8 | 3. 9 | 3. 10 | 3. 11

More Analysis n Pivot tables and accompanying charts are very useful in this type of situation. We can begin by using this technique to get a better understanding of the demographics of the customers. n The first pivot table and chart shows the percentage of an age group who own homes. Using the pivot table we can see how these percentages change for women, unmarried men and so on. – Specifically small percentages of the younger people own their own hoe, regardless of marital status or gender. 3. 2 | 3. 3 | 3. 4 | 3. 5 | 3. 6 | 3. 7 | 3. 8 | 3. 9 | 3. 10 | 3. 11

Percent of Home Owners versus Age, Married and Gender 3. 2 | 3. 3 | 3. 4 | 3. 5 | 3. 6 | 3. 7 | 3. 8 | 3. 9 | 3. 10 | 3. 11

More Analysis -- continued n The second pivot table shows the percentage of each age group who are married, for any combination of the Gender and Own. Home variables. – For example, we can check that the married/unmarried split is quite different for women who don’t own a home than for male home owners. n The third pivot table shows the average Salary broken down by Age and Gender, with page variables for Own. Home and Married. 3. 2 | 3. 3 | 3. 4 | 3. 5 | 3. 6 | 3. 7 | 3. 8 | 3. 9 | 3. 10 | 3. 11

Percent of Married versus Age, Own. Home and Gender 3. 2 | 3. 3 | 3. 4 | 3. 5 | 3. 6 | 3. 7 | 3. 8 | 3. 9 | 3. 10 | 3. 11

Average Salary versus Age, Gender, Married and Own. Home 3. 2 | 3. 3 | 3. 4 | 3. 5 | 3. 6 | 3. 7 | 3. 8 | 3. 9 | 3. 10 | 3. 11

More Analysis --continued – The shape of the resulting charts is practically the same for any combination of the page variables. However, the heights of the bars change appreciably. For example, salaries are higher for the married home owners than for unmarried customers who are not home owners. n Another pivot table and chart break the data down in another way. Each column in the pivot table shows the percentages in the various History categories for a particular number of children. Each of these columns corresponds to one of the bars in the “stacked” bar chart. We have also used Close as a page variable. 3. 2 | 3. 3 | 3. 4 | 3. 5 | 3. 6 | 3. 7 | 3. 8 | 3. 9 | 3. 10 | 3. 11

Percentages in History Categories versus Children and Close 3. 2 | 3. 3 | 3. 4 | 3. 5 | 3. 6 | 3. 7 | 3. 8 | 3. 9 | 3. 10 | 3. 11

More Analysis -- continued n Two interesting points emerge from this plot. – First, customers with more children tend to be more heavily represented in the low-spending History category. – Second, you can check by changing the setting of the Close variable from 1 to 2, the percentage of high spenders among customers who live far from electronic stores is much higher than for those who live close to such stores. n The next pivot table provide insight into how Hy. Tex determined its catalog mailing distributions. Each row shows a percentage of a particular History category that were sent 6, 12, 18 or 24 catalogs. 3. 2 | 3. 3 | 3. 4 | 3. 5 | 3. 6 | 3. 7 | 3. 8 | 3. 9 | 3. 10 | 3. 11

Catalog Distribution versus History 3. 2 | 3. 3 | 3. 4 | 3. 5 | 3. 6 | 3. 7 | 3. 8 | 3. 9 | 3. 10 | 3. 11

More Analysis -- continued n The company’s distribution policy is still unclear but we can see that it definitely sends more catalogs to high spending customers and fewer to low spending customers. n Finally the last pivot table shows Amount. Spent versus History and Catalogs, with a variety of demographic variables in the page area. There are so many possible combinations so it can be difficult to discover all the existing patterns 3. 2 | 3. 3 | 3. 4 | 3. 5 | 3. 6 | 3. 7 | 3. 8 | 3. 9 | 3. 10 | 3. 11

Average Amount Spent versus History, Catalogs and Demographic Variables 3. 2 | 3. 3 | 3. 4 | 3. 5 | 3. 6 | 3. 7 | 3. 8 | 3. 9 | 3. 10 | 3. 11

More Analysis -- continued n One things stands out loud and clear from the graph: the more customers receive, the more they tend to spend. n Also, if they were large spenders last year, they tend to be large spenders this year. n In a pivot table with this many combinations there will almost certainly be some combinations with no observations. 3. 2 | 3. 3 | 3. 4 | 3. 5 | 3. 6 | 3. 7 | 3. 8 | 3. 9 | 3. 10 | 3. 11