SCCC 312A
Lectures 03-05 –
Tabular and Graphical Presentation of Data, Numerical Summary Measures
Spring 2004 – E.
Peña’s Class
A study was conducted regarding the weights of students in a certain school. A sample of n = 50 students was obtained and the resulting raw sample data, which has the variables GENDER and WEIGHT, is given in Table 2.2 of the textbook, and can also be found in the DATASETS folder of the course website (in the MINITAB folder it is Table02-02.mtw). Note that even though the variable GENDER is a qualitative or categorical variable, its responses were coded numerically (0 = Female, 1 = Male). The variable WEIGHT is a quantitative variable. We will use this data set to demonstrate the following graphical methods of data presentation.
a) frequency table and bar graph for qualitative variable;
b) dot plot and comparative dot plots;
c) stem-and-leaf and comparative stem-and-leaf;
d) frequency tables and histograms;
e) distributions.
The first step in constructing a bar graph is to count the frequency of occurrence of each of the distinct values of the qualitative variable. This could be done manually (using a “tally” column) or if the data set is large, by using the computer (in Minitab you use the STAT > TABLES > TALLY in the menu bar). For the variable GENDER either manually or via Minitab, we find the following frequencies, together with the relative frequencies (in percent):
Value |
Frequency |
Percentage |
0 = Female |
20 |
40 |
1 = Male |
30 |
60 |
Notice immediately that this provides a more informative summary as compared to the raw data. This table shows that 20 out of the 50 units in the sample were females, and the other 30 were males. The third column tells us that 40% of all respondents were female.
Such a frequency table could be converted into a bar graph, either using the frequency or using the relative frequency. For example, using Minitab, this leads to the bar graph depicted below:
It might be more appropriate in this case to separate the bars as the values are categorical responses.
Another way of depicting the frequency table above is by using a pie chart. Again, using Minitab, the associated pie chart for GENDER is given below. To computer the angle for a category, multiply 360 degrees by the relative frequency, so the angle for instance for Female is (360 degrees)(.40) = 144 degrees.
For presenting quantitative variables, there are several ways of doing so, and you may appeal to your artistic talent to do so (but don’t overdo as simplicity is a virtue!). The crudest way is simply to arrange the observations in an ascending (or decreasing) order, but this usually is not very informative. The first method we study is called a dot plot. For the variable WEIGHT its dotplot is presented below.
In constructing a dot plot, each data point represents one dot, and if a value appears more than once, then we simply stack the dots. This particular dot plot is not the most enlightening as there is not much grouping that arose! We could also construct a comparative dot plot if there is a grouping variable, such as GENDER in our example. The use of a comparative dot plot is to compare the groups in terms of the values of a quantitative variable. Again, by using Minitab, we have the following comparative dot plot.
And through this we of
course immediately see that females tend to be lighter than males!
Stem-and-Leaf Plots
Similar to the dot plot and comparative dot plots is the notion of a stem-and-leaf plot. The basic idea is to subdivide each value into a stem and a leaf, and then to draw a tree with the trunk consisting of the stems, and the leaves attached to the stem. For example, the first value of WEIGHT is 98, and we could split this into a stem taking value 9, and a leaf taking value 8. The stem-and-leaf plot for the WEIGHT variable then looks like (with the first column consisting of cumulative counts):
Leaf Unit = 1.0
1 9 8
4 10 188
11 11 0255688
16 12 00089
19 13 257
23 14 2358
(6) 15 044578
21 16 122578
15 17 00667
10 18 3468
6 19 0155
2 20 5
1 21 5
The associated comparative stem-and-leaf plot, with GENDER as grouping variable, is:
Character Stem-and-Leaf Display
Stem-and-leaf of Weight Gender = 0 N = 20
Leaf Unit = 1.0
1 9 8
4 10 188
(7) 11 0255688
9 12 00089
4 13 257
1 14 2
Stem-and-leaf of Weight Gender = 1 N = 30
Leaf Unit = 1.0
3 14 358
9 15 044578
15 16 122578
15 17 00667
10 18 3468
6 19 0155
2 20 5
1 21 5
Usually, this will be presented as a side-by-side stem-and-leaf plot so as to make the comparison easier.
The final graphical method that we study is that of a frequency distribution table and its associated histogram. The basic idea is to subdivide the range of the values of the variable into a small number of (usually equal-length) intervals and to count the frequencies of occurrence of values in each of these intervals. Usually, the number of intervals is between 7-15, but there is no hard rule for this. The more observations you have, the more acceptable to have many intervals. The histogram or relative frequency histogram is the associated graph of the frequency table. For the WEIGHT variable, using Minitab and the menu commands GRAPH > HISTOGAM, the histogram for all observations is given by
Here, notice that the intervals were from 90-99, 100-109, etc. There should always be enough intervals so as to include all the observations. The histogram above is also referred to as a distribution, and for this case, this histogram is bi-modal as there are two “humps” associated with the females and the males, though with this histogram this information is not available. To get this information that the two humps or bumps are from the grouping variable, we could have a comparative histogram. With GENDER as grouping variable, the comparative histogram is provided by:
And again, through this graph we immediately see that females are lighter than males, and these plots confirm the saying the “a picture is worth a thousand words!”
When we deal with distributions, we could usually characterize distributions as either symmetric, mound-shaped, left-skewed, right-skewed, bi-modal, and some other types that we will encounter later.
Illustration #2
Poverty and PACT
in South Carolina
A study was conducted to determine if there is a difference in the PACT scores of SC school districts in regards to the percentage of students who do not receive subsidies for school lunches (this is supposed to be an indicator of the level of poverty of the school district, with high values indicating a more well-to-do school district. On whether it is a good measure is certainly another issue). The data set was obtained from an issue of The State, Columbia’s city newspaper, and this data set is available in the DATASETS folder of the course website, using the name pactvspoverty.mpj. This data set contains four variables:
District = school district name
Lunch = percentage of students receiving school lunch subsidy
ActualLang = percent of students who were not proficient on the Language portion of PACT
ActualMath = percent of students who were not proficient on the Math portion of the PACT
There were a total of 84 school districts.
Question: Is this a population or a sample?
For the moment we simply use this data set to demonstrate the graphical techniques of presenting data. Here are the histograms of the three quantitative variables.
What shapes do we observe from these histograms? Are they all symmetric? Is the histogram for the Language Proficiency right-skewed?
More importantly, at this point, we introduce the scatterplot, which is a plot of two quantitative variables, and aids in determining if there is an association or relationship between the two variables. In our case, the question of interest is whether there is a relationship between the proficiency (Math or Language) and the “poverty” level of the school district.
A provocative question is: If there is such a relationship, would the standard exams (SAT, ACT, PACT) be valid criteria for admission to public universities?
Based on these scatterplots, what could we conclude about the relationship between “poverty” and PACT scores? Of course, one might also be interested in whether, within school districts, there is a relationship between language and math scores. Here’s the scatterplot of these two variables.
This lecture demonstrates some basic methods of graphically presenting data, both qualitative and quantitative. We will have occasion to encounter other types of presentation methods, and in thinking of a way to present data graphically, we might add that its only limitation is our imagination!
Appendix: The Data
Sets
Weight Data Set |
Poverty vs PACT Data Set |
||||
Gender |
Weight |
District |
%LunchSubsidy |
%NotProficientLanguage |
%NotProficientMath |
0 |
98 |
Abbeville |
59 |
32 |
38 |
1 |
150 |
Aiken |
46 |
26 |
30 |
0 |
108 |
Allendale |
90 |
63 |
67 |
1 |
158 |
Anderson1 |
29 |
17 |
24 |
1 |
162 |
Anderson2 |
41 |
24 |
26 |
0 |
112 |
Anderson3 |
51 |
30 |
41 |
0 |
118 |
Anderson4 |
41 |
25 |
30 |
1 |
167 |
Anderson5 |
43 |
32 |
36 |
1 |
170 |
Bamberg1 |
70 |
33 |
36 |
0 |
120 |
Bamberg2 |
93 |
50 |
66 |
1 |
177 |
Barnwell19 |
84 |
50 |
66 |
1 |
186 |
Barnwell29 |
64 |
27 |
32 |
1 |
191 |
Barnwell45 |
52 |
36 |
43 |
0 |
128 |
Beaufort |
50 |
31 |
43 |
0 |
135 |
Berkeley |
53 |
28 |
35 |
1 |
195 |
Calhoun |
78 |
36 |
41 |
0 |
137 |
Charleston |
57 |
31 |
42 |
1 |
205 |
Cherokee |
51 |
39 |
42 |
1 |
190 |
Chester |
55 |
41 |
53 |
0 |
120 |
Chesterfield |
60 |
37 |
45 |
1 |
188 |
Clarendon1 |
96 |
46 |
66 |
1 |
176 |
Clarendon2 |
75 |
34 |
45 |
0 |
118 |
Clarendon3 |
60 |
29 |
36 |
1 |
168 |
Colleton |
71 |
43 |
53 |
0 |
115 |
Darlington |
68 |
42 |
51 |
0 |
115 |
Dillon1 |
76 |
47 |
52 |
1 |
162 |
Dillon2 |
82 |
49 |
55 |
1 |
157 |
Dillon3 |
73 |
30 |
41 |
1 |
154 |
Dorchester2 |
31 |
24 |
30 |
1 |
148 |
Dorchester4 |
75 |
45 |
57 |
0 |
101 |
Edgefield |
57 |
29 |
40 |
1 |
143 |
Fairfield |
80 |
51 |
63 |
1 |
145 |
Florence1 |
54 |
30 |
44 |
0 |
108 |
Florence2 |
67 |
28 |
33 |
1 |
155 |
Florence3 |
76 |
45 |
50 |
0 |
110 |
Florence4 |
87 |
61 |
61 |
1 |
154 |
Florence5 |
54 |
27 |
33 |
0 |
116 |
Georgetown |
60 |
32 |
41 |
1 |
161 |
Greenville |
35 |
26 |
35 |
1 |
165 |
Greenwood50 |
51 |
29 |
36 |
0 |
142 |
Greenwood51 |
50 |
35 |
42 |
1 |
184 |
Greenwood52 |
43 |
23 |
26 |
0 |
120 |
Hampton1 |
66 |
32 |
44 |
1 |
170 |
Hampton2 |
86 |
63 |
75 |
1 |
195 |
Horry |
54 |
25 |
33 |
0 |
132 |
Jasper |
87 |
60 |
69 |
0 |
129 |
Kershaw |
49 |
29 |
37 |
1 |
215 |
Lancaster |
46 |
38 |
43 |
1 |
176 |
Laurens55 |
50 |
38 |
44 |
1 |
183 |
Laurens56 |
57 |
40 |
50 |
|
|
Lee |
90 |
60 |
75 |
|
|
Lexington1 |
26 |
17 |
20 |
|
|
Lexington2 |
47 |
23 |
27 |
|
|
Lexington3 |
53 |
37 |
39 |
|
|
Lexington4 |
58 |
34 |
43 |
|
|
Lexington5 |
16 |
13 |
15 |
|
|
Marion1 |
74 |
48 |
54 |
|
|
Marion2 |
77 |
43 |
55 |
|
|
Marion3 |
94 |
41 |
62 |
|
|
Marion4 |
88 |
49 |
62 |
|
|
Marlboro |
78 |
50 |
59 |
|
|
McCormick |
79 |
46 |
58 |
|
|
Newberry |
61 |
41 |
47 |
|
|
Oconee |
45 |
26 |
34 |
|
|
Orangeburg3 |
87 |
49 |
62 |
|
|
Orangeburg4 |
68 |
36 |
52 |
|
|
Orangeburg5 |
76 |
45 |
56 |
|
|
Pickens |
32 |
22 |
31 |
|
|
Richland1 |
63 |
39 |
53 |
|
|
Richland2 |
33 |
20 |
26 |
|
|
Saluda |
64 |
44 |
53 |
|
|
Spartanburg1 |
39 |
20 |
22 |
|
|
Spartanburg2 |
37 |
21 |
27 |
|
|
Spartanburg3 |
47 |
23 |
30 |
|
|
Spartanburg4 |
40 |
29 |
41 |
|
|
Spartanburg5 |
43 |
25 |
27 |
|
|
Spartanburg6 |
37 |
24 |
31 |
|
|
Spartanburg7 |
64 |
37 |
43 |
|
|
Sumter17 |
59 |
36 |
45 |
|
|
Sumter2 |
70 |
32 |
41 |
|
|
Union |
55 |
37 |
46 |
|
|
Williamsburg |
90 |
38 |
47 |
|
|
York1 |
45 |
32 |
35 |
|
|
York2 |
31 |
25 |
24 |
Lecture 04 – Measures of Location and Position
Graphical methods of presentation of data are important because of its visual impact. However, in summarizing data, we oftentimes would like to have a few numerical measures to depict the important features of the data set. There are three types of numerical summary measures that we will study for univariate data sets: measures of location, measures of position, and measures of position. We begin our discussion with measures of location.
These summary measures purport to provide a summary of the “center” or central tendency of the data set. The center need not be the value on which the observations cluster, though most often it will be. The first measure of location is the well-known arithmetic average or the mean. Since will be dealing with sample data, this will be referred to as the sample mean. It is defined via:
Sample Mean (X-Bar) = (Sum of all observations)/(Number of observations).
For purposes of demonstrating the computation of these summary measures, let us consider the following data set, which represents the number of homeruns for each of 8 seasons of Sammy Sosa, a Chicago Cubs professional baseball player.
15, 10, 33, 25, 36, 40, 36, 66
The sample size for this data set is n = 8. The sample mean number of homeruns is
X-Bar = (15 + 10 + 33 + 25 + 36 + 40 + 36 + 66)/8 = 261/8 = 32.625 homeruns/season.
There are several properties of the sample mean.
a) It represents the “center of gravity” of the data set in the sense that this value will “balance” out the data set.
b) It is the value, A, that will minimize the sum of the squared deviations of each of the observations from A.
c) It is easily affected by outliers or extreme observations, hence it may not be the best measure when dealing with skewed distributions.
d) It uses all the observations in its computation.
The second measure of location or central tendency is the median. This is the value that divides the arranged data set into two equal parts. It is obtained by first arranging the data set into an ascending order, and then determining that value that will split the data set into a 50:50 split. To demonstrate, the arranged data set for Sosa’s homeruns is:
10, 15, 25, 33, 36, 36, 40, 66
Because n = 8, the observation that will divide the data set into two equal parts will be the (n+1)/2 = (8+1)/2 = 4.5 observation in the arranged data set. Since 4.5 is not a whole number, then we take the median as the average between the 4th and 5th observations in the arranged data. Thus,
Median = (33 + 36)/2 = 34.5.
In contrast to the mean, the median is not affected by outliers, hence is suitable to use when dealing with skewed distributions. On the other hand, it does not utilize the magnitudes of all the observations, except for purposes of ranking the observations.
The third measure of location is the mode. This is the observation that occurs most frequently in the data set. For the homerun data set, this is the value of 36 since it appears twice. Most often, we would look for the modal class when given the frequency distribution or histogram, and the modal class is the interval of values which has the highest frequency.
We now demonstrate these using the weight data, which also has “Gender” as a classification variable. When dealing with large data sets, it is better to use computer packages or your calculator for computing these quantities. In our case, we use Minitab to compute these quantities, using the Stat > Basic Stat > Descriptive Statistics menu commands.
Descriptive Statistics: Weight
Variable N
N* Mean SE Mean
StDev Minimum Q1
Median Q3
Weight 50
0 150.64 4.32
30.53 98.00 120.00
154.00 176.00
Variable Maximum
Weight 215.00
This tells us that the mean weight of all students is 150.64 pounds, and the median weight is 154 pounds. Examining the frequency table or histogram below, the modal class is 157.5-172.5 and also 112.5-127.5. The histogram is depicted below.
In this case we would say that the distribution is “bi-modal” of course owing to the fact that there are two groups: males and females.
With a classification variable like “Gender” we may also compute these quantities for each group in order to compare these measures across groups. Again, we may do this using Minitab.
Descriptive Statistics: Weight
Variable Gender
N N* Mean SE Mean StDev
Minimum Q1 Median
Weight 0
20 0 119.10 2.64 11.79
98.00 110.50 118.00
1 30 0
171.67 3.37 18.44
143.00 156.50 169.00
Variable Gender
Q3 Maximum
Weight 0
128.75 142.00
1
186.50 215.00
From these summary measures, we note that the mean and median for the females are 119.10 and 118.00, respectively; whereas for the males they are 171.67 and 169.00, respectively.
In comparing the values of the mean, median, and mode, we have the following general guidelines:
a) for symmetric distributions, the mean, median and mode will generally coincide or be close to each other.
b) for right-skewed distributions (tail on the right), in order of increasing magnitude we have: mode, median, mean. This is because the mean will be affected by the extreme values to the right.
c) for left-skewed distributions (tail on the left), in order of increasing magnitude we have: mean, median, mode. This is because the mean will be affected by the extreme values to the left.
The median is an example of a measure of position as it divides the data set into two equal parts. The quantity that divides the arranged data set into a 25%:75% split is called the first quartile, denoted by Q1; whereas the quantity that divides the arranged data set into a 75%:25% split is called the third quartile, and denoted by Q3. To compute Q1, obtain (n+1)/4, and take the observation in the arranged data set whose index is closest to (n+1)/4. This is not the most precise way of doing it, but this will suffice. To obtain Q3, take the observation in the arranged data set whose index is closest to 3(n+1)/4. Thus, for the homerun data set where n = 8, we have:
(n+1)/4 = (8+1)/4 = 2.25, so Q1 = 2nd observation in arranged data = 15;
3(n+1)/4 = 3(8+1)/4 = 6.75, so Q3 = 7th observation in arranged data = 40.
When using Minitab, Q1 and Q3 are also computed, as the outputs above demonstrates.
Other measures of position are the percentiles. Thus, for example, the 90th percentile is that value that will divide the arranged data set into a 90%:10% split. Thus, 90% of all observations will be smaller than or equal the 90th percentile. To determine the 90th percentile, one may look for the observation in the arranged data set whose index is closest to (n+1)(.90).
A box plot is another method of presenting data in a pictorial and compact way. In its simplest form, we form a box with edges being the Q1 and Q3, and also indicate where the median is. Then we construct whiskers, which emanate from the edges and goes up to the two extreme values. There are more complicated box plots that one may construct. For the weight data set, the box plot for the whole data set is (from Minitab):
We could also have comparative box plots to compare different groups. Thus, for the weight data set, the comparative box plots for the males and females is depicted below.
Notice that one immediately sees the difference between the two groups by looking at these box plots.
Lecture 05 - Measures of Dispersion
Measures of location does not however tell a complete story about the data set as can be seen by considering the mean and median of the following three data sets:
Data Set 1: 1, 2, 3, 4, 5
Data Set 2: 1, 1, 3, 5, 5
Data Set 3: 3, 3, 3, 3, 3
Clearly, these three data sets are quite different from each other, but their means and medians all coincide. It is obvious that there are differences in the degree of variation among the observations in each of these data sets. Data set 3 has the least variation, in fact, it has no variation at all, while Data set 2 has the most variation. We therefore need an additional measure to augment the measures of location and get a better picture of the data set. Such measures are called measures of variation or dispersion.
Range: the range is a crude measure of variation. It is the difference between the extreme values in the data set.
(Sample) Variance = S2 = {Sum of (Xi – XBar)2}/(n-1) = “average” of the squared deviations of observations from the sample mean. The reason why the divisor is (n-1) instead of n will become apparent to us when we use this sample variance to estimate the population variance. By dividing using (n-1), it becomes an unbiased estimator of the population variance, that is, it is “on target” on the average.
Note that the units of measurement of the variance are the squared units of the original observations. Also, clearly, the variance could never be negative, and the only time it equals zero is when all the observations are identical. The larger the value of the variance, the more variation there is in the data set.
(Sample) Standard Deviation = S = SQRT(S2) (take the positive value). Note that this has the same unit of measurement as the observations.
We demonstrate the computation of these quantities using the homerun data set.
Observation Number |
Value (Xi) |
Deviation from Mean |
Squared Deviation |
1 |
15 |
15 – 32.625 = -17.625 |
310.6406 |
2 |
10 |
10 – 32.625 = -22.625 |
511.8906 |
3 |
33 |
33 – 32.625 = .375 |
.1406 |
4 |
25 |
25 – 32.625 = -7.625 |
58.1406 |
5 |
36 |
36 – 32.625 = 3.375 |
11.3906 |
6 |
40 |
40 – 32.625 = 7.375 |
54.3906 |
7 |
36 |
36 – 32.625 = 3.375 |
11.3906 |
8 |
66 |
66 – 32.625 = 33.375 |
1113.8906 |
|
Sum = 261; Mean = 32.625 |
Sum = 0 |
Sum = 2071.8748 |
Therefore, the sample variance is S2 = 2071.8748/(8-1) = 2071.8748/7 = 295.9821.
The standard deviation is S = SQRT(295.9821) = 17.2041.
Remark: There is a simpler formula for computing the variance, called the machine formula, but we will not study this as we will mostly be computing the variance and standard deviation using your calculator or the computer.
Question: Is it always better to have small variability?
Answer: In many situations we would like to have small variability as this implies consistency and precision, especially when measuring certain quantities or when we would like to monitor a production line. However, sometimes it is also important that there be variability, so imagine a world where we all look identical, where every shot in a basketball game is good, etc. There is therefore a “Jekyll and Hyde” quality to how we would want the variation to be!
Before proceeding, we mention an important property of the mean, variance, and standard deviation.
Imagine a situation where you are interested in studying the temperature in Columbia during noontime for the month of February. You will then have 31 observations, and let us suppose that you recorded these temperatures in Fahrenheit. With these observations you obtain the mean, variance, and standard deviation in units of Fahrenheit, Fahrenheit2, and Fahrenheit, respectively. Suppose however that you decided to convert your readings into Centigrade, perhaps because of the requirement of the journal in which you are publishing your work. The question is do you need to re-compute the mean, variance and standard deviation again, or is there a way to obtain them by simply utilizing the values you already have with the Fahrenheit readings?
Suppose that F is a temperature reading in Fahrenheit. We could convert this reading into Centigrade according to the formula
C = (5/9)(F – 32) = (5/9)(F) – (5/9)(32) = (a)(F) + b with a = 5/9 and b = - (5/9)(32).
This is what we refer to as a linear transformation (or an affine transformation). Here is the important result that we need:
Property Of Mean and Variance: Let X1, X2, ..., Xn be sample observations with mean XBAR and variance S2. Let a and b be constants, and define the transformed observations Y1, Y2, ..., Yn according to the formula
Yi = aXi + b for all i = 1, 2, ..., n.
Then the mean and variance of these new observations are given by
Mean of the Ys = (a)(XBAR) + b
Variance of the Ys = (a2)(S2)
Standard Deviation of the Ys = |a|S.
Consequently, suppose that the mean temperature for the month of January during noontime is 45 degrees Fahrenheit, and the standard deviation is 8 degrees Fahrenheit. Without re-computing the mean and standard deviation in units of Centigrade are equal to
Mean in Centigrade = (5/9)(45) –(5/9)(32) = 7.22
Standard Deviation in Centigrade = |5/9|(8) = 4.44.
Another measure of variation is the inter-quartile range (IQR), which is the difference between the third quartile (Q3) and the first quartile (Q1). Thus
IQR = Q3 – Q1.
Mostly in this course, however, we will be concerned with the variance and the standard deviation. We note that the inter-quartile range has an important role in determining outliers in a data set.
Outliers: Those observations whose values are smaller than Q1 – (1.5)(IQR) or larger than Q3 + (1.5)(IQR) are called mild outliers; while those observations smaller than Q1 – (3)(IQR) or larger than Q3 + (3)(IQR) are referred to as extreme outliers.
Imagine the situation where you have a data set with n observations. You are not given all the values of these observations, but you are provided the values of the mean (XBAR) and the standard deviation (S). What information do you get by knowing the mean and the standard deviation?
Case 1: You do NOT have any idea of the shape of the histogram or distribution. If this is the case, then you may use Chebyshev’s Inequality to make the following conclusions:
a) At least 75% [= (1 – (1/2)2)*100] of all the observations will have values between XBAR – 2S and XBAR + 2S.
b) At least 88.89% [ = (1 – (1/3)2)*100] of all the observations will have values between XBAR – 3S and XBAR + 3S.
On the other hand, if you have the additional information that the histogram or distribution is mound-shaped, then you could do better than Chebyshev’s Inequality by invoking what is called as the Empirical Rule. This states that with mound-shaped distributions,
a) Approximately 68% of all observations will be between XBAR – S and XBAR + S.
b) Approximately 95% of all observations will be between XBAR – 2S and XBAR + 2S.
c) Approximately 100% of all observations will be between XBAR – 3S and XBAR + 3S.
We now demonstrate this using the WEIGHT data set. Recall that using Minitab, for this data set the sample mean and sample standard deviation are given by XBAR = 150.64 pounds and S = 30.53. We determine the percentages of observations in certain intervals.
Interval |
Frequency |
Percentage |
According to Chebyshev |
Empirical Rule |
[XBAR -/+ S] = [120.11, 181.17] |
26 |
52% |
at least 0% |
approx 66% |
[XBAR -/+ 2S] = [89.58, 211.7] |
49 |
98% |
at least 75% |
approx 95% |
[XBAR -/+ 3S] = [59.05, 242.23] |
50 |
100% |
at least 88.89% |
approx 100% |
Observe that the result for the first interval is quite different from that expected under the empirical rule, but this is due to the fact that the histogram or distribution is not mound-shaped as we have observed it to be bi-modal owing to the mixing of the female and male weights.
The implication of these rules is that it is quite unlikely to have observations which are more than three standard deviations from the mean. In statistics, it is therefore customary to measure distances among observations in terms of standard deviation units. In connection with this, we have the notion of the standardized score or the z-score:
z-score = (Observation – Mean)/(Standard Deviation).
The z-score is a unit-less quantity, hence it could be used to compare distances even with data sets using different units of measurement.
In this lecture we discuss methods for measuring the degree of linear association and determining the relationship between two quantitative variables. To demonstrate the ideas, we utilize the data set regarding current age and years of life remaining on page 141, problem 3.48 of the textbook. A scatter plot of the data set is given below:
To compute the correlation coefficient and the regression coefficient, we implement the machine or computational formulas using Excel.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Computing
Correlation and Regression Coefficient |
|
|
|
|
|
|||
Data Set |
On page
141, Problem 3.48. The independent variable is current age, while the dependent
variable is |
|||||||
|
Years of
Life Remaining |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
XCurrentAge |
YYrsRemain |
Xsquared |
Ysquared |
XY |
|
|
|
|
65 |
16.5 |
4225 |
272.25 |
1072.5 |
|
|
|
|
67 |
15.1 |
4489 |
228.01 |
1011.7 |
|
|
|
|
69 |
13.7 |
4761 |
187.69 |
945.3 |
|
|
|
|
71 |
12.4 |
5041 |
153.76 |
880.4 |
|
|
|
|
73 |
11.2 |
5329 |
125.44 |
817.6 |
|
|
|
|
75 |
10.1 |
5625 |
102.01 |
757.5 |
|
|
|
|
77 |
9 |
5929 |
81 |
693 |
|
|
|
|
79 |
8.4 |
6241 |
70.56 |
663.6 |
|
|
|
|
81 |
7.1 |
6561 |
50.41 |
575.1 |
|
|
|
|
83 |
6.4 |
6889 |
40.96 |
531.2 |
|
|
|
|
|
|
|
|
|
|
|
|
|
740 |
109.9 |
55090 |
1312.09 |
7947.9 |
|
|
|
|
|
|
|
|
|
|
|
|
|
Xmean |
74 |
|
|
|
|
|
|
|
Ymean |
10.99 |
|
|
|
|
|
|
|
SS(X) |
330 |
|
|
|
|
|
|
|
SS(Y) |
104.289 |
|
|
|
|
|
|
|
SS(XY) |
-184.7 |
|
|
|
|
|
|
|
r |
-0.99561326 |
|
|
|
|
|
|
|
b |
-0.55969697 |
|
|
|
|
|
|
|
a |
52.40757576 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
We could also compute the coefficients using Minitab. The results upon applying Minitab are as follows. To compute the correlation, use the STAT and DESCRIPTIVE STATISTICS menu, and use the correlation option.
Correlations: CurrentAge, YrsRemain
Pearson correlation of
CurrentAge and YrsRemain = -0.996
To compute the regression line, use the STAT and REGRESSION menu in Minitab.
Regression Analysis: YrsRemain
versus CurrentAge
The regression equation is
YrsRemain = 52.4 - 0.560
CurrentAge
Predictor Coef
SE Coef T P
Constant 52.408 1.380 37.97 0.000
CurrentAge -0.55970
0.01860 -30.10 0.000
S = 0.337818 R-Sq = 99.1% R-Sq(adj) = 99.0%
Analysis of Variance
Source DF SS MS F
P
Regression 1
103.38 103.38 905.84
0.000
Residual Error 8
0.91 0.11
Total 9
104.29
We will study later the meaning of this analysis of variance table, as well as the other components of this output. The important quantity to note from this output are the Constant Coefficient of 52.408 which is the Y-intercept, and the CurrentAge Coefficient of -.55970 which is the regression coefficient. The best fitting line to the data set is therefore:
(Predicted Years Remaining) = 52.408 - .55970(CurrentAge).
Another particular quantity to note at this stage is the R-Sq value of 99.1%. This is called the coefficient of determination, and it is just the square of the correlation coefficient. It measures the predictive ability of the independent variable. More technically, it measures the amount of variation in the dependent variable that could be explained by the independent variable.
Using the Minitab command “Fitted Line Plot” in the Stat > Regression menu, we obtain the scatterplot together with the fitted line.