SCCC 312A

Lectures 03-05 – Tabular and Graphical Presentation of Data, Numerical Summary Measures

Spring 2004 – E. Peña’s Class

Illustrative Study #1

A study was conducted regarding the weights of students in a certain school. A sample of n = 50 students was obtained and the resulting raw sample data, which has the variables GENDER and WEIGHT, is given in Table 2.2 of the textbook, and can also be found in the DATASETS folder of the course website (in the MINITAB folder it is Table02-02.mtw). Note that even though the variable GENDER is a qualitative or categorical variable, its responses were coded numerically (0 = Female, 1 = Male). The variable WEIGHT is a quantitative variable. We will use this data set to demonstrate the following graphical methods of data presentation.

a) frequency table and bar graph for qualitative variable;

b) dot plot and comparative dot plots;

c) stem-and-leaf and comparative stem-and-leaf;

d) frequency tables and histograms;

e) distributions.

Frequency Table and Bar Graph for Qualitative Variables

The first step in constructing a bar graph is to count the frequency of occurrence of each of the distinct values of the qualitative variable. This could be done manually (using a “tally” column) or if the data set is large, by using the computer (in Minitab you use the STAT > TABLES > TALLY in the menu bar). For the variable GENDER either manually or via Minitab, we find the following frequencies, together with the relative frequencies (in percent):

Value	Frequency	Percentage
0 = Female	20	40
1 = Male	30	60

Notice immediately that this provides a more informative summary as compared to the raw data. This table shows that 20 out of the 50 units in the sample were females, and the other 30 were males. The third column tells us that 40% of all respondents were female.

Such a frequency table could be converted into a bar graph, either using the frequency or using the relative frequency. For example, using Minitab, this leads to the bar graph depicted below:

It might be more appropriate in this case to separate the bars as the values are categorical responses.

Pie Chart

Another way of depicting the frequency table above is by using a pie chart. Again, using Minitab, the associated pie chart for GENDER is given below. To computer the angle for a category, multiply 360 degrees by the relative frequency, so the angle for instance for Female is (360 degrees)(.40) = 144 degrees.

Dot Plots for Quantitative Variables

For presenting quantitative variables, there are several ways of doing so, and you may appeal to your artistic talent to do so (but don’t overdo as simplicity is a virtue!). The crudest way is simply to arrange the observations in an ascending (or decreasing) order, but this usually is not very informative. The first method we study is called a dot plot. For the variable WEIGHT its dotplot is presented below.

In constructing a dot plot, each data point represents one dot, and if a value appears more than once, then we simply stack the dots. This particular dot plot is not the most enlightening as there is not much grouping that arose! We could also construct a comparative dot plot if there is a grouping variable, such as GENDER in our example. The use of a comparative dot plot is to compare the groups in terms of the values of a quantitative variable. Again, by using Minitab, we have the following comparative dot plot.

And through this we of course immediately see that females tend to be lighter than males!

Stem-and-Leaf Plots

Similar to the dot plot and comparative dot plots is the notion of a stem-and-leaf plot. The basic idea is to subdivide each value into a stem and a leaf, and then to draw a tree with the trunk consisting of the stems, and the leaves attached to the stem. For example, the first value of WEIGHT is 98, and we could split this into a stem taking value 9, and a leaf taking value 8. The stem-and-leaf plot for the WEIGHT variable then looks like (with the first column consisting of cumulative counts):

Leaf Unit = 1.0

1 9 8

4 10 188

11 11 0255688

16 12 00089

19 13 257

23 14 2358

(6) 15 044578

21 16 122578

15 17 00667

10 18 3468

6 19 0155

2 20 5

1 21 5

The associated comparative stem-and-leaf plot, with GENDER as grouping variable, is:

Character Stem-and-Leaf Display

Stem-and-leaf of Weight Gender = 0 N = 20

Leaf Unit = 1.0

1 9 8

4 10 188

(7) 11 0255688

9 12 00089

4 13 257

1 14 2

Stem-and-leaf of Weight Gender = 1 N = 30

Leaf Unit = 1.0

3 14 358

9 15 044578

15 16 122578

15 17 00667

10 18 3468

6 19 0155

2 20 5

1 21 5

Usually, this will be presented as a side-by-side stem-and-leaf plot so as to make the comparison easier.

Frequency Tables, Histograms and Distributions

The final graphical method that we study is that of a frequency distribution table and its associated histogram. The basic idea is to subdivide the range of the values of the variable into a small number of (usually equal-length) intervals and to count the frequencies of occurrence of values in each of these intervals. Usually, the number of intervals is between 7-15, but there is no hard rule for this. The more observations you have, the more acceptable to have many intervals. The histogram or relative frequency histogram is the associated graph of the frequency table. For the WEIGHT variable, using Minitab and the menu commands GRAPH > HISTOGAM, the histogram for all observations is given by

Here, notice that the intervals were from 90-99, 100-109, etc. There should always be enough intervals so as to include all the observations. The histogram above is also referred to as a distribution, and for this case, this histogram is bi-modal as there are two “humps” associated with the females and the males, though with this histogram this information is not available. To get this information that the two humps or bumps are from the grouping variable, we could have a comparative histogram. With GENDER as grouping variable, the comparative histogram is provided by:

And again, through this graph we immediately see that females are lighter than males, and these plots confirm the saying the “a picture is worth a thousand words!”

When we deal with distributions, we could usually characterize distributions as either symmetric, mound-shaped, left-skewed, right-skewed, bi-modal, and some other types that we will encounter later.

Illustration #2

Poverty and PACT in South Carolina

A study was conducted to determine if there is a difference in the PACT scores of SC school districts in regards to the percentage of students who do not receive subsidies for school lunches (this is supposed to be an indicator of the level of poverty of the school district, with high values indicating a more well-to-do school district. On whether it is a good measure is certainly another issue). The data set was obtained from an issue of The State, Columbia’s city newspaper, and this data set is available in the DATASETS folder of the course website, using the name pactvspoverty.mpj. This data set contains four variables:

District = school district name

Lunch = percentage of students receiving school lunch subsidy

ActualLang = percent of students who were not proficient on the Language portion of PACT

ActualMath = percent of students who were not proficient on the Math portion of the PACT

There were a total of 84 school districts.

Question: Is this a population or a sample?

For the moment we simply use this data set to demonstrate the graphical techniques of presenting data. Here are the histograms of the three quantitative variables.

What shapes do we observe from these histograms? Are they all symmetric? Is the histogram for the Language Proficiency right-skewed?

More importantly, at this point, we introduce the scatterplot, which is a plot of two quantitative variables, and aids in determining if there is an association or relationship between the two variables. In our case, the question of interest is whether there is a relationship between the proficiency (Math or Language) and the “poverty” level of the school district.

A provocative question is: If there is such a relationship, would the standard exams (SAT, ACT, PACT) be valid criteria for admission to public universities?

Based on these scatterplots, what could we conclude about the relationship between “poverty” and PACT scores? Of course, one might also be interested in whether, within school districts, there is a relationship between language and math scores. Here’s the scatterplot of these two variables.

This lecture demonstrates some basic methods of graphically presenting data, both qualitative and quantitative. We will have occasion to encounter other types of presentation methods, and in thinking of a way to present data graphically, we might add that its only limitation is our imagination!

Appendix: The Data Sets

Weight Data Set		Poverty vs PACT Data Set
Gender	Weight	District	%LunchSubsidy	%NotProficientLanguage	%NotProficientMath
0	98	Abbeville	59	32	38
1	150	Aiken	46	26	30
0	108	Allendale	90	63	67
1	158	Anderson1	29	17	24
1	162	Anderson2	41	24	26
0	112	Anderson3	51	30	41
0	118	Anderson4	41	25	30
1	167	Anderson5	43	32	36
1	170	Bamberg1	70	33	36
0	120	Bamberg2	93	50	66
1	177	Barnwell19	84	50	66
1	186	Barnwell29	64	27	32
1	191	Barnwell45	52	36	43
0	128	Beaufort	50	31	43
0	135	Berkeley	53	28	35
1	195	Calhoun	78	36	41
0	137	Charleston	57	31	42
1	205	Cherokee	51	39	42
1	190	Chester	55	41	53
0	120	Chesterfield	60	37	45
1	188	Clarendon1	96	46	66
1	176	Clarendon2	75	34	45
0	118	Clarendon3	60	29	36
1	168	Colleton	71	43	53
0	115	Darlington	68	42	51
0	115	Dillon1	76	47	52
1	162	Dillon2	82	49	55
1	157	Dillon3	73	30	41
1	154	Dorchester2	31	24	30
1	148	Dorchester4	75	45	57
0	101	Edgefield	57	29	40
1	143	Fairfield	80	51	63
1	145	Florence1	54	30	44
0	108	Florence2	67	28	33
1	155	Florence3	76	45	50
0	110	Florence4	87	61	61
1	154	Florence5	54	27	33
0	116	Georgetown	60	32	41
1	161	Greenville	35	26	35
1	165	Greenwood50	51	29	36
0	142	Greenwood51	50	35	42
1	184	Greenwood52	43	23	26
0	120	Hampton1	66	32	44
1	170	Hampton2	86	63	75
1	195	Horry	54	25	33
0	132	Jasper	87	60	69
0	129	Kershaw	49	29	37
1	215	Lancaster	46	38	43
1	176	Laurens55	50	38	44
1	183	Laurens56	57	40	50
		Lee	90	60	75
		Lexington1	26	17	20
		Lexington2	47	23	27
		Lexington3	53	37	39
		Lexington4	58	34	43
		Lexington5	16	13	15
		Marion1	74	48	54
		Marion2	77	43	55
		Marion3	94	41	62
		Marion4	88	49	62
		Marlboro	78	50	59
		McCormick	79	46	58
		Newberry	61	41	47
		Oconee	45	26	34
		Orangeburg3	87	49	62
		Orangeburg4	68	36	52
		Orangeburg5	76	45	56
		Pickens	32	22	31
		Richland1	63	39	53
		Richland2	33	20	26
		Saluda	64	44	53
		Spartanburg1	39	20	22
		Spartanburg2	37	21	27
		Spartanburg3	47	23	30
		Spartanburg4	40	29	41
		Spartanburg5	43	25	27
		Spartanburg6	37	24	31
		Spartanburg7	64	37	43
		Sumter17	59	36	45
		Sumter2	70	32	41
		Union	55	37	46
		Williamsburg	90	38	47
		York1	45	32	35
		York2	31	25	24

Lecture 04 – Measures of Location and Position

Graphical methods of presentation of data are important because of its visual impact. However, in summarizing data, we oftentimes would like to have a few numerical measures to depict the important features of the data set. There are three types of numerical summary measures that we will study for univariate data sets: measures of location, measures of position, and measures of position. We begin our discussion with measures of location.

Measures of Location

These summary measures purport to provide a summary of the “center” or central tendency of the data set. The center need not be the value on which the observations cluster, though most often it will be. The first measure of location is the well-known arithmetic average or the mean. Since will be dealing with sample data, this will be referred to as the sample mean. It is defined via:

Sample Mean (X-Bar) = (Sum of all observations)/(Number of observations).

For purposes of demonstrating the computation of these summary measures, let us consider the following data set, which represents the number of homeruns for each of 8 seasons of Sammy Sosa, a Chicago Cubs professional baseball player.

15, 10, 33, 25, 36, 40, 36, 66

The sample size for this data set is n = 8. The sample mean number of homeruns is

X-Bar = (15 + 10 + 33 + 25 + 36 + 40 + 36 + 66)/8 = 261/8 = 32.625 homeruns/season.

There are several properties of the sample mean.

a) It represents the “center of gravity” of the data set in the sense that this value will “balance” out the data set.

b) It is the value, A, that will minimize the sum of the squared deviations of each of the observations from A.

c) It is easily affected by outliers or extreme observations, hence it may not be the best measure when dealing with skewed distributions.

d) It uses all the observations in its computation.

The second measure of location or central tendency is the median. This is the value that divides the arranged data set into two equal parts. It is obtained by first arranging the data set into an ascending order, and then determining that value that will split the data set into a 50:50 split. To demonstrate, the arranged data set for Sosa’s homeruns is:

10, 15, 25, 33, 36, 36, 40, 66

Because n = 8, the observation that will divide the data set into two equal parts will be the (n+1)/2 = (8+1)/2 = 4.5 observation in the arranged data set. Since 4.5 is not a whole number, then we take the median as the average between the 4^th and 5^th observations in the arranged data. Thus,

Median = (33 + 36)/2 = 34.5.

In contrast to the mean, the median is not affected by outliers, hence is suitable to use when dealing with skewed distributions. On the other hand, it does not utilize the magnitudes of all the observations, except for purposes of ranking the observations.

The third measure of location is the mode. This is the observation that occurs most frequently in the data set. For the homerun data set, this is the value of 36 since it appears twice. Most often, we would look for the modal class when given the frequency distribution or histogram, and the modal class is the interval of values which has the highest frequency.

We now demonstrate these using the weight data, which also has “Gender” as a classification variable. When dealing with large data sets, it is better to use computer packages or your calculator for computing these quantities. In our case, we use Minitab to compute these quantities, using the Stat > Basic Stat > Descriptive Statistics menu commands.

Descriptive Statistics: Weight

Variable N N* Mean SE Mean StDev Minimum Q1 Median Q3

Weight 50 0 150.64 4.32 30.53 98.00 120.00 154.00 176.00

Variable Maximum

Weight 215.00

This tells us that the mean weight of all students is 150.64 pounds, and the median weight is 154 pounds. Examining the frequency table or histogram below, the modal class is 157.5-172.5 and also 112.5-127.5. The histogram is depicted below.

In this case we would say that the distribution is “bi-modal” of course owing to the fact that there are two groups: males and females.

With a classification variable like “Gender” we may also compute these quantities for each group in order to compare these measures across groups. Again, we may do this using Minitab.

Descriptive Statistics: Weight

Variable Gender N N* Mean SE Mean StDev Minimum Q1 Median

Weight 0 20 0 119.10 2.64 11.79 98.00 110.50 118.00

1 30 0 171.67 3.37 18.44 143.00 156.50 169.00

Variable Gender Q3 Maximum

Weight 0 128.75 142.00

1 186.50 215.00

From these summary measures, we note that the mean and median for the females are 119.10 and 118.00, respectively; whereas for the males they are 171.67 and 169.00, respectively.

In comparing the values of the mean, median, and mode, we have the following general guidelines:

a) for symmetric distributions, the mean, median and mode will generally coincide or be close to each other.

b) for right-skewed distributions (tail on the right), in order of increasing magnitude we have: mode, median, mean. This is because the mean will be affected by the extreme values to the right.

c) for left-skewed distributions (tail on the left), in order of increasing magnitude we have: mean, median, mode. This is because the mean will be affected by the extreme values to the left.

Some Measures of Position

The median is an example of a measure of position as it divides the data set into two equal parts. The quantity that divides the arranged data set into a 25%:75% split is called the first quartile, denoted by Q1; whereas the quantity that divides the arranged data set into a 75%:25% split is called the third quartile, and denoted by Q3. To compute Q1, obtain (n+1)/4, and take the observation in the arranged data set whose index is closest to (n+1)/4. This is not the most precise way of doing it, but this will suffice. To obtain Q3, take the observation in the arranged data set whose index is closest to 3(n+1)/4. Thus, for the homerun data set where n = 8, we have:

(n+1)/4 = (8+1)/4 = 2.25, so Q1 = 2^nd observation in arranged data = 15;

3(n+1)/4 = 3(8+1)/4 = 6.75, so Q3 = 7^th observation in arranged data = 40.

When using Minitab, Q1 and Q3 are also computed, as the outputs above demonstrates.

Other measures of position are the percentiles. Thus, for example, the 90^th percentile is that value that will divide the arranged data set into a 90%:10% split. Thus, 90% of all observations will be smaller than or equal the 90^th percentile. To determine the 90^th percentile, one may look for the observation in the arranged data set whose index is closest to (n+1)(.90).

Box Plots

A box plot is another method of presenting data in a pictorial and compact way. In its simplest form, we form a box with edges being the Q1 and Q3, and also indicate where the median is. Then we construct whiskers, which emanate from the edges and goes up to the two extreme values. There are more complicated box plots that one may construct. For the weight data set, the box plot for the whole data set is (from Minitab):

We could also have comparative box plots to compare different groups. Thus, for the weight data set, the comparative box plots for the males and females is depicted below.

Notice that one immediately sees the difference between the two groups by looking at these box plots.

Lecture 05 - Measures of Dispersion

Measures of location does not however tell a complete story about the data set as can be seen by considering the mean and median of the following three data sets:

Data Set 1: 1, 2, 3, 4, 5

Data Set 2: 1, 1, 3, 5, 5

Data Set 3: 3, 3, 3, 3, 3

Clearly, these three data sets are quite different from each other, but their means and medians all coincide. It is obvious that there are differences in the degree of variation among the observations in each of these data sets. Data set 3 has the least variation, in fact, it has no variation at all, while Data set 2 has the most variation. We therefore need an additional measure to augment the measures of location and get a better picture of the data set. Such measures are called measures of variation or dispersion.

Range: the range is a crude measure of variation. It is the difference between the extreme values in the data set.

(Sample) Variance = S² = {Sum of (X_i – XBar)²}/(n-1) = “average” of the squared deviations of observations from the sample mean. The reason why the divisor is (n-1) instead of n will become apparent to us when we use this sample variance to estimate the population variance. By dividing using (n-1), it becomes an unbiased estimator of the population variance, that is, it is “on target” on the average.

Note that the units of measurement of the variance are the squared units of the original observations. Also, clearly, the variance could never be negative, and the only time it equals zero is when all the observations are identical. The larger the value of the variance, the more variation there is in the data set.

(Sample) Standard Deviation = S = SQRT(S²) (take the positive value). Note that this has the same unit of measurement as the observations.

We demonstrate the computation of these quantities using the homerun data set.

Observation Number	Value (X_i)	Deviation from Mean	Squared Deviation
1	15	15 – 32.625 = -17.625	310.6406
2	10	10 – 32.625 = -22.625	511.8906
3	33	33 – 32.625 = .375	.1406
4	25	25 – 32.625 = -7.625	58.1406
5	36	36 – 32.625 = 3.375	11.3906
6	40	40 – 32.625 = 7.375	54.3906
7	36	36 – 32.625 = 3.375	11.3906
8	66	66 – 32.625 = 33.375	1113.8906
	Sum = 261; Mean = 32.625	Sum = 0	Sum = 2071.8748

Therefore, the sample variance is S² = 2071.8748/(8-1) = 2071.8748/7 = 295.9821.

The standard deviation is S = SQRT(295.9821) = 17.2041.

Remark: There is a simpler formula for computing the variance, called the machine formula, but we will not study this as we will mostly be computing the variance and standard deviation using your calculator or the computer.

Question: Is it always better to have small variability?

Answer: In many situations we would like to have small variability as this implies consistency and precision, especially when measuring certain quantities or when we would like to monitor a production line. However, sometimes it is also important that there be variability, so imagine a world where we all look identical, where every shot in a basketball game is good, etc. There is therefore a “Jekyll and Hyde” quality to how we would want the variation to be!

Before proceeding, we mention an important property of the mean, variance, and standard deviation.

Imagine a situation where you are interested in studying the temperature in Columbia during noontime for the month of February. You will then have 31 observations, and let us suppose that you recorded these temperatures in Fahrenheit. With these observations you obtain the mean, variance, and standard deviation in units of Fahrenheit, Fahrenheit², and Fahrenheit, respectively. Suppose however that you decided to convert your readings into Centigrade, perhaps because of the requirement of the journal in which you are publishing your work. The question is do you need to re-compute the mean, variance and standard deviation again, or is there a way to obtain them by simply utilizing the values you already have with the Fahrenheit readings?

Suppose that F is a temperature reading in Fahrenheit. We could convert this reading into Centigrade according to the formula

C = (5/9)(F – 32) = (5/9)(F) – (5/9)(32) = (a)(F) + b with a = 5/9 and b = - (5/9)(32).

This is what we refer to as a linear transformation (or an affine transformation). Here is the important result that we need:

Property Of Mean and Variance: Let X₁, X₂, ..., X_n be sample observations with mean XBAR and variance S². Let a and b be constants, and define the transformed observations Y₁, Y₂, ..., Y_n according to the formula

Y_i = aX_i + b for all i = 1, 2, ..., n.

Then the mean and variance of these new observations are given by

Mean of the Ys = (a)(XBAR) + b

Variance of the Ys = (a²)(S²)

Standard Deviation of the Ys = |a|S.

Consequently, suppose that the mean temperature for the month of January during noontime is 45 degrees Fahrenheit, and the standard deviation is 8 degrees Fahrenheit. Without re-computing the mean and standard deviation in units of Centigrade are equal to

Mean in Centigrade = (5/9)(45) –(5/9)(32) = 7.22

Standard Deviation in Centigrade = |5/9|(8) = 4.44.

Another measure of variation is the inter-quartile range (IQR), which is the difference between the third quartile (Q₃) and the first quartile (Q₁). Thus

IQR = Q₃ – Q₁.

Mostly in this course, however, we will be concerned with the variance and the standard deviation. We note that the inter-quartile range has an important role in determining outliers in a data set.

Outliers: Those observations whose values are smaller than Q₁ – (1.5)(IQR) or larger than Q₃ + (1.5)(IQR) are called mild outliers; while those observations smaller than Q₁ – (3)(IQR) or larger than Q₃ + (3)(IQR) are referred to as extreme outliers.

On the Interplay Between the Mean and the Standard Deviation

Imagine the situation where you have a data set with n observations. You are not given all the values of these observations, but you are provided the values of the mean (XBAR) and the standard deviation (S). What information do you get by knowing the mean and the standard deviation?

Case 1: You do NOT have any idea of the shape of the histogram or distribution. If this is the case, then you may use Chebyshev’s Inequality to make the following conclusions:

a) At least 75% [= (1 – (1/2)²)*100] of all the observations will have values between XBAR – 2S and XBAR + 2S.

b) At least 88.89% [ = (1 – (1/3)²)*100] of all the observations will have values between XBAR – 3S and XBAR + 3S.

On the other hand, if you have the additional information that the histogram or distribution is mound-shaped, then you could do better than Chebyshev’s Inequality by invoking what is called as the Empirical Rule. This states that with mound-shaped distributions,

a) Approximately 68% of all observations will be between XBAR – S and XBAR + S.

b) Approximately 95% of all observations will be between XBAR – 2S and XBAR + 2S.

c) Approximately 100% of all observations will be between XBAR – 3S and XBAR + 3S.

We now demonstrate this using the WEIGHT data set. Recall that using Minitab, for this data set the sample mean and sample standard deviation are given by XBAR = 150.64 pounds and S = 30.53. We determine the percentages of observations in certain intervals.

Interval	Frequency	Percentage	According to Chebyshev	Empirical Rule
[XBAR -/+ S] = [120.11, 181.17]	26	52%	at least 0%	approx 66%
[XBAR -/+ 2S] = [89.58, 211.7]	49	98%	at least 75%	approx 95%
[XBAR -/+ 3S] = [59.05, 242.23]	50	100%	at least 88.89%	approx 100%

Observe that the result for the first interval is quite different from that expected under the empirical rule, but this is due to the fact that the histogram or distribution is not mound-shaped as we have observed it to be bi-modal owing to the mixing of the female and male weights.

The implication of these rules is that it is quite unlikely to have observations which are more than three standard deviations from the mean. In statistics, it is therefore customary to measure distances among observations in terms of standard deviation units. In connection with this, we have the notion of the standardized score or the z-score:

z-score = (Observation – Mean)/(Standard Deviation).

The z-score is a unit-less quantity, hence it could be used to compare distances even with data sets using different units of measurement.

Measures of Association and Relationships

In this lecture we discuss methods for measuring the degree of linear association and determining the relationship between two quantitative variables. To demonstrate the ideas, we utilize the data set regarding current age and years of life remaining on page 141, problem 3.48 of the textbook. A scatter plot of the data set is given below:

To compute the correlation coefficient and the regression coefficient, we implement the machine or computational formulas using Excel.



Computing Correlation and Regression Coefficient
Data Set	On page 141, Problem 3.48. The independent variable is current age, while the dependent variable is
	Years of Life Remaining


XCurrentAge	YYrsRemain	Xsquared	Ysquared	XY
65	16.5	4225	272.25	1072.5
67	15.1	4489	228.01	1011.7
69	13.7	4761	187.69	945.3
71	12.4	5041	153.76	880.4
73	11.2	5329	125.44	817.6
75	10.1	5625	102.01	757.5
77	9	5929	81	693
79	8.4	6241	70.56	663.6
81	7.1	6561	50.41	575.1
83	6.4	6889	40.96	531.2

740	109.9	55090	1312.09	7947.9

Xmean	74
Ymean	10.99
SS(X)	330
SS(Y)	104.289
SS(XY)	-184.7
r	-0.99561326
b	-0.55969697
a	52.40757576

We could also compute the coefficients using Minitab. The results upon applying Minitab are as follows. To compute the correlation, use the STAT and DESCRIPTIVE STATISTICS menu, and use the correlation option.

Correlations: CurrentAge, YrsRemain

Pearson correlation of CurrentAge and YrsRemain = -0.996

To compute the regression line, use the STAT and REGRESSION menu in Minitab.

Regression Analysis: YrsRemain versus CurrentAge

The regression equation is

YrsRemain = 52.4 - 0.560 CurrentAge

Predictor Coef SE Coef T P

Constant 52.408 1.380 37.97 0.000

CurrentAge -0.55970 0.01860 -30.10 0.000

S = 0.337818 R-Sq = 99.1% R-Sq(adj) = 99.0%

Analysis of Variance

Source DF SS MS F P

Regression 1 103.38 103.38 905.84 0.000

Residual Error 8 0.91 0.11

Total 9 104.29

We will study later the meaning of this analysis of variance table, as well as the other components of this output. The important quantity to note from this output are the Constant Coefficient of 52.408 which is the Y-intercept, and the CurrentAge Coefficient of -.55970 which is the regression coefficient. The best fitting line to the data set is therefore:

(Predicted Years Remaining) = 52.408 - .55970(CurrentAge).

Another particular quantity to note at this stage is the R-Sq value of 99.1%. This is called the coefficient of determination, and it is just the square of the correlation coefficient. It measures the predictive ability of the independent variable. More technically, it measures the amount of variation in the dependent variable that could be explained by the independent variable.

Using the Minitab command “Fitted Line Plot” in the Stat > Regression menu, we obtain the scatterplot together with the fitted line.