Download this data set and store it in the folder containing the .qmd file for your homework assignment. The data set contains the self-reported heights (in feet and inches), lengths of index and pinky fingers (in millimeters), shoe size, and shoe size gender (“m”/“w”) of several students.
The code below imports the data into R and converts one “uk” shoe size to a “w” size according to a sizing chart found online. In addition it converts the heights to centimeter heights and creates a data frame hg. Run this code to get started.
# import the datahg0 <-read.table("heights.csv",sep=",",header=T)# clean one data point# https://www.grivetoutdoors.com/pages/shoe-size-charthg0$shoe[hg0$shoe_wm =='uk'] <-9hg0$shoe_wm[hg0$shoe_wm =='uk'] <-'w'# create data frame for analysishg <-data.frame(height = (hg0$ft*12+ hg0$in.)*2.54, # get heights in cmind_mm = hg0$ind_mm,pnk_mm = hg0$pnk_mm,shoe = hg0$shoe,shoe_wm = hg0$shoe_wm)# view the first few rows of the data framehead(hg)
height ind_mm pnk_mm shoe shoe_wm
1 162.56 75 65 8.0 w
2 162.56 70 56 7.0 w
3 172.72 70 52 10.0 w
4 165.10 68 62 7.5 w
5 167.64 71 55 9.5 m
6 193.04 78 65 13.0 m
It is of interest to use the multiple linear regression model to predict the height of a person based on his or her index and pinky finger lengths, shoe size, and shoe size gender.
1.
Make a figure which shows scatterplots for all pairs of variables in the data set. Comment on which pairs of variables appear to be highly correlated.
plot(hg)
Height appears positively linearly related to both finger length
measurements as well as to the shoe size. The heights
appear to differ between shoe size gender. The index
finger length appears highly positively correlated
with the pinky finger length. Shoe size appears positively
correlated with index and pinky finger lengths, and those wearing
mens shoes appear to have longer index and pinky
finger lengths. Lastly, the shoe sizes reported by those
wearing mens shoes appear to be greater on average than
those reported by those wearing womens shoes.
2.
Fit a multiple linear regression model for predicting height based on all the other variables in the data set—index and pinky finger length, shoe size, and shoe size gender. Then:
2.a
Report the estimated value of the regression coefficient for each covariate.
Call:
lm(formula = height ~ ind_mm + pnk_mm + shoe + shoe_wm, data = hg)
Residuals:
Min 1Q Median 3Q Max
-10.9786 -2.1592 0.8879 2.6740 7.8550
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 126.41698 17.11668 7.386 5.38e-07 ***
ind_mm 0.05668 0.34801 0.163 0.872332
pnk_mm 0.25891 0.33130 0.781 0.444150
shoe 3.12329 0.66411 4.703 0.000155 ***
shoe_wmw -6.00225 2.79359 -2.149 0.044772 *
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 5.19 on 19 degrees of freedom
Multiple R-squared: 0.8493, Adjusted R-squared: 0.8176
F-statistic: 26.77 on 4 and 19 DF, p-value: 1.411e-07
The summary() function applied to the output of the lm()
function gives the estimated coefficients in the first column of
the coefficients table.
2.b
Give the value of the estimated standard error for each of the covariates.
The summary() function applied to the output of the lm()
function gives the estimated standard errors in the second
column of the coefficients table.
2.c
Give the value of the test statistic for each of the covariates.
The summary() function applied to the output of the lm()
function gives the estimated standard errors in the third
column of the coefficients table.
2.d
Give the p value for testing : versus : for each of the covariates.
The summary() function applied to the output of the lm()
function gives the p values in the fourth
column of the coefficients table.
2.e
Give an interpretation to the estimated coefficient for the shoe size covariate.
An increase in shoe size by one size corresponds to an
increase in height of 3.123294cm, on average,
with all other characteristics held fixed.
2.f
Give an interpretation to the estimated coefficient for the shoe size gender covariate.
Those wearing women's shoes are 6.002247cm shorter,
on average, than those wearing men's shoes, when all
other characteristics are held fixed.
2.g
Do the index and pinky finger lengths appear to be important predictors of height?
In this model, they do not appear to be
statistically significant predictors of
height, as the p values for testing
whether their regression coefficients are
equal to zero are not small.
2.h
Give an estimate of , the standard deviation of the error term in the multiple linear regression model.
n <-nrow(hg)p <-4sigmahat <-sqrt(sum(lm_all$residuals^2)/(n-(p+1)))sigmahat
[1] 5.190474
We obtain the estimate 5.190474.
2.i
Produce a normal quantile-quantile plot of the residuals as well as a residuals versus fitted values plot. Comment on whether you believe the assumptions of the multiple linear regression model to be satisfied.
plot(lm_all,which=1)
plot(lm_all,which=2)
The normal q-q plot does not show any alarming departures
from normality. The residuals versus fitted values plot exhibits some
fanning out from the left to the right, suggesting that the variance of
the heights may not be constant over all covariate values. In
particular, it appears that the variance in height is greater when the
predicted height is greater (at greater values of the finger length and shoe
size covariates.
3.
Fit a simple linear regression model using only the shoe size gender covariate. Then:
3.a
Give an interpretation of the estimated regression coefficient for the shoe size gender covariate.
lm_wm <-lm(height ~ shoe_wm, data = hg)summary(lm_wm)
Call:
lm(formula = height ~ shoe_wm, data = hg)
Residuals:
Min 1Q Median 3Q Max
-15.4517 -6.0325 -0.9525 4.8683 20.1083
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 183.092 2.504 73.133 < 2e-16 ***
shoe_wmw -17.039 3.541 -4.813 8.3e-05 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 8.673 on 22 degrees of freedom
Multiple R-squared: 0.5129, Adjusted R-squared: 0.4907
F-statistic: 23.16 on 1 and 22 DF, p-value: 8.303e-05
Those wearing women's shoes are 17.03917cm shorter,
on average, than those wearing men's shoes. We do not need to
say 'with all other characteristics held fixed', because
we have not included any other covariates in the model.
3.b
Why does this covariate appear to have a different effect when it is the sole covariate in the model?
When no other covariates are included, the shoe size
gender covariate becomes equal to the difference between the
average height of those wearing men's shoes and
that of those wearing women's shoes, without regard for
any other characteristics.
4.
Fit a multiple linear regression model using only the index and pinky finger lengths as predictors of height.
4.a
Does either covariate in this model appear to be significantly related to the height?
lm_fingers <-lm(height ~ ind_mm + pnk_mm, data = hg)summary(lm_fingers)
Call:
lm(formula = height ~ ind_mm + pnk_mm, data = hg)
Residuals:
Min 1Q Median 3Q Max
-14.9603 -5.5195 -0.2733 5.7353 19.7888
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 65.5938 23.7968 2.756 0.0118 *
ind_mm 1.1716 0.5250 2.232 0.0367 *
pnk_mm 0.3637 0.5640 0.645 0.5260
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 8.981 on 21 degrees of freedom
Multiple R-squared: 0.5013, Adjusted R-squared: 0.4538
F-statistic: 10.56 on 2 and 21 DF, p-value: 0.0006715
The index finger length appears to be
significantly related to height, but the pinky finger coefficient
has a large p value, suggesting that this covariate does
not contribute a significant information beyond
that contributed by the index finger length.
4.b
What proportion of the total variation in heights does this model explain?
This is the coefficient of determination, appearing as
'Multiple R-squared' in the summary output. The value is 0.8375415.
6.
A forensic team analyzes a shoe print and a hand print, presumably left by the same person: The shoe print belongs to a size women’s shoe and the index and pinky fingers measure mm and mm, respectively:
6.a
If the forensic team uses the models fitted above to make guesses about the height of the person who left the prints, will they be extrapolating beyond the range of the observed data? Explain your answer.
From careful study of the scatterplots
produced in the first question, it does not
appear that a person wearing a size 8 women's
shoe and having index and pinky fingers
measuring 70mm and 60mm, respectively, is an outlier
beyond the range of observed data.
6.b
Give an interval such that the forensic team can be certain it contains the average height of the population of all people wearing size women’s shoes and having index and pinky fingers measuring mm and mm, respectively.
Give an interval such that the forensic team can be certain it contains the height of the person who left the prints.
pi <-predict(lm_all,newdata = xnew,int='pred')
The interval is (153.57,176.237).
7.
Suppose there is no shoe print, but only a hand print with index and pinky fingers measuring mm and mm, respectively:
7.a
Give an interval such that the forensic team can be certain it contains the average height of the population of all people having index and pinky fingers measuring mm and mm, respectively.
Give an interval such that the forensic team can be certain it contains the height of the person who left the hand print.
pi <-predict(lm_fingers,newdata = xnew,int='pred')
The interval is (150.1608,188.6977).
8.
Suppose there is no hand print, but only a shoe print belonging to a size 8 women’s shoe:
8.a
Give an interval such that the forensic team can be certain it contains the average height of the population of all people wearing a size 8 women’s shoe.
Give an interval such that the forensic team can be certain it contains the height of the person who left the shoe print.
pi <-predict(lm_shoes,newdata = xnew,int='pred')
The interval is (153.6233,175.831).
9.
Answer the following based on careful study of the preceding model output and confidence and prediction intervals:
9.a
If a shoe print is found, does a hand print provide useful additional accuracy in guessing the height of the person leaving the prints?
The index and pinky finger lengths appear to
contribute very little additional information if the
shoe size and shoe size gender are known. This is seen
in two ways from the above output: The confidence and
prediction intervals based on the complete information
(with finger lengths) are scarcely narrower than those
based only on shoe information. Moreover, the model
with all four covariates has a coefficient of
determination scarcely higher than that of the model
with only the shoe covariates.
9.b
If a hand print is found, does a shoe print provide useful additional accuracy in guessing the height of the person leaving the prints?
The shoe print information does indeed allow
the team to obtain a more accurate guess at the
height of the person who left the prints; note
how much narrower the confidence and prediction
intervals became when the shoe size and gender
information was included in the model.
9.c
If only a hand print is found, should the forensic team bother trying to use the index and pinky finger lengths to guess the height of the person who left it?
If no shoe information is available, the hand
print information can still be useful. Even
though the confidence and prediction intervals
are wide when only the hand print information is
used, the intervals still allow the forensic team
to narrow down the range of possible heights for
the person leaving the print (the prediction
interval does not cover the entire range of the
observed data).