STAT 516 hw 3

Author

Karl Gregory

Download this data set and store it in the folder containing the .qmd file for your homework assignment. The data set contains the self-reported heights (in feet and inches), lengths of index and pinky fingers (in millimeters), shoe size, and shoe size gender (“m”/“w”) of several students.

The code below imports the data into R and converts one “uk” shoe size to a “w” size according to a sizing chart found online. In addition it converts the heights to centimeter heights and creates a data frame hg. Run this code to get started.

# import the data
hg0 <- read.table("heights.csv",sep=",",header=T)

# clean one data point
# https://www.grivetoutdoors.com/pages/shoe-size-chart
hg0$shoe[hg0$shoe_wm == 'uk'] <- 9 
hg0$shoe_wm[hg0$shoe_wm == 'uk'] <- 'w'

# create data frame for analysis
hg <- data.frame(height = (hg0$ft*12 + hg0$in.)*2.54, # get heights in cm
                 ind_mm = hg0$ind_mm,
                 pnk_mm = hg0$pnk_mm,
                 shoe = hg0$shoe,
                 shoe_wm = hg0$shoe_wm)

# view the first few rows of the data frame
head(hg)
  height ind_mm pnk_mm shoe shoe_wm
1 162.56     75     65  8.0       w
2 162.56     70     56  7.0       w
3 172.72     70     52 10.0       w
4 165.10     68     62  7.5       w
5 167.64     71     55  9.5       m
6 193.04     78     65 13.0       m

It is of interest to use the multiple linear regression model to predict the height of a person based on his or her index and pinky finger lengths, shoe size, and shoe size gender.

1.

Make a figure which shows scatterplots for all pairs of variables in the data set. Comment on which pairs of variables appear to be highly correlated.

plot(hg)

Height appears positively linearly related to both finger length 
measurements as well as to the shoe size. The heights 
appear to differ between shoe size gender. The index 
finger length appears highly positively correlated 
with the pinky finger length. Shoe size appears positively 
correlated with index and pinky finger lengths, and those wearing 
mens shoes appear to have longer index and pinky 
finger lengths. Lastly, the shoe sizes reported by those 
wearing mens shoes appear to be greater on average than 
those reported by those wearing womens shoes.

2.

Fit a multiple linear regression model for predicting height based on all the other variables in the data set—index and pinky finger length, shoe size, and shoe size gender. Then:

2.a

Report the estimated value of the regression coefficient for each covariate.

lm_all <- lm(height ~ ind_mm + pnk_mm + shoe + shoe_wm, data = hg)
summary(lm_all)

Call:
lm(formula = height ~ ind_mm + pnk_mm + shoe + shoe_wm, data = hg)

Residuals:
     Min       1Q   Median       3Q      Max 
-10.9786  -2.1592   0.8879   2.6740   7.8550 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) 126.41698   17.11668   7.386 5.38e-07 ***
ind_mm        0.05668    0.34801   0.163 0.872332    
pnk_mm        0.25891    0.33130   0.781 0.444150    
shoe          3.12329    0.66411   4.703 0.000155 ***
shoe_wmw     -6.00225    2.79359  -2.149 0.044772 *  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 5.19 on 19 degrees of freedom
Multiple R-squared:  0.8493,    Adjusted R-squared:  0.8176 
F-statistic: 26.77 on 4 and 19 DF,  p-value: 1.411e-07
The summary() function applied to the output of the lm() 
function gives the estimated coefficients in the first column of 
the coefficients table.
2.b

Give the value of the estimated standard error s.e.^(β^j)=σ^/Ωjj for each of the covariates.

The summary() function applied to the output of the lm() 
function gives the estimated standard errors in the second 
column of the coefficients table.
2.c

Give the value of the test statistic Ttest=β^j/(σ^/Ωjj) for each of the covariates.

The summary() function applied to the output of the lm() 
function gives the estimated standard errors in the third 
column of the coefficients table.
2.d

Give the p value for testing H0: βj=0 versus H1: βj0 for each of the covariates.

The summary() function applied to the output of the lm() 
function gives the p values in the fourth 
column of the coefficients table.
2.e

Give an interpretation to the estimated coefficient β^j for the shoe size covariate.

An increase in shoe size by one size corresponds to an 
increase in height of 3.123294cm, on average, 
with all other characteristics held fixed.
2.f

Give an interpretation to the estimated coefficient β^j for the shoe size gender covariate.

Those wearing women's shoes are 6.002247cm shorter, 
on average, than those wearing men's shoes, when all 
other characteristics are held fixed.
2.g

Do the index and pinky finger lengths appear to be important predictors of height?

In this model, they do not appear to be 
statistically significant predictors of 
height, as the p values for testing 
whether their regression coefficients are 
equal to zero are not small.
2.h

Give an estimate of σ, the standard deviation of the error term in the multiple linear regression model.

n <- nrow(hg)
p <- 4
sigmahat <- sqrt(sum(lm_all$residuals^2)/(n-(p+1)))
sigmahat
[1] 5.190474
We obtain the estimate 5.190474.
2.i

Produce a normal quantile-quantile plot of the residuals as well as a residuals versus fitted values plot. Comment on whether you believe the assumptions of the multiple linear regression model to be satisfied.

plot(lm_all,which=1)

plot(lm_all,which=2)

The normal q-q plot does not show any alarming departures 
from normality. The residuals versus fitted values plot exhibits some 
fanning out from the left to the right, suggesting that the variance of 
the heights may not be constant over all covariate values. In 
particular, it appears that the variance in height is greater when the 
predicted height is greater (at greater values of the finger length and shoe 
size covariates.

3.

Fit a simple linear regression model using only the shoe size gender covariate. Then:

3.a

Give an interpretation of the estimated regression coefficient for the shoe size gender covariate.

lm_wm <- lm(height ~ shoe_wm, data = hg)
summary(lm_wm)

Call:
lm(formula = height ~ shoe_wm, data = hg)

Residuals:
     Min       1Q   Median       3Q      Max 
-15.4517  -6.0325  -0.9525   4.8683  20.1083 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  183.092      2.504  73.133  < 2e-16 ***
shoe_wmw     -17.039      3.541  -4.813  8.3e-05 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 8.673 on 22 degrees of freedom
Multiple R-squared:  0.5129,    Adjusted R-squared:  0.4907 
F-statistic: 23.16 on 1 and 22 DF,  p-value: 8.303e-05
Those wearing women's shoes are 17.03917cm shorter, 
on average, than those wearing men's shoes. We do not need to 
say 'with all other characteristics held fixed', because 
we have not included any other covariates in the model.
3.b

Why does this covariate appear to have a different effect when it is the sole covariate in the model?

When no other covariates are included, the shoe size 
gender covariate becomes equal to the difference between the 
average height of those wearing men's shoes and 
that of those wearing women's shoes, without regard for 
any other characteristics.

4.

Fit a multiple linear regression model using only the index and pinky finger lengths as predictors of height.

4.a

Does either covariate in this model appear to be significantly related to the height?

lm_fingers <- lm(height ~ ind_mm + pnk_mm, data = hg)
summary(lm_fingers)

Call:
lm(formula = height ~ ind_mm + pnk_mm, data = hg)

Residuals:
     Min       1Q   Median       3Q      Max 
-14.9603  -5.5195  -0.2733   5.7353  19.7888 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)  
(Intercept)  65.5938    23.7968   2.756   0.0118 *
ind_mm        1.1716     0.5250   2.232   0.0367 *
pnk_mm        0.3637     0.5640   0.645   0.5260  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 8.981 on 21 degrees of freedom
Multiple R-squared:  0.5013,    Adjusted R-squared:  0.4538 
F-statistic: 10.56 on 2 and 21 DF,  p-value: 0.0006715
The index finger length appears to be 
significantly related to height, but the pinky finger coefficient 
has a large p value, suggesting that this covariate does 
not contribute a significant information beyond 
that contributed by the index finger length.
4.b

What proportion of the total variation in heights does this model explain?

SSreg <- sum((lm_fingers$fitted.values - mean(hg$height))**2)
SStot <- sum((hg$height - mean(hg$height))**2)
Rsq <- SSreg/SStot
This is the coefficient of determination, 
appearing as 
'Multiple R-squared' in the 
summary output. The value is 0.5013299.

5.

Fit a multiple linear regression model using only the shoe size and shoe size gender covariates.

5.a

Does either covariate in this model appear to be significantly related to the height?

lm_shoes <- lm(height ~ shoe + shoe_wm, data = hg)
summary(lm_shoes)

Call:
lm(formula = height ~ shoe + shoe_wm, data = hg)

Residuals:
    Min      1Q  Median      3Q     Max 
-10.519  -1.943   0.400   2.647   8.034 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 142.5944     6.4238  22.198 4.62e-16 ***
shoe          3.5343     0.5455   6.478 2.03e-06 ***
shoe_wmw     -6.1417     2.6850  -2.287   0.0326 *  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 5.126 on 21 degrees of freedom
Multiple R-squared:  0.8375,    Adjusted R-squared:  0.8221 
F-statistic: 54.13 on 2 and 21 DF,  p-value: 5.162e-09
Both of these covariates appear to be 
significant predictors of height, as each has a 
small p value.
5.b

What proportion of the total variation in heights does this model explain?

SSreg <- sum((lm_shoes$fitted.values - mean(hg$height))**2)
SStot <- sum((hg$height - mean(hg$height))**2)
Rsq <- SSreg/SStot
This is the coefficient of determination, appearing as 
'Multiple R-squared' in the summary output. The value is 0.8375415.

6.

A forensic team analyzes a shoe print and a hand print, presumably left by the same person: The shoe print belongs to a size 8 women’s shoe and the index and pinky fingers measure 70mm and 60mm, respectively:

6.a

If the forensic team uses the models fitted above to make guesses about the height of the person who left the prints, will they be extrapolating beyond the range of the observed data? Explain your answer.

From careful study of the scatterplots 
produced in the first question, it does not 
appear that a person wearing a size 8 women's 
shoe and having index and pinky fingers 
measuring 70mm and 60mm, respectively, is an outlier 
beyond the range of observed data.
6.b

Give an interval such that the forensic team can be 95% certain it contains the average height of the population of all people wearing size 8 women’s shoes and having index and pinky fingers measuring 70mm and 60mm, respectively.

xnew <- data.frame(ind_mm = 70, pnk_mm = 60, shoe = 8, shoe_wm = 'w')
ci <- predict(lm_all,newdata = xnew,int='conf')
The interval is (161.6745,168.1325).
6.c

Give an interval such that the forensic team can be 95% certain it contains the height of the person who left the prints.

pi <- predict(lm_all,newdata = xnew,int='pred')
The interval is (153.57,176.237).

7.

Suppose there is no shoe print, but only a hand print with index and pinky fingers measuring 70mm and 60mm, respectively:

7.a

Give an interval such that the forensic team can be 95% certain it contains the average height of the population of all people having index and pinky fingers measuring 70mm and 60mm, respectively.

xnew <- data.frame(ind_mm = 70, pnk_mm = 60)
ci <- predict(lm_fingers,newdata = xnew,int='conf')
The interval is (164.692,174.1665).
7.b

Give an interval such that the forensic team can be 95% certain it contains the height of the person who left the hand print.

pi <- predict(lm_fingers,newdata = xnew,int='pred')
The interval is (150.1608,188.6977).

8.

Suppose there is no hand print, but only a shoe print belonging to a size 8 women’s shoe:

8.a

Give an interval such that the forensic team can be 95% certain it contains the average height of the population of all people wearing a size 8 women’s shoe.

xnew <- data.frame(shoe = 8, shoe_wm = 'w')
ci <- predict(lm_shoes,newdata = xnew,int='conf')
The interval is (161.6205,167.8338).
8.b

Give an interval such that the forensic team can be 95% certain it contains the height of the person who left the shoe print.

pi <- predict(lm_shoes,newdata = xnew,int='pred')
The interval is (153.6233,175.831).

9.

Answer the following based on careful study of the preceding model output and confidence and prediction intervals:

9.a

If a shoe print is found, does a hand print provide useful additional accuracy in guessing the height of the person leaving the prints?

The index and pinky finger lengths appear to 
contribute very little additional information if the 
shoe size and shoe size gender are known.  This is seen 
in two ways from the above output: The confidence and 
prediction intervals based on the complete information 
(with finger lengths) are scarcely narrower than those 
based only on shoe information.  Moreover, the model 
with all four covariates has a coefficient of 
determination scarcely higher than that of the model 
with only the shoe covariates. 
9.b

If a hand print is found, does a shoe print provide useful additional accuracy in guessing the height of the person leaving the prints?

The shoe print information does indeed allow 
the team to obtain a more accurate guess at the 
height of the person who left the prints; note 
how much narrower the confidence and prediction 
intervals became when the shoe size and gender 
information was included in the model.
9.c

If only a hand print is found, should the forensic team bother trying to use the index and pinky finger lengths to guess the height of the person who left it?

If no shoe information is available, the hand 
print information can still be useful.  Even 
though the confidence and prediction intervals 
are wide when only the hand print information is 
used, the intervals still allow the forensic team 
to narrow down the range of possible heights for 
the person leaving the print (the prediction 
interval does not cover the entire range of the 
observed data).