STAT 516 hw 4

Author

Karl Gregory

Download this data set and store it in the folder containing the .qmd file for your homework assignment.

The data set contains the self-reported heights (in feet and inches), lengths of index and pinky fingers (in millimeters), shoe size, and shoe size category (“m”/“w”) of several students. The code below imports the data into R and converts one “uk” shoe size to a “w” size according to a sizing chart found online. In addition the heights are converted to centimeter heights and a data frame hg is created containing a version of the data ready for analysis.

# import the data
hg0 <- read.table("heights.csv",sep=",",header=T)

# clean one data point
# https://www.grivetoutdoors.com/pages/shoe-size-chart
hg0$shoe[hg0$shoe_wm == 'uk'] <- 9 
hg0$shoe_wm[hg0$shoe_wm == 'uk'] <- 'w'

# create data frame for analysis
hg <- data.frame(height = (hg0$ft*12 + hg0$in.)*2.54, # get heights in cm
                 ind_mm = hg0$ind_mm,
                 pnk_mm = hg0$pnk_mm,
                 shoe = hg0$shoe,
                 shoe_wm = hg0$shoe_wm)

# view the first few rows of the data frame
head(hg)

  height ind_mm pnk_mm shoe shoe_wm
1 162.56     75     65  8.0       w
2 162.56     70     56  7.0       w
3 172.72     70     52 10.0       w
4 165.10     68     62  7.5       w
5 167.64     71     55  9.5       m
6 193.04     78     65 13.0       m

It is of interest to use the multiple linear regression model to predict the height of a person based on his or her index and pinky finger lengths, shoe size, and shoe size gender “m” or “w”.

1.

Consider the multiple linear regression model which uses all the covariates—the index and pinky finger lengths, shoe size, and shoe size gender—to predict the height of a person.

1.a

Give the critical value for the overall F test at significance level $α = 0.05$ . This the value such that when it is exceeded by the value of the test statistic $F_{test}$ we reject $H_{0}$ at $α = 0.05$ .

n <- nrow(hg)
p <- 4
alpha <- 0.05
Fcrit <- qf(1 - alpha, p, n-(p+1))

The critical value is 2.895107.

1.b

Fit the model and give the value of the test statistic for the overall F test of significance as well as the p-value associated with it. Give an interpretation of these values.

lm_all <- lm(height ~ ind_mm + pnk_mm + shoe + shoe_wm, data = hg)
summary(lm_all)


Call:
lm(formula = height ~ ind_mm + pnk_mm + shoe + shoe_wm, data = hg)

Residuals:
     Min       1Q   Median       3Q      Max 
-10.9786  -2.1592   0.8879   2.6740   7.8550 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) 126.41698   17.11668   7.386 5.38e-07 ***
ind_mm        0.05668    0.34801   0.163 0.872332    
pnk_mm        0.25891    0.33130   0.781 0.444150    
shoe          3.12329    0.66411   4.703 0.000155 ***
shoe_wmw     -6.00225    2.79359  -2.149 0.044772 *  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 5.19 on 19 degrees of freedom
Multiple R-squared:  0.8493,    Adjusted R-squared:  0.8176 
F-statistic: 26.77 on 4 and 19 DF,  p-value: 1.411e-07

Y <- hg$height
Ybar <- mean(Y)
SSE <- sum(lm_all$residuals**2)
SST <- sum((Y - Ybar)**2)
SSR <- SST - SSE
MSR <- SSR / p
MSE <- SSE / (n - (p + 1))
Ftest <- MSR/MSE

pval <- 1 - pf(Ftest,p,n-(p+1))

The value of the test statistic for the overall F 
test is 26.76977, and the p value is 1.411167e-07. We reject the 
null hypothesis that all regression coefficients 
(apart from the intercept) are equal to zero at every significance level 
greater than 1.411167e-07. 
Therefore we conclude (at any such significance level), 
that not all the regression coefficients are equal to zero; that is, 
at least one of the covariates is significantly (linearly) related 
to the response.

1.c

Report the variance inflation factor for each of the four covariates.

slm_ind <- summary(lm(ind_mm ~ pnk_mm + shoe + shoe_wm,data=hg))
slm_pnk <- summary(lm(pnk_mm ~ ind_mm + shoe + shoe_wm,data=hg))
slm_shoe <- summary(lm(shoe ~ ind_mm + pnk_mm + shoe_wm,data=hg))
slm_shoe_wm <- summary(lm(shoe_wm =='w' ~ ind_mm + pnk_mm + shoe,data=hg))

vif_ind <- 1/(1-slm_ind$r.squared)
vif_pnk <- 1/(1-slm_pnk$r.squared)
vif_shoe <- 1/(1-slm_shoe$r.squared)
vif_shoe_wm <- 1/(1-slm_shoe_wm$r.squared)

# Can also run:
# library(car)
# vif(lm_all)

The VIFs were 3.611387 for the index finger 
length, 2.83546 for the pinky finger length 2.37919 for the 
shoe size, and 1.738048 for the shoe size gender. The 
VIFs are smaller in this model than in the model with all 
four covariates.

2.

Fit a model with all covariates except the shoe size gender covariate.

2.a

Use the full-reduced model F test to test whether the shoe size gender covariate has a nonzero regression coefficient in the full model. Give the value of the test statistic as well as the p value. Interpret the result of your test.

lm_nowm <- lm(height ~ ind_mm + pnk_mm + shoe, data = hg)

SSred <- sum(lm_nowm$residuals^2)
SSfull <- sum(lm_all$residuals^2)
s <- 1
n <- nrow(hg)
p <- 4
num <-  (SSred - SSfull) / s
den <- SSfull / (n - (p + 1))
Ftest <- num / den
pval <- 1 - pf(Ftest,s,n-(p+1))

The value of the test statistic is 4.616397 and the 
p value is 0.0447715. There is sufficient evidence at the 0.05 
significance level to conclude that shoe size gender is significantly 
related to height.

2.b

Obtain the value of the test statistic $T_{test}$ and the p value associated with it for testing $H_{0}$ : $β_{j} = 0$ versus $H_{1}$ : $β_{j} \neq 0$ where $j$ is the index of the shoe size gender covariate.

We can obtain this from applying the summary() 
function to the output of the lm() function. We 
find that the T statistic is equal the square root of the 
F statistic, but signed in the direction of the 
estimated regression coefficient. The p value is the same 
as that for the full-reduced model F test.

2.c

Explain the relationship between the value of the test statistic $F_{test}$ of the full-reduced model F test when one considers the removal of a single covariate with the test statistic $T_{test}$ for testing whether a single covariate is significantly related to the response.

The former is the square of the latter.

3.

Fit a model using only the two shoe size covariates.

3.a

Use the full-reduced model F test to test whether either of the finger length variables has a nonzero regression coefficient in the full model. Give the value of the test statistic as well as the p value. Interpret the result of your test.

lm_shoe <- lm(height ~ shoe + shoe_wm, data = hg)

SSred <- sum(lm_shoe$residuals^2)
SSfull <- sum(lm_all$residuals^2)
s <- 2
n <- nrow(hg)
p <- 4
num <-  (SSred - SSfull) / s
den <- SSfull / (n - (p + 1))
Ftest <- num / den
pval <- 1 - pf(Ftest,s,n-(p+1))

The value of the test statistic is 0.7413126 and the 
p value is 0.4897741. There is not sufficient evidence to 
reject the null hypothesis that both regression 
coefficients are equal to zero. So we do not find 
sufficient evidence to claim that index and pinky finger 
lengths, when taken in addition to shoe size and shoe 
size gender, contribute to any knowledge of the height 
of a person.

3.b

Compute the variance inflation factors of the two variables in this model. Comment on how these compare to their counterparts in the model with all four covariates.

slm_shoe <- summary(lm(shoe ~ shoe_wm,data=hg))
slm_shoe_wm <- summary(lm(shoe_wm =='w' ~ shoe,data=hg))

vif_shoe <- 1/(1-slm_shoe$r.squared)
vif_shoe_wm <- 1/(1-slm_shoe_wm$r.squared)

The VIFs were 1.646059 for the shoe 
size and 1.646059 for the shoe size gender.

3.c

Comment on anything else interesting about these two variance inflation factors!

The two VIFs are equal. This will always be 
the case when there are only two covariates. The 
reason for this is that the VIF is based on the 
coefficient of determination in a model in which the 
covariate under consideration is the response and 
all other covariates are used as predictors. If 
there are only two covariates, then when one is put as 
the response, there is only one other predictor. Then, 
when the other is put as the response, the response 
and the single predictor trade places. If one 
regresses 'Y' on a single covariate 'x' and then 
regresses 'x' on 'Y', one will get the same coefficient 
of determination in both regressions.

4.

Fit a model using only the index and pinky finger lengths.

4.a

Give the value of the test statistic and the p value for the full-reduced model F test for testing whether either of the two shoe size covariates has a nonzero regression coefficient in the full model. Interpret the result.

lm_finger <- lm(height ~ ind_mm + pnk_mm, data = hg)

SSred <- sum(lm_finger$residuals^2)
SSfull <- sum(lm_all$residuals^2)
s <- 2
n <- nrow(hg)
p <- 4
num <-  (SSred - SSfull) / s
den <- SSfull / (n - (p + 1))
Ftest <- num / den
pval <- 1 - pf(Ftest,s,n-(p+1))

The value of the test statistic is 21.93594 and the 
p value is 1.155642e-05. There is strong evidence against the null 
hypothesis that both shoe size covariates have regression coefficients 
equal to zero.

4.b

Suppose the index and pinky finger measurements were recorded in centimeters instead of millimeters. Describe the effect this would have on the value of the test statistic $F_{test}$ in the previous part as well as on the p value.

There would be no effect whatsoever 
in the value of the test statistic or the p 
value.  The only change would be in 
the value of the estimated regression 
coefficients. The tests for 
statistical significance of covariates in 
multiple linear regression are invariant to any 
shifting or scaling of the covariate and 
response values.

5.

5.a

Use Mallow’s $C_{p}$ statistic to select the best model among all possible submodels involving the four predictors.

Warning: package 'leaps' was built under R version 4.4.1

Subset selection object
Call: regsubsets.formula(height ~ ind_mm + pnk_mm + shoe + shoe_wm, 
    data = hg)
4 Variables  (and intercept)
         Forced in Forced out
ind_mm       FALSE      FALSE
pnk_mm       FALSE      FALSE
shoe         FALSE      FALSE
shoe_wmw     FALSE      FALSE
1 subsets of each size up to 4
Selection Algorithm: exhaustive
         ind_mm pnk_mm shoe shoe_wmw
1  ( 1 ) " "    " "    "*"  " "     
2  ( 1 ) " "    " "    "*"  "*"     
3  ( 1 ) " "    "*"    "*"  "*"     
4  ( 1 ) "*"    "*"    "*"  "*"

[1] 5.586182 2.482625 3.026530 5.000000

The output shows the best model of each size 
and the value of the Cp statistic for the best (Rsq-maximizing) 
model of each size. Should probably take the model with two 
predictors, which has Cp equal to 2.482625, since this is 
the smallest model with Cp approximately equal to p + 1.

5.b

Give the model chosen by backward selection based on the AIC criterion.

step_out <- step(lm_all,
                 direction = "backward",
                 scope=formula(lm_all))

Start:  AIC=83.44
height ~ ind_mm + pnk_mm + shoe + shoe_wm

          Df Sum of Sq     RSS    AIC
- ind_mm   1      0.71  512.59 81.474
- pnk_mm   1     16.45  528.33 82.200
<none>                  511.88 83.441
- shoe_wm  1    124.37  636.25 86.661
- shoe     1    595.88 1107.76 99.969

Step:  AIC=81.47
height ~ pnk_mm + shoe + shoe_wm

          Df Sum of Sq     RSS    AIC
- pnk_mm   1     39.23  551.82 81.244
<none>                  512.59 81.474
- shoe_wm  1    135.92  648.52 85.119
- shoe     1    663.34 1175.93 99.402

Step:  AIC=81.24
height ~ shoe + shoe_wm

          Df Sum of Sq     RSS     AIC
<none>                  551.82  81.244
- shoe_wm  1     137.5  689.32  84.584
- shoe     1    1102.9 1654.70 105.600

step_out


Call:
lm(formula = height ~ shoe + shoe_wm, data = hg)

Coefficients:
(Intercept)         shoe     shoe_wmw  
    142.594        3.534       -6.142

Backward selection with AIC chooses 
the model having only the shoe size and shoe 
gender covariates.

5.c

Give the model chosen by forward selection based on the AIC criterion.

lm_intercept <- lm(height ~ 1 , data = hg)
step_out <- step(lm_intercept,
                 scope=formula(lm_all),
                 direction = "forward",
                 trace = 1)

Start:  AIC=120.86
height ~ 1

          Df Sum of Sq    RSS     AIC
+ shoe     1    2707.4  689.3  84.584
+ shoe_wm  1    1742.0 1654.7 105.600
+ ind_mm   1    1669.3 1727.4 106.631
+ pnk_mm   1    1301.2 2095.6 111.268
<none>                 3396.7 120.860

Step:  AIC=84.58
height ~ shoe

          Df Sum of Sq    RSS    AIC
+ shoe_wm  1   137.495 551.82 81.244
<none>                 689.32 84.584
+ ind_mm   1    47.113 642.21 84.885
+ pnk_mm   1    40.803 648.52 85.119

Step:  AIC=81.24
height ~ shoe + shoe_wm

         Df Sum of Sq    RSS    AIC
<none>                551.82 81.244
+ pnk_mm  1    39.229 512.59 81.474
+ ind_mm  1    23.490 528.33 82.200

step_out


Call:
lm(formula = height ~ shoe + shoe_wm, data = hg)

Coefficients:
(Intercept)         shoe     shoe_wmw  
    142.594        3.534       -6.142

Forward selection with AIC also chooses 
the model having only the shoe size and shoe 
gender covariates.

6.

Considering your “best” model, check whether there are any outlying observations.

plot(lm_shoe,which=4)

plot(height ~ shoe, pch = shoe_wm, data = hg)

Considering the model with only the shoe 
size covariates, none of the  Cook's distances are 
alarmingly large.  One can also see from the scatterplot 
that all the observations more or less conform to the same 
pattern. So, there do not appear to be any concerning 
outliers.