Download this data set and store it in the folder containing the .qmd file for your homework assignment.
The data set contains the self-reported heights (in feet and inches), lengths of index and pinky fingers (in millimeters), shoe size, and shoe size category (“m”/“w”) of several students. The code below imports the data into R and converts one “uk” shoe size to a “w” size according to a sizing chart found online. In addition the heights are converted to centimeter heights and a data frame hg is created containing a version of the data ready for analysis.
# import the datahg0 <-read.table("heights.csv",sep=",",header=T)# clean one data point# https://www.grivetoutdoors.com/pages/shoe-size-charthg0$shoe[hg0$shoe_wm =='uk'] <-9hg0$shoe_wm[hg0$shoe_wm =='uk'] <-'w'# create data frame for analysishg <-data.frame(height = (hg0$ft*12+ hg0$in.)*2.54, # get heights in cmind_mm = hg0$ind_mm,pnk_mm = hg0$pnk_mm,shoe = hg0$shoe,shoe_wm = hg0$shoe_wm)# view the first few rows of the data framehead(hg)
height ind_mm pnk_mm shoe shoe_wm
1 162.56 75 65 8.0 w
2 162.56 70 56 7.0 w
3 172.72 70 52 10.0 w
4 165.10 68 62 7.5 w
5 167.64 71 55 9.5 m
6 193.04 78 65 13.0 m
It is of interest to use the multiple linear regression model to predict the height of a person based on his or her index and pinky finger lengths, shoe size, and shoe size gender “m” or “w”.
1.
Consider the multiple linear regression model which uses all the covariates—the index and pinky finger lengths, shoe size, and shoe size gender—to predict the height of a person.
1.a
Give the critical value for the overall F test at significance level . This the value such that when it is exceeded by the value of the test statistic we reject at .
n <-nrow(hg)p <-4alpha <-0.05Fcrit <-qf(1- alpha, p, n-(p+1))
The critical value is 2.895107.
1.b
Fit the model and give the value of the test statistic for the overall F test of significance as well as the p-value associated with it. Give an interpretation of these values.
The value of the test statistic for the overall F
test is 26.76977, and the p value is 1.411167e-07. We reject the
null hypothesis that all regression coefficients
(apart from the intercept) are equal to zero at every significance level
greater than 1.411167e-07.
Therefore we conclude (at any such significance level),
that not all the regression coefficients are equal to zero; that is,
at least one of the covariates is significantly (linearly) related
to the response.
1.c
Report the variance inflation factor for each of the four covariates.
The VIFs were 3.611387 for the index finger
length, 2.83546 for the pinky finger length 2.37919 for the
shoe size, and 1.738048 for the shoe size gender. The
VIFs are smaller in this model than in the model with all
four covariates.
2.
Fit a model with all covariates except the shoe size gender covariate.
2.a
Use the full-reduced model F test to test whether the shoe size gender covariate has a nonzero regression coefficient in the full model. Give the value of the test statistic as well as the p value. Interpret the result of your test.
The value of the test statistic is 4.616397 and the
p value is 0.0447715. There is sufficient evidence at the 0.05
significance level to conclude that shoe size gender is significantly
related to height.
2.b
Obtain the value of the test statistic and the p value associated with it for testing : versus : where is the index of the shoe size gender covariate.
We can obtain this from applying the summary()
function to the output of the lm() function. We
find that the T statistic is equal the square root of the
F statistic, but signed in the direction of the
estimated regression coefficient. The p value is the same
as that for the full-reduced model F test.
2.c
Explain the relationship between the value of the test statistic of the full-reduced model F test when one considers the removal of a single covariate with the test statistic for testing whether a single covariate is significantly related to the response.
The former is the square of the latter.
3.
Fit a model using only the two shoe size covariates.
3.a
Use the full-reduced model F test to test whether either of the finger length variables has a nonzero regression coefficient in the full model. Give the value of the test statistic as well as the p value. Interpret the result of your test.
The value of the test statistic is 0.7413126 and the
p value is 0.4897741. There is not sufficient evidence to
reject the null hypothesis that both regression
coefficients are equal to zero. So we do not find
sufficient evidence to claim that index and pinky finger
lengths, when taken in addition to shoe size and shoe
size gender, contribute to any knowledge of the height
of a person.
3.b
Compute the variance inflation factors of the two variables in this model. Comment on how these compare to their counterparts in the model with all four covariates.
The VIFs were 1.646059 for the shoe
size and 1.646059 for the shoe size gender.
3.c
Comment on anything else interesting about these two variance inflation factors!
The two VIFs are equal. This will always be
the case when there are only two covariates. The
reason for this is that the VIF is based on the
coefficient of determination in a model in which the
covariate under consideration is the response and
all other covariates are used as predictors. If
there are only two covariates, then when one is put as
the response, there is only one other predictor. Then,
when the other is put as the response, the response
and the single predictor trade places. If one
regresses 'Y' on a single covariate 'x' and then
regresses 'x' on 'Y', one will get the same coefficient
of determination in both regressions.
4.
Fit a model using only the index and pinky finger lengths.
4.a
Give the value of the test statistic and the p value for the full-reduced model F test for testing whether either of the two shoe size covariates has a nonzero regression coefficient in the full model. Interpret the result.
The value of the test statistic is 21.93594 and the
p value is 1.155642e-05. There is strong evidence against the null
hypothesis that both shoe size covariates have regression coefficients
equal to zero.
4.b
Suppose the index and pinky finger measurements were recorded in centimeters instead of millimeters. Describe the effect this would have on the value of the test statistic in the previous part as well as on the p value.
There would be no effect whatsoever
in the value of the test statistic or the p
value. The only change would be in
the value of the estimated regression
coefficients. The tests for
statistical significance of covariates in
multiple linear regression are invariant to any
shifting or scaling of the covariate and
response values.
5.
5.a
Use Mallow’s statistic to select the best model among all possible submodels involving the four predictors.
Warning: package 'leaps' was built under R version 4.4.1
The output shows the best model of each size
and the value of the Cp statistic for the best (Rsq-maximizing)
model of each size. Should probably take the model with two
predictors, which has Cp equal to 2.482625, since this is
the smallest model with Cp approximately equal to p + 1.
5.b
Give the model chosen by backward selection based on the AIC criterion.
Forward selection with AIC also chooses
the model having only the shoe size and shoe
gender covariates.
6.
Considering your “best” model, check whether there are any outlying observations.
plot(lm_shoe,which=4)
plot(height ~ shoe, pch = shoe_wm, data = hg)
Considering the model with only the shoe
size covariates, none of the Cook's distances are
alarmingly large. One can also see from the scatterplot
that all the observations more or less conform to the same
pattern. So, there do not appear to be any concerning
outliers.