hg0 <- read.table(pathtofile,sep=",",header=T) # edit pathtofile
colnames(hg0) <- c("ft","in","ind","pnk","shoe","shoe_wm","class")
keep <- which(hg0$shoe_wm %in% c("m","w")) # remove a "uk" shoe size
hg <- hg0[keep,]
# create 0,1 response
hg$w <- ifelse(hg$shoe_wm == "w", 1, 0)
# convert heights to cm
hg$hgt <- (hg$ft*12 + hg$`in`)*2.54
# make STAT 515 and STAT 516 data frames
hg515 <- hg[hg$class == "STAT_515_sp_2026",]
hg516 <- hg[hg$class == "STAT_516_sp_2026",]STAT 516 hw 10
Students in STAT 515 and STAT 516 courses completed a survey in which they reported their heights (in feet and inches), index and pinky finger lengths (mm), shoe size, and shoe size gender (“w”/“m”). The data are recorded in this file.
Here we will consider using logistic regression to predict the reported shoe size gender (“w”/“m”) based on the other reported measurements. The R code below reads in and cleans the data, constructing a response variable equal to
1.
Plot the responses against the index finger lengths for the STAT 516 students.
plot(w ~ ind, data = hg516)2.
Fit a logistic regression model using the STAT 516 student responses for predicting the gender response “w” based on the index finger length. Report the slope coefficient and make a scatter plot of the responses against the index finger lengths with the fitted probabilities overlaid; overlay also the curve on which these fitted probabilities lie.
# fit logistic regression model
glm_out <- glm(w ~ ind, family = "binomial", data = hg516)
# evaluate estimated probabilities over a range of covariate values
parms <- coef(glm_out)
logistic <- function(x,a,b) exp(a + b*x) / (1 + exp(a + b*x))
x <- seq(min(hg516$ind),max(hg516$ind),length=201)
# make the plot
plot(w ~ ind, data = hg516)
lines(logistic(x,parms[1],parms[2]) ~ x)
points(glm_out$fitted.values ~ hg516$ind,pch = 19)The estimate slope coefficient is -0.2580008.
3.
Comment on the strength of evidence in the data of a relationship between gender and index finger length. Explain your answer and give a careful interpretation of the estimated slope coefficient.
summary(glm_out)
Call:
glm(formula = w ~ ind, family = "binomial", data = hg516)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 18.9646 8.7554 2.166 0.0303 *
ind -0.2580 0.1186 -2.176 0.0296 *
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 31.841 on 22 degrees of freedom
Residual deviance: 24.391 on 21 degrees of freedom
AIC: 28.391
Number of Fisher Scoring iterations: 4
The p value is fairly small, indicating
fairly strong evidence of a relationship between index
finger length and the reporting of 'w' as the
shoe size gender. An increase in index finger length of one
mm is estimated to multiply the odds of reporting 'w'
by 0.7725946. Put anther way, an
increase in index finger length by one mm
is estimated to reduce
the odds of reporting 'w' by 22.74054
percent.
4.
Make a plot of the ROC curve for classifying STAT 516 students as wearing “w” or “m” shoes with the fitted logistic regression model using index finger length as the predictor.
ROC <- function(y,pihat){
cc <- sort(c(0,pihat,1))
TP <- numeric(length(cc))
FP <- numeric(length(cc))
for(i in 1:length(cc)){
yhat <- as.numeric(pihat >= cc[i])
TP[i] <- sum(yhat == 1 & y == 1) / sum(y == 1)
FP[i] <- sum(yhat == 1 & y == 0) / sum(y == 0)
}
plot(TP~FP,type = "l",bty="l")
abline(0,1,lty = 3)
}
ROC(hg516$w,glm_out$fitted.values)5.
Using the model fit on only the STAT 516 students, obtain predicted probabilities for the STAT 515 students. Then report the proportion of STAT 515 students correctly classified as “w” or “m” under the rule which assigns the class “w” when the predicted probability is at least
pihat515 <- predict(glm_out,newdata = hg515,type = "response")
y515 <- hg515$w
yhat515 <- as.numeric(pihat515 >= 0.5)
correct <- mean(y515 == 1 & yhat515 == 1)The rate of correct classification for
the STAT 515 students based on this model
was 0.5454545.
6.
Now fit a model on the STAT 516 students which uses height to predict the gender response “w”. Then, as in the previous part, obtain predicted probabilities based on this model for the STAT 515 students and report the proportion of these correctly classified as “w” or “m” under the rule which assigns the class “w” when the predicted probability is at least
glm2_out <- glm(w ~ hgt,data = hg516)
pihat515 <- predict(glm2_out,newdata = hg515,type="response")
y515 <- hg515$w
yhat515 <- as.numeric(pihat515 >= 0.5)
correct <- mean(y515 == 1 & yhat515 == 1)The rate of correct classification for
the STAT 515 students based on this model
was 0.6363636.
7.
State whether you believe index finger length or height is a better predictor of gender. Explain your answer.
The model based on height gave a
better out-of-sample rate of correct classification,
so it seems to be a better model.
8.
Investigate whether one should include height, index finger, and pinky finger length all together in one model to predict gender. Report your findings.
glm3_out <- glm(w ~ ind + pnk + hgt, data = hg)
summary(glm3_out)
Call:
glm(formula = w ~ ind + pnk + hgt, data = hg)
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 6.150597 0.817712 7.522 3.05e-09 ***
ind -0.007105 0.016934 -0.420 0.677
pnk 0.011988 0.015285 0.784 0.437
hgt -0.033746 0.006539 -5.161 6.68e-06 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for gaussian family taken to be 0.1245604)
Null deviance: 11.111 on 44 degrees of freedom
Residual deviance: 5.107 on 41 degrees of freedom
AIC: 39.782
Number of Fisher Scoring iterations: 2
It appears that once height is included,
index and pinky finger length do not significantly
contribute to the model.