STAT 516 hw 10

Author

Karl Gregory

Students in STAT 515 and STAT 516 courses completed a survey in which they reported their heights (in feet and inches), index and pinky finger lengths (mm), shoe size, and shoe size gender (“w”/“m”). The data are recorded in this file.

Here we will consider using logistic regression to predict the reported shoe size gender (“w”/“m”) based on the other reported measurements. The R code below reads in and cleans the data, constructing a response variable equal to $1$ for shoe size gender “w” and equal to $0$ for shoe size gender “m”.

hg0 <- read.table(pathtofile,sep=",",header=T) # edit pathtofile
colnames(hg0) <- c("ft","in","ind","pnk","shoe","shoe_wm","class")

keep <- which(hg0$shoe_wm %in% c("m","w")) # remove a "uk" shoe size
hg <- hg0[keep,]

# create 0,1 response
hg$w <- ifelse(hg$shoe_wm == "w", 1, 0)

# convert heights to cm
hg$hgt <- (hg$ft*12 + hg$`in`)*2.54

# make STAT 515 and STAT 516 data frames
hg515 <- hg[hg$class == "STAT_515_sp_2026",]
hg516 <- hg[hg$class == "STAT_516_sp_2026",]

1.

Plot the responses against the index finger lengths for the STAT 516 students.

plot(w ~ ind, data = hg516)

2.

Fit a logistic regression model using the STAT 516 student responses for predicting the gender response “w” based on the index finger length. Report the slope coefficient and make a scatter plot of the responses against the index finger lengths with the fitted probabilities overlaid; overlay also the curve on which these fitted probabilities lie.

# fit logistic regression model
glm_out <- glm(w ~ ind, family = "binomial", data = hg516)

# evaluate estimated probabilities over a range of covariate values
parms <- coef(glm_out)
logistic <- function(x,a,b) exp(a + b*x) / (1 + exp(a + b*x))
x <- seq(min(hg516$ind),max(hg516$ind),length=201)

# make the plot
plot(w ~ ind, data = hg516)
lines(logistic(x,parms[1],parms[2]) ~ x)
points(glm_out$fitted.values ~ hg516$ind,pch = 19)

The estimate slope coefficient is -0.2580008.

3.

Comment on the strength of evidence in the data of a relationship between gender and index finger length. Explain your answer and give a careful interpretation of the estimated slope coefficient.

summary(glm_out)


Call:
glm(formula = w ~ ind, family = "binomial", data = hg516)

Coefficients:
            Estimate Std. Error z value Pr(>|z|)  
(Intercept)  18.9646     8.7554   2.166   0.0303 *
ind          -0.2580     0.1186  -2.176   0.0296 *
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 31.841  on 22  degrees of freedom
Residual deviance: 24.391  on 21  degrees of freedom
AIC: 28.391

Number of Fisher Scoring iterations: 4

The p value is fairly small, indicating 
fairly strong evidence of a relationship between index 
finger length and the reporting of 'w' as the 
shoe size gender. An increase in index finger length of one 
mm is estimated to multiply the odds of reporting 'w' 
by 0.7725946. Put anther way, an 
increase in index finger length by one mm 
is estimated to reduce 
the odds of reporting 'w' by 22.74054 
percent.

4.

Make a plot of the ROC curve for classifying STAT 516 students as wearing “w” or “m” shoes with the fitted logistic regression model using index finger length as the predictor.

ROC <- function(y,pihat){
  
  cc <- sort(c(0,pihat,1))
  TP <- numeric(length(cc))
  FP <- numeric(length(cc))
  
  for(i in 1:length(cc)){
    
    yhat <- as.numeric(pihat >= cc[i])
    
    TP[i] <- sum(yhat == 1 & y == 1) / sum(y == 1)
    FP[i] <- sum(yhat == 1 & y == 0) / sum(y == 0)
    
  }
  
  plot(TP~FP,type = "l",bty="l")
  abline(0,1,lty = 3)
  
}

ROC(hg516$w,glm_out$fitted.values)

5.

Using the model fit on only the STAT 516 students, obtain predicted probabilities for the STAT 515 students. Then report the proportion of STAT 515 students correctly classified as “w” or “m” under the rule which assigns the class “w” when the predicted probability is at least $1 / 2$ .

pihat515 <- predict(glm_out,newdata = hg515,type = "response")

y515 <- hg515$w
yhat515 <- as.numeric(pihat515 >= 0.5)
correct <- mean(y515 == 1 & yhat515 == 1)

The rate of correct classification for 
the STAT 515 students based on this model 
was 0.5454545.

6.

Now fit a model on the STAT 516 students which uses height to predict the gender response “w”. Then, as in the previous part, obtain predicted probabilities based on this model for the STAT 515 students and report the proportion of these correctly classified as “w” or “m” under the rule which assigns the class “w” when the predicted probability is at least $1 / 2$ .

glm2_out <- glm(w ~ hgt,data = hg516)
pihat515 <- predict(glm2_out,newdata = hg515,type="response")

y515 <- hg515$w
yhat515 <- as.numeric(pihat515 >= 0.5)
correct <- mean(y515 == 1 & yhat515 == 1)

The rate of correct classification for 
the STAT 515 students based on this model 
was 0.6363636.

7.

State whether you believe index finger length or height is a better predictor of gender. Explain your answer.

The model based on height gave a 
better out-of-sample rate of correct classification, 
so it seems to be a better model.

8.

Investigate whether one should include height, index finger, and pinky finger length all together in one model to predict gender. Report your findings.

glm3_out <- glm(w ~ ind + pnk + hgt, data = hg)
summary(glm3_out)


Call:
glm(formula = w ~ ind + pnk + hgt, data = hg)

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept)  6.150597   0.817712   7.522 3.05e-09 ***
ind         -0.007105   0.016934  -0.420    0.677    
pnk          0.011988   0.015285   0.784    0.437    
hgt         -0.033746   0.006539  -5.161 6.68e-06 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for gaussian family taken to be 0.1245604)

    Null deviance: 11.111  on 44  degrees of freedom
Residual deviance:  5.107  on 41  degrees of freedom
AIC: 39.782

Number of Fisher Scoring iterations: 2

It appears that once height is included, 
index and pinky finger length do not significantly 
contribute to the model.