experience <- c(14,29,6,25,18,4,18,12,22,6,30,11,30,5,20,13,9,32,24,13,19,4,28,22,8)
success <- c(0,0,0,1,1,0,0,0,1,0,1,0,1,0,1,0,0,1,0,1,0,0,1,1,1)STAT 516 Lec 12 (accessible)
Logistic regression
Programming task data from Kutner et al. (2005)
Twenty-five people succeeded or failed at a programming task.
Months of programming experience was recorded for each person.
Can we predict probability of success based on experience?
plot(success ~ experience)Logistic regression model
Assume
is the response for observation . is the value of a predictor/covariate/explanatory variable for obs . is the probability of “success” for observation . and are slope and intercept parameters. is the odds of “success” for obs . is the log-odds for obs .
Logistic regression assumes the log-odds are linear in the predictor.
Odds
Let
be the probability of success.Then
is called the odds in favor of success.- If
then . “One-to-one” odds of success. - If
then . Success 2x more likely than failure. - If
then . Failure 3x more likely than success.
- If
The logit and logistic transformations
- The transformation
is called the logistic transformation. - Its inverse
is called the logit transformation. - We have
x <- seq(-10,10,length=200)
y <- exp(x) / (1 + exp(x))
par(mfrow= c(1,2))
plot(y~x,type = "l")
plot(x~y,type = "l")Goals in logistic regression
- Estimate
and . - Obtain fitted probabilities
. - Build CI for
and test : . - Give interpretations of the estimated regression coefficients.
- Check goodness of fit of the logistic regression model.
- Add additional covariates
Maximum likelihood estimation in logistic regression
- We do not use least-squares to estimate
and . - Instead we use maximum likelihood estimators (MLEs).
- The MLEs are the parameter values giving the observed data the highest possible probability.
- Intercept
and slope give to the observed data the probability with for . - The MLEs
, are the values of , that maximize . is called the likelihood function.
Computing the MLEs in logistic regression
- There is no “closed-form” expression for
and . - One must find their values numerically, that is with an algorithm.
- More convenient to work with
, which is given by - Newton’s method is one way to find the maximizers of
.
Programming task data (cont)
Newton-Raphson algorithm for computing
Generalized linear models
- The logistic regression model is in a class of models called GLMs.
- GLM stands for generalized linear model.
- Poisson regression, binomial response regression, i.a. are GLMs too.
- Use
glm()function in R to obtain and .
Use glm() function with the option family = "binomial".
glm_out <- glm(success ~ experience, family = "binomial")
summary(glm_out)
Call:
glm(formula = success ~ experience, family = "binomial")
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -3.05970 1.25935 -2.430 0.0151 *
experience 0.16149 0.06498 2.485 0.0129 *
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 34.296 on 24 degrees of freedom
Residual deviance: 25.425 on 23 degrees of freedom
AIC: 29.425
Number of Fisher Scoring iterations: 4
x <- seq(min(experience),max(experience),length = 200)
pihat_x <- 1/(1 + exp( -(coef(glm_out)[1] + coef(glm_out)[2]*x)))
plot(success ~ experience); lines(pihat_x~x)Fitted probabilities
- Define the fitted probabilities as
- For any value
, we estimate the probability of “success” as
Programming task data (cont)
plot(success ~ experience); lines(pihat_x~x)
points(glm_out$fitted.values~experience,pch = 19)Asymptotic distribution of slope estimator and CI
- For large enough
, is approximately Normal, such that where, setting for , we may write - We can make an approximate
CI for as
Programming task data (cont)
# Confidence interval for beta1
pihat <- glm_out$fitted.values
b1hat <- coef(glm_out)[2]
w <- pihat*(1-pihat)
se <- sqrt(1/(sum(w*experience^2) - sum(w*experience)^2/sum(w)))
lo <- b1hat - 1.96 * se
up <- b1hat + 1.96 * se
c(lo,up)experience experience
0.03412491 0.28884692
# CIs for both beta0 and beta1 automatically from glm_out
confint.default(glm_out) 2.5 % 97.5 %
(Intercept) -5.52797622 -0.5914155
experience 0.03412744 0.2888444
Testing whether the slope coefficient is zero
To test
Compute
.Reject
at if .Obtain p value as
, .
The summary() function on the glm() output prints this p value.
Odds ratios
Let
and be success probs under an initial and an altered condition, respectively.Then we call the ratio
the odds ratio associated with the change from the initial to the altered condition.The odds ratio is the factor by which the odds are multiplied when the initial condition is changed to the altered condition.
Interpreting the logistic regression parameters
Let
and be the “success” probabilities at and .Then we have the two equations
Subtracting the first equation from the second gives
The quantity
is called an odds ratio.So
is log of the odds ratio associated with a unit increase in .
Odds ratio from a unit increase in x
From the previous slide, we have
Can build a CI for
by exponentiating the CI for .Gives CI for
as .A unit increase in
multiplies the odds of success by the factor .What if the CI for
contains ?
Programming task data (cont)
exp(confint.default(glm_out,parm = "experience")) 2.5 % 97.5 %
experience 1.034716 1.334884
Each additional month of experience increases the odds of completing the programming task by a factor of 1.035 to 1.335, with
Residuals for logistic regression
- Ordinary residuals
cannot be Normally distributed. - In GLMs, one looks at special residuals called deviance residuals.
- In logistic regression, the deviance residuals are defined as
for . - These are not Normal either, but are useful for assessing model fit.
Programming task data (cont)
plot(glm_out,which = 1)Checking model fit with a simulated envelope
The simulated envelope method is described in Kutner et al. (2005):
- Fit the logistic regression model and obtain
. - Obtain the deviance residuals; sort them as
. - Generate many new data sets
, . - For each new data set, obtain sorted
. - Plot
as well as the 0.025 and 0.975 quantiles and the mean of the (it doesn’t matter what is chosen as the -axis). - The quantiles of the
make a band. If the model fits, then the should lie within the band and close to the mean.
Asks: If the model is correct, how would the deviance residuals behave?
Programming task data (cont)
library(glmtoolbox) # first time run install.packages("glmtoolbox")
envelope(glm_out,type = "deviance")experience2 <- (experience - mean(experience))^2
envelope(glm(success ~ experience2, family = "binomial"),type = "deviance")German credit score data from Hofmann (1994)
Response is credit rating (good/bad), various predictors.
library(foreign) # credit-g dataset from https://www.openml.org/Warning: package 'foreign' was built under R version 4.4.1
link <- url("https://people.stat.sc.edu/gregorkb/data/dataset_31_credit-g.arff")
credg <- read.arff(link)
colnames(credg) [1] "checking_status" "duration" "credit_history"
[4] "purpose" "credit_amount" "savings_status"
[7] "employment" "installment_commitment" "personal_status"
[10] "other_parties" "residence_since" "property_magnitude"
[13] "age" "other_payment_plans" "housing"
[16] "existing_credits" "job" "num_dependents"
[19] "own_telephone" "foreign_worker" "class"
summary(credg[,1:3]) checking_status duration credit_history
<0 :274 Min. : 4.0 all paid : 49
>=200 : 63 1st Qu.:12.0 critical/other existing credit:293
0<=X<200 :269 Median :18.0 delayed previously : 88
no checking:394 Mean :20.9 existing paid :530
3rd Qu.:24.0 no credits/all paid : 40
Max. :72.0
Logistic multiple regression model
Observe
is the response for observation . are the values of the predictors for obs . is the probability of “success” for observation . is an intercept and are slope parameters. is the odds of “success” for obs . is the log-odds for obs .
So we assume the log-odds are a linear function of the predictors.
Interpreting multiple logistic regression parameters
- Let
and be the “success” probabilities at and but with fixed for all .
- Then we have the two equations
- Subtracting the first equation from the second gives
- So
is log of the odds ratio associated with a unit increase in with all other predictors held fixed.
German credit score data (cont)
glm_out <- glm(class ~ ., family = "binomial", data = credg)
summary(glm_out)
Call:
glm(formula = class ~ ., family = "binomial", data = credg)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 1.505e+00 1.248e+00 1.206 0.227801
checking_status>=200 9.657e-01 3.692e-01 2.616 0.008905 **
checking_status0<=X<200 3.749e-01 2.179e-01 1.720 0.085400 .
checking_statusno checking 1.712e+00 2.322e-01 7.373 1.66e-13 ***
duration -2.786e-02 9.296e-03 -2.997 0.002724 **
credit_historycritical/other existing credit 1.579e+00 4.381e-01 3.605 0.000312 ***
credit_historydelayed previously 9.965e-01 4.703e-01 2.119 0.034105 *
credit_historyexisting paid 7.295e-01 3.852e-01 1.894 0.058238 .
credit_historyno credits/all paid 1.434e-01 5.489e-01 0.261 0.793921
purposedomestic appliance -2.173e-01 8.041e-01 -0.270 0.786976
purposeeducation -7.764e-01 4.660e-01 -1.666 0.095718 .
purposefurniture/equipment 5.152e-02 3.543e-01 0.145 0.884391
purposenew car -7.401e-01 3.339e-01 -2.216 0.026668 *
purposeother 7.487e-01 7.998e-01 0.936 0.349202
purposeradio/tv 1.515e-01 3.370e-01 0.450 0.653002
purposerepairs -5.237e-01 5.933e-01 -0.883 0.377428
purposeretraining 1.319e+00 1.233e+00 1.070 0.284625
purposeused car 9.264e-01 4.409e-01 2.101 0.035645 *
credit_amount -1.283e-04 4.444e-05 -2.887 0.003894 **
savings_status>=1000 1.339e+00 5.249e-01 2.551 0.010729 *
savings_status100<=X<500 3.577e-01 2.861e-01 1.250 0.211130
savings_status500<=X<1000 3.761e-01 4.011e-01 0.938 0.348476
savings_statusno known savings 9.467e-01 2.625e-01 3.607 0.000310 ***
employment>=7 2.097e-01 2.947e-01 0.712 0.476718
employment1<=X<4 1.159e-01 2.423e-01 0.478 0.632415
employment4<=X<7 7.641e-01 3.051e-01 2.504 0.012271 *
employmentunemployed -6.691e-02 4.270e-01 -0.157 0.875475
installment_commitment -3.301e-01 8.828e-02 -3.739 0.000185 ***
personal_statusmale div/sep -2.755e-01 3.865e-01 -0.713 0.476040
personal_statusmale mar/wid 9.162e-02 3.118e-01 0.294 0.768908
personal_statusmale single 5.406e-01 2.102e-01 2.572 0.010113 *
other_partiesguarantor 1.415e+00 5.685e-01 2.488 0.012834 *
other_partiesnone 4.360e-01 4.101e-01 1.063 0.287700
residence_since -4.776e-03 8.641e-02 -0.055 0.955920
property_magnitudelife insurance -8.690e-02 2.313e-01 -0.376 0.707115
property_magnitudeno known property -5.359e-01 4.017e-01 -1.334 0.182211
property_magnitudereal estate 1.945e-01 2.360e-01 0.824 0.409743
age 1.454e-02 9.222e-03 1.576 0.114982
other_payment_plansnone 6.463e-01 2.391e-01 2.703 0.006871 **
other_payment_plansstores 1.232e-01 4.119e-01 0.299 0.764878
housingown -2.402e-01 4.503e-01 -0.534 0.593687
housingrent -6.839e-01 4.770e-01 -1.434 0.151657
existing_credits -2.721e-01 1.895e-01 -1.436 0.151109
jobskilled -7.524e-02 2.845e-01 -0.264 0.791419
jobunemp/unskilled non res 4.795e-01 6.623e-01 0.724 0.469086
jobunskilled resident -5.666e-02 3.501e-01 -0.162 0.871450
num_dependents -2.647e-01 2.492e-01 -1.062 0.288249
own_telephoneyes 3.000e-01 2.013e-01 1.491 0.136060
foreign_workeryes -1.392e+00 6.258e-01 -2.225 0.026095 *
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 1221.73 on 999 degrees of freedom
Residual deviance: 895.82 on 951 degrees of freedom
AIC: 993.82
Number of Fisher Scoring iterations: 5
Note that glm() estimates three coefficients for checking_status.
summary(credg$checking_status) <0 >=200 0<=X<200 no checking
274 63 269 394
Numeric predictors to encode the levels of the categorical predictor:
Likewise for other categorical predictors.
Deviances replace error sums of squares in GLMs
- The deviance is the sum of squared deviance residuals
. - In logistic regression the deviance can be computed as
- Full-reduced model test: Reject
: for all if where is the number of predictors in (need large ).
German credit score data (cont)
Test whether any level of checking status is important to the credit score.
credg_red <- credg[,-1] # remove checking_status column
glm_full <- glm(class ~ ., family = "binomial", data = credg)
glm_red <- glm(class ~ ., family = "binomial", data = credg_red)
p <- length(coef(glm_full)) - 1
s <- nlevels(credg$checking_status) - 1
1 - pchisq(glm_red$deviance - glm_full$deviance,s)[1] 2.731149e-14
Variable selection in logistic regression
- We may want to discard some of our predictors.
- One way is to add/remove variables stepwise according to AIC.
- Can do this just as we did in multiple linear regression.
- Be cautious about making inferences after selecting a model.
German credit score data (cont)
glm_all <- glm(class ~ ., family = "binomial", data = credg)
step_back <- step(glm_all,
direction = "backward",
scope = formula(glm_all),
criterion = "aic",
trace = 0) # suppress printed output
summary(step_back)
Call:
glm(formula = class ~ checking_status + duration + credit_history +
purpose + credit_amount + savings_status + installment_commitment +
personal_status + other_parties + age + other_payment_plans +
housing + own_telephone + foreign_worker, family = "binomial",
data = credg)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 4.838e-01 1.017e+00 0.476 0.634362
checking_status>=200 1.024e+00 3.626e-01 2.824 0.004739 **
checking_status0<=X<200 3.900e-01 2.121e-01 1.839 0.065928 .
checking_statusno checking 1.718e+00 2.281e-01 7.531 5.05e-14 ***
duration -2.568e-02 8.940e-03 -2.872 0.004074 **
credit_historycritical/other existing credit 1.373e+00 4.041e-01 3.397 0.000680 ***
credit_historydelayed previously 7.910e-01 4.488e-01 1.762 0.077985 .
credit_historyexisting paid 7.115e-01 3.788e-01 1.879 0.060305 .
credit_historyno credits/all paid -1.188e-01 5.268e-01 -0.225 0.821612
purposedomestic appliance -2.576e-01 7.763e-01 -0.332 0.740041
purposeeducation -9.262e-01 4.569e-01 -2.027 0.042628 *
purposefurniture/equipment -4.216e-02 3.415e-01 -0.123 0.901748
purposenew car -7.827e-01 3.272e-01 -2.392 0.016752 *
purposeother 6.523e-01 7.832e-01 0.833 0.404946
purposeradio/tv 1.368e-01 3.288e-01 0.416 0.677335
purposerepairs -6.402e-01 5.808e-01 -1.102 0.270365
purposeretraining 1.382e+00 1.240e+00 1.114 0.265228
purposeused car 8.246e-01 4.288e-01 1.923 0.054495 .
credit_amount -1.294e-04 4.221e-05 -3.066 0.002169 **
savings_status>=1000 1.289e+00 5.072e-01 2.542 0.011008 *
savings_status100<=X<500 3.282e-01 2.767e-01 1.186 0.235477
savings_status500<=X<1000 4.304e-01 3.933e-01 1.094 0.273900
savings_statusno known savings 9.628e-01 2.570e-01 3.746 0.000179 ***
installment_commitment -3.299e-01 8.554e-02 -3.857 0.000115 ***
personal_statusmale div/sep -2.872e-01 3.763e-01 -0.763 0.445327
personal_statusmale mar/wid 1.297e-01 3.061e-01 0.424 0.671803
personal_statusmale single 5.356e-01 1.975e-01 2.712 0.006686 **
other_partiesguarantor 1.528e+00 5.581e-01 2.737 0.006193 **
other_partiesnone 4.874e-01 3.997e-01 1.220 0.222648
age 1.309e-02 8.398e-03 1.559 0.118967
other_payment_plansnone 6.995e-01 2.350e-01 2.976 0.002916 **
other_payment_plansstores 7.864e-02 4.033e-01 0.195 0.845394
housingown 2.918e-01 2.887e-01 1.011 0.312127
housingrent -1.497e-01 3.411e-01 -0.439 0.660793
own_telephoneyes 2.794e-01 1.842e-01 1.516 0.129394
foreign_workeryes -1.382e+00 6.207e-01 -2.227 0.025925 *
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 1221.7 on 999 degrees of freedom
Residual deviance: 910.5 on 964 degrees of freedom
AIC: 982.5
Number of Fisher Scoring iterations: 5
envelope(step_back,type="deviance")Classification with the logistic regression model
Consider classifying the observations as 1 or 0 according to
Choose a threshold
and make the classificationCan compute misclassification rate
.Can compute observed true positive and false positive rates
Plotting TP against FP over all
creates the receiver operating characteristic (ROC) curve.
German credit score data (cont)
Compute TP and FP over a range of thresholds
Y <- ifelse(credg$class == "good",1,0)
pi_hat <- step_back$fitted.values
n1 <- sum(Y == 1)
n0 <- sum(Y == 0)
cc <- sort(c(0,pi_hat,1))
TP <- FP <- numeric(length(cc))
for(j in 1:length(cc)){
Yhat <- pi_hat >= cc[j]
TP[j] <- sum(Yhat == 1 & Y == 1) / n1
FP[j] <- sum(Yhat == 1 & Y == 0) / n0
}plot(TP ~ FP, type = "l", main = "ROC curve")
abline(0,1,lty = 3)Can use ROC curves to compare models.
Traning and testing data sets
Overfitting is fitting a model such that it performs well on the data used to fit it, but will perform poorly when it is given new data.
To avoid, split data into training and testing data sets
where and .Fit models on training data. Compare performance on testing data.
For example, compute misclassification rate or draw ROC curve for testing data.
German credit score data (cont)
set.seed(1)
n <- nrow(credg)
n_train <- ceiling(.5*n)
n_test <- n - n_train
ind_train <- sample(n,n_train) # randomly select a training set
credg_train <- credg[ind_train,]
credg_test <- credg[-ind_train,]
glm_train <- glm(class ~ ., family = "binomial", data = credg_train)
ypred_train <- glm_train$fitted.values >= 0.5
y_train <- as.numeric(credg_train$class == "good")
misclass_train <- mean(ypred_train != y_train)
misclass_train[1] 0.204
pihat_test <- predict(glm_train,newdata = credg_test,type="response")
ypred_test <- pihat_test >= 0.5
y_test <- as.numeric(credg_test$class == "good")
misclass_test <- mean(ypred_test != y_test)
misclass_test[1] 0.258