Table of Contents

R - Logistic Regression

About

logistic regression in R

Steps

Model

We have a call to GLM where we gives:

logisticRegressionModel =glm(response~variable1+variable2+...+variableN,data=dataframe,family=binomial)

Summary

Call:
glm(formula = Response ~ Variable1 + Variable2, family = binomial, data = dataframe)

Deviance Residuals: 
   Min      1Q  Median      3Q     Max  
-1.446  -1.203   1.065   1.145   1.326  

Coefficients:
             Estimate Std. Error z value Pr(>|z|)
(Intercept) -0.126000   0.240736  -0.523    0.601
Variable1   -0.073074   0.050167  -1.457    0.145
Variable2   -0.042301   0.050086  -0.845    0.398

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 1731.2  on 1249  degrees of freedom
Residual deviance: 1727.6  on 1243  degrees of freedom
AIC: 1741.6

Number of Fisher Scoring iterations: 3

where:

There is in this case a very modest change in deviance.

Prediction Probabilities

# predict gives you a vector of fitted probabilities.
probabilities=predict(dataModel,type="response") 
# Look at the first five
probabilities[1:5]
1         2         3         4         5 
0.5070841 0.4814679 0.4811388 0.5152224 0.5107812 

In this case, the fitted probabilities are very close to 50% (not from 100% or 0%), which indicates no strong predictions (relation).

Classification

# glm.probs>0.5 will return a vector of trues and falses.
estimatedResponses=ifelse(probabilities>0.5,"True","False")

Accuracy

Confusion matrix

The table (confusion matrix) of the estimated response (estimatedResponses) against the true response can be made

table(estimatedResponses,trueResponses)
trueResponse
estimatedResponses   False True
              False  145   141
              True   457   507

Mean classification performance

mean(estimatedResponses==trueResponses)
[1] 0.5216

We do slightly better than chance.

Data Set Splits

As we may be have overfit, we split the data set.

train = variable<2005
logisticTrainRegressionModel =glm(response~variable1+variable2+...+variableN,data=dataframe,family=binomial,subset=train)
# predict gives you a vector of fitted probabilities.
trainProbabilities=predict(dataModel,type="response",newdata=dataframe[!train,]) 
trainEstimatedResponse=ifelse(trainProbabilities >0.5,"True","False")
table(trainEstimatedResponse,trueResponses[!train])
mean(trainEstimatedResponse==trueResponses[!train])
[1] 0.4801587

We're doing worse than the null rate, which is 50%.

We might be overfitting because we're doing worse on the test data.

It doesn't necessarily mean it won't be able to make any kind of reasonable predictions. It just means that possibly these variables are very correlated. To discover that, we must fit to a smaller model (ie with less variables in the regression).