Miscellaneous Modeling Topics

I want to spend a little bit of time talking about ways to model non-linear trends within a linear model as well as show an example of conducting a logistic regression within R using the glm function. This set of notes will use the following packages:

library(modelr)
library(broom)
library(forcats)
library(tidyverse)

glm function

The glm function behaves much like the lm function. The only major difference in model logistics is that we will now need to specify a family argument. This family argument will depend on the type of model being fitted. We are going to perform a logistic regression, therefore this family will be binomial.

The data we will use is from Kaggle and is data from Titanic, more specifically the data has characteristics on the passengers and whether they survived the shipwreck or not. You can get a sense of the variables from this website: https://www.kaggle.com/c/titanic/data.

library(titanic)

titanic <- bind_rows(titanic_test,
                     titanic_train)
head(titanic)
##   PassengerId Pclass                                         Name    Sex  Age
## 1         892      3                             Kelly, Mr. James   male 34.5
## 2         893      3             Wilkes, Mrs. James (Ellen Needs) female 47.0
## 3         894      2                    Myles, Mr. Thomas Francis   male 62.0
## 4         895      3                             Wirz, Mr. Albert   male 27.0
## 5         896      3 Hirvonen, Mrs. Alexander (Helga E Lindqvist) female 22.0
## 6         897      3                   Svensson, Mr. Johan Cervin   male 14.0
##   SibSp Parch  Ticket    Fare Cabin Embarked Survived
## 1     0     0  330911  7.8292              Q       NA
## 2     1     0  363272  7.0000              S       NA
## 3     0     0  240276  9.6875              Q       NA
## 4     0     0  315154  8.6625              S       NA
## 5     1     1 3101298 12.2875              S       NA
## 6     0     0    7538  9.2250              S       NA

Suppose we were interested in fitting a model that predicted whether a passenger survived or not. Using lm is not appropriate as the dependent variable is not continuous, rather it is dichotomous. Using logistic regression is more appropriate here.

surv_mod <- glm(Survived ~ factor(Pclass) + Fare + Sex + Age, 
                data = titanic, 
                family = binomial)
summary(surv_mod)
## 
## Call:
## glm(formula = Survived ~ factor(Pclass) + Fare + Sex + Age, family = binomial, 
##     data = titanic)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.7393  -0.6788  -0.3956   0.6486   2.4639  
## 
## Coefficients:
##                   Estimate Std. Error z value Pr(>|z|)    
## (Intercept)      3.7225052  0.4645113   8.014 1.11e-15 ***
## factor(Pclass)2 -1.2765903  0.3126370  -4.083 4.44e-05 ***
## factor(Pclass)3 -2.5415762  0.3277677  -7.754 8.89e-15 ***
## Fare             0.0005226  0.0022579   0.231    0.817    
## Sexmale         -2.5185052  0.2082017 -12.096  < 2e-16 ***
## Age             -0.0367302  0.0077325  -4.750 2.03e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 964.52  on 713  degrees of freedom
## Residual deviance: 647.23  on 708  degrees of freedom
##   (595 observations deleted due to missingness)
## AIC: 659.23
## 
## Number of Fisher Scoring iterations: 5

It is common to interpret these in terms of probability, we can do this with the following bit of code:

prob_mod <- titanic %>%
  select(Survived, Pclass, Fare, Sex, Age) %>%
  na.omit() %>%
  mutate(
    probability = predict(surv_mod, type = 'response')
  )

head(prob_mod, n = 15)
##     Survived Pclass    Fare    Sex Age probability
## 419        0      3  7.2500   male  22  0.10509514
## 420        1      1 71.2833 female  38  0.91404126
## 421        1      3  7.9250 female  26  0.55726903
## 422        1      1 53.1000 female  35  0.92162960
## 423        0      3  8.0500   male  35  0.06793029
## 425        0      1 51.8625   male  54  0.32031425
## 426        0      3 21.0750   male   2  0.19781236
## 427        1      3 11.1333 female  27  0.54860408
## 428        1      2 30.0708 female  14  0.87516354
## 429        1      3 16.7000 female   4  0.73937738
## 430        1      1 26.5500 female  58  0.83285937
## 431        0      3  8.0500   male  20  0.11224887
## 432        0      3 31.2750   male  39  0.05987747
## 433        0      3  7.8542 female  14  0.66168470
## 434        1      2 16.0000 female  55  0.60685621

We could now plot these probabilities to explore the effects.

ggplot(prob_mod, aes(Age, probability, color = Sex, linetype = Sex)) + 
  geom_line(size = 1) + 
  facet_grid(. ~ Pclass) + 
  theme_bw() + 
  xlab("Age") + 
  ylab("Probability of Survival")

Previous
Next