Model Assumptions

I want to spend a little bit of time talking about model assumptions. These are important and should be checked for any analysis. The following packages will be used for this set of notes.

library(modelr)
library(broom)
library(forcats)
library(tidyverse)

We are also going to use this model from the advanced model section to explore model assumptions.

heights2 <- heights %>%
  mutate(
    marital_comb = fct_recode(marital,
          'Other' = 'separated',
          'Other' = 'widowed'
    )
  )

afqt_mod <- lm(afqt ~ marital_comb + height, data = heights2)
summary(afqt_mod)
## 
## Call:
## lm(formula = afqt ~ marital_comb + height, data = heights2)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -53.150 -22.517  -4.248  22.097  78.355 
## 
## Coefficients:
##                       Estimate Std. Error t value Pr(>|t|)    
## (Intercept)          -28.99932    5.67144  -5.113 3.25e-07 ***
## marital_combmarried   16.13531    0.96385  16.741  < 2e-16 ***
## marital_combOther     -4.63978    1.50160  -3.090  0.00201 ** 
## marital_combdivorced   6.75490    1.11278   6.070 1.35e-09 ***
## height                 0.89847    0.08329  10.787  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 27.78 on 6739 degrees of freedom
##   (262 observations deleted due to missingness)
## Multiple R-squared:  0.08467,    Adjusted R-squared:  0.08413 
## F-statistic: 155.8 on 4 and 6739 DF,  p-value: < 2.2e-16

plot function with model object

One of the nicest features when model building within R, is that many diagnostic plots are very accessible using the plot function. For example:

plot(afqt_mod)

These plots are shown one at a time in an interactive R session, but allow you to check many common assumptions such as normal residuals, linearity, homogeneity of variance, and even leverage.

You can see from these figures that there is quite a bit to be desired from our model. First, the residuals are not normally distributed (very heavy tails) and more problematic is that the residuals have a trend to them. This suggests that we are missing an important variable in the model.

Suppose we add some additional variables to the model keeping the model additive. Note, I am also mean centering many of the continuous variables to ease interpretation slightly.

heights2 <- heights2 %>%
  mutate(income2 = ifelse(income == 0, .001, income),
         height2 = height - mean(height, na.rm = TRUE),
         education2 = education - mean(education, na.rm = TRUE),
         weight2 = weight - mean(weight, na.rm = TRUE),
         log_income = log(income2))
afqt_alt <- lm(afqt ~ marital_comb + height2 + education2 + 
                 log_income + weight2, data = heights2)
summary(afqt_alt)
## 
## Call:
## lm(formula = afqt ~ marital_comb + height2 + education2 + log_income + 
##     weight2, data = heights2)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -80.466 -16.515  -2.036  16.088 101.730 
## 
## Coefficients:
##                       Estimate Std. Error t value Pr(>|t|)    
## (Intercept)          32.645765   0.717463  45.502  < 2e-16 ***
## marital_combmarried   9.312248   0.802952  11.598  < 2e-16 ***
## marital_combOther    -2.537031   1.231340  -2.060   0.0394 *  
## marital_combdivorced  4.790976   0.913836   5.243 1.63e-07 ***
## height2               0.781474   0.077526  10.080  < 2e-16 ***
## education2            5.911616   0.111848  52.854  < 2e-16 ***
## log_income            0.395176   0.038720  10.206  < 2e-16 ***
## weight2              -0.027669   0.007074  -3.911 9.27e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 22.57 on 6637 degrees of freedom
##   (361 observations deleted due to missingness)
## Multiple R-squared:  0.3969, Adjusted R-squared:  0.3963 
## F-statistic:   624 on 7 and 6637 DF,  p-value: < 2.2e-16
plot(afqt_alt)

We could continue to model this variable by including additional predictors or interactions between our current predictors to attempt to remove this strong trend in the residuals. I’m going to stop here however and move on to another topic.

Exploring individual residuals

Evaluating all of the residuals in a single step using the plot function can be useful initially to explore model quality. However, eventually it is useful to explore residuals for specific values to explore why these are large.

There are a few options to do this, one uses the broom package, the other uses the modelr package. I show both below.

aug_afqt <- augment(afqt_alt)
aug_afqt
## # A tibble: 6,645 × 13
##    .rownames  afqt marital_comb height2 education2 log_income weight2 .fitted
##    <chr>     <dbl> <fct>          <dbl>      <dbl>      <dbl>   <dbl>   <dbl>
##  1 1          6.84 married       -7.10      -0.218       9.85  -33.3     39.9
##  2 2         49.4  married        2.90      -3.22       10.5   -32.3     30.2
##  3 3         99.4  married       -2.10       2.78       11.6     6.70    61.1
##  4 4         44.0  married       -4.10       0.782      10.6     8.70    47.3
##  5 5         59.7  married       -1.10       0.782      11.2     1.70    50.1
##  6 6         98.8  divorced       0.896      4.78       11.5    11.7     70.6
##  7 7         82.3  married        6.90       2.78       -6.91   36.7     60.0
##  8 8         50.3  divorced      -3.10      -1.22       11.2   -28.3     33.0
##  9 9         89.7  divorced       1.90      -1.22       11.0   -26.3     36.8
## 10 10        96.0  divorced       1.90      -0.218      11.9     5.70    42.2
## # … with 6,635 more rows, and 5 more variables: .resid <dbl>, .hat <dbl>,
## #   .sigma <dbl>, .cooksd <dbl>, .std.resid <dbl>
height_resid <- heights2 %>%
  add_residuals(afqt_alt) %>%
  add_predictions(afqt_alt)
height_resid
## # A tibble: 7,006 × 16
##    income height weight   age marital sex   education  afqt marital_comb income2
##     <int>  <dbl>  <int> <int> <fct>   <fct>     <int> <dbl> <fct>          <dbl>
##  1  19000     60    155    53 married fema…        13  6.84 married      1.9 e+4
##  2  35000     70    156    51 married fema…        10 49.4  married      3.5 e+4
##  3 105000     65    195    52 married male         16 99.4  married      1.05e+5
##  4  40000     63    197    54 married fema…        14 44.0  married      4   e+4
##  5  75000     66    190    49 married male         14 59.7  married      7.5 e+4
##  6 102000     68    200    49 divorc… fema…        18 98.8  divorced     1.02e+5
##  7      0     74    225    48 married male         16 82.3  married      1   e-3
##  8  70000     64    160    54 divorc… fema…        12 50.3  divorced     7   e+4
##  9  60000     69    162    55 divorc… male         12 89.7  divorced     6   e+4
## 10 150000     69    194    54 divorc… male         13 96.0  divorced     1.5 e+5
## # … with 6,996 more rows, and 6 more variables: height2 <dbl>,
## #   education2 <dbl>, weight2 <dbl>, log_income <dbl>, resid <dbl>, pred <dbl>

The benefit to the augment function from the broom package is that it includes more information. The main benefit fo the add_residuals function from the modelr pacakge is that it includes all of the original data, not just those variables included in the final model. Therefore if you are fitting a model and want to explore trends in residuals based on other data characteristics not currently in the model, using the add_residuals function would likely be better for this use.

Filtering and Plotting Residuals

Now that we have these in a data frame, we could easily plot or filter residuals to explore cases in which the model is not adequatly fitting the dependent variable. For example, maybe we want to explore residuals greater than 50 in absolute value. This can easily be done with dplyr and the filter function.

filter(height_resid, abs(resid) > 50)
## # A tibble: 113 × 16
##    income height weight   age marital sex   education  afqt marital_comb income2
##     <int>  <dbl>  <int> <int> <fct>   <fct>     <int> <dbl> <fct>          <dbl>
##  1  60000     69    162    55 divorc… male         12  89.7 divorced      6  e+4
##  2 150000     69    194    54 divorc… male         13  96.0 divorced      1.5e+5
##  3  45000     69    240    53 married male         12  98.8 married       4.5e+4
##  4      0     66    195    50 married fema…        14  99.2 married       1  e-3
##  5  93000     68    182    48 divorc… male         12  86.9 divorced      9.3e+4
##  6  43000     64    145    51 divorc… fema…        12  94.8 divorced      4.3e+4
##  7  52000     72    230    52 married male         12  93.1 married       5.2e+4
##  8  35000     59    165    49 divorc… fema…        12  80.0 divorced      3.5e+4
##  9  85000     61    140    55 married fema…        18  15.7 married       8.5e+4
## 10      0     69    186    55 separa… fema…        12  88.8 Other         1  e-3
## # … with 103 more rows, and 6 more variables: height2 <dbl>, education2 <dbl>,
## #   weight2 <dbl>, log_income <dbl>, resid <dbl>, pred <dbl>

You can also plot residuals directly or compared to other predictors.

ggplot(height_resid, aes(resid)) + 
  geom_histogram() + 
  theme_bw() + 
  xlab("Residuals")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 361 rows containing non-finite values (stat_bin).

ggplot(height_resid, aes(resid, education2)) + 
  geom_point(size = 3) + 
  geom_smooth(size = 1, se = FALSE) + 
  xlab("Residuals") + 
  ylab("Education (mean centered)")
## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'
## Warning: Removed 361 rows containing non-finite values (stat_smooth).
## Warning: Removed 361 rows containing missing values (geom_point).

Previous
Next