Building Upon Linear Models

The last week has focused on building simple linear models with a single predictor. This week will evaluate these models and build them up with more complexity. Particularly, this week will focus on ways to build models with predictors that have more than two categories, alternative ways to code categorical predictors, mixing categorical and quantitative variables, and interactions.

This section of notes will use the following packages.

library(tidyverse)
library(modelr)
library(broom)
library(fivethirtyeight)
library(forcats)

More than two categorical levels

Last week we explored a linear model framework for a two sample t-test (and the homework has you explore fitting a one-sample t-test in a linear model framework). I now want to generalize this idea to more than two categorical levels. It is traditional to think about these types of models as analysis of variance (ANOVA) models, however, the same model can be fitted in a linear model framework as well.

For this set of notes, we are going to make use of the gss_cat data found in the forcats package. Below are the first few rows of the data:

gss_cat
## # A tibble: 21,483 × 9
##     year marital         age race  rincome        partyid    relig denom tvhours
##    <int> <fct>         <int> <fct> <fct>          <fct>      <fct> <fct>   <int>
##  1  2000 Never married    26 White $8000 to 9999  Ind,near … Prot… Sout…      12
##  2  2000 Divorced         48 White $8000 to 9999  Not str r… Prot… Bapt…      NA
##  3  2000 Widowed          67 White Not applicable Independe… Prot… No d…       2
##  4  2000 Never married    39 White Not applicable Ind,near … Orth… Not …       4
##  5  2000 Divorced         25 White Not applicable Not str d… None  Not …       1
##  6  2000 Married          25 White $20000 - 24999 Strong de… Prot… Sout…      NA
##  7  2000 Never married    36 White $25000 or more Not str r… Chri… Not …       3
##  8  2000 Divorced         44 White $7000 to 7999  Ind,near … Prot… Luth…      NA
##  9  2000 Married          44 White $25000 or more Not str d… Prot… Other       0
## 10  2000 Married          47 White $25000 or more Strong re… Prot… Sout…       3
## # … with 21,473 more rows

Suppose we were interested in exploring the relationship between the marital status of an individual and how much tv they watch. For example, perhaps married couples watch more tv compared to those that are single or never married. To get an idea of the categories in the marital variable, we could use the count function within dplyr.

gss_cat %>%
  count(marital)
## # A tibble: 6 × 2
##   marital           n
##   <fct>         <int>
## 1 No answer        17
## 2 Never married  5416
## 3 Separated       743
## 4 Divorced       3383
## 5 Widowed        1807
## 6 Married       10117

You’ll notice that there a few responses of “No Answer” and we may wish to treat these as missing values. This can be done with the fct_recode function as follows:

gss_cat <- gss_cat %>%
  mutate(marital_miss = fct_recode(marital,
    NULL = 'No answer'
  ))
gss_cat %>%
  count(marital_miss)
## # A tibble: 6 × 2
##   marital_miss      n
##   <fct>         <int>
## 1 Never married  5416
## 2 Separated       743
## 3 Divorced       3383
## 4 Widowed        1807
## 5 Married       10117
## 6 <NA>             17

We can now fit the model to this new data using the lm function.

anova_mod <- lm(tvhours ~ marital_miss, data = gss_cat)
summary(anova_mod)
## 
## Call:
## lm(formula = tvhours ~ marital_miss, data = gss_cat)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.9120 -1.6504 -0.6504  0.8948 21.3496 
## 
## Coefficients:
##                       Estimate Std. Error t value Pr(>|t|)    
## (Intercept)            3.10518    0.04679  66.366  < 2e-16 ***
## marital_missSeparated  0.44444    0.13738   3.235  0.00122 ** 
## marital_missDivorced  -0.01977    0.07680  -0.257  0.79687    
## marital_missWidowed    0.80682    0.09352   8.627  < 2e-16 ***
## marital_missMarried   -0.45475    0.05879  -7.735 1.13e-14 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.561 on 11323 degrees of freedom
##   (10155 observations deleted due to missingness)
## Multiple R-squared:  0.02141,    Adjusted R-squared:  0.02107 
## F-statistic: 61.94 on 4 and 11323 DF,  p-value: < 2.2e-16

To explore what the lm function is doing internally, the design matrix is a natural way to do this.

model_matrix(gss_cat, tvhours ~ marital_miss)
## # A tibble: 11,328 × 5
##    `(Intercept)` marital_missSeparated marital_missDivorced marital_missWidowed
##            <dbl>                 <dbl>                <dbl>               <dbl>
##  1             1                     0                    0                   0
##  2             1                     0                    0                   1
##  3             1                     0                    0                   0
##  4             1                     0                    1                   0
##  5             1                     0                    0                   0
##  6             1                     0                    0                   0
##  7             1                     0                    0                   0
##  8             1                     0                    0                   0
##  9             1                     0                    0                   0
## 10             1                     0                    1                   0
## # … with 11,318 more rows, and 1 more variable: marital_missMarried <dbl>

Writing out this model with equations, the model looks like this: \[ tvhours_{i} = \beta_{0} + \beta_{1} Separated_{i} + \beta_{2} Divorced_{I} + \beta_{3} Widowed_{i} + \beta_{4} Married_{i} + \epsilon_{i} \]

If you are more familiar with ANOVA terminology, you can get an ANOVA table using the anova function on the model object.

anova(anova_mod)
## Analysis of Variance Table
## 
## Response: tvhours
##                 Df Sum Sq Mean Sq F value    Pr(>F)    
## marital_miss     4   1624  406.12  61.941 < 2.2e-16 ***
## Residuals    11323  74239    6.56                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Here you’ll notice that the F statistic is the same from the lm and anova functions showing that these are equivalent model calls.

Exercises

  1. Using the martial_miss variable created above, what are the sample means of the five groups?
  2. How do these sample means relate back to the parameters estimates shown above?
  3. How could you visualize these models results? Attempt to create a visualization that captures the model results above.

Adjusting the reference group

It is often of interest to adjust the reference group to make the intercept represent a specific group of interest. There are two approaches to take for this approach. The first I will show is using the forcats package to change the order of the levels of the variable.

Suppose for example, we wish to make the widowed category the reference group. This is the job of fct_relevel from the forcats package.

gss_cat <- gss_cat %>%
  mutate(marital_m_widow = fct_relevel(
    marital_miss,
    'Widowed'
  ))
levels(gss_cat$marital_miss)
## [1] "Never married" "Separated"     "Divorced"      "Widowed"      
## [5] "Married"
levels(gss_cat$marital_m_widow)
## [1] "Widowed"       "Never married" "Separated"     "Divorced"     
## [5] "Married"

You’ll notice that in the new variable, the widowed category was moved to the beginning and the remaining order was not changed. We can now fit a new model with this newly releveled factor variable.

summary(lm(tvhours ~ marital_m_widow, data = gss_cat))
## 
## Call:
## lm(formula = tvhours ~ marital_m_widow, data = gss_cat)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.9120 -1.6504 -0.6504  0.8948 21.3496 
## 
## Coefficients:
##                              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                   3.91200    0.08097  48.313  < 2e-16 ***
## marital_m_widowNever married -0.80682    0.09352  -8.627  < 2e-16 ***
## marital_m_widowSeparated     -0.36238    0.15245  -2.377   0.0175 *  
## marital_m_widowDivorced      -0.82659    0.10132  -8.159 3.75e-16 ***
## marital_m_widowMarried       -1.26157    0.08845 -14.262  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.561 on 11323 degrees of freedom
##   (10155 observations deleted due to missingness)
## Multiple R-squared:  0.02141,    Adjusted R-squared:  0.02107 
## F-statistic: 61.94 on 4 and 11323 DF,  p-value: < 2.2e-16

The second approach to modifying which group represents the reference group would be to create the indicator (dummy) variables manually. The logic follows from the design matrix above, namely that each variable should have a value of 1 if the marital status equals a specific category or 0 otherwise. For example, this can be created as follows:

gss_cat <- gss_cat %>%
  mutate(
    separated = ifelse(marital_miss == 'Separated', 1, 0),
    never_married = ifelse(marital_miss == 'Never married', 1, 0),
    divorced = ifelse(marital_miss == 'Divorced', 1, 0),
    married = ifelse(marital_miss == 'Married', 1, 0)
  )
summary(lm(tvhours ~ separated + never_married + divorced + married, 
           data = gss_cat))
## 
## Call:
## lm(formula = tvhours ~ separated + never_married + divorced + 
##     married, data = gss_cat)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.9120 -1.6504 -0.6504  0.8948 21.3496 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    3.91200    0.08097  48.313  < 2e-16 ***
## separated     -0.36238    0.15245  -2.377   0.0175 *  
## never_married -0.80682    0.09352  -8.627  < 2e-16 ***
## divorced      -0.82659    0.10132  -8.159 3.75e-16 ***
## married       -1.26157    0.08845 -14.262  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.561 on 11323 degrees of freedom
##   (10155 observations deleted due to missingness)
## Multiple R-squared:  0.02141,    Adjusted R-squared:  0.02107 
## F-statistic: 61.94 on 4 and 11323 DF,  p-value: < 2.2e-16

Manually creating the variables has a few advantages, namely that there is a bit more flexibility on how the variables are created, but both approaches lead to the same model.

Exercises

  1. Combine the ‘Never married’ and ‘Divorced’ categories into one category.
  2. Fit a new model that combines these two categories. Does the model fit differ from the models shown above? Is this surprising?

Post Hoc Tests

From the models fitted above, it may be of interest to conduct post hoc tests that compare all pairwise mean differences, particularly as the tests above are all compared to the reference group. This approach will be explored using the multcomp package and with defining linear contrasts.

# install.packages("multcomp")
library(multcomp)

We first need to define linear contrasts based on the levels of the factor variable. For example, using the following model:

gss_cat <- gss_cat %>%
  mutate(marital_m_widow = fct_recode(marital_m_widow,
           "Never_married" = "Never married"                           
  ))
anova_mod <- lm(tvhours ~ marital_m_widow, data = gss_cat)
levels(gss_cat$marital_m_widow)
## [1] "Widowed"       "Never_married" "Separated"     "Divorced"     
## [5] "Married"

We will use these level values to create linear contrasts that test all pairwise categories.

my_contrasts <- c("Widowed - Never_married = 0",
                 "Widowed - Separated = 0",
                 "Widowed - Divorced = 0",
                 "Widowed - Married = 0",
                 "Never_married - Separated = 0",
                 "Never_married - Divorced = 0",
                 "Never_married - Married = 0",
                 "Separated - Divorced = 0",
                 "Separated - Married = 0",
                 "Divorced - Married = 0")
contr_results <- glht(anova_mod, 
                      linfct = mcp(marital_m_widow = my_contrasts))
summary(contr_results)
## 
##   Simultaneous Tests for General Linear Hypotheses
## 
## Multiple Comparisons of Means: User-defined Contrasts
## 
## 
## Fit: lm(formula = tvhours ~ marital_m_widow, data = gss_cat)
## 
## Linear Hypotheses:
##                                Estimate Std. Error t value Pr(>|t|)    
## Widowed - Never_married == 0    0.80682    0.09352   8.627  < 0.001 ***
## Widowed - Separated == 0        0.36238    0.15245   2.377  0.11076    
## Widowed - Divorced == 0         0.82659    0.10132   8.159  < 0.001 ***
## Widowed - Married == 0          1.26157    0.08845  14.262  < 0.001 ***
## Never_married - Separated == 0 -0.44444    0.13738  -3.235  0.00957 ** 
## Never_married - Divorced == 0   0.01977    0.07680   0.257  0.99893    
## Never_married - Married == 0    0.45475    0.05879   7.735  < 0.001 ***
## Separated - Divorced == 0       0.46421    0.14280   3.251  0.00904 ** 
## Separated - Married == 0        0.89919    0.13398   6.711  < 0.001 ***
## Divorced - Married == 0         0.43498    0.07054   6.166  < 0.001 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## (Adjusted p values reported -- single-step method)

You can also specify different adjustment methods, such as the Benjamin-Hochberg method.

summary(contr_results, test = adjusted("BH"))
## 
##   Simultaneous Tests for General Linear Hypotheses
## 
## Multiple Comparisons of Means: User-defined Contrasts
## 
## 
## Fit: lm(formula = tvhours ~ marital_m_widow, data = gss_cat)
## 
## Linear Hypotheses:
##                                Estimate Std. Error t value Pr(>|t|)    
## Widowed - Never_married == 0    0.80682    0.09352   8.627  < 2e-16 ***
## Widowed - Separated == 0        0.36238    0.15245   2.377  0.01941 *  
## Widowed - Divorced == 0         0.82659    0.10132   8.159 1.48e-15 ***
## Widowed - Married == 0          1.26157    0.08845  14.262  < 2e-16 ***
## Never_married - Separated == 0 -0.44444    0.13738  -3.235  0.00152 ** 
## Never_married - Divorced == 0   0.01977    0.07680   0.257  0.79687    
## Never_married - Married == 0    0.45475    0.05879   7.735 2.78e-14 ***
## Separated - Divorced == 0       0.46421    0.14280   3.251  0.00152 ** 
## Separated - Married == 0        0.89919    0.13398   6.711 4.04e-11 ***
## Divorced - Married == 0         0.43498    0.07054   6.166 1.20e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## (Adjusted p values reported -- BH method)

Although defining the linear contrasts manually is more flexible, for simple models, the multiple comparisons can be generated a bit more simply.

summary(glht(anova_mod, linfct = mcp(marital_m_widow = "Tukey")), 
        test = adjusted("BH"))
## 
##   Simultaneous Tests for General Linear Hypotheses
## 
## Multiple Comparisons of Means: Tukey Contrasts
## 
## 
## Fit: lm(formula = tvhours ~ marital_m_widow, data = gss_cat)
## 
## Linear Hypotheses:
##                                Estimate Std. Error t value Pr(>|t|)    
## Never_married - Widowed == 0   -0.80682    0.09352  -8.627  < 2e-16 ***
## Separated - Widowed == 0       -0.36238    0.15245  -2.377  0.01941 *  
## Divorced - Widowed == 0        -0.82659    0.10132  -8.159 1.48e-15 ***
## Married - Widowed == 0         -1.26157    0.08845 -14.262  < 2e-16 ***
## Separated - Never_married == 0  0.44444    0.13738   3.235  0.00152 ** 
## Divorced - Never_married == 0  -0.01977    0.07680  -0.257  0.79687    
## Married - Never_married == 0   -0.45475    0.05879  -7.735 2.78e-14 ***
## Divorced - Separated == 0      -0.46421    0.14280  -3.251  0.00152 ** 
## Married - Separated == 0       -0.89919    0.13398  -6.711 4.04e-11 ***
## Married - Divorced == 0        -0.43498    0.07054  -6.166 1.20e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## (Adjusted p values reported -- BH method)

You can also generate simultaneous confidence intervals:

ci <- confint(summary(glht(anova_mod, linfct = mcp(marital_m_widow = "Tukey")),
                test = adjusted("BH")))
ci
## 
##   Simultaneous Confidence Intervals
## 
## Multiple Comparisons of Means: Tukey Contrasts
## 
## 
## Fit: lm(formula = tvhours ~ marital_m_widow, data = gss_cat)
## 
## Quantile = 2.6895
## 95% family-wise confidence level
##  
## 
## Linear Hypotheses:
##                                Estimate lwr      upr     
## Never_married - Widowed == 0   -0.80682 -1.05834 -0.55531
## Separated - Widowed == 0       -0.36238 -0.77239  0.04762
## Divorced - Widowed == 0        -0.82659 -1.09908 -0.55410
## Married - Widowed == 0         -1.26157 -1.49947 -1.02367
## Separated - Never_married == 0  0.44444  0.07497  0.81392
## Divorced - Never_married == 0  -0.01977 -0.22631  0.18678
## Married - Never_married == 0   -0.45475 -0.61288 -0.29662
## Divorced - Separated == 0      -0.46421 -0.84827 -0.08015
## Married - Separated == 0       -0.89919 -1.25954 -0.53885
## Married - Divorced == 0        -0.43498 -0.62471 -0.24526

These could then be visulized directly.

SCI = data.frame(
  Contrast = 1:nrow(ci$confint),   #Contrast number
  MD = ci$confint[, 1],      #Mean difference
  LL = ci$confint[, 2],      #Lower limit
  UL = ci$confint[, 3],      #Upper limit
  Sig = c("Yes", "No", "Yes", "Yes", 'Yes', 'No', 
          'Yes', 'Yes', 'Yes', 'Yes'),   #Statistically reliable?
  Alpha = c(1, .75, 1, 1, 1, .75, 1, 1, 1, 1),     #Transparency value
  Names = rownames(ci$confint)  # contrast label
)

# Plot of the simultaneous intervals
library(ggplot2)
ggplot(data = SCI, aes(x = Contrast, y = MD, color = Sig)) +
  geom_point(size = 4) +
  geom_segment(aes(x = Contrast, xend = Contrast, y = LL, 
                   yend = UL, alpha = Alpha), lwd = 1.5) +
  geom_hline(yintercept = 0, lty = "dotted") +
  scale_color_manual(values = c("Black", "Gold")) +
  scale_x_continuous(
    name = "",
    breaks = 1:10,
    labels = SCI$Names
  ) +
  ylab("Mean Difference") +
  coord_flip() +
  theme_bw() +
  theme(
    legend.position = "none",
    panel.grid.minor = element_blank(), 
    panel.grid.major = element_blank() 
  )

Exercises

  1. Fit a model that explores mean differences in tvhours by the party affiliation (partyid variable). Do the means differ?
  2. Using the post-hoc tests, run post-hoc tests to test all pairwise differences.
Previous
Next