Model Introduction

This section of the notes is going to introduce you into the world of models in R. For the most part, we are going to stick with simple linear models and build up the various models using one function lm. The lm function is an extremely powerful function that can accommodate many different models in a single framework.

This section of notes is going to make use of four R packages:

library(tidyverse)
library(modelr)
# install.packages("broom")
library(broom)
library(fivethirtyeight)

Simple Linear Regression

First we need some data. We are going to explore the data from the fivethirtyeight package called fandango. Here are the first few rows:

fandango

## # A tibble: 146 × 23
##    film    year rottentomatoes rottentomatoes_… metacritic metacritic_user  imdb
##    <chr>  <dbl>          <int>            <int>      <int>           <dbl> <dbl>
##  1 Aveng…  2015             74               86         66             7.1   7.8
##  2 Cinde…  2015             85               80         67             7.5   7.1
##  3 Ant-M…  2015             80               90         64             8.1   7.8
##  4 Do Yo…  2015             18               84         22             4.7   5.4
##  5 Hot T…  2015             14               28         29             3.4   5.1
##  6 The W…  2015             63               62         50             6.8   7.2
##  7 Irrat…  2015             42               53         53             7.6   6.9
##  8 Top F…  2014             86               64         81             6.8   6.5
##  9 Shaun…  2015             99               82         81             8.8   7.4
## 10 Love …  2015             89               87         80             8.5   7.8
## # … with 136 more rows, and 16 more variables: fandango_stars <dbl>,
## #   fandango_ratingvalue <dbl>, rt_norm <dbl>, rt_user_norm <dbl>,
## #   metacritic_norm <dbl>, metacritic_user_nom <dbl>, imdb_norm <dbl>,
## #   rt_norm_round <dbl>, rt_user_norm_round <dbl>, metacritic_norm_round <dbl>,
## #   metacritic_user_norm_round <dbl>, imdb_norm_round <dbl>,
## #   metacritic_user_vote_count <int>, imdb_user_vote_count <int>,
## #   fandango_votes <int>, fandango_difference <dbl>

These data have 146 rows and 23 columns.

Suppose we were interested in exploring the relationship between ratings from rottentomatoes and metacritic. Note, we will not use the user rating for this exploration. A natural first step may be to look at a scatterplot of these data to explore the shape of the relationship.

ggplot(fandango, aes(rottentomatoes, metacritic)) + 
  theme_bw() + 
  geom_point(size = 3)

To better explore the relationship, including a smoother can be useful:

ggplot(fandango, aes(rottentomatoes, metacritic)) + 
  theme_bw() + 
  geom_point(size = 3) + 
  geom_smooth(method = 'loess', se = FALSE, size = 1.5)

## `geom_smooth()` using formula 'y ~ x'

It may also be useful to calculate a correlation coefficient between these two variables.

with(fandango, cor(rottentomatoes, metacritic))

## [1] 0.9573596

Fit Linear Regression

Now we will attempt to fit a model to these data. Namely, the relationship appears to be mostly linear and suppose we wished to predict the metacritic review score with the rotten tomatoes score. To do this, we will use the lm function and the ~ that we used with facet_wrap and facet_grid.

More concretely, suppose we wished to fit the model: \[ metacritic_{i} = b_{0} + b_{1} rottentomatoes_{i} + \epsilon_{i} \]

In this model, \(metacritic_{i}\) is the dependent or response variable and \(rottentomatoes_{i}\) is the independent, predictor, or covariate. In many traditional statistics courses, \(metacritic_{i}\) would be represented with \(Y\) and \(rottentomatoes_{i}\) would be represented with \(X\). It is often more descriptive to represent these with their variable names instead of \(Y\) or \(X\).

To fit this model, we simply need to replace the \(=\) sign found in the equation above with the ~. For example, the equation above would turn into:

meta_mod <- lm(metacritic ~ rottentomatoes, data = fandango)

To see output from the model, we can take two approaches. One is to use summary and another is to use the tidy function from the broom package. I show each below in turn.

summary(meta_mod)

## 
## Call:
## lm(formula = metacritic ~ rottentomatoes, data = fandango)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -11.7209  -4.1999   0.3855   3.7952  14.6662 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    21.12097    1.05710   19.98   <2e-16 ***
## rottentomatoes  0.61935    0.01558   39.77   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.658 on 144 degrees of freedom
## Multiple R-squared:  0.9165, Adjusted R-squared:  0.916 
## F-statistic:  1581 on 1 and 144 DF,  p-value: < 2.2e-16

tidy(meta_mod)

## # A tibble: 2 × 5
##   term           estimate std.error statistic  p.value
##   <chr>             <dbl>     <dbl>     <dbl>    <dbl>
## 1 (Intercept)      21.1      1.06        20.0 2.36e-43
## 2 rottentomatoes    0.619    0.0156      39.8 1.54e-79

With the broom package, the results are reported in a tidier framework. We will see additional useful functions using the broom package later on.

You can also directly request confidence intervals with the tidy function:

tidy(meta_mod, conf.int = TRUE)

## # A tibble: 2 × 7
##   term           estimate std.error statistic  p.value conf.low conf.high
##   <chr>             <dbl>     <dbl>     <dbl>    <dbl>    <dbl>     <dbl>
## 1 (Intercept)      21.1      1.06        20.0 2.36e-43   19.0      23.2  
## 2 rottentomatoes    0.619    0.0156      39.8 1.54e-79    0.589     0.650

Exercises

Fit a new model using the fandango data that attempts to explain the metacritic ratings with the imdb rating.
Fit another model using the fandango data that attempts to explain the metacritic ratings with the fandango_ratingvalue scores.
Exploring the predictors of these two new models with the one fitted above with the rottentomatoes scores, which rating score best helps us predict the metacritic scores?

Workings Behind `lm` function

To see what the lm function is doing behind the scenes, we will use the model_matrix function from the modelr package. For example, from the model above:

model_matrix(fandango, metacritic ~ rottentomatoes)

## # A tibble: 146 × 2
##    `(Intercept)` rottentomatoes
##            <dbl>          <dbl>
##  1             1             74
##  2             1             85
##  3             1             80
##  4             1             18
##  5             1             14
##  6             1             63
##  7             1             42
##  8             1             86
##  9             1             99
## 10             1             89
## # … with 136 more rows

This is often referred to as the design matrix in statistics text books and is one of the matrices that are used by lm to calculate the estimated parameters from above. Notice that is automatically included the intercept, normally this is of interest, if it is not, we can omit it directly by including a -1 in the formula. For example:

model_matrix(fandango, metacritic ~ rottentomatoes - 1)

## # A tibble: 146 × 1
##    rottentomatoes
##             <dbl>
##  1             74
##  2             85
##  3             80
##  4             18
##  5             14
##  6             63
##  7             42
##  8             86
##  9             99
## 10             89
## # … with 136 more rows

tidy(lm(metacritic ~ rottentomatoes - 1, data = fandango))

## # A tibble: 1 × 5
##   term           estimate std.error statistic   p.value
##   <chr>             <dbl>     <dbl>     <dbl>     <dbl>
## 1 rottentomatoes    0.898    0.0134      67.3 3.14e-111

You need to be careful with this syntax as this is commonly not is what is desired when fitting a linear model.

Categorical Predictors

Suppose we were interested in the following research question:

To what extent are there average differences in movie ratings between rottentomatoes and metacritic?

To answer this research question, we would need to transform our data to great a group variable and a rating variable.

meta_rotten <- fandango %>%
  select(film, year, rottentomatoes, metacritic) %>%
  gather(group, rating, rottentomatoes, metacritic)
meta_rotten

## # A tibble: 292 × 4
##    film                     year group          rating
##    <chr>                   <dbl> <chr>           <int>
##  1 Avengers: Age of Ultron  2015 rottentomatoes     74
##  2 Cinderella               2015 rottentomatoes     85
##  3 Ant-Man                  2015 rottentomatoes     80
##  4 Do You Believe?          2015 rottentomatoes     18
##  5 Hot Tub Time Machine 2   2015 rottentomatoes     14
##  6 The Water Diviner        2015 rottentomatoes     63
##  7 Irrational Man           2015 rottentomatoes     42
##  8 Top Five                 2014 rottentomatoes     86
##  9 Shaun the Sheep Movie    2015 rottentomatoes     99
## 10 Love & Mercy             2015 rottentomatoes     89
## # … with 282 more rows

Now we can work with this data to answer the question from above. More specifically, our dependent variable will be the rating variable and the independent variable will be the group (categorical) variable. This can be fitted within a linear model as follows:

tidy(lm(rating ~ factor(group), data = meta_rotten))

## # A tibble: 2 × 5
##   term                        estimate std.error statistic  p.value
##   <chr>                          <dbl>     <dbl>     <dbl>    <dbl>
## 1 (Intercept)                    58.8       2.10    28.0   2.50e-84
## 2 factor(group)rottentomatoes     2.04      2.97     0.686 4.93e- 1

To see exactly what is happening, model_matrix may be useful. First I am going to arrange the data by the films in alphabetical order.

meta_rotten %>%
  arrange(film)

## # A tibble: 292 × 4
##    film                 year group          rating
##    <chr>               <dbl> <chr>           <int>
##  1 '71                  2015 rottentomatoes     97
##  2 '71                  2015 metacritic         83
##  3 5 Flights Up         2015 rottentomatoes     52
##  4 5 Flights Up         2015 metacritic         55
##  5 A Little Chaos       2015 rottentomatoes     40
##  6 A Little Chaos       2015 metacritic         51
##  7 A Most Violent Year  2014 rottentomatoes     90
##  8 A Most Violent Year  2014 metacritic         79
##  9 About Elly           2015 rottentomatoes     97
## 10 About Elly           2015 metacritic         87
## # … with 282 more rows

meta_rotten %>%
  arrange(film) %>%
  model_matrix(rating ~ factor(group))

## # A tibble: 292 × 2
##    `(Intercept)` `factor(group)rottentomatoes`
##            <dbl>                         <dbl>
##  1             1                             1
##  2             1                             0
##  3             1                             1
##  4             1                             0
##  5             1                             1
##  6             1                             0
##  7             1                             1
##  8             1                             0
##  9             1                             1
## 10             1                             0
## # … with 282 more rows

You may be more familiar with using a t-test for this type of design. We can replicate the results above with a t-test using the t.test function.

t.test(rating ~ factor(group), data = meta_rotten, 
       var.equal = TRUE)

## 
##  Two Sample t-test
## 
## data:  rating by factor(group)
## t = -0.68638, df = 290, p-value = 0.493
## alternative hypothesis: true difference in means between group metacritic and group rottentomatoes is not equal to 0
## 95 percent confidence interval:
##  -7.893918  3.811726
## sample estimates:
##     mean in group metacritic mean in group rottentomatoes 
##                     58.80822                     60.84932

Exercises

Compute descriptive means using the meta_rotten transformed data from above by the group variable. Do these means appear to be descriptively different?
How do these means relate to the parameters estimated from the model above?

Evaulating Model fit

There are many ways to evaluate model fit. Many of these are available using the summary function.

summary(lm(rating ~ factor(group), data = meta_rotten))

## 
## Call:
## lm(formula = rating ~ factor(group), data = meta_rotten)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -55.849 -19.089   0.671  22.161  39.151 
## 
## Coefficients:
##                             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                   58.808      2.103  27.967   <2e-16 ***
## factor(group)rottentomatoes    2.041      2.974   0.686    0.493    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 25.41 on 290 degrees of freedom
## Multiple R-squared:  0.001622,   Adjusted R-squared:  -0.001821 
## F-statistic: 0.4711 on 1 and 290 DF,  p-value: 0.493

The unfortunate part of this is the fact that these are more difficult to pull out of the table programmatically (i.e. in a reproducible workflow). This is where the broom package helps with the use of the glance function.

glance(lm(rating ~ factor(group), data = meta_rotten))

## # A tibble: 1 × 12
##   r.squared adj.r.squared sigma statistic p.value    df logLik   AIC   BIC
##       <dbl>         <dbl> <dbl>     <dbl>   <dbl> <dbl>  <dbl> <dbl> <dbl>
## 1   0.00162      -0.00182  25.4     0.471   0.493     1 -1358. 2722. 2733.
## # … with 3 more variables: deviance <dbl>, df.residual <int>, nobs <int>

These are now in a more tidy data frame and if you have multiple models in an exploratory analysis, these could then be much easier compared and combined programmatically to come to a final model.

Another useful function from the broom package is augment. This function will add additional information to the original data such as residuals, fitted (predicted) values, and other diagnostic statistics.

diagnostic <- augment(lm(rating ~ factor(group), data = meta_rotten))
diagnostic

## # A tibble: 292 × 7
##    rating `factor(group)` .fitted    .hat .sigma   .cooksd .std.resid
##     <int> <fct>             <dbl>   <dbl>  <dbl>     <dbl>      <dbl>
##  1     74 rottentomatoes     60.8 0.00685   25.4 0.000930      0.519 
##  2     85 rottentomatoes     60.8 0.00685   25.4 0.00314       0.954 
##  3     80 rottentomatoes     60.8 0.00685   25.4 0.00197       0.756 
##  4     18 rottentomatoes     60.8 0.00685   25.3 0.00988      -1.69  
##  5     14 rottentomatoes     60.8 0.00685   25.3 0.0118       -1.85  
##  6     63 rottentomatoes     60.8 0.00685   25.5 0.0000249     0.0849
##  7     42 rottentomatoes     60.8 0.00685   25.4 0.00191      -0.744 
##  8     86 rottentomatoes     60.8 0.00685   25.4 0.00340       0.993 
##  9     99 rottentomatoes     60.8 0.00685   25.4 0.00783       1.51  
## 10     89 rottentomatoes     60.8 0.00685   25.4 0.00426       1.11  
## # … with 282 more rows

These could then be plotted to explore more information about model fit. For example a histogram of the standardized residuals are often useful.

ggplot(diagnostic, aes(.std.resid)) + 
  geom_histogram(bins = 30, color = 'white') + 
  theme_bw()

For this model, boxplots of the standardized residuals by the two groups can also be informative:

ggplot(diagnostic, aes(`factor(group)`, .std.resid)) + 
  geom_boxplot() + 
  geom_jitter() + 
  coord_flip() +
  theme_bw()

We will explore more details on predicted or fitted values later.

Lastly, the augment function can be useful, however I personally do not like the naming convention used by the function. I want to point you to two additional functions from the modelr package that can be useful for predicted (add_predictions) and residual values (add_residuals).

For example, to add the residuals to the original data:

meta_rotten %>%
  add_residuals(lm(rating ~ factor(group), data = meta_rotten))

## # A tibble: 292 × 5
##    film                     year group          rating  resid
##    <chr>                   <dbl> <chr>           <int>  <dbl>
##  1 Avengers: Age of Ultron  2015 rottentomatoes     74  13.2 
##  2 Cinderella               2015 rottentomatoes     85  24.2 
##  3 Ant-Man                  2015 rottentomatoes     80  19.2 
##  4 Do You Believe?          2015 rottentomatoes     18 -42.8 
##  5 Hot Tub Time Machine 2   2015 rottentomatoes     14 -46.8 
##  6 The Water Diviner        2015 rottentomatoes     63   2.15
##  7 Irrational Man           2015 rottentomatoes     42 -18.8 
##  8 Top Five                 2014 rottentomatoes     86  25.2 
##  9 Shaun the Sheep Movie    2015 rottentomatoes     99  38.2 
## 10 Love & Mercy             2015 rottentomatoes     89  28.2 
## # … with 282 more rows

Exercises

Fit a new model using the fandango data that attempts to explain the metacritic ratings with the imdb rating. Explore the distribution of residuals. Does there appear to be problems with these residuals?
Using the model from #1, create a scatterplot that displays the residuals by the predictor variable. Are there problems with this plot that we should be concerned with?

Last updated on Feb 28, 2022

Model Introduction

Simple Linear Regression

Fit Linear Regression

Exercises

Workings Behind lm function

Categorical Predictors

Exercises

Evaulating Model fit

Exercises

Workings Behind `lm` function