Factors

To date I have ignored factor variables and how these are implemented in R. Much of this is due to the greater flexibility of character vectors instead of factors. Also, if using the readr or readxl packages to read in data, the variables are also read in as character strings instead of factors. However, there are situations when factors are useful. Most of these uses are for readability when creating output formats for a report or paper.

This set of notes will make use of the following three packages:

library(tidyverse)
library(forcats)
library(fivethirtyeight)

Uses for Factors

To see a few of the benefits of a factor, assume we have a variable that represents the levels of a survey question with five possible responses and we only saw three of those response categories.

resp <- c('Disagree', 'Agree', 'Neutral')

This type of variable has a natural order, namely the disagree side of the scale (i.e. strongly disagree) to the agree side of the scale (i.e. strongly agree) with neutral belonging in the middle. However, if we sort this variable, this ordering will not be taken into account with a character string.

sort(resp)

## [1] "Agree"    "Disagree" "Neutral"

Notice, these are actually in alphabetical order, likely not what we wanted. This can be fixed by defining this variable as a factor with levels of the variable specified.

scale_levels <- c('Strongly Disagree', 'Disagree', 
                  'Neutral', 'Agree', 'Strongly Agree')
resp_fact <- factor(resp, levels = scale_levels)
resp_fact

## [1] Disagree Agree    Neutral 
## Levels: Strongly Disagree Disagree Neutral Agree Strongly Agree

sort(resp_fact)

## [1] Disagree Neutral  Agree   
## Levels: Strongly Disagree Disagree Neutral Agree Strongly Agree

Another benefit, if values that are not found in the levels of the factor variable, these will be replaced with NAs. For example,

factor(c('disagree', 'Agree', 'Strongly Agree'), 
       levels = scale_levels)

## [1] <NA>           Agree          Strongly Agree
## Levels: Strongly Disagree Disagree Neutral Agree Strongly Agree

We can also explore valid levels of a variables with the levels function.

levels(resp_fact)

## [1] "Strongly Disagree" "Disagree"          "Neutral"          
## [4] "Agree"             "Strongly Agree"

Exercises

How are factors stored internally by R? To explore this, use the str function on a factor variable and see what it looks like?
To further this idea from #1, what happens when you do each of the following commands? Why is this happening?

as.numeric(resp)
as.numeric(resp_fact)

Common Factor Manipulations

In addition to setting the levels of the variable, there are two common tasks useful with factors.

Reorder factor levels for plotting or table creation
Change the levels of the factor (i.e. collapse levels)

Examples of each of these will be given with the weather_check data from the fivethirtyeight package.

weather_check

## # A tibble: 928 × 9
##    respondent_id ck_weather weather_source     weather_source_… ck_weather_watch
##            <dbl> <lgl>      <chr>              <chr>            <ord>           
##  1    3887201482 TRUE       The default weath… <NA>             Very likely     
##  2    3887159451 TRUE       The default weath… <NA>             Very likely     
##  3    3887152228 TRUE       The default weath… <NA>             Very likely     
##  4    3887145426 TRUE       The default weath… <NA>             Somewhat likely 
##  5    3887021873 TRUE       A specific websit… Iphone app       Very likely     
##  6    3886937140 TRUE       A specific websit… AccuWeather App  Somewhat likely 
##  7    3886923931 TRUE       The Weather Chann… <NA>             Very unlikely   
##  8    3886913587 TRUE       <NA>               <NA>             <NA>            
##  9    3886889048 TRUE       The Weather Chann… <NA>             Very likely     
## 10    3886848806 TRUE       The default weath… <NA>             Very likely     
## # … with 918 more rows, and 4 more variables: age <fct>, female <lgl>,
## #   hhold_income <ord>, region <chr>

Reorder Factor Variables

To show examples of this operation, suppose we calculated the proportion of respondents that checked the weather daily by region of the country. We could use dplyr for this:

prop_check_weather <- weather_check %>%
  group_by(region) %>%
  summarise(prop = mean(ck_weather))
prop_check_weather

## # A tibble: 10 × 2
##    region              prop
##    <chr>              <dbl>
##  1 East North Central 0.858
##  2 East South Central 0.927
##  3 Middle Atlantic    0.885
##  4 Mountain           0.792
##  5 New England        0.942
##  6 Pacific            0.697
##  7 South Atlantic     0.740
##  8 West North Central 0.815
##  9 West South Central 0.904
## 10 <NA>               0.548

This would be a bit easier to view if we plotted this data:

ggplot(prop_check_weather, aes(prop, region)) + 
  geom_point()

This plot is difficult to read, primarily due to the way the points are ordered. Showing the regions in alphabetical order makes it more difficult to discern the trend. Instead, we would likely wish to reorder this variable by the ascending order of the proportion that check the weather. We will use the fct_reorder function from the forcats package. Note, I also omit the NA category here.

ggplot(na.omit(prop_check_weather), 
       aes(prop, fct_reorder(region, prop))) + 
  geom_point()

Need to be a bit careful with this operation however. For example:

weather_check %>%
  group_by(hhold_income) %>%
  summarise(prop = mean(ck_weather)) %>%
  na.omit() %>%
  ggplot(aes(prop, fct_reorder(hhold_income, prop))) + 
  geom_point()

Instead, this is the proper way to show this relationship:

weather_check %>%
  group_by(hhold_income) %>%
  summarise(prop = mean(ck_weather)) %>%
  na.omit() %>%
  ggplot(aes(prop, hhold_income)) + 
  geom_point()

Exercises

Using data from the fivethirtyeight package called flying, explore the proportion of respondents that believe the reclining the chair while flying should be eliminated (the variable is recline_eliminate).
Do these proportions differ by the location?
Create a graphic that captures this relationship, you may wish to reorder the columns to more appropriately represent the relationship.

Rename Factor Levels

These operations are useful to collapse categories or rename levels for publication. The primary function we will use for this operation is fct_recode from the forcats package.

Again, using the weather_check data, suppose we wished to change the levels of the age variable. The levels currently are:

levels(weather_check$age)

## [1] "18 - 29" "30 - 44" "45 - 59" "60+"

Suppose we wished to better represent these as words. We can use this with mutate from dplyr combined with fct_recode:

weather_check %>%
  mutate(age_recode = fct_recode(age,
      '18 to 29' = '18 - 29',
      '30 to 44' = '30 - 44',
      '45 to 59' = '45 - 59'
  )) %>%
  count(age_recode)

## # A tibble: 5 × 2
##   age_recode     n
##   <fct>      <int>
## 1 18 to 29     176
## 2 30 to 44     204
## 3 45 to 59     278
## 4 60+          258
## 5 <NA>          12

We could also collapse categories by assigning many levels to the same new level. For example, suppose we wished to collapse the ck_weather_watch variable to unlikely and likely instead of the very unlikely to very likely.

levels(weather_check$ck_weather_watch)

## [1] "Very unlikely"     "Somewhat unlikely" "Somewhat likely"  
## [4] "Very likely"

weather_check %>%
  mutate(watch_recode = fct_recode(ck_weather_watch,
         'Unlikely' = 'Very unlikely',
         'Unlikely' = 'Somewhat unlikely',
         'Likely' = 'Somewhat likely',
         'Likely' = 'Very likely'
  )) %>%
  count(watch_recode)

## # A tibble: 3 × 2
##   watch_recode     n
##   <ord>        <int>
## 1 Unlikely       281
## 2 Likely         636
## 3 <NA>            11

Finally, one last option that may be useful is to lump together categories that are too small to report independently. This functionality is implemented with the function fct_lump. For example, suppose we want to lump the region variable together to have only 5 regions.

weather_check %>%
  mutate(region = fct_lump(region, n = 5)) %>%
  count(region, sort = TRUE)

## # A tibble: 7 × 2
##   region                 n
##   <fct>              <int>
## 1 Other                219
## 2 Pacific              185
## 3 South Atlantic       154
## 4 East North Central   141
## 5 Middle Atlantic      104
## 6 West South Central    94
## 7 <NA>                  31

Exercises

Again, using the flying data from the fivethirtyeight package, is there a relationship between the proportion of respondents who have a children under 18 years old and if they believe it is rude to bring a baby on a plane? For this question, collapse the baby variable to two levels, no and yes.

Last updated on Feb 26, 2021