Exploratory Data Analysis

Exploratory data analysis (EDA) is an important step in exploring and understanding your data. In addition, EDA does not suffer from problems with inferential statistics related to multiple (correlated) models on the same data. Instead, EDA is a great way to visualize, summarize, and prod your data without any consequences.

For this set of notes, we are going to use the following packages:

library(nycflights13)
library(tidyverse)

We will use a few different data sets, but the one we are going to start with is the flights data from the nycflights13 package. Below is the first 10 rows of the data.

flights
## # A tibble: 336,776 × 19
##     year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
##  1  2013     1     1      517            515         2      830            819
##  2  2013     1     1      533            529         4      850            830
##  3  2013     1     1      542            540         2      923            850
##  4  2013     1     1      544            545        -1     1004           1022
##  5  2013     1     1      554            600        -6      812            837
##  6  2013     1     1      554            558        -4      740            728
##  7  2013     1     1      555            600        -5      913            854
##  8  2013     1     1      557            600        -3      709            723
##  9  2013     1     1      557            600        -3      838            846
## 10  2013     1     1      558            600        -2      753            745
## # … with 336,766 more rows, and 11 more variables: arr_delay <dbl>,
## #   carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
## #   air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>

This data contains information on all flights that departed from the three airports in NYC in 2013. As you can see, the data has a total of 336776 rows and 19. For additional information use ?flights.

Exploratory Data Analysis

The general process for proceeding with exploratory data analysis as summarized in the R for Data Science text are:

  1. Ask questions about your data
  2. Search for answers
  3. Refine or ask new questions about the data.

There are no bad questions when performing EDA, but some common questions worth exploring are:

  • Missing Data
  • Variation
  • Covariation
  • Rare cases
  • Common cases
  • Distributions

Missing Data

A first step in exploring the data is to explore if there are any missing data present in the data (likely). This is not an easy step, but determining the amount of missing data and, if possible, why these values are missing are important first steps. One quick way to get a view of this information for the entire data is to use the summary command. An example is given with the flights data below.

summary(flights)
##       year          month             day           dep_time    sched_dep_time
##  Min.   :2013   Min.   : 1.000   Min.   : 1.00   Min.   :   1   Min.   : 106  
##  1st Qu.:2013   1st Qu.: 4.000   1st Qu.: 8.00   1st Qu.: 907   1st Qu.: 906  
##  Median :2013   Median : 7.000   Median :16.00   Median :1401   Median :1359  
##  Mean   :2013   Mean   : 6.549   Mean   :15.71   Mean   :1349   Mean   :1344  
##  3rd Qu.:2013   3rd Qu.:10.000   3rd Qu.:23.00   3rd Qu.:1744   3rd Qu.:1729  
##  Max.   :2013   Max.   :12.000   Max.   :31.00   Max.   :2400   Max.   :2359  
##                                                  NA's   :8255                 
##    dep_delay          arr_time    sched_arr_time   arr_delay       
##  Min.   : -43.00   Min.   :   1   Min.   :   1   Min.   : -86.000  
##  1st Qu.:  -5.00   1st Qu.:1104   1st Qu.:1124   1st Qu.: -17.000  
##  Median :  -2.00   Median :1535   Median :1556   Median :  -5.000  
##  Mean   :  12.64   Mean   :1502   Mean   :1536   Mean   :   6.895  
##  3rd Qu.:  11.00   3rd Qu.:1940   3rd Qu.:1945   3rd Qu.:  14.000  
##  Max.   :1301.00   Max.   :2400   Max.   :2359   Max.   :1272.000  
##  NA's   :8255      NA's   :8713                  NA's   :9430      
##    carrier              flight       tailnum             origin         
##  Length:336776      Min.   :   1   Length:336776      Length:336776     
##  Class :character   1st Qu.: 553   Class :character   Class :character  
##  Mode  :character   Median :1496   Mode  :character   Mode  :character  
##                     Mean   :1972                                        
##                     3rd Qu.:3465                                        
##                     Max.   :8500                                        
##                                                                         
##      dest              air_time        distance         hour      
##  Length:336776      Min.   : 20.0   Min.   :  17   Min.   : 1.00  
##  Class :character   1st Qu.: 82.0   1st Qu.: 502   1st Qu.: 9.00  
##  Mode  :character   Median :129.0   Median : 872   Median :13.00  
##                     Mean   :150.7   Mean   :1040   Mean   :13.18  
##                     3rd Qu.:192.0   3rd Qu.:1389   3rd Qu.:17.00  
##                     Max.   :695.0   Max.   :4983   Max.   :23.00  
##                     NA's   :9430                                  
##      minute        time_hour                  
##  Min.   : 0.00   Min.   :2013-01-01 05:00:00  
##  1st Qu.: 8.00   1st Qu.:2013-04-04 13:00:00  
##  Median :29.00   Median :2013-07-03 10:00:00  
##  Mean   :26.23   Mean   :2013-07-03 05:22:54  
##  3rd Qu.:44.00   3rd Qu.:2013-10-01 07:00:00  
##  Max.   :59.00   Max.   :2013-12-31 23:00:00  
## 

This summary can be a bit difficult to digest at first, but can give some useful insight into the variables, including the amount of missing data for each variable.

You can dive into looking at specific rows that are missing with the filter command and the is.na function. An example pulling out rows with a missing dep_time values is illustrated:

filter(flights, is.na(dep_time))
## # A tibble: 8,255 × 19
##     year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
##  1  2013     1     1       NA           1630        NA       NA           1815
##  2  2013     1     1       NA           1935        NA       NA           2240
##  3  2013     1     1       NA           1500        NA       NA           1825
##  4  2013     1     1       NA            600        NA       NA            901
##  5  2013     1     2       NA           1540        NA       NA           1747
##  6  2013     1     2       NA           1620        NA       NA           1746
##  7  2013     1     2       NA           1355        NA       NA           1459
##  8  2013     1     2       NA           1420        NA       NA           1644
##  9  2013     1     2       NA           1321        NA       NA           1536
## 10  2013     1     2       NA           1545        NA       NA           1910
## # … with 8,245 more rows, and 11 more variables: arr_delay <dbl>,
## #   carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
## #   air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>

We can also build more complex operations to look at data that is missing for one variable, but not another. For instance, if you look at the summary information above, you may notice there are more missing values for the arr_delay variable compared to the arr_time variable. To look at these values you can use the following command:

filter(flights, is.na(arr_delay) & !is.na(arr_time))
## # A tibble: 717 × 19
##     year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
##  1  2013     1     1     1525           1530        -5     1934           1805
##  2  2013     1     1     1528           1459        29     2002           1647
##  3  2013     1     1     1740           1745        -5     2158           2020
##  4  2013     1     1     1807           1738        29     2251           2103
##  5  2013     1     1     1939           1840        59       29           2151
##  6  2013     1     1     1952           1930        22     2358           2207
##  7  2013     1     2      905            822        43     1313           1045
##  8  2013     1     2     1125            925       120     1445           1146
##  9  2013     1     2     1848           1840         8     2333           2151
## 10  2013     1     2     1849           1724        85     2235           1938
## # … with 707 more rows, and 11 more variables: arr_delay <dbl>, carrier <chr>,
## #   flight <int>, tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>,
## #   distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>

This may actually be a data error, as there is an arr_time value, a scheduled_arr_time value, but no arr_delay value. This could then be calculated manually to reduce the number of missing values with this variable.

Viewing missing data graphically

It may be useful to view missing data graphically. This may be useful to see if there are specific trends in the data in relation to the missing values. A few ways to plot these data may be useful.

First, it is always a good rule to explore missing data in relation to other variables in the data. If there is evidence that another variable is influencing whether a value is missing, additional statistical controls are needed to adjust for these concerns.

For example, we could explore if the scheduled arrival time is related to whether the actual arrival flight time is missing.

flights %>%
  mutate(
    miss_arrival = is.na(arr_time)
  ) %>%
  ggplot(mapping = aes(sched_arr_time)) + 
  geom_freqpoly(aes(color = miss_arrival)) + 
  theme_bw()

Notice that the count metric masks much of what is being visualized here as there are many more flights that arrived compared to those with missing times. To adjust this, we simply need to change the y-axis from counts to density.

flights %>%
  mutate(
    miss_arrival = is.na(arr_time)
  ) %>%
  ggplot(mapping = aes(x = sched_arr_time, y = ..density..)) + 
  geom_freqpoly(aes(color = miss_arrival)) +
  theme_bw()

The two curves are now standardized so that the area under each curve equals 1, which in turn makes comparison between the two groups easier.

One other special note, for EDA internally, there is no need to spend much time worrying about formatting of the graphics. However, if this plot above would be included in a report or manuscript, this figure would need additional polish to be included.

Variation

Another common EDA question is related to variation. Variation is important for statistics, without variation there is no need to do statistics. The best way to explore variation of any type of variable is through visulization. This section will be broken into two sub areas, one that explores qualitative and another that explores quantitative.

Qualitative Variables

Bar graphs (frequency tables) are commonly used to explore variation in qualitative variables. For example, if we wished to explore the number of flights that took off for each month of the year from NYC:

ggplot(flights, aes(factor(month))) + 
  geom_bar() + 
  theme_bw()

One special note about the above code, I used the factor() function so that ggplot specifically added all the values of the variable to the x-axis. By default since the month variable is being treated as an integer (a number), it would not show all the values for month.

These counts could be calculated manually with the use of dplyr using the count function. The count function basically creates a frequency table. More complex tables can be created by passing additional variables to the count function.

flights %>%
  count(month)
## # A tibble: 12 × 2
##    month     n
##    <int> <int>
##  1     1 27004
##  2     2 24951
##  3     3 28834
##  4     4 28330
##  5     5 28796
##  6     6 28243
##  7     7 29425
##  8     8 29327
##  9     9 27574
## 10    10 28889
## 11    11 27268
## 12    12 28135

Exercises

  1. Copy the code from the bar graph above, but instead of wrapping the month variable in factor, try it without it. What is different? Extra, using the scale_x_continuous function, can you manually add each of the 12 numeric month values to the plot?
  2. Using dplyr, manually calculate the number of flights that took off for every day of every month. In other words, how many flights took off everyday of the year. Which day had the most flights?

Quantitative Variables

Histrograms, frequency polygons, or density curves are three common options to explore variation with quantitative variables. Within the flights data, suppose we were interested in exploring the variation in the distance traveled, this could easily be done with a histrogram.

ggplot(flights, aes(distance)) + 
  geom_histogram() + 
  theme_bw()

We could also use a frequency polygon:

ggplot(flights, aes(distance)) + 
  geom_freqpoly() + 
  theme_bw()

We could also use a density curve:

ggplot(flights, aes(distance)) + 
  geom_density() + 
  theme_bw()

When exploring the variation for a single variable overall, I tend to use histograms. However, when attempting to see if the variation changes across values of a categorical variable, histograms are difficult as the groups likely overlap. These are instances when using the frequency polygon or density curves are useful. Here are examples of both when exploring variation differences by month.

ggplot(flights, aes(distance)) + 
  geom_freqpoly(aes(color = factor(month))) + 
  theme_bw()

ggplot(flights, aes(distance)) + 
  geom_density(aes(color = factor(month))) + 
  theme_bw()

You can also calculate the counts plotted in the histograms and frequency polygons using the count as with qualitative variables. We just now need to use the cut_width function to specify bins.

flights %>%
  count(cut_width(distance, 100))
## # A tibble: 28 × 2
##    `cut_width(distance, 100)`     n
##    <fct>                      <int>
##  1 [-50,50]                       1
##  2 (50,150]                    2514
##  3 (150,250]                  36839
##  4 (250,350]                  18442
##  5 (350,450]                  15233
##  6 (450,550]                  29149
##  7 (550,650]                  11688
##  8 (650,750]                  33482
##  9 (750,850]                  19482
## 10 (850,950]                  19644
## # … with 18 more rows

Note, these counts may differ from above as the binwidth was not specifically stated when creating the histrogram or frequency polygon.

It may also be useful to calculate the variance, standard deviation, or the range. These can be calculated using the summarize function.

flights %>%
  summarize(
    var_dist = var(distance, na.rm = TRUE),
    sd_dist = sd(distance, na.rm = TRUE),
    min_dist = min(distance, na.rm = TRUE),
    max_dist = max(distance, na.rm = TRUE)
  )
## # A tibble: 1 × 4
##   var_dist sd_dist min_dist max_dist
##      <dbl>   <dbl>    <dbl>    <dbl>
## 1  537631.    733.       17     4983

You could pair this with the group_by function to calculate these values for different groups (e.g. by month).

Exercises

  1. Explore variation in the air_time variable.
  2. Does the variation in the air_time variable differ by month?

Distributions

Exploring distributions for variables is a very similar process to exploring the variation, the question is just different. Most often we are interested in exploring if the shape of the distribution is approximately normal. This will become more interesting when we start fitting models to explore potential assumption violations in the residuals. We leave these discussions until then.

Covariation

Covariation is the process of comparing how two (or more) variables are related. The most common method for exploring covariation is through scatterplots. However, these are most natural for two continuous variables. Other plots are useful for a mixture of variable types or for two qualitative variables. We will explore each in turn.

Two Qualitative Variables

Covariation in two qualitative variables is more difficult to view visually due to the restricted possible values in each variable. Suppose for example, we wished to explore covariation in the origin of the flight and the carrier.

ggplot(flights, aes(origin, carrier)) + 
  geom_count()

This plot is okay, however, I think a more useful plot is to use a tile plot to explore these differences with color.

flights %>%
  count(origin, carrier) %>%
  ggplot(aes(origin, carrier)) + 
  geom_tile(aes(fill = n)) + 
  theme_bw()

Note, that holes mean missing values (i.e. no flights from that airport from that carrier).

Exercises

  1. Explore the covariation between the month and day variables. Note, these are treated as continuous in the data, but in reality they are likely best represented as qualitative.

Two Quantitative Variables

Scatterplots are useful for two quantitative variables. Suppose for example that we wish to explore the relationship between the air_time variable and the arr_delay variable. This could be done with a scatterplot.

ggplot(flights, aes(air_time, arr_delay)) + 
  geom_point() + 
  theme_bw()
## Warning: Removed 9430 rows containing missing values (geom_point).

One problem with the plot above, is overplotting. There are two fixes for this, one is to use transparent points using the alpha argument to the geom_point function.

ggplot(flights, aes(air_time, arr_delay)) + 
  geom_point(alpha = .05) + 
  theme_bw()
## Warning: Removed 9430 rows containing missing values (geom_point).

Another approach that will simplify the exploration is to use boxplots. This will involve grouping the air_time variable into “bins.”

ggplot(flights, aes(x = air_time, y = arr_delay)) + 
  geom_boxplot(aes(group = cut_width(air_time, 50))) + 
  theme_bw()
## Warning: Removed 9430 rows containing missing values (stat_boxplot).

Even another more sophisticated graphic is to do the quantitative alternative to geom_tile. Note, the code below uses the hexbin package, but a similar function is geom_bin2d.

# install.packages("hexbin")
library(hexbin)
ggplot(flights, aes(x = air_time, y = arr_delay)) + 
  geom_hex()
## Warning: Removed 9430 rows containing non-finite values (stat_binhex).

Adding a Third Variable

Adding a third variable is often useful, but can be difficult to think about procedurally. The type of plot that is useful depends on the type the third variable is. For example, if the third variable is also quantitative, the visualization is more difficult, however, if the third variable is qualitative, there are two main options. These will be explored in more detail below.

The main two approaches for adding a third variable when it is qualitative is to use a different color/shape for the values of this variable or to facet the plot. Suppose we wished to explore the covariation between the following three variables: air_time, arr_delay, and origin. The two different options are shown below.

ggplot(flights, aes(air_time, arr_delay)) + 
  geom_hex() + 
  facet_grid(. ~ origin) + 
  theme_bw()
## Warning: Removed 9430 rows containing non-finite values (stat_binhex).

ggplot(flights, aes(air_time, arr_delay)) + 
  geom_point(aes(color = origin), alpha = .05) + 
  theme_bw()
## Warning: Removed 9430 rows containing missing values (geom_point).

Plotting three quantitative variables commonly involves binning one one the variables to turn it into an ordinal variable with different levels. For example, see the example with two variables and the boxplot, a similar approach could be used to facet by this third variable. Below is a simple example:

ggplot(flights, aes(x = air_time, y = arr_delay)) + 
  geom_hex() + 
  theme_bw() + 
  facet_wrap(~ cut_width(dep_time, 250))
## Warning: Removed 9430 rows containing non-finite values (stat_binhex).

One Quantitative, One Qualitative

This was actually already discussed in the discussion of variation by exploring differences in variation for different levels of a qualitative variable (see above). If the variation differs by groups, there is then evidence of covariation.

Correlations

It is also useful to calculate and visualize raw correlations. To calculate raw correlations (assuming only quantitative variables), the cor function is useful.

flights %>%
  select(air_time, arr_delay, dep_time) %>%
  cor(use = 'pairwise.complete.obs')
##              air_time   arr_delay    dep_time
## air_time   1.00000000 -0.03529709 -0.01461948
## arr_delay -0.03529709  1.00000000  0.23230573
## dep_time  -0.01461948  0.23230573  1.00000000

To visualize a correlation matrix, the GGally package is useful. Note, the ggpairs function can take some time to run.

# install.package("GGally")
library(GGally)
flights %>%
  select(air_time, arr_delay, dep_time) %>%
  na.omit() %>%
  sample_n(1000) %>%
  ggpairs()

Exercises

  1. Explore covariation in the dep_delay and arr_delay variables. What type of relationship, if any, appears to be present?
  2. Explore the relationship between dep_delay, arr_delay, and origin. What type of relationship is present. Does the relationship between dep_delay and arr_delay differ by origin?
  3. Finally, calculate the correlation matrix for dep_delay, arr_delay, and dep_time.

Rare/Common Cases

The last question of use to explore when performing EDA is looking for the presence of rare or common cases. In other words, an exploration of any outliers and the central tendency of the distribution.

When we explored variation in the distance variable earlier, there may have been extreme values we’d want to explore in more detail.

ggplot(flights, aes(distance)) + 
  geom_histogram() +
  theme_bw()

Notice the large distance value, to get a better view of how many there are here, we can use coord_cartesian to zoom in.

ggplot(flights, aes(distance)) + 
  geom_histogram() + 
  theme_bw() + 
  coord_cartesian(ylim = c(0, 5000))

Note, that in the above plot, coord_cartesian does not remove any points, simply changes the coordinates that are plotted. We could also pull these out using filter as well.

flights %>%
  filter(distance > 3000) %>%
  arrange(distance)
## # A tibble: 715 × 19
##     year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
##  1  2013     7     6     1629           1615        14     1954           1953
##  2  2013     7    13     1618           1615         3     1955           1953
##  3  2013     7    20     1618           1615         3     2003           1953
##  4  2013     7    27     1617           1615         2     1906           1953
##  5  2013     8     3     1615           1615         0     2003           1953
##  6  2013     8    10     1613           1615        -2     1922           1953
##  7  2013     8    17     1740           1625        75     2042           2003
##  8  2013     8    24     1633           1625         8     1959           2003
##  9  2013     1     1     1344           1344         0     2005           1944
## 10  2013     1     2     1344           1344         0     1940           1944
## # … with 705 more rows, and 11 more variables: arr_delay <dbl>, carrier <chr>,
## #   flight <int>, tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>,
## #   distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>

Measures of Central Tendency

Exploring measures of central tendency or simply common values/repeated of common values can also be important.

ggplot(flights, aes(arr_time)) +
  geom_histogram(binwidth = 50) +
  theme_bw()
## Warning: Removed 8713 rows containing non-finite values (stat_bin).

Measures of central tendency can be directly calculated using the summarise function. For example, exploring central tendency of the arr_delay variable.

flights %>%
  summarise(
    avg_arrdelay = mean(arr_delay, na.rm = TRUE),
    med_arrdelay = median(arr_delay, na.rm = TRUE)
  )
## # A tibble: 1 × 2
##   avg_arrdelay med_arrdelay
##          <dbl>        <dbl>
## 1         6.90           -5

More interesting computations can be performed by using adding in the group_by function.

Exercises

  1. Using the txhousing data, explore rare/common cases in the median sale price for the following 3 cities: Austin, Dallas, and Houston.
  2. Using the data from #1, explore measures of central tendency in the median sale price of these three cities. How have these changed over time (year)?
  3. Create an effective visualization that explores differences in the median sale price over time for these three cities.
Previous
Next