Data Restructuring

Data restructuring is often a useful tool to have. By data restructuring, I mean transforming data from long to wide format or vice versa. For the most part, long format is much easier to use when plotting and computing summary statistics. A related topic, called tidy data, can be read about in more detail here: http://www.jstatsoft.org/v59/i10/paper.

The data we are going to use for this section of notes is called “LongitudinalEx.csv” and can be found on the course website. The data are loaded directly from the web, but these could be loaded from a downloaded data file (see )

library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──
## ✓ ggplot2 3.3.5     ✓ purrr   0.3.4
## ✓ tibble  3.1.6     ✓ dplyr   1.0.7
## ✓ tidyr   1.2.0     ✓ stringr 1.4.0
## ✓ readr   2.1.2     ✓ forcats 0.5.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
long_data <- read_csv("https://raw.githubusercontent.com/lebebr01/psqf-6250-blogdown/main/data/LongitudinalEx.csv")
## Rows: 27 Columns: 7
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## dbl (7): id, wave, agegrp, age, piat, agegrp.c, age.c
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Long/Stacked Data

The data read in above is in a format that is commonly referred to as long or stacked data.

long_data
## # A tibble: 27 × 7
##       id  wave agegrp   age  piat agegrp.c  age.c
##    <dbl> <dbl>  <dbl> <dbl> <dbl>    <dbl>  <dbl>
##  1     4     1    6.5  6       18        0 -0.5  
##  2     4     2    8.5  8.5     31        2  2    
##  3     4     3   10.5 10.7     50        4  4.17 
##  4    27     1    6.5  6.25    19        0 -0.25 
##  5    27     2    8.5  9.17    36        2  2.67 
##  6    27     3   10.5 10.9     57        4  4.42 
##  7    31     1    6.5  6.33    18        0 -0.167
##  8    31     2    8.5  8.83    31        2  2.33 
##  9    31     3   10.5 10.9     51        4  4.42 
## 10    33     1    6.5  6.33    18        0 -0.167
## # … with 17 more rows

These data do not have one individual per row, instead each row is a individual by wave combination and are stacked for each individual (notice the three rows for id = 4). The variables in this case each have there own column in the data and all of them are time varying (change for each wave of data within an individual). This is also an example of “tidy data” from the paper linked to above, where each row is a unique observation (id, wave pair), variables are in the columns, and each cell of the data is a value.

The primary functions within the tidyr package are the following:

  • pivot_longer(): for making the data longer, this replaces the gather() function from tidyr.
  • pivot_wider(): for making the data wider, this replaces the spread() function from tidyr.

These two functions are relatively new and they still may change as they continue to be developed. I believe the old functions will be deprecated, but not removed entirely. This means they will not be actively developed any longer, but will remain in the tidyr package for the forseeable future.

It should also be noted there are mutiple R packages for data restructuring, including reshape and reshape2. The syntax for these has always been difficult for me to fully process and internalize, but they are incredibly powerful. I believe almost every data restructuring task can be accomplished by pivot_longer() and pivot_wider() (with some companion functions), but it may take a few extra steps to get to the desired structure.

Extra Long Data

To progress through data restructuring, we first need to transform this data is extra long format. This format is not entirely useful by itself, however it will help use show the use of a few functions from the tidyr package. To go to extra long data, we will make use of the pivot_longer() and unite functions.

extra_long <- long_data %>%
  pivot_longer(agegrp:age.c, names_to = 'variable', values_to = 'value') %>%
  unite(var_wave, variable, wave)
extra_long 
## # A tibble: 135 × 3
##       id var_wave   value
##    <dbl> <chr>      <dbl>
##  1     4 agegrp_1     6.5
##  2     4 age_1        6  
##  3     4 piat_1      18  
##  4     4 agegrp.c_1   0  
##  5     4 age.c_1     -0.5
##  6     4 agegrp_2     8.5
##  7     4 age_2        8.5
##  8     4 piat_2      31  
##  9     4 agegrp.c_2   2  
## 10     4 age.c_2      2  
## # … with 125 more rows

You’ll notice now that there are only three columns in the data and that there are now 135 rows in data. This extra long data format gathered all of the variables into two columns, one that identify the variable and wave and the other that simply lists the value.

Wide Data

We can now take the extra long data and turn this into wide data. Wide data is characterized by one row per individual with columns representing the variable and wave combinations.

wide <- extra_long %>% 
  pivot_wider(names_from = var_wave, values_from = value)
wide
## # A tibble: 9 × 16
##      id agegrp_1 age_1 piat_1 agegrp.c_1 age.c_1 agegrp_2 age_2 piat_2
##   <dbl>    <dbl> <dbl>  <dbl>      <dbl>   <dbl>    <dbl> <dbl>  <dbl>
## 1     4      6.5  6        18          0  -0.5        8.5  8.5      31
## 2    27      6.5  6.25     19          0  -0.25       8.5  9.17     36
## 3    31      6.5  6.33     18          0  -0.167      8.5  8.83     31
## 4    33      6.5  6.33     18          0  -0.167      8.5  8.92     34
## 5    41      6.5  6.33     18          0  -0.167      8.5  8.75     28
## 6    49      6.5  6.5      19          0   0          8.5  8.75     32
## 7    69      6.5  6.67     26          0   0.167      8.5  9.17     47
## 8    77      6.5  6.83     17          0   0.333      8.5  8.08     19
## 9    87      6.5  6.92     22          0   0.417      8.5  9.42     49
## # … with 7 more variables: agegrp.c_2 <dbl>, age.c_2 <dbl>, agegrp_3 <dbl>,
## #   age_3 <dbl>, piat_3 <dbl>, agegrp.c_3 <dbl>, age.c_3 <dbl>

You’ll notice from the data above, there are now only 9 rows, but now 16 columns in the data. Each variable except for id now also has a number appended to it to represent the wave of the data.

This data structure is common, particularly for users of SPSS or Excel for data entry or processing. Unfortunately, when working with data in R (and in general), data in wide format is often difficult to work with. Therefore it is common to need to restructure the data from wide to long format.

Back to Long Format

Fortunately, we can use the same functions as we used above, but now in inverse to get from wide to long format.

wide %>% 
  pivot_longer(-id, names_to = "variable", values_to = "value") %>%
  separate(variable, into = c('variable', 'wave'), 
           sep = "_") %>%
  arrange(id, wave) %>%
  pivot_wider(names_from = variable, values_from = value)
## # A tibble: 27 × 7
##       id wave  agegrp   age  piat agegrp.c  age.c
##    <dbl> <chr>  <dbl> <dbl> <dbl>    <dbl>  <dbl>
##  1     4 1        6.5  6       18        0 -0.5  
##  2     4 2        8.5  8.5     31        2  2    
##  3     4 3       10.5 10.7     50        4  4.17 
##  4    27 1        6.5  6.25    19        0 -0.25 
##  5    27 2        8.5  9.17    36        2  2.67 
##  6    27 3       10.5 10.9     57        4  4.42 
##  7    31 1        6.5  6.33    18        0 -0.167
##  8    31 2        8.5  8.83    31        2  2.33 
##  9    31 3       10.5 10.9     51        4  4.42 
## 10    33 1        6.5  6.33    18        0 -0.167
## # … with 17 more rows

This now is identical to the first data that we had. I would recommend working through the steps above to see what the data structure looks like in each intermediate step along the way. In addition, it is often not of interest to save the extra long data format, below is the code that would go directly from long to wide.

long_data %>%
  pivot_longer(agegrp:age.c, names_to = 'variable', values_to = 'value') %>%
  unite(var_wave, variable, wave) %>%
  pivot_wider(names_from = var_wave, values_from = value)
## # A tibble: 9 × 16
##      id agegrp_1 age_1 piat_1 agegrp.c_1 age.c_1 agegrp_2 age_2 piat_2
##   <dbl>    <dbl> <dbl>  <dbl>      <dbl>   <dbl>    <dbl> <dbl>  <dbl>
## 1     4      6.5  6        18          0  -0.5        8.5  8.5      31
## 2    27      6.5  6.25     19          0  -0.25       8.5  9.17     36
## 3    31      6.5  6.33     18          0  -0.167      8.5  8.83     31
## 4    33      6.5  6.33     18          0  -0.167      8.5  8.92     34
## 5    41      6.5  6.33     18          0  -0.167      8.5  8.75     28
## 6    49      6.5  6.5      19          0   0          8.5  8.75     32
## 7    69      6.5  6.67     26          0   0.167      8.5  9.17     47
## 8    77      6.5  6.83     17          0   0.333      8.5  8.08     19
## 9    87      6.5  6.92     22          0   0.417      8.5  9.42     49
## # … with 7 more variables: agegrp.c_2 <dbl>, age.c_2 <dbl>, agegrp_3 <dbl>,
## #   age_3 <dbl>, piat_3 <dbl>, agegrp.c_3 <dbl>, age.c_3 <dbl>

Exercises

  1. Using the following data generation code, convert these data to long format.
set.seed(10)
messy <- data.frame(
  id = 1:4,
  trt = sample(rep(c('control', 'treatment'), each = 2)),
  work.T1 = runif(4),
  home.T1 = runif(4),
  work.T2 = runif(4),
  home.T2 = runif(4)
)
  1. Once successfully converted to long format, convert back to wide format.
Previous
Next