Data Import

So far we have solely used data that is already found within R by way of packages. Obviously, we will want to use our own data and this involves importing data directly into R. We are going to focus on two types of data structures to read in, text files and excel files.

The following two packages will be used in this section.

library(tidyverse)
# install.packages("readxl")
library(readxl)

Text Files

Through the use of the readr package, we are going to read in flat text files. In many cases, these text files are saved as csv files. The csv stands for comma separated values files meaning that columns in the data are separated by columns. As a side note, this is the most common way that I save data and read in data. The nice aspect of csv files is that if needed, they can be opened in programs like Excel for viewing, but are still just text files which are simple and lightweight.

To read in a csv file, we are going to use the read_csv function from the readr package. We are going to read in some UFO data (the data can be found on the course website). The code below is going to read the data directly from the GitHub where the data are currently being stored.

ufo <- read_csv("https://raw.githubusercontent.com/lebebr01/psqf-6250-blogdown/main/data/ufo.csv")

## Rows: 8031 Columns: 7
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (7): Date / Time, City, State, Shape, Duration, Summary, Posted
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

A special note here, if you downloaded the data, I recommend putting the downloaded data within an RStudio project in a folder named “Data” or “data”. The updated code (not run here), would look like (assuming the data are in the “data” folder within an RStudio project):

ufo <- read_csv('data/ufo.csv')

Note again, similar to dplyr, when saving the data to an object, it will not be printed. We can now view the first 10 rows by typing the object name.

ufo

## # A tibble: 8,031 × 7
##    `Date / Time`  City                     State Shape   Duration Summary Posted
##    <chr>          <chr>                    <chr> <chr>   <chr>    <chr>   <chr> 
##  1 12/12/14 17:30 North Wales              PA    Triang… 5 minut… "I hea… <NA>  
##  2 12/12/14 12:40 Cartersville             GA    Unknown 3.6 min… "Looki… 12/12…
##  3 12/12/14 06:30 Isle of Man (UK/England) <NA>  Light   2 secon… "Over … 12/12…
##  4 12/12/14 01:00 Miamisburg               OH    Changi… <NA>     "Brigh… 12/12…
##  5 12/12/14 00:00 Spotsylvania             VA    Unknown 1 minute "White… 12/12…
##  6 12/11/14 23:25 Kenner                   LA    Chevron ~1 minu… "Stran… 12/12…
##  7 12/11/14 23:15 Eugene                   OR    Disk    2 minut… "Dual … 12/12…
##  8 12/11/14 20:04 Phoenix                  AZ    Chevron 3 minut… "4 Ora… 12/12…
##  9 12/11/14 20:00 Franklin                 NC    Disk    5 minut… "There… 12/12…
## 10 12/11/14 18:30 Longview                 WA    Cylind… 10 seco… "Two c… 12/12…
## # … with 8,021 more rows

By default, the read_csv function uses the first row of the data file as the names of the variables. To override this behavior, set col_names = FALSE or better yet, specify the names with the col_names argument. In addition, if the file has header metadata, rows of the data can be skipped with the skip argument. For example, reading in the same data as above, but skipping the first row and specifying the names manually would look as follows:

read_csv("https://raw.githubusercontent.com/lebebr01/psqf-6250-blogdown/main/data/ufo.csv", skip = 1, 
         col_names = c('Date/Time', 'City', 'State', 
                       'Shape', 'Duration', 'Summary',
                       'Posted'))

## Rows: 8031 Columns: 7
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (7): Date/Time, City, State, Shape, Duration, Summary, Posted
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

## # A tibble: 8,031 × 7
##    `Date/Time`    City                     State Shape   Duration Summary Posted
##    <chr>          <chr>                    <chr> <chr>   <chr>    <chr>   <chr> 
##  1 12/12/14 17:30 North Wales              PA    Triang… 5 minut… "I hea… <NA>  
##  2 12/12/14 12:40 Cartersville             GA    Unknown 3.6 min… "Looki… 12/12…
##  3 12/12/14 06:30 Isle of Man (UK/England) <NA>  Light   2 secon… "Over … 12/12…
##  4 12/12/14 01:00 Miamisburg               OH    Changi… <NA>     "Brigh… 12/12…
##  5 12/12/14 00:00 Spotsylvania             VA    Unknown 1 minute "White… 12/12…
##  6 12/11/14 23:25 Kenner                   LA    Chevron ~1 minu… "Stran… 12/12…
##  7 12/11/14 23:15 Eugene                   OR    Disk    2 minut… "Dual … 12/12…
##  8 12/11/14 20:04 Phoenix                  AZ    Chevron 3 minut… "4 Ora… 12/12…
##  9 12/11/14 20:00 Franklin                 NC    Disk    5 minut… "There… 12/12…
## 10 12/11/14 18:30 Longview                 WA    Cylind… 10 seco… "Two c… 12/12…
## # … with 8,021 more rows

Manually Specifying Column Types

You may have noticed above that we just needed to give the read_csv function the path to the data file, we did not need to tell the function the types of columns. Instead, the function guessed the type from the first 1000 rows. This can be useful for interactive work, but for truly reproducible code, it is best to specify these manually. There are two ways to specify the column types, one is verbose and the other is simpler, but both use the argument col_types.

First the verbose solution:

read_csv("https://raw.githubusercontent.com/lebebr01/psqf-6250-blogdown/main/data/ufo.csv", 
         col_types = c(
           'Date/Time' = col_character(),
           City = col_character(),
           State = col_character(),
           Shape = col_character(),
           Duration = col_character(),
           Summary = col_character(),
           Posted = col_character()
         ))

## # A tibble: 8,031 × 7
##    `Date / Time`  City                     State Shape   Duration Summary Posted
##    <chr>          <chr>                    <chr> <chr>   <chr>    <chr>   <chr> 
##  1 12/12/14 17:30 North Wales              PA    Triang… 5 minut… "I hea… <NA>  
##  2 12/12/14 12:40 Cartersville             GA    Unknown 3.6 min… "Looki… 12/12…
##  3 12/12/14 06:30 Isle of Man (UK/England) <NA>  Light   2 secon… "Over … 12/12…
##  4 12/12/14 01:00 Miamisburg               OH    Changi… <NA>     "Brigh… 12/12…
##  5 12/12/14 00:00 Spotsylvania             VA    Unknown 1 minute "White… 12/12…
##  6 12/11/14 23:25 Kenner                   LA    Chevron ~1 minu… "Stran… 12/12…
##  7 12/11/14 23:15 Eugene                   OR    Disk    2 minut… "Dual … 12/12…
##  8 12/11/14 20:04 Phoenix                  AZ    Chevron 3 minut… "4 Ora… 12/12…
##  9 12/11/14 20:00 Franklin                 NC    Disk    5 minut… "There… 12/12…
## 10 12/11/14 18:30 Longview                 WA    Cylind… 10 seco… "Two c… 12/12…
## # … with 8,021 more rows

As all variables are being read in as characters, there is a simple shortcut to use.

read_csv("https://raw.githubusercontent.com/lebebr01/psqf-6250-blogdown/main/data/ufo.csv", 
         col_types = c('ccccccc'))

## # A tibble: 8,031 × 7
##    `Date / Time`  City                     State Shape   Duration Summary Posted
##    <chr>          <chr>                    <chr> <chr>   <chr>    <chr>   <chr> 
##  1 12/12/14 17:30 North Wales              PA    Triang… 5 minut… "I hea… <NA>  
##  2 12/12/14 12:40 Cartersville             GA    Unknown 3.6 min… "Looki… 12/12…
##  3 12/12/14 06:30 Isle of Man (UK/England) <NA>  Light   2 secon… "Over … 12/12…
##  4 12/12/14 01:00 Miamisburg               OH    Changi… <NA>     "Brigh… 12/12…
##  5 12/12/14 00:00 Spotsylvania             VA    Unknown 1 minute "White… 12/12…
##  6 12/11/14 23:25 Kenner                   LA    Chevron ~1 minu… "Stran… 12/12…
##  7 12/11/14 23:15 Eugene                   OR    Disk    2 minut… "Dual … 12/12…
##  8 12/11/14 20:04 Phoenix                  AZ    Chevron 3 minut… "4 Ora… 12/12…
##  9 12/11/14 20:00 Franklin                 NC    Disk    5 minut… "There… 12/12…
## 10 12/11/14 18:30 Longview                 WA    Cylind… 10 seco… "Two c… 12/12…
## # … with 8,021 more rows

To show the reason the more verbose is useful, suppose we wished to convert the ‘Data/Time’ variable to the correct type, a date time variable.

read_csv("https://raw.githubusercontent.com/lebebr01/psqf-6250-blogdown/main/data/ufo.csv", 
         col_types = c(
           'Date / Time' = col_datetime(),
           City = col_character(),
           State = col_character(),
           Shape = col_character(),
           Duration = col_character(),
           Summary = col_character(),
           Posted = col_character()
         ))

## Error: Unknown shortcut:

Here we get an error, which is caused by the fact that the date time variable specification needs a format statement. We can directly specify this.

ufo_date <- read_csv("https://raw.githubusercontent.com/lebebr01/psqf-6250-blogdown/main/data/ufo.csv", 
         col_types = list(
           'Date / Time' = col_datetime(format = "%m/%d/%y %H:%M"),
           City = col_character(),
           State = col_character(),
           Shape = col_character(),
           Duration = col_character(),
           Summary = col_character(),
           Posted = col_character()
         ))

## Warning: One or more parsing issues, see `problems()` for details

ufo_date

## # A tibble: 8,031 × 7
##    `Date / Time`       City                  State Shape Duration Summary Posted
##    <dttm>              <chr>                 <chr> <chr> <chr>    <chr>   <chr> 
##  1 2014-12-12 17:30:00 North Wales           PA    Tria… 5 minut… "I hea… <NA>  
##  2 2014-12-12 12:40:00 Cartersville          GA    Unkn… 3.6 min… "Looki… 12/12…
##  3 2014-12-12 06:30:00 Isle of Man (UK/Engl… <NA>  Light 2 secon… "Over … 12/12…
##  4 2014-12-12 01:00:00 Miamisburg            OH    Chan… <NA>     "Brigh… 12/12…
##  5 2014-12-12 00:00:00 Spotsylvania          VA    Unkn… 1 minute "White… 12/12…
##  6 2014-12-11 23:25:00 Kenner                LA    Chev… ~1 minu… "Stran… 12/12…
##  7 2014-12-11 23:15:00 Eugene                OR    Disk  2 minut… "Dual … 12/12…
##  8 2014-12-11 20:04:00 Phoenix               AZ    Chev… 3 minut… "4 Ora… 12/12…
##  9 2014-12-11 20:00:00 Franklin              NC    Disk  5 minut… "There… 12/12…
## 10 2014-12-11 18:30:00 Longview              WA    Cyli… 10 seco… "Two c… 12/12…
## # … with 8,021 more rows

Notice even though I was careful in the column specification, there was still issues when parsing this column as a date/time column. The data is still returned, but there are issues. These issues can be viewed using the problems function.

problems(ufo_date)

## # A tibble: 56 × 5
##      row   col expected                 actual   file 
##    <int> <int> <chr>                    <chr>    <chr>
##  1   120     1 date like %m/%d/%y %H:%M 12/1/14  ""   
##  2   195     1 date like %m/%d/%y %H:%M 11/27/14 ""   
##  3   237     1 date like %m/%d/%y %H:%M 11/24/14 ""   
##  4   408     1 date like %m/%d/%y %H:%M 11/15/14 ""   
##  5   666     1 date like %m/%d/%y %H:%M 10/31/14 ""   
##  6   798     1 date like %m/%d/%y %H:%M 10/25/14 ""   
##  7   947     1 date like %m/%d/%y %H:%M 10/19/14 ""   
##  8  1082     1 date like %m/%d/%y %H:%M 10/14/14 ""   
##  9  1123     1 date like %m/%d/%y %H:%M 10/12/14 ""   
## 10  1124     1 date like %m/%d/%y %H:%M 10/12/14 ""   
## # … with 46 more rows

Other Text Formats

There are other text formats used to read in data. They are listed below with the function used to read in that type. Note, that the function calls are identical to those specified above.

tsv - tab separated files - read_tsv
fixed width files - read_fwf
white space generally - read_table
delimiter generally - read_delim

Exercises

There is a tsv file posted on icon called “lotr_clean.tsv”. Download this and read this data file into R.
Instead of specifying the path, use the function file.choose(). For example, read_tsv(file.choose()). What does this function use? Would you recommend this to be used in a reproducible document?
Run the getwd() function from the R console. What does this function return?

Excel Files

Although I commonly use text files (e.g. csv) files, reality is that many people still use Excel for storing of data files. There are good and bad aspects of this, but reading in Excel files may be needed. The readxl package is useful for this task.

Suppose we wished to read in the Excel file found on the US Census Bureau website related to Education: https://www.census.gov/support/USACdataDownloads.html

To do this, we can do this directly with the read_excel function with the data already downloaded and posted to the course website. Note, the read_excel() function does not allow for reading in data from the web, instead the data need to be downloaded to a temp file, then this file is loaded into R. If you downloaded the data, I recommend placing it within a “Data” or “data” folder.

tf <- tempfile(fileext = ".xls")
curl::curl_download("https://github.com/lebebr01/psqf-6250-blogdown/blob/main/data/EDU01.xls?raw=true", tf)

read_excel(tf)

## # A tibble: 3,198 × 42
##    Area_name     STCOU EDU010187F EDU010187D EDU010187N1 EDU010187N2 EDU010188F
##    <chr>         <chr>      <dbl>      <dbl> <chr>       <chr>            <dbl>
##  1 UNITED STATES 00000          0   40024299 0000        0000                 0
##  2 ALABAMA       01000          0     733735 0000        0000                 0
##  3 Autauga, AL   01001          0       6829 0000        0000                 0
##  4 Baldwin, AL   01003          0      16417 0000        0000                 0
##  5 Barbour, AL   01005          0       5071 0000        0000                 0
##  6 Bibb, AL      01007          0       3557 0000        0000                 0
##  7 Blount, AL    01009          0       7319 0000        0000                 0
##  8 Bullock, AL   01011          0       2014 0000        0000                 0
##  9 Butler, AL    01013          0       4640 0000        0000                 0
## 10 Calhoun, AL   01015          0      20939 0000        0000                 0
## # … with 3,188 more rows, and 35 more variables: EDU010188D <dbl>,
## #   EDU010188N1 <chr>, EDU010188N2 <chr>, EDU010189F <dbl>, EDU010189D <dbl>,
## #   EDU010189N1 <chr>, EDU010189N2 <chr>, EDU010190F <dbl>, EDU010190D <dbl>,
## #   EDU010190N1 <chr>, EDU010190N2 <chr>, EDU010191F <dbl>, EDU010191D <dbl>,
## #   EDU010191N1 <chr>, EDU010191N2 <chr>, EDU010192F <dbl>, EDU010192D <dbl>,
## #   EDU010192N1 <chr>, EDU010192N2 <chr>, EDU010193F <dbl>, EDU010193D <dbl>,
## #   EDU010193N1 <chr>, EDU010193N2 <chr>, EDU010194F <dbl>, EDU010194D <dbl>, …

By default, the function will read in the first sheet and will treat the first row as the column names. If you wish to read in another sheet, you can use the sheet argument. For example:

read_excel(tf, sheet = 2)

## # A tibble: 3,198 × 42
##    Area_name     STCOU EDU010197F EDU010197D EDU010197N1 EDU010197N2 EDU010198F
##    <chr>         <chr>      <dbl>      <dbl> <chr>       <chr>            <dbl>
##  1 UNITED STATES 00000          0   44534459 0000        0000                 0
##  2 ALABAMA       01000          0     737386 0000        0000                 0
##  3 Autauga, AL   01001          0       8099 0000        0000                 0
##  4 Baldwin, AL   01003          0      21410 0000        0000                 0
##  5 Barbour, AL   01005          0       5100 0000        0000                 0
##  6 Bibb, AL      01007          0       3717 0000        0000                 0
##  7 Blount, AL    01009          0       7816 0000        0000                 0
##  8 Bullock, AL   01011          0       2010 0000        0000                 0
##  9 Butler, AL    01013          0       4119 0000        0000                 0
## 10 Calhoun, AL   01015          0      19721 0000        0000                 0
## # … with 3,188 more rows, and 35 more variables: EDU010198D <dbl>,
## #   EDU010198N1 <chr>, EDU010198N2 <chr>, EDU010199F <dbl>, EDU010199D <dbl>,
## #   EDU010199N1 <chr>, EDU010199N2 <chr>, EDU010200F <dbl>, EDU010200D <dbl>,
## #   EDU010200N1 <chr>, EDU010200N2 <chr>, EDU010201F <dbl>, EDU010201D <dbl>,
## #   EDU010201N1 <chr>, EDU010201N2 <chr>, EDU010202F <dbl>, EDU010202D <dbl>,
## #   EDU010202N1 <chr>, EDU010202N2 <chr>, EDU015203F <dbl>, EDU015203D <dbl>,
## #   EDU015203N1 <chr>, EDU015203N2 <chr>, EDU015204F <dbl>, EDU015204D <dbl>, …

read_excel(tf, sheet = 'EDU01B')

## # A tibble: 3,198 × 42
##    Area_name     STCOU EDU010197F EDU010197D EDU010197N1 EDU010197N2 EDU010198F
##    <chr>         <chr>      <dbl>      <dbl> <chr>       <chr>            <dbl>
##  1 UNITED STATES 00000          0   44534459 0000        0000                 0
##  2 ALABAMA       01000          0     737386 0000        0000                 0
##  3 Autauga, AL   01001          0       8099 0000        0000                 0
##  4 Baldwin, AL   01003          0      21410 0000        0000                 0
##  5 Barbour, AL   01005          0       5100 0000        0000                 0
##  6 Bibb, AL      01007          0       3717 0000        0000                 0
##  7 Blount, AL    01009          0       7816 0000        0000                 0
##  8 Bullock, AL   01011          0       2010 0000        0000                 0
##  9 Butler, AL    01013          0       4119 0000        0000                 0
## 10 Calhoun, AL   01015          0      19721 0000        0000                 0
## # … with 3,188 more rows, and 35 more variables: EDU010198D <dbl>,
## #   EDU010198N1 <chr>, EDU010198N2 <chr>, EDU010199F <dbl>, EDU010199D <dbl>,
## #   EDU010199N1 <chr>, EDU010199N2 <chr>, EDU010200F <dbl>, EDU010200D <dbl>,
## #   EDU010200N1 <chr>, EDU010200N2 <chr>, EDU010201F <dbl>, EDU010201D <dbl>,
## #   EDU010201N1 <chr>, EDU010201N2 <chr>, EDU010202F <dbl>, EDU010202D <dbl>,
## #   EDU010202N1 <chr>, EDU010202N2 <chr>, EDU015203F <dbl>, EDU015203D <dbl>,
## #   EDU015203N1 <chr>, EDU015203N2 <chr>, EDU015204F <dbl>, EDU015204D <dbl>, …

If there is metadata or no column names, these can be added in the same fashion as discussed above with the read_csv function. Finally, it should be noted, to use these data within R, you would want to save these data to an object within R.

edu_data <- read_excel(tf)

Writing Files

Most of the read_* functions also come with functions that allow you to write out files as well. I’m only going to cover the write_csv function, however, there are others that may be of use. Similarly to reading in files, the functionality is the same across the write_* functions.

Suppose we created a new column with the ufo data and wished to save this data to a csv file, this can be accomplished with the following series of commands.

ufo_count <- ufo %>%
  group_by(State) %>%
  mutate(num_state = n())
write_csv(ufo_count, path = 'path/to/save/file.csv')

Notice there are two arguments to the write_csv function, the first argument is the object you wish to save. The second is the path to the location to save the object. You must specify path = otherwise the write_csv function will look for that object in the R session.

Other Data Formats

There are still other data formats, particularly from proprietary statistical software such as Stata, SAS, or SPSS. To read these files in the haven function would be useful. I leave this as an exercise for you if you have these types of files to read into R.

Last updated on Feb 18, 2021