Data Import
So far we have solely used data that is already found within R by way of packages. Obviously, we will want to use our own data and this involves importing data directly into R. We are going to focus on two types of data structures to read in, text files and excel files.
The following two packages will be used in this section.
library(tidyverse)
# install.packages("readxl")
library(readxl)
Text Files
Through the use of the readr
package, we are going to read in flat text files. In many cases, these text files are saved as csv files. The csv stands for comma separated values files meaning that columns in the data are separated by columns. As a side note, this is the most common way that I save data and read in data. The nice aspect of csv files is that if needed, they can be opened in programs like Excel for viewing, but are still just text files which are simple and lightweight.
To read in a csv file, we are going to use the read_csv
function from the readr
package. We are going to read in some UFO data (the data can be found on the course website). The code below is going to read the data directly from the GitHub where the data are currently being stored.
ufo <- read_csv("https://raw.githubusercontent.com/lebebr01/psqf-6250-blogdown/main/data/ufo.csv")
## Rows: 8031 Columns: 7
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (7): Date / Time, City, State, Shape, Duration, Summary, Posted
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
A special note here, if you downloaded the data, I recommend putting the downloaded data within an RStudio project in a folder named “Data” or “data”. The updated code (not run here), would look like (assuming the data are in the “data” folder within an RStudio project):
ufo <- read_csv('data/ufo.csv')
Note again, similar to dplyr, when saving the data to an object, it will not be printed. We can now view the first 10 rows by typing the object name.
ufo
## # A tibble: 8,031 × 7
## `Date / Time` City State Shape Duration Summary Posted
## <chr> <chr> <chr> <chr> <chr> <chr> <chr>
## 1 12/12/14 17:30 North Wales PA Triang… 5 minut… "I hea… <NA>
## 2 12/12/14 12:40 Cartersville GA Unknown 3.6 min… "Looki… 12/12…
## 3 12/12/14 06:30 Isle of Man (UK/England) <NA> Light 2 secon… "Over … 12/12…
## 4 12/12/14 01:00 Miamisburg OH Changi… <NA> "Brigh… 12/12…
## 5 12/12/14 00:00 Spotsylvania VA Unknown 1 minute "White… 12/12…
## 6 12/11/14 23:25 Kenner LA Chevron ~1 minu… "Stran… 12/12…
## 7 12/11/14 23:15 Eugene OR Disk 2 minut… "Dual … 12/12…
## 8 12/11/14 20:04 Phoenix AZ Chevron 3 minut… "4 Ora… 12/12…
## 9 12/11/14 20:00 Franklin NC Disk 5 minut… "There… 12/12…
## 10 12/11/14 18:30 Longview WA Cylind… 10 seco… "Two c… 12/12…
## # … with 8,021 more rows
By default, the read_csv
function uses the first row of the data file as the names of the variables. To override this behavior, set col_names = FALSE
or better yet, specify the names with the col_names
argument. In addition, if the file has header metadata, rows of the data can be skipped with the skip
argument. For example, reading in the same data as above, but skipping the first row and specifying the names manually would look as follows:
read_csv("https://raw.githubusercontent.com/lebebr01/psqf-6250-blogdown/main/data/ufo.csv", skip = 1,
col_names = c('Date/Time', 'City', 'State',
'Shape', 'Duration', 'Summary',
'Posted'))
## Rows: 8031 Columns: 7
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (7): Date/Time, City, State, Shape, Duration, Summary, Posted
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
## # A tibble: 8,031 × 7
## `Date/Time` City State Shape Duration Summary Posted
## <chr> <chr> <chr> <chr> <chr> <chr> <chr>
## 1 12/12/14 17:30 North Wales PA Triang… 5 minut… "I hea… <NA>
## 2 12/12/14 12:40 Cartersville GA Unknown 3.6 min… "Looki… 12/12…
## 3 12/12/14 06:30 Isle of Man (UK/England) <NA> Light 2 secon… "Over … 12/12…
## 4 12/12/14 01:00 Miamisburg OH Changi… <NA> "Brigh… 12/12…
## 5 12/12/14 00:00 Spotsylvania VA Unknown 1 minute "White… 12/12…
## 6 12/11/14 23:25 Kenner LA Chevron ~1 minu… "Stran… 12/12…
## 7 12/11/14 23:15 Eugene OR Disk 2 minut… "Dual … 12/12…
## 8 12/11/14 20:04 Phoenix AZ Chevron 3 minut… "4 Ora… 12/12…
## 9 12/11/14 20:00 Franklin NC Disk 5 minut… "There… 12/12…
## 10 12/11/14 18:30 Longview WA Cylind… 10 seco… "Two c… 12/12…
## # … with 8,021 more rows
Manually Specifying Column Types
You may have noticed above that we just needed to give the read_csv
function the path to the data file, we did not need to tell the function the types of columns. Instead, the function guessed the type from the first 1000 rows. This can be useful for interactive work, but for truly reproducible code, it is best to specify these manually. There are two ways to specify the column types, one is verbose and the other is simpler, but both use the argument col_types
.
First the verbose solution:
read_csv("https://raw.githubusercontent.com/lebebr01/psqf-6250-blogdown/main/data/ufo.csv",
col_types = c(
'Date/Time' = col_character(),
City = col_character(),
State = col_character(),
Shape = col_character(),
Duration = col_character(),
Summary = col_character(),
Posted = col_character()
))
## # A tibble: 8,031 × 7
## `Date / Time` City State Shape Duration Summary Posted
## <chr> <chr> <chr> <chr> <chr> <chr> <chr>
## 1 12/12/14 17:30 North Wales PA Triang… 5 minut… "I hea… <NA>
## 2 12/12/14 12:40 Cartersville GA Unknown 3.6 min… "Looki… 12/12…
## 3 12/12/14 06:30 Isle of Man (UK/England) <NA> Light 2 secon… "Over … 12/12…
## 4 12/12/14 01:00 Miamisburg OH Changi… <NA> "Brigh… 12/12…
## 5 12/12/14 00:00 Spotsylvania VA Unknown 1 minute "White… 12/12…
## 6 12/11/14 23:25 Kenner LA Chevron ~1 minu… "Stran… 12/12…
## 7 12/11/14 23:15 Eugene OR Disk 2 minut… "Dual … 12/12…
## 8 12/11/14 20:04 Phoenix AZ Chevron 3 minut… "4 Ora… 12/12…
## 9 12/11/14 20:00 Franklin NC Disk 5 minut… "There… 12/12…
## 10 12/11/14 18:30 Longview WA Cylind… 10 seco… "Two c… 12/12…
## # … with 8,021 more rows
As all variables are being read in as characters, there is a simple shortcut to use.
read_csv("https://raw.githubusercontent.com/lebebr01/psqf-6250-blogdown/main/data/ufo.csv",
col_types = c('ccccccc'))
## # A tibble: 8,031 × 7
## `Date / Time` City State Shape Duration Summary Posted
## <chr> <chr> <chr> <chr> <chr> <chr> <chr>
## 1 12/12/14 17:30 North Wales PA Triang… 5 minut… "I hea… <NA>
## 2 12/12/14 12:40 Cartersville GA Unknown 3.6 min… "Looki… 12/12…
## 3 12/12/14 06:30 Isle of Man (UK/England) <NA> Light 2 secon… "Over … 12/12…
## 4 12/12/14 01:00 Miamisburg OH Changi… <NA> "Brigh… 12/12…
## 5 12/12/14 00:00 Spotsylvania VA Unknown 1 minute "White… 12/12…
## 6 12/11/14 23:25 Kenner LA Chevron ~1 minu… "Stran… 12/12…
## 7 12/11/14 23:15 Eugene OR Disk 2 minut… "Dual … 12/12…
## 8 12/11/14 20:04 Phoenix AZ Chevron 3 minut… "4 Ora… 12/12…
## 9 12/11/14 20:00 Franklin NC Disk 5 minut… "There… 12/12…
## 10 12/11/14 18:30 Longview WA Cylind… 10 seco… "Two c… 12/12…
## # … with 8,021 more rows
To show the reason the more verbose is useful, suppose we wished to convert the ‘Data/Time’ variable to the correct type, a date time variable.
read_csv("https://raw.githubusercontent.com/lebebr01/psqf-6250-blogdown/main/data/ufo.csv",
col_types = c(
'Date / Time' = col_datetime(),
City = col_character(),
State = col_character(),
Shape = col_character(),
Duration = col_character(),
Summary = col_character(),
Posted = col_character()
))
## Error: Unknown shortcut:
Here we get an error, which is caused by the fact that the date time variable specification needs a format statement. We can directly specify this.
ufo_date <- read_csv("https://raw.githubusercontent.com/lebebr01/psqf-6250-blogdown/main/data/ufo.csv",
col_types = list(
'Date / Time' = col_datetime(format = "%m/%d/%y %H:%M"),
City = col_character(),
State = col_character(),
Shape = col_character(),
Duration = col_character(),
Summary = col_character(),
Posted = col_character()
))
## Warning: One or more parsing issues, see `problems()` for details
ufo_date
## # A tibble: 8,031 × 7
## `Date / Time` City State Shape Duration Summary Posted
## <dttm> <chr> <chr> <chr> <chr> <chr> <chr>
## 1 2014-12-12 17:30:00 North Wales PA Tria… 5 minut… "I hea… <NA>
## 2 2014-12-12 12:40:00 Cartersville GA Unkn… 3.6 min… "Looki… 12/12…
## 3 2014-12-12 06:30:00 Isle of Man (UK/Engl… <NA> Light 2 secon… "Over … 12/12…
## 4 2014-12-12 01:00:00 Miamisburg OH Chan… <NA> "Brigh… 12/12…
## 5 2014-12-12 00:00:00 Spotsylvania VA Unkn… 1 minute "White… 12/12…
## 6 2014-12-11 23:25:00 Kenner LA Chev… ~1 minu… "Stran… 12/12…
## 7 2014-12-11 23:15:00 Eugene OR Disk 2 minut… "Dual … 12/12…
## 8 2014-12-11 20:04:00 Phoenix AZ Chev… 3 minut… "4 Ora… 12/12…
## 9 2014-12-11 20:00:00 Franklin NC Disk 5 minut… "There… 12/12…
## 10 2014-12-11 18:30:00 Longview WA Cyli… 10 seco… "Two c… 12/12…
## # … with 8,021 more rows
Notice even though I was careful in the column specification, there was still issues when parsing this column as a date/time column. The data is still returned, but there are issues. These issues can be viewed using the problems
function.
problems(ufo_date)
## # A tibble: 56 × 5
## row col expected actual file
## <int> <int> <chr> <chr> <chr>
## 1 120 1 date like %m/%d/%y %H:%M 12/1/14 ""
## 2 195 1 date like %m/%d/%y %H:%M 11/27/14 ""
## 3 237 1 date like %m/%d/%y %H:%M 11/24/14 ""
## 4 408 1 date like %m/%d/%y %H:%M 11/15/14 ""
## 5 666 1 date like %m/%d/%y %H:%M 10/31/14 ""
## 6 798 1 date like %m/%d/%y %H:%M 10/25/14 ""
## 7 947 1 date like %m/%d/%y %H:%M 10/19/14 ""
## 8 1082 1 date like %m/%d/%y %H:%M 10/14/14 ""
## 9 1123 1 date like %m/%d/%y %H:%M 10/12/14 ""
## 10 1124 1 date like %m/%d/%y %H:%M 10/12/14 ""
## # … with 46 more rows
Other Text Formats
There are other text formats used to read in data. They are listed below with the function used to read in that type. Note, that the function calls are identical to those specified above.
- tsv - tab separated files -
read_tsv
- fixed width files -
read_fwf
- white space generally -
read_table
- delimiter generally -
read_delim
Exercises
- There is a tsv file posted on icon called “lotr_clean.tsv”. Download this and read this data file into R.
- Instead of specifying the path, use the function
file.choose()
. For example,read_tsv(file.choose())
. What does this function use? Would you recommend this to be used in a reproducible document? - Run the
getwd()
function from the R console. What does this function return?
Excel Files
Although I commonly use text files (e.g. csv) files, reality is that many people still use Excel for storing of data files. There are good and bad aspects of this, but reading in Excel files may be needed. The readxl
package is useful for this task.
Suppose we wished to read in the Excel file found on the US Census Bureau website related to Education: https://www.census.gov/support/USACdataDownloads.html
To do this, we can do this directly with the read_excel
function with the data already downloaded and posted to the course website. Note, the read_excel()
function does not allow for reading in data from the web, instead the data need to be downloaded to a temp file, then this file is loaded into R. If you downloaded the data, I recommend placing it within a “Data” or “data” folder.
tf <- tempfile(fileext = ".xls")
curl::curl_download("https://github.com/lebebr01/psqf-6250-blogdown/blob/main/data/EDU01.xls?raw=true", tf)
read_excel(tf)
## # A tibble: 3,198 × 42
## Area_name STCOU EDU010187F EDU010187D EDU010187N1 EDU010187N2 EDU010188F
## <chr> <chr> <dbl> <dbl> <chr> <chr> <dbl>
## 1 UNITED STATES 00000 0 40024299 0000 0000 0
## 2 ALABAMA 01000 0 733735 0000 0000 0
## 3 Autauga, AL 01001 0 6829 0000 0000 0
## 4 Baldwin, AL 01003 0 16417 0000 0000 0
## 5 Barbour, AL 01005 0 5071 0000 0000 0
## 6 Bibb, AL 01007 0 3557 0000 0000 0
## 7 Blount, AL 01009 0 7319 0000 0000 0
## 8 Bullock, AL 01011 0 2014 0000 0000 0
## 9 Butler, AL 01013 0 4640 0000 0000 0
## 10 Calhoun, AL 01015 0 20939 0000 0000 0
## # … with 3,188 more rows, and 35 more variables: EDU010188D <dbl>,
## # EDU010188N1 <chr>, EDU010188N2 <chr>, EDU010189F <dbl>, EDU010189D <dbl>,
## # EDU010189N1 <chr>, EDU010189N2 <chr>, EDU010190F <dbl>, EDU010190D <dbl>,
## # EDU010190N1 <chr>, EDU010190N2 <chr>, EDU010191F <dbl>, EDU010191D <dbl>,
## # EDU010191N1 <chr>, EDU010191N2 <chr>, EDU010192F <dbl>, EDU010192D <dbl>,
## # EDU010192N1 <chr>, EDU010192N2 <chr>, EDU010193F <dbl>, EDU010193D <dbl>,
## # EDU010193N1 <chr>, EDU010193N2 <chr>, EDU010194F <dbl>, EDU010194D <dbl>, …
By default, the function will read in the first sheet and will treat the first row as the column names. If you wish to read in another sheet, you can use the sheet
argument. For example:
read_excel(tf, sheet = 2)
## # A tibble: 3,198 × 42
## Area_name STCOU EDU010197F EDU010197D EDU010197N1 EDU010197N2 EDU010198F
## <chr> <chr> <dbl> <dbl> <chr> <chr> <dbl>
## 1 UNITED STATES 00000 0 44534459 0000 0000 0
## 2 ALABAMA 01000 0 737386 0000 0000 0
## 3 Autauga, AL 01001 0 8099 0000 0000 0
## 4 Baldwin, AL 01003 0 21410 0000 0000 0
## 5 Barbour, AL 01005 0 5100 0000 0000 0
## 6 Bibb, AL 01007 0 3717 0000 0000 0
## 7 Blount, AL 01009 0 7816 0000 0000 0
## 8 Bullock, AL 01011 0 2010 0000 0000 0
## 9 Butler, AL 01013 0 4119 0000 0000 0
## 10 Calhoun, AL 01015 0 19721 0000 0000 0
## # … with 3,188 more rows, and 35 more variables: EDU010198D <dbl>,
## # EDU010198N1 <chr>, EDU010198N2 <chr>, EDU010199F <dbl>, EDU010199D <dbl>,
## # EDU010199N1 <chr>, EDU010199N2 <chr>, EDU010200F <dbl>, EDU010200D <dbl>,
## # EDU010200N1 <chr>, EDU010200N2 <chr>, EDU010201F <dbl>, EDU010201D <dbl>,
## # EDU010201N1 <chr>, EDU010201N2 <chr>, EDU010202F <dbl>, EDU010202D <dbl>,
## # EDU010202N1 <chr>, EDU010202N2 <chr>, EDU015203F <dbl>, EDU015203D <dbl>,
## # EDU015203N1 <chr>, EDU015203N2 <chr>, EDU015204F <dbl>, EDU015204D <dbl>, …
read_excel(tf, sheet = 'EDU01B')
## # A tibble: 3,198 × 42
## Area_name STCOU EDU010197F EDU010197D EDU010197N1 EDU010197N2 EDU010198F
## <chr> <chr> <dbl> <dbl> <chr> <chr> <dbl>
## 1 UNITED STATES 00000 0 44534459 0000 0000 0
## 2 ALABAMA 01000 0 737386 0000 0000 0
## 3 Autauga, AL 01001 0 8099 0000 0000 0
## 4 Baldwin, AL 01003 0 21410 0000 0000 0
## 5 Barbour, AL 01005 0 5100 0000 0000 0
## 6 Bibb, AL 01007 0 3717 0000 0000 0
## 7 Blount, AL 01009 0 7816 0000 0000 0
## 8 Bullock, AL 01011 0 2010 0000 0000 0
## 9 Butler, AL 01013 0 4119 0000 0000 0
## 10 Calhoun, AL 01015 0 19721 0000 0000 0
## # … with 3,188 more rows, and 35 more variables: EDU010198D <dbl>,
## # EDU010198N1 <chr>, EDU010198N2 <chr>, EDU010199F <dbl>, EDU010199D <dbl>,
## # EDU010199N1 <chr>, EDU010199N2 <chr>, EDU010200F <dbl>, EDU010200D <dbl>,
## # EDU010200N1 <chr>, EDU010200N2 <chr>, EDU010201F <dbl>, EDU010201D <dbl>,
## # EDU010201N1 <chr>, EDU010201N2 <chr>, EDU010202F <dbl>, EDU010202D <dbl>,
## # EDU010202N1 <chr>, EDU010202N2 <chr>, EDU015203F <dbl>, EDU015203D <dbl>,
## # EDU015203N1 <chr>, EDU015203N2 <chr>, EDU015204F <dbl>, EDU015204D <dbl>, …
If there is metadata or no column names, these can be added in the same fashion as discussed above with the read_csv
function. Finally, it should be noted, to use these data within R, you would want to save these data to an object within R.
edu_data <- read_excel(tf)
Writing Files
Most of the read_*
functions also come with functions that allow you to write out files as well. I’m only going to cover the write_csv
function, however, there are others that may be of use. Similarly to reading in files, the functionality is the same across the write_*
functions.
Suppose we created a new column with the ufo
data and wished to save this data to a csv file, this can be accomplished with the following series of commands.
ufo_count <- ufo %>%
group_by(State) %>%
mutate(num_state = n())
write_csv(ufo_count, path = 'path/to/save/file.csv')
Notice there are two arguments to the write_csv
function, the first argument is the object you wish to save. The second is the path to the location to save the object. You must specify path =
otherwise the write_csv
function will look for that object in the R session.
Other Data Formats
There are still other data formats, particularly from proprietary statistical software such as Stata, SAS, or SPSS. To read these files in the haven
function would be useful. I leave this as an exercise for you if you have these types of files to read into R.