Joining Data
Another common data manipulation task is to join multiple data sources into a single data file for an analysis. This task is most easily accomplished using a set of join
functions found in the dplyr
package. In this set of notes we are going to focus on mutating joins and filtering joins. There is another class of joins called set operations. I use these much less frequently, but for those interested, see the text in the R for Data Science book http://r4ds.had.co.nz/relational-data.html.
For this set of notes, we are going to make use of two packages:
library(tidyverse)
# install.packages('Lahman')
library(Lahman)
The Lahman
package contains data from the Major League Baseball (MLB), a professional baseball association in the United States. For this section, we are going to focus on the following three data tables, Teams
, Salaries
, and Managers
. I print the first ten rows of the data for each table below.
head(Teams, n = 10)
## yearID lgID teamID franchID divID Rank G Ghome W L DivWin WCWin LgWin
## 1 1871 NA BS1 BNA <NA> 3 31 NA 20 10 <NA> <NA> N
## 2 1871 NA CH1 CNA <NA> 2 28 NA 19 9 <NA> <NA> N
## 3 1871 NA CL1 CFC <NA> 8 29 NA 10 19 <NA> <NA> N
## 4 1871 NA FW1 KEK <NA> 7 19 NA 7 12 <NA> <NA> N
## 5 1871 NA NY2 NNA <NA> 5 33 NA 16 17 <NA> <NA> N
## 6 1871 NA PH1 PNA <NA> 1 28 NA 21 7 <NA> <NA> Y
## 7 1871 NA RC1 ROK <NA> 9 25 NA 4 21 <NA> <NA> N
## 8 1871 NA TRO TRO <NA> 6 29 NA 13 15 <NA> <NA> N
## 9 1871 NA WS3 OLY <NA> 4 32 NA 15 15 <NA> <NA> N
## 10 1872 NA BL1 BLC <NA> 2 58 NA 35 19 <NA> <NA> N
## WSWin R AB H X2B X3B HR BB SO SB CS HBP SF RA ER ERA CG SHO SV
## 1 <NA> 401 1372 426 70 37 3 60 19 73 16 NA NA 303 109 3.55 22 1 3
## 2 <NA> 302 1196 323 52 21 10 60 22 69 21 NA NA 241 77 2.76 25 0 1
## 3 <NA> 249 1186 328 35 40 7 26 25 18 8 NA NA 341 116 4.11 23 0 0
## 4 <NA> 137 746 178 19 8 2 33 9 16 4 NA NA 243 97 5.17 19 1 0
## 5 <NA> 302 1404 403 43 21 1 33 15 46 15 NA NA 313 121 3.72 32 1 0
## 6 <NA> 376 1281 410 66 27 9 46 23 56 12 NA NA 266 137 4.95 27 0 0
## 7 <NA> 231 1036 274 44 25 3 38 30 53 10 NA NA 287 108 4.30 23 1 0
## 8 <NA> 351 1248 384 51 34 6 49 19 62 24 NA NA 362 153 5.51 28 0 0
## 9 <NA> 310 1353 375 54 26 6 48 13 48 13 NA NA 303 137 4.37 32 0 0
## 10 <NA> 617 2571 753 106 31 14 29 28 53 18 NA NA 434 166 2.90 48 1 1
## IPouts HA HRA BBA SOA E DP FP name
## 1 828 367 2 42 23 243 24 0.834 Boston Red Stockings
## 2 753 308 6 28 22 229 16 0.829 Chicago White Stockings
## 3 762 346 13 53 34 234 15 0.818 Cleveland Forest Citys
## 4 507 261 5 21 17 163 8 0.803 Fort Wayne Kekiongas
## 5 879 373 7 42 22 235 14 0.840 New York Mutuals
## 6 747 329 3 53 16 194 13 0.845 Philadelphia Athletics
## 7 678 315 3 34 16 220 14 0.821 Rockford Forest Citys
## 8 750 431 4 75 12 198 22 0.845 Troy Haymakers
## 9 846 371 4 45 13 218 20 0.850 Washington Olympics
## 10 1548 573 3 63 77 432 22 0.830 Baltimore Canaries
## park attendance BPF PPF teamIDBR teamIDlahman45
## 1 South End Grounds I NA 103 98 BOS BS1
## 2 Union Base-Ball Grounds NA 104 102 CHI CH1
## 3 National Association Grounds NA 96 100 CLE CL1
## 4 Hamilton Field NA 101 107 KEK FW1
## 5 Union Grounds (Brooklyn) NA 90 88 NYU NY2
## 6 Jefferson Street Grounds NA 102 98 ATH PH1
## 7 Agricultural Society Fair Grounds NA 97 99 ROK RC1
## 8 Haymakers' Grounds NA 101 100 TRO TRO
## 9 Olympics Grounds NA 94 98 OLY WS3
## 10 Newington Park NA 106 102 BAL BL1
## teamIDretro
## 1 BS1
## 2 CH1
## 3 CL1
## 4 FW1
## 5 NY2
## 6 PH1
## 7 RC1
## 8 TRO
## 9 WS3
## 10 BL1
head(Salaries, n = 10)
## yearID teamID lgID playerID salary
## 1 1985 ATL NL barkele01 870000
## 2 1985 ATL NL bedrost01 550000
## 3 1985 ATL NL benedbr01 545000
## 4 1985 ATL NL campri01 633333
## 5 1985 ATL NL ceronri01 625000
## 6 1985 ATL NL chambch01 800000
## 7 1985 ATL NL dedmoje01 150000
## 8 1985 ATL NL forstte01 483333
## 9 1985 ATL NL garbege01 772000
## 10 1985 ATL NL harpete01 250000
head(Managers, n = 10)
## playerID yearID teamID lgID inseason G W L rank plyrMgr
## 1 wrighha01 1871 BS1 NA 1 31 20 10 3 Y
## 2 woodji01 1871 CH1 NA 1 28 19 9 2 Y
## 3 paborch01 1871 CL1 NA 1 29 10 19 8 Y
## 4 lennobi01 1871 FW1 NA 1 14 5 9 8 Y
## 5 deaneha01 1871 FW1 NA 2 5 2 3 8 Y
## 6 fergubo01 1871 NY2 NA 1 33 16 17 5 Y
## 7 mcbridi01 1871 PH1 NA 1 28 21 7 1 Y
## 8 hastisc01 1871 RC1 NA 1 25 4 21 9 Y
## 9 pikeli01 1871 TRO NA 1 4 1 3 6 Y
## 10 cravebi01 1871 TRO NA 2 25 12 12 6 Y
Inner Join
The most basic join is the inner join. This join takes two tables and returns values if key variables match in both tables. If rows do not match on the key variables, these observations are removed. Suppose for example, we wanted to select the rows that matched between the Teams
and Salaries
data. This would be useful for example if we wished to calculate the average salary of the players for each team for every year.
This join could be done with the inner_join
function.
team_salary <- inner_join(Teams, Salaries)
## Joining, by = c("yearID", "lgID", "teamID")
head(team_salary, n = 10)
## yearID lgID teamID franchID divID Rank G Ghome W L DivWin WCWin LgWin
## 1 1985 NL ATL ATL W 5 162 81 66 96 N <NA> N
## 2 1985 NL ATL ATL W 5 162 81 66 96 N <NA> N
## 3 1985 NL ATL ATL W 5 162 81 66 96 N <NA> N
## 4 1985 NL ATL ATL W 5 162 81 66 96 N <NA> N
## 5 1985 NL ATL ATL W 5 162 81 66 96 N <NA> N
## 6 1985 NL ATL ATL W 5 162 81 66 96 N <NA> N
## 7 1985 NL ATL ATL W 5 162 81 66 96 N <NA> N
## 8 1985 NL ATL ATL W 5 162 81 66 96 N <NA> N
## 9 1985 NL ATL ATL W 5 162 81 66 96 N <NA> N
## 10 1985 NL ATL ATL W 5 162 81 66 96 N <NA> N
## WSWin R AB H X2B X3B HR BB SO SB CS HBP SF RA ER ERA CG SHO SV
## 1 N 632 5526 1359 213 28 126 553 849 72 52 22 41 781 679 4.19 9 9 29
## 2 N 632 5526 1359 213 28 126 553 849 72 52 22 41 781 679 4.19 9 9 29
## 3 N 632 5526 1359 213 28 126 553 849 72 52 22 41 781 679 4.19 9 9 29
## 4 N 632 5526 1359 213 28 126 553 849 72 52 22 41 781 679 4.19 9 9 29
## 5 N 632 5526 1359 213 28 126 553 849 72 52 22 41 781 679 4.19 9 9 29
## 6 N 632 5526 1359 213 28 126 553 849 72 52 22 41 781 679 4.19 9 9 29
## 7 N 632 5526 1359 213 28 126 553 849 72 52 22 41 781 679 4.19 9 9 29
## 8 N 632 5526 1359 213 28 126 553 849 72 52 22 41 781 679 4.19 9 9 29
## 9 N 632 5526 1359 213 28 126 553 849 72 52 22 41 781 679 4.19 9 9 29
## 10 N 632 5526 1359 213 28 126 553 849 72 52 22 41 781 679 4.19 9 9 29
## IPouts HA HRA BBA SOA E DP FP name
## 1 4372 1512 134 642 776 159 197 0.976 Atlanta Braves
## 2 4372 1512 134 642 776 159 197 0.976 Atlanta Braves
## 3 4372 1512 134 642 776 159 197 0.976 Atlanta Braves
## 4 4372 1512 134 642 776 159 197 0.976 Atlanta Braves
## 5 4372 1512 134 642 776 159 197 0.976 Atlanta Braves
## 6 4372 1512 134 642 776 159 197 0.976 Atlanta Braves
## 7 4372 1512 134 642 776 159 197 0.976 Atlanta Braves
## 8 4372 1512 134 642 776 159 197 0.976 Atlanta Braves
## 9 4372 1512 134 642 776 159 197 0.976 Atlanta Braves
## 10 4372 1512 134 642 776 159 197 0.976 Atlanta Braves
## park attendance BPF PPF teamIDBR teamIDlahman45
## 1 Atlanta-Fulton County Stadium 1350137 105 106 ATL ATL
## 2 Atlanta-Fulton County Stadium 1350137 105 106 ATL ATL
## 3 Atlanta-Fulton County Stadium 1350137 105 106 ATL ATL
## 4 Atlanta-Fulton County Stadium 1350137 105 106 ATL ATL
## 5 Atlanta-Fulton County Stadium 1350137 105 106 ATL ATL
## 6 Atlanta-Fulton County Stadium 1350137 105 106 ATL ATL
## 7 Atlanta-Fulton County Stadium 1350137 105 106 ATL ATL
## 8 Atlanta-Fulton County Stadium 1350137 105 106 ATL ATL
## 9 Atlanta-Fulton County Stadium 1350137 105 106 ATL ATL
## 10 Atlanta-Fulton County Stadium 1350137 105 106 ATL ATL
## teamIDretro playerID salary
## 1 ATL barkele01 870000
## 2 ATL bedrost01 550000
## 3 ATL benedbr01 545000
## 4 ATL campri01 633333
## 5 ATL ceronri01 625000
## 6 ATL chambch01 800000
## 7 ATL dedmoje01 150000
## 8 ATL forstte01 483333
## 9 ATL garbege01 772000
## 10 ATL harpete01 250000
You’ll notice that there is only data from 1985 onward, the data in the Teams
data from before 1985 have automatically been removed due to no matching data in the Salaries
data. You may have also noticed, that I did not specify the variables to join by above, for interactive work this can be okay, but to be more reproducible, specifying the variables to join on would be better. The function call above can be modified to include this information.
team_salary <- inner_join(Teams, Salaries, by = c('yearID', 'teamID', 'lgID'))
head(team_salary, n = 10)
## yearID lgID teamID franchID divID Rank G Ghome W L DivWin WCWin LgWin
## 1 1985 NL ATL ATL W 5 162 81 66 96 N <NA> N
## 2 1985 NL ATL ATL W 5 162 81 66 96 N <NA> N
## 3 1985 NL ATL ATL W 5 162 81 66 96 N <NA> N
## 4 1985 NL ATL ATL W 5 162 81 66 96 N <NA> N
## 5 1985 NL ATL ATL W 5 162 81 66 96 N <NA> N
## 6 1985 NL ATL ATL W 5 162 81 66 96 N <NA> N
## 7 1985 NL ATL ATL W 5 162 81 66 96 N <NA> N
## 8 1985 NL ATL ATL W 5 162 81 66 96 N <NA> N
## 9 1985 NL ATL ATL W 5 162 81 66 96 N <NA> N
## 10 1985 NL ATL ATL W 5 162 81 66 96 N <NA> N
## WSWin R AB H X2B X3B HR BB SO SB CS HBP SF RA ER ERA CG SHO SV
## 1 N 632 5526 1359 213 28 126 553 849 72 52 22 41 781 679 4.19 9 9 29
## 2 N 632 5526 1359 213 28 126 553 849 72 52 22 41 781 679 4.19 9 9 29
## 3 N 632 5526 1359 213 28 126 553 849 72 52 22 41 781 679 4.19 9 9 29
## 4 N 632 5526 1359 213 28 126 553 849 72 52 22 41 781 679 4.19 9 9 29
## 5 N 632 5526 1359 213 28 126 553 849 72 52 22 41 781 679 4.19 9 9 29
## 6 N 632 5526 1359 213 28 126 553 849 72 52 22 41 781 679 4.19 9 9 29
## 7 N 632 5526 1359 213 28 126 553 849 72 52 22 41 781 679 4.19 9 9 29
## 8 N 632 5526 1359 213 28 126 553 849 72 52 22 41 781 679 4.19 9 9 29
## 9 N 632 5526 1359 213 28 126 553 849 72 52 22 41 781 679 4.19 9 9 29
## 10 N 632 5526 1359 213 28 126 553 849 72 52 22 41 781 679 4.19 9 9 29
## IPouts HA HRA BBA SOA E DP FP name
## 1 4372 1512 134 642 776 159 197 0.976 Atlanta Braves
## 2 4372 1512 134 642 776 159 197 0.976 Atlanta Braves
## 3 4372 1512 134 642 776 159 197 0.976 Atlanta Braves
## 4 4372 1512 134 642 776 159 197 0.976 Atlanta Braves
## 5 4372 1512 134 642 776 159 197 0.976 Atlanta Braves
## 6 4372 1512 134 642 776 159 197 0.976 Atlanta Braves
## 7 4372 1512 134 642 776 159 197 0.976 Atlanta Braves
## 8 4372 1512 134 642 776 159 197 0.976 Atlanta Braves
## 9 4372 1512 134 642 776 159 197 0.976 Atlanta Braves
## 10 4372 1512 134 642 776 159 197 0.976 Atlanta Braves
## park attendance BPF PPF teamIDBR teamIDlahman45
## 1 Atlanta-Fulton County Stadium 1350137 105 106 ATL ATL
## 2 Atlanta-Fulton County Stadium 1350137 105 106 ATL ATL
## 3 Atlanta-Fulton County Stadium 1350137 105 106 ATL ATL
## 4 Atlanta-Fulton County Stadium 1350137 105 106 ATL ATL
## 5 Atlanta-Fulton County Stadium 1350137 105 106 ATL ATL
## 6 Atlanta-Fulton County Stadium 1350137 105 106 ATL ATL
## 7 Atlanta-Fulton County Stadium 1350137 105 106 ATL ATL
## 8 Atlanta-Fulton County Stadium 1350137 105 106 ATL ATL
## 9 Atlanta-Fulton County Stadium 1350137 105 106 ATL ATL
## 10 Atlanta-Fulton County Stadium 1350137 105 106 ATL ATL
## teamIDretro playerID salary
## 1 ATL barkele01 870000
## 2 ATL bedrost01 550000
## 3 ATL benedbr01 545000
## 4 ATL campri01 633333
## 5 ATL ceronri01 625000
## 6 ATL chambch01 800000
## 7 ATL dedmoje01 150000
## 8 ATL forstte01 483333
## 9 ATL garbege01 772000
## 10 ATL harpete01 250000
We could then use other dplyr
verbs to calculate the average salary for every team by year and plot these.
team_salary %>%
group_by(yearID, teamID) %>%
summarise(avg_salary = mean(salary, na.rm = TRUE)) %>%
ggplot(aes(x = yearID, y = avg_salary)) +
geom_line(size = 1) +
facet_wrap(~teamID)
## `summarise()` has grouped output by 'yearID'. You can override using the
## `.groups` argument.
Below is a diagram of the inner join found in the R for Data Science text, inner join diagram.
Left Join
This is by far the most common join I perform. Left join is more formally part of a group of operations called outer joins. Outer joins are useful when you want to use one data table as a base data set in which variables will be added to this data if the keys match. It is likely best shown with an example.
Suppose we wish to add the salary information to the Teams
data. However, instead of using a inner_join
, let’s use left_join
to see the difference. Note: I print only 10 rows of data with the head()
function. This part of the code below would generally not be used.
left_join(Teams, Salaries) %>%
head(n = 10)
## Joining, by = c("yearID", "lgID", "teamID")
## yearID lgID teamID franchID divID Rank G Ghome W L DivWin WCWin LgWin
## 1 1871 NA BS1 BNA <NA> 3 31 NA 20 10 <NA> <NA> N
## 2 1871 NA CH1 CNA <NA> 2 28 NA 19 9 <NA> <NA> N
## 3 1871 NA CL1 CFC <NA> 8 29 NA 10 19 <NA> <NA> N
## 4 1871 NA FW1 KEK <NA> 7 19 NA 7 12 <NA> <NA> N
## 5 1871 NA NY2 NNA <NA> 5 33 NA 16 17 <NA> <NA> N
## 6 1871 NA PH1 PNA <NA> 1 28 NA 21 7 <NA> <NA> Y
## 7 1871 NA RC1 ROK <NA> 9 25 NA 4 21 <NA> <NA> N
## 8 1871 NA TRO TRO <NA> 6 29 NA 13 15 <NA> <NA> N
## 9 1871 NA WS3 OLY <NA> 4 32 NA 15 15 <NA> <NA> N
## 10 1872 NA BL1 BLC <NA> 2 58 NA 35 19 <NA> <NA> N
## WSWin R AB H X2B X3B HR BB SO SB CS HBP SF RA ER ERA CG SHO SV
## 1 <NA> 401 1372 426 70 37 3 60 19 73 16 NA NA 303 109 3.55 22 1 3
## 2 <NA> 302 1196 323 52 21 10 60 22 69 21 NA NA 241 77 2.76 25 0 1
## 3 <NA> 249 1186 328 35 40 7 26 25 18 8 NA NA 341 116 4.11 23 0 0
## 4 <NA> 137 746 178 19 8 2 33 9 16 4 NA NA 243 97 5.17 19 1 0
## 5 <NA> 302 1404 403 43 21 1 33 15 46 15 NA NA 313 121 3.72 32 1 0
## 6 <NA> 376 1281 410 66 27 9 46 23 56 12 NA NA 266 137 4.95 27 0 0
## 7 <NA> 231 1036 274 44 25 3 38 30 53 10 NA NA 287 108 4.30 23 1 0
## 8 <NA> 351 1248 384 51 34 6 49 19 62 24 NA NA 362 153 5.51 28 0 0
## 9 <NA> 310 1353 375 54 26 6 48 13 48 13 NA NA 303 137 4.37 32 0 0
## 10 <NA> 617 2571 753 106 31 14 29 28 53 18 NA NA 434 166 2.90 48 1 1
## IPouts HA HRA BBA SOA E DP FP name
## 1 828 367 2 42 23 243 24 0.834 Boston Red Stockings
## 2 753 308 6 28 22 229 16 0.829 Chicago White Stockings
## 3 762 346 13 53 34 234 15 0.818 Cleveland Forest Citys
## 4 507 261 5 21 17 163 8 0.803 Fort Wayne Kekiongas
## 5 879 373 7 42 22 235 14 0.840 New York Mutuals
## 6 747 329 3 53 16 194 13 0.845 Philadelphia Athletics
## 7 678 315 3 34 16 220 14 0.821 Rockford Forest Citys
## 8 750 431 4 75 12 198 22 0.845 Troy Haymakers
## 9 846 371 4 45 13 218 20 0.850 Washington Olympics
## 10 1548 573 3 63 77 432 22 0.830 Baltimore Canaries
## park attendance BPF PPF teamIDBR teamIDlahman45
## 1 South End Grounds I NA 103 98 BOS BS1
## 2 Union Base-Ball Grounds NA 104 102 CHI CH1
## 3 National Association Grounds NA 96 100 CLE CL1
## 4 Hamilton Field NA 101 107 KEK FW1
## 5 Union Grounds (Brooklyn) NA 90 88 NYU NY2
## 6 Jefferson Street Grounds NA 102 98 ATH PH1
## 7 Agricultural Society Fair Grounds NA 97 99 ROK RC1
## 8 Haymakers' Grounds NA 101 100 TRO TRO
## 9 Olympics Grounds NA 94 98 OLY WS3
## 10 Newington Park NA 106 102 BAL BL1
## teamIDretro playerID salary
## 1 BS1 <NA> NA
## 2 CH1 <NA> NA
## 3 CL1 <NA> NA
## 4 FW1 <NA> NA
## 5 NY2 <NA> NA
## 6 PH1 <NA> NA
## 7 RC1 <NA> NA
## 8 TRO <NA> NA
## 9 WS3 <NA> NA
## 10 BL1 <NA> NA
The first thing to notice is that now there are years in the yearID variable from before 1985, this was not the case in the above data joined using inner_join
. If you scroll over to explore variables to the right, there are missing values for the salary variable. What left_join
does when it doesn’t find a match in the table is to produce NA values, so all records within the joined data will be NA before 1985.
This is the major difference between outer joins and inner joins. Outer joins will preserve data in the keyed data that do not match and NA values are returned for non-matching values. For inner joins, any keys that do not match are removed.
Right Join
A right join is similar to a left join, except the keyed table is the second one specified (the rightmost data). For example, if we wished for the salary information to be the keyed table, we could do that same specification as above, but use right_join
instead of left_join
. Note: I print only 10 rows of data with the head()
function. This part of the code below would generally not be used.
right_join(Teams, Salaries) %>%
head(n = 10)
## Joining, by = c("yearID", "lgID", "teamID")
## yearID lgID teamID franchID divID Rank G Ghome W L DivWin WCWin LgWin
## 1 1985 NL ATL ATL W 5 162 81 66 96 N <NA> N
## 2 1985 NL ATL ATL W 5 162 81 66 96 N <NA> N
## 3 1985 NL ATL ATL W 5 162 81 66 96 N <NA> N
## 4 1985 NL ATL ATL W 5 162 81 66 96 N <NA> N
## 5 1985 NL ATL ATL W 5 162 81 66 96 N <NA> N
## 6 1985 NL ATL ATL W 5 162 81 66 96 N <NA> N
## 7 1985 NL ATL ATL W 5 162 81 66 96 N <NA> N
## 8 1985 NL ATL ATL W 5 162 81 66 96 N <NA> N
## 9 1985 NL ATL ATL W 5 162 81 66 96 N <NA> N
## 10 1985 NL ATL ATL W 5 162 81 66 96 N <NA> N
## WSWin R AB H X2B X3B HR BB SO SB CS HBP SF RA ER ERA CG SHO SV
## 1 N 632 5526 1359 213 28 126 553 849 72 52 22 41 781 679 4.19 9 9 29
## 2 N 632 5526 1359 213 28 126 553 849 72 52 22 41 781 679 4.19 9 9 29
## 3 N 632 5526 1359 213 28 126 553 849 72 52 22 41 781 679 4.19 9 9 29
## 4 N 632 5526 1359 213 28 126 553 849 72 52 22 41 781 679 4.19 9 9 29
## 5 N 632 5526 1359 213 28 126 553 849 72 52 22 41 781 679 4.19 9 9 29
## 6 N 632 5526 1359 213 28 126 553 849 72 52 22 41 781 679 4.19 9 9 29
## 7 N 632 5526 1359 213 28 126 553 849 72 52 22 41 781 679 4.19 9 9 29
## 8 N 632 5526 1359 213 28 126 553 849 72 52 22 41 781 679 4.19 9 9 29
## 9 N 632 5526 1359 213 28 126 553 849 72 52 22 41 781 679 4.19 9 9 29
## 10 N 632 5526 1359 213 28 126 553 849 72 52 22 41 781 679 4.19 9 9 29
## IPouts HA HRA BBA SOA E DP FP name
## 1 4372 1512 134 642 776 159 197 0.976 Atlanta Braves
## 2 4372 1512 134 642 776 159 197 0.976 Atlanta Braves
## 3 4372 1512 134 642 776 159 197 0.976 Atlanta Braves
## 4 4372 1512 134 642 776 159 197 0.976 Atlanta Braves
## 5 4372 1512 134 642 776 159 197 0.976 Atlanta Braves
## 6 4372 1512 134 642 776 159 197 0.976 Atlanta Braves
## 7 4372 1512 134 642 776 159 197 0.976 Atlanta Braves
## 8 4372 1512 134 642 776 159 197 0.976 Atlanta Braves
## 9 4372 1512 134 642 776 159 197 0.976 Atlanta Braves
## 10 4372 1512 134 642 776 159 197 0.976 Atlanta Braves
## park attendance BPF PPF teamIDBR teamIDlahman45
## 1 Atlanta-Fulton County Stadium 1350137 105 106 ATL ATL
## 2 Atlanta-Fulton County Stadium 1350137 105 106 ATL ATL
## 3 Atlanta-Fulton County Stadium 1350137 105 106 ATL ATL
## 4 Atlanta-Fulton County Stadium 1350137 105 106 ATL ATL
## 5 Atlanta-Fulton County Stadium 1350137 105 106 ATL ATL
## 6 Atlanta-Fulton County Stadium 1350137 105 106 ATL ATL
## 7 Atlanta-Fulton County Stadium 1350137 105 106 ATL ATL
## 8 Atlanta-Fulton County Stadium 1350137 105 106 ATL ATL
## 9 Atlanta-Fulton County Stadium 1350137 105 106 ATL ATL
## 10 Atlanta-Fulton County Stadium 1350137 105 106 ATL ATL
## teamIDretro playerID salary
## 1 ATL barkele01 870000
## 2 ATL bedrost01 550000
## 3 ATL benedbr01 545000
## 4 ATL campri01 633333
## 5 ATL ceronri01 625000
## 6 ATL chambch01 800000
## 7 ATL dedmoje01 150000
## 8 ATL forstte01 483333
## 9 ATL garbege01 772000
## 10 ATL harpete01 250000
This data is very similar (although not identical) to the one from the inner join above. Can you spot what is different?
Full Join
Full join is the last type of outer join and this will return all values from both tables and NAs will be given for those keys that do not match. For example,
full_join(Teams, Salaries) %>%
head(n = 10)
## Joining, by = c("yearID", "lgID", "teamID")
## yearID lgID teamID franchID divID Rank G Ghome W L DivWin WCWin LgWin
## 1 1871 NA BS1 BNA <NA> 3 31 NA 20 10 <NA> <NA> N
## 2 1871 NA CH1 CNA <NA> 2 28 NA 19 9 <NA> <NA> N
## 3 1871 NA CL1 CFC <NA> 8 29 NA 10 19 <NA> <NA> N
## 4 1871 NA FW1 KEK <NA> 7 19 NA 7 12 <NA> <NA> N
## 5 1871 NA NY2 NNA <NA> 5 33 NA 16 17 <NA> <NA> N
## 6 1871 NA PH1 PNA <NA> 1 28 NA 21 7 <NA> <NA> Y
## 7 1871 NA RC1 ROK <NA> 9 25 NA 4 21 <NA> <NA> N
## 8 1871 NA TRO TRO <NA> 6 29 NA 13 15 <NA> <NA> N
## 9 1871 NA WS3 OLY <NA> 4 32 NA 15 15 <NA> <NA> N
## 10 1872 NA BL1 BLC <NA> 2 58 NA 35 19 <NA> <NA> N
## WSWin R AB H X2B X3B HR BB SO SB CS HBP SF RA ER ERA CG SHO SV
## 1 <NA> 401 1372 426 70 37 3 60 19 73 16 NA NA 303 109 3.55 22 1 3
## 2 <NA> 302 1196 323 52 21 10 60 22 69 21 NA NA 241 77 2.76 25 0 1
## 3 <NA> 249 1186 328 35 40 7 26 25 18 8 NA NA 341 116 4.11 23 0 0
## 4 <NA> 137 746 178 19 8 2 33 9 16 4 NA NA 243 97 5.17 19 1 0
## 5 <NA> 302 1404 403 43 21 1 33 15 46 15 NA NA 313 121 3.72 32 1 0
## 6 <NA> 376 1281 410 66 27 9 46 23 56 12 NA NA 266 137 4.95 27 0 0
## 7 <NA> 231 1036 274 44 25 3 38 30 53 10 NA NA 287 108 4.30 23 1 0
## 8 <NA> 351 1248 384 51 34 6 49 19 62 24 NA NA 362 153 5.51 28 0 0
## 9 <NA> 310 1353 375 54 26 6 48 13 48 13 NA NA 303 137 4.37 32 0 0
## 10 <NA> 617 2571 753 106 31 14 29 28 53 18 NA NA 434 166 2.90 48 1 1
## IPouts HA HRA BBA SOA E DP FP name
## 1 828 367 2 42 23 243 24 0.834 Boston Red Stockings
## 2 753 308 6 28 22 229 16 0.829 Chicago White Stockings
## 3 762 346 13 53 34 234 15 0.818 Cleveland Forest Citys
## 4 507 261 5 21 17 163 8 0.803 Fort Wayne Kekiongas
## 5 879 373 7 42 22 235 14 0.840 New York Mutuals
## 6 747 329 3 53 16 194 13 0.845 Philadelphia Athletics
## 7 678 315 3 34 16 220 14 0.821 Rockford Forest Citys
## 8 750 431 4 75 12 198 22 0.845 Troy Haymakers
## 9 846 371 4 45 13 218 20 0.850 Washington Olympics
## 10 1548 573 3 63 77 432 22 0.830 Baltimore Canaries
## park attendance BPF PPF teamIDBR teamIDlahman45
## 1 South End Grounds I NA 103 98 BOS BS1
## 2 Union Base-Ball Grounds NA 104 102 CHI CH1
## 3 National Association Grounds NA 96 100 CLE CL1
## 4 Hamilton Field NA 101 107 KEK FW1
## 5 Union Grounds (Brooklyn) NA 90 88 NYU NY2
## 6 Jefferson Street Grounds NA 102 98 ATH PH1
## 7 Agricultural Society Fair Grounds NA 97 99 ROK RC1
## 8 Haymakers' Grounds NA 101 100 TRO TRO
## 9 Olympics Grounds NA 94 98 OLY WS3
## 10 Newington Park NA 106 102 BAL BL1
## teamIDretro playerID salary
## 1 BS1 <NA> NA
## 2 CH1 <NA> NA
## 3 CL1 <NA> NA
## 4 FW1 <NA> NA
## 5 NY2 <NA> NA
## 6 PH1 <NA> NA
## 7 RC1 <NA> NA
## 8 TRO <NA> NA
## 9 WS3 <NA> NA
## 10 BL1 <NA> NA
Note: I print only 10 rows of data with the head()
function. This part of the code below would generally not be used.
This data is very similar to the left join above, but not identical, can you tell the difference again?
Below is a diagram of the differences between the three outer joins from the R for Data Science text, outer joins diagram.
Filtering Joins
I tend to not use filtering joins, however, these are useful to connect summary data back to the original rows in the data. For example, using the team_salary
data created above, let’s select only the top 10 teams in terms of average salary from the year 2015.
top_salary_15 <- team_salary %>%
group_by(yearID, teamID) %>%
summarise(avg_salary = mean(salary, na.rm = TRUE)) %>%
filter(yearID == 2015) %>%
arrange(desc(avg_salary)) %>%
head(10)
## `summarise()` has grouped output by 'yearID'. You can override using the
## `.groups` argument.
top_salary_15
## # A tibble: 10 × 3
## # Groups: yearID [1]
## yearID teamID avg_salary
## <int> <fct> <dbl>
## 1 2015 LAN 7441103.
## 2 2015 NYA 7336274.
## 3 2015 DET 6891390
## 4 2015 SFN 6100056.
## 5 2015 BOS 5659481.
## 6 2015 WAS 5365085.
## 7 2015 SEA 4888348
## 8 2015 TEX 4791426.
## 9 2015 SLN 4586212.
## 10 2015 SDN 4555435.
Although not impossible, it would be a bit more difficult to connect these teams and years back to the original data in the team_salary
data. This can be done simply with a filtering join, namely a semi join.
team_salary %>%
semi_join(top_salary_15) %>%
head(n = 10)
## Joining, by = c("yearID", "teamID")
## yearID lgID teamID franchID divID Rank G Ghome W L DivWin WCWin LgWin
## 1 2015 AL BOS BOS E 5 162 81 78 84 N N N
## 2 2015 AL BOS BOS E 5 162 81 78 84 N N N
## 3 2015 AL BOS BOS E 5 162 81 78 84 N N N
## 4 2015 AL BOS BOS E 5 162 81 78 84 N N N
## 5 2015 AL BOS BOS E 5 162 81 78 84 N N N
## 6 2015 AL BOS BOS E 5 162 81 78 84 N N N
## 7 2015 AL BOS BOS E 5 162 81 78 84 N N N
## 8 2015 AL BOS BOS E 5 162 81 78 84 N N N
## 9 2015 AL BOS BOS E 5 162 81 78 84 N N N
## 10 2015 AL BOS BOS E 5 162 81 78 84 N N N
## WSWin R AB H X2B X3B HR BB SO SB CS HBP SF RA ER ERA CG SHO SV
## 1 N 748 5640 1495 294 33 161 478 1148 71 27 46 42 753 694 4.31 3 10 40
## 2 N 748 5640 1495 294 33 161 478 1148 71 27 46 42 753 694 4.31 3 10 40
## 3 N 748 5640 1495 294 33 161 478 1148 71 27 46 42 753 694 4.31 3 10 40
## 4 N 748 5640 1495 294 33 161 478 1148 71 27 46 42 753 694 4.31 3 10 40
## 5 N 748 5640 1495 294 33 161 478 1148 71 27 46 42 753 694 4.31 3 10 40
## 6 N 748 5640 1495 294 33 161 478 1148 71 27 46 42 753 694 4.31 3 10 40
## 7 N 748 5640 1495 294 33 161 478 1148 71 27 46 42 753 694 4.31 3 10 40
## 8 N 748 5640 1495 294 33 161 478 1148 71 27 46 42 753 694 4.31 3 10 40
## 9 N 748 5640 1495 294 33 161 478 1148 71 27 46 42 753 694 4.31 3 10 40
## 10 N 748 5640 1495 294 33 161 478 1148 71 27 46 42 753 694 4.31 3 10 40
## IPouts HA HRA BBA SOA E DP FP name park
## 1 4345 1486 178 478 1218 97 148 0.984 Boston Red Sox Fenway Park II
## 2 4345 1486 178 478 1218 97 148 0.984 Boston Red Sox Fenway Park II
## 3 4345 1486 178 478 1218 97 148 0.984 Boston Red Sox Fenway Park II
## 4 4345 1486 178 478 1218 97 148 0.984 Boston Red Sox Fenway Park II
## 5 4345 1486 178 478 1218 97 148 0.984 Boston Red Sox Fenway Park II
## 6 4345 1486 178 478 1218 97 148 0.984 Boston Red Sox Fenway Park II
## 7 4345 1486 178 478 1218 97 148 0.984 Boston Red Sox Fenway Park II
## 8 4345 1486 178 478 1218 97 148 0.984 Boston Red Sox Fenway Park II
## 9 4345 1486 178 478 1218 97 148 0.984 Boston Red Sox Fenway Park II
## 10 4345 1486 178 478 1218 97 148 0.984 Boston Red Sox Fenway Park II
## attendance BPF PPF teamIDBR teamIDlahman45 teamIDretro playerID salary
## 1 2880694 104 107 BOS BOS BOS barnema01 508500
## 2 2880694 104 107 BOS BOS BOS bettsmo01 514500
## 3 2880694 104 107 BOS BOS BOS bogaexa01 543000
## 4 2880694 104 107 BOS BOS BOS bradlja02 528000
## 5 2880694 104 107 BOS BOS BOS breslcr01 2000000
## 6 2880694 104 107 BOS BOS BOS buchhcl01 12000000
## 7 2880694 104 107 BOS BOS BOS castiru01 11271000
## 8 2880694 104 107 BOS BOS BOS cecchga01 508500
## 9 2880694 104 107 BOS BOS BOS craigal01 5500000
## 10 2880694 104 107 BOS BOS BOS hanigry01 3500000
Note: I print only 10 rows of data with the head()
function. This part of the code below would generally not be used.
This operation selected only the rows that had the matching keys from the first table (note that the columns were not touched).
The opposite operation is to use an anti join, in this type of join, the rows that do not match will be returned.
team_salary %>%
anti_join(top_salary_15) %>%
head(n = 10)
## Joining, by = c("yearID", "teamID")
## yearID lgID teamID franchID divID Rank G Ghome W L DivWin WCWin LgWin
## 1 1985 NL ATL ATL W 5 162 81 66 96 N <NA> N
## 2 1985 NL ATL ATL W 5 162 81 66 96 N <NA> N
## 3 1985 NL ATL ATL W 5 162 81 66 96 N <NA> N
## 4 1985 NL ATL ATL W 5 162 81 66 96 N <NA> N
## 5 1985 NL ATL ATL W 5 162 81 66 96 N <NA> N
## 6 1985 NL ATL ATL W 5 162 81 66 96 N <NA> N
## 7 1985 NL ATL ATL W 5 162 81 66 96 N <NA> N
## 8 1985 NL ATL ATL W 5 162 81 66 96 N <NA> N
## 9 1985 NL ATL ATL W 5 162 81 66 96 N <NA> N
## 10 1985 NL ATL ATL W 5 162 81 66 96 N <NA> N
## WSWin R AB H X2B X3B HR BB SO SB CS HBP SF RA ER ERA CG SHO SV
## 1 N 632 5526 1359 213 28 126 553 849 72 52 22 41 781 679 4.19 9 9 29
## 2 N 632 5526 1359 213 28 126 553 849 72 52 22 41 781 679 4.19 9 9 29
## 3 N 632 5526 1359 213 28 126 553 849 72 52 22 41 781 679 4.19 9 9 29
## 4 N 632 5526 1359 213 28 126 553 849 72 52 22 41 781 679 4.19 9 9 29
## 5 N 632 5526 1359 213 28 126 553 849 72 52 22 41 781 679 4.19 9 9 29
## 6 N 632 5526 1359 213 28 126 553 849 72 52 22 41 781 679 4.19 9 9 29
## 7 N 632 5526 1359 213 28 126 553 849 72 52 22 41 781 679 4.19 9 9 29
## 8 N 632 5526 1359 213 28 126 553 849 72 52 22 41 781 679 4.19 9 9 29
## 9 N 632 5526 1359 213 28 126 553 849 72 52 22 41 781 679 4.19 9 9 29
## 10 N 632 5526 1359 213 28 126 553 849 72 52 22 41 781 679 4.19 9 9 29
## IPouts HA HRA BBA SOA E DP FP name
## 1 4372 1512 134 642 776 159 197 0.976 Atlanta Braves
## 2 4372 1512 134 642 776 159 197 0.976 Atlanta Braves
## 3 4372 1512 134 642 776 159 197 0.976 Atlanta Braves
## 4 4372 1512 134 642 776 159 197 0.976 Atlanta Braves
## 5 4372 1512 134 642 776 159 197 0.976 Atlanta Braves
## 6 4372 1512 134 642 776 159 197 0.976 Atlanta Braves
## 7 4372 1512 134 642 776 159 197 0.976 Atlanta Braves
## 8 4372 1512 134 642 776 159 197 0.976 Atlanta Braves
## 9 4372 1512 134 642 776 159 197 0.976 Atlanta Braves
## 10 4372 1512 134 642 776 159 197 0.976 Atlanta Braves
## park attendance BPF PPF teamIDBR teamIDlahman45
## 1 Atlanta-Fulton County Stadium 1350137 105 106 ATL ATL
## 2 Atlanta-Fulton County Stadium 1350137 105 106 ATL ATL
## 3 Atlanta-Fulton County Stadium 1350137 105 106 ATL ATL
## 4 Atlanta-Fulton County Stadium 1350137 105 106 ATL ATL
## 5 Atlanta-Fulton County Stadium 1350137 105 106 ATL ATL
## 6 Atlanta-Fulton County Stadium 1350137 105 106 ATL ATL
## 7 Atlanta-Fulton County Stadium 1350137 105 106 ATL ATL
## 8 Atlanta-Fulton County Stadium 1350137 105 106 ATL ATL
## 9 Atlanta-Fulton County Stadium 1350137 105 106 ATL ATL
## 10 Atlanta-Fulton County Stadium 1350137 105 106 ATL ATL
## teamIDretro playerID salary
## 1 ATL barkele01 870000
## 2 ATL bedrost01 550000
## 3 ATL benedbr01 545000
## 4 ATL campri01 633333
## 5 ATL ceronri01 625000
## 6 ATL chambch01 800000
## 7 ATL dedmoje01 150000
## 8 ATL forstte01 483333
## 9 ATL garbege01 772000
## 10 ATL harpete01 250000
Note: I print only 10 rows of data with the head()
function. This part of the code below would generally not be used.
The sum of the number of rows in these two tables should equal the number of rows from the entire team_salary
data table
anti_rows <- team_salary %>%
anti_join(top_salary_15) %>%
nrow()
## Joining, by = c("yearID", "teamID")
semi_rows <- team_salary %>%
semi_join(top_salary_15) %>%
nrow()
## Joining, by = c("yearID", "teamID")
anti_rows + semi_rows == nrow(team_salary)
## [1] TRUE
Exercises
- Using the
Teams
andManagers
data, join the two tables and only keep the matching observations in both tables. Note, you may need to specify the column names directly you wish to join by. What happens to the columns that have the same names but are not keys? - Using the same data tables from #1, add all the
Managers
variables to theTeams
data while retaining all the rows for theTeams
data.