Joining Data

Another common data manipulation task is to join multiple data sources into a single data file for an analysis. This task is most easily accomplished using a set of join functions found in the dplyr package. In this set of notes we are going to focus on mutating joins and filtering joins. There is another class of joins called set operations. I use these much less frequently, but for those interested, see the text in the R for Data Science book http://r4ds.had.co.nz/relational-data.html.

For this set of notes, we are going to make use of two packages:

library(tidyverse)
# install.packages('Lahman')
library(Lahman)

The Lahman package contains data from the Major League Baseball (MLB), a professional baseball association in the United States. For this section, we are going to focus on the following three data tables, Teams, Salaries, and Managers. I print the first ten rows of the data for each table below.

head(Teams, n = 10)
##    yearID lgID teamID franchID divID Rank  G Ghome  W  L DivWin WCWin LgWin
## 1    1871   NA    BS1      BNA  <NA>    3 31    NA 20 10   <NA>  <NA>     N
## 2    1871   NA    CH1      CNA  <NA>    2 28    NA 19  9   <NA>  <NA>     N
## 3    1871   NA    CL1      CFC  <NA>    8 29    NA 10 19   <NA>  <NA>     N
## 4    1871   NA    FW1      KEK  <NA>    7 19    NA  7 12   <NA>  <NA>     N
## 5    1871   NA    NY2      NNA  <NA>    5 33    NA 16 17   <NA>  <NA>     N
## 6    1871   NA    PH1      PNA  <NA>    1 28    NA 21  7   <NA>  <NA>     Y
## 7    1871   NA    RC1      ROK  <NA>    9 25    NA  4 21   <NA>  <NA>     N
## 8    1871   NA    TRO      TRO  <NA>    6 29    NA 13 15   <NA>  <NA>     N
## 9    1871   NA    WS3      OLY  <NA>    4 32    NA 15 15   <NA>  <NA>     N
## 10   1872   NA    BL1      BLC  <NA>    2 58    NA 35 19   <NA>  <NA>     N
##    WSWin   R   AB   H X2B X3B HR BB SO SB CS HBP SF  RA  ER  ERA CG SHO SV
## 1   <NA> 401 1372 426  70  37  3 60 19 73 16  NA NA 303 109 3.55 22   1  3
## 2   <NA> 302 1196 323  52  21 10 60 22 69 21  NA NA 241  77 2.76 25   0  1
## 3   <NA> 249 1186 328  35  40  7 26 25 18  8  NA NA 341 116 4.11 23   0  0
## 4   <NA> 137  746 178  19   8  2 33  9 16  4  NA NA 243  97 5.17 19   1  0
## 5   <NA> 302 1404 403  43  21  1 33 15 46 15  NA NA 313 121 3.72 32   1  0
## 6   <NA> 376 1281 410  66  27  9 46 23 56 12  NA NA 266 137 4.95 27   0  0
## 7   <NA> 231 1036 274  44  25  3 38 30 53 10  NA NA 287 108 4.30 23   1  0
## 8   <NA> 351 1248 384  51  34  6 49 19 62 24  NA NA 362 153 5.51 28   0  0
## 9   <NA> 310 1353 375  54  26  6 48 13 48 13  NA NA 303 137 4.37 32   0  0
## 10  <NA> 617 2571 753 106  31 14 29 28 53 18  NA NA 434 166 2.90 48   1  1
##    IPouts  HA HRA BBA SOA   E DP    FP                    name
## 1     828 367   2  42  23 243 24 0.834    Boston Red Stockings
## 2     753 308   6  28  22 229 16 0.829 Chicago White Stockings
## 3     762 346  13  53  34 234 15 0.818  Cleveland Forest Citys
## 4     507 261   5  21  17 163  8 0.803    Fort Wayne Kekiongas
## 5     879 373   7  42  22 235 14 0.840        New York Mutuals
## 6     747 329   3  53  16 194 13 0.845  Philadelphia Athletics
## 7     678 315   3  34  16 220 14 0.821   Rockford Forest Citys
## 8     750 431   4  75  12 198 22 0.845          Troy Haymakers
## 9     846 371   4  45  13 218 20 0.850     Washington Olympics
## 10   1548 573   3  63  77 432 22 0.830      Baltimore Canaries
##                                 park attendance BPF PPF teamIDBR teamIDlahman45
## 1                South End Grounds I         NA 103  98      BOS            BS1
## 2            Union Base-Ball Grounds         NA 104 102      CHI            CH1
## 3       National Association Grounds         NA  96 100      CLE            CL1
## 4                     Hamilton Field         NA 101 107      KEK            FW1
## 5           Union Grounds (Brooklyn)         NA  90  88      NYU            NY2
## 6           Jefferson Street Grounds         NA 102  98      ATH            PH1
## 7  Agricultural Society Fair Grounds         NA  97  99      ROK            RC1
## 8                 Haymakers' Grounds         NA 101 100      TRO            TRO
## 9                   Olympics Grounds         NA  94  98      OLY            WS3
## 10                    Newington Park         NA 106 102      BAL            BL1
##    teamIDretro
## 1          BS1
## 2          CH1
## 3          CL1
## 4          FW1
## 5          NY2
## 6          PH1
## 7          RC1
## 8          TRO
## 9          WS3
## 10         BL1
head(Salaries, n = 10)
##    yearID teamID lgID  playerID salary
## 1    1985    ATL   NL barkele01 870000
## 2    1985    ATL   NL bedrost01 550000
## 3    1985    ATL   NL benedbr01 545000
## 4    1985    ATL   NL  campri01 633333
## 5    1985    ATL   NL ceronri01 625000
## 6    1985    ATL   NL chambch01 800000
## 7    1985    ATL   NL dedmoje01 150000
## 8    1985    ATL   NL forstte01 483333
## 9    1985    ATL   NL garbege01 772000
## 10   1985    ATL   NL harpete01 250000
head(Managers, n = 10)
##     playerID yearID teamID lgID inseason  G  W  L rank plyrMgr
## 1  wrighha01   1871    BS1   NA        1 31 20 10    3       Y
## 2   woodji01   1871    CH1   NA        1 28 19  9    2       Y
## 3  paborch01   1871    CL1   NA        1 29 10 19    8       Y
## 4  lennobi01   1871    FW1   NA        1 14  5  9    8       Y
## 5  deaneha01   1871    FW1   NA        2  5  2  3    8       Y
## 6  fergubo01   1871    NY2   NA        1 33 16 17    5       Y
## 7  mcbridi01   1871    PH1   NA        1 28 21  7    1       Y
## 8  hastisc01   1871    RC1   NA        1 25  4 21    9       Y
## 9   pikeli01   1871    TRO   NA        1  4  1  3    6       Y
## 10 cravebi01   1871    TRO   NA        2 25 12 12    6       Y

Inner Join

The most basic join is the inner join. This join takes two tables and returns values if key variables match in both tables. If rows do not match on the key variables, these observations are removed. Suppose for example, we wanted to select the rows that matched between the Teams and Salaries data. This would be useful for example if we wished to calculate the average salary of the players for each team for every year.

This join could be done with the inner_join function.

team_salary <- inner_join(Teams, Salaries)
## Joining, by = c("yearID", "lgID", "teamID")
head(team_salary, n = 10)
##    yearID lgID teamID franchID divID Rank   G Ghome  W  L DivWin WCWin LgWin
## 1    1985   NL    ATL      ATL     W    5 162    81 66 96      N  <NA>     N
## 2    1985   NL    ATL      ATL     W    5 162    81 66 96      N  <NA>     N
## 3    1985   NL    ATL      ATL     W    5 162    81 66 96      N  <NA>     N
## 4    1985   NL    ATL      ATL     W    5 162    81 66 96      N  <NA>     N
## 5    1985   NL    ATL      ATL     W    5 162    81 66 96      N  <NA>     N
## 6    1985   NL    ATL      ATL     W    5 162    81 66 96      N  <NA>     N
## 7    1985   NL    ATL      ATL     W    5 162    81 66 96      N  <NA>     N
## 8    1985   NL    ATL      ATL     W    5 162    81 66 96      N  <NA>     N
## 9    1985   NL    ATL      ATL     W    5 162    81 66 96      N  <NA>     N
## 10   1985   NL    ATL      ATL     W    5 162    81 66 96      N  <NA>     N
##    WSWin   R   AB    H X2B X3B  HR  BB  SO SB CS HBP SF  RA  ER  ERA CG SHO SV
## 1      N 632 5526 1359 213  28 126 553 849 72 52  22 41 781 679 4.19  9   9 29
## 2      N 632 5526 1359 213  28 126 553 849 72 52  22 41 781 679 4.19  9   9 29
## 3      N 632 5526 1359 213  28 126 553 849 72 52  22 41 781 679 4.19  9   9 29
## 4      N 632 5526 1359 213  28 126 553 849 72 52  22 41 781 679 4.19  9   9 29
## 5      N 632 5526 1359 213  28 126 553 849 72 52  22 41 781 679 4.19  9   9 29
## 6      N 632 5526 1359 213  28 126 553 849 72 52  22 41 781 679 4.19  9   9 29
## 7      N 632 5526 1359 213  28 126 553 849 72 52  22 41 781 679 4.19  9   9 29
## 8      N 632 5526 1359 213  28 126 553 849 72 52  22 41 781 679 4.19  9   9 29
## 9      N 632 5526 1359 213  28 126 553 849 72 52  22 41 781 679 4.19  9   9 29
## 10     N 632 5526 1359 213  28 126 553 849 72 52  22 41 781 679 4.19  9   9 29
##    IPouts   HA HRA BBA SOA   E  DP    FP           name
## 1    4372 1512 134 642 776 159 197 0.976 Atlanta Braves
## 2    4372 1512 134 642 776 159 197 0.976 Atlanta Braves
## 3    4372 1512 134 642 776 159 197 0.976 Atlanta Braves
## 4    4372 1512 134 642 776 159 197 0.976 Atlanta Braves
## 5    4372 1512 134 642 776 159 197 0.976 Atlanta Braves
## 6    4372 1512 134 642 776 159 197 0.976 Atlanta Braves
## 7    4372 1512 134 642 776 159 197 0.976 Atlanta Braves
## 8    4372 1512 134 642 776 159 197 0.976 Atlanta Braves
## 9    4372 1512 134 642 776 159 197 0.976 Atlanta Braves
## 10   4372 1512 134 642 776 159 197 0.976 Atlanta Braves
##                             park attendance BPF PPF teamIDBR teamIDlahman45
## 1  Atlanta-Fulton County Stadium    1350137 105 106      ATL            ATL
## 2  Atlanta-Fulton County Stadium    1350137 105 106      ATL            ATL
## 3  Atlanta-Fulton County Stadium    1350137 105 106      ATL            ATL
## 4  Atlanta-Fulton County Stadium    1350137 105 106      ATL            ATL
## 5  Atlanta-Fulton County Stadium    1350137 105 106      ATL            ATL
## 6  Atlanta-Fulton County Stadium    1350137 105 106      ATL            ATL
## 7  Atlanta-Fulton County Stadium    1350137 105 106      ATL            ATL
## 8  Atlanta-Fulton County Stadium    1350137 105 106      ATL            ATL
## 9  Atlanta-Fulton County Stadium    1350137 105 106      ATL            ATL
## 10 Atlanta-Fulton County Stadium    1350137 105 106      ATL            ATL
##    teamIDretro  playerID salary
## 1          ATL barkele01 870000
## 2          ATL bedrost01 550000
## 3          ATL benedbr01 545000
## 4          ATL  campri01 633333
## 5          ATL ceronri01 625000
## 6          ATL chambch01 800000
## 7          ATL dedmoje01 150000
## 8          ATL forstte01 483333
## 9          ATL garbege01 772000
## 10         ATL harpete01 250000

You’ll notice that there is only data from 1985 onward, the data in the Teams data from before 1985 have automatically been removed due to no matching data in the Salaries data. You may have also noticed, that I did not specify the variables to join by above, for interactive work this can be okay, but to be more reproducible, specifying the variables to join on would be better. The function call above can be modified to include this information.

team_salary <- inner_join(Teams, Salaries, by = c('yearID', 'teamID', 'lgID'))
head(team_salary, n = 10)
##    yearID lgID teamID franchID divID Rank   G Ghome  W  L DivWin WCWin LgWin
## 1    1985   NL    ATL      ATL     W    5 162    81 66 96      N  <NA>     N
## 2    1985   NL    ATL      ATL     W    5 162    81 66 96      N  <NA>     N
## 3    1985   NL    ATL      ATL     W    5 162    81 66 96      N  <NA>     N
## 4    1985   NL    ATL      ATL     W    5 162    81 66 96      N  <NA>     N
## 5    1985   NL    ATL      ATL     W    5 162    81 66 96      N  <NA>     N
## 6    1985   NL    ATL      ATL     W    5 162    81 66 96      N  <NA>     N
## 7    1985   NL    ATL      ATL     W    5 162    81 66 96      N  <NA>     N
## 8    1985   NL    ATL      ATL     W    5 162    81 66 96      N  <NA>     N
## 9    1985   NL    ATL      ATL     W    5 162    81 66 96      N  <NA>     N
## 10   1985   NL    ATL      ATL     W    5 162    81 66 96      N  <NA>     N
##    WSWin   R   AB    H X2B X3B  HR  BB  SO SB CS HBP SF  RA  ER  ERA CG SHO SV
## 1      N 632 5526 1359 213  28 126 553 849 72 52  22 41 781 679 4.19  9   9 29
## 2      N 632 5526 1359 213  28 126 553 849 72 52  22 41 781 679 4.19  9   9 29
## 3      N 632 5526 1359 213  28 126 553 849 72 52  22 41 781 679 4.19  9   9 29
## 4      N 632 5526 1359 213  28 126 553 849 72 52  22 41 781 679 4.19  9   9 29
## 5      N 632 5526 1359 213  28 126 553 849 72 52  22 41 781 679 4.19  9   9 29
## 6      N 632 5526 1359 213  28 126 553 849 72 52  22 41 781 679 4.19  9   9 29
## 7      N 632 5526 1359 213  28 126 553 849 72 52  22 41 781 679 4.19  9   9 29
## 8      N 632 5526 1359 213  28 126 553 849 72 52  22 41 781 679 4.19  9   9 29
## 9      N 632 5526 1359 213  28 126 553 849 72 52  22 41 781 679 4.19  9   9 29
## 10     N 632 5526 1359 213  28 126 553 849 72 52  22 41 781 679 4.19  9   9 29
##    IPouts   HA HRA BBA SOA   E  DP    FP           name
## 1    4372 1512 134 642 776 159 197 0.976 Atlanta Braves
## 2    4372 1512 134 642 776 159 197 0.976 Atlanta Braves
## 3    4372 1512 134 642 776 159 197 0.976 Atlanta Braves
## 4    4372 1512 134 642 776 159 197 0.976 Atlanta Braves
## 5    4372 1512 134 642 776 159 197 0.976 Atlanta Braves
## 6    4372 1512 134 642 776 159 197 0.976 Atlanta Braves
## 7    4372 1512 134 642 776 159 197 0.976 Atlanta Braves
## 8    4372 1512 134 642 776 159 197 0.976 Atlanta Braves
## 9    4372 1512 134 642 776 159 197 0.976 Atlanta Braves
## 10   4372 1512 134 642 776 159 197 0.976 Atlanta Braves
##                             park attendance BPF PPF teamIDBR teamIDlahman45
## 1  Atlanta-Fulton County Stadium    1350137 105 106      ATL            ATL
## 2  Atlanta-Fulton County Stadium    1350137 105 106      ATL            ATL
## 3  Atlanta-Fulton County Stadium    1350137 105 106      ATL            ATL
## 4  Atlanta-Fulton County Stadium    1350137 105 106      ATL            ATL
## 5  Atlanta-Fulton County Stadium    1350137 105 106      ATL            ATL
## 6  Atlanta-Fulton County Stadium    1350137 105 106      ATL            ATL
## 7  Atlanta-Fulton County Stadium    1350137 105 106      ATL            ATL
## 8  Atlanta-Fulton County Stadium    1350137 105 106      ATL            ATL
## 9  Atlanta-Fulton County Stadium    1350137 105 106      ATL            ATL
## 10 Atlanta-Fulton County Stadium    1350137 105 106      ATL            ATL
##    teamIDretro  playerID salary
## 1          ATL barkele01 870000
## 2          ATL bedrost01 550000
## 3          ATL benedbr01 545000
## 4          ATL  campri01 633333
## 5          ATL ceronri01 625000
## 6          ATL chambch01 800000
## 7          ATL dedmoje01 150000
## 8          ATL forstte01 483333
## 9          ATL garbege01 772000
## 10         ATL harpete01 250000

We could then use other dplyr verbs to calculate the average salary for every team by year and plot these.

team_salary %>%
  group_by(yearID, teamID) %>%
  summarise(avg_salary = mean(salary, na.rm = TRUE)) %>%
  ggplot(aes(x = yearID, y = avg_salary)) + 
  geom_line(size = 1) + 
  facet_wrap(~teamID)
## `summarise()` has grouped output by 'yearID'. You can override using the
## `.groups` argument.

Below is a diagram of the inner join found in the R for Data Science text, inner join diagram.

Left Join

This is by far the most common join I perform. Left join is more formally part of a group of operations called outer joins. Outer joins are useful when you want to use one data table as a base data set in which variables will be added to this data if the keys match. It is likely best shown with an example.

Suppose we wish to add the salary information to the Teams data. However, instead of using a inner_join, let’s use left_join to see the difference. Note: I print only 10 rows of data with the head() function. This part of the code below would generally not be used.

left_join(Teams, Salaries) %>%
  head(n = 10)
## Joining, by = c("yearID", "lgID", "teamID")
##    yearID lgID teamID franchID divID Rank  G Ghome  W  L DivWin WCWin LgWin
## 1    1871   NA    BS1      BNA  <NA>    3 31    NA 20 10   <NA>  <NA>     N
## 2    1871   NA    CH1      CNA  <NA>    2 28    NA 19  9   <NA>  <NA>     N
## 3    1871   NA    CL1      CFC  <NA>    8 29    NA 10 19   <NA>  <NA>     N
## 4    1871   NA    FW1      KEK  <NA>    7 19    NA  7 12   <NA>  <NA>     N
## 5    1871   NA    NY2      NNA  <NA>    5 33    NA 16 17   <NA>  <NA>     N
## 6    1871   NA    PH1      PNA  <NA>    1 28    NA 21  7   <NA>  <NA>     Y
## 7    1871   NA    RC1      ROK  <NA>    9 25    NA  4 21   <NA>  <NA>     N
## 8    1871   NA    TRO      TRO  <NA>    6 29    NA 13 15   <NA>  <NA>     N
## 9    1871   NA    WS3      OLY  <NA>    4 32    NA 15 15   <NA>  <NA>     N
## 10   1872   NA    BL1      BLC  <NA>    2 58    NA 35 19   <NA>  <NA>     N
##    WSWin   R   AB   H X2B X3B HR BB SO SB CS HBP SF  RA  ER  ERA CG SHO SV
## 1   <NA> 401 1372 426  70  37  3 60 19 73 16  NA NA 303 109 3.55 22   1  3
## 2   <NA> 302 1196 323  52  21 10 60 22 69 21  NA NA 241  77 2.76 25   0  1
## 3   <NA> 249 1186 328  35  40  7 26 25 18  8  NA NA 341 116 4.11 23   0  0
## 4   <NA> 137  746 178  19   8  2 33  9 16  4  NA NA 243  97 5.17 19   1  0
## 5   <NA> 302 1404 403  43  21  1 33 15 46 15  NA NA 313 121 3.72 32   1  0
## 6   <NA> 376 1281 410  66  27  9 46 23 56 12  NA NA 266 137 4.95 27   0  0
## 7   <NA> 231 1036 274  44  25  3 38 30 53 10  NA NA 287 108 4.30 23   1  0
## 8   <NA> 351 1248 384  51  34  6 49 19 62 24  NA NA 362 153 5.51 28   0  0
## 9   <NA> 310 1353 375  54  26  6 48 13 48 13  NA NA 303 137 4.37 32   0  0
## 10  <NA> 617 2571 753 106  31 14 29 28 53 18  NA NA 434 166 2.90 48   1  1
##    IPouts  HA HRA BBA SOA   E DP    FP                    name
## 1     828 367   2  42  23 243 24 0.834    Boston Red Stockings
## 2     753 308   6  28  22 229 16 0.829 Chicago White Stockings
## 3     762 346  13  53  34 234 15 0.818  Cleveland Forest Citys
## 4     507 261   5  21  17 163  8 0.803    Fort Wayne Kekiongas
## 5     879 373   7  42  22 235 14 0.840        New York Mutuals
## 6     747 329   3  53  16 194 13 0.845  Philadelphia Athletics
## 7     678 315   3  34  16 220 14 0.821   Rockford Forest Citys
## 8     750 431   4  75  12 198 22 0.845          Troy Haymakers
## 9     846 371   4  45  13 218 20 0.850     Washington Olympics
## 10   1548 573   3  63  77 432 22 0.830      Baltimore Canaries
##                                 park attendance BPF PPF teamIDBR teamIDlahman45
## 1                South End Grounds I         NA 103  98      BOS            BS1
## 2            Union Base-Ball Grounds         NA 104 102      CHI            CH1
## 3       National Association Grounds         NA  96 100      CLE            CL1
## 4                     Hamilton Field         NA 101 107      KEK            FW1
## 5           Union Grounds (Brooklyn)         NA  90  88      NYU            NY2
## 6           Jefferson Street Grounds         NA 102  98      ATH            PH1
## 7  Agricultural Society Fair Grounds         NA  97  99      ROK            RC1
## 8                 Haymakers' Grounds         NA 101 100      TRO            TRO
## 9                   Olympics Grounds         NA  94  98      OLY            WS3
## 10                    Newington Park         NA 106 102      BAL            BL1
##    teamIDretro playerID salary
## 1          BS1     <NA>     NA
## 2          CH1     <NA>     NA
## 3          CL1     <NA>     NA
## 4          FW1     <NA>     NA
## 5          NY2     <NA>     NA
## 6          PH1     <NA>     NA
## 7          RC1     <NA>     NA
## 8          TRO     <NA>     NA
## 9          WS3     <NA>     NA
## 10         BL1     <NA>     NA

The first thing to notice is that now there are years in the yearID variable from before 1985, this was not the case in the above data joined using inner_join. If you scroll over to explore variables to the right, there are missing values for the salary variable. What left_join does when it doesn’t find a match in the table is to produce NA values, so all records within the joined data will be NA before 1985.

This is the major difference between outer joins and inner joins. Outer joins will preserve data in the keyed data that do not match and NA values are returned for non-matching values. For inner joins, any keys that do not match are removed.

Right Join

A right join is similar to a left join, except the keyed table is the second one specified (the rightmost data). For example, if we wished for the salary information to be the keyed table, we could do that same specification as above, but use right_join instead of left_join. Note: I print only 10 rows of data with the head() function. This part of the code below would generally not be used.

right_join(Teams, Salaries) %>%
  head(n = 10)
## Joining, by = c("yearID", "lgID", "teamID")
##    yearID lgID teamID franchID divID Rank   G Ghome  W  L DivWin WCWin LgWin
## 1    1985   NL    ATL      ATL     W    5 162    81 66 96      N  <NA>     N
## 2    1985   NL    ATL      ATL     W    5 162    81 66 96      N  <NA>     N
## 3    1985   NL    ATL      ATL     W    5 162    81 66 96      N  <NA>     N
## 4    1985   NL    ATL      ATL     W    5 162    81 66 96      N  <NA>     N
## 5    1985   NL    ATL      ATL     W    5 162    81 66 96      N  <NA>     N
## 6    1985   NL    ATL      ATL     W    5 162    81 66 96      N  <NA>     N
## 7    1985   NL    ATL      ATL     W    5 162    81 66 96      N  <NA>     N
## 8    1985   NL    ATL      ATL     W    5 162    81 66 96      N  <NA>     N
## 9    1985   NL    ATL      ATL     W    5 162    81 66 96      N  <NA>     N
## 10   1985   NL    ATL      ATL     W    5 162    81 66 96      N  <NA>     N
##    WSWin   R   AB    H X2B X3B  HR  BB  SO SB CS HBP SF  RA  ER  ERA CG SHO SV
## 1      N 632 5526 1359 213  28 126 553 849 72 52  22 41 781 679 4.19  9   9 29
## 2      N 632 5526 1359 213  28 126 553 849 72 52  22 41 781 679 4.19  9   9 29
## 3      N 632 5526 1359 213  28 126 553 849 72 52  22 41 781 679 4.19  9   9 29
## 4      N 632 5526 1359 213  28 126 553 849 72 52  22 41 781 679 4.19  9   9 29
## 5      N 632 5526 1359 213  28 126 553 849 72 52  22 41 781 679 4.19  9   9 29
## 6      N 632 5526 1359 213  28 126 553 849 72 52  22 41 781 679 4.19  9   9 29
## 7      N 632 5526 1359 213  28 126 553 849 72 52  22 41 781 679 4.19  9   9 29
## 8      N 632 5526 1359 213  28 126 553 849 72 52  22 41 781 679 4.19  9   9 29
## 9      N 632 5526 1359 213  28 126 553 849 72 52  22 41 781 679 4.19  9   9 29
## 10     N 632 5526 1359 213  28 126 553 849 72 52  22 41 781 679 4.19  9   9 29
##    IPouts   HA HRA BBA SOA   E  DP    FP           name
## 1    4372 1512 134 642 776 159 197 0.976 Atlanta Braves
## 2    4372 1512 134 642 776 159 197 0.976 Atlanta Braves
## 3    4372 1512 134 642 776 159 197 0.976 Atlanta Braves
## 4    4372 1512 134 642 776 159 197 0.976 Atlanta Braves
## 5    4372 1512 134 642 776 159 197 0.976 Atlanta Braves
## 6    4372 1512 134 642 776 159 197 0.976 Atlanta Braves
## 7    4372 1512 134 642 776 159 197 0.976 Atlanta Braves
## 8    4372 1512 134 642 776 159 197 0.976 Atlanta Braves
## 9    4372 1512 134 642 776 159 197 0.976 Atlanta Braves
## 10   4372 1512 134 642 776 159 197 0.976 Atlanta Braves
##                             park attendance BPF PPF teamIDBR teamIDlahman45
## 1  Atlanta-Fulton County Stadium    1350137 105 106      ATL            ATL
## 2  Atlanta-Fulton County Stadium    1350137 105 106      ATL            ATL
## 3  Atlanta-Fulton County Stadium    1350137 105 106      ATL            ATL
## 4  Atlanta-Fulton County Stadium    1350137 105 106      ATL            ATL
## 5  Atlanta-Fulton County Stadium    1350137 105 106      ATL            ATL
## 6  Atlanta-Fulton County Stadium    1350137 105 106      ATL            ATL
## 7  Atlanta-Fulton County Stadium    1350137 105 106      ATL            ATL
## 8  Atlanta-Fulton County Stadium    1350137 105 106      ATL            ATL
## 9  Atlanta-Fulton County Stadium    1350137 105 106      ATL            ATL
## 10 Atlanta-Fulton County Stadium    1350137 105 106      ATL            ATL
##    teamIDretro  playerID salary
## 1          ATL barkele01 870000
## 2          ATL bedrost01 550000
## 3          ATL benedbr01 545000
## 4          ATL  campri01 633333
## 5          ATL ceronri01 625000
## 6          ATL chambch01 800000
## 7          ATL dedmoje01 150000
## 8          ATL forstte01 483333
## 9          ATL garbege01 772000
## 10         ATL harpete01 250000

This data is very similar (although not identical) to the one from the inner join above. Can you spot what is different?

Full Join

Full join is the last type of outer join and this will return all values from both tables and NAs will be given for those keys that do not match. For example,

full_join(Teams, Salaries) %>%
  head(n = 10)
## Joining, by = c("yearID", "lgID", "teamID")
##    yearID lgID teamID franchID divID Rank  G Ghome  W  L DivWin WCWin LgWin
## 1    1871   NA    BS1      BNA  <NA>    3 31    NA 20 10   <NA>  <NA>     N
## 2    1871   NA    CH1      CNA  <NA>    2 28    NA 19  9   <NA>  <NA>     N
## 3    1871   NA    CL1      CFC  <NA>    8 29    NA 10 19   <NA>  <NA>     N
## 4    1871   NA    FW1      KEK  <NA>    7 19    NA  7 12   <NA>  <NA>     N
## 5    1871   NA    NY2      NNA  <NA>    5 33    NA 16 17   <NA>  <NA>     N
## 6    1871   NA    PH1      PNA  <NA>    1 28    NA 21  7   <NA>  <NA>     Y
## 7    1871   NA    RC1      ROK  <NA>    9 25    NA  4 21   <NA>  <NA>     N
## 8    1871   NA    TRO      TRO  <NA>    6 29    NA 13 15   <NA>  <NA>     N
## 9    1871   NA    WS3      OLY  <NA>    4 32    NA 15 15   <NA>  <NA>     N
## 10   1872   NA    BL1      BLC  <NA>    2 58    NA 35 19   <NA>  <NA>     N
##    WSWin   R   AB   H X2B X3B HR BB SO SB CS HBP SF  RA  ER  ERA CG SHO SV
## 1   <NA> 401 1372 426  70  37  3 60 19 73 16  NA NA 303 109 3.55 22   1  3
## 2   <NA> 302 1196 323  52  21 10 60 22 69 21  NA NA 241  77 2.76 25   0  1
## 3   <NA> 249 1186 328  35  40  7 26 25 18  8  NA NA 341 116 4.11 23   0  0
## 4   <NA> 137  746 178  19   8  2 33  9 16  4  NA NA 243  97 5.17 19   1  0
## 5   <NA> 302 1404 403  43  21  1 33 15 46 15  NA NA 313 121 3.72 32   1  0
## 6   <NA> 376 1281 410  66  27  9 46 23 56 12  NA NA 266 137 4.95 27   0  0
## 7   <NA> 231 1036 274  44  25  3 38 30 53 10  NA NA 287 108 4.30 23   1  0
## 8   <NA> 351 1248 384  51  34  6 49 19 62 24  NA NA 362 153 5.51 28   0  0
## 9   <NA> 310 1353 375  54  26  6 48 13 48 13  NA NA 303 137 4.37 32   0  0
## 10  <NA> 617 2571 753 106  31 14 29 28 53 18  NA NA 434 166 2.90 48   1  1
##    IPouts  HA HRA BBA SOA   E DP    FP                    name
## 1     828 367   2  42  23 243 24 0.834    Boston Red Stockings
## 2     753 308   6  28  22 229 16 0.829 Chicago White Stockings
## 3     762 346  13  53  34 234 15 0.818  Cleveland Forest Citys
## 4     507 261   5  21  17 163  8 0.803    Fort Wayne Kekiongas
## 5     879 373   7  42  22 235 14 0.840        New York Mutuals
## 6     747 329   3  53  16 194 13 0.845  Philadelphia Athletics
## 7     678 315   3  34  16 220 14 0.821   Rockford Forest Citys
## 8     750 431   4  75  12 198 22 0.845          Troy Haymakers
## 9     846 371   4  45  13 218 20 0.850     Washington Olympics
## 10   1548 573   3  63  77 432 22 0.830      Baltimore Canaries
##                                 park attendance BPF PPF teamIDBR teamIDlahman45
## 1                South End Grounds I         NA 103  98      BOS            BS1
## 2            Union Base-Ball Grounds         NA 104 102      CHI            CH1
## 3       National Association Grounds         NA  96 100      CLE            CL1
## 4                     Hamilton Field         NA 101 107      KEK            FW1
## 5           Union Grounds (Brooklyn)         NA  90  88      NYU            NY2
## 6           Jefferson Street Grounds         NA 102  98      ATH            PH1
## 7  Agricultural Society Fair Grounds         NA  97  99      ROK            RC1
## 8                 Haymakers' Grounds         NA 101 100      TRO            TRO
## 9                   Olympics Grounds         NA  94  98      OLY            WS3
## 10                    Newington Park         NA 106 102      BAL            BL1
##    teamIDretro playerID salary
## 1          BS1     <NA>     NA
## 2          CH1     <NA>     NA
## 3          CL1     <NA>     NA
## 4          FW1     <NA>     NA
## 5          NY2     <NA>     NA
## 6          PH1     <NA>     NA
## 7          RC1     <NA>     NA
## 8          TRO     <NA>     NA
## 9          WS3     <NA>     NA
## 10         BL1     <NA>     NA

Note: I print only 10 rows of data with the head() function. This part of the code below would generally not be used.

This data is very similar to the left join above, but not identical, can you tell the difference again?

Below is a diagram of the differences between the three outer joins from the R for Data Science text, outer joins diagram.

Filtering Joins

I tend to not use filtering joins, however, these are useful to connect summary data back to the original rows in the data. For example, using the team_salary data created above, let’s select only the top 10 teams in terms of average salary from the year 2015.

top_salary_15 <- team_salary %>%
  group_by(yearID, teamID) %>%
  summarise(avg_salary = mean(salary, na.rm = TRUE)) %>%
  filter(yearID == 2015) %>%
  arrange(desc(avg_salary)) %>%
  head(10)
## `summarise()` has grouped output by 'yearID'. You can override using the
## `.groups` argument.
top_salary_15
## # A tibble: 10 × 3
## # Groups:   yearID [1]
##    yearID teamID avg_salary
##     <int> <fct>       <dbl>
##  1   2015 LAN      7441103.
##  2   2015 NYA      7336274.
##  3   2015 DET      6891390 
##  4   2015 SFN      6100056.
##  5   2015 BOS      5659481.
##  6   2015 WAS      5365085.
##  7   2015 SEA      4888348 
##  8   2015 TEX      4791426.
##  9   2015 SLN      4586212.
## 10   2015 SDN      4555435.

Although not impossible, it would be a bit more difficult to connect these teams and years back to the original data in the team_salary data. This can be done simply with a filtering join, namely a semi join.

team_salary %>%
  semi_join(top_salary_15) %>%
  head(n = 10)
## Joining, by = c("yearID", "teamID")
##    yearID lgID teamID franchID divID Rank   G Ghome  W  L DivWin WCWin LgWin
## 1    2015   AL    BOS      BOS     E    5 162    81 78 84      N     N     N
## 2    2015   AL    BOS      BOS     E    5 162    81 78 84      N     N     N
## 3    2015   AL    BOS      BOS     E    5 162    81 78 84      N     N     N
## 4    2015   AL    BOS      BOS     E    5 162    81 78 84      N     N     N
## 5    2015   AL    BOS      BOS     E    5 162    81 78 84      N     N     N
## 6    2015   AL    BOS      BOS     E    5 162    81 78 84      N     N     N
## 7    2015   AL    BOS      BOS     E    5 162    81 78 84      N     N     N
## 8    2015   AL    BOS      BOS     E    5 162    81 78 84      N     N     N
## 9    2015   AL    BOS      BOS     E    5 162    81 78 84      N     N     N
## 10   2015   AL    BOS      BOS     E    5 162    81 78 84      N     N     N
##    WSWin   R   AB    H X2B X3B  HR  BB   SO SB CS HBP SF  RA  ER  ERA CG SHO SV
## 1      N 748 5640 1495 294  33 161 478 1148 71 27  46 42 753 694 4.31  3  10 40
## 2      N 748 5640 1495 294  33 161 478 1148 71 27  46 42 753 694 4.31  3  10 40
## 3      N 748 5640 1495 294  33 161 478 1148 71 27  46 42 753 694 4.31  3  10 40
## 4      N 748 5640 1495 294  33 161 478 1148 71 27  46 42 753 694 4.31  3  10 40
## 5      N 748 5640 1495 294  33 161 478 1148 71 27  46 42 753 694 4.31  3  10 40
## 6      N 748 5640 1495 294  33 161 478 1148 71 27  46 42 753 694 4.31  3  10 40
## 7      N 748 5640 1495 294  33 161 478 1148 71 27  46 42 753 694 4.31  3  10 40
## 8      N 748 5640 1495 294  33 161 478 1148 71 27  46 42 753 694 4.31  3  10 40
## 9      N 748 5640 1495 294  33 161 478 1148 71 27  46 42 753 694 4.31  3  10 40
## 10     N 748 5640 1495 294  33 161 478 1148 71 27  46 42 753 694 4.31  3  10 40
##    IPouts   HA HRA BBA  SOA  E  DP    FP           name           park
## 1    4345 1486 178 478 1218 97 148 0.984 Boston Red Sox Fenway Park II
## 2    4345 1486 178 478 1218 97 148 0.984 Boston Red Sox Fenway Park II
## 3    4345 1486 178 478 1218 97 148 0.984 Boston Red Sox Fenway Park II
## 4    4345 1486 178 478 1218 97 148 0.984 Boston Red Sox Fenway Park II
## 5    4345 1486 178 478 1218 97 148 0.984 Boston Red Sox Fenway Park II
## 6    4345 1486 178 478 1218 97 148 0.984 Boston Red Sox Fenway Park II
## 7    4345 1486 178 478 1218 97 148 0.984 Boston Red Sox Fenway Park II
## 8    4345 1486 178 478 1218 97 148 0.984 Boston Red Sox Fenway Park II
## 9    4345 1486 178 478 1218 97 148 0.984 Boston Red Sox Fenway Park II
## 10   4345 1486 178 478 1218 97 148 0.984 Boston Red Sox Fenway Park II
##    attendance BPF PPF teamIDBR teamIDlahman45 teamIDretro  playerID   salary
## 1     2880694 104 107      BOS            BOS         BOS barnema01   508500
## 2     2880694 104 107      BOS            BOS         BOS bettsmo01   514500
## 3     2880694 104 107      BOS            BOS         BOS bogaexa01   543000
## 4     2880694 104 107      BOS            BOS         BOS bradlja02   528000
## 5     2880694 104 107      BOS            BOS         BOS breslcr01  2000000
## 6     2880694 104 107      BOS            BOS         BOS buchhcl01 12000000
## 7     2880694 104 107      BOS            BOS         BOS castiru01 11271000
## 8     2880694 104 107      BOS            BOS         BOS cecchga01   508500
## 9     2880694 104 107      BOS            BOS         BOS craigal01  5500000
## 10    2880694 104 107      BOS            BOS         BOS hanigry01  3500000

Note: I print only 10 rows of data with the head() function. This part of the code below would generally not be used.

This operation selected only the rows that had the matching keys from the first table (note that the columns were not touched).

The opposite operation is to use an anti join, in this type of join, the rows that do not match will be returned.

team_salary %>%
  anti_join(top_salary_15) %>%
  head(n = 10)
## Joining, by = c("yearID", "teamID")
##    yearID lgID teamID franchID divID Rank   G Ghome  W  L DivWin WCWin LgWin
## 1    1985   NL    ATL      ATL     W    5 162    81 66 96      N  <NA>     N
## 2    1985   NL    ATL      ATL     W    5 162    81 66 96      N  <NA>     N
## 3    1985   NL    ATL      ATL     W    5 162    81 66 96      N  <NA>     N
## 4    1985   NL    ATL      ATL     W    5 162    81 66 96      N  <NA>     N
## 5    1985   NL    ATL      ATL     W    5 162    81 66 96      N  <NA>     N
## 6    1985   NL    ATL      ATL     W    5 162    81 66 96      N  <NA>     N
## 7    1985   NL    ATL      ATL     W    5 162    81 66 96      N  <NA>     N
## 8    1985   NL    ATL      ATL     W    5 162    81 66 96      N  <NA>     N
## 9    1985   NL    ATL      ATL     W    5 162    81 66 96      N  <NA>     N
## 10   1985   NL    ATL      ATL     W    5 162    81 66 96      N  <NA>     N
##    WSWin   R   AB    H X2B X3B  HR  BB  SO SB CS HBP SF  RA  ER  ERA CG SHO SV
## 1      N 632 5526 1359 213  28 126 553 849 72 52  22 41 781 679 4.19  9   9 29
## 2      N 632 5526 1359 213  28 126 553 849 72 52  22 41 781 679 4.19  9   9 29
## 3      N 632 5526 1359 213  28 126 553 849 72 52  22 41 781 679 4.19  9   9 29
## 4      N 632 5526 1359 213  28 126 553 849 72 52  22 41 781 679 4.19  9   9 29
## 5      N 632 5526 1359 213  28 126 553 849 72 52  22 41 781 679 4.19  9   9 29
## 6      N 632 5526 1359 213  28 126 553 849 72 52  22 41 781 679 4.19  9   9 29
## 7      N 632 5526 1359 213  28 126 553 849 72 52  22 41 781 679 4.19  9   9 29
## 8      N 632 5526 1359 213  28 126 553 849 72 52  22 41 781 679 4.19  9   9 29
## 9      N 632 5526 1359 213  28 126 553 849 72 52  22 41 781 679 4.19  9   9 29
## 10     N 632 5526 1359 213  28 126 553 849 72 52  22 41 781 679 4.19  9   9 29
##    IPouts   HA HRA BBA SOA   E  DP    FP           name
## 1    4372 1512 134 642 776 159 197 0.976 Atlanta Braves
## 2    4372 1512 134 642 776 159 197 0.976 Atlanta Braves
## 3    4372 1512 134 642 776 159 197 0.976 Atlanta Braves
## 4    4372 1512 134 642 776 159 197 0.976 Atlanta Braves
## 5    4372 1512 134 642 776 159 197 0.976 Atlanta Braves
## 6    4372 1512 134 642 776 159 197 0.976 Atlanta Braves
## 7    4372 1512 134 642 776 159 197 0.976 Atlanta Braves
## 8    4372 1512 134 642 776 159 197 0.976 Atlanta Braves
## 9    4372 1512 134 642 776 159 197 0.976 Atlanta Braves
## 10   4372 1512 134 642 776 159 197 0.976 Atlanta Braves
##                             park attendance BPF PPF teamIDBR teamIDlahman45
## 1  Atlanta-Fulton County Stadium    1350137 105 106      ATL            ATL
## 2  Atlanta-Fulton County Stadium    1350137 105 106      ATL            ATL
## 3  Atlanta-Fulton County Stadium    1350137 105 106      ATL            ATL
## 4  Atlanta-Fulton County Stadium    1350137 105 106      ATL            ATL
## 5  Atlanta-Fulton County Stadium    1350137 105 106      ATL            ATL
## 6  Atlanta-Fulton County Stadium    1350137 105 106      ATL            ATL
## 7  Atlanta-Fulton County Stadium    1350137 105 106      ATL            ATL
## 8  Atlanta-Fulton County Stadium    1350137 105 106      ATL            ATL
## 9  Atlanta-Fulton County Stadium    1350137 105 106      ATL            ATL
## 10 Atlanta-Fulton County Stadium    1350137 105 106      ATL            ATL
##    teamIDretro  playerID salary
## 1          ATL barkele01 870000
## 2          ATL bedrost01 550000
## 3          ATL benedbr01 545000
## 4          ATL  campri01 633333
## 5          ATL ceronri01 625000
## 6          ATL chambch01 800000
## 7          ATL dedmoje01 150000
## 8          ATL forstte01 483333
## 9          ATL garbege01 772000
## 10         ATL harpete01 250000

Note: I print only 10 rows of data with the head() function. This part of the code below would generally not be used.

The sum of the number of rows in these two tables should equal the number of rows from the entire team_salary data table

anti_rows <- team_salary %>%
  anti_join(top_salary_15) %>%
  nrow()
## Joining, by = c("yearID", "teamID")
semi_rows <- team_salary %>%
  semi_join(top_salary_15) %>%
  nrow()
## Joining, by = c("yearID", "teamID")
anti_rows + semi_rows == nrow(team_salary)
## [1] TRUE

Exercises

  1. Using the Teams and Managers data, join the two tables and only keep the matching observations in both tables. Note, you may need to specify the column names directly you wish to join by. What happens to the columns that have the same names but are not keys?
  2. Using the same data tables from #1, add all the Managers variables to the Teams data while retaining all the rows for the Teams data.
Previous
Next