/usr/lib/R/site-library/dplyr/doc/two-table.Rmd is in r-cran-dplyr 0.7.4-3.
This file is owned by root:root, with mode 0o644.
The actual contents of the file can be viewed below.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 | ---
title: "Two-table verbs"
output: rmarkdown::html_vignette
vignette: >
%\VignetteIndexEntry{Two-table verbs}
%\VignetteEngine{knitr::rmarkdown}
\usepackage[utf8]{inputenc}
---
```{r, echo = FALSE, message = FALSE}
knitr::opts_chunk$set(collapse = T, comment = "#>")
options(tibble.print_min = 5)
library(dplyr)
knit_print.tbl_df <- function(x, options) {
knitr::knit_print(trunc_mat(x), options)
}
```
It's rare that a data analysis involves only a single table of data. In practice, you'll normally have many tables that contribute to an analysis, and you need flexible tools to combine them. In dplyr, there are three families of verbs that work with two tables at a time:
* Mutating joins, which add new variables to one table from matching rows in
another.
* Filtering joins, which filter observations from one table based on whether or
not they match an observation in the other table.
* Set operations, which combine the observations in the data sets as if they
were set elements.
(This discussion assumes that you have [tidy data](http://www.jstatsoft.org/v59/i10/), where the rows are observations and the columns are variables. If you're not familiar with that framework, I'd recommend reading up on it first.)
All two-table verbs work similarly. The first two arguments are `x` and `y`, and provide the tables to combine. The output is always a new table with the same type as `x`.
## Mutating joins
Mutating joins allow you to combine variables from multiple tables. For example, take the nycflights13 data. In one table we have flight information with an abbreviation for carrier, and in another we have a mapping between abbreviations and full names. You can use a join to add the carrier names to the flight data:
```{r, warning = FALSE}
library("nycflights13")
# Drop unimportant variables so it's easier to understand the join results.
flights2 <- flights %>% select(year:day, hour, origin, dest, tailnum, carrier)
flights2 %>%
left_join(airlines)
```
### Controlling how the tables are matched
As well as `x` and `y`, each mutating join takes an argument `by` that controls which variables are used to match observations in the two tables. There are a few ways to specify it, as I illustrate below with various tables from nycflights13:
* `NULL`, the default. dplyr will will use all variables that appear in
both tables, a __natural__ join. For example, the flights and
weather tables match on their common variables: year, month, day, hour and
origin.
```{r}
flights2 %>% left_join(weather)
```
* A character vector, `by = "x"`. Like a natural join, but uses only
some of the common variables. For example, `flights` and `planes` have
`year` columns, but they mean different things so we only want to join by
`tailnum`.
```{r}
flights2 %>% left_join(planes, by = "tailnum")
```
Note that the year columns in the output are disambiguated with a suffix.
* A named character vector: `by = c("x" = "a")`. This will
match variable `x` in table `x` to variable `a` in table `b`. The
variables from use will be used in the output.
Each flight has an origin and destination `airport`, so we need to specify
which one we want to join to:
```{r}
flights2 %>% left_join(airports, c("dest" = "faa"))
flights2 %>% left_join(airports, c("origin" = "faa"))
```
### Types of join
There are four types of mutating join, which differ in their behaviour when a match is not found. We'll illustrate each with a simple example:
```{r}
(df1 <- data_frame(x = c(1, 2), y = 2:1))
(df2 <- data_frame(x = c(1, 3), a = 10, b = "a"))
```
* `inner_join(x, y)` only includes observations that match in both `x` and `y`.
```{r}
df1 %>% inner_join(df2) %>% knitr::kable()
```
* `left_join(x, y)` includes all observations in `x`, regardless of whether
they match or not. This is the most commonly used join because it ensures
that you don't lose observations from your primary table.
```{r}
df1 %>% left_join(df2)
```
* `right_join(x, y)` includes all observations in `y`. It's equivalent to
`left_join(y, x)`, but the columns will be ordered differently.
```{r}
df1 %>% right_join(df2)
df2 %>% left_join(df1)
```
* `full_join()` includes all observations from `x` and `y`.
```{r}
df1 %>% full_join(df2)
```
The left, right and full joins are collectively know as __outer joins__. When a row doesn't match in an outer join, the new variables are filled in with missing values.
### Observations
While mutating joins are primarily used to add new variables, they can also generate new observations. If a match is not unique, a join will add all possible combinations (the Cartesian product) of the matching observations:
```{r}
df1 <- data_frame(x = c(1, 1, 2), y = 1:3)
df2 <- data_frame(x = c(1, 1, 2), z = c("a", "b", "a"))
df1 %>% left_join(df2)
```
## Filtering joins
Filtering joins match obserations in the same way as mutating joins, but affect the observations, not the variables. There are two types:
* `semi_join(x, y)` __keeps__ all observations in `x` that have a match in `y`.
* `anti_join(x, y)` __drops__ all observations in `x` that have a match in `y`.
These are most useful for diagnosing join mismatches. For example, there are many flights in the nycflights13 dataset that don't have a matching tail number in the planes table:
```{r}
library("nycflights13")
flights %>%
anti_join(planes, by = "tailnum") %>%
count(tailnum, sort = TRUE)
```
If you're worried about what observations your joins will match, start with a `semi_join()` or `anti_join()`. `semi_join()` and `anti_join()` never duplicate; they only ever remove observations.
```{r}
df1 <- data_frame(x = c(1, 1, 3, 4), y = 1:4)
df2 <- data_frame(x = c(1, 1, 2), z = c("a", "b", "a"))
# Four rows to start with:
df1 %>% nrow()
# And we get four rows after the join
df1 %>% inner_join(df2, by = "x") %>% nrow()
# But only two rows actually match
df1 %>% semi_join(df2, by = "x") %>% nrow()
```
## Set operations
The final type of two-table verb is set operations. These expect the `x` and `y` inputs to have the same variables, and treat the observations like sets:
* `intersect(x, y)`: return only observations in both `x` and `y`
* `union(x, y)`: return unique observations in `x` and `y`
* `setdiff(x, y)`: return observations in `x`, but not in `y`.
Given this simple data:
```{r}
(df1 <- data_frame(x = 1:2, y = c(1L, 1L)))
(df2 <- data_frame(x = 1:2, y = 1:2))
```
The four possibilities are:
```{r}
intersect(df1, df2)
# Note that we get 3 rows, not 4
union(df1, df2)
setdiff(df1, df2)
setdiff(df2, df1)
```
## Coercion rules
When joining tables, dplyr is a little more conservative than base R about the types of variable that it considers equivalent. This is mostly likely to surprise if you're working factors:
* Factors with different levels are coerced to character with a warning:
```{r}
df1 <- data_frame(x = 1, y = factor("a"))
df2 <- data_frame(x = 2, y = factor("b"))
full_join(df1, df2) %>% str()
```
* Factors with the same levels in a different order are coerced to character
with a warning:
```{r}
df1 <- data_frame(x = 1, y = factor("a", levels = c("a", "b")))
df2 <- data_frame(x = 2, y = factor("b", levels = c("b", "a")))
full_join(df1, df2) %>% str()
```
* Factors are preserved only if the levels match exactly:
```{r}
df1 <- data_frame(x = 1, y = factor("a", levels = c("a", "b")))
df2 <- data_frame(x = 2, y = factor("b", levels = c("a", "b")))
full_join(df1, df2) %>% str()
```
* A factor and a character are coerced to character with a warning:
```{r}
df1 <- data_frame(x = 1, y = "a")
df2 <- data_frame(x = 2, y = factor("a"))
full_join(df1, df2) %>% str()
```
Otherwise logicals will be silently upcast to integer, and integer to numeric, but coercing to character will raise an error:
```{r, error = TRUE, purl = FALSE}
df1 <- data_frame(x = 1, y = 1L)
df2 <- data_frame(x = 2, y = 1.5)
full_join(df1, df2) %>% str()
df1 <- data_frame(x = 1, y = 1L)
df2 <- data_frame(x = 2, y = "a")
full_join(df1, df2) %>% str()
```
## Multiple-table verbs
dplyr does not provide any functions for working with three or more tables. Instead use `purrr::reduce()` or `Reduce()`, as described in [Advanced R](http://adv-r.had.co.nz/Functionals.html#functionals-fp), to iteratively combine the two-table verbs to handle as many tables as you need.
|