This file is indexed.

/usr/lib/R/site-library/dplyr/doc/dplyr.html is in r-cran-dplyr 0.7.4-3.

This file is owned by root:root, with mode 0o644.

The actual contents of the file can be viewed below.

  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
<!DOCTYPE html>

<html xmlns="http://www.w3.org/1999/xhtml">

<head>

<meta charset="utf-8" />
<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
<meta name="generator" content="pandoc" />

<meta name="viewport" content="width=device-width, initial-scale=1">



<title>Introduction to dplyr</title>



<style type="text/css">code{white-space: pre;}</style>
<style type="text/css">
div.sourceCode { overflow-x: auto; }
table.sourceCode, tr.sourceCode, td.lineNumbers, td.sourceCode {
  margin: 0; padding: 0; vertical-align: baseline; border: none; }
table.sourceCode { width: 100%; line-height: 100%; }
td.lineNumbers { text-align: right; padding-right: 4px; padding-left: 4px; color: #aaaaaa; border-right: 1px solid #aaaaaa; }
td.sourceCode { padding-left: 5px; }
code > span.kw { color: #007020; font-weight: bold; } /* Keyword */
code > span.dt { color: #902000; } /* DataType */
code > span.dv { color: #40a070; } /* DecVal */
code > span.bn { color: #40a070; } /* BaseN */
code > span.fl { color: #40a070; } /* Float */
code > span.ch { color: #4070a0; } /* Char */
code > span.st { color: #4070a0; } /* String */
code > span.co { color: #60a0b0; font-style: italic; } /* Comment */
code > span.ot { color: #007020; } /* Other */
code > span.al { color: #ff0000; font-weight: bold; } /* Alert */
code > span.fu { color: #06287e; } /* Function */
code > span.er { color: #ff0000; font-weight: bold; } /* Error */
code > span.wa { color: #60a0b0; font-weight: bold; font-style: italic; } /* Warning */
code > span.cn { color: #880000; } /* Constant */
code > span.sc { color: #4070a0; } /* SpecialChar */
code > span.vs { color: #4070a0; } /* VerbatimString */
code > span.ss { color: #bb6688; } /* SpecialString */
code > span.im { } /* Import */
code > span.va { color: #19177c; } /* Variable */
code > span.cf { color: #007020; font-weight: bold; } /* ControlFlow */
code > span.op { color: #666666; } /* Operator */
code > span.bu { } /* BuiltIn */
code > span.ex { } /* Extension */
code > span.pp { color: #bc7a00; } /* Preprocessor */
code > span.at { color: #7d9029; } /* Attribute */
code > span.do { color: #ba2121; font-style: italic; } /* Documentation */
code > span.an { color: #60a0b0; font-weight: bold; font-style: italic; } /* Annotation */
code > span.cv { color: #60a0b0; font-weight: bold; font-style: italic; } /* CommentVar */
code > span.in { color: #60a0b0; font-weight: bold; font-style: italic; } /* Information */
</style>



<link href="data:text/css;charset=utf-8,body%20%7B%0Abackground%2Dcolor%3A%20%23fff%3B%0Amargin%3A%201em%20auto%3B%0Amax%2Dwidth%3A%20700px%3B%0Aoverflow%3A%20visible%3B%0Apadding%2Dleft%3A%202em%3B%0Apadding%2Dright%3A%202em%3B%0Afont%2Dfamily%3A%20%22Open%20Sans%22%2C%20%22Helvetica%20Neue%22%2C%20Helvetica%2C%20Arial%2C%20sans%2Dserif%3B%0Afont%2Dsize%3A%2014px%3B%0Aline%2Dheight%3A%201%2E35%3B%0A%7D%0A%23header%20%7B%0Atext%2Dalign%3A%20center%3B%0A%7D%0A%23TOC%20%7B%0Aclear%3A%20both%3B%0Amargin%3A%200%200%2010px%2010px%3B%0Apadding%3A%204px%3B%0Awidth%3A%20400px%3B%0Aborder%3A%201px%20solid%20%23CCCCCC%3B%0Aborder%2Dradius%3A%205px%3B%0Abackground%2Dcolor%3A%20%23f6f6f6%3B%0Afont%2Dsize%3A%2013px%3B%0Aline%2Dheight%3A%201%2E3%3B%0A%7D%0A%23TOC%20%2Etoctitle%20%7B%0Afont%2Dweight%3A%20bold%3B%0Afont%2Dsize%3A%2015px%3B%0Amargin%2Dleft%3A%205px%3B%0A%7D%0A%23TOC%20ul%20%7B%0Apadding%2Dleft%3A%2040px%3B%0Amargin%2Dleft%3A%20%2D1%2E5em%3B%0Amargin%2Dtop%3A%205px%3B%0Amargin%2Dbottom%3A%205px%3B%0A%7D%0A%23TOC%20ul%20ul%20%7B%0Amargin%2Dleft%3A%20%2D2em%3B%0A%7D%0A%23TOC%20li%20%7B%0Aline%2Dheight%3A%2016px%3B%0A%7D%0Atable%20%7B%0Amargin%3A%201em%20auto%3B%0Aborder%2Dwidth%3A%201px%3B%0Aborder%2Dcolor%3A%20%23DDDDDD%3B%0Aborder%2Dstyle%3A%20outset%3B%0Aborder%2Dcollapse%3A%20collapse%3B%0A%7D%0Atable%20th%20%7B%0Aborder%2Dwidth%3A%202px%3B%0Apadding%3A%205px%3B%0Aborder%2Dstyle%3A%20inset%3B%0A%7D%0Atable%20td%20%7B%0Aborder%2Dwidth%3A%201px%3B%0Aborder%2Dstyle%3A%20inset%3B%0Aline%2Dheight%3A%2018px%3B%0Apadding%3A%205px%205px%3B%0A%7D%0Atable%2C%20table%20th%2C%20table%20td%20%7B%0Aborder%2Dleft%2Dstyle%3A%20none%3B%0Aborder%2Dright%2Dstyle%3A%20none%3B%0A%7D%0Atable%20thead%2C%20table%20tr%2Eeven%20%7B%0Abackground%2Dcolor%3A%20%23f7f7f7%3B%0A%7D%0Ap%20%7B%0Amargin%3A%200%2E5em%200%3B%0A%7D%0Ablockquote%20%7B%0Abackground%2Dcolor%3A%20%23f6f6f6%3B%0Apadding%3A%200%2E25em%200%2E75em%3B%0A%7D%0Ahr%20%7B%0Aborder%2Dstyle%3A%20solid%3B%0Aborder%3A%20none%3B%0Aborder%2Dtop%3A%201px%20solid%20%23777%3B%0Amargin%3A%2028px%200%3B%0A%7D%0Adl%20%7B%0Amargin%2Dleft%3A%200%3B%0A%7D%0Adl%20dd%20%7B%0Amargin%2Dbottom%3A%2013px%3B%0Amargin%2Dleft%3A%2013px%3B%0A%7D%0Adl%20dt%20%7B%0Afont%2Dweight%3A%20bold%3B%0A%7D%0Aul%20%7B%0Amargin%2Dtop%3A%200%3B%0A%7D%0Aul%20li%20%7B%0Alist%2Dstyle%3A%20circle%20outside%3B%0A%7D%0Aul%20ul%20%7B%0Amargin%2Dbottom%3A%200%3B%0A%7D%0Apre%2C%20code%20%7B%0Abackground%2Dcolor%3A%20%23f7f7f7%3B%0Aborder%2Dradius%3A%203px%3B%0Acolor%3A%20%23333%3B%0Awhite%2Dspace%3A%20pre%2Dwrap%3B%20%0A%7D%0Apre%20%7B%0Aborder%2Dradius%3A%203px%3B%0Amargin%3A%205px%200px%2010px%200px%3B%0Apadding%3A%2010px%3B%0A%7D%0Apre%3Anot%28%5Bclass%5D%29%20%7B%0Abackground%2Dcolor%3A%20%23f7f7f7%3B%0A%7D%0Acode%20%7B%0Afont%2Dfamily%3A%20Consolas%2C%20Monaco%2C%20%27Courier%20New%27%2C%20monospace%3B%0Afont%2Dsize%3A%2085%25%3B%0A%7D%0Ap%20%3E%20code%2C%20li%20%3E%20code%20%7B%0Apadding%3A%202px%200px%3B%0A%7D%0Adiv%2Efigure%20%7B%0Atext%2Dalign%3A%20center%3B%0A%7D%0Aimg%20%7B%0Abackground%2Dcolor%3A%20%23FFFFFF%3B%0Apadding%3A%202px%3B%0Aborder%3A%201px%20solid%20%23DDDDDD%3B%0Aborder%2Dradius%3A%203px%3B%0Aborder%3A%201px%20solid%20%23CCCCCC%3B%0Amargin%3A%200%205px%3B%0A%7D%0Ah1%20%7B%0Amargin%2Dtop%3A%200%3B%0Afont%2Dsize%3A%2035px%3B%0Aline%2Dheight%3A%2040px%3B%0A%7D%0Ah2%20%7B%0Aborder%2Dbottom%3A%204px%20solid%20%23f7f7f7%3B%0Apadding%2Dtop%3A%2010px%3B%0Apadding%2Dbottom%3A%202px%3B%0Afont%2Dsize%3A%20145%25%3B%0A%7D%0Ah3%20%7B%0Aborder%2Dbottom%3A%202px%20solid%20%23f7f7f7%3B%0Apadding%2Dtop%3A%2010px%3B%0Afont%2Dsize%3A%20120%25%3B%0A%7D%0Ah4%20%7B%0Aborder%2Dbottom%3A%201px%20solid%20%23f7f7f7%3B%0Amargin%2Dleft%3A%208px%3B%0Afont%2Dsize%3A%20105%25%3B%0A%7D%0Ah5%2C%20h6%20%7B%0Aborder%2Dbottom%3A%201px%20solid%20%23ccc%3B%0Afont%2Dsize%3A%20105%25%3B%0A%7D%0Aa%20%7B%0Acolor%3A%20%230033dd%3B%0Atext%2Ddecoration%3A%20none%3B%0A%7D%0Aa%3Ahover%20%7B%0Acolor%3A%20%236666ff%3B%20%7D%0Aa%3Avisited%20%7B%0Acolor%3A%20%23800080%3B%20%7D%0Aa%3Avisited%3Ahover%20%7B%0Acolor%3A%20%23BB00BB%3B%20%7D%0Aa%5Bhref%5E%3D%22http%3A%22%5D%20%7B%0Atext%2Ddecoration%3A%20underline%3B%20%7D%0Aa%5Bhref%5E%3D%22https%3A%22%5D%20%7B%0Atext%2Ddecoration%3A%20underline%3B%20%7D%0A%0Acode%20%3E%20span%2Ekw%20%7B%20color%3A%20%23555%3B%20font%2Dweight%3A%20bold%3B%20%7D%20%0Acode%20%3E%20span%2Edt%20%7B%20color%3A%20%23902000%3B%20%7D%20%0Acode%20%3E%20span%2Edv%20%7B%20color%3A%20%2340a070%3B%20%7D%20%0Acode%20%3E%20span%2Ebn%20%7B%20color%3A%20%23d14%3B%20%7D%20%0Acode%20%3E%20span%2Efl%20%7B%20color%3A%20%23d14%3B%20%7D%20%0Acode%20%3E%20span%2Ech%20%7B%20color%3A%20%23d14%3B%20%7D%20%0Acode%20%3E%20span%2Est%20%7B%20color%3A%20%23d14%3B%20%7D%20%0Acode%20%3E%20span%2Eco%20%7B%20color%3A%20%23888888%3B%20font%2Dstyle%3A%20italic%3B%20%7D%20%0Acode%20%3E%20span%2Eot%20%7B%20color%3A%20%23007020%3B%20%7D%20%0Acode%20%3E%20span%2Eal%20%7B%20color%3A%20%23ff0000%3B%20font%2Dweight%3A%20bold%3B%20%7D%20%0Acode%20%3E%20span%2Efu%20%7B%20color%3A%20%23900%3B%20font%2Dweight%3A%20bold%3B%20%7D%20%20code%20%3E%20span%2Eer%20%7B%20color%3A%20%23a61717%3B%20background%2Dcolor%3A%20%23e3d2d2%3B%20%7D%20%0A" rel="stylesheet" type="text/css" />

</head>

<body>




<h1 class="title toc-ignore">Introduction to dplyr</h1>



<p>When working with data you must:</p>
<ul>
<li><p>Figure out what you want to do.</p></li>
<li><p>Describe those tasks in the form of a computer program.</p></li>
<li><p>Execute the program.</p></li>
</ul>
<p>The dplyr package makes these steps fast and easy:</p>
<ul>
<li><p>By constraining your options, it helps you think about your data manipulation challenges.</p></li>
<li><p>It provides simple “verbs”, functions that correspond to the most common data manipulation tasks, to help you translate your thoughts into code.</p></li>
<li><p>It uses efficient backends, so you spend less time waiting for the computer.</p></li>
</ul>
<p>This document introduces you to dplyr’s basic set of tools, and shows you how to apply them to data frames. dplyr also supports databases via the dbplyr package, once you’ve installed, read <code>vignette(&quot;dbplyr&quot;)</code> to learn more.</p>
<div id="data-nycflights13" class="section level2">
<h2>Data: nycflights13</h2>
<p>To explore the basic data manipulation verbs of dplyr, we’ll use <code>nycflights13::flights</code>. This dataset contains all 336776 flights that departed from New York City in 2013. The data comes from the US <a href="http://www.transtats.bts.gov/DatabaseInfo.asp?DB_ID=120&amp;Link=0">Bureau of Transportation Statistics</a>, and is documented in <code>?nycflights13</code></p>
<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r"><span class="kw">library</span>(nycflights13)
<span class="kw">dim</span>(flights)
<span class="co">#&gt; [1] 336776     19</span>
flights
<span class="co">#&gt; # A tibble: 336,776 x 19</span>
<span class="co">#&gt;    year month   day dep_t… sche… dep_… arr_… sche… arr_… carr… flig… tail…</span>
<span class="co">#&gt;   &lt;int&gt; &lt;int&gt; &lt;int&gt;  &lt;int&gt; &lt;int&gt; &lt;dbl&gt; &lt;int&gt; &lt;int&gt; &lt;dbl&gt; &lt;chr&gt; &lt;int&gt; &lt;chr&gt;</span>
<span class="co">#&gt; 1  2013     1     1    517   515  2.00   830   819  11.0 UA     1545 N142…</span>
<span class="co">#&gt; 2  2013     1     1    533   529  4.00   850   830  20.0 UA     1714 N242…</span>
<span class="co">#&gt; 3  2013     1     1    542   540  2.00   923   850  33.0 AA     1141 N619…</span>
<span class="co">#&gt; 4  2013     1     1    544   545 -1.00  1004  1022 -18.0 B6      725 N804…</span>
<span class="co">#&gt; # ... with 336,772 more rows, and 7 more variables: origin &lt;chr&gt;,</span>
<span class="co">#&gt; #   dest &lt;chr&gt;, air_time &lt;dbl&gt;, distance &lt;dbl&gt;, hour &lt;dbl&gt;, minute &lt;dbl&gt;,</span>
<span class="co">#&gt; #   time_hour &lt;dttm&gt;</span></code></pre></div>
<p>Note that <code>nycflights13::flights</code> is a tibble, a modern reimagining of the data frame. It’s particular useful for large datasets because it only prints the first few rows. You can learn more about tibbles at <a href="http://tibble.tidyverse.org" class="uri">http://tibble.tidyverse.org</a>; in particular you can convert data frames to tibbles with <code>as_tibble()</code>.</p>
</div>
<div id="single-table-verbs" class="section level2">
<h2>Single table verbs</h2>
<p>Dplyr aims to provide a function for each basic verb of data manipulation:</p>
<ul>
<li><code>filter()</code> to select cases based on their values.</li>
<li><code>arrange()</code> to reorder the cases.</li>
<li><code>select()</code> and <code>rename()</code> to select variables based on their names.</li>
<li><code>mutate()</code> and <code>transmute()</code> to add new variables that are functions of existing variables.</li>
<li><code>summarise()</code> to condense multiple values to a single value.</li>
<li><code>sample_n()</code> and <code>sample_frac()</code> to take random samples.</li>
</ul>
<div id="filter-rows-with-filter" class="section level3">
<h3>Filter rows with <code>filter()</code></h3>
<p><code>filter()</code> allows you to select a subset of rows in a data frame. Like all single verbs, the first argument is the tibble (or data frame). The second and subsequent arguments refer to variables within that data frame, selecting rows where the expression is <code>TRUE</code>.</p>
<p>For example, we can select all flights on January 1st with:</p>
<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r"><span class="kw">filter</span>(flights, month <span class="op">==</span><span class="st"> </span><span class="dv">1</span>, day <span class="op">==</span><span class="st"> </span><span class="dv">1</span>)
<span class="co">#&gt; # A tibble: 842 x 19</span>
<span class="co">#&gt;    year month   day dep_t… sche… dep_… arr_… sche… arr_… carr… flig… tail…</span>
<span class="co">#&gt;   &lt;int&gt; &lt;int&gt; &lt;int&gt;  &lt;int&gt; &lt;int&gt; &lt;dbl&gt; &lt;int&gt; &lt;int&gt; &lt;dbl&gt; &lt;chr&gt; &lt;int&gt; &lt;chr&gt;</span>
<span class="co">#&gt; 1  2013     1     1    517   515  2.00   830   819  11.0 UA     1545 N142…</span>
<span class="co">#&gt; 2  2013     1     1    533   529  4.00   850   830  20.0 UA     1714 N242…</span>
<span class="co">#&gt; 3  2013     1     1    542   540  2.00   923   850  33.0 AA     1141 N619…</span>
<span class="co">#&gt; 4  2013     1     1    544   545 -1.00  1004  1022 -18.0 B6      725 N804…</span>
<span class="co">#&gt; # ... with 838 more rows, and 7 more variables: origin &lt;chr&gt;, dest &lt;chr&gt;,</span>
<span class="co">#&gt; #   air_time &lt;dbl&gt;, distance &lt;dbl&gt;, hour &lt;dbl&gt;, minute &lt;dbl&gt;,</span>
<span class="co">#&gt; #   time_hour &lt;dttm&gt;</span></code></pre></div>
<p>This is rougly equivalent to this base R code:</p>
<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r">flights[flights<span class="op">$</span>month <span class="op">==</span><span class="st"> </span><span class="dv">1</span> <span class="op">&amp;</span><span class="st"> </span>flights<span class="op">$</span>day <span class="op">==</span><span class="st"> </span><span class="dv">1</span>, ]</code></pre></div>
</div>
<div id="arrange-rows-with-arrange" class="section level3">
<h3>Arrange rows with <code>arrange()</code></h3>
<p><code>arrange()</code> works similarly to <code>filter()</code> except that instead of filtering or selecting rows, it reorders them. It takes a data frame, and a set of column names (or more complicated expressions) to order by. If you provide more than one column name, each additional column will be used to break ties in the values of preceding columns:</p>
<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r"><span class="kw">arrange</span>(flights, year, month, day)
<span class="co">#&gt; # A tibble: 336,776 x 19</span>
<span class="co">#&gt;    year month   day dep_t… sche… dep_… arr_… sche… arr_… carr… flig… tail…</span>
<span class="co">#&gt;   &lt;int&gt; &lt;int&gt; &lt;int&gt;  &lt;int&gt; &lt;int&gt; &lt;dbl&gt; &lt;int&gt; &lt;int&gt; &lt;dbl&gt; &lt;chr&gt; &lt;int&gt; &lt;chr&gt;</span>
<span class="co">#&gt; 1  2013     1     1    517   515  2.00   830   819  11.0 UA     1545 N142…</span>
<span class="co">#&gt; 2  2013     1     1    533   529  4.00   850   830  20.0 UA     1714 N242…</span>
<span class="co">#&gt; 3  2013     1     1    542   540  2.00   923   850  33.0 AA     1141 N619…</span>
<span class="co">#&gt; 4  2013     1     1    544   545 -1.00  1004  1022 -18.0 B6      725 N804…</span>
<span class="co">#&gt; # ... with 336,772 more rows, and 7 more variables: origin &lt;chr&gt;,</span>
<span class="co">#&gt; #   dest &lt;chr&gt;, air_time &lt;dbl&gt;, distance &lt;dbl&gt;, hour &lt;dbl&gt;, minute &lt;dbl&gt;,</span>
<span class="co">#&gt; #   time_hour &lt;dttm&gt;</span></code></pre></div>
<p>Use <code>desc()</code> to order a column in descending order:</p>
<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r"><span class="kw">arrange</span>(flights, <span class="kw">desc</span>(arr_delay))
<span class="co">#&gt; # A tibble: 336,776 x 19</span>
<span class="co">#&gt;    year month   day dep_t… sche… dep_… arr_… sche… arr_… carr… flig… tail…</span>
<span class="co">#&gt;   &lt;int&gt; &lt;int&gt; &lt;int&gt;  &lt;int&gt; &lt;int&gt; &lt;dbl&gt; &lt;int&gt; &lt;int&gt; &lt;dbl&gt; &lt;chr&gt; &lt;int&gt; &lt;chr&gt;</span>
<span class="co">#&gt; 1  2013     1     9    641   900  1301  1242  1530  1272 HA       51 N384…</span>
<span class="co">#&gt; 2  2013     6    15   1432  1935  1137  1607  2120  1127 MQ     3535 N504…</span>
<span class="co">#&gt; 3  2013     1    10   1121  1635  1126  1239  1810  1109 MQ     3695 N517…</span>
<span class="co">#&gt; 4  2013     9    20   1139  1845  1014  1457  2210  1007 AA      177 N338…</span>
<span class="co">#&gt; # ... with 336,772 more rows, and 7 more variables: origin &lt;chr&gt;,</span>
<span class="co">#&gt; #   dest &lt;chr&gt;, air_time &lt;dbl&gt;, distance &lt;dbl&gt;, hour &lt;dbl&gt;, minute &lt;dbl&gt;,</span>
<span class="co">#&gt; #   time_hour &lt;dttm&gt;</span></code></pre></div>
</div>
<div id="select-columns-with-select" class="section level3">
<h3>Select columns with <code>select()</code></h3>
<p>Often you work with large datasets with many columns but only a few are actually of interest to you. <code>select()</code> allows you to rapidly zoom in on a useful subset using operations that usually only work on numeric variable positions:</p>
<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r"><span class="co"># Select columns by name</span>
<span class="kw">select</span>(flights, year, month, day)
<span class="co">#&gt; # A tibble: 336,776 x 3</span>
<span class="co">#&gt;    year month   day</span>
<span class="co">#&gt;   &lt;int&gt; &lt;int&gt; &lt;int&gt;</span>
<span class="co">#&gt; 1  2013     1     1</span>
<span class="co">#&gt; 2  2013     1     1</span>
<span class="co">#&gt; 3  2013     1     1</span>
<span class="co">#&gt; 4  2013     1     1</span>
<span class="co">#&gt; # ... with 336,772 more rows</span>
<span class="co"># Select all columns between year and day (inclusive)</span>
<span class="kw">select</span>(flights, year<span class="op">:</span>day)
<span class="co">#&gt; # A tibble: 336,776 x 3</span>
<span class="co">#&gt;    year month   day</span>
<span class="co">#&gt;   &lt;int&gt; &lt;int&gt; &lt;int&gt;</span>
<span class="co">#&gt; 1  2013     1     1</span>
<span class="co">#&gt; 2  2013     1     1</span>
<span class="co">#&gt; 3  2013     1     1</span>
<span class="co">#&gt; 4  2013     1     1</span>
<span class="co">#&gt; # ... with 336,772 more rows</span>
<span class="co"># Select all columns except those from year to day (inclusive)</span>
<span class="kw">select</span>(flights, <span class="op">-</span>(year<span class="op">:</span>day))
<span class="co">#&gt; # A tibble: 336,776 x 16</span>
<span class="co">#&gt;   dep_t… sche… dep_… arr_… sche… arr_… carr… flig… tail… orig… dest  air_…</span>
<span class="co">#&gt;    &lt;int&gt; &lt;int&gt; &lt;dbl&gt; &lt;int&gt; &lt;int&gt; &lt;dbl&gt; &lt;chr&gt; &lt;int&gt; &lt;chr&gt; &lt;chr&gt; &lt;chr&gt; &lt;dbl&gt;</span>
<span class="co">#&gt; 1    517   515  2.00   830   819  11.0 UA     1545 N142… EWR   IAH     227</span>
<span class="co">#&gt; 2    533   529  4.00   850   830  20.0 UA     1714 N242… LGA   IAH     227</span>
<span class="co">#&gt; 3    542   540  2.00   923   850  33.0 AA     1141 N619… JFK   MIA     160</span>
<span class="co">#&gt; 4    544   545 -1.00  1004  1022 -18.0 B6      725 N804… JFK   BQN     183</span>
<span class="co">#&gt; # ... with 336,772 more rows, and 4 more variables: distance &lt;dbl&gt;,</span>
<span class="co">#&gt; #   hour &lt;dbl&gt;, minute &lt;dbl&gt;, time_hour &lt;dttm&gt;</span></code></pre></div>
<p>There are a number of helper functions you can use within <code>select()</code>, like <code>starts_with()</code>, <code>ends_with()</code>, <code>matches()</code> and <code>contains()</code>. These let you quickly match larger blocks of variables that meet some criterion. See <code>?select</code> for more details.</p>
<p>You can rename variables with <code>select()</code> by using named arguments:</p>
<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r"><span class="kw">select</span>(flights, <span class="dt">tail_num =</span> tailnum)
<span class="co">#&gt; # A tibble: 336,776 x 1</span>
<span class="co">#&gt;   tail_num</span>
<span class="co">#&gt;   &lt;chr&gt;   </span>
<span class="co">#&gt; 1 N14228  </span>
<span class="co">#&gt; 2 N24211  </span>
<span class="co">#&gt; 3 N619AA  </span>
<span class="co">#&gt; 4 N804JB  </span>
<span class="co">#&gt; # ... with 336,772 more rows</span></code></pre></div>
<p>But because <code>select()</code> drops all the variables not explicitly mentioned, it’s not that useful. Instead, use <code>rename()</code>:</p>
<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r"><span class="kw">rename</span>(flights, <span class="dt">tail_num =</span> tailnum)
<span class="co">#&gt; # A tibble: 336,776 x 19</span>
<span class="co">#&gt;    year month   day dep_t… sche… dep_… arr_… sche… arr_… carr… flig… tail…</span>
<span class="co">#&gt;   &lt;int&gt; &lt;int&gt; &lt;int&gt;  &lt;int&gt; &lt;int&gt; &lt;dbl&gt; &lt;int&gt; &lt;int&gt; &lt;dbl&gt; &lt;chr&gt; &lt;int&gt; &lt;chr&gt;</span>
<span class="co">#&gt; 1  2013     1     1    517   515  2.00   830   819  11.0 UA     1545 N142…</span>
<span class="co">#&gt; 2  2013     1     1    533   529  4.00   850   830  20.0 UA     1714 N242…</span>
<span class="co">#&gt; 3  2013     1     1    542   540  2.00   923   850  33.0 AA     1141 N619…</span>
<span class="co">#&gt; 4  2013     1     1    544   545 -1.00  1004  1022 -18.0 B6      725 N804…</span>
<span class="co">#&gt; # ... with 336,772 more rows, and 7 more variables: origin &lt;chr&gt;,</span>
<span class="co">#&gt; #   dest &lt;chr&gt;, air_time &lt;dbl&gt;, distance &lt;dbl&gt;, hour &lt;dbl&gt;, minute &lt;dbl&gt;,</span>
<span class="co">#&gt; #   time_hour &lt;dttm&gt;</span></code></pre></div>
</div>
<div id="add-new-columns-with-mutate" class="section level3">
<h3>Add new columns with <code>mutate()</code></h3>
<p>Besides selecting sets of existing columns, it’s often useful to add new columns that are functions of existing columns. This is the job of <code>mutate()</code>:</p>
<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r"><span class="kw">mutate</span>(flights,
  <span class="dt">gain =</span> arr_delay <span class="op">-</span><span class="st"> </span>dep_delay,
  <span class="dt">speed =</span> distance <span class="op">/</span><span class="st"> </span>air_time <span class="op">*</span><span class="st"> </span><span class="dv">60</span>
)
<span class="co">#&gt; # A tibble: 336,776 x 21</span>
<span class="co">#&gt;    year month   day dep_t… sche… dep_… arr_… sche… arr_… carr… flig… tail…</span>
<span class="co">#&gt;   &lt;int&gt; &lt;int&gt; &lt;int&gt;  &lt;int&gt; &lt;int&gt; &lt;dbl&gt; &lt;int&gt; &lt;int&gt; &lt;dbl&gt; &lt;chr&gt; &lt;int&gt; &lt;chr&gt;</span>
<span class="co">#&gt; 1  2013     1     1    517   515  2.00   830   819  11.0 UA     1545 N142…</span>
<span class="co">#&gt; 2  2013     1     1    533   529  4.00   850   830  20.0 UA     1714 N242…</span>
<span class="co">#&gt; 3  2013     1     1    542   540  2.00   923   850  33.0 AA     1141 N619…</span>
<span class="co">#&gt; 4  2013     1     1    544   545 -1.00  1004  1022 -18.0 B6      725 N804…</span>
<span class="co">#&gt; # ... with 336,772 more rows, and 9 more variables: origin &lt;chr&gt;,</span>
<span class="co">#&gt; #   dest &lt;chr&gt;, air_time &lt;dbl&gt;, distance &lt;dbl&gt;, hour &lt;dbl&gt;, minute &lt;dbl&gt;,</span>
<span class="co">#&gt; #   time_hour &lt;dttm&gt;, gain &lt;dbl&gt;, speed &lt;dbl&gt;</span></code></pre></div>
<p><code>dplyr::mutate()</code> is similar to the base <code>transform()</code>, but allows you to refer to columns that you’ve just created:</p>
<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r"><span class="kw">mutate</span>(flights,
  <span class="dt">gain =</span> arr_delay <span class="op">-</span><span class="st"> </span>dep_delay,
  <span class="dt">gain_per_hour =</span> gain <span class="op">/</span><span class="st"> </span>(air_time <span class="op">/</span><span class="st"> </span><span class="dv">60</span>)
)
<span class="co">#&gt; # A tibble: 336,776 x 21</span>
<span class="co">#&gt;    year month   day dep_t… sche… dep_… arr_… sche… arr_… carr… flig… tail…</span>
<span class="co">#&gt;   &lt;int&gt; &lt;int&gt; &lt;int&gt;  &lt;int&gt; &lt;int&gt; &lt;dbl&gt; &lt;int&gt; &lt;int&gt; &lt;dbl&gt; &lt;chr&gt; &lt;int&gt; &lt;chr&gt;</span>
<span class="co">#&gt; 1  2013     1     1    517   515  2.00   830   819  11.0 UA     1545 N142…</span>
<span class="co">#&gt; 2  2013     1     1    533   529  4.00   850   830  20.0 UA     1714 N242…</span>
<span class="co">#&gt; 3  2013     1     1    542   540  2.00   923   850  33.0 AA     1141 N619…</span>
<span class="co">#&gt; 4  2013     1     1    544   545 -1.00  1004  1022 -18.0 B6      725 N804…</span>
<span class="co">#&gt; # ... with 336,772 more rows, and 9 more variables: origin &lt;chr&gt;,</span>
<span class="co">#&gt; #   dest &lt;chr&gt;, air_time &lt;dbl&gt;, distance &lt;dbl&gt;, hour &lt;dbl&gt;, minute &lt;dbl&gt;,</span>
<span class="co">#&gt; #   time_hour &lt;dttm&gt;, gain &lt;dbl&gt;, gain_per_hour &lt;dbl&gt;</span></code></pre></div>
<p>If you only want to keep the new variables, use <code>transmute()</code>:</p>
<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r"><span class="kw">transmute</span>(flights,
  <span class="dt">gain =</span> arr_delay <span class="op">-</span><span class="st"> </span>dep_delay,
  <span class="dt">gain_per_hour =</span> gain <span class="op">/</span><span class="st"> </span>(air_time <span class="op">/</span><span class="st"> </span><span class="dv">60</span>)
)
<span class="co">#&gt; # A tibble: 336,776 x 2</span>
<span class="co">#&gt;     gain gain_per_hour</span>
<span class="co">#&gt;    &lt;dbl&gt;         &lt;dbl&gt;</span>
<span class="co">#&gt; 1   9.00          2.38</span>
<span class="co">#&gt; 2  16.0           4.23</span>
<span class="co">#&gt; 3  31.0          11.6 </span>
<span class="co">#&gt; 4 -17.0         - 5.57</span>
<span class="co">#&gt; # ... with 336,772 more rows</span></code></pre></div>
</div>
<div id="summarise-values-with-summarise" class="section level3">
<h3>Summarise values with <code>summarise()</code></h3>
<p>The last verb is <code>summarise()</code>. It collapses a data frame to a single row.</p>
<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r"><span class="kw">summarise</span>(flights,
  <span class="dt">delay =</span> <span class="kw">mean</span>(dep_delay, <span class="dt">na.rm =</span> <span class="ot">TRUE</span>)
)
<span class="co">#&gt; # A tibble: 1 x 1</span>
<span class="co">#&gt;   delay</span>
<span class="co">#&gt;   &lt;dbl&gt;</span>
<span class="co">#&gt; 1  12.6</span></code></pre></div>
<p>It’s not that useful until we learn the <code>group_by()</code> verb below.</p>
</div>
<div id="randomly-sample-rows-with-sample_n-and-sample_frac" class="section level3">
<h3>Randomly sample rows with <code>sample_n()</code> and <code>sample_frac()</code></h3>
<p>You can use <code>sample_n()</code> and <code>sample_frac()</code> to take a random sample of rows: use <code>sample_n()</code> for a fixed number and <code>sample_frac()</code> for a fixed fraction.</p>
<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r"><span class="kw">sample_n</span>(flights, <span class="dv">10</span>)
<span class="co">#&gt; # A tibble: 10 x 19</span>
<span class="co">#&gt;    year month   day dep_t… sched_… dep_de… arr_… sched… arr_d… carr… flig…</span>
<span class="co">#&gt;   &lt;int&gt; &lt;int&gt; &lt;int&gt;  &lt;int&gt;   &lt;int&gt;   &lt;dbl&gt; &lt;int&gt;  &lt;int&gt;  &lt;dbl&gt; &lt;chr&gt; &lt;int&gt;</span>
<span class="co">#&gt; 1  2013    10     1    822     825  - 3.00   932    935 - 3.00 AA       84</span>
<span class="co">#&gt; 2  2013     8     2    712     715  - 3.00  1015   1010   5.00 VX      399</span>
<span class="co">#&gt; 3  2013     5    10   1309    1315  - 6.00  1502   1501   1.00 US     1895</span>
<span class="co">#&gt; 4  2013    10    28   2002    1930   32.0   2318   2250  28.0  DL      795</span>
<span class="co">#&gt; # ... with 6 more rows, and 8 more variables: tailnum &lt;chr&gt;, origin &lt;chr&gt;,</span>
<span class="co">#&gt; #   dest &lt;chr&gt;, air_time &lt;dbl&gt;, distance &lt;dbl&gt;, hour &lt;dbl&gt;, minute &lt;dbl&gt;,</span>
<span class="co">#&gt; #   time_hour &lt;dttm&gt;</span>
<span class="kw">sample_frac</span>(flights, <span class="fl">0.01</span>)
<span class="co">#&gt; # A tibble: 3,368 x 19</span>
<span class="co">#&gt;    year month   day dep_t… sche… dep_… arr_… sche… arr_… carr… flig… tail…</span>
<span class="co">#&gt;   &lt;int&gt; &lt;int&gt; &lt;int&gt;  &lt;int&gt; &lt;int&gt; &lt;dbl&gt; &lt;int&gt; &lt;int&gt; &lt;dbl&gt; &lt;chr&gt; &lt;int&gt; &lt;chr&gt;</span>
<span class="co">#&gt; 1  2013     8    16    827   830 -3.00   928   950 -22.0 AA     1838 N3CA…</span>
<span class="co">#&gt; 2  2013    11     4   1306  1300  6.00  1639  1610  29.0 VX      411 N641…</span>
<span class="co">#&gt; 3  2013     1    14    929   935 -6.00  1213  1238 -25.0 B6      361 N639…</span>
<span class="co">#&gt; 4  2013    12    28    625   630 -5.00   916  1014 -58.0 US      690 N656…</span>
<span class="co">#&gt; # ... with 3,364 more rows, and 7 more variables: origin &lt;chr&gt;,</span>
<span class="co">#&gt; #   dest &lt;chr&gt;, air_time &lt;dbl&gt;, distance &lt;dbl&gt;, hour &lt;dbl&gt;, minute &lt;dbl&gt;,</span>
<span class="co">#&gt; #   time_hour &lt;dttm&gt;</span></code></pre></div>
<p>Use <code>replace = TRUE</code> to perform a bootstrap sample. If needed, you can weight the sample with the <code>weight</code> argument.</p>
</div>
<div id="commonalities" class="section level3">
<h3>Commonalities</h3>
<p>You may have noticed that the syntax and function of all these verbs are very similar:</p>
<ul>
<li><p>The first argument is a data frame.</p></li>
<li><p>The subsequent arguments describe what to do with the data frame. You can refer to columns in the data frame directly without using <code>$</code>.</p></li>
<li><p>The result is a new data frame</p></li>
</ul>
<p>Together these properties make it easy to chain together multiple simple steps to achieve a complex result.</p>
<p>These five functions provide the basis of a language of data manipulation. At the most basic level, you can only alter a tidy data frame in five useful ways: you can reorder the rows (<code>arrange()</code>), pick observations and variables of interest (<code>filter()</code> and <code>select()</code>), add new variables that are functions of existing variables (<code>mutate()</code>), or collapse many values to a summary (<code>summarise()</code>). The remainder of the language comes from applying the five functions to different types of data. For example, I’ll discuss how these functions work with grouped data.</p>
</div>
</div>
<div id="patterns-of-operations" class="section level2">
<h2>Patterns of operations</h2>
<p>The dplyr verbs can be classified by the type of operations they accomplish (we sometimes speak of their <strong>semantics</strong>, i.e., their meaning). The most important and useful distinction is between grouped and ungrouped operations. In addition, it is helpful to have a good grasp of the difference between select and mutate operations.</p>
<div id="grouped-operations" class="section level3">
<h3>Grouped operations</h3>
<p>The dplyr verbs are useful on their own, but they become even more powerful when you apply them to groups of observations within a dataset. In dplyr, you do this with the <code>group_by()</code> function. It breaks down a dataset into specified groups of rows. When you then apply the verbs above on the resulting object they’ll be automatically applied “by group”.</p>
<p>Grouping affects the verbs as follows:</p>
<ul>
<li><p>grouped <code>select()</code> is the same as ungrouped <code>select()</code>, except that grouping variables are always retained.</p></li>
<li><p>grouped <code>arrange()</code> is the same as ungrouped; unless you set <code>.by_group = TRUE</code>, in which case it orders first by the grouping variables</p></li>
<li><p><code>mutate()</code> and <code>filter()</code> are most useful in conjunction with window functions (like <code>rank()</code>, or <code>min(x) == x</code>). They are described in detail in <code>vignette(&quot;window-functions&quot;)</code>.</p></li>
<li><p><code>sample_n()</code> and <code>sample_frac()</code> sample the specified number/fraction of rows in each group.</p></li>
<li><p><code>summarise()</code> computes the summary for each group.</p></li>
</ul>
<p>In the following example, we split the complete dataset into individual planes and then summarise each plane by counting the number of flights (<code>count = n()</code>) and computing the average distance (<code>dist = mean(distance, na.rm = TRUE)</code>) and arrival delay (<code>delay = mean(arr_delay, na.rm = TRUE)</code>). We then use ggplot2 to display the output.</p>
<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r">by_tailnum &lt;-<span class="st"> </span><span class="kw">group_by</span>(flights, tailnum)
delay &lt;-<span class="st"> </span><span class="kw">summarise</span>(by_tailnum,
  <span class="dt">count =</span> <span class="kw">n</span>(),
  <span class="dt">dist =</span> <span class="kw">mean</span>(distance, <span class="dt">na.rm =</span> <span class="ot">TRUE</span>),
  <span class="dt">delay =</span> <span class="kw">mean</span>(arr_delay, <span class="dt">na.rm =</span> <span class="ot">TRUE</span>))
delay &lt;-<span class="st"> </span><span class="kw">filter</span>(delay, count <span class="op">&gt;</span><span class="st"> </span><span class="dv">20</span>, dist <span class="op">&lt;</span><span class="st"> </span><span class="dv">2000</span>)

<span class="co"># Interestingly, the average delay is only slightly related to the</span>
<span class="co"># average distance flown by a plane.</span>
<span class="kw">ggplot</span>(delay, <span class="kw">aes</span>(dist, delay)) <span class="op">+</span>
<span class="st">  </span><span class="kw">geom_point</span>(<span class="kw">aes</span>(<span class="dt">size =</span> count), <span class="dt">alpha =</span> <span class="dv">1</span><span class="op">/</span><span class="dv">2</span>) <span class="op">+</span>
<span class="st">  </span><span class="kw">geom_smooth</span>() <span class="op">+</span>
<span class="st">  </span><span class="kw">scale_size_area</span>()</code></pre></div>
<p><img src="" /><!-- --></p>
<p>You use <code>summarise()</code> with <strong>aggregate functions</strong>, which take a vector of values and return a single number. There are many useful examples of such functions in base R like <code>min()</code>, <code>max()</code>, <code>mean()</code>, <code>sum()</code>, <code>sd()</code>, <code>median()</code>, and <code>IQR()</code>. dplyr provides a handful of others:</p>
<ul>
<li><p><code>n()</code>: the number of observations in the current group</p></li>
<li><p><code>n_distinct(x)</code>:the number of unique values in <code>x</code>.</p></li>
<li><p><code>first(x)</code>, <code>last(x)</code> and <code>nth(x, n)</code> - these work similarly to <code>x[1]</code>, <code>x[length(x)]</code>, and <code>x[n]</code> but give you more control over the result if the value is missing.</p></li>
</ul>
<p>For example, we could use these to find the number of planes and the number of flights that go to each possible destination:</p>
<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r">destinations &lt;-<span class="st"> </span><span class="kw">group_by</span>(flights, dest)
<span class="kw">summarise</span>(destinations,
  <span class="dt">planes =</span> <span class="kw">n_distinct</span>(tailnum),
  <span class="dt">flights =</span> <span class="kw">n</span>()
)
<span class="co">#&gt; # A tibble: 105 x 3</span>
<span class="co">#&gt;   dest  planes flights</span>
<span class="co">#&gt;   &lt;chr&gt;  &lt;int&gt;   &lt;int&gt;</span>
<span class="co">#&gt; 1 ABQ      108     254</span>
<span class="co">#&gt; 2 ACK       58     265</span>
<span class="co">#&gt; 3 ALB      172     439</span>
<span class="co">#&gt; 4 ANC        6       8</span>
<span class="co">#&gt; # ... with 101 more rows</span></code></pre></div>
<p>When you group by multiple variables, each summary peels off one level of the grouping. That makes it easy to progressively roll-up a dataset:</p>
<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r">daily &lt;-<span class="st"> </span><span class="kw">group_by</span>(flights, year, month, day)
(per_day   &lt;-<span class="st"> </span><span class="kw">summarise</span>(daily, <span class="dt">flights =</span> <span class="kw">n</span>()))
<span class="co">#&gt; # A tibble: 365 x 4</span>
<span class="co">#&gt; # Groups:   year, month [?]</span>
<span class="co">#&gt;    year month   day flights</span>
<span class="co">#&gt;   &lt;int&gt; &lt;int&gt; &lt;int&gt;   &lt;int&gt;</span>
<span class="co">#&gt; 1  2013     1     1     842</span>
<span class="co">#&gt; 2  2013     1     2     943</span>
<span class="co">#&gt; 3  2013     1     3     914</span>
<span class="co">#&gt; 4  2013     1     4     915</span>
<span class="co">#&gt; # ... with 361 more rows</span>
(per_month &lt;-<span class="st"> </span><span class="kw">summarise</span>(per_day, <span class="dt">flights =</span> <span class="kw">sum</span>(flights)))
<span class="co">#&gt; # A tibble: 12 x 3</span>
<span class="co">#&gt; # Groups:   year [?]</span>
<span class="co">#&gt;    year month flights</span>
<span class="co">#&gt;   &lt;int&gt; &lt;int&gt;   &lt;int&gt;</span>
<span class="co">#&gt; 1  2013     1   27004</span>
<span class="co">#&gt; 2  2013     2   24951</span>
<span class="co">#&gt; 3  2013     3   28834</span>
<span class="co">#&gt; 4  2013     4   28330</span>
<span class="co">#&gt; # ... with 8 more rows</span>
(per_year  &lt;-<span class="st"> </span><span class="kw">summarise</span>(per_month, <span class="dt">flights =</span> <span class="kw">sum</span>(flights)))
<span class="co">#&gt; # A tibble: 1 x 2</span>
<span class="co">#&gt;    year flights</span>
<span class="co">#&gt;   &lt;int&gt;   &lt;int&gt;</span>
<span class="co">#&gt; 1  2013  336776</span></code></pre></div>
<p>However you need to be careful when progressively rolling up summaries like this: it’s ok for sums and counts, but you need to think about weighting for means and variances (it’s not possible to do this exactly for medians).</p>
</div>
<div id="selecting-operations" class="section level3">
<h3>Selecting operations</h3>
<p>One of the appealing features of dplyr is that you can refer to columns from the tibble as if they were regular variables. However, the syntactic uniformity of referring to bare column names hide semantical differences across the verbs. A column symbol supplied to <code>select()</code> does not have the same meaning as the same symbol supplied to <code>mutate()</code>.</p>
<p>Selecting operations expect column names and positions. Hence, when you call <code>select()</code> with bare variable names, they actually represent their own positions in the tibble. The following calls are completely equivalent from dplyr’s point of view:</p>
<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r"><span class="co"># `year` represents the integer 1</span>
<span class="kw">select</span>(flights, year)
<span class="co">#&gt; # A tibble: 336,776 x 1</span>
<span class="co">#&gt;    year</span>
<span class="co">#&gt;   &lt;int&gt;</span>
<span class="co">#&gt; 1  2013</span>
<span class="co">#&gt; 2  2013</span>
<span class="co">#&gt; 3  2013</span>
<span class="co">#&gt; 4  2013</span>
<span class="co">#&gt; # ... with 336,772 more rows</span>
<span class="kw">select</span>(flights, <span class="dv">1</span>)
<span class="co">#&gt; # A tibble: 336,776 x 1</span>
<span class="co">#&gt;    year</span>
<span class="co">#&gt;   &lt;int&gt;</span>
<span class="co">#&gt; 1  2013</span>
<span class="co">#&gt; 2  2013</span>
<span class="co">#&gt; 3  2013</span>
<span class="co">#&gt; 4  2013</span>
<span class="co">#&gt; # ... with 336,772 more rows</span></code></pre></div>
<p>By the same token, this means that you cannot refer to variables from the surrounding context if they have the same name as one of the columns. In the following example, <code>year</code> still represents 1, not 5:</p>
<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r">year &lt;-<span class="st"> </span><span class="dv">5</span>
<span class="kw">select</span>(flights, year)</code></pre></div>
<p>One useful subtlety is that this only applies to bare names and to selecting calls like <code>c(year, month, day)</code> or <code>year:day</code>. In all other cases, the columns of the data frame are not put in scope. This allows you to refer to contextual variables in selection helpers:</p>
<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r">year &lt;-<span class="st"> &quot;dep&quot;</span>
<span class="kw">select</span>(flights, <span class="kw">starts_with</span>(year))
<span class="co">#&gt; # A tibble: 336,776 x 2</span>
<span class="co">#&gt;   dep_time dep_delay</span>
<span class="co">#&gt;      &lt;int&gt;     &lt;dbl&gt;</span>
<span class="co">#&gt; 1      517      2.00</span>
<span class="co">#&gt; 2      533      4.00</span>
<span class="co">#&gt; 3      542      2.00</span>
<span class="co">#&gt; 4      544     -1.00</span>
<span class="co">#&gt; # ... with 336,772 more rows</span></code></pre></div>
<p>These semantics are usually intuitive. But note the subtle difference:</p>
<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r">year &lt;-<span class="st"> </span><span class="dv">5</span>
<span class="kw">select</span>(flights, year, <span class="kw">identity</span>(year))
<span class="co">#&gt; # A tibble: 336,776 x 2</span>
<span class="co">#&gt;    year sched_dep_time</span>
<span class="co">#&gt;   &lt;int&gt;          &lt;int&gt;</span>
<span class="co">#&gt; 1  2013            515</span>
<span class="co">#&gt; 2  2013            529</span>
<span class="co">#&gt; 3  2013            540</span>
<span class="co">#&gt; 4  2013            545</span>
<span class="co">#&gt; # ... with 336,772 more rows</span></code></pre></div>
<p>In the first argument, <code>year</code> represents its own position <code>1</code>. In the second argument, <code>year</code> is evaluated in the surrounding context and represents the fifth column.</p>
<p>For a long time, <code>select()</code> used to only understand column positions. Counting from dplyr 0.6, it now understands column names as well. This makes it a bit easier to program with <code>select()</code>:</p>
<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r">vars &lt;-<span class="st"> </span><span class="kw">c</span>(<span class="st">&quot;year&quot;</span>, <span class="st">&quot;month&quot;</span>)
<span class="kw">select</span>(flights, vars, <span class="st">&quot;day&quot;</span>)
<span class="co">#&gt; # A tibble: 336,776 x 3</span>
<span class="co">#&gt;    year month   day</span>
<span class="co">#&gt;   &lt;int&gt; &lt;int&gt; &lt;int&gt;</span>
<span class="co">#&gt; 1  2013     1     1</span>
<span class="co">#&gt; 2  2013     1     1</span>
<span class="co">#&gt; 3  2013     1     1</span>
<span class="co">#&gt; 4  2013     1     1</span>
<span class="co">#&gt; # ... with 336,772 more rows</span></code></pre></div>
<p>Note that the code above is somewhat unsafe because you might have added a column named <code>vars</code> to the tibble, or you might apply the code to another data frame containing such a column. To avoid this issue, you can wrap the variable in an <code>identity()</code> call as we mentioned above, as this will bypass column names. However, a more explicit and general method that works in all dplyr verbs is to unquote the variable with the <code>!!</code> operator. This tells dplyr to bypass the data frame and to directly look in the context:</p>
<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r"><span class="co"># Let's create a new `vars` column:</span>
flights<span class="op">$</span>vars &lt;-<span class="st"> </span>flights<span class="op">$</span>year

<span class="co"># The new column won't be an issue if you evaluate `vars` in the</span>
<span class="co"># context with the `!!` operator:</span>
vars &lt;-<span class="st"> </span><span class="kw">c</span>(<span class="st">&quot;year&quot;</span>, <span class="st">&quot;month&quot;</span>, <span class="st">&quot;day&quot;</span>)
<span class="kw">select</span>(flights, <span class="op">!!</span><span class="st"> </span>vars)
<span class="co">#&gt; # A tibble: 336,776 x 3</span>
<span class="co">#&gt;    year month   day</span>
<span class="co">#&gt;   &lt;int&gt; &lt;int&gt; &lt;int&gt;</span>
<span class="co">#&gt; 1  2013     1     1</span>
<span class="co">#&gt; 2  2013     1     1</span>
<span class="co">#&gt; 3  2013     1     1</span>
<span class="co">#&gt; 4  2013     1     1</span>
<span class="co">#&gt; # ... with 336,772 more rows</span></code></pre></div>
<p>This operator is very useful when you need to use dplyr within custom functions. You can learn more about it in <code>vignette(&quot;programming&quot;)</code>. However it is important to understand the semantics of the verbs you are unquoting into, that is, the values they understand. As we have just seen, <code>select()</code> supports names and positions of columns. But that won’t be the case in other verbs like <code>mutate()</code> because they have different semantics.</p>
</div>
<div id="mutating-operations" class="section level3">
<h3>Mutating operations</h3>
<p>Mutate semantics are quite different from selection semantics. Whereas <code>select()</code> expects column names or positions, <code>mutate()</code> expects <em>column vectors</em>. Let’s create a smaller tibble for clarity:</p>
<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r">df &lt;-<span class="st"> </span><span class="kw">select</span>(flights, year<span class="op">:</span>dep_time)</code></pre></div>
<p>When we use <code>select()</code>, the bare column names stand for ther own positions in the tibble. For <code>mutate()</code> on the other hand, column symbols represent the actual column vectors stored in the tibble. Consider what happens if we give a string or a number to <code>mutate()</code>:</p>
<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r"><span class="kw">mutate</span>(df, <span class="st">&quot;year&quot;</span>, <span class="dv">2</span>)
<span class="co">#&gt; # A tibble: 336,776 x 6</span>
<span class="co">#&gt;    year month   day dep_time `&quot;year&quot;`   `2`</span>
<span class="co">#&gt;   &lt;int&gt; &lt;int&gt; &lt;int&gt;    &lt;int&gt; &lt;chr&gt;    &lt;dbl&gt;</span>
<span class="co">#&gt; 1  2013     1     1      517 year      2.00</span>
<span class="co">#&gt; 2  2013     1     1      533 year      2.00</span>
<span class="co">#&gt; 3  2013     1     1      542 year      2.00</span>
<span class="co">#&gt; 4  2013     1     1      544 year      2.00</span>
<span class="co">#&gt; # ... with 336,772 more rows</span></code></pre></div>
<p><code>mutate()</code> gets length-1 vectors that it interprets as new columns in the data frame. These vectors are recycled so they match the number of rows. That’s why it doesn’t make sense to supply expressions like <code>&quot;year&quot; + 10</code> to <code>mutate()</code>. This amounts to adding 10 to a string! The correct expression is:</p>
<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r"><span class="kw">mutate</span>(df, year <span class="op">+</span><span class="st"> </span><span class="dv">10</span>)
<span class="co">#&gt; # A tibble: 336,776 x 5</span>
<span class="co">#&gt;    year month   day dep_time `year + 10`</span>
<span class="co">#&gt;   &lt;int&gt; &lt;int&gt; &lt;int&gt;    &lt;int&gt;       &lt;dbl&gt;</span>
<span class="co">#&gt; 1  2013     1     1      517        2023</span>
<span class="co">#&gt; 2  2013     1     1      533        2023</span>
<span class="co">#&gt; 3  2013     1     1      542        2023</span>
<span class="co">#&gt; 4  2013     1     1      544        2023</span>
<span class="co">#&gt; # ... with 336,772 more rows</span></code></pre></div>
<p>In the same way, you can unquote values from the context if these values represent a valid column. They must be either length 1 (they then get recycled) or have the same length as the number of rows. In the following example we create a new vector that we add to the data frame:</p>
<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r">var &lt;-<span class="st"> </span><span class="kw">seq</span>(<span class="dv">1</span>, <span class="kw">nrow</span>(df))
<span class="kw">mutate</span>(df, <span class="dt">new =</span> var)
<span class="co">#&gt; # A tibble: 336,776 x 5</span>
<span class="co">#&gt;    year month   day dep_time   new</span>
<span class="co">#&gt;   &lt;int&gt; &lt;int&gt; &lt;int&gt;    &lt;int&gt; &lt;int&gt;</span>
<span class="co">#&gt; 1  2013     1     1      517     1</span>
<span class="co">#&gt; 2  2013     1     1      533     2</span>
<span class="co">#&gt; 3  2013     1     1      542     3</span>
<span class="co">#&gt; 4  2013     1     1      544     4</span>
<span class="co">#&gt; # ... with 336,772 more rows</span></code></pre></div>
<p>A case in point is <code>group_by()</code>. While you might think it has select semantics, it actually has mutate semantics. This is quite handy as it allows to group by a modified column:</p>
<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r"><span class="kw">group_by</span>(df, month)
<span class="co">#&gt; # A tibble: 336,776 x 4</span>
<span class="co">#&gt; # Groups:   month [12]</span>
<span class="co">#&gt;    year month   day dep_time</span>
<span class="co">#&gt;   &lt;int&gt; &lt;int&gt; &lt;int&gt;    &lt;int&gt;</span>
<span class="co">#&gt; 1  2013     1     1      517</span>
<span class="co">#&gt; 2  2013     1     1      533</span>
<span class="co">#&gt; 3  2013     1     1      542</span>
<span class="co">#&gt; 4  2013     1     1      544</span>
<span class="co">#&gt; # ... with 336,772 more rows</span>
<span class="kw">group_by</span>(df, <span class="dt">month =</span> <span class="kw">as.factor</span>(month))
<span class="co">#&gt; # A tibble: 336,776 x 4</span>
<span class="co">#&gt; # Groups:   month [12]</span>
<span class="co">#&gt;    year month    day dep_time</span>
<span class="co">#&gt;   &lt;int&gt; &lt;fctr&gt; &lt;int&gt;    &lt;int&gt;</span>
<span class="co">#&gt; 1  2013 1          1      517</span>
<span class="co">#&gt; 2  2013 1          1      533</span>
<span class="co">#&gt; 3  2013 1          1      542</span>
<span class="co">#&gt; 4  2013 1          1      544</span>
<span class="co">#&gt; # ... with 336,772 more rows</span>
<span class="kw">group_by</span>(df, <span class="dt">day_binned =</span> <span class="kw">cut</span>(day, <span class="dv">3</span>))
<span class="co">#&gt; # A tibble: 336,776 x 5</span>
<span class="co">#&gt; # Groups:   day_binned [3]</span>
<span class="co">#&gt;    year month   day dep_time day_binned</span>
<span class="co">#&gt;   &lt;int&gt; &lt;int&gt; &lt;int&gt;    &lt;int&gt; &lt;fctr&gt;    </span>
<span class="co">#&gt; 1  2013     1     1      517 (0.97,11] </span>
<span class="co">#&gt; 2  2013     1     1      533 (0.97,11] </span>
<span class="co">#&gt; 3  2013     1     1      542 (0.97,11] </span>
<span class="co">#&gt; 4  2013     1     1      544 (0.97,11] </span>
<span class="co">#&gt; # ... with 336,772 more rows</span></code></pre></div>
<p>This is why you can’t supply a column name to <code>group_by()</code>. This amounts to creating a new column containing the string recycled to the number of rows:</p>
<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r"><span class="kw">group_by</span>(df, <span class="st">&quot;month&quot;</span>)
<span class="co">#&gt; # A tibble: 336,776 x 5</span>
<span class="co">#&gt; # Groups:   &quot;month&quot; [1]</span>
<span class="co">#&gt;    year month   day dep_time `&quot;month&quot;`</span>
<span class="co">#&gt;   &lt;int&gt; &lt;int&gt; &lt;int&gt;    &lt;int&gt; &lt;chr&gt;    </span>
<span class="co">#&gt; 1  2013     1     1      517 month    </span>
<span class="co">#&gt; 2  2013     1     1      533 month    </span>
<span class="co">#&gt; 3  2013     1     1      542 month    </span>
<span class="co">#&gt; 4  2013     1     1      544 month    </span>
<span class="co">#&gt; # ... with 336,772 more rows</span></code></pre></div>
<p>Since grouping with select semantics can be sometimes useful as well, we have added the <code>group_by_at()</code> variant. In dplyr, variants suffixed with <code>_at()</code> support selection semantics in their second argument. You just need to wrap the selection with <code>vars()</code>:</p>
<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r"><span class="kw">group_by_at</span>(df, <span class="kw">vars</span>(year<span class="op">:</span>day))
<span class="co">#&gt; # A tibble: 336,776 x 4</span>
<span class="co">#&gt; # Groups:   year, month, day [365]</span>
<span class="co">#&gt;    year month   day dep_time</span>
<span class="co">#&gt;   &lt;int&gt; &lt;int&gt; &lt;int&gt;    &lt;int&gt;</span>
<span class="co">#&gt; 1  2013     1     1      517</span>
<span class="co">#&gt; 2  2013     1     1      533</span>
<span class="co">#&gt; 3  2013     1     1      542</span>
<span class="co">#&gt; 4  2013     1     1      544</span>
<span class="co">#&gt; # ... with 336,772 more rows</span></code></pre></div>
<p>You can read more about the <code>_at()</code> and <code>_if()</code> variants in the <code>?scoped</code> help page.</p>
</div>
</div>
<div id="piping" class="section level2">
<h2>Piping</h2>
<p>The dplyr API is functional in the sense that function calls don’t have side-effects. You must always save their results. This doesn’t lead to particularly elegant code, especially if you want to do many operations at once. You either have to do it step-by-step:</p>
<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r">a1 &lt;-<span class="st"> </span><span class="kw">group_by</span>(flights, year, month, day)
a2 &lt;-<span class="st"> </span><span class="kw">select</span>(a1, arr_delay, dep_delay)
a3 &lt;-<span class="st"> </span><span class="kw">summarise</span>(a2,
  <span class="dt">arr =</span> <span class="kw">mean</span>(arr_delay, <span class="dt">na.rm =</span> <span class="ot">TRUE</span>),
  <span class="dt">dep =</span> <span class="kw">mean</span>(dep_delay, <span class="dt">na.rm =</span> <span class="ot">TRUE</span>))
a4 &lt;-<span class="st"> </span><span class="kw">filter</span>(a3, arr <span class="op">&gt;</span><span class="st"> </span><span class="dv">30</span> <span class="op">|</span><span class="st"> </span>dep <span class="op">&gt;</span><span class="st"> </span><span class="dv">30</span>)</code></pre></div>
<p>Or if you don’t want to name the intermediate results, you need to wrap the function calls inside each other:</p>
<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r"><span class="kw">filter</span>(
  <span class="kw">summarise</span>(
    <span class="kw">select</span>(
      <span class="kw">group_by</span>(flights, year, month, day),
      arr_delay, dep_delay
    ),
    <span class="dt">arr =</span> <span class="kw">mean</span>(arr_delay, <span class="dt">na.rm =</span> <span class="ot">TRUE</span>),
    <span class="dt">dep =</span> <span class="kw">mean</span>(dep_delay, <span class="dt">na.rm =</span> <span class="ot">TRUE</span>)
  ),
  arr <span class="op">&gt;</span><span class="st"> </span><span class="dv">30</span> <span class="op">|</span><span class="st"> </span>dep <span class="op">&gt;</span><span class="st"> </span><span class="dv">30</span>
)
<span class="co">#&gt; Adding missing grouping variables: `year`, `month`, `day`</span>
<span class="co">#&gt; # A tibble: 49 x 5</span>
<span class="co">#&gt; # Groups:   year, month [11]</span>
<span class="co">#&gt;    year month   day   arr   dep</span>
<span class="co">#&gt;   &lt;int&gt; &lt;int&gt; &lt;int&gt; &lt;dbl&gt; &lt;dbl&gt;</span>
<span class="co">#&gt; 1  2013     1    16  34.2  24.6</span>
<span class="co">#&gt; 2  2013     1    31  32.6  28.7</span>
<span class="co">#&gt; 3  2013     2    11  36.3  39.1</span>
<span class="co">#&gt; 4  2013     2    27  31.3  37.8</span>
<span class="co">#&gt; # ... with 45 more rows</span></code></pre></div>
<p>This is difficult to read because the order of the operations is from inside to out. Thus, the arguments are a long way away from the function. To get around this problem, dplyr provides the <code>%&gt;%</code> operator from magrittr. <code>x %&gt;% f(y)</code> turns into <code>f(x, y)</code> so you can use it to rewrite multiple operations that you can read left-to-right, top-to-bottom:</p>
<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r">flights <span class="op">%&gt;%</span>
<span class="st">  </span><span class="kw">group_by</span>(year, month, day) <span class="op">%&gt;%</span>
<span class="st">  </span><span class="kw">select</span>(arr_delay, dep_delay) <span class="op">%&gt;%</span>
<span class="st">  </span><span class="kw">summarise</span>(
    <span class="dt">arr =</span> <span class="kw">mean</span>(arr_delay, <span class="dt">na.rm =</span> <span class="ot">TRUE</span>),
    <span class="dt">dep =</span> <span class="kw">mean</span>(dep_delay, <span class="dt">na.rm =</span> <span class="ot">TRUE</span>)
  ) <span class="op">%&gt;%</span>
<span class="st">  </span><span class="kw">filter</span>(arr <span class="op">&gt;</span><span class="st"> </span><span class="dv">30</span> <span class="op">|</span><span class="st"> </span>dep <span class="op">&gt;</span><span class="st"> </span><span class="dv">30</span>)</code></pre></div>
</div>
<div id="other-data-sources" class="section level2">
<h2>Other data sources</h2>
<p>As well as data frames, dplyr works with data that is stored in other ways, like data tables, databases and multidimensional arrays.</p>
<div id="data-table" class="section level3">
<h3>Data table</h3>
<p>dplyr also provides <a href="http://datatable.r-forge.r-project.org/">data table</a> methods for all verbs through <a href="http://github.com/hadley/dtplyr">dtplyr</a>. If you’re using data.tables already this lets you to use dplyr syntax for data manipulation, and data.table for everything else.</p>
<p>For multiple operations, data.table can be faster because you usually use it with multiple verbs simultaneously. For example, with data table you can do a mutate and a select in a single step. It’s smart enough to know that there’s no point in computing the new variable for rows you’re about to throw away.</p>
<p>The advantages of using dplyr with data tables are:</p>
<ul>
<li><p>For common data manipulation tasks, it insulates you from the reference semantics of data.tables, and protects you from accidentally modifying your data.</p></li>
<li><p>Instead of one complex method built on the subscripting operator (<code>[</code>), it provides many simple methods.</p></li>
</ul>
</div>
<div id="databases" class="section level3">
<h3>Databases</h3>
<p>dplyr also allows you to use the same verbs with a remote database. It takes care of generating the SQL for you so that you can avoid the cognitive challenge of constantly switching between languages. To use these capabilities, you’ll need to install the dbplyr package and then read <code>vignette(&quot;dbplyr&quot;)</code> for the details.</p>
</div>
<div id="multidimensional-arrays-cubes" class="section level3">
<h3>Multidimensional arrays / cubes</h3>
<p><code>tbl_cube()</code> provides an experimental interface to multidimensional arrays or data cubes. If you’re using this form of data in R, please get in touch so I can better understand your needs.</p>
</div>
</div>
<div id="comparisons" class="section level2">
<h2>Comparisons</h2>
<p>Compared to all existing options, dplyr:</p>
<ul>
<li><p>abstracts away how your data is stored, so that you can work with data frames, data tables and remote databases using the same set of functions. This lets you focus on what you want to achieve, not on the logistics of data storage.</p></li>
<li><p>provides a thoughtful default <code>print()</code> method that doesn’t automatically print pages of data to the screen (this was inspired by data table’s output).</p></li>
</ul>
<p>Compared to base functions:</p>
<ul>
<li><p>dplyr is much more consistent; functions have the same interface. So once you’ve mastered one, you can easily pick up the others</p></li>
<li><p>base functions tend to be based around vectors; dplyr is based around data frames</p></li>
</ul>
<p>Compared to plyr, dplyr:</p>
<ul>
<li><p>is much much faster</p></li>
<li><p>provides a better thought out set of joins</p></li>
<li><p>only provides tools for working with data frames (e.g. most of dplyr is equivalent to <code>ddply()</code> + various functions, <code>do()</code> is equivalent to <code>dlply()</code>)</p></li>
</ul>
<p>Compared to virtual data frame approaches:</p>
<ul>
<li><p>it doesn’t pretend that you have a data frame: if you want to run lm etc, you’ll still need to manually pull down the data</p></li>
<li><p>it doesn’t provide methods for R summary functions (e.g. <code>mean()</code>, or <code>sum()</code>)</p></li>
</ul>
</div>



<!-- dynamically load mathjax for compatibility with self-contained -->
<script>
  (function () {
    var script = document.createElement("script");
    script.type = "text/javascript";
    script.src  = "https://mathjax.rstudio.com/latest/MathJax.js?config=TeX-AMS-MML_HTMLorMML";
    document.getElementsByTagName("head")[0].appendChild(script);
  })();
</script>

</body>
</html>