Good explanation of how to use scoped verbs?

crazybilly · September 14, 2017, 4:08pm

Is there some good documentation laying around somewhere about how to use dplyr's scoped verbs, the ones like summarize_at() and mutate_each(), a vignette-sort of thing on these?

I get the general idea of how to use them, but could stand a solid walk through from somebody who actually knows what they're talking about instead of me bumbling around the documentation!

michael · September 14, 2017, 4:18pm

Have you seen the what's available via RDocumentation?

crazybilly · September 14, 2017, 4:31pm

Yeah, but that's largely just the documentation--it's fine for technical documentation, but doesn't give me a good sense of how to use vars () and funs(), ie. when I should and shouldn't, nor how to reference the data within funs(), etc.

I'm really hoping for something more like a vignette that explains when/how to use each and why the syntax is the way it is....

michael · September 14, 2017, 4:56pm

Ah, I see what you're saying. There is a short bit of information in the compatibility vignette - Deprecation of mutate_each() and summarise_each(). Probably not exactly what you're looking for, but could be helpful.

crazybilly · September 14, 2017, 6:08pm

That is helpful. Not comprehensive, but starts to give me a handle on it!

hadley · September 14, 2017, 6:15pm

It would be really great if you could give so examples of what you're having problems with because to me the examples seem fine (but I'm obviously so steeped it in that I can't see the problem)

hadley · September 14, 2017, 6:16pm

Oh but that reminds me that I did write a page giving more details for the class I teach at Stanford: https://dcl-2017-04.github.io/curriculum/manip-scoped.html

crazybilly · September 15, 2017, 1:56pm

Oh, wow--that page was really useful, pretty much exactly what I was looking for! In terms of things I didn't understand till I read through it, I'd say I was struggling with:

when (and why) to use funs() or not
when (and why) to use vars()
how exactly all the arguments should work (ie. what is a predicate vs the actual function)
how to write a predicate function for one of the_at functions that's more complex than a single function (seems like every time I reach for these functions it's because I've got some crazy, overly complex idea).

If you were going to generalize that doc for broader use, the only feedback I'd give as I read through it was:

Explicitly mention that you can use . within funs() to reference the column in the function
Further explain ways to control how new columns will be named (particularly when using the _at functions)
Is there a way to write predicate functions for the _if() functions using lambda functions? (Ah! a bit of experimentation reveals that you can use funs() and . there too!!)

Also, filter_all() with all_vars() or any_vars() seems awesome!

phil · September 15, 2017, 11:10pm

To echo @crazybilly I have trouble wrapping my head around when I should use those calls or not.

hadley · September 16, 2017, 3:35am

You only need to use them when you want summarise multiple variables or with multiple functions.

david2 · September 16, 2017, 12:40pm

hey @crazybilly,

re: #1 & #2 when and why to use funs()/ vars() - if it is of any help here is a short example of how some of these things save me lots of time & typing.

In my daily workflow i get data coming in that has many columns, say 50 - and a subset of them (like 10) are dates and datetimes unfortunately coded as characters. Think of them as start date, end date, departure date, return date etc. I want to convert them to POSIX datetimes and maybe further extract days of week, days of month and similar features.

So first approach would be to do this separately for each char-date column

# the format is something like "2017-09-16 15:30:00")
my_format <- "%Y-%m-%d %H:%M:%S"

my_data %>%
  mutate(
    col1_date = as.POSIXct(col1_date, format = my_format),
    col2_date = as.POSIXct(col2_date, format = my_format),
    ...
    col10_date = as.POSIXct(col10_date, format = my_format)
  )

In this case it comes really handy that one can simply do

my_data %>%
  mutate_at(
    vars(contains("date"),
    funs(posix = as.POSIXct),
    format = my_format
  )

and be done with all of them in one call.

Some assumptions that make this easier and possible are that all the char-date columns have "date" in their name which make the vars() call simple. This might not be always the case in general but it's easy to rename such columns by selecting them by hand and appending "date" to their name for example.

Note also the convenience that by supplying the name _posix in the funs() call will result in the new column names having "_posix" appended to their original name automatically.

Furthermore, to get each of the new posix dates columns, day of week for example i could just supply vars(contains("posix")) and funs( wday = lubridate::wday) in another call and get all of their days of week in one go.

Hope this small example helps a bit to show how practical these tools are.

cheers,
david

hadley · September 18, 2017, 12:40pm

Are you sure you mean as.POSIXlt()? That shouldn't work inside mutate, and if it does it's a bug in dplyr.

david2 · September 18, 2017, 1:23pm

ooops sorry. you are right POSIXlt is the list one and doesn't work with tibbles. I corrected the examples to POSIXct and i removed the unnecessary underscore when naming the functions in funs().

Thanks for all the great packages and teaching Hadley - this community website is a great idea, I've learned so many new things already.

crazybilly · September 18, 2017, 2:01pm

My question was, I guess, a bit more in the weeds: why do you have to use vars() instead of just using select()-style calls?

Oh! I just looked at the source code for vars() which is just:

function (...) 
{
  quos(...)
}

So basically, you're just one step out in a NSE sort of thing--you're using vars() so mutate_at() knows where to look for the column names you pass it.

Looks like funs() operates on the same principal: use quos() to wrap the function in a quosure so it's clear where it should be applied (ie. to the original data frame).

nick · September 18, 2017, 2:15pm

Keep in mind that select uses the ... argument to allow multiple inputs. Since the mutate_at family uses ... for additional arguments to the function(s), multiple inputs to .cols have to be wrapped somehow. Hence, vars. Keep in mind that vars can take any of the same style of arguments that select can, such as starts_with.

hadley · September 18, 2017, 2:18pm

The scoped helpers have three inputs, each of which can be of arbitrary length: variables, functions, and extra arguments. We need someway to disambiguate between so chose to make vars() and funs() explicit.

We could've also used an args() helper to put everything on an equal footer, but chose not to since the scoped verbs are similar in spirit to apply/map functions which use ... for extra args.

aleeie · October 11, 2017, 2:03pm

Thanks @hadley! That post explained it very well. Now I'm sure these functions will save me a lot of typing =)

-Alex