Tidy Evaluation and Formulae

bhive01 · January 26, 2018, 9:38pm

Was attending @lionel's talk on Tidy Evaluation this week and asked a question about using tidy eval for substituting columns in lm(). I did some searching and apparently this sort of thing might be coming as mentioned by Lionel here:

The use case is if I want to map() over a list of names and substitute these names as part of the formula in the lm() call.

Such a thing is mentioned in this closed pull request:

I'm curious about programming with formula using Tidy Evaluation and wondering when/if these functions will make it into purrr in the near future. Also very open to thinking about this differently if it means not needing this entirely.

lionel · January 27, 2018, 9:05am

We are currently thinking about how to interface the base modelling functions with tidy eval tools. Stay tuned!

sdrakulich · August 28, 2018, 6:47pm

So, how's that coming along? My use-case is the same as bhive01's.

JohnMount · August 28, 2018, 6:54pm

R formulas already are a fairly complete expression capture system, one just has to use paste() to build up what one wants. You really don't need a lot of extra machinery as formulas are already quite powerful (by design).

Example:

outcome_name <- "mpg"
variable_names <- c("disp", "hp", "wt")
f_str <- paste(outcome_name, "~", paste(variable_names, collapse = " + "))
model <- lm(as.formula(f_str), data = mtcars)
summary(model)

# Call:
#   lm(formula = as.formula(f_str), data = mtcars)
# 
# Residuals:
#   Min     1Q Median     3Q    Max 
# -3.891 -1.640 -0.172  1.061  5.861 
# 
# Coefficients:
#   Estimate Std. Error t value Pr(>|t|)    
#   (Intercept) 37.105505   2.110815  17.579  < 2e-16 ***
#   disp        -0.000937   0.010350  -0.091  0.92851    
#   hp          -0.031157   0.011436  -2.724  0.01097 *  
#   wt          -3.800891   1.066191  -3.565  0.00133 ** 
#   ---
#   Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
# 
# Residual standard error: 2.639 on 28 degrees of freedom
# Multiple R-squared:  0.8268,	Adjusted R-squared:  0.8083 
# F-statistic: 44.57 on 3 and 28 DF,  p-value: 8.65e-11

sdrakulich · August 28, 2018, 7:49pm

The data argument is what the issue is. Here's my example:

variables <- c("y1", "y2",  "y3")

models_test <- map(variables, ~paste0(.x, " ~ x1 + x2 + x3")) %>%
  map(~as.formula(.x)) %>% map(~as.Formula(.x))

Which is then mapped through my data, a list of df's that differs only for x3.

runmodel <- function(df, modelList) {
  map(modelList, ~ lm(modelList, data = df))
}

# Ultimate Goal #
dff_ALL %>%
    imap(~runmodel(modelList = models_test, df = .x))

But $call is lm(formula = ., data = df) which fails down-the-line for my use-case of plotting marginal effects with sjPlot::plot_model.

JohnMount · August 28, 2018, 8:54pm

If your problem is you want to apply the same workflow to a set of different data.frames, That seems solvable by wrapping all of your steps in a function on a single data.frame and then varying the data.frame.

I am not a sjPlot user and haven't tried to following, but it seems like a design like the following might work.

run_one_model <- function(outcome_name, variable_names, data)  {
  f_str <- paste(outcome_name, "~", paste(variable_names, collapse = " + "))
  model <- lm(as.formula(f_str), data = data)
  plt <- sjPlot::plot_model(model)
  list(model = model, plt = plt)
}

dff_ALL <- lapply(list_of_data_frames, function(df) run_one_model(outcome_name, variable_names, df))

sdrakulich · August 28, 2018, 9:08pm

Thanks for trying to help me! I really appreciate it.

Currently, the only difference I see between our approaches is the addition of the plots at the end.
Otherwise, we both have an out-of-scope $call in our model outputs.

Substituting "list_of_data_frames" with dff_ALL doesn't work. The formula only runs one model.

I had to modify it a bit to work for me, so here's what I managed to run. It's only for one dependentVar too...

run_one_model <- function(outcome_name, variable_names, data)  {
  f_str <- paste(outcome_name, "~", paste(variable_names, collapse = " + "))
  model <- lm(as.formula(f_str), data = data)
  plt <- sjPlot::plot_model(model)
  list(model = model, plt = plt)
}

lapply(dff_ALL[[1]], function(df) run_one_model("y4", list("x1", "x2", "x3"), dff_ALL[[1]]))

I don't think it's possible unless quasiquotation is used

JohnMount · August 28, 2018, 9:17pm

The above code would only run one model (if any). Have you tried the following? The code is assuming dff_ALL is a list of data.frames.

sdrakulich · August 28, 2018, 9:21pm

It runs, but trying to do plot_model(model, type = "eff", Terms = "AccelResid) doesn't work, nor for type = "pred".

Ultimately, the model call is still not referenced adequately.

JohnMount · August 28, 2018, 9:51pm

The sort of error messages you are seeing are consistent with having a variable name that is not a column name of the data.frame.

I would say double check your spelling of column names and take the extra steps of creating (possibly temporary) data.frames that have all variables you require ready as columns (don't take any values from the environment). It is a part of R to have to perform data transforms, so copying a value from an environment into a data.frame columns is designed to be very inexpensive (they can in many situations share the same storage). Simple code working over explicit data is a good design principle.

I have run the following, and it works. Any issues with changing parameters to plot_model() are probably details involving the sgPlot package.

library("sjPlot")

run_one_model <- function(outcome_name, variable_names, data)  {
  f_str <- paste(outcome_name, "~", paste(variable_names, collapse = " + "))
  model <- lm(as.formula(f_str), data = data)
  plt <- sjPlot::plot_model(model)
  list(model = model, plt = plt)
}

mk_data <- function() {
  d <- data.frame(x1 = rnorm(100),
                  x2 = rnorm(100),
                  x3 = rnorm(100))
  d$y4 <- d$x1 + d$x2 + d$x3 + rnorm(100)
  d
}

dA <- mk_data()

# two data.frames that differ only in one column
data_list <- list(
  dA,
  transform(dA, x3 = rnorm(nrow(dA)))
)

res <- lapply(data_list, function(df) run_one_model("y4", c("x1", "x2", "x3"), df))

for(i in seq_along(res)) {
  print(res[[i]]$plt + ggplot2::ggtitle(i))
}

lionel · August 29, 2018, 7:37am

What you've shown has nothing to do with formulas, that's just R parsing a string. We don't recommend meta-programming with strings as it's not robust, and not powerful. It's not a tidyverse thing, experimented base R programmers are likely to give the same advice.

Here is an approach to reducing a list of names or expressions to a single expression. First we'll need to define some tools, hopefully they'll live in a package sometime in the future:

syms_reduce <- function(syms, op = "+") {
  exprs_reduce(syms(syms), op = op)
}
exprs_reduce <- function(exprs, op = "+") {
  n <- length(exprs)

  if (length(exprs) == 0) {
    abort("Empty list of expressions")
  }
  if (length(exprs) == 1) {
    return(exprs[[1]])
  }

  op <- sym(op)
  purrr::reduce(exprs, function(x, y) expr((!!op)(!!x, !!y)))
}

We have one function syms_reduce() that takes symbols or strings as inputs, and the other exprs_reduce() takes a list of arbitrary expressions. Let's try syms_reduce() to create a sum, the default operation:

syms_reduce(c("a"))
#> a

syms_reduce(c("a", "b", "c"))
#> a + b + c

This function is robust to any weird names that might come up in messy data:

syms_reduce(c("1()", "`a`e", "_foo"))
#> `1()` + `\`a\`e` + `_foo`

You can specify a different operation:

syms_reduce(c("a", "b", "c"), op = "*")
#> a * b * c

And it even works with regular functions!

syms_reduce(c("a", "b", "c"), op = "plus")
#> plus(plus(a, b), c)

As for exprs_reduce(), it accepts any expression as input, even other sums:

exprs <- exprs(foo(), bar(baz), quux + blip)
exprs_reduce(exprs, "-")
#> foo() - bar(baz) - (quux + blip)

Note how the operator precedence of the last input quux + blip was handled gracefully.

String meta-programming simply doesn't have this kind of flexibility and robustness.

desmonds22 · August 29, 2018, 12:44pm

Your code is illegible to anyone who isnt comfortable with tidyeval while almost anyone can understand whats going on in John's code. More importantly they will be able to write their own quite quickly using the same method.

Tidyeval may have theoretical advantages over meta-programming but evaluating strings is intuitive for most users especially non-programmers. Metaprogramming works really well almost all of the time without the overhead of quoting, unquoting, eval,!!, sym etc.

For daily use tidyeval feels like a solution in search of a problem - maybe if you're building tools for other people it may be more robust? I havent seen a clear case for learning it. I cant be the only one who feels like this - there are at least 2 other people in this world who agree! Maybe even 3. We should meet up and carpool

JohnMount · August 29, 2018, 3:05pm

It is all about formula, just add the following edit and it is even more clear.

The campaign that R is no good (and therefore one must use rlang/tidyeval) is just chasing people away from R.

sdrakulich · August 29, 2018, 3:40pm

I appreciate this discussion, and the help I've received thus far.

@lionel, I'll be honest, I spent yesterday well into the night trying to figure out lazyeval or whatever the proper term would be, to no avail. Maybe I'm just a novice, maybe worse, but point is, I didn't see how what you put applied to my scenario.

Back to John:
At the end of the day, what you wrote does only one outcome variable at a time, and only one dataframe at a time.

I have a list of dataframes. They differ only by one independent variable's column. I don't see how I could recode it without generating a model for each case...whereupon your code scales to meet the task even less for that scenario. Not to bash your code, just stating the limitations for my case.

Finally, the whole reason I'm trying to resolve this lazyeval thing is for the sole purpose of having the $call be appropriate, and not $call: lm(formula = ., data = df).
Which, when trying to use sjPlot, can't reconstruct where the original data came from properly. That's the whole reason for this headache.

My hopes are that I can somehow retain the reference to the right dataframe in my list of df's, such that when I purrr through them to generate my fitted results, the references are all properly maintained. Otherwise, the code I've come up with to generate all my results for all my data is this, if anyone is interested.

dff_ALL <- list(df1, df2, df3) #differ only on "x3" below!
variables <- c("y1", "y2", "y3")
models_test <- map(variables, ~paste0(.x, " ~ x1 + x2 + x3"))

allresults <- map(dff_ALL, function(x) {
    res <- map(models_test, function(z) {
        tmp <- lm(as.formula(z), data = x)
        tmp$call <- (z)
        tmp <- 
        return(tmp)
        })
    })

# Set names from models (otherwise reference from list is lost)
for (i in seq_along(allresults)){
    allresults[[i]] <- set_names(allresults[[i]], models_test)
}

At which point you can just broom::tidy/glance/augment. Yes, I can just emulate sjPlot functionality for marginal effects, but that becomes cumbersome very quickly.

Conceptually. I have a set of dependent variables I want to test a set of predictors on. I want to test those models on a set of dataframes that are, as far as dimensions, cases, and variables, identical. The values in one column are different, I'm stating for a third time. Maybe there's an easier way to approach this, once you know that.
I want to keep it tidy, start to finish, and I want to have the ability to expand my analysis very easily, by adding another string to variables or models_test construction of predicting vars.

I'm actually pretty surprised that I didn't find a single "many models" approach that didn't nest the dataframes. My case doesn't work with that, evidently, and I haven't managed to get the horrid-looking tribble approach, as shown here.

JohnMount · August 29, 2018, 3:53pm

Was just trying to help you debug it in smaller pieces. You said in addition to the problem of multiple dependent variables you were seeing errors and I tried to show how to address the errors. As far as how to vary so many things at once probably you just need one more layer of nesting (be it through a for-loop, lapply, or purrr::map).

lionel · August 29, 2018, 4:10pm

I didn't see how what you put applied to my scenario.

Sorry that was mostly a response to John's confusion between formulas and parsing strings, which are two independent things.

You're right that quasiquotation is the right tool for fixing that lm() call. Here is how you can do it, create a quasiquoted expression and evaluate it right away:

data <- mtcars
f <- disp ~ drat

eval(expr(lm(!!f, data = data)))
#> Call:
#> lm(formula = disp ~ drat, data = data)
#>
#> Coefficients:
#> (Intercept)         drat
#>       822.8       -164.6

We are considering making that eval() + expr() available in one function: Implement bang() by lionel- · Pull Request #591 · r-lib/rlang · GitHub

JohnMount · August 29, 2018, 5:09pm

If one wants to retrieve the model formula structure, there are easy and reliable ways to get it. One can carry the string version around as name-entries on a list of results or as a parallel column in a data.frame. Thus it would be easy to add to reports.

data <- mtcars
f <- disp ~ drat
model <- lm(f, mtcars)
format(model$terms)
# [1] "disp ~ drat"

Or

data <- mtcars
f <- disp ~ drat

eval(bquote(   lm(.(f), data = data)   ))

# Call:
#   lm(formula = disp ~ drat, data = data)
# 
# Coefficients:
#   (Intercept)         drat  
#         822.8       -164.6

tbradley · August 29, 2018, 5:48pm

Can you point to any references where the developers of rlang/tidyeval/tidyverse have ever said "R is no good"? I think there is a big difference between them promoting their packages (and even saying that programming with strings is not recommended - they are entitled to their opnions) and them saying "R is no good". If they truly believed that then why would they spend so much time to build all these tools to make R easier to use?

In addition, you seem to do very similar things by promoting your packages (which is great) as an alternative to the ones provided by the tidyverse/r-lib teams at RStudio. I am struggling to see how you can view their promotion for their packages any different then you view your promotion of your packages.

JohnMount · August 29, 2018, 6:18pm

I am answering essentially assuming you are making an assertion, not asking a question. Plus it seems like calling out more behavior is just going to come off as more negative. I think there are in fact a lot of examples of very good R practices, R training, and R packages being inappropriately ignored and denied respect/priority.

"R is no good" is the gist of the message, not the exact words.

On this very thread I just got called "confused" by an RStudio employee when I used both R strings and R formulas- that seemed pretty negative.

And yes I do promote my packages by comparison. A difference being I admit other packages exist. For example wrapr::let() publicly pre-dates the public release and announcement of rlang (though it was replyr::let() at the time). Yet I never see it given any credit or listed as having some influence.

sdrakulich · August 29, 2018, 8:33pm

Reporting in: the following did the trick!

tmp$call <- expr(lm(formula= !!z))

Fixes the issue in the operation below, regarding future references to the $call for the model formula!

allresults <- map(dff_ALL, function(x) {
    res <- map(models_test, function(z) {
        tmp <- lm(as.formula(z), data = x)
        tmp$call <- (z)
        return(tmp)
        })
    })

PS: Thank you for all the help!