Hi there,
I am trying to migrate over to a more tidy eval of doing things. As a warm up I was trying to be more explicit and use the rlang .data
pronoun as outlined here:
We can fix that ambiguity by being more explicit and using the .data pronoun. This will throw an informative error if the variable doesn’t exist:
mutate_y <- function(df) { mutate(df, y = .data$a + .data$x) }
This was an attempt to remove all those ugly R CMD check notes like this:
checking R code for possible problems ... NOTE
function: no visible binding for global variable
'VARIABLE'
However I've run into a problem when using a dplyr verb on a database connection. I was having trouble creating a reproducible example because this manifests itself only AFAIK when running R CMD on a package. So I decided to actually make a package that included an internal sqlite database to illustrate the problem. That can be found here:
https://github.com/boshek/testpackage
When using .data$
on a column directly from a database connection before using collect()
like this :
https://github.com/boshek/testpackage/blob/5189fa5db4559789fbee5a0ef6054ba8dc28609c/R/dummy.R#L8-L17
With the package installed you get an error message like this:
dot_data_not_found()
Error: Column `STATION_NUMBER` not found in `.data`
If however you use .data$
on a column directly from a database connection after using collect()
like this:
https://github.com/boshek/testpackage/blob/5189fa5db4559789fbee5a0ef6054ba8dc28609c/R/dummy.R#L23-L33
with no error message and an expected output:
using_dot_data()
# A tibble: 1 x 15
STATION_NUMBER STATION_NAME PROV_TERR_STATE~ REGIONAL_OFFICE~ HYD_STATUS SED_STATUS LATITUDE LONGITUDE
<chr> <chr> <chr> <chr> <chr> <chr> <dbl> <dbl>
1 05AA008 CROWSNEST RIVE~ AB 3 A D 49.6 -114.
# ... with 7 more variables: DRAINAGE_AREA_GROSS <dbl>, DRAINAGE_AREA_EFFECT <dbl>, RHBN <int>,
# REAL_TIME <int>, CONTRIBUTOR_ID <int>, OPERATOR_ID <int>, DATUM_ID <int>
using_dot_data()
however, has the disadvantage of dealing with an entire table from a database since the data is filtered after collect()
. This is a disadvantage/deal-breaker on really big databases. The advantage here is that this does take care of the "no visible binding..." note in R CMD check.
Last example is an instance that does generate the "no visible binding..." note in R CMD check:
https://github.com/boshek/testpackage/blob/5189fa5db4559789fbee5a0ef6054ba8dc28609c/R/dummy.R#L39-L50
with the R CMD check vieweable here:
Question
How do I deal with bare variable names in dplyr verbs that are generating a database query (other than like this in my zzz.R files: if(getRversion() >= "2.15.1") utils::globalVariables(c("STATION_NUMBER"))
) given that I want to leverage that efficiency of filtering before collecting the data?