It's well-known that when writing code like the following, RStudio (and R CMD check
, and presumably several other things) will complain about no symbol named 'lat' in scope
and no symbol named 'lon' in scope
:
weather_points <- weather_data %>%
distinct(lat, lon)
That's of course just an example using distinct()
, it happens frequently with any of the common functions like mutate()
or filter()
or whatever - anything that uses non-standard evaluation to place the column names of the data into scope as variable names.
The result is that these warnings generally go ignored, and the noise builds up so that other legitimate warnings also go ignored, and that leads to bugs.
One solution is to disambiguate by explicitly using .$foo
syntax:
weather_points <- weather_data %>%
distinct(.$lat, .$lon)
That has some disadvantages:
- All the mentions of variables will need to be changed in this way;
- Some tools will still complain about
.
being undefined (it looks likeR CMD check
still will, and RStudio won't?); - Most importantly - it changes the behavior when one of the variables is typoed. In the original code, a fatal runtime exception is thrown, but when using
.$foo
it will silently returnNULL
. It will also resolve.$la
to.$lat
, which is different from the original, which requires exact name matching.
So while this gets rid of a warning, it's actually less safe in some important ways than the original.
Another option would be to have a function that merely asserts the existence of columns by name, essentially "declaring" them for use later in the pipeline:
weather_points <- weather_data %>%
vars(lat, lon) %>%
distinct(lat, lon)
The idea is that it would throw a runtime exception if lat
and lon
weren't present in weather_data
(the same way that the existing distinct()
call would have), but also that tools like RStudio could easily parse the vars()
call to know that lat
and lon
are legitimate variables later in the pipeline (and future pipelines based on the result of this pipeline, etc.).
One slight advantage in the exception-throwing part is that it can explicitly check that the variables are present in the data table rather than just as ambient variables in the namespace, which seems like it could help avoid some errors too.
Thoughts? Any other existing technique that I haven't thought of? I know a lot of other people have thought about this too, so let me know if I'm missing something.