I am having some trouble using filtering and tables.
I have a dataset named dataset (nor really original here) that included several variables. I have one named dataset$sex, that reflects the gender of a group of patients included in a medical study. To get the sex distribution I can do as follows:
> table(dataset$sex)
Male Female
136 50
But I just need patients with a follow up over 60 months so, after loading the dplyr library, I use (and get):
> dataset %>%
+ filter(followup_months >= 60) %>%
+ table(sex)
Error: object 'sex' not found
table() requires something that it can interpret as a factor, which can be in a data frame, but it can't be a data frame itself. What the filter function pipes on to table is the subset of the data frame containing only the rows matching the filter condition. While {dplyr} functions receiving the pipe treat an incoming data frame as implicit and search the names within it to interpret the argument sex, table makes so such effort.
One of the downsides of using tidy-flavored R is that it inculcates habits that come to expect that the rest of R shares the same mindset. While that often doesn't present a problem, when it does it can be harder to detect the disconnects that cause it because of course the downstream pipe function knows about the variables contained in the data frame.
library(dplyr)
#>
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
# mixed metaphor
iris |> filter(Sepal.Length > 5.5) |> select(Species) |> table()
#> Species
#> setosa versicolor virginica
#> 3 39 49
# tidyesque
iris |>
dplyr::filter(Sepal.Length > 5.5) |>
group_by(Species) |>
count()
#> # A tibble: 3 Ă— 2
#> # Groups: Species [3]
#> Species n
#> <fct> <int>
#> 1 setosa 3
#> 2 versicolor 39
#> 3 virginica 49
# base consise, when using bracket subseting,
# dealing with a handful of variables
iris[iris[1] > 5.5,5] |> table()
#>
#> setosa versicolor virginica
#> 3 39 49
# base wordy, when keeping positional track
# is harder due to the number of variables
iris[iris$Sepal.Length > 5.5,"Species"] |> table()
#>
#> setosa versicolor virginica
#> 3 39 49
As a novice, the most difficult thing that I find in R is the fact that using base R everything is complex... but using packages is confusing. And there is always the fact that each function is expecting for an specific type of data.
In this case, for the moment, the easiest way of achieving what I want is to use dplyr to create a subset dataframe, store it under a new name, and to use table() (or any other command) over that new subset of data.
But, as I am learning, I hope that I will be able to master this in the near future. Thank you for helping me to improve my habilities.
Most of us who suffered school trauma with algebra that we didn’t overcome going on to a STEM field of study in uni fall into the mind trap of confusing abstract notation with complexity. It may help to escape by reflecting on our common experience that as bad it was dealing with syntax with all the punctuation, the “word problems” were even worse.
The tidy dialect is a word problem analogy that makes for a better class of word problems and aids fluency in composing solutions in a. LLM sort of way because what comes next often seems obvious. For what it attempts, it’s brilliant. But it sometimes doesn’t play well with {base} as in this example.
My experience has been that the ease of composing solutions in {tidy} can come at the expense of understanding the underlying functional programming question that the solution is attempting to address. When I weaned from it, I found rapid improvement in my understanding of the statistical functions.