Trouble with filter and table

veintisiete · June 19, 2023, 7:06am

Hi everyone

I am having some trouble using filtering and tables.

I have a dataset named dataset (nor really original here) that included several variables. I have one named dataset$sex, that reflects the gender of a group of patients included in a medical study. To get the sex distribution I can do as follows:

> table(dataset$sex)

  Male Female 
   136     50

But I just need patients with a follow up over 60 months so, after loading the dplyr library, I use (and get):

> dataset %>%
+   filter(followup_months >= 60) %>%
+   table(sex)
Error: object 'sex' not found

Any suggestions?
Thanks in advanced.

Leon · June 19, 2023, 7:20am

Try this:

tibble(sex = sample(c("F", "M"), size = 100, replace = TRUE)) |> count(sex)

technocrat · June 19, 2023, 7:48am

table() requires something that it can interpret as a factor, which can be in a data frame, but it can't be a data frame itself. What the filter function pipes on to table is the subset of the data frame containing only the rows matching the filter condition. While {dplyr} functions receiving the pipe treat an incoming data frame as implicit and search the names within it to interpret the argument sex, table makes so such effort.

One of the downsides of using tidy-flavored R is that it inculcates habits that come to expect that the rest of R shares the same mindset. While that often doesn't present a problem, when it does it can be harder to detect the disconnects that cause it because of course the downstream pipe function knows about the variables contained in the data frame.

library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union

# mixed metaphor
iris |> filter(Sepal.Length > 5.5) |> select(Species) |> table()
#> Species
#>     setosa versicolor  virginica 
#>          3         39         49

# tidyesque
iris |> 
  dplyr::filter(Sepal.Length > 5.5) |> 
  group_by(Species) |>
  count()
#> # A tibble: 3 × 2
#> # Groups:   Species [3]
#>   Species        n
#>   <fct>      <int>
#> 1 setosa         3
#> 2 versicolor    39
#> 3 virginica     49

# base consise, when using bracket subseting, 
# dealing with a handful of variables
iris[iris[1] > 5.5,5] |> table()
#> 
#>     setosa versicolor  virginica 
#>          3         39         49

# base wordy, when keeping positional track 
# is harder due to the number of variables
iris[iris$Sepal.Length > 5.5,"Species"] |> table()
#> 
#>     setosa versicolor  virginica 
#>          3         39         49

^{Created on 2023-06-19 with reprex v2.0.2}

veintisiete · June 20, 2023, 1:15pm

Thanks a lot for your explanation, @technocrat.

As a novice, the most difficult thing that I find in R is the fact that using base R everything is complex... but using packages is confusing. And there is always the fact that each function is expecting for an specific type of data.

In this case, for the moment, the easiest way of achieving what I want is to use dplyr to create a subset dataframe, store it under a new name, and to use table() (or any other command) over that new subset of data.

But, as I am learning, I hope that I will be able to master this in the near future. Thank you for helping me to improve my habilities.

EconProf · June 20, 2023, 4:55pm

My favorite method uses the with() function from base R.

library(tidyverse)

iris |> filter(Sepal.Length > 5.5) |> with(table(Species))
#> Species
#>     setosa versicolor  virginica 
#>          3         39         49

^{Created on 2023-06-20 with reprex v2.0.2}

technocrat · June 20, 2023, 8:19pm

Most of us who suffered school trauma with algebra that we didn’t overcome going on to a STEM field of study in uni fall into the mind trap of confusing abstract notation with complexity. It may help to escape by reflecting on our common experience that as bad it was dealing with syntax with all the punctuation, the “word problems” were even worse.

The tidy dialect is a word problem analogy that makes for a better class of word problems and aids fluency in composing solutions in a. LLM sort of way because what comes next often seems obvious. For what it attempts, it’s brilliant. But it sometimes doesn’t play well with {base} as in this example.

My experience has been that the ease of composing solutions in {tidy} can come at the expense of understanding the underlying functional programming question that the solution is attempting to address. When I weaned from it, I found rapid improvement in my understanding of the statistical functions.

system · July 11, 2023, 8:19pm

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.