Basic Subsetting help

thmsrs · March 27, 2022, 7:56pm

I've attach()ed my data set so I'm able to refer to my variables directly by name.

the dataset contains people from all over the world and their 100M score (time).
variables 'country', 'age' and 'time'.

Say I wanted to output the mean 'time' of individuals from 'Canada' & 'US' who are under 20 as its own variable, i.e., I want to be able to just mean() to get this product if that's possible.

is filter() the function I'm looking for? I've been playing around but seems to just be outputting the entire subsetted rows and I cant figure out how to do this.

JackDavison · March 27, 2022, 9:17pm

Hello!

I've attached some code that does what I think you're looking for!

The first chunk just generates some data as you haven't provided any in your question.

The second does a group_by / summarise workflow, calculating the mean and standard deviation for each combination of country and age.

The third chunk uses filter to filter for US & Canada athletes who are 20 or under, pulls the time column (turns the column into a vector), then calculates the mean.

library(dplyr, warn.conflicts = F)

# generate data
set.seed(123)
dat <- tidyr::crossing(country = c("UK", "US", "AU", "CA", "NZ"),
                       age = c(10, 20, 30, 40, 50),
                       runner_id = LETTERS) %>%
  mutate(time = rnorm(n = 650, mean = 200, sd = 15))

# get average times for all categories
avg_times <- dat %>%
  group_by(country, age) %>%
  summarise(time_mean = mean(time),
            time_sd   = sd(time))
#> `summarise()` has grouped output by 'country'. You can override using the
#> `.groups` argument.

head(avg_times)
#> # A tibble: 6 x 4
#> # Groups:   country [2]
#>   country   age time_mean time_sd
#>   <chr>   <dbl>     <dbl>   <dbl>
#> 1 AU         10      199.    14.7
#> 2 AU         20      203.    12.4
#> 3 AU         30      200.    15.1
#> 4 AU         40      204.    11.6
#> 5 AU         50      196.    11.7
#> 6 CA         10      198.    17.8

# get 20 and under from CA and US
na_20_mean_time <- dat %>%
  filter(country %in% c("CA", "US"),
         age <= 20) %>%
  pull(time) %>%
  mean()

na_20_mean_time
#> [1] 200.9996

^{Created on 2022-03-27 by the reprex package (v2.0.1)}

system · April 17, 2022, 9:18pm

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.