hello rstudio community,
first ever post so trying to provide reprex as below.
i have a huge data set (3 million+ rows) which has the following format in the tibble named "tribble_test":
dpasta(tribble_test)
tibble::tribble(
~title, ~date, ~bucket, ~score,
"a", "12-07-2020", 1, 86,
"a", "13-07-2020", 1, 22,
"a", "14-07-2020", 1, 24,
"a", "15-07-2020", 4, 54,
"b", "12-07-2020", 9, 66,
"b", "13-07-2020", 7, 76,
"b", "14-07-2020", 10, 43,
"b", "15-07-2020", 10, 97,
"c", "12-07-2020", 8, 9,
"c", "13-07-2020", 5, 53,
"c", "14-07-2020", 5, 45,
"c", "15-07-2020", 5, 40,
"d", "12-07-2020", 8, 21,
"d", "13-07-2020", 10, 99,
"d", "14-07-2020", 9, 91,
"d", "15-07-2020", 10, 90
)
For each title, I would like to analyse the following
- the lowest bucket possible
- the last date when the lowest bucket was detected
- average score when the title was in the lowest bucket
the required outcome should appear like this:
outcome_required <- tibble::tribble(
~title, ~best_bucket_date, ~best_bucket, ~best_bucket_score,
"a", "14-Jul-20", 1, 44,
"b", "13-Jul-20", 7, 76,
"c", "15-Jul-20", 5, 46,
"d", "12-Jul-20", 8, 21
)
I have tried to use slice_min()
but clearly i'm missing something here. attempted code is show below:
outcome_wrong <- tribble_test %>% group_by(title) %>%
slice_min(bucket, n = 1, with_ties = FALSE) %>%
summarise(best_bucket_date = max(date),
best_bucket = bucket,
best_bucket_score = mean(score))
dpasta(outcome_wrong)
tibble::tribble(
~title, ~best_bucket_date, ~best_bucket, ~best_bucket_score,
"a", "12-07-2020", 1, 86,
"b", "13-07-2020", 7, 76,
"c", "13-07-2020", 5, 53,
"d", "12-07-2020", 8, 21
)
clearly, slice_min()
is giving me the minimum bucket value but is not performing the date or score operations correctly. My hunch is that I have to perform some sort of iterative function at row level. or possibly create some nested vector which is beyond my current comprehension. This is to be performed on 3M+ rows so I'd like to know which approach would be optimal.
Thank you for reading this far
edit: added more context