Add a row depending on a condition using tidyverse if possible

joonwoob · October 13, 2018, 5:35pm

I'm working on hockey analytics, specifically modeling the goals scored as a poisson distribution.
So far, I've been able to create a new column, time_diff, that finds the number of minutes passed since the last goal scored.

Now, I'd like to add in a row at the end of very game (represented by game_id) that displays the number of minutes passed in the game from the last goal scored to the end of the game.

For example, in game_id 2007020001, the last goal scored in the game was at 59.56 minutes. So, I'd like to insert a row right after this game with event_type column containing the value, game_end and column time_diff containing the value 0.44

I've seen some data.table solutions on SO, but would love to see a tidyverse solution since I'm not used to data.table

Here is subset of my data:

2007020001L, 2007020001L, 2007020002L, 2007020002L, 2007020002L, 
2007020002L, 2007020002L, 2007020003L, 2007020003L, 2007020003L, 
2007020003L, 2007020003L, 2007020004L, 2007020004L, 2007020004L
), session = c("R", "R", "R", "R", "R", "R", "R", "R", "R", "R", 
"R", "R", "R", "R", "R", "R", "R", "R"), game_seconds = c(515L, 
1815L, 2470L, 3189L, 3574L, 649L, 919L, 1373L, 1450L, 1737L, 
217L, 781L, 2971L, 3168L, 3665L, 586L, 2072L, 2358L), event_type = c("GOAL", 
"GOAL", "GOAL", "GOAL", "GOAL", "GOAL", "GOAL", "GOAL", "GOAL", 
"GOAL", "GOAL", "GOAL", "GOAL", "GOAL", "GOAL", "GOAL", "GOAL", 
"GOAL"), time_diff = c(8.58, 21.67, 10.92, 11.98, 6.42, 10.82, 
4.5, 7.57, 1.28, 4.78, 3.62, 9.4, 36.5, 3.28, 8.28, 9.77, 24.77, 
4.77)), row.names = c(NA, -18L), class = c("grouped_df", "tbl_df", 
"tbl", "data.frame"), vars = c("game_id", "session"), drop = TRUE, indices = list(
    0:4, 5:9, 10:14, 15:17), group_sizes = c(5L, 5L, 5L, 3L), biggest_group_size = 5L, labels = structure(list(
    game_id = 2007020001:2007020004, session = c("R", "R", "R", 
    "R")), row.names = c(NA, -4L), class = "data.frame", vars = c("game_id", 
"session"), drop = TRUE))

chris.prener · October 13, 2018, 5:58pm

One option is to check out add_row from the tibble package. If you calculate the time_diff ahead of time and store it in an object finalTime, you can include that object in your add_row() call. A simplified version might look something like this:

hockey <- add_row(hockey, event_type = "game end", time_diff = finalTime)

howardbaek · October 15, 2018, 3:08am

I'm going to answer my own question here haha.

A great dplyr solution I found is:


df %>%
 group_by(game_id) %>%
  summarise(time_diff_sum = sum(time_diff)) %>% 
  mutate(time_diff_game_end = 60 - time_diff_sum)

howardbaek · October 15, 2018, 3:14am

I see that my solution only finds the difference between the time of the last goal and the end of the game.

Does anyone have ideas on how to insert this difference into the column time_diff right after the last row of each group of game_id and put down the value game_end in event_type ?

chris.prener · October 16, 2018, 3:01pm

@howardbaek - it sounds like you have multiple games in the same data object? Can you show us a minimal example, of what the output of

df %>%
 group_by(game_id) %>%
  summarise(time_diff_sum = sum(time_diff)) %>% 
  mutate(time_diff_game_end = 60 - time_diff_sum)

looks like?

jdb · October 20, 2018, 2:23pm

Would something like the following work?

df %>%
  group_by(game_id, event_type) %>%
  mutate(time_diff_sum = 60 - sum(time_diff)) %>% 
  ungroup() %>% 
  gather(time_diff, time_diff_sum, key = "key", value = "value") %>% 
  arrange(game_id) %>% 
  distinct(game_id, event_type, key, value)

joonwoob · October 31, 2018, 11:41pm

Looks good to me. Thanks

cderv · November 1, 2018, 1:12pm

If your question's been answered, would you mind choosing a solution? It helps other people see which questions still need help, or find solutions if they have similar problems. Here’s how to do it:

Thanks!

joonwoob · November 1, 2018, 3:00pm

Solution for future users:

df %>% 
    filter(event_type == event_type_fun) %>% 
    group_by(game_id, session) %>%
    summarise(event_index = last(event_index)) %>%
    ungroup() %>% 
    mutate(event_type = "game_end",
           time_diff = time_diff_game_end) %>%
    bind_rows(df %>% 
                filter(event_type == event_type_fun) %>% 
                group_by(game_id, session) %>% 
                # Time difference in minutes (seconds / 60)
                mutate(time_diff = game_seconds - lag(game_seconds, default = 0),
                       time_diff = round(time_diff / 60, 2)), .) %>% 
    arrange(game_id) %>% 
    select(game_id, session, event_type, time_diff) %>% 
    mutate(event_type = if_else(event_type == event_type_fun, 1, 0))

where event_type_fun == "GOAL" and time_diff_game_end =


df %>% 
    filter(event_type == event_type_fun) %>% 
    # Time difference in minutes (seconds / 60)
    group_by(game_id, session) %>% 
    mutate(time_diff = game_seconds - lag(game_seconds, default = 0),
           time_diff = round(time_diff / 60, 2)) %>% 
    select(game_id, session, game_seconds, event_type, time_diff) %>% 
    group_by(game_id, session) %>%
    summarise(time_diff_sum = sum(time_diff)) %>% 
    mutate(time_diff_game_end = 60 - time_diff_sum) %>% 
    pull(time_diff_game_end)

joonwoob · November 1, 2018, 3:00pm

Thanks for letting me know.

system · November 8, 2018, 3:00pm

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.