I am trying to select the most recent date in a grouped column, subject to constraints imposed by an third column. Everything works fine when there are no NAs involved, but when the condition column contains NAs unexpected results obtain.
I would like to understand why the results of these two queries differ and how I might fix my query to handle the case when NAs exist.
So, how can I change the query on dat2 so that it's results match those of the query on dat?
dat <- tibble(x = rep(c('a', 'b'), each = 4),
y = rep(seq(as.Date('2019-02-01'), as.Date('2019-02-04'), by = 'days'), 2),
z = c('this', 'that', 'this', 'that', 'this', 'that', 'this', 'that'))
dat2 <- dat %>% mutate(z = case_when(x == 'b' & y == as.Date('2019-02-04') ~ NA_character_,
TRUE ~ z))
dat %>% group_by(x) %>%
arrange(y) %>%
summarize(first = first(y[z == 'this']),
last = last(y[z == 'this']))
# A tibble: 2 x 3
x first last
<chr> <date> <date>
1 a 2019-02-01 2019-02-03
2 b 2019-02-01 2019-02-03
dat2 %>% group_by(x) %>%
arrange(y) %>%
summarize(first = first(y[z == 'this']),
last = last(y[z == 'this']))
# A tibble: 2 x 3
x first last
<chr> <date> <date>
1 a 2019-02-01 2019-02-03
2 b 2019-02-01 NA
Hi @kburnham! I think the problem you're having here is that NA == 'this' doesn't evaluate to FALSE: it evaluates to NA. And—this is speculation on my part—I think the subscripting function [ cuts out values that are FALSE, rather than retaining values that are TRUE.
For example:
test1 <- 1:10
test1[7] <- NA
test1[test1 > 5]
#> [1] 6 NA 8 9 10
set.seed(1)
test2 <- rnorm(10)
test2[7] <- NA
test2[test2 > 0]
#> [1] 0.1836433 1.5952808 0.3295078 NA 0.7383247 0.5757814
Does that make sense? You may need to modify your example to explicitly test for NA as well:
dat2 %>% group_by(x) %>%
arrange(y) %>%
summarize(first = first(y[!is.na(z) & z == 'this']),
last = last(y[!is.na(z) & z == 'this']))
And to confirm that this still respects grouping, let's try a slightly modified data set:
library(tidyverse)
dat3 <- tibble(
x = rep(c('a', 'b'), each = 4),
y = seq(as.Date('2019-02-01'), as.Date('2019-02-08'), by = 'days'),
z = c('this', 'that', 'this', 'that', 'this', 'that', 'this', NA))
dat3 %>% group_by(x) %>%
arrange(y) %>%
summarize(first = first(y[!is.na(z) & z == 'this']),
last = last(y[!is.na(z) & z == 'this']))
#> # A tibble: 2 x 3
#> x first last
#> <chr> <date> <date>
#> 1 a 2019-02-01 2019-02-03
#> 2 b 2019-02-05 2019-02-07
If your question's been answered (even by you!), would you mind choosing a solution? It helps other people see which questions still need help, or find solutions if they have similar problems. Here’s how to do it: