Understanding first() and last()

kburnham · February 17, 2019, 6:46am

I am trying to select the most recent date in a grouped column, subject to constraints imposed by an third column. Everything works fine when there are no NAs involved, but when the condition column contains NAs unexpected results obtain.

I would like to understand why the results of these two queries differ and how I might fix my query to handle the case when NAs exist.

So, how can I change the query on dat2 so that it's results match those of the query on dat?

dat <- tibble(x = rep(c('a', 'b'), each = 4),
                   y = rep(seq(as.Date('2019-02-01'), as.Date('2019-02-04'), by = 'days'), 2),
                   z = c('this', 'that', 'this', 'that', 'this', 'that', 'this', 'that'))

dat2 <- dat %>% mutate(z = case_when(x == 'b' & y == as.Date('2019-02-04') ~ NA_character_,
                                     TRUE ~ z))


dat %>% group_by(x) %>% 
  arrange(y) %>% 
  summarize(first = first(y[z == 'this']),
            last = last(y[z == 'this']))


# A tibble: 2 x 3
  x     first      last      
  <chr> <date>     <date>    
1 a     2019-02-01 2019-02-03
2 b     2019-02-01 2019-02-03


dat2 %>% group_by(x) %>% 
  arrange(y) %>% 
  summarize(first = first(y[z == 'this']),
            last = last(y[z == 'this']))

# A tibble: 2 x 3
  x     first      last      
  <chr> <date>     <date>    
1 a     2019-02-01 2019-02-03
2 b     2019-02-01 NA

rensa · February 17, 2019, 7:11am

Hi @kburnham! I think the problem you're having here is that NA == 'this' doesn't evaluate to FALSE: it evaluates to NA. And—this is speculation on my part—I think the subscripting function [ cuts out values that are FALSE, rather than retaining values that are TRUE.

For example:

test1 <- 1:10
test1[7] <- NA

test1[test1 > 5]
#> [1]  6 NA  8  9 10

set.seed(1)
test2 <- rnorm(10)
test2[7] <- NA

test2[test2 > 0]
#> [1] 0.1836433 1.5952808 0.3295078        NA 0.7383247 0.5757814

Created on 2019-02-17 by the reprex package (v0.2.0).

Does that make sense? You may need to modify your example to explicitly test for NA as well:

dat2 %>% group_by(x) %>% 
  arrange(y) %>% 
  summarize(first = first(y[!is.na(z) & z == 'this']),
            last = last(y[!is.na(z) & z == 'this']))

And to confirm that this still respects grouping, let's try a slightly modified data set:

library(tidyverse)

dat3 <- tibble(
  x = rep(c('a', 'b'), each = 4),
  y = seq(as.Date('2019-02-01'), as.Date('2019-02-08'), by = 'days'),
  z = c('this', 'that', 'this', 'that', 'this', 'that', 'this', NA))

dat3 %>% group_by(x) %>% 
  arrange(y) %>% 
  summarize(first = first(y[!is.na(z) & z == 'this']),
            last = last(y[!is.na(z) & z == 'this']))
#> # A tibble: 2 x 3
#>   x     first      last      
#>   <chr> <date>     <date>    
#> 1 a     2019-02-01 2019-02-03
#> 2 b     2019-02-05 2019-02-07

Created on 2019-02-17 by the reprex package (v0.2.0).

kburnham · February 17, 2019, 6:13pm

Thanks @rensa,

That solves the problem. I could have sworn I tried that previously without success, but I must have overlooked something.

kb

andresrcs · February 17, 2019, 6:14pm

If your question's been answered (even by you!), would you mind choosing a solution? It helps other people see which questions still need help, or find solutions if they have similar problems. Here’s how to do it:

system · February 24, 2019, 6:19pm

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.