I've often used data %>% filter(is.na(col)) as a way to inspect the data where a missing value is located--there's often a lot of context that needs investigation before I decide to remove missing data and I'm always scared of things like na.omit() or complete.cases().
Today something happened that seemed weird, which is shy I'm asking, "[a]m I crazy?"
It seems like dplyr::filter is behaving differently; at least some older code is not working the way that it used to. Often I use the Interval class from lubridate in my work and usually an interval column in a tbl_df doesn't throw filter off in this way, take a look:
library(tidyverse)
#> ── Attaching packages ─────────────────────────────────────── tidyverse 1.2.1 ──
#> ✔ ggplot2 2.2.1.9000 ✔ purrr 0.2.4
#> ✔ tibble 1.4.2 ✔ dplyr 0.7.5
#> ✔ tidyr 0.8.0 ✔ stringr 1.3.0
#> ✔ readr 1.1.1 ✔ forcats 0.3.0
#> ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
#> ✖ dplyr::filter() masks stats::filter()
#> ✖ dplyr::lag() masks stats::lag()
#> ✖ dplyr::vars() masks ggplot2::vars()
library(lubridate)
#>
#> Attaching package: 'lubridate'
#> The following object is masked from 'package:base':
#>
#> date
data <- tribble(
~id, ~move_in, ~move_out,
147, "20110115", "20130521",
148, "20170222", NA,
149, NA, NA,
150, NA, "20170101",
151, "20160506", "20180125") %>%
mutate(
move_in = parse_date(move_in, "%Y%m%d"),
move_out = parse_date(move_out, "%Y%m%d"),
length_of_stay = move_in %--% move_out
)
glimpse(data)
#> Observations: 5
#> Variables: 4
#> $ id <dbl> 147, 148, 149, 150, 151
#> $ move_in <date> 2011-01-15, 2017-02-22, NA, NA, 2016-05-06
#> $ move_out <date> 2013-05-21, NA, NA, 2017-01-01, 2018-01-25
#> $ length_of_stay <S4: Interval> 2011-01-15 UTC--2013-05-21 UTC, 2017-0...
data %>% filter(is.na(move_in))
#> Error in filter_impl(.data, quo): Column `length_of_stay` classes Period and Interval from lubridate are currently not supported.
data %>% filter(!is.na(move_in))
#> Error in filter_impl(.data, quo): Column `length_of_stay` classes Period and Interval from lubridate are currently not supported.
data %>%
select(-length_of_stay) %>%
filter(is.na(move_in))
#> # A tibble: 2 x 3
#> id move_in move_out
#> <dbl> <date> <date>
#> 1 149. NA NA
#> 2 150. NA 2017-01-01
data %>%
tally(is.na(move_in))
#> # A tibble: 1 x 1
#> n
#> <int>
#> 1 2
data %>%
count(year(move_in))
#> # A tibble: 4 x 2
#> `year(move_in)` n
#> <dbl> <int>
#> 1 2011. 1
#> 2 2016. 1
#> 3 2017. 1
#> 4 NA 2
It seems like I can't use filter so long as this Interval column is in the data, but I can remove it and things will work. The strange thing though is that functions like count() and tally() don't seem to be thrown off in this way.
Maybe I'm not doing something correctly r emo::ji("man_shrugging") 