I've often used data %>% filter(is.na(col))
as a way to inspect the data where a missing value is located--there's often a lot of context that needs investigation before I decide to remove missing data and I'm always scared of things like na.omit()
or complete.cases()
.
Today something happened that seemed weird, which is shy I'm asking, "[a]m I crazy?"
It seems like dplyr::filter
is behaving differently; at least some older code is not working the way that it used to. Often I use the Interval
class from lubridate
in my work and usually an interval column in a tbl_df
doesn't throw filter
off in this way, take a look:
library(tidyverse)
#> ── Attaching packages ─────────────────────────────────────── tidyverse 1.2.1 ──
#> ✔ ggplot2 2.2.1.9000 ✔ purrr 0.2.4
#> ✔ tibble 1.4.2 ✔ dplyr 0.7.5
#> ✔ tidyr 0.8.0 ✔ stringr 1.3.0
#> ✔ readr 1.1.1 ✔ forcats 0.3.0
#> ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
#> ✖ dplyr::filter() masks stats::filter()
#> ✖ dplyr::lag() masks stats::lag()
#> ✖ dplyr::vars() masks ggplot2::vars()
library(lubridate)
#>
#> Attaching package: 'lubridate'
#> The following object is masked from 'package:base':
#>
#> date
data <- tribble(
~id, ~move_in, ~move_out,
147, "20110115", "20130521",
148, "20170222", NA,
149, NA, NA,
150, NA, "20170101",
151, "20160506", "20180125") %>%
mutate(
move_in = parse_date(move_in, "%Y%m%d"),
move_out = parse_date(move_out, "%Y%m%d"),
length_of_stay = move_in %--% move_out
)
glimpse(data)
#> Observations: 5
#> Variables: 4
#> $ id <dbl> 147, 148, 149, 150, 151
#> $ move_in <date> 2011-01-15, 2017-02-22, NA, NA, 2016-05-06
#> $ move_out <date> 2013-05-21, NA, NA, 2017-01-01, 2018-01-25
#> $ length_of_stay <S4: Interval> 2011-01-15 UTC--2013-05-21 UTC, 2017-0...
data %>% filter(is.na(move_in))
#> Error in filter_impl(.data, quo): Column `length_of_stay` classes Period and Interval from lubridate are currently not supported.
data %>% filter(!is.na(move_in))
#> Error in filter_impl(.data, quo): Column `length_of_stay` classes Period and Interval from lubridate are currently not supported.
data %>%
select(-length_of_stay) %>%
filter(is.na(move_in))
#> # A tibble: 2 x 3
#> id move_in move_out
#> <dbl> <date> <date>
#> 1 149. NA NA
#> 2 150. NA 2017-01-01
data %>%
tally(is.na(move_in))
#> # A tibble: 1 x 1
#> n
#> <int>
#> 1 2
data %>%
count(year(move_in))
#> # A tibble: 4 x 2
#> `year(move_in)` n
#> <dbl> <int>
#> 1 2011. 1
#> 2 2016. 1
#> 3 2017. 1
#> 4 NA 2
It seems like I can't use filter
so long as this Interval
column is in the data, but I can remove it and things will work. The strange thing though is that functions like count()
and tally()
don't seem to be thrown off in this way.
Maybe I'm not doing something correctly r emo::ji("man_shrugging")