Select_if predicate function

andrew57jm · June 14, 2018, 8:49pm

I'm using select_if to clean up dataframes that are downloaded from external databases. So for instance if I want to trim whitespace from character columns I can write:

df <- select_if(df, is.character, trimws)

and my text fields are trimmed. Since the date fields come down as text ("2018-01-02"), I'd like to write a similar statement that converts them to dates. So I write:

is_date <- function(x) str_detect(x,"^\\d{4}-\\d{2}-\\d{2}")
df <- select_if (df, is_date, ymd)

but I get the enigmatic error:

Error in selected[[i]] <- .p(.tbl[[vars[[i]]]], ...) :
more elements supplied than there are to replace

This feels like one of those strange R list things but I can't make it go away. Anyone know how?

Thanks!

mishabalyasin · June 14, 2018, 9:08pm

There are couple of things that need to be changed in your code.

The main one is that your is_date function takes in a vector (in this context - column in a dataframe), not a vector of names. For select_if helper to work you need to provide a function that will return a single TRUE/FALSE per each column. This is why you are getting the error "more elements supplied than there are to replace". I agree, it's pretty cryptic, especially the line above it, but it does give you an idea of where you need to look.

Second, and I guess it's mostly a typo, you need mutate_if, not select_if.

Working code is:

df <- tibble::tibble(date = rep("2018-01-01", 5))
is_date <- function(x) all(stringr::str_detect(x,"^\\d{4}-\\d{2}-\\d{2}"))
df <- dplyr::mutate_if(df, is_date, lubridate::ymd)

alistaire · June 14, 2018, 9:14pm

It seems like you'd need mutate_if, not select_if. You may also check out readr::type_convert; if you pass it a data frame of character vectors, it will convert them to the correct types (including reasonably-formatted dates):

df <- tibble::data_frame(
    chr = letters[1:5],
    int = as.character(1:5), 
    dbl = as.character(1:5 + .1), 
    date = as.character(Sys.Date() - 5:1)
)

df
#> # A tibble: 5 x 4
#>   chr   int   dbl   date      
#>   <chr> <chr> <chr> <chr>     
#> 1 a     1     1.1   2018-06-09
#> 2 b     2     2.1   2018-06-10
#> 3 c     3     3.1   2018-06-11
#> 4 d     4     4.1   2018-06-12
#> 5 e     5     5.1   2018-06-13

readr::type_convert(df)
#> Parsed with column specification:
#> cols(
#>   chr = col_character(),
#>   int = col_double(),
#>   dbl = col_double(),
#>   date = col_date(format = "")
#> )
#> # A tibble: 5 x 4
#>   chr     int   dbl date      
#>   <chr> <dbl> <dbl> <date>    
#> 1 a         1   1.1 2018-06-09
#> 2 b         2   2.1 2018-06-10
#> 3 c         3   3.1 2018-06-11
#> 4 d         4   4.1 2018-06-12
#> 5 e         5   5.1 2018-06-13

Well, it's cautious about integers, but otherwise it does the job. If you need, you can explicitly specify types in the same fashion as any of readr's reading functions.