How to convert only variables with dates stored as text to date class?

Guys, similar issue here.....
While reading excel files which has 100 columns of which 40 are dates, read as text but when tried to convert it to dates using
apply(myFile,2, as.Date) facing an error as each column is affected.
Is there a solution to convert to dates only if they are convertible in spite of reading them as "text".

 myFile <- lapply(sheets, function(X) readxl::read_excel(filename, sheet = X, col_names = T, skip = 5
                                                     , col_types = "text"
                                                     ))

You could try this to convert column types:

Thanks for suggestion. But my file has literally 140 columns. and only 14 are dates.
Can we just read all columns as text and then check if there are numbers matching with date formatable and convert them using lubridate or as.Date or something similar.
Or at least can I mention col_types only for these 14 Dates column ?

You can either usecol_types = "guess" which should find the date columns or if you know which columns are dates, use something like this: col_types = c("text", "text", "date", "text", "date") etc.

Actually, currently using "guess" but Shiny has many warnings keep occuring while it reads this 10MB file with 140 columns and 30,000 rows.
But if I declare col_types, then I should create vector of col_types which is 140 values i.e,
col_types = c( "text", "date",................140th value)

Is there is efficient way to do such things ?
Because I knew the dates column number.
For example: can we specifically mention these 14 dates column to read as Dates and rest as text ?

If your variable names follow some pattern you can read them as text and convert them later with mutate_at() and some regular expressions for the selector.
Try to give some sample data to give you more specific help

If you know the column positions you could construct a string and use a pattern like this:
readr::type_convert(col_types = "ccDcD")

There's a cols helper in readr that will let you set type by column name, e.g. read_csv(...., col_types = cols(mpg = col_double(), gear = 'i')) with a .default parameter that defaults to col_guess(). In this context you'd need to know the names of those 14 columns (or find them by reading in a tiny bit) and they'd each have to be mentioned, but the rest could fall to the default, which is probably fine.

For example,

library(readr)

df <- read_csv("https://github.com/tidyverse/readr/raw/master/inst/extdata/mtcars.csv", 
               col_types = cols(mpg = col_double(), 
                                gear = 'i', 
                                disp = col_date('%F')))

df
#> # A tibble: 32 x 11
#>      mpg   cyl disp          hp  drat    wt  qsec    vs    am  gear  carb
#>    <dbl> <dbl> <date>     <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <int> <dbl>
#>  1  21       6 0160-01-01   110  3.9   2.62  16.5     0     1     4     4
#>  2  21       6 0160-01-01   110  3.9   2.88  17.0     0     1     4     4
#>  3  22.8     4 0108-01-01    93  3.85  2.32  18.6     1     1     4     1
#>  4  21.4     6 0258-01-01   110  3.08  3.22  19.4     1     0     3     1
#>  5  18.7     8 0360-01-01   175  3.15  3.44  17.0     0     0     3     2
#>  6  18.1     6 0225-01-01   105  2.76  3.46  20.2     1     0     3     1
#>  7  14.3     8 0360-01-01   245  3.21  3.57  15.8     0     0     3     4
#>  8  24.4     4 NA            62  3.69  3.19  20       1     0     4     2
#>  9  22.8     4 NA            95  3.92  3.15  22.9     1     0     4     2
#> 10  19.2     6 NA           123  3.92  3.44  18.3     1     0     4     4
#> # … with 22 more rows

Obviously displacement makes lousy dates and fails sometimes, but note that gear is, in fact, an integer, and the unread columns got read and types determined by col_guess().

I think the complicating issue here is that readxl::read_excel() does not have all the neat options that readr::read_csv() has, so hence reading all as text from Excel and then applying readr::type_convert() is my best guess to a solution if only the Dates are to be converted.

thanks for suggestion. I tried this approach. Positive side is "No warnings" in Shiny after reading the file
Flip side is that Even the 14 columns with dates format are also integer :frowning:

Thanks for detailed example.
Actually those "Dates" column names are weird i.e.,
"Start date of the Product (in date)"
"Sales of the Car (based on contract signature)".......

So, mentioning the column names is too clumsy. But it seems there is no better to do this due to naming of the column.
I was wondering something could be done with known column indices.

Yep, column positions are known but
Not sure what does "ccDcD" means.
Could you please elaborate

Thanks for the great idea.
Column names are really large
Eg: "Sales of the Car produced in european (Date)"
"Product designed and Contract signed date"
.......

But the pattern those 14 dates column name follow is they always have "date" mentioned in its name

So, I was wondering, let me read it as "text" and later convert the those columns with pattern "date" as "as:Date". So, mutate_at() could be the solution.
Should I also use stringr package to find those "date" patterns ?

You can define the column types as a shorthand string. "c" is character, "D" is Date, etc.

You can use a tidyselect function

df <- tibble::tibble(other_variable = c("a", "b"),
                     `Sales of the Car produced in european (Date)` = c("2019-03-08", "2019-03-07"),
                     `Product designed and Contract signed date` = c("2018-03-08", "2018-03-07")
)
library(dplyr)

df %>% 
    mutate_at(vars(contains("date")), as.Date)
#> # A tibble: 2 x 3
#>   other_variable `Sales of the Car produced in… `Product designed and Cont…
#>   <chr>          <date>                         <date>                     
#> 1 a              2019-03-08                     2018-03-08                 
#> 2 b              2019-03-07                     2018-03-07

Created on 2019-03-08 by the reprex package (v0.2.1)

1 Like

Thanks @andresrcs for quality description with example.

Since dates are different format, I adapted your code as below but there is an error in interpretation of dates

library(dplyr)
df <- tibble::tibble(other_variable = c("a", "b"),
                     `Sales of the Car produced in european (Date)` = c("19-03-08", "19-03-07"),
                     `Product designed and Contract signed date` = c("18-03-08", "18-03-07")
)

df %>% 
  mutate_at(vars(contains("date")), funs(as.Date(.,format = "%d-%m-%y")))

Incorrectly interpreted dates were as follow:

2008-03-19                                     2008-03-18                                 
2007-03-19                                     2007-03-18

What happens if you try as.Date(.,format = "%y-%m-%d") instead of as.Date(.,format = "%d-%m-%y")? (Without changing the format of the dates that you're using currently)

Thanks @Yarnabrina, you corrected my interpretation. earlier, was confused.

Actually, not those columns had "date" in the column names but few had "since"

library(dplyr)
df <- tibble::tibble(other_variable = c("a", "b"),
                     `Sales of the Car produced in european (Date)` = c("19-03-08", "19-03-07"),
                     `Product designed and Contract signed since` = c("18-03-08", "18-03-07")
)

# Tried
df %>% 
  mutate_at(vars(contains("date","since")), funs(as.Date(.,format = "%d-%m-%y")))
# Error in if (ignore.case) { : argument is not interpretable as logical
df %>% 
  mutate_at(vars(contains(c("date","since"))), funs(as.Date(.,format = "%d-%m-%y")))
# Error: is_string(match) is not TRUE
df %>% 
  mutate_at(vars(contains("date"|"since")), funs(as.Date(.,format = "%d-%m-%y")))

You can use a regular expression with matches() function and for dealing with dates, lubridate functions are easier to work with, take a look at this example

library(dplyr)
library(lubridate)

df <- tibble::tibble(other_variable = c("a", "b"),
                     `Sales of the Car produced in european (Date)` = c("19-03-08", "19-03-07"),
                     `Product designed and Contract signed since` = c("18-03-08", "18-03-07")
)

df %>% 
    mutate_at(vars(matches("date|since")), ~ymd(.x))
#> # A tibble: 2 x 3
#>   other_variable `Sales of the Car produced i… `Product designed and Contr…
#>   <chr>          <date>                        <date>                      
#> 1 a              2019-03-08                    2018-03-08                  
#> 2 b              2019-03-07                    2018-03-07

Im not sure why you were using this format ("%d-%m-%y) but you can do that with the dmy() function from lubridate if that is what you are looking for.

df %>% 
    mutate_at(vars(matches("date|since")), ~dmy(.x))
#> # A tibble: 2 x 3
#>   other_variable `Sales of the Car produced i… `Product designed and Contr…
#>   <chr>          <date>                        <date>                      
#> 1 a              2008-03-19                    2008-03-18                  
#> 2 b              2007-03-19                    2007-03-18
1 Like

As this question started with this code to read excel file as text.

 myFile <- lapply(sheets, function(X) readxl::read_excel(filename, sheet = X, col_names = T, skip = 5
                                                     , col_types = "text"
                                                     ))

myFile is read as text
So, all the columns are read as "text" including dates (Format: 11-03-2019) column.

# The columns with date is  **not read** as below:
 `Sales of the Car produced in european (Date)` = c("2019-03-08", "2019-03-07"),
  `Product designed and Contract signed date` = c("2018-03-08", "2018-03-07")
Instead
 `Sales of the Car produced in european (Date)` = c("43427","43116"),
  `Product designed and Contract signed date` = c("42124", "43427")

Now, when I used your code to convert to date format

myFile %>% 
    mutate_at(vars(matches("date|since")), ~dmy(.x))

those dates' columns have
only NA