How to convert only variables with dates stored as text to date class?

AbhishekHP · March 7, 2019, 3:16pm

Guys, similar issue here.....
While reading excel files which has 100 columns of which 40 are dates, read as text but when tried to convert it to dates using
apply(myFile,2, as.Date) facing an error as each column is affected.
Is there a solution to convert to dates only if they are convertible in spite of reading them as "text".

 myFile <- lapply(sheets, function(X) readxl::read_excel(filename, sheet = X, col_names = T, skip = 5
                                                     , col_types = "text"
                                                     ))

martin.R · March 7, 2019, 3:44pm

You could try this to convert column types:

AbhishekHP · March 7, 2019, 3:53pm

Thanks for suggestion. But my file has literally 140 columns. and only 14 are dates.
Can we just read all columns as text and then check if there are numbers matching with date formatable and convert them using lubridate or as.Date or something similar.
Or at least can I mention col_types only for these 14 Dates column ?

martin.R · March 7, 2019, 4:03pm

You can either usecol_types = "guess" which should find the date columns or if you know which columns are dates, use something like this: col_types = c("text", "text", "date", "text", "date") etc.

AbhishekHP · March 7, 2019, 4:20pm

Actually, currently using "guess" but Shiny has many warnings keep occuring while it reads this 10MB file with 140 columns and 30,000 rows.
But if I declare col_types, then I should create vector of col_types which is 140 values i.e,
col_types = c( "text", "date",................140th value)

Is there is efficient way to do such things ?
Because I knew the dates column number.
For example: can we specifically mention these 14 dates column to read as Dates and rest as text ?

andresrcs · March 7, 2019, 5:10pm

If your variable names follow some pattern you can read them as text and convert them later with mutate_at() and some regular expressions for the selector.
Try to give some sample data to give you more specific help

martin.R · March 7, 2019, 5:14pm

If you know the column positions you could construct a string and use a pattern like this:
readr::type_convert(col_types = "ccDcD")

alistaire · March 8, 2019, 1:34am

There's a cols helper in readr that will let you set type by column name, e.g. read_csv(...., col_types = cols(mpg = col_double(), gear = 'i')) with a .default parameter that defaults to col_guess(). In this context you'd need to know the names of those 14 columns (or find them by reading in a tiny bit) and they'd each have to be mentioned, but the rest could fall to the default, which is probably fine.

For example,

library(readr)

df <- read_csv("https://github.com/tidyverse/readr/raw/master/inst/extdata/mtcars.csv", 
               col_types = cols(mpg = col_double(), 
                                gear = 'i', 
                                disp = col_date('%F')))

df
#> # A tibble: 32 x 11
#>      mpg   cyl disp          hp  drat    wt  qsec    vs    am  gear  carb
#>    <dbl> <dbl> <date>     <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <int> <dbl>
#>  1  21       6 0160-01-01   110  3.9   2.62  16.5     0     1     4     4
#>  2  21       6 0160-01-01   110  3.9   2.88  17.0     0     1     4     4
#>  3  22.8     4 0108-01-01    93  3.85  2.32  18.6     1     1     4     1
#>  4  21.4     6 0258-01-01   110  3.08  3.22  19.4     1     0     3     1
#>  5  18.7     8 0360-01-01   175  3.15  3.44  17.0     0     0     3     2
#>  6  18.1     6 0225-01-01   105  2.76  3.46  20.2     1     0     3     1
#>  7  14.3     8 0360-01-01   245  3.21  3.57  15.8     0     0     3     4
#>  8  24.4     4 NA            62  3.69  3.19  20       1     0     4     2
#>  9  22.8     4 NA            95  3.92  3.15  22.9     1     0     4     2
#> 10  19.2     6 NA           123  3.92  3.44  18.3     1     0     4     4
#> # … with 22 more rows

Obviously displacement makes lousy dates and fails sometimes, but note that gear is, in fact, an integer, and the unread columns got read and types determined by col_guess().

martin.R · March 8, 2019, 10:15am

I think the complicating issue here is that readxl::read_excel() does not have all the neat options that readr::read_csv() has, so hence reading all as text from Excel and then applying readr::type_convert() is my best guess to a solution if only the Dates are to be converted.

AbhishekHP · March 8, 2019, 10:49am

thanks for suggestion. I tried this approach. Positive side is "No warnings" in Shiny after reading the file
Flip side is that Even the 14 columns with dates format are also integer

AbhishekHP · March 8, 2019, 10:53am

Thanks for detailed example.
Actually those "Dates" column names are weird i.e.,
"Start date of the Product (in date)"
"Sales of the Car (based on contract signature)".......

So, mentioning the column names is too clumsy. But it seems there is no better to do this due to naming of the column.
I was wondering something could be done with known column indices.

AbhishekHP · March 8, 2019, 10:55am

Yep, column positions are known but
Not sure what does "ccDcD" means.
Could you please elaborate

AbhishekHP · March 8, 2019, 11:00am

Thanks for the great idea.
Column names are really large
Eg: "Sales of the Car produced in european (Date)"
"Product designed and Contract signed date"
.......

But the pattern those 14 dates column name follow is they always have "date" mentioned in its name

So, I was wondering, let me read it as "text" and later convert the those columns with pattern "date" as "as:Date". So, mutate_at() could be the solution.
Should I also use stringr package to find those "date" patterns ?

martin.R · March 8, 2019, 11:14am

You can define the column types as a shorthand string. "c" is character, "D" is Date, etc.

andresrcs · March 8, 2019, 1:36pm

You can use a tidyselect function

df <- tibble::tibble(other_variable = c("a", "b"),
                     `Sales of the Car produced in european (Date)` = c("2019-03-08", "2019-03-07"),
                     `Product designed and Contract signed date` = c("2018-03-08", "2018-03-07")
)
library(dplyr)

df %>% 
    mutate_at(vars(contains("date")), as.Date)
#> # A tibble: 2 x 3
#>   other_variable `Sales of the Car produced in… `Product designed and Cont…
#>   <chr>          <date>                         <date>                     
#> 1 a              2019-03-08                     2018-03-08                 
#> 2 b              2019-03-07                     2018-03-07

^{Created on 2019-03-08 by the reprex package (v0.2.1)}

AbhishekHP · March 11, 2019, 9:32am

Thanks @andresrcs for quality description with example.

Since dates are different format, I adapted your code as below but there is an error in interpretation of dates

library(dplyr)
df <- tibble::tibble(other_variable = c("a", "b"),
                     `Sales of the Car produced in european (Date)` = c("19-03-08", "19-03-07"),
                     `Product designed and Contract signed date` = c("18-03-08", "18-03-07")
)

df %>% 
  mutate_at(vars(contains("date")), funs(as.Date(.,format = "%d-%m-%y")))

Incorrectly interpreted dates were as follow:

2008-03-19                                     2008-03-18                                 
2007-03-19                                     2007-03-18

Yarnabrina · March 11, 2019, 11:12am

What happens if you try as.Date(.,format = "%y-%m-%d") instead of as.Date(.,format = "%d-%m-%y")? (Without changing the format of the dates that you're using currently)

AbhishekHP · March 11, 2019, 11:16am

Thanks @Yarnabrina, you corrected my interpretation. earlier, was confused.

Actually, not those columns had "date" in the column names but few had "since"

library(dplyr)
df <- tibble::tibble(other_variable = c("a", "b"),
                     `Sales of the Car produced in european (Date)` = c("19-03-08", "19-03-07"),
                     `Product designed and Contract signed since` = c("18-03-08", "18-03-07")
)

# Tried
df %>% 
  mutate_at(vars(contains("date","since")), funs(as.Date(.,format = "%d-%m-%y")))
# Error in if (ignore.case) { : argument is not interpretable as logical
df %>% 
  mutate_at(vars(contains(c("date","since"))), funs(as.Date(.,format = "%d-%m-%y")))
# Error: is_string(match) is not TRUE
df %>% 
  mutate_at(vars(contains("date"|"since")), funs(as.Date(.,format = "%d-%m-%y")))

andresrcs · March 11, 2019, 1:07pm

You can use a regular expression with matches() function and for dealing with dates, lubridate functions are easier to work with, take a look at this example

library(dplyr)
library(lubridate)

df <- tibble::tibble(other_variable = c("a", "b"),
                     `Sales of the Car produced in european (Date)` = c("19-03-08", "19-03-07"),
                     `Product designed and Contract signed since` = c("18-03-08", "18-03-07")
)

df %>% 
    mutate_at(vars(matches("date|since")), ~ymd(.x))
#> # A tibble: 2 x 3
#>   other_variable `Sales of the Car produced i… `Product designed and Contr…
#>   <chr>          <date>                        <date>                      
#> 1 a              2019-03-08                    2018-03-08                  
#> 2 b              2019-03-07                    2018-03-07

Im not sure why you were using this format ("%d-%m-%y) but you can do that with the dmy() function from lubridate if that is what you are looking for.

df %>% 
    mutate_at(vars(matches("date|since")), ~dmy(.x))
#> # A tibble: 2 x 3
#>   other_variable `Sales of the Car produced i… `Product designed and Contr…
#>   <chr>          <date>                        <date>                      
#> 1 a              2008-03-19                    2008-03-18                  
#> 2 b              2007-03-19                    2007-03-18

AbhishekHP · March 11, 2019, 1:21pm

As this question started with this code to read excel file as text.

 myFile <- lapply(sheets, function(X) readxl::read_excel(filename, sheet = X, col_names = T, skip = 5
                                                     , col_types = "text"
                                                     ))

myFile is read as text
So, all the columns are read as "text" including dates (Format: 11-03-2019) column.

# The columns with date is  **not read** as below:
 `Sales of the Car produced in european (Date)` = c("2019-03-08", "2019-03-07"),
  `Product designed and Contract signed date` = c("2018-03-08", "2018-03-07")
Instead
 `Sales of the Car produced in european (Date)` = c("43427","43116"),
  `Product designed and Contract signed date` = c("42124", "43427")

Now, when I used your code to convert to date format

myFile %>% 
    mutate_at(vars(matches("date|since")), ~dmy(.x))

those dates' columns have
only NA