Date import problem in R

juandmaz · July 24, 2023, 6:00am

I have a column in an excel .xls file where for some reason the dates were saved in 2 different ways and now I can not import them correctly because they are saved in 2 different ways in R and I can not have them in a uniform way.
Any advice?

I show the first 10 observations in excel


12/5/2022 12:53
12/7/2022 08:49
14/03/2023 14:13:00
25/01/2023 08:49:00
12/5/2022 13:54
13/03/2023 11:50:00
1/2/2023 11:01
17/02/2023 09:44:00
23/12/2022 09:59:00
12/5/2022 12:47

And this is how they look in R


<chr>              
 1 44693.536805555559 
 2 44754.367361111108 
 3 14/03/2023 14:13:00
 4 25/01/2023 08:49:00
 5 44693.57916666667  
 6 13/03/2023 11:50:00
 7 44958.459027777775 
 8 17/02/2023 09:44:00
 9 23/12/2022 09:59:00
10 44693.532638888886

MarekGierlinski · July 24, 2023, 8:07am

I'm not sure how you read the Excel file into R, but I'm going to assume here that we read a CSV file and the column with messy dates is read as character. After reading the file, we remove spurious ":00" in some rows and convert character strings into datetime using dmy_hm from lubridate.

# Create text file
"time\n12/5/2022 12:53\n12/7/2022 08:49\n14/03/2023 14:13:00\n25/01/2023 08:49:00\n12/5/2022 13:54\n13/03/2023 11:50:00\n1/2/2023 11:01\n17/02/2023 09:44:00\n23/12/2022 09:59:00\n12/5/2022 12:47"  |> write_lines("test.csv")

library(tidyverse)

# Read text file and convert data
read_tsv("test.csv") |>
  mutate(time = str_remove(time, ":00$")) |>
  mutate(time = dmy_hm(time))

# A tibble: 10 × 1
   time               
   <dttm>             
 1 2022-05-12 12:53:00
 2 2022-07-12 08:49:00
 3 2023-03-14 14:13:00
 4 2023-01-25 08:49:00
 5 2022-05-12 13:54:00
 6 2023-03-13 11:50:00
 7 2023-02-01 11:01:00
 8 2023-02-17 09:44:00
 9 2022-12-23 09:59:00
10 2022-05-12 12:47:00

technocrat · July 24, 2023, 9:18am

The simple case is handled with

library(lubridate)
#> 
#> Attaching package: 'lubridate'
#> The following objects are masked from 'package:base':
#> 
#>     date, intersect, setdiff, union

example <- "12/5/2022 12:53"
convert <- mdy_hm(example)
convert
#> [1] "2022-12-05 12:53:00 UTC"
class(convert)
#> [1] "POSIXct" "POSIXt"
str(convert)
#>  POSIXct[1:1], format: "2022-12-05 12:53:00"

^{Created on 2023-07-24 with reprex v2.0.2}

assuming that MM-DD-YYYY is intended.

This needs, assuming DD-MM-YYYY

example <- "14/03/2023 14:13:00"
(convert <- lubridate::dmy_hms(example))
#> [1] "2023-03-14 14:13:00 UTC"

Sort in this order

1/2/2023 11:01
12/5/2022 12:47
12/5/2022 12:53
12/5/2022 13:54
12/7/2022 08:49
13/03/2023 11:50:00
14/03/2023 14:13:00
17/02/2023 09:44:00
23/12/2022 09:59:00
25/01/2023 08:49:00

and handle separately.

juandmaz · July 24, 2023, 11:00pm

Hi, I am not with reading read_csv but with read_xls since it is an excel .xls file. I tried your method but it doesn't work since half of the observations are imported as numbers and not as dates, see the table I uploaded of how R imports it. * Makes the dmy_hm() method useless.

juandmaz · July 24, 2023, 11:04pm

The problem is that R does not import as different types of dates but as some dates and then the rest of the numbers. I can deal with numbers, the problem is that I don't know how to deal with both types without altering the row order. Your solution was well oriented, but how could it be applied without altering the order of the base? Because I only put 10 rows as an example, but the base has much more.

technocrat · July 24, 2023, 11:28pm

Add an index column to preserve row ordering. Sort by date, process, resort by index. Much easier than doing the logic for dealing with the different patterns.

juandmaz · July 25, 2023, 3:23am

Hi, I used if_else() to discriminate by pattern logic which seemed easier than creating a column of indices. The problem I have is that if you look at the 2 types of date that I have in excel one is numeric (for example, 44693.53263888888886). I use excel_numeric_to_date() to convert it to date but it doesn't do it correctly because it doesn't show the hours and minutes. For example in excel the date is 12/5/2022 12:53:00, in R it appears as 44693.53680555555559 and with excel_numeric_to_date() it appears as 2022-05-12.

EconProf · July 25, 2023, 5:01am

Perhaps convertToDateTime()from the openxlsx package?

MarekGierlinski · July 25, 2023, 6:48am

The problem is that R does not import as different types of dates but as some dates and then the rest of the numbers.

Have you tried forcing the column type to character in read_excel? Then, you should get all data as strings, as in your example.

What happens when you export the Excel file as CSV? Does it still contain a mixture of dates and numbers?

juandmaz · July 25, 2023, 8:21am

Hi, is it possible to specify in read_excel() to read that column as character without specifying the rest of the columns? Or do I have to do it for all of them?

MarekGierlinski · July 25, 2023, 8:46am

You can read one column using col_names and col_types options, just as an experiment to see what happens if you force character type.

juandmaz · July 25, 2023, 9:29am

Hi, I dont know how to read only one column or specify the type of only one column and not the rest

MarekGierlinski · July 25, 2023, 9:52am

df <- read_excel(file_name, range = cell_cols("C"), col_type = "text")

will read column "C" in the sheet.

technocrat · July 25, 2023, 9:55am

numeric_value <- 19393.53263888888886
seconds_in_a_day <- 24 * 60 * 60
datetime_value <- as.POSIXct(numeric_value * seconds_in_a_day, origin = "1970-01-01", tz = "UTC")
datetime_value
#> [1] "2023-02-05 12:47:00 UTC"

^{Created on 2023-07-25 with reprex v2.0.2}

On the other hand, to plan ahead

numeric_value <- 44693.53263888888886
seconds_in_a_day <- 24 * 60 * 60
datetime_value <- as.POSIXct(numeric_value * seconds_in_a_day, origin = "1970-01-01", tz = "UTC")
datetime_value
#> [1] "2092-05-13 12:47:00 UTC"

juandmaz · July 26, 2023, 4:47am

Hi, with the number 44693.53263888888886 (wich in excel is 2/5/2022 12:53:00) your formula returns this output: 2092-05-13 12:47:00 UTC

juandmaz · July 26, 2023, 4:49am

Hi, thanks for the answer but it doesn't works. It read all the columns and returns this warning:

Warning messages:
1: In min(x) : no non-missing arguments to min; returning Inf
2: In max(x) : no non-missing arguments to max; returning -Inf
3: In stopifnot(length(ul) == 2L, length(lr) == 2L, length(sheet) ==  :
  NAs introduced by coercion to integer range
4: In stopifnot(length(ul) == 2L, length(lr) == 2L, length(sheet) ==  :
  NAs introduced by coercion to integer range

But,anyways, I check and R imported the column as a character with the previous way

juandmaz · July 26, 2023, 4:50am

Hi, I tried and it works but R adds 3 hours to the excel time, any proposal on how to solve it?

technocrat · July 26, 2023, 6:22am

Excel is where data goes to die. Day 0 was 1970-01-01. 52 years from 2022-02-05. During those 52 years there were leap years in 1972, 1976, 1980, 1984, 1988, 1992, 1996, 2000, 2004, 2008, 2012, 2016, 2020. That 12 leap days plus 52*365 = 18,980 regular days = 18,992 days. This

is not a correct representation of a POSIX date. Instead, it's counting from 1900.

juandmaz · July 26, 2023, 8:10am

I found the solution by using excel_numeric_to_date() and adding the argument include_time = TRUE.

abdgofar · July 31, 2023, 6:14am

Since it's stored in two different file format. Change all to CSV then convert the second date from character to date format.

That's all