Date import problem in R

I have a column in an excel .xls file where for some reason the dates were saved in 2 different ways and now I can not import them correctly because they are saved in 2 different ways in R and I can not have them in a uniform way.
Any advice?

I show the first 10 observations in excel


12/5/2022 12:53
12/7/2022 08:49
14/03/2023 14:13:00
25/01/2023 08:49:00
12/5/2022 13:54
13/03/2023 11:50:00
1/2/2023 11:01
17/02/2023 09:44:00
23/12/2022 09:59:00
12/5/2022 12:47

And this is how they look in R


<chr>              
 1 44693.536805555559 
 2 44754.367361111108 
 3 14/03/2023 14:13:00
 4 25/01/2023 08:49:00
 5 44693.57916666667  
 6 13/03/2023 11:50:00
 7 44958.459027777775 
 8 17/02/2023 09:44:00
 9 23/12/2022 09:59:00
10 44693.532638888886 

I'm not sure how you read the Excel file into R, but I'm going to assume here that we read a CSV file and the column with messy dates is read as character. After reading the file, we remove spurious ":00" in some rows and convert character strings into datetime using dmy_hm from lubridate.

# Create text file
"time\n12/5/2022 12:53\n12/7/2022 08:49\n14/03/2023 14:13:00\n25/01/2023 08:49:00\n12/5/2022 13:54\n13/03/2023 11:50:00\n1/2/2023 11:01\n17/02/2023 09:44:00\n23/12/2022 09:59:00\n12/5/2022 12:47"  |> write_lines("test.csv")

library(tidyverse)

# Read text file and convert data
read_tsv("test.csv") |>
  mutate(time = str_remove(time, ":00$")) |>
  mutate(time = dmy_hm(time))

# A tibble: 10 × 1
   time               
   <dttm>             
 1 2022-05-12 12:53:00
 2 2022-07-12 08:49:00
 3 2023-03-14 14:13:00
 4 2023-01-25 08:49:00
 5 2022-05-12 13:54:00
 6 2023-03-13 11:50:00
 7 2023-02-01 11:01:00
 8 2023-02-17 09:44:00
 9 2022-12-23 09:59:00
10 2022-05-12 12:47:00

The simple case is handled with

library(lubridate)
#> 
#> Attaching package: 'lubridate'
#> The following objects are masked from 'package:base':
#> 
#>     date, intersect, setdiff, union

example <- "12/5/2022 12:53"
convert <- mdy_hm(example)
convert
#> [1] "2022-12-05 12:53:00 UTC"
class(convert)
#> [1] "POSIXct" "POSIXt"
str(convert)
#>  POSIXct[1:1], format: "2022-12-05 12:53:00"

Created on 2023-07-24 with reprex v2.0.2

assuming that MM-DD-YYYY is intended.

This needs, assuming DD-MM-YYYY

example <- "14/03/2023 14:13:00"
(convert <- lubridate::dmy_hms(example))
#> [1] "2023-03-14 14:13:00 UTC"

Sort in this order

1/2/2023 11:01
12/5/2022 12:47
12/5/2022 12:53
12/5/2022 13:54
12/7/2022 08:49
13/03/2023 11:50:00
14/03/2023 14:13:00
17/02/2023 09:44:00
23/12/2022 09:59:00
25/01/2023 08:49:00

and handle separately.

Hi, I am not with reading read_csv but with read_xls since it is an excel .xls file. I tried your method but it doesn't work since half of the observations are imported as numbers and not as dates, see the table I uploaded of how R imports it. * Makes the dmy_hm() method useless.

The problem is that R does not import as different types of dates but as some dates and then the rest of the numbers. I can deal with numbers, the problem is that I don't know how to deal with both types without altering the row order. Your solution was well oriented, but how could it be applied without altering the order of the base? Because I only put 10 rows as an example, but the base has much more.

Add an index column to preserve row ordering. Sort by date, process, resort by index. Much easier than doing the logic for dealing with the different patterns.

Hi, I used if_else() to discriminate by pattern logic which seemed easier than creating a column of indices. The problem I have is that if you look at the 2 types of date that I have in excel one is numeric (for example, 44693.53263888888886). I use excel_numeric_to_date() to convert it to date but it doesn't do it correctly because it doesn't show the hours and minutes. For example in excel the date is 12/5/2022 12:53:00, in R it appears as 44693.53680555555559 and with excel_numeric_to_date() it appears as 2022-05-12.

Perhaps convertToDateTime()from the openxlsx package?

The problem is that R does not import as different types of dates but as some dates and then the rest of the numbers.

Have you tried forcing the column type to character in read_excel? Then, you should get all data as strings, as in your example.

What happens when you export the Excel file as CSV? Does it still contain a mixture of dates and numbers?

Hi, is it possible to specify in read_excel() to read that column as character without specifying the rest of the columns? Or do I have to do it for all of them?

You can read one column using col_names and col_types options, just as an experiment to see what happens if you force character type.

Hi, I dont know how to read only one column or specify the type of only one column and not the rest

df <- read_excel(file_name, range = cell_cols("C"), col_type = "text")

will read column "C" in the sheet.

numeric_value <- 19393.53263888888886
seconds_in_a_day <- 24 * 60 * 60
datetime_value <- as.POSIXct(numeric_value * seconds_in_a_day, origin = "1970-01-01", tz = "UTC")
datetime_value
#> [1] "2023-02-05 12:47:00 UTC"

Created on 2023-07-25 with reprex v2.0.2

On the other hand, to plan ahead

numeric_value <- 44693.53263888888886
seconds_in_a_day <- 24 * 60 * 60
datetime_value <- as.POSIXct(numeric_value * seconds_in_a_day, origin = "1970-01-01", tz = "UTC")
datetime_value
#> [1] "2092-05-13 12:47:00 UTC"

Hi, with the number 44693.53263888888886 (wich in excel is 2/5/2022 12:53:00) your formula returns this output: 2092-05-13 12:47:00 UTC

Hi, thanks for the answer but it doesn't works. It read all the columns and returns this warning:

Warning messages:
1: In min(x) : no non-missing arguments to min; returning Inf
2: In max(x) : no non-missing arguments to max; returning -Inf
3: In stopifnot(length(ul) == 2L, length(lr) == 2L, length(sheet) ==  :
  NAs introduced by coercion to integer range
4: In stopifnot(length(ul) == 2L, length(lr) == 2L, length(sheet) ==  :
  NAs introduced by coercion to integer range

But,anyways, I check and R imported the column as a character with the previous way

Hi, I tried and it works but R adds 3 hours to the excel time, any proposal on how to solve it?

Excel is where data goes to die. Day 0 was 1970-01-01. 52 years from 2022-02-05. During those 52 years there were leap years in 1972, 1976, 1980, 1984, 1988, 1992, 1996, 2000, 2004, 2008, 2012, 2016, 2020. That 12 leap days plus 52*365 = 18,980 regular days = 18,992 days. This

is not a correct representation of a POSIX date. Instead, it's counting from 1900.

2 Likes

I found the solution by using excel_numeric_to_date() and adding the argument include_time = TRUE.

1 Like

Since it's stored in two different file format. Change all to CSV then convert the second date from character to date format.

That's all