I have an Excel file (now CSV) that I'm importing to RStudio, however there is a column that's formatted as 37:30:55 that is not recognized in RStudio. For those that don't know what 37:30:55 is, I believe it is a format to calculate time duration over 24hrs. I think RStudio is recognizing it as time and not duration.
The column itself is the duration of a rental in hours. I start to see issues when hours go over 24 hours.
The actual variable is typeof() character. Assuming that it is in the form hours:minutes:seconds and if you are interested in finding those rows with more than 24 hours
I'm not sure is this is responsive to your question, but as @technocrat points out actual comes in as a character. You can use the lubridate package to convert this to hours minutes and seconds. Something like
library(lubridate)
result <- hms(actual)
howManyHours <- hours(result)
Once again, thank you for your input. However I might be a little lost. My knowledge of R is still in its infancy and your solution may be beyond the scope of my knowledge. I can barely understand the R script you wrote.
I'm not interested in finding the rows > 24hrs. I think the issue is that R is finding a problem with rows greater than 24 hrs because of the format and being unable to read and load the data with read_csv.
I tried to view columns with functions like head & glimpse but because of the problems nothing is shown except the name of the original CSV files.
Should or can this column be edited during the read_csv function?
Thanks for your response. Are you suggesting that I change type for column 14 after I read_csv to hms(actual). I'm not trying find a certain amount of hours, I'm just trying to include all the data. I might just have to change the format of Column 14 before or after but don't know how to do that.
My suggestion would be to convert time value result into a numeric value expressing the time in seconds. Then @ Gmillan1219 can do numeric calculations on it
library(lubridate)
result <- hms("37:30:55")
timm <- period_to_seconds(hms(result))
Thus
`timm / 60` gives us rental time in minutes or `tim / (60 * 60)` wil give rental time in hours
I've encountered this very issue before. It seems like the column in your Excel file is displaying the duration in a "days:hours:minutes" format, which is quite typical in Excel for durations longer than 24 hours. However, when imported into RStudio, it's being read as a time, which can cause some confusion.
Here's a potential way to handle this in R:
Import the CSV file as usual. Let's say the problematic column is called "Duration."
Use the strsplit function to split the column into three separate columns (days, hours, and minutes).
Convert these columns into numeric and then combine them into a single duration column in hours.
Here's a bit of code that might help:
Assuming df is your data frame and Duration is the column
The exact command I'm using to read the data frame is:
read_csv(case_study_1)
case_study_1 is a data frame with 12 CSV files assigned to it representing the data for 12 months.
Below is a screen shot of one of the CSV files. The column I am having issues with is column N. Column N represents the time difference between columns D & C. The data is about the time length of rented bicycles.
Some of the time differences exceed 24 hours and could be 108:36:04 (which can be seen in the very first screenshot of this post) meaning 108 hours 36 minutes and 4 seconds. So I believe that once it exceeds 23:59:59 RStudio can't interpret it because its working with a 24 hour time span.
This may be a silly question but can I do this even though the data had problems importing? I'm very new to RStudio and not sure if data could be manipulated even after a failed read_csv.
Also, the data is hh:mm:ss its just that when the hours go over 23:59:59 RStudio cannot interpret it because its working with a 24 hr time span.
Anyway I'm going to try your recommendations. Thanks once again
You may be right. I took one of the files I was working with and provided 4 different time lengths; 3 below 24 hours and 1 above it imported with no problems.
I haven't tried them all individually yet but I did try the first two individually. Which are August 5th, 2022 (contains July 2022 data) & September 8th, 2022 (contains August 2022 data).
I'm using > july <- ("July 2022.csv") > read_csv(july)
But I think i found what may be the problem.
I looked at one of the problem rows in July 2022 and found there was a date in the column 14. In the formula bar it says 1/7/1900 while there aren't dates in the others. I think times over 23:59:59 might be including dates. I did format the whole column as 37:30:55