Converting columns to date and numeric format but I get NA conversion error

I have a large data set around 6 million rows of data. The data set was created by using bindrows from several different spreadsheets all consisting of the same columns.

I want to create new columns for date, month, year, and day of the week so I run the following code:

all_trips$date <- as.Date(all_trips$start_time)
all_trips$month <- format(as.Date(all_trips$date), "%m")
all_trips$day <- format(as.Date(all_trips$date), "%d")
all_trips$year <- format(as.Date(all_trips$date), "%Y")
all_trips$day_of_week <- format(as.Date(all_trips$date), "%A")

After the code is complete, I run the str(all_trips) function I get :

tibble [6,444,645 × 16] (S3: tbl_df/tbl/data.frame)
 $ trip_id          : chr [1:6444645] "7C00A93E10556E47" "90854840DFD508BA" "0A7D10CDD144061C" "2F3BE33085BCFF02" ...
 $ bikeid           : chr [1:6444645] "electric_bike" "electric_bike" "electric_bike" "electric_bike" ...
 $ start_time       : chr [1:6444645] "11/27/21 13:27" "11/27/21 13:38" "11/26/21 22:03" "11/27/21 9:56" ...
 $ end_time         : chr [1:6444645] "11/27/21 13:46" "11/27/21 13:56" "11/26/21 22:05" "11/27/21 10:01" ...
 $ ride length      : 'hms' num [1:6444645] 00:19:00 00:17:45 00:02:22 00:05:01 ...
  ..- attr(*, "units")= chr "secs"
 $ day of the week  : chr [1:6444645] "Saturday" "Saturday" "Friday" "Saturday" ...
 $ from_station_name: chr [1:6444645] NA NA NA NA ...
 $ from_station_id  : chr [1:6444645] NA NA NA NA ...
 $ to_station_name  : chr [1:6444645] NA NA NA NA ...
 $ to_station_id    : chr [1:6444645] NA NA NA NA ...
 $ usertype         : chr [1:6444645] "casual" "casual" "casual" "casual" ...
 $ date             : Date[1:6444645], format: NA NA NA NA ...
 $ month            : chr [1:6444645] NA NA NA NA ...
 $ day              : chr [1:6444645] NA NA NA NA ...
 $ year             : chr [1:6444645] NA NA NA NA ...
 $ day_of_week      : chr [1:6444645] NA NA NA NA ...

however I keep getting NA's for my new columns, I've also attempted to convert start_time and end time to a numeric format using :

all_trips$started_at = as.numeric(as.character(all_trips$started_at))
all_trips$ended_at = as.numeric(as.character(all_trips$ended_at))

The result of the code is NA conversion errors, this is for a portfolio project any help with this would be greatly appreciated.

Hi @antik731 , if you put a reproducible example is better for get a good response.
Like you have larger data set if you put a few rows is well.

See this:

Maybe a option is set the format in all_trips$date column and next try the other codes. lubridate library could help in other way.

The rows of data are in the code that I used as an example, I'm not too sure on what your solution is can you please explain it?

They are not usable as data. We need something that we can read into R.

A handy way to supply some sample data is the dput() function. In the case of a large dataset something like dput(head(mydata, 100)) should supply the data we need. Just do dput(mydata) where mydata is your data. Copy the output and paste it here.

Here is a very simple example of dput() output

dput(dat)
structure(list(cat = c(1L, 1L, 1L, 2L), dog = c(2L, 2L, 1L, 5L
), parrot = c(3L, 6L, 1L, 6L)), row.names = c(NA, 4L), class = "data.frame")

The problem is start_time is not in an unambiguous format, you need to specify the date format when converting it to date class.

all_trips$date <- as.Date(all_trips$start_time, format = "%m/%d/%y")

Note: Next time please provide a proper REPRoducible EXample (reprex) illustrating your issue.

I am also doing the same project and i also encountered the same problem and this is how i solved. I am attaching screenshot of my code.
Basically when you are trying to create the new "Date" column there you have to tell "as.Date" function that your date is in which format and then wrap that "as.Date" function in "format" function and then tell format function you want the dates in which format.

all_trips$Date <- format(as.Date(all_trips$started_at, format = "%d/%m/%Y"), "%Y-%m-%d")

all_trips$Month <- format(as.Date(all_trips$Date),format = "%B")

all_trips$Day <- format(as.Date(all_trips$Date),format = "%A")

all_trips$Year <- format(as.Date(all_trips$Date),format = "%Y")
1 Like

This is great.
But how did you achieve the Difftime? I have been on this project for almost a month now and I have watched countless of YouTube videos and yet I am still stocked. Kindly help

Here is an example of calculating the time difference between the start_time and end_time.

#Invent some data    
all_trips <- data.frame(start_time = c("11/27/21 13:27", "11/27/21 13:38"),
                        end_time = c("11/27/21 13:46", "11/27/21 13:56"))
all_trips
#>       start_time       end_time
#> 1 11/27/21 13:27 11/27/21 13:46
#> 2 11/27/21 13:38 11/27/21 13:56

#str() shows that the columns are characters
str(all_trips)
#> 'data.frame':    2 obs. of  2 variables:
#>  $ start_time: chr  "11/27/21 13:27" "11/27/21 13:38"
#>  $ end_time  : chr  "11/27/21 13:46" "11/27/21 13:56"

#convert the characters to time stamps
all_trips$start_time <- as.POSIXct(all_trips$start_time, format = "%m/%d/%y %H:%M")
all_trips$end_time <- as.POSIXct(all_trips$end_time, format = "%m/%d/%y %H:%M")
str(all_trips)
#> 'data.frame':    2 obs. of  2 variables:
#>  $ start_time: POSIXct, format: "2021-11-27 13:27:00" "2021-11-27 13:38:00"
#>  $ end_time  : POSIXct, format: "2021-11-27 13:46:00" "2021-11-27 13:56:00"

#Calculate time different
all_trips$DiffTime <- all_trips$end_time - all_trips$start_time
all_trips
#>            start_time            end_time DiffTime
#> 1 2021-11-27 13:27:00 2021-11-27 13:46:00  19 mins
#> 2 2021-11-27 13:38:00 2021-11-27 13:56:00  18 mins

Created on 2024-01-27 with reprex v2.0.2

Thank you so much for these insight. I will revert as soon as I try these

I just tried it and it didn't work perfectly as all the data in the new variable created was converted to "NA"

Please see result in the attached

This is my script

It is very hard to see what you did from images.

I don't see the start_time and end_time columns in your images, so I can't tell anything about why you have trouble calculating the time difference.

I can reproduce the creation of dates like 0004-01-20 with code like this

all_trips <- data.frame(start_time = c("4/1/2019 0:02", "4/1/2019 0:03"))
as.Date(all_trips$start_time)
[1] "0004-01-20" "0004-01-20"

This makes me think you made the start_date and end_date columns with as.Date() without specifying the date format.

Please do the following. Get all_trips into the state it has before you run any transformations on the start_time and end_time columns. Then run

dput(head(all_trips))

Post the output of that here, putting a line with three back ticks before and after the output, like this:
```
output of dput() goes here
```
The back tick key is just to the left of the 1 key on a US Keyboard.

  • Please share the code you're using to perform the conversion, including any relevant function calls or commands.

The above is the result I got from using the code "dput(head(all_trips))"

I also ran this code to check "str(all_trips)" and below is what I got

'data.frame':	3820235 obs. of  9 variables:
 $ ride_id            : chr  "22178529" "22178530" "22178531" "22178532" ...
 $ started_at         : chr  "4/1/2019 0:02" "4/1/2019 0:03" "4/1/2019 0:11" "4/1/2019 0:13" ...
 $ ended_at           : chr  "4/1/2019 0:09" "4/1/2019 0:20" "4/1/2019 0:15" "4/1/2019 0:18" ...
 $ rideable_type      : chr  "6251" "6226" "5649" "4151" ...
 $ start_station_id   : int  81 317 283 26 202 420 503 260 211 211 ...
 $ start_station_name : chr  "Daley Center Plaza" "Wood St & Taylor St" "LaSalle St & Jackson Blvd" "McClurg Ct & Illinois St" ...
 $ end_station_id     : int  56 59 174 133 129 426 500 499 211 211 ...
 $ end_station_name   : chr  "Desplaines St & Kinzie St" "Wabash Ave & Roosevelt Rd" "Canal St & Madison St" "Kingsbury St & Kinzie St" ...
 $ member_casual      : chr  "member" "member" "member" "member" ...

Here is the application of my previous code to the head() of your all_trips data.

all_trips <- structure(list(ride_id = c("22178529", "22178530", "22178531", 
                                        "22178532", "22178533", "22178534"), 
                            started_at = c("4/1/2019 0:02", "4/1/2019 0:03", "4/1/2019 0:11", "4/1/2019 0:13", "4/1/2019 0:19", 
                                                                                            "4/1/2019 0:19"), 
                            ended_at = c("4/1/2019 0:09", "4/1/2019 0:20", "4/1/2019 0:15", "4/1/2019 0:18", "4/1/2019 0:36", "4/1/2019 0:23"
                                                                                            ), 
                            rideable_type = c("6251", "6226", "5649", "4151", "3270","3123"), 
                            start_station_id = c(81L, 317L, 283L, 26L, 202L, 420L), 
                            start_station_name = c("Daley Center Plaza", "Wood St & Taylor St","LaSalle St & Jackson Blvd", "McClurg Ct & Illinois St", "Halsted St & 18th St", 
                                                    "Ellis Ave & 55th St"), 
                            end_station_id = c(56L, 59L, 174L, 133L, 129L, 426L), 
                            end_station_name = c("Desplaines St & Kinzie St","Wabash Ave & Roosevelt Rd", "Canal St & Madison St", "Kingsbury St & Kinzie St", 
                                                  "Blue Island Ave & 18th St", "Ellis Ave & 60th St"), 
                            member_casual = c("member","member", "member", "member", "member", "member")), 
                       row.names = c(NA, 6L), class = "data.frame")
str(all_trips)
#> 'data.frame':    6 obs. of  9 variables:
#>  $ ride_id           : chr  "22178529" "22178530" "22178531" "22178532" ...
#>  $ started_at        : chr  "4/1/2019 0:02" "4/1/2019 0:03" "4/1/2019 0:11" "4/1/2019 0:13" ...
#>  $ ended_at          : chr  "4/1/2019 0:09" "4/1/2019 0:20" "4/1/2019 0:15" "4/1/2019 0:18" ...
#>  $ rideable_type     : chr  "6251" "6226" "5649" "4151" ...
#>  $ start_station_id  : int  81 317 283 26 202 420
#>  $ start_station_name: chr  "Daley Center Plaza" "Wood St & Taylor St" "LaSalle St & Jackson Blvd" "McClurg Ct & Illinois St" ...
#>  $ end_station_id    : int  56 59 174 133 129 426
#>  $ end_station_name  : chr  "Desplaines St & Kinzie St" "Wabash Ave & Roosevelt Rd" "Canal St & Madison St" "Kingsbury St & Kinzie St" ...
#>  $ member_casual     : chr  "member" "member" "member" "member" ...
all_trips$started_at <- as.POSIXct(all_trips$started_at, format = "%m/%d/%Y %H:%M")
all_trips$ended_at <- as.POSIXct(all_trips$ended_at, format = "%m/%d/%Y %H:%M")
str(all_trips)
#> 'data.frame':    6 obs. of  9 variables:
#>  $ ride_id           : chr  "22178529" "22178530" "22178531" "22178532" ...
#>  $ started_at        : POSIXct, format: "2019-04-01 00:02:00" "2019-04-01 00:03:00" ...
#>  $ ended_at          : POSIXct, format: "2019-04-01 00:09:00" "2019-04-01 00:20:00" ...
#>  $ rideable_type     : chr  "6251" "6226" "5649" "4151" ...
#>  $ start_station_id  : int  81 317 283 26 202 420
#>  $ start_station_name: chr  "Daley Center Plaza" "Wood St & Taylor St" "LaSalle St & Jackson Blvd" "McClurg Ct & Illinois St" ...
#>  $ end_station_id    : int  56 59 174 133 129 426
#>  $ end_station_name  : chr  "Desplaines St & Kinzie St" "Wabash Ave & Roosevelt Rd" "Canal St & Madison St" "Kingsbury St & Kinzie St" ...
#>  $ member_casual     : chr  "member" "member" "member" "member" ...
all_trips$DiffTime <- all_trips$ended_at - all_trips$started_at
str(all_trips)
#> 'data.frame':    6 obs. of  10 variables:
#>  $ ride_id           : chr  "22178529" "22178530" "22178531" "22178532" ...
#>  $ started_at        : POSIXct, format: "2019-04-01 00:02:00" "2019-04-01 00:03:00" ...
#>  $ ended_at          : POSIXct, format: "2019-04-01 00:09:00" "2019-04-01 00:20:00" ...
#>  $ rideable_type     : chr  "6251" "6226" "5649" "4151" ...
#>  $ start_station_id  : int  81 317 283 26 202 420
#>  $ start_station_name: chr  "Daley Center Plaza" "Wood St & Taylor St" "LaSalle St & Jackson Blvd" "McClurg Ct & Illinois St" ...
#>  $ end_station_id    : int  56 59 174 133 129 426
#>  $ end_station_name  : chr  "Desplaines St & Kinzie St" "Wabash Ave & Roosevelt Rd" "Canal St & Madison St" "Kingsbury St & Kinzie St" ...
#>  $ member_casual     : chr  "member" "member" "member" "member" ...
#>  $ DiffTime          : 'difftime' num  7 17 4 5 ...
#>   ..- attr(*, "units")= chr "mins"
all_trips[, c("started_at", "ended_at", "DiffTime")]
#>            started_at            ended_at DiffTime
#> 1 2019-04-01 00:02:00 2019-04-01 00:09:00   7 mins
#> 2 2019-04-01 00:03:00 2019-04-01 00:20:00  17 mins
#> 3 2019-04-01 00:11:00 2019-04-01 00:15:00   4 mins
#> 4 2019-04-01 00:13:00 2019-04-01 00:18:00   5 mins
#> 5 2019-04-01 00:19:00 2019-04-01 00:36:00  17 mins
#> 6 2019-04-01 00:19:00 2019-04-01 00:23:00   4 mins

Created on 2024-01-29 with reprex v2.0.2

Thank you so much. It worked

Though I got an error here

all_trips[, c("started_at", "ended_at", "DiffTime")]
Error in `[.data.frame`(all_trips, , c("started_at", "ended_at", "DiffTime")) : 
  undefined columns selected

all_trips[, c("started_at", "ended_at", "DiffTime")]
Error in `[.data.frame`(all_trips, , c("started_at", "ended_at", "DiffTime")) : 
  undefined columns selected

One of the column names is probably misspelled. Is the new column named DiffTime or is it something slightly different like Difftime?

The new column is named "ride_length

But I have used the below codes to create the difference in time. Thank you so much

 all_trips$ride_length <- difftime(all_trips$ended_at,all_trips$started_at)