Hello everyone. This is my first time posting here. I am new to R. I am currently working on my capstone project from the google data analytics course. I am trying to convert date/time data from character to date format. Some of my dates are in "POSIXct. POSIxt" and some are in character format. I have been using the following code
df$date <- as.Date(df$started_at)
here is an example of the results. my dates change from 10/08/2021 17:15 (this is the character format) to 0010-08-20 (this is after running the code)
I tried using other functions such as strptime(), as.POSIXct() but they give me errors of their own for example the date column has NAs instead of dates or new dates that don't exist appear such as the year 2030 or 2013. My time frame is between 2021 - 2022 so I don't understand where 2030 is coming from. Here is my code from the beginning. Let me know what I am missing
```{r}
library(tidyverse)
library(lubridate)
library(ggplot2)
```
# importing data
```{r}
aug_2021<-read_csv("202108-divvy-tripdata.csv")
sept_2021<-read_csv("202109-divvy-tripdata.csv")
oct_2021<-read_csv("202110-divvy-tripdata.csv")
nov_2021<-read_csv("202111-divvy-tripdata.csv")
dec_2021<-read_csv("202112-divvy-tripdata.csv")
jan_2022<-read_csv("202201-divvy-tripdata.csv")
feb_2022<-read_csv("202202-divvy-tripdata.csv")
mar_2022<-read_csv("202203-divvy-tripdata.csv")
apr_2022<-read_csv("202204-divvy-tripdata.csv")
may_2022<-read_csv("202205-divvy-tripdata.csv")
jun_2022<-read_csv("202206-divvy-tripdata.csv")
jul_2022<-read_csv("202207-divvy-tripdata.csv")
aug_2022<-read_csv("202208-divvy-tripdata.csv")
```
# wrangle and combine data into a single file
```{r}
colnames(aug_2021)
colnames(sept_2021)
colnames(oct_2021)
colnames(nov_2021)
colnames(dec_2021)
colnames(jan_2022)
colnames(feb_2022)
colnames(mar_2022)
colnames(apr_2022)
colnames(may_2022)
colnames(jun_2022)
colnames(jul_2022)
colnames(aug_2022)
```
# inspecting the dataframes and looking for incongurencies
```{r}
str(aug_2021)
str(sept_2021)
str(oct_2021)
str(nov_2021)
str(dec_2021)
str(jan_2022)
str(feb_2022)
str(mar_2022)
str(apr_2022)
str(may_2022)
str(jun_2022)
str(jul_2022)
str(aug_2022)
```
# combining the data into one big data frame
```{r}
all_trips<-rbind(aug_2021,sept_2021,oct_2021,nov_2021,dec_2021,jan_2022,feb_2022,mar_2022,apr_2022,may_2022,jun_2022,jul_2022,aug_2022)
```
# removing unnecessary columns
```{r}
all_trips<-all_trips%>%
select(-c(start_lat,start_lng,end_lat,end_lng,start_station_id,start_station_name,end_station_name,end_station_id))
```
# step 3 clean up and add data to prepare for analysis
# inspecting the new table that has been created
```{r}
colnames(all_trips)
nrow(all_trips)
dim(all_trips)
head(all_trips)
tail(all_trips)
str(all_trips)
summary(all_trips)
```
#check how many observations fall under each usertype
```{r}
table(all_trips$member_casual)
```
# Add columns that list the date, month, day, and year of each ride
#The default format is yyyy-mm-dd
# This will allow us to aggregate ride data for each month, day, or year. Before completing these operations we could only aggregate at the ride level
```{r}
all_trips$date <- as.Date(all_trips$started_at)
```
If you look at the Help for as.Date, you will find the Usage section that begins with
Usage
as.Date(x, ...)
## S3 method for class 'character'
as.Date(x, format, tryFormats = c("%Y-%m-%d", "%Y/%m/%d"),
optional = FALSE, ...)
You can see that for character input with slashes, the function will guess that the format is %Y/%m/%d. That accounts for the result
> as.Date("10/08/2021 17:15")
[1] "0010-08-20"
where the 10 is treated as the year, the 08 and the month and the first two digits of 2021 as the day. If you give the function the format of the character date, it will return the correct value. If your dates are in day/month/year format,
as.Date("10/08/2021 17:15", format = "%d/%m/%Y %H:%M")
[1] "2021-08-10"
It worked. Thank you so much!! Been struggling with this since yesterday. You helped me realize where I went wrong. Just realized that I have been reading the formats wrong. I thought %Y/%m/%d directly translated to 2021/08/10 and did not know that it was in reverse. Will work on learning how to read the formats correctly. Asante!
I have one more date-related question. The time difference between two columns (start_date and end_date) but I have to convert the columns to date/time format from character format. Tried using ymd_hms() and strptime(). Here are my codes and results: all_trips is my data frame.
This first code gives me dates outside my time frame. My time frame is 2021-2022. It returns 2030,2005,2016 etc
Error in as.POSIXlt.character(x, tz, ...) :
character string is not in a standard unambiguous format
What am I missing?
forgot to add, this is my date/time order "19/08/2021 12:41"
The |> symbol is a pipe which conveys the output of the previous function to another function, in which case str() and the purpose here is to show that the result is a datetime object, rather than character. It's similar to the %>% symbol that you will see in the tidyverse packages, but has a slightly different syntax.
I am having a difficult time understanding the lubridate function you suggested. Is there a way that I can still use strptime() or even as_datetime() to convert my DateTime from the character format to the date time format?
Sure. help("strptime") will give you everything needed. But the reason for lubridate is that the strptime syntax is clunky. If you look at the examples for as_datetime you will see that they use the hms syntax (as_datetime redirects to as_date).
The dmy_hm is pretty straightforward. Is the difficulty that you are looking to display in the "19/08/2021 12:41" form? That's a separate problem than being able to do date conversion for purposes of date calculations and should be done on the back end.
I was writing the format part wrong . You are right, strptime() is clunky and lubridate functions are more straightforward. Thanks for walking with me on this issue. Greatly appreciated