Hi there,
My dataset has a column that contains mixed data types, characters and dates.
I will put things into context please bear with me, the question is at the bottom.
> df_all$`Job Revenue Recognition Date`
[1] NA "ARV 02-Apr-17" "ARV 04-Apr-17" NA "ARV 29-Mar-17, DEP 08-Mar-17" "ARV 29-Mar-17, DEP 08-Mar-17"
[7] "ARV 10-Apr-17" "ARV 07-Apr-17" "ARV 30-Mar-17" "ARV 28-Mar-17" "ARV 03-Apr-17" "ARV 09-Apr-17"
[13] "ARV 05-Apr-17" "ARV 10-Apr-17" "ARV 11-Apr-17" "ARV 26-Mar-17" "ARV 06-Apr-17" "ARV 26-Mar-17"
[19] "ARV 22-Mar-17, DEP 05-Mar-17"
In order to fix that, I came up with the following solution while selecting the columns of interest:
# df_all_og is just the original dataset.
df_all <- df_all_og %>%
select(
`Shipment ID`,
Trans,
Mode:`House Ref`,
`Goods Description`:`Destination ETA`,
Added:Direction,
starts_with("Total"),
`Job Revenue Recognition Date`) %>%
separate(`Job Revenue Recognition Date`, into = c("ARV", "DEP"),
sep = ", ", remove = FALSE) %>%
separate(`ARV`,into = c("ARV.type", "ARV.date"), sep= " ") %>%
separate(`DEP`,into = c("DEP.type", "DEP.date"), sep= " ")
Which breaks that single column into 4 columns because the data contained could be:
-
ARV
-
DEP
-
ARV DATE
-
DEP DATE
$ ARV.type <chr> NA, "ARV", "ARV", NA, "ARV", "ARV", "ARV", "ARV", "ARV", "ARV", "ARV", "ARV", "ARV", "ARV", "ARV", "ARV", "ARV", "ARV", "ARV", NA, "ARV", "ARV", "ARV"...
$ ARV.date <date> NA, 2017-04-02, 2017-04-04, NA, 2017-03-29, 2017-03-29, 2017-04-10, 2017-04-07, 2017-03-30, 2017-03-28, 2017-04-03, 2017-04-09, 2017-04-05, 2017-04-1...
$ DEP.type <chr> NA, NA, NA, NA, "DEP", "DEP", NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, "DEP", NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, "DEP", NA, NA, NA, NA, NA...
$ DEP.date <date> NA, NA, NA, NA, 2017-03-08, 2017-03-08, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 2017-03-05, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 2017-03-0...
Now I had to get the largest date (newest) from AR.date or DEP.date and store that in another column.
So I left the dataset a side, replicated it and called it x, to play around and find a solution (which has an issue):
# Searching for a possible solution for date comparison.
> for (i in seq_along(x$ARV.date)){
+ if(is.na(x$ARV.date[i]) | is.na(x$DEP.date[i])){
+
+ x$Job.Recog.Date[i] <- NA
+
+ }else if(!is.na(x$ARV.date[i]) & !is.na(x$DEP.date[i])){
+
+ x$Job.Recog.Date[i] <- ymd(max(c(x$ARV.date[i],x$DEP.date[i])))
+
+ }
+ }
Warning message:
Unknown or uninitialised column: 'Job.Recog.Date'.
> glimpse(x)
Observations: 43,856
Variables: 11
$ `Origin ETD` <date> 2017-01-16, 2017-03-02, 2017-03-04, 2017-02-09, 2017-03-08, 2017-03-08, 2017-03-08, 2017-03-08, 2017-03-01, 2017-02-15, 2017-03-06, 2017-03-07, 2017-03-06, 2017-03-17, 2017-03-16...
$ `Destination ETA` <date> 2017-02-21, 2017-04-02, 2017-04-04, 2017-04-20, 2017-03-29, 2017-03-29, 2017-04-10, 2017-04-07, 2017-04-01, 2017-03-28, 2017-04-03, 2017-04-09, 2017-04-05, 2017-04-10, 2017-04-16...
$ Added <date> 2016-12-27, 2017-02-08, 2017-02-08, 2017-02-08, 2017-02-09, 2017-02-09, 2017-02-09, 2017-02-09, 2017-02-10, 2017-02-10, 2017-02-13, 2017-02-13, 2017-02-13, 2017-02-13, 2017-02-13...
$ `Job Opened` <date> 2017-09-18, 2017-03-27, 2017-03-15, 2017-04-12, 2017-03-23, 2017-03-23, 2017-03-22, 2017-03-22, 2017-03-24, 2017-03-06, 2017-03-16, 2017-03-06, 2017-03-13, 2017-03-20, 2017-04-03...
$ `ETD First Load` <date> NA, 2017-03-02, 2017-03-09, 2017-03-22, 2017-03-08, 2017-03-08, 2017-03-08, 2017-03-08, 2017-03-01, 2017-03-06, 2017-03-06, 2017-03-07, 2017-03-13, 2017-03-19, 2017-03-16, 2017-0...
$ `ETA Last Discharge` <date> NA, 2017-04-02, 2017-03-28, 2017-04-20, 2017-03-29, 2017-03-29, 2017-04-10, 2017-04-07, 2017-04-01, 2017-03-28, 2017-04-03, 2017-04-09, 2017-04-03, 2017-04-09, 2017-04-11, 2017-0...
$ `ETD Load` <date> NA, 2017-03-02, 2017-03-16, 2017-03-22, 2017-03-08, 2017-03-08, 2017-03-16, 2017-03-16, 2017-03-01, 2017-03-06, 2017-03-06, 2017-03-07, 2017-03-13, 2017-03-19, 2017-03-16, 2017-0...
$ `ETA Discharge` <date> NA, 2017-03-31, 2017-03-28, 2017-04-10, 2017-03-29, 2017-03-29, 2017-03-28, 2017-03-28, 2017-03-23, 2017-03-21, 2017-04-03, 2017-04-09, 2017-03-28, 2017-04-01, 2017-04-07, 2017-0...
$ ARV.date <date> NA, 2017-04-02, 2017-04-04, NA, 2017-03-29, 2017-03-29, 2017-04-10, 2017-04-07, 2017-03-30, 2017-03-28, 2017-04-03, 2017-04-09, 2017-04-05, 2017-04-10, 2017-04-11, 2017-03-26, 20...
$ DEP.date <date> NA, NA, NA, NA, 2017-03-08, 2017-03-08, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 2017-03-05, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 2017-03-02, NA, NA, NA, NA, NA, NA, NA...
$ Job.Recog.Date <dbl> NA, NA, NA, NA, 17254, 17254, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 17247, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 17261, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,...
Note that on the newly added $Job.Recog.Date column a date was expected yet I got a number.
I've also noted that using the loop increased substantially the time it took to run the code, nothing drastic, but noticable when everything was basically instantaneous. Which leads me to believe this is a possible method, even if I corrected so that the output gave me a correct date, but not an efficient one.
My question is, how should I have approached this problem to avoid this For loop. Which I would still have to create a function around it and figure out how to set have the syntactical col names being generated on the pipe when doing separate() to then manipulate to find which date out of the two is needed. Can this be done all while piping?
Side question: Do people mind long descriptive questions or just put the code there with minimum context? I ask this because I would find it a lot easier to help others when I can clearly understand the goal. Not to mention others that are starting and come across this, I guess it would make "digestible".
Like always, thanks for your time and patience.
LF.