tl;dr: you can use the zoo
package to use date or datetime objects to create time series.
But what do you want to happen if two rows have different times within the same hour? Should certain columns be summed up? Averaged?
I'm going to assume summed, since these are counts, and give an example. The steps we want:
- Sum up the number of orders, grouping by hour processed.
- Expand the dataset to include all hours in the range, not just those which had orders.
- Use the
zoo
function from the zoo
package to make a time series with the hours as the index.
First, I'll make some example data similar to what's in the OP.
orders <- data.frame(
processed_date = as.POSIXct(c(
"2014-12-31 16:58:20",
"2015-01-02 19:36:55",
"2015-01-09 18:47:37",
"2015-01-14 18:45:10",
"2015-01-18 13:51:13",
"2015-02-09 19:17:16",
"2015-02-09 19:17:59" # Let's throw in a second order for this hour
)),
order_count = 1
)
Next, summarizing the orders we do have and adding in the hours with no orders (setting their order_count
to 0
).
library(dplyr)
library(lubridate)
hourly_orders <- orders %>%
mutate(processed_hour = floor_date(processed_date, "hour")) %>%
group_by(processed_hour) %>%
summarise(order_count = sum(order_count))
hourly_orders
## A tibble: 6 x 2
# processed_hour order_count
# <dttm> <dbl>
# 1 2014-12-31 16:00:00 1.
# 2 2015-01-02 19:00:00 1.
# 3 2015-01-09 18:00:00 1.
# 4 2015-01-14 18:00:00 1.
# 5 2015-01-18 13:00:00 1.
# 6 2015-02-09 19:00:00 2.
time_frame <- as_datetime(c("2009-01-01 00:00:00", "2014-12-31 23:59:59"))
all_hours <- data.frame(
processed_hour = seq(time_frame[1], time_frame[2], by = "hour")
)
hourly_orders <- hourly_orders %>%
right_join(all_hours, by = "processed_hour") %>%
mutate(
order_count = ifelse(
test = is.na(order_count),
yes = 0,
no = order_count
)
)
hourly_orders
## A tibble: 52,584 x 2
# processed_hour order_count
# <dttm> <dbl>
# 1 2009-01-01 00:00:00 0.
# 2 2009-01-01 01:00:00 0.
# 3 2009-01-01 02:00:00 0.
# 4 2009-01-01 03:00:00 0.
# 5 2009-01-01 04:00:00 0.
# 6 2009-01-01 05:00:00 0.
# 7 2009-01-01 06:00:00 0.
# 8 2009-01-01 07:00:00 0.
# 9 2009-01-01 08:00:00 0.
# 10 2009-01-01 09:00:00 0.
# # ... with 52,574 more rows
Finally, the zoo
function makes the time series.
library(zoo)
hourly_ts <- zoo(
x = hourly_orders[["order_count"]],
order.by = hourly_orders[["processed_hour"]],
frequency = 24
)
start(hourly_ts)
# [1] "2009-01-01 UTC"
end(hourly_ts)
# [1] "2014-12-31 23:00:00 UTC"