Convert hourly data to time series

jasminecaur · May 14, 2018, 1:27pm

If I want to convert my hourly data to time series for forecasting how to give start and end in
"y-m-d h:m:s" format while using the ts() function.
Is there any other function to do the same ?

This is my "date_time" column.

1 2014-12-31 16:58:20 
2 2015-01-02 19:36:55 
3 2015-01-09 18:47:37 
4 2015-01-14 18:45:10 
5 2015-01-18 13:51:13 
6 2015-02-09 19:17:16

Leon · May 14, 2018, 3:21pm

Can you give an example of your expected output? The lubridate package has many nice functions for working with date/time data.

jasminecaur · May 15, 2018, 6:42am

This is my data frame

data <- data.frame(goal_data$processed_date, goal_data$order_count)
head(data)

 goal_data.processed_date goal_data.order_count
1      2015-09-08 08:38:42                     1
2      2015-10-27 10:28:02                     1
3      2015-10-28 16:05:13                     1
4      2015-12-31 23:21:11                     1
5      2016-01-07 11:22:04                     1
6      2015-12-29 17:49:32                     1

Now I want to convert it to a time series by using the ts() function.
e.g how ts() is used,
from Jan 2009 to Dec 2014 as a time series object
myts <- ts(myvector, start=c(2009, 1), end=c(2014, 12), frequency=12)

So now what 'start' and 'end' should I give for my hourly data.

data_ts <- ts(data, start = , end = )

mara · May 15, 2018, 12:41pm

Could you please turn this into a self-contained reprex (short for minimal reproducible example)? It will help us help you if we can be sure we're all working with/looking at the same stuff.

Right now the best way to install reprex is:

# install.packages("devtools")
devtools::install_github("tidyverse/reprex")

If you've never heard of a reprex before, you might want to start by reading the tidyverse.org help page. The reprex dos and don'ts are also useful.

nwerth · May 15, 2018, 1:44pm

tl;dr: you can use the zoo package to use date or datetime objects to create time series.

But what do you want to happen if two rows have different times within the same hour? Should certain columns be summed up? Averaged?

I'm going to assume summed, since these are counts, and give an example. The steps we want:

Sum up the number of orders, grouping by hour processed.
Expand the dataset to include all hours in the range, not just those which had orders.
Use the zoo function from the zoo package to make a time series with the hours as the index.

First, I'll make some example data similar to what's in the OP.

orders <- data.frame(
  processed_date = as.POSIXct(c(
    "2014-12-31 16:58:20",
    "2015-01-02 19:36:55",
    "2015-01-09 18:47:37",
    "2015-01-14 18:45:10",
    "2015-01-18 13:51:13",
    "2015-02-09 19:17:16",
    "2015-02-09 19:17:59"  # Let's throw in a second order for this hour
  )),
  order_count = 1
)

Next, summarizing the orders we do have and adding in the hours with no orders (setting their order_count to 0).

library(dplyr)
library(lubridate)

hourly_orders <- orders %>%
  mutate(processed_hour = floor_date(processed_date, "hour")) %>%
  group_by(processed_hour) %>%
  summarise(order_count = sum(order_count))

hourly_orders
## A tibble: 6 x 2
#   processed_hour      order_count
#   <dttm>                    <dbl>
# 1 2014-12-31 16:00:00          1.
# 2 2015-01-02 19:00:00          1.
# 3 2015-01-09 18:00:00          1.
# 4 2015-01-14 18:00:00          1.
# 5 2015-01-18 13:00:00          1.
# 6 2015-02-09 19:00:00          2.

time_frame <- as_datetime(c("2009-01-01 00:00:00", "2014-12-31 23:59:59"))
all_hours <- data.frame(
  processed_hour = seq(time_frame[1], time_frame[2], by = "hour")
)
hourly_orders <- hourly_orders %>%
  right_join(all_hours, by = "processed_hour") %>%
  mutate(
    order_count = ifelse(
      test = is.na(order_count),
      yes  = 0,
      no   = order_count
    )
  )

hourly_orders
## A tibble: 52,584 x 2
#    processed_hour      order_count
#    <dttm>                    <dbl>
#  1 2009-01-01 00:00:00          0.
#  2 2009-01-01 01:00:00          0.
#  3 2009-01-01 02:00:00          0.
#  4 2009-01-01 03:00:00          0.
#  5 2009-01-01 04:00:00          0.
#  6 2009-01-01 05:00:00          0.
#  7 2009-01-01 06:00:00          0.
#  8 2009-01-01 07:00:00          0.
#  9 2009-01-01 08:00:00          0.
# 10 2009-01-01 09:00:00          0.
# # ... with 52,574 more rows

Finally, the zoo function makes the time series.

library(zoo)

hourly_ts <- zoo(
  x         = hourly_orders[["order_count"]],
  order.by  = hourly_orders[["processed_hour"]],
  frequency = 24
)

start(hourly_ts)
# [1] "2009-01-01 UTC"

end(hourly_ts)
# [1] "2014-12-31 23:00:00 UTC"