The function that finds previous values exists, it's called lag()
(there is one in base R and another one in dplyr
), but it's not great here: it only returns a single value.
We can create our own:
lag_n <- function(x, n = 1){
nb_elements <- length(x)
x[(nb_elements - n):(nb_elements - 1)]
}
lag_n(1:10, 1)
# [1] 9
lag_n(1:10, 4)
# [1] 6 7 8 9
Or the slightly more advanced version, that lets us select from an end point:
lag_n <- function(x, end = length(x)-1, n = 1){
nb_elements <- length(x)
stopifnot(end <= nb_elements)
x[(end+1 - n):end]
}
lag_n(1:10)
# [1] 9
lag_n(1:10, n = 4)
# [1] 6 7 8 9
lag_n(1:10, end=7, n=3)
# [1] 5 6 7
Now, for your project, I'm worried about one aspect: you seem to assume all years have the same number of days and hours, but on 34 years about 8-9 should be leap years with 366 days. So your split()
should give incorrect results (with an error that accumulates with each leap year). The package lubridate
provides tools to work with dates and date-times.
Let's suppose you only know that the measures were done every day between 1986-01-01 and 2019-01-31. We can create a sequence of dates with appropriate leap days:
library(lubridate)
seq_dates <- seq.Date(from = as_date("1986-01-01"),
to=as_date("2019-12-31"),
by="day")
View(seq_dates)
If we assume there is no daylight saving (so we assume every day has 24 hours), we can create a simple vector of hours with 1:24
. If you need to take into account changes in time-zone, you could also generate date-times directly (that can become tricky though, see the lubridate help and that book chapter for details).
In R, the most powerful data structure is probably the data.frame (or its equivalent the tibble), we can store our data in that format to get useful tools. We use expand_grid()
to get automatically all the combinations of date and hour.
library(tidyverse)
rain_data <- expand_grid(date = seq_dates,
hour = 1:24)
Note that we have here 298,032 measurements, vs 34*365*24 = 297,840
if we forget leap days.
So now we have the basic data.frame structure, we can insert the actual measurements in there (if the dates come with your data, its better to use it directly), and extract the year from the date.
# dummy measurements, use the ones you have
P <- rpois(n = 298032, lambda = 5)
rain_data <- rain_data %>%
mutate(rain = P,
year = year(date))
Finally, we have all the data in one place, we can go to processing. Rather than explicitly splitting the values, we can use group_by()
to perform each processing per year. For example, if we want the max, the day of max, and the mean for each year:
rain_data %>%
group_by(year) %>%
summarize(max_rain = max(rain),
mean_rain = mean(rain),
day_of_max = date[which.max(rain)],
nb_hours_in_year = n()) %>%
View()
And of course you can plot the whole thing:
rain_data %>%
group_by(date) %>%
summarize(mean_per_day = mean(rain)) %>%
ggplot() +
geom_line(aes(x=date, y = mean_per_day))
So now we get to the crux of the problem, for each year, we want to find the max, and find the n values immediately before. Let's first try to find the n days before, to check that it works as expected:
rain_data %>%
group_by(year) %>%
summarize(day_of_max = date[which.max(rain)],
hour_of_max = hour[which.max(rain)],
previous_hours = lag_n(hour,
end = which.max(rain),
n = 5),
previous_days = lag_n(date,
end = which.max(rain),
n = 5)) %>%
View()
It does look like the result we expect. So we can make the actual computation:
previous <- rain_data %>%
group_by(year) %>%
summarize(previous = lag_n(rain,
end = which.max(rain),
n = 5))
Alright, we did it. We get a data.frame in "long" format, with the previous observations stacked on one column. We could put it in the wide format of we prefer:
previous %>%
add_column(hour_before_max = paste0("t-",rep(5:1, 34),"h")) %>%
pivot_wider(names_from = "hour_before_max",
values_from = "previous")
That can be handy to convert to a matrix for example:
mat_previous <- previous %>%
add_column(hour_before_max = paste0("t-",rep(5:1, 34),"h")) %>%
pivot_wider(names_from = "hour_before_max",
values_from = "previous") %>%
ungroup() %>%
select(-year) %>%
as.matrix()
dim(mat_previous)
# [1] 34 5
mat_previous[1:3, 1:3]
# t-5h t-4h t-3h
# [1,] 9 8 4
# [2,] 3 4 5
# [3,] 7 3 5
But if you want to model, there is one more advanced trick that is very useful:
previous <- rain_data %>%
group_by(year) %>%
summarize(previous = list(lag_n(rain,
end = which.max(rain),
n = 5)),
hours_before_max = list(5:1))
Here we store the result as a list, so we can directly pick up vectors from that list to give it to the modeling function:
previous %>%
mutate(mod = map2(hours_before_max, previous, ~lm( .y ~ .x)),
coef = map_dbl(mod, ~ .x$coefficients[2]))