Summarize by group the maximum amount of time a range of values continuously occurred in R

Nate_L · June 6, 2022, 11:34pm

Say I have a dataframe that looks like this :

> head(df)
  yr	mo	day	time	   sal	site
2021	8	1	0000	26.614	14
2021	8	1	0015	25.724	14
2021	8	1	0030	25.739	14
2021	8	1	0045	25.831	14
2021	8	1	0100	25.798	14
2021	8	1	0115	25.667	14

The dataframe has +800k rows of continuously measured data in 15 min. increments for the past couple of years at "n" sites.

I'm attempting to make a matrix/table like this in R (that was originally created in Excel). Color is not important, I only hope to re-create the table itself. The values in this plot are the maximum/highest number of days (converted from consecutive minutes) where a range of values (sal) were consistently measured (ex. For each site, season, and year, create a table that shows the longest lasting run of continuous time where the values stayed between, for example, 40-50).

nXHga

This is code I found somewhat useful, but it doesn't account for the fact that consecutive values need to be by an unbroken chain of the time.

data %>% group_by(site, mo, yr) %>% 
  mutate(high_salinity = between(sal, 40, 50), 
         high_salinity_duration  = cumsum(high_salinity) * high_salinity) %>% 
  summarise(longest_high_salinity = max(high_salinity_duration))

To belabor the point: values are only considered consistent if they occur at times, days, months, and years that are in order.

FJCC · June 7, 2022, 2:23am

I hope the function in the snippet below will get you started. I invented some data to work with. I did not invent different sites or seasons because I was too lazy. I your real data, you will want to group_by() all the appropriate variables.
The first thing I do is construct a datetime column so that time differences can be calculated.

What the function does is filter for the desired sal values and then calculate the time between neighboring rows. If that time is 15, then the rows are part of a continuous run. The rle function looks at the values in the Diff column and characterizes the run length of all the values. We are interested in runs of the value 15 and my code finds the longest such run.

library(dplyr)

library(lubridate)
#Invent data. Don't worry about the details here
Sequence <- seq.POSIXt(from = ymd_hm("2022-01-01 00:00"), 
                       to = ymd_hm("2022-02-28 00:00"), by = "15 min")
DATA <- data.frame(year = 2022, month = month(Sequence), day = day(Sequence),
                 time = paste0(formatC(hour(Sequence),width = 2, flag = "0"), 
                               formatC(minute(Sequence), width = 2, flag = "0")),
                 sal = rnorm(5569,40,5))

#Construct a POSIX datetime from the year, month etc.
DATA <- DATA |> mutate(DateTime = make_datetime(year,month,day,
                                           hour=as.numeric(substr(time,1,2)),
                                           min=as.numeric(substr(time,3,4))))
FindRun <- function(DF){
  DF40_50 <- DF |> filter(between(sal, 40,50)) |> 
    mutate(Lag = lag(DateTime),
           Diff = as.numeric(DateTime - Lag)) 
  
  RUNs <- rle(DF40_50$Diff)
  Lengths <- RUNs$lengths[RUNs$values==15]
  MaxRun <- max(Lengths,na.rm = TRUE)
  MaxRun *15/1440
}

FindRun(DATA)
#> [1] 0.07291667

^{Created on 2022-06-06 by the reprex package (v2.0.1)}

Nate_L · June 7, 2022, 1:04pm

Definitely a great start, thank you!

system · June 14, 2022, 1:05pm

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.