How to enumerate intervals in a sequence, like periods when cryptids come to visit

dromano · April 12, 2024, 7:31pm

Say a local park is popular with cryptids, and can offer room and board for a single cryptid at a time. The park has shared its visitor log with me, and I would like to know how many times Sasquatch has visited the park, where a single visit could be for any number of days.

Here is a visitor log, which currently covers 20 days but can be altered to cover as many as you like:

library(tidyverse)
cryptids <- c('Sasquatch', 'Yeti', 'Chupacabras', 'Nessie')

set.seed(1)
tibble(
  day = 1:20,
  visitor = 
    sample(
      cryptids |> c("crickets"), 
      length(day), 
      prob = c(0.5, rep(0.1, 4)),
      replace = T
    )
) -> visits

visits |> print(n = 6)
#> # A tibble: 20 × 2
#>     day visitor    
#>   <int> <chr>      
#> 1     1 Sasquatch  
#> 2     2 Sasquatch  
#> 3     3 Chupacabras
#> 4     4 Yeti       
#> 5     5 Sasquatch  
#> 6     6 Yeti       
#> # ℹ 14 more rows

^{Created on 2024-04-12 with reprex v2.0.2}

How would you add a visit column to the log that counts Sasquatch's visits by assigning a number to each visit, sort of like this:

#> # A tibble: 20 × 3
#>     day visitor     visit
#>   <int> <chr>       <dbl>
#> 1     1 Sasquatch       1
#> 2     2 Sasquatch       
#> 3     3 Chupacabras     
#> 4     4 Yeti           
#> 5     5 Sasquatch       2
#> 6     6 Yeti            
#> # ℹ 14 more rows

^{Created on 2024-04-12 with reprex v2.0.2}

I would be curious to see alternative solutions, whether they use tidyverse, data.table, or base R functions.
[Edit]: I forgot to mention that I was curious about solutions that don't depend on the value passed to set.seed(). (See below.)]

(Inspired by this question)

Identifying and Determining the Average Rainfall Intensity of Precipitation Events and Associated Increases in Streamflow Part 1

I first want to use the data from this .csv file to identify precipitation events. Precipitation events will be defined as the cumulative total of non-zero precipitation values with less than 4 hours of zero precipitation values in between. For example, considering the data provided above, precipitation began at 2010-05-13 11:30:00 and stopped 2010-05-13 21:00:00. Precipitation then began again at 2010-05-13 22:45:00 and continued until 2010-05-14 00:15:00. Because there is less than 4 hours between the time that the precipitation stopped (2010-05-13 21:00:00) and the time that the precipitation began (2010-05-13 22:45:00), these should be treated as a single event.

prubin · April 12, 2024, 10:31pm

Here is a solution using tidyverse functions (including lag).

# Find which visits are Sasquatch.
w <- which(visits$visitor == "Sasquatch")

# Isolate those visits and add a flag (1 if a new visit, 0 if a continuation of a visit).
s <- visits[w, ] |> mutate(new_visit = ifelse(lag(day) + 1 < day, 1, 0))

# Fix the NA in the first row (caused by lack of a previous row when lagging).
s[1, "new_visit"] <- 1

# Add the "visit" value and drop the new_visit column (no longer needed).
s <- s |> mutate(visit = cumsum(new_visit)) |> select(-new_visit)

# Make a copy of the full data frame and add the visit column.
visits2 <- visits
visits2$visit <- NA

# Replace the visit values for the Sasquatch visits.
visits2[w, "visit"] <- s$visit

# Confirm the results.
visits2

dromano · April 13, 2024, 11:27am

Here is an attempt at a faithful translation of your solution into pure tidyverse, which is interesting to compare:

visits |> 
  # Find which visits are Sasquatch
  mutate(sasquatch = visitor == "Sasquatch") |> 
  # Isolate those visits
  group_by(sasquatch) |> 
  # Add a flag (1 if a new visit, 0 if a continuation of a visit).
  mutate(new_visit = if_else(lag(day) + 1 < day, 1, 0)) |> 
  # Fix the NA in the first row (caused by lack of a previous row when lagging).
  mutate(new_visit = if_else(day == 1, 1, new_visit)) |> 
  # Add the "visit" value and drop the new_visit column (no longer needed).
  mutate(visit = cumsum(new_visit)) |> select(!new_visit) |> 
  # Replace the visit values for the Sasquatch visits.
  ungroup()

What's nice about your original solution is that you don't have to worry about what to do with the non-Sasquatch rows. In my translation, it happens that lag takes care of them in the end because cumsum transmits the initial NA to the rest of the non-Sasquatch rows, but that's just a lucky accident.

The principal trade-off seems to be the clearer focus on Sasquatch rows in your original solution for not having to make a copy of visits in the translation.

The only significant observation I would make is that your solution depends on the argument to set.seed(), which happens to have Sasquatch visit on day 1, but I didn't make that part of the question clear. (I'll make a quick edit to fix that.)

prubin · April 13, 2024, 7:07pm

Actually, my solution does not require Sasquatch to show up on day 1. I just tested it on a modified dataframe where he (?) arrives on day 2. I think what might be throwing you off is that the first row of s (which contains only visits by you-know-who) has to be visit #1 for him (?), regardless of what day that is.

dromano · April 13, 2024, 7:10pm

Oh, right — you had already subsetted from the larger table — sorry not to have spotted that! I'll modify my translation to capture what you actually wrote.

dromano · April 13, 2024, 7:30pm

visits |> 
  # Find which visits are Sasquatch
  mutate(sasquatch = visitor == "Sasquatch") |> 
  # Isolate those visits
  group_by(sasquatch) |> 
  # Add a flag (1 if a new visit, 0 if a continuation of a visit).
  mutate(new_visit = if_else(lag(day) + 1 < day, 1, 0)) |> 
  # Fix the NA in the first row (caused by lack of a previous row when lagging).
  # mutate(new_visit = if_else(day == 1, 1, new_visit)) |> # <-- mistranslation
  mutate(
    new_visit = 
      if_else(sasquatch & day == first(day), 1, new_visit) # <-- corrected
  ) |> 
  # Add the "visit" value and drop the new_visit column (no longer needed).
  mutate(visit = cumsum(new_visit)) |> select(!new_visit) |> 
  # Replace the visit values for the Sasquatch visits.
  ungroup()

dromano · April 15, 2024, 12:26am

This is nice illustration of an operation that doesn't seem to have a succinct translation into tidyverse-only code, as far as I know: changing the values in specific rows of a table by simply supplying a vector of row-indices and a vector of replacement values. A puzzle for a separate topic?

dromano · April 15, 2024, 12:26am

Also, here's an alternative solution that also uses cumsum(), but in a less direct way:

visits |> 
  # flag Sasquatch days
  mutate(sasquatch = visitor == 'Sasquatch') |> 
  # if we assign a value of 0 to Sasquatch days, and 1 to the rest,
  # the cumulative sum will be constant during Sasquatch visits, but 
  # also different for each visit
  mutate(visit = if_else(sasquatch, cumsum(!sasquatch), NA)) |>
  # remove 'sasquatch' column since no longer needed
  select(!sasquatch) |> 
  # convert 'visit' to a factor so corresponding ordinal can be extracted
  mutate(visit = visit |> factor() |> as.numeric())

This:

visit |> factor() |> as.numeric()

is a bit of a hack, but I'm not sure of a way to convert to ordinal values more cleanly.

system · May 27, 2024, 12:27am

This topic was automatically closed 42 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.