Say a local park is popular with cryptids, and can offer room and board for a single cryptid at a time. The park has shared its visitor log with me, and I would like to know how many times Sasquatch has visited the park, where a single visit could be for any number of days.
Here is a visitor log, which currently covers 20 days but can be altered to cover as many as you like:
I would be curious to see alternative solutions, whether they use tidyverse, data.table, or base R functions.
[Edit]: I forgot to mention that I was curious about solutions that don't depend on the value passed to set.seed(). (See below.)]
Here is a solution using tidyverse functions (including lag).
# Find which visits are Sasquatch.
w <- which(visits$visitor == "Sasquatch")
# Isolate those visits and add a flag (1 if a new visit, 0 if a continuation of a visit).
s <- visits[w, ] |> mutate(new_visit = ifelse(lag(day) + 1 < day, 1, 0))
# Fix the NA in the first row (caused by lack of a previous row when lagging).
s[1, "new_visit"] <- 1
# Add the "visit" value and drop the new_visit column (no longer needed).
s <- s |> mutate(visit = cumsum(new_visit)) |> select(-new_visit)
# Make a copy of the full data frame and add the visit column.
visits2 <- visits
visits2$visit <- NA
# Replace the visit values for the Sasquatch visits.
visits2[w, "visit"] <- s$visit
# Confirm the results.
visits2
Here is an attempt at a faithful translation of your solution into pure tidyverse, which is interesting to compare:
visits |>
# Find which visits are Sasquatch
mutate(sasquatch = visitor == "Sasquatch") |>
# Isolate those visits
group_by(sasquatch) |>
# Add a flag (1 if a new visit, 0 if a continuation of a visit).
mutate(new_visit = if_else(lag(day) + 1 < day, 1, 0)) |>
# Fix the NA in the first row (caused by lack of a previous row when lagging).
mutate(new_visit = if_else(day == 1, 1, new_visit)) |>
# Add the "visit" value and drop the new_visit column (no longer needed).
mutate(visit = cumsum(new_visit)) |> select(!new_visit) |>
# Replace the visit values for the Sasquatch visits.
ungroup()
What's nice about your original solution is that you don't have to worry about what to do with the non-Sasquatch rows. In my translation, it happens that lag takes care of them in the end because cumsum transmits the initial NA to the rest of the non-Sasquatch rows, but that's just a lucky accident.
The principal trade-off seems to be the clearer focus on Sasquatch rows in your original solution for not having to make a copy of visits in the translation.
The only significant observation I would make is that your solution depends on the argument to set.seed(), which happens to have Sasquatch visit on day 1, but I didn't make that part of the question clear. (I'll make a quick edit to fix that.)
Actually, my solution does not require Sasquatch to show up on day 1. I just tested it on a modified dataframe where he (?) arrives on day 2. I think what might be throwing you off is that the first row of s (which contains only visits by you-know-who) has to be visit #1 for him (?), regardless of what day that is.
Oh, right — you had already subsetted from the larger table — sorry not to have spotted that! I'll modify my translation to capture what you actually wrote.
visits |>
# Find which visits are Sasquatch
mutate(sasquatch = visitor == "Sasquatch") |>
# Isolate those visits
group_by(sasquatch) |>
# Add a flag (1 if a new visit, 0 if a continuation of a visit).
mutate(new_visit = if_else(lag(day) + 1 < day, 1, 0)) |>
# Fix the NA in the first row (caused by lack of a previous row when lagging).
# mutate(new_visit = if_else(day == 1, 1, new_visit)) |> # <-- mistranslation
mutate(
new_visit =
if_else(sasquatch & day == first(day), 1, new_visit) # <-- corrected
) |>
# Add the "visit" value and drop the new_visit column (no longer needed).
mutate(visit = cumsum(new_visit)) |> select(!new_visit) |>
# Replace the visit values for the Sasquatch visits.
ungroup()
This is nice illustration of an operation that doesn't seem to have a succinct translation into tidyverse-only code, as far as I know: changing the values in specific rows of a table by simply supplying a vector of row-indices and a vector of replacement values. A puzzle for a separate topic?
Also, here's an alternative solution that also uses cumsum(), but in a less direct way:
visits |>
# flag Sasquatch days
mutate(sasquatch = visitor == 'Sasquatch') |>
# if we assign a value of 0 to Sasquatch days, and 1 to the rest,
# the cumulative sum will be constant during Sasquatch visits, but
# also different for each visit
mutate(visit = if_else(sasquatch, cumsum(!sasquatch), NA)) |>
# remove 'sasquatch' column since no longer needed
select(!sasquatch) |>
# convert 'visit' to a factor so corresponding ordinal can be extracted
mutate(visit = visit |> factor() |> as.numeric())
This:
visit |> factor() |> as.numeric()
is a bit of a hack, but I'm not sure of a way to convert to ordinal values more cleanly.