Calculated difference in Lagged values

Hi Team,

I have this problem and looking for a code that can help me solve it (find values a,b,c in the attached file - sample data).

I've got a team of data enumerators, they record both start and end time of the survey they are collecting data. Now I would like to calculate the time span (minutes or hours) between the end of one survey to the next start of a survey.
Am attaching this dataset for demonstration:
For enumerator Emma, she finished the first survey at 12:01:17 PM and began the next at 12:39:56 PM (approx 40 mins)
Olive on the other hand finished the first interview at 1:05:05 PM and began the next at 2:06:23 PM (an hr later)
How can I calculate difference in these lagged values in a tidyverse context?

1 Like

Hi @rigs ,

the link gives a 401 Unauthorized. So you will have to make it public - otherwise it is not reachable.

A handy way to supply data is to use the dput() function. Do dput(mydata) where "mydata" is the name of your dataset. For really large datasets probably dput(head(mydata, 100)) will do. Paste the output between
```

```

Hi @vedoa,

I have made the data public, kindly revisit.

You sample data has no times just dates.

``` r
DT 
        start          end     enumerator  time_lapse
       <char>     <char>     <char>     <char>
1: 2024-12-20 2024-12-20      Emmah       <NA>
2: 2024-12-20 2024-12-20      Emmah          a
3: 2024-12-20 2024-12-20      Olive          b
4: 2024-12-20 2024-12-20      Olive          c

<sup>Created on 2025-01-01 with [reprex v2.1.1](https://reprex.tidyverse.org)</sup>

It does have date/times but you have to change the format.

To accomplish this task, I see a few steps:

  1. Read in the data (exercise left to you)
  2. Convert the start and end dates to be date-times, not characters
  3. Sort the data by the times as that is necessary to appropriately calculate lags
  4. Group the data by enumerator so you are not combining Emmah's times with Oliv'es
  5. Calculate the difference between a start time and the prior end time
library(tidyverse)

dat <- structure(list(start = c("12/20/2024 11:41:03", "12/20/2024 12:39:56", 
                                "12/20/2024 12:19:09", "12/20/2024 14:06:23"), end = c("12/20/2024 12:01:17", 
                                                                                       "12/20/2024 13:16:04", "12/20/2024 13:05:05", "12/20/2024 14:21:43"
                                ), enumerator = c("Emmah", "Emmah", "Olive", "Olive")), class = c("tbl_df", 
                                                                                                  "tbl", "data.frame"), row.names = c(NA, -4L))

dat %>%
  mutate(across(c(start, end), mdy_hms)) %>%
  arrange(enumerator, start) %>%
  group_by(enumerator) %>%
  mutate(
    InterviewLag=start-lag(end)
  )
#> # A tibble: 4 × 4
#> # Groups:   enumerator [2]
#>   start               end                 enumerator InterviewLag
#>   <dttm>              <dttm>              <chr>      <drtn>      
#> 1 2024-12-20 11:41:03 2024-12-20 12:01:17 Emmah        NA secs   
#> 2 2024-12-20 12:39:56 2024-12-20 13:16:04 Emmah      2319 secs   
#> 3 2024-12-20 12:19:09 2024-12-20 13:05:05 Olive        NA secs   
#> 4 2024-12-20 14:06:23 2024-12-20 14:21:43 Olive      3678 secs

Created on 2025-01-01 with reprex v2.1.0

1 Like

Ah, my bad.

I did not download the file but just did a copy and {datapasta} paste. My apologies.

I see @ StatSteph has given you a solution. Just for the heck of it, here in a different one using {data.table} and {rio}

library(data.table)
library(rio)

DT <- as.data.table(import("sample data.xlsx"))
DT[, "time_lapse" := start - shift(end, n = 1,   fill = NA, type = "lag"), by = enumerator ]
DT

Thanks so much for this, exactly what I needed.

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.