R does not show me the condition fulfilled in the corresponding row

juandmaz · December 18, 2023, 5:36am

I have this data.base

head(df)
# A tibble: 6 × 3
  anio  id_mujer fecha_muestra      
  <chr>    <dbl> <dttm>             
1 2015      4807 2015-06-26 00:00:00
2 2017      4807 2017-06-02 00:00:00
3 2018      4807 2018-11-07 00:00:00
4 2018      8029 2018-03-23 00:00:00
5 2019      8029 2019-09-06 00:00:00
6 2021      8029 2021-04-23 00:00:00

Each group of women grouped by id_woman consists of 3 observations.
I want to see which women have a distance of 90 days or less between the variable 'fecha_muestra' of the second observation with the same variable of the third observation.

df %>%
  group_by(id_mujer) %>%
  mutate(distancia = difftime(lead(fecha_muestra, 2), lead(fecha_muestra, 1), units = "days"),
    igual= difftime(lead(fecha_muestra, 2), lead(fecha_muestra, 1), units = "days") <= 90) %>%
  filter(any(igual)==T) %>%
  relocate(igual, fecha_muestra)
  view()

3 women meet this condition. The problem is that when I ask to see that woman I get the TRUE in the first observation, instead of the corresponding one.
How can I solve it?

EDIT:
I'm gonna add an example.

df %>%
  filter(id_mujer==79051)
# A tibble: 3 × 3
  anio  id_mujer fecha_muestra      
  <chr>    <dbl> <dttm>             
1 2016     79051 2016-07-19 00:00:00
2 2017     79051 2017-10-23 00:00:00
3 2017     79051 2017-09-12 00:00:00

As we can see, between observation 2 and 3 the date is less than 90 days.
But if I try to do this calculation, R marks it in row 1.

df %>%
  group_by(id_mujer) %>%
  mutate(distancia = difftime(lead(fecha_muestra, 2), 
                              lead(fecha_muestra, 1), 
                              units = "days"),
         igual= difftime(lead(fecha_muestra, 2), 
                         lead(fecha_muestra, 1), 
                         units = "days") <= 90) %>%
  filter(any(igual)==T) %>%
  relocate(igual, fecha_muestra) %>%
  filter(id_mujer==79051)

# A tibble: 3 × 5
# Groups:   id_mujer [1]
  igual fecha_muestra       anio  id_mujer distancia
  <lgl> <dttm>              <chr>    <dbl> <drtn>   
1 TRUE  2016-07-19 00:00:00 2016     79051 -41 days 
2 NA    2017-10-23 00:00:00 2017     79051  NA days 
3 NA    2017-09-12 00:00:00 2017     79051  NA days

nirgrahamuk · December 18, 2023, 11:20am

you say how you have an interest in comparing values of the last two entries , you wrote code that leads 2 and leads 1, think about which rows of your source data lead 2 and lead 1 can work for , the first row.

juandmaz · December 18, 2023, 6:31pm

Hi, the code I wrote refers to row 2 and 3 since lead() is used to move values forward in a vector or column.
For example, if you have a column of data and you apply lead() to that column, you will get a new column where each value is the next one in the original sequence.
I edited the post with an example to visualize my problem.

nirgrahamuk · December 19, 2023, 9:40am

It is as I explained.

library(tidyverse)

tribble(~anio , ~id_mujer, ~fecha_muestra    ,  
 2016 ,    79051, "2016-07-19 00:00:00",
 2017 ,    79051 ,"2017-10-23 00:00:00",
 2017 ,    79051 ,"2017-09-12 00:00:00") |> mutate(
   fecha_muestra = as_datetime(fecha_muestra)) |> group_by(id_mujer) |>
  mutate(distancia_original = difftime(lead(fecha_muestra, 2), 
                              lead(fecha_muestra, 1), 
                              units = "days"),
         igual_original= difftime(lead(fecha_muestra, 2), 
                         lead(fecha_muestra, 1), 
                         units = "days") <= 90,
         distancia_accurate=difftime(lead(fecha_muestra, 1), 
                                     lead(fecha_muestra, 0), 
                                     units = "days"),
         igual_accurate= difftime(lead(fecha_muestra, 1), 
                                  lead(fecha_muestra, 0), 
                                  units = "days") <= 90)


## A tibble: 3 × 7
## Groups:   id_mujer [1]
 #  anio id_mujer fecha_muestra       distancia_original igual_original distancia_accurate igual_accurate
 # <dbl>    <dbl> <dttm>              <drtn>             <lgl>          <drtn>             <lgl>         
#1  2016    79051 2016-07-19 00:00:00 -41 days           TRUE           461 days           FALSE         
#2  2017    79051 2017-10-23 00:00:00  NA days           NA             -41 days           TRUE          
#3  2017    79051 2017-09-12 00:00:00  NA days           NA              NA days           NA

juandmaz · December 21, 2023, 7:06pm

Thanks for the answer.
I didn't understand. but now i see what the problem was.
The problem with the solution you gave me is that this code also marks me TRUE when the condition is fulfilled between row 1 and 2 and I need it to apply only for row 2 and 3 and that the TRUE goes in row 2.

nirgrahamuk · December 21, 2023, 7:53pm

I recommend you dont use lead or lag, because what you want to do is not based on lead and lag concepts. Your concept is row number based so use the dplyr row_number() function