Filter repeated obs by id

Khoajak · May 18, 2022, 8:31pm

Hi RStudio community,
I have the following data and I would like to drop repeated x by id after sorting by 3 columns. For example, I would like to drop the highlighted obs in the dataset.
I used the below code, but it is filtering repeated x without considering other columns such as id (.e.g. , it is filtering 18 5 2021-02-18 B) . If somebody can help me, I appreciate it.
Thanks

data <- data.frame(id = c(1L,1L,1L,2L,2L,2L,3L,3L,3L,3L,4L,4L,4L,4L,5L,5L,5L,5L,6L,6L,6L,6L,6L),
                   date = c("2020-01-20", "2021-04-25","2021-08-12","2021-03-15","2021-05-17","2021-07-19","2021-03-15", "2021-05-16","2021-06-17", "2021-08-18", 
                            "2021-08-18","2021-02-11", "2021-08-18", "2021-03-19", "2021-06-11", "2021-06-11", "2021-10-01",
                            "2021-02-18", "2021-04-12", "2021-09-13", "2021-06-07", "2021-08-08", "2021-10-18"),
                   x = factor(c("A", "B", "C", "A", "A", "B", "A", "B","B", "C", "A", "A", "B", "C", "A",
                                "A", "B", "B", "A", "A", "B", "B", "C")),
                   stringsAsFactors = FALSE)

   id       date x
1   1 2020-01-20 A
2   1 2021-04-25 B
3   1 2021-08-12 C
4   2 2021-03-15 A
**5   2 2021-05-17 A**
6   2 2021-07-19 B
7   3 2021-03-15 A
8   3 2021-05-16 B
**9   3 2021-06-17 B**
10  3 2021-08-18 C
12  4 2021-02-11 A
14  4 2021-03-19 C
11  4 2021-08-18 A
13  4 2021-08-18 B
18  5 2021-02-18 B
15  5 2021-06-11 A
**16  5 2021-06-11 A**
17  5 2021-10-01 B
18  5 2021-02-18 B
19  6 2021-04-12 A
21  6 2021-06-07 B
**22  6 2021-08-08 B**
20  6 2021-09-13 A
23  6 2021-10-18 C

# filter repeated x by id
library(dplyr)
data2<-data %>% filter(x!= lag(x, default="1"))

FJCC · May 18, 2022, 8:50pm

Here is one solution.

library(dplyr)
data <- data.frame(id = c(1L,1L,1L,2L,2L,2L,3L,3L,3L,3L,4L,4L,4L,4L,5L,5L,5L,5L,6L,6L,6L,6L,6L),
                   date = c("2020-01-20", "2021-04-25","2021-08-12","2021-03-15","2021-05-17","2021-07-19","2021-03-15", "2021-05-16","2021-06-17", "2021-08-18", 
                            "2021-08-18","2021-02-11", "2021-08-18", "2021-03-19", "2021-06-11", "2021-06-11", "2021-10-01",
                            "2021-02-18", "2021-04-12", "2021-09-13", "2021-06-07", "2021-08-08", "2021-10-18"),
                   x = factor(c("A", "B", "C", "A", "A", "B", "A", "B","B", "C", "A", "A", "B", "C", "A",
                                "A", "B", "B", "A", "A", "B", "B", "C")),
                   stringsAsFactors = FALSE)
OUT <- data |> arrange(id, x, date) |> group_by(id, x) |> 
  slice(1)
OUT
#> # A tibble: 16 x 3
#> # Groups:   id, x [16]
#>       id date       x    
#>    <int> <chr>      <fct>
#>  1     1 2020-01-20 A    
#>  2     1 2021-04-25 B    
#>  3     1 2021-08-12 C    
#>  4     2 2021-03-15 A    
#>  5     2 2021-07-19 B    
#>  6     3 2021-03-15 A    
#>  7     3 2021-05-16 B    
#>  8     3 2021-08-18 C    
#>  9     4 2021-02-11 A    
#> 10     4 2021-08-18 B    
#> 11     4 2021-03-19 C    
#> 12     5 2021-06-11 A    
#> 13     5 2021-02-18 B    
#> 14     6 2021-04-12 A    
#> 15     6 2021-06-07 B    
#> 16     6 2021-10-18 C

^{Created on 2022-05-18 by the reprex package (v2.0.1)}

Khoajak · May 18, 2022, 9:15pm

Thank you very much FJCC for quick response. I would like to drop repeated treatment (x) for an id by date . for example for below ids, i would like to drop the bold 4 obs below, and should have (n=19) obs at the end. Thank you again,

data
id date x
1 1 2020-01-20 A
2 1 2021-04-25 B
3 1 2021-08-12 C
4 2 2021-03-15 A
5 2 2021-05-17 A
6 2 2021-07-19 B
7 3 2021-03-15 A
8 3 2021-05-16 B
9 3 2021-06-17 B
10 3 2021-08-18 C
12 4 2021-02-11 A
14 4 2021-03-19 C
11 4 2021-08-18 A
13 4 2021-08-18 B
18 5 2021-02-18 B
15 5 2021-06-11 A
16 5 2021-06-11 A
17 5 2021-10-01 B
19 6 2021-04-12 A
21 6 2021-06-07 B
22 6 2021-08-08 B
20 6 2021-09-13 A
23 6 2021-10-18 C

FJCC · May 18, 2022, 9:19pm

But you do not want to drop row 11 that has id = 4 and x = A just like row 12?

Khoajak · May 18, 2022, 10:00pm

No, because id 4 took A at time1, then C at time 2, and then A again at time 3 and then B at time 4. I would like to drop x if it is repeated in sequence, but if an id took different x between, I will keep it. Thanks

FJCC · May 18, 2022, 10:16pm

Try this

library(dplyr)
OUT <- data |> arrange(id, date) |> group_by(id) |> 
  mutate(Lag = lag(x)) |> 
  filter(is.na(Lag) | x != Lag) |> 
  select(-Lag)

Khoajak · May 18, 2022, 10:43pm

Ys, it is exactly what I wanted. Thank you very much again for your time and help.

system · May 25, 2022, 10:44pm

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.