R removing duplicates with conditions

chrissy1402 · November 20, 2019, 1:48pm

Hi,

I have the following dataset:

ID Visite 200 other Variables 
1  1
1  2
1  4
1  4
1  5
1  6
2  1
2  2
2  3
2  4
2  4
2  6
2  7
3  1
3  2
3  2
3  3
3  4
3  5
4  1
4  2
4  3

I would like to see which IDs contain duplicates in Visite (here: 1: 2x4; 2: 2x4; 3: 2x3)
I would like to change the duplicates to the lower Visite if e.g. at ID 1 visite 4 or the lower visit is missing, to the higher visite if e.g. at ID 2 visite 5 or the higher visit is missing, or delete one whole row including all other values of a person if there is a duplicate but no adjacent visit is missing.

Could you please help me? I tried some codes but nothing worked.

Thank you.
Best
Christina

mara · November 20, 2019, 2:02pm

This will be easier to discuss with a reproducible example (which allows others to run the code you're running). I've made a shorter version, and added a fake column so you can tell which is the first or second instance of an ID/Visite combo. When you say "lower" and "higher" do you mean lower in the table, or lower number?

This gives you a start,

library(tidyverse)
dummydat <- tibble::tribble(
  ~ID, ~Visite,  ~Pet,
   1L,      1L, "dog",
   1L,      2L, "dog",
   1L,      4L, "cat",
   1L,      4L, "dog",
   1L,      5L, "cat",
   1L,      6L, "cat",
   2L,      1L, "cat",
   2L,      2L, "cat",
   2L,      3L, "dog",
   2L,      4L, "dog",
   2L,      4L, "cat",
   2L,      6L, "dog",
   2L,      7L, "dog"
  )

dummydat %>%
  distinct(ID, Visite, .keep_all = TRUE)
#> # A tibble: 11 x 3
#>       ID Visite Pet  
#>    <int>  <int> <chr>
#>  1     1      1 dog  
#>  2     1      2 dog  
#>  3     1      4 cat  
#>  4     1      5 cat  
#>  5     1      6 cat  
#>  6     2      1 cat  
#>  7     2      2 cat  
#>  8     2      3 dog  
#>  9     2      4 dog  
#> 10     2      6 dog  
#> 11     2      7 dog

dummydat %>%
  group_by(ID, Visite) %>%
  dplyr::mutate(row_number = dplyr::row_number())
#> # A tibble: 13 x 4
#> # Groups:   ID, Visite [11]
#>       ID Visite Pet   row_number
#>    <int>  <int> <chr>      <int>
#>  1     1      1 dog            1
#>  2     1      2 dog            1
#>  3     1      4 cat            1
#>  4     1      4 dog            2
#>  5     1      5 cat            1
#>  6     1      6 cat            1
#>  7     2      1 cat            1
#>  8     2      2 cat            1
#>  9     2      3 dog            1
#> 10     2      4 dog            1
#> 11     2      4 cat            2
#> 12     2      6 dog            1
#> 13     2      7 dog            1

^{Created on 2019-11-20 by the reprex package (v0.3.0.9001)}

These StackOverflow threads should also help point you in the right direction (the first one provides a couple different solutions using different R packages, as well)

chrissy1402 · November 21, 2019, 8:51am

Thank you very much.

When I say "lower" and "higher" I mean lower in the number.
In this example at ID 1L the lower number is missing so Visite = 4L and Pet = cat should switch to Visite = 3L and Pet = cat.
At ID 2L the higher number is missing so Visite = 4L and Pet = cat should switch to Visite = 5L and Pet = cat.

mara · November 21, 2019, 11:14am

Yep, so I think that the answers in here should let you do just that — it's just one more step beyond what I've shown you above. Give it a whirl, and post back your code if you get stuck.

chrissy1402 · November 27, 2019, 11:55am

Thank you very much. I will try this next week. My kid is ill and I can't work this week.

chrissy1402 · December 4, 2019, 9:06am

I tried this:

library(tidyverse)
dummydat <- tibble::tribble(
  ~ID, ~Visite,  ~Pet,
  1L,      1L, "dog",
  1L,      2L, "dog",
  1L,      4L, "cat",
  1L,      4L, "dog",
  1L,      5L, "cat",
  1L,      6L, "cat",
  2L,      1L, "cat",
  2L,      2L, "cat",
  2L,      3L, "dog",
  2L,      4L, "dog",
  2L,      4L, "cat",
  2L,      6L, "dog",
  2L,      7L, "dog",
  3L,      1L, "cat",
  3L,      2L, "cat",
  3L,      3L, "dog",
  3L,      4L, "dog",
  3L,      4L, "cat",
  3L,      5L, "dog",
  3L,      6L, "dog"
)
dummydat

library(dplyr)

dummydat2<- dummydat %>%
  group_by(ID) %>%
  mutate(Visite = case_when(duplicated(Visite, fromLast = TRUE) ~
                              lag(Visite) + 1L, TRUE ~ Visite))
dummydat2

dummydat3<- dummydat2 %>%
  group_by(ID) %>%
  mutate(Visite = case_when(duplicated(Visite, fromLast = FALSE) ~
                              lag(Visite) + 1L, TRUE ~ Visite))
dummydat3

dummydat4<- dummydat3 %>%
  group_by(ID) %>% filter(!duplicated(Visite))  %>% 
  filter(!is.na(Pet))
dummydat4

I really don't get it. Yesterday it worked but today all values suddenly change ....

The result is:

# A tibble: 17 x 3
# Groups:   ID [3]
      ID Visite Pet  
   <int>  <int> <chr>
 1     1     NA dog  
 2     1      2 dog  
 3     1      3 cat  
 4     1      5 dog  
 5     1      6 cat  
 6     2      7 cat  
 7     2      8 cat  
 8     2      3 dog  
 9     2      4 dog  
10     2      5 cat  
11     2      6 dog  
12     3      1 cat  
13     3      2 cat  
14     3      3 dog  
15     3      4 dog  
16     3      5 cat  
17     3      6 dog

The values get completely confused!

chrissy1402 · December 4, 2019, 9:18am

The result should be:

dummydat <- tibble::tribble(
  ~ID, ~Visite,  ~Pet,
  1L,      1L, "dog",
  1L,      2L, "dog",
  1L,      3L, "cat",
  1L,      4L, "dog",
  1L,      5L, "cat",
  1L,      6L, "cat",
  2L,      1L, "cat",
  2L,      2L, "cat",
  2L,      3L, "dog",
  2L,      4L, "dog",
  2L,      5L, "cat",
  2L,      6L, "dog",
  2L,      7L, "dog",
  3L,      1L, "cat",
  3L,      2L, "cat",
  3L,      3L, "dog",
  3L,      4L, "dog",
  3L,      5L, "dog",
  3L,      6L, "dog"
)

mara · December 4, 2019, 2:07pm

library(tidyverse)
dummydat <- tibble::tribble(
  ~ID, ~Visite,  ~Pet,
  1L,      1L, "dog",
  1L,      2L, "dog",
  1L,      4L, "cat",
  1L,      4L, "dog",
  1L,      5L, "cat",
  1L,      6L, "cat",
  2L,      1L, "cat",
  2L,      2L, "cat",
  2L,      3L, "dog",
  2L,      4L, "dog",
  2L,      4L, "cat",
  2L,      6L, "dog",
  2L,      7L, "dog",
  3L,      1L, "cat",
  3L,      2L, "cat",
  3L,      3L, "dog",
  3L,      4L, "dog",
  3L,      4L, "cat",
  3L,      5L, "dog",
  3L,      6L, "dog"
)

dummydat %>%
  group_by(ID, Visite) %>%
  dplyr::mutate(row_number = dplyr::row_number()) %>%
  filter(row_number == max(row_number))
#> # A tibble: 17 x 4
#> # Groups:   ID, Visite [17]
#>       ID Visite Pet   row_number
#>    <int>  <int> <chr>      <int>
#>  1     1      1 dog            1
#>  2     1      2 dog            1
#>  3     1      4 dog            2
#>  4     1      5 cat            1
#>  5     1      6 cat            1
#>  6     2      1 cat            1
#>  7     2      2 cat            1
#>  8     2      3 dog            1
#>  9     2      4 cat            2
#> 10     2      6 dog            1
#> 11     2      7 dog            1
#> 12     3      1 cat            1
#> 13     3      2 cat            1
#> 14     3      3 dog            1
#> 15     3      4 cat            2
#> 16     3      5 dog            1
#> 17     3      6 dog            1

^{Created on 2019-12-04 by the reprex package (v0.3.0.9001)}

If you want to get rid of the row_number column, you can just use select(-row_number) at the end.

system · December 25, 2019, 2:07pm

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.