I need my dataframe put into items of a list so each potential person is in a 'bucket', based on 2 columns ID and ID2_value . Then I'll perform user validation manually to confirm same person or not.
multiple rows with the same ID2_value is potentially the same person - they need to be put together. multiple rows with the same ID is definitely the same person.
library(dplyr)
#>
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
dat <- structure(list(fname_c1 = structure(c(5L, 3L, 4L, 1L, 4L, 5L, 2L, 2L),
.Label = c("BOB", "GARY", "HANS", "PODRICK", "STEFAN"),
class = "factor"),
ID = structure(c(1L, 4L, 2L, 5L, 2L, 1L, 2L, 3L),
.Label = c("100", "104", "110", "205", "600"),
class = "factor"),
ID2 = structure(c(1L, 1L, 2L, 2L, 1L, 2L, 3L, 3L),
.Label = c("MEDICAL#1", "MEDICAL#2", "MEDICAL#4"),
class = "factor"),
ID2_value = structure(c(4L, 4L, 3L, 2L, 5L, 1L, 6L, 6L),
.Label = c("1001", "2005", "5003", "60", "89", "AAB"),
class = "factor"),
by = c(1957L, 1957L, 1933L, 1960L, 1933L, 1957L, 1933L, 1933L),
bm = c(9L, 6L, 6L, 9L, 9L, 6L, 6L, 6L),
bd = c(2L, 7L, 7L, 2L, 2L, 1L, 1L,1L)
)
)
dat <- as_tibble(dat)
dat %>%
group_split(ID, ID2_value)
#> [[1]]
#> # A tibble: 1 x 7
#> fname_c1 ID ID2 ID2_value by bm bd
#> <fct> <fct> <fct> <fct> <int> <int> <int>
#> 1 STEFAN 100 MEDICAL#2 1001 1957 6 1
#>
#> [[2]]
#> # A tibble: 1 x 7
#> fname_c1 ID ID2 ID2_value by bm bd
#> <fct> <fct> <fct> <fct> <int> <int> <int>
#> 1 STEFAN 100 MEDICAL#1 60 1957 9 2
#>
#> [[3]]
#> # A tibble: 1 x 7
#> fname_c1 ID ID2 ID2_value by bm bd
#> <fct> <fct> <fct> <fct> <int> <int> <int>
#> 1 PODRICK 104 MEDICAL#2 5003 1933 6 7
#>
#> [[4]]
#> # A tibble: 1 x 7
#> fname_c1 ID ID2 ID2_value by bm bd
#> <fct> <fct> <fct> <fct> <int> <int> <int>
#> 1 PODRICK 104 MEDICAL#1 89 1933 9 2
#>
#> [[5]]
#> # A tibble: 1 x 7
#> fname_c1 ID ID2 ID2_value by bm bd
#> <fct> <fct> <fct> <fct> <int> <int> <int>
#> 1 GARY 104 MEDICAL#4 AAB 1933 6 1
#>
#> [[6]]
#> # A tibble: 1 x 7
#> fname_c1 ID ID2 ID2_value by bm bd
#> <fct> <fct> <fct> <fct> <int> <int> <int>
#> 1 GARY 110 MEDICAL#4 AAB 1933 6 1
#>
#> [[7]]
#> # A tibble: 1 x 7
#> fname_c1 ID ID2 ID2_value by bm bd
#> <fct> <fct> <fct> <fct> <int> <int> <int>
#> 1 HANS 205 MEDICAL#1 60 1957 6 7
#>
#> [[8]]
#> # A tibble: 1 x 7
#> fname_c1 ID ID2 ID2_value by bm bd
#> <fct> <fct> <fct> <fct> <int> <int> <int>
#> 1 BOB 600 MEDICAL#2 2005 1960 9 2
However, the output I'm looking for is this :
fname_c1 ID ID2 ID2_value by bm bd
<fct> <fct> <fct> <fct> <int> <int> <int>
1 STEFAN 100 MEDICAL#1 60 1957 9 2
2 HANS 205 MEDICAL#1 60 1957 6 7
3 STEFAN 100 MEDICAL#2 1001 1957 6 1
fname_c1 ID ID2 ID2_value by bm bd
4 BOB 600 MEDICAL#2 2005 1960 9 2
fname_c1 ID ID2 ID2_value by bm bd
5 PODRICK 104 MEDICAL#1 89 1933 9 2
6 PODRICK 104 MEDICAL#2 5003 1933 6 7
7 GARY 104 MEDICAL#4 AAB 1933 6 1
8 GARY 110 MEDICAL#4 AAB 1933 6 1