I'm working with a very large, messy dataset and I need some help with a problem beyond my current knowledge.
There are two variables: col A is a list of numeric IDs, col B is a character description. Sometimes col B will contain a reference to a different ID. I already wrote code to extract IDs from col B into col C and convert them to numeric, so they can be matched with col A.
My goal is to take all of the IDs that relate to each other and combine their descriptions into one.
Current dataset:
A
B
01
Cat 02
02
Dog
03
Rabbit 04
04
Hamster 03
05
Dolphin 01
Goal:
A
B
01
Cat 02 - Dog - Dolphin 01
02
Cat 02 - Dog - Dolphin 01
03
Rabbit 04 - Hamster 03
04
Rabbit 04 - Hamster 03
05
Cat 02 - Dog - Dolphin 01
So far I can extract IDs into col C,
A
B
C
01
Cat 02
02
02
Dog
NA
03
Rabbit 04
04
04
Hamster 03
03
05
Dolphin 01
01
and then col D returns the match in col A
A
B
C
D
01
Cat 02
02
Dog
02
Dog
NA
NA
03
Rabbit 04
04
Hamster 03
04
Hamster 03
03
Rabbit 04
05
Dolphin
01
Cat 02
I've been using unite, group_by %>% mutate str_flatten, and separate_rows to some success. My problem is I'll end up with rows like
A
B
01
Cat 02 - Dog
02
Dog
03
Rabbit 04 - Hamster 03
04
Rabbit 04 - Hamster 03
05
Cat 02 - Dog - Dolphin 01
where I'm not catching all unique descriptions for each item in a group and only some rows will contain all the data. Really I'm having trouble trying to group in general because sometimes there's recursions, sometimes multiple As relate to one C and then multiple Cs will relate to one A...
Does anyone have advice for which functions would be useful in this situation? Or a different approach that would be better? Let me know if I need to provide more information to solve.
This is not the goal, because I am unsure of the logic. This implements constructing a string with the values of the animal, the id of the animal it references, and the animal that references it.
# re-code to clear air; provide meaningful variable names, symbols for id
d <- readr::read_csv("~/Desktop/grist.csv")
#> Rows: 5 Columns: 2
#> ── Column specification ────────────────────────────────────────────────────────
#> Delimiter: ","
#> chr (1): critter
#> dbl (1): id
#>
#> ℹ Use `spec()` to retrieve the full column specification for this data.
#> ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
d
#> # A tibble: 5 × 2
#> id critter
#> <dbl> <chr>
#> 1 1 cat 2
#> 2 2 dog
#> 3 3 rabbit 4
#> 4 4 hamster 3
#> 5 5 dolphin 1
# cross referenced critter
pat1 <- "[0-9]"
pat2 <- "[A-Za-z]+"
buddies <- stringr::str_extract(d$critter,pat1) |> as.integer()
crits <- stringr::str_extract(d$critter,pat2)
d$buddy <- buddies
d$critter <- crits
d
#> # A tibble: 5 × 3
#> id critter buddy
#> <dbl> <chr> <int>
#> 1 1 cat 2
#> 2 2 dog NA
#> 3 3 rabbit 4
#> 4 4 hamster 3
#> 5 5 dolphin 1
# record refers to another record
d$refers_to <- d[d$buddy,][2][[1]]
# record referred to by another record
find_referral <- function(x) d$critter[which(d$refers_to %in% x)]
referrals <- unlist(lapply(d$critter,find_referral))
referrals
#> [1] "dolphin" "cat" "hamster" "rabbit"
# record not referred to by another record
orphans <- referrals[!(referrals %in% d$refers_to)]
# create column of referring columns
d$referrals <- c(referrals,orphans)
# eliminate "self-references"
d$referrals[which(d$critter == d$referrals)] <- NA
# create suite variable
d$suite <- paste(d$critter,stringr::str_pad(as.character(d$buddy),2,"left","0"),d$refers_to,d$referrals,sep = "-")
d[,c(1,6)]
#> # A tibble: 5 × 2
#> id suite
#> <dbl> <chr>
#> 1 1 cat-02-dog-dolphin
#> 2 2 dog-NA-NA-cat
#> 3 3 rabbit-04-hamster-hamster
#> 4 4 hamster-03-rabbit-rabbit
#> 5 5 dolphin-01-cat-NA