Help with grouping and mutating a messy dataset

sj-io · December 27, 2021, 10:41am

Hey y'all!

I'm working with a very large, messy dataset and I need some help with a problem beyond my current knowledge.

There are two variables: col A is a list of numeric IDs, col B is a character description. Sometimes col B will contain a reference to a different ID. I already wrote code to extract IDs from col B into col C and convert them to numeric, so they can be matched with col A.

My goal is to take all of the IDs that relate to each other and combine their descriptions into one.

Current dataset:

A	B
01	Cat 02
02	Dog
03	Rabbit 04
04	Hamster 03
05	Dolphin 01

Goal:

A	B
01	Cat 02 - Dog - Dolphin 01
02	Cat 02 - Dog - Dolphin 01
03	Rabbit 04 - Hamster 03
04	Rabbit 04 - Hamster 03
05	Cat 02 - Dog - Dolphin 01

So far I can extract IDs into col C,

A	B	C
01	Cat 02	02
02	Dog	NA
03	Rabbit 04	04
04	Hamster 03	03
05	Dolphin 01	01

and then col D returns the match in col A

A	B	C	D
01	Cat 02	02	Dog
02	Dog	NA	NA
03	Rabbit 04	04	Hamster 03
04	Hamster 03	03	Rabbit 04
05	Dolphin	01	Cat 02

I've been using unite, group_by %>% mutate str_flatten, and separate_rows to some success. My problem is I'll end up with rows like

A	B
01	Cat 02 - Dog
02	Dog
03	Rabbit 04 - Hamster 03
04	Rabbit 04 - Hamster 03
05	Cat 02 - Dog - Dolphin 01

where I'm not catching all unique descriptions for each item in a group and only some rows will contain all the data. Really I'm having trouble trying to group in general because sometimes there's recursions, sometimes multiple As relate to one C and then multiple Cs will relate to one A...

Does anyone have advice for which functions would be useful in this situation? Or a different approach that would be better? Let me know if I need to provide more information to solve.

technocrat · December 28, 2021, 7:05am

This is not the goal, because I am unsure of the logic. This implements constructing a string with the values of the animal, the id of the animal it references, and the animal that references it.

# re-code to clear air; provide meaningful variable names, symbols for id
d <- readr::read_csv("~/Desktop/grist.csv")
#> Rows: 5 Columns: 2
#> ── Column specification ────────────────────────────────────────────────────────
#> Delimiter: ","
#> chr (1): critter
#> dbl (1): id
#> 
#> ℹ Use `spec()` to retrieve the full column specification for this data.
#> ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
d
#> # A tibble: 5 × 2
#>      id critter  
#>   <dbl> <chr>    
#> 1     1 cat 2    
#> 2     2 dog      
#> 3     3 rabbit 4 
#> 4     4 hamster 3
#> 5     5 dolphin 1
# cross referenced critter
pat1 <- "[0-9]"
pat2 <- "[A-Za-z]+"
buddies <- stringr::str_extract(d$critter,pat1) |> as.integer()
crits <- stringr::str_extract(d$critter,pat2) 
d$buddy <- buddies
d$critter <- crits
d
#> # A tibble: 5 × 3
#>      id critter buddy
#>   <dbl> <chr>   <int>
#> 1     1 cat         2
#> 2     2 dog        NA
#> 3     3 rabbit      4
#> 4     4 hamster     3
#> 5     5 dolphin     1

# record refers to another record
d$refers_to <- d[d$buddy,][2][[1]]

# record referred to by another record

find_referral <- function(x) d$critter[which(d$refers_to %in% x)]
referrals <- unlist(lapply(d$critter,find_referral))
referrals
#> [1] "dolphin" "cat"     "hamster" "rabbit"

# record not referred to by another record
orphans <- referrals[!(referrals %in% d$refers_to)]

# create column of referring columns
d$referrals <- c(referrals,orphans)
# eliminate "self-references"
d$referrals[which(d$critter == d$referrals)] <- NA

# create suite variable
d$suite <- paste(d$critter,stringr::str_pad(as.character(d$buddy),2,"left","0"),d$refers_to,d$referrals,sep = "-")

d[,c(1,6)]
#> # A tibble: 5 × 2
#>      id suite                    
#>   <dbl> <chr>                    
#> 1     1 cat-02-dog-dolphin       
#> 2     2 dog-NA-NA-cat            
#> 3     3 rabbit-04-hamster-hamster
#> 4     4 hamster-03-rabbit-rabbit 
#> 5     5 dolphin-01-cat-NA

system · January 18, 2022, 7:06am

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.