Mutate, Map, case_when with nested tibbles (and saddling horses)

rgallego · March 8, 2019, 11:33pm

Hi internet,

What I'm trying to achieve is a bit too hard to explain, so I will assume I am an example of "How do I saddle a horse in a balcony?" from @jennybryan talk. So I will try to explain the broader problem (in case there is an easy way of doing this) and then ask the right way of saddling this horse in case there is no way around it.

The big picture:
I want to join two tables: One contains a list of species with their genus as another column, the other one contains another list of species, genus with their correspondent habitat. What I want to do is a "sequential join" of sorts: try first to join by species, and if there is no match, try by genus, and add the habitat column.

library(tidyverse)
#> Warning: package 'tibble' was built under R version 3.5.2

dataf1 <- tibble (Genus = LETTERS[c(1:3,3)],
                  Species = c("A.A", "B.D", "C.B", "C.E")) 


dataf2 <- tibble (Genus = LETTERS[c(2,2,3,3,3)],
                 Species = LETTERS[1:5],
                 HET = c(rep("HET",4),"AUT") )  %>% 
  unite(Genus,Species, col = "Species", sep = ".", remove = F)

desired.output <- dataf1 %>% mutate(HET = c(NA, "HET", "undeternined", "AUT"))

dataf1
#> # A tibble: 4 x 2
#>   Genus Species
#>   <chr> <chr>  
#> 1 A     A.A    
#> 2 B     B.D    
#> 3 C     C.B    
#> 4 C     C.E

dataf2
#> # A tibble: 5 x 3
#>   Species Genus HET  
#>   <chr>   <chr> <chr>
#> 1 B.A     B     HET  
#> 2 B.B     B     HET  
#> 3 C.C     C     HET  
#> 4 C.D     C     HET  
#> 5 C.E     C     AUT

desired.output
#> # A tibble: 4 x 3
#>   Genus Species HET         
#>   <chr> <chr>   <chr>       
#> 1 A     A.A     <NA>        
#> 2 B     B.D     HET         
#> 3 C     C.B     undeternined
#> 4 C     C.E     AUT

^{Created on 2019-03-08 by the reprex package (v0.2.1)}

The Problem:

The first join is an unequivocal join: there is only one value per species, so there will be only one line returned per match. So if I do

dataf2 %>% 
  select(Species, HET) %>% 

inner_join(dataf1, ., by = "Species") -> first.join

I will only get those that have a match for species.

For the rest of the entries that didn't have a match for the species level (anti_join? filter?) I will try the join by genus.

dataf1 %>% 
  anti_join(dataf2, by = "Species") -> left.over

dataf2 %>% 
  select(Genus, HET) %>% 
  left_join(left.over,., by = "Genus") 
# And later %>% bind_rows(first.join)

But if joining by genus, the output will have duplicate entries because there is more than entry per genus on df2

In some cases, I can reduce the inflated number of entries by using distinct()

dataf2 %>% 
  select(Genus, HET) %>% 
  left_join(dataf1,., by = "Genus") %>% 
  distinct() -> better.but.not.there.yet

But for species C.B. it returns two entries because there are two values - Is there a way of getting the desired output? i.e. If there is more than one value, return ("undetermined").

What I have tried is

nest the output by genus, calculate the number of rows for each entry and mutate a new column that returns the original if nrow == 1, returns the value "undetermined" if not. This doesn't work, which is weird because it is a nice horse and the balcony is pretty cool.
Update: It does work, apparently the thing you need to do is to spend 1 hr putting your thoughts in order by writing the reprex

better.but.not.there.yet %>% 
  nest(-Genus) %>% 
  mutate(n = map_dbl(data, ~nrow(.x))) %>% 
  mutate(data = map2(n,data, ~.y %>%
                               mutate(HET = case_when(.x == 1 ~ HET,
                                                      .x !=1 ~  "undetermined")) %>% 
                               distinct)) %>% 
  select(-n) %>% 
  unnest(data) %>% 
  bind_rows(first.join) # Bind with the first step join

So I think I did it, but I will welcome any comments to the overall problem.

technocrat · March 9, 2019, 2:29am

I'll offer a philosophical answer, instead of code, because your reprex shows that you understand tidiness and the basis.

What is your unit of analysis?

I nominate critter, which has three attributes: genus, species and habitat. The poor generic critters have to suffer the indignity of being coded as generic for species.

In your position, I would refactor species from A.A to just A. The idea is first to subset the data into critters with unassigned species and inner_join them with habitat. Then subset the data into critters with assigned species and inner_join with habitat. and your desired result should fall out easily

Felis genericus undetermined
Felis catus urban
Panthera leo savanna

When you need to recode to the genus.species a simple mutate/paste should be all you need.

In the dying stages of the Industrial Age, part numbers were taxonomically coded XXXX-aaaa-1111, to be able to give the user an idea of what it was. In the Information Age, we haven't yet shedded the notion that unit names should be decodable. It's just a darned key. The critter could care less if it's called 14203.

I came close to having to get a tattoo on my forearm to remind me that tibbles are not vectors. map is for vectors and works very well on tibble columns and can be made to work on tibble rows, at some expense comparable to just doing it in Haskell. I spent the past three months in that twisty maze of passages.

My generic suggestion remains that when getting the data structure desired seems beyond reach is to rethink the unit of observation in its most general terms before adding attributes into the equation.

Saddle up, cowpoke!

nathania · March 9, 2019, 5:25pm

Hi! If you rearrange your approach a bit, you can do without nest() and unnest().

library(tidyverse)
dataf1 <- tibble(Genus = LETTERS[c(1:3,3)],
                Species = c("A.A", "B.D", "C.B", "C.E")) 
dataf2 <- tibble (Genus = LETTERS[c(2,2,3,3,3)],
                  Species = LETTERS[1:5],
                  HET = c(rep("HET",4),"AUT"))  %>% 
  unite(Genus, Species, col = "Species", sep = ".", remove = F)
desired.output <- dataf1 %>% mutate(HET = c(NA, "HET", "undetermined", "AUT"))

result.S <- dataf1 %>% inner_join(dataf2)
#> Joining, by = c("Genus", "Species")

f.G_to_H <- dataf2 %>% 
  distinct(Genus, HET) %>%
  add_count(Genus) %>% 
  mutate(HET = if_else(n == 1, HET, 'undetermined')) %>% 
  distinct(Genus, HET)
result.G <- anti_join(dataf1, dataf2) %>% 
  left_join(f.G_to_H)
#> Joining, by = c("Genus", "Species")
#> Joining, by = "Genus"

result <- bind_rows(result.G, result.S)
all.equal(result, desired.output)
#> [1] TRUE

^{Created on 2019-03-09 by the reprex package (v0.2.1)}

rgallego · March 11, 2019, 7:59pm

Thanks heaps @technocrat! A lot of food for thought there - Never thought of readjust the unit of observation

rgallego · March 11, 2019, 8:00pm

Thanks Nathania! That sure does the trick

technocrat · March 11, 2019, 8:21pm

I had the luck to come to R from just having learned MySQL where if you had a unique index key, you were pretty much stuck with the tidyway of one observation per row and one variable per column. It's become pretty much a world view for me!

system · March 18, 2019, 8:21pm

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.