Hi internet,
What I'm trying to achieve is a bit too hard to explain, so I will assume I am an example of "How do I saddle a horse in a balcony?" from @jennybryan talk. So I will try to explain the broader problem (in case there is an easy way of doing this) and then ask the right way of saddling this horse in case there is no way around it.
The big picture:
I want to join two tables: One contains a list of species with their genus as another column, the other one contains another list of species, genus with their correspondent habitat. What I want to do is a "sequential join" of sorts: try first to join by species, and if there is no match, try by genus, and add the habitat column.
library(tidyverse)
#> Warning: package 'tibble' was built under R version 3.5.2
dataf1 <- tibble (Genus = LETTERS[c(1:3,3)],
Species = c("A.A", "B.D", "C.B", "C.E"))
dataf2 <- tibble (Genus = LETTERS[c(2,2,3,3,3)],
Species = LETTERS[1:5],
HET = c(rep("HET",4),"AUT") ) %>%
unite(Genus,Species, col = "Species", sep = ".", remove = F)
desired.output <- dataf1 %>% mutate(HET = c(NA, "HET", "undeternined", "AUT"))
dataf1
#> # A tibble: 4 x 2
#> Genus Species
#> <chr> <chr>
#> 1 A A.A
#> 2 B B.D
#> 3 C C.B
#> 4 C C.E
dataf2
#> # A tibble: 5 x 3
#> Species Genus HET
#> <chr> <chr> <chr>
#> 1 B.A B HET
#> 2 B.B B HET
#> 3 C.C C HET
#> 4 C.D C HET
#> 5 C.E C AUT
desired.output
#> # A tibble: 4 x 3
#> Genus Species HET
#> <chr> <chr> <chr>
#> 1 A A.A <NA>
#> 2 B B.D HET
#> 3 C C.B undeternined
#> 4 C C.E AUT
Created on 2019-03-08 by the reprex package (v0.2.1)
The Problem:
The first join is an unequivocal join: there is only one value per species, so there will be only one line returned per match. So if I do
dataf2 %>%
select(Species, HET) %>%
inner_join(dataf1, ., by = "Species") -> first.join
I will only get those that have a match for species.
For the rest of the entries that didn't have a match for the species level (anti_join
? filter
?) I will try the join by genus.
dataf1 %>%
anti_join(dataf2, by = "Species") -> left.over
dataf2 %>%
select(Genus, HET) %>%
left_join(left.over,., by = "Genus")
# And later %>% bind_rows(first.join)
But if joining by genus, the output will have duplicate entries because there is more than entry per genus on df2
In some cases, I can reduce the inflated number of entries by using distinct()
dataf2 %>%
select(Genus, HET) %>%
left_join(dataf1,., by = "Genus") %>%
distinct() -> better.but.not.there.yet
But for species C.B. it returns two entries because there are two values - Is there a way of getting the desired output? i.e. If there is more than one value, return ("undetermined").
What I have tried is
- nest the output by genus, calculate the number of rows for each entry and mutate a new column that returns the original if
nrow == 1
, returns the value "undetermined" if not. This doesn't work, which is weird because it is a nice horse and the balcony is pretty cool.
Update: It does work, apparently the thing you need to do is to spend 1 hr putting your thoughts in order by writing the reprex
better.but.not.there.yet %>%
nest(-Genus) %>%
mutate(n = map_dbl(data, ~nrow(.x))) %>%
mutate(data = map2(n,data, ~.y %>%
mutate(HET = case_when(.x == 1 ~ HET,
.x !=1 ~ "undetermined")) %>%
distinct)) %>%
select(-n) %>%
unnest(data) %>%
bind_rows(first.join) # Bind with the first step join
So I think I did it, but I will welcome any comments to the overall problem.