Taxonomic Ambiguity

Craigdux · August 2, 2023, 7:26pm

Hello,
I have a dataset of algae species. Multiple sites and multiple sampling events with hundreds of species. Some species are only identified to genus (i.e., "Microcystis sp."), whereas others (sometimes in same sample event; sometimes not) are identified to species (i.e., "Microcystis speciesA", "Microcystis speciesB", etc.)

I am making the assumption that the Microcystis sp. are Microcystis speciesA, Microcystis speciesB, etc., and further assume that the relative abundance of speciesA and speciesB are similar throughout my dataset.

Therefore, I need to distribute the higher level parent (Microcystis sp.) to M. species A, and M. speciesB, dependent on the relative abundance of speciesA and speciesB. (This is known as "merge parents with children").

So, I need to:

Determine relative proportion of M. speciesA, M. speciesB, etc.
Multiply "species.cell.ml" of "Microcystis sp." by the relative proportions of speciesA, speciesB, etc.
Add these values to speciesA, and speciesB (for samples collected the same day), otherwise, create new rows with speciesA and speciesB for that sample event.
Delete the Microcystis sp. rows.

I have no idea how to do this, and am stuck.

I am hoping someone can help me!

Thanks!

I have included some of my data:

df2 <- data.frame(
  stringsAsFactors = FALSE,
                sampling.date = c("2019-10-22","2020-02-11","2020-02-11","2020-12-07",
                                  "2020-05-27","2020-12-07","2021-03-15",
                                  "2021-06-07","2022-02-22","2022-05-02",
                                  "2019-10-22","2020-02-11","2020-05-27",
                                  "2020-08-18","2020-12-07","2021-03-15","2021-06-07",
                                  "2021-09-08","2022-02-22","2022-05-02"),
              final.taxa.name = c("Microcystis ichthyoblabe","Microcystis ichthyoblabe",
                                  "Microcystis smithii","Microcystis smithii",
                                  "Microcystis sp.","Microcystis sp.",
                                  "Microcystis sp.","Microcystis sp.",
                                  "Microcystis sp.","Microcystis sp.",
                                  "Microcystis wesenbergii","Microcystis wesenbergii",
                                  "Microcystis wesenbergii","Microcystis wesenbergii",
                                  "Microcystis wesenbergii","Microcystis wesenbergii",
                                  "Microcystis wesenbergii",
                                  "Microcystis wesenbergii","Microcystis wesenbergii",
                                  "Microcystis wesenbergii"),
             species.cells.ml = c(1044,
                                  1290,200,10862,4500,37699,760,20617,20944,
                                  320,720,6684,17546,1440,4595,10862,
                                  47124,11488,2841,5640)
           )

^{Created on 2023-08-02 with reprex v2.0.2}

Matthias · August 2, 2023, 7:53pm

Over all sampling points & based on the counts in column "species.cells.ml" ?

AlexisW · August 2, 2023, 8:37pm

Going by the counts in column species.cells.ml:

library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
library(tidyr)

df2 <- data.frame(...)
df2 |>
  arrange(sampling.date)
#>    sampling.date          final.taxa.name species.cells.ml
#> 1     2019-10-22 Microcystis ichthyoblabe             1044
#> 2     2019-10-22  Microcystis wesenbergii              720
#> 3     2020-02-11 Microcystis ichthyoblabe             1290
#> 4     2020-02-11      Microcystis smithii              200
#> 5     2020-02-11  Microcystis wesenbergii             6684
#> 6     2020-05-27          Microcystis sp.             4500
#> 7     2020-05-27  Microcystis wesenbergii            17546
#> 8     2020-08-18  Microcystis wesenbergii             1440
#> 9     2020-12-07      Microcystis smithii            10862
#> 10    2020-12-07          Microcystis sp.            37699
#> 11    2020-12-07  Microcystis wesenbergii             4595
#> 12    2021-03-15          Microcystis sp.              760
#> 13    2021-03-15  Microcystis wesenbergii            10862
#> 14    2021-06-07          Microcystis sp.            20617
#> 15    2021-06-07  Microcystis wesenbergii            47124
#> 16    2021-09-08  Microcystis wesenbergii            11488
#> 17    2022-02-22          Microcystis sp.            20944
#> 18    2022-02-22  Microcystis wesenbergii             2841
#> 19    2022-05-02          Microcystis sp.              320
#> 20    2022-05-02  Microcystis wesenbergii             5640

# Identify parent and species
df2_species <- df2 |>
  separate_wider_delim(final.taxa.name,
                       delim = " ",
                       names = c("taxa.parent", "taxa.species")) |>
  mutate(species.is.known = taxa.species != "sp.")
df2_species
#> # A tibble: 20 × 5
#>    sampling.date taxa.parent taxa.species species.cells.ml species.is.known
#>    <chr>         <chr>       <chr>                   <dbl> <lgl>           
#>  1 2019-10-22    Microcystis ichthyoblabe             1044 TRUE            
#>  2 2020-02-11    Microcystis ichthyoblabe             1290 TRUE            
#>  3 2020-02-11    Microcystis smithii                   200 TRUE            
#>  4 2020-12-07    Microcystis smithii                 10862 TRUE            
#>  5 2020-05-27    Microcystis sp.                      4500 FALSE           
#>  6 2020-12-07    Microcystis sp.                     37699 FALSE           
#>  7 2021-03-15    Microcystis sp.                       760 FALSE           
#>  8 2021-06-07    Microcystis sp.                     20617 FALSE           
#>  9 2022-02-22    Microcystis sp.                     20944 FALSE           
#> 10 2022-05-02    Microcystis sp.                       320 FALSE           
#> 11 2019-10-22    Microcystis wesenbergii               720 TRUE            
#> 12 2020-02-11    Microcystis wesenbergii              6684 TRUE            
#> 13 2020-05-27    Microcystis wesenbergii             17546 TRUE            
#> 14 2020-08-18    Microcystis wesenbergii              1440 TRUE            
#> 15 2020-12-07    Microcystis wesenbergii              4595 TRUE            
#> 16 2021-03-15    Microcystis wesenbergii             10862 TRUE            
#> 17 2021-06-07    Microcystis wesenbergii             47124 TRUE            
#> 18 2021-09-08    Microcystis wesenbergii             11488 TRUE            
#> 19 2022-02-22    Microcystis wesenbergii              2841 TRUE            
#> 20 2022-05-02    Microcystis wesenbergii              5640 TRUE

# find relative proportion of species in each parent taxa
df2_prop_per_spec <- df2_species |>
  filter(species.is.known) |>
  group_by(sampling.date, taxa.parent, taxa.species) |>
  summarize(species.count = sum(species.cells.ml),
            .groups = "drop") |>
  group_by(sampling.date, taxa.parent) |>
  reframe(taxa.species = taxa.species,
          proportion.species = species.count/sum(species.count))
df2_prop_per_spec
#> # A tibble: 14 × 4
#>    sampling.date taxa.parent taxa.species proportion.species
#>    <chr>         <chr>       <chr>                     <dbl>
#>  1 2019-10-22    Microcystis ichthyoblabe             0.592 
#>  2 2019-10-22    Microcystis wesenbergii              0.408 
#>  3 2020-02-11    Microcystis ichthyoblabe             0.158 
#>  4 2020-02-11    Microcystis smithii                  0.0245
#>  5 2020-02-11    Microcystis wesenbergii              0.818 
#>  6 2020-05-27    Microcystis wesenbergii              1     
#>  7 2020-08-18    Microcystis wesenbergii              1     
#>  8 2020-12-07    Microcystis smithii                  0.703 
#>  9 2020-12-07    Microcystis wesenbergii              0.297 
#> 10 2021-03-15    Microcystis wesenbergii              1     
#> 11 2021-06-07    Microcystis wesenbergii              1     
#> 12 2021-09-08    Microcystis wesenbergii              1     
#> 13 2022-02-22    Microcystis wesenbergii              1     
#> 14 2022-05-02    Microcystis wesenbergii              1


# add back to the df taking only the unidentified species
df2_imputation <- df2_species |>
  filter( ! species.is.known) |>
  select(-taxa.species) |>
  left_join(df2_prop_per_spec,
            by = c("sampling.date", "taxa.parent"),
            relationship = "many-to-many") |>
  mutate(count.imputed = species.cells.ml * proportion.species)
df2_imputation
#> # A tibble: 7 × 7
#>   sampling.date taxa.parent species.cells.ml species.is.known taxa.species
#>   <chr>         <chr>                  <dbl> <lgl>            <chr>       
#> 1 2020-05-27    Microcystis             4500 FALSE            wesenbergii 
#> 2 2020-12-07    Microcystis            37699 FALSE            smithii     
#> 3 2020-12-07    Microcystis            37699 FALSE            wesenbergii 
#> 4 2021-03-15    Microcystis              760 FALSE            wesenbergii 
#> 5 2021-06-07    Microcystis            20617 FALSE            wesenbergii 
#> 6 2022-02-22    Microcystis            20944 FALSE            wesenbergii 
#> 7 2022-05-02    Microcystis              320 FALSE            wesenbergii 
#> # ℹ 2 more variables: proportion.species <dbl>, count.imputed <dbl>

# Re-assemble everything and take sum
df2_species |>
  filter(species.is.known) |>
  select(-species.is.known) |>
  bind_rows(df2_imputation |>
              select(sampling.date, taxa.parent, taxa.species,
                     species.cells.ml = count.imputed)) |>
  group_by(sampling.date, taxa.parent, taxa.species) |>
  summarize(species.cells.ml = sum(species.cells.ml),
            .groups = "drop")
#> # A tibble: 14 × 4
#>    sampling.date taxa.parent taxa.species species.cells.ml
#>    <chr>         <chr>       <chr>                   <dbl>
#>  1 2019-10-22    Microcystis ichthyoblabe            1044 
#>  2 2019-10-22    Microcystis wesenbergii              720 
#>  3 2020-02-11    Microcystis ichthyoblabe            1290 
#>  4 2020-02-11    Microcystis smithii                  200 
#>  5 2020-02-11    Microcystis wesenbergii             6684 
#>  6 2020-05-27    Microcystis wesenbergii            22046 
#>  7 2020-08-18    Microcystis wesenbergii             1440 
#>  8 2020-12-07    Microcystis smithii                37354.
#>  9 2020-12-07    Microcystis wesenbergii            15802.
#> 10 2021-03-15    Microcystis wesenbergii            11622 
#> 11 2021-06-07    Microcystis wesenbergii            67741 
#> 12 2021-09-08    Microcystis wesenbergii            11488 
#> 13 2022-02-22    Microcystis wesenbergii            23785 
#> 14 2022-05-02    Microcystis wesenbergii             5960

^{Created on 2023-08-02 with reprex v2.0.2}

Craigdux · August 3, 2023, 1:49am

@AlexisW --thank you! This is an amazing script!

I was able to get it to run on a larger subset of my data.

Problems:

Some species have four identifiers. (i.e.: "Aulacoseira granulata v. angustissima"). I get this error (I tried adding two more names, but it also threw an error, as it was expecting four names:

Error in separate_wider_delim():
! Expected 2 pieces in each element of final.taxa.name.
! 27 values were too long.
Use too_many = "debug" to diagnose the problem.
Use too_many = "drop"/"merge" to silence this message.

The final dataframe did not drop the "Genus sp. (It properly coded "species.is.known" as "FALSE"):

Any ideas?

Thanks

Craigdux · August 3, 2023, 12:48pm

@Matthias yes, it is based on the column "species.cells.ml"

Thank you!

Matthias · August 3, 2023, 12:53pm

Per day or over all sampling points?
So for 2020-05-27 it's 100% M. wesenbergii?

5	2020-05-27	Microcystis sp.	4500
13	2020-05-27	Microcystis wesenbergii	17546

Craigdux · August 3, 2023, 1:09pm

Sorry, meant to state "overall". Then distribute this among the Microscystis sp.

(It is a little more complicated, as I have multiple lakes, but to keep it simple, this is just from one lake).

thanks

AlexisW · August 3, 2023, 1:22pm

That depends on how you will want to interpret it. Intuitively, I think all 3 names should be kept together and treated as a single species, so use too_many = "merge" (so "Aulacoseira" is the parent taxon and "granulata v. angustissima" the species). But that's a choice for you to make.

In principle you can change what I put in the group_by() commands to ensure you get the right proportions.

Matthias · August 3, 2023, 1:50pm

Based on the awesome solution proposed by AlexisW this could be done like this

# Identify parent and species
df2_species <- df |>
   # split species names, basically at the first space 
  separate_wider_delim(final.taxa.name,
                       delim = " ",
                       names = c("taxa.parent", "taxa.species"),
                       too_many = "merge") |>
  mutate(species.is.known = taxa.species != "sp.")
df2_species


# find relative proportion of species in each parent taxa
df2_prop_per_spec <- df2_species |>
  filter(species.is.known) |>
  group_by(taxa.parent, taxa.species) |> 
  summarize(species.count = sum(species.cells.ml), .groups = "drop") |>
  mutate(proportion.species = species.count/sum(species.count))
df2_prop_per_spec

# add back to the df taking only the unidentified species
df2_imputation <- df2_species |>
  filter(!species.is.known) |>
  select(-taxa.species) |>
  left_join(df2_prop_per_spec,
            by = c("taxa.parent"),
            relationship = "many-to-many") |>
  mutate(count.imputed = round(species.cells.ml * proportion.species,0))
df2_imputation


# Re-assemble everything and take sum
df2_final = df2_species |>
  filter(species.is.known) |>
  select(-species.is.known) |>
  bind_rows(df2_imputation |>
              select(sampling.date, taxa.parent, taxa.species,
                     species.cells.ml = count.imputed)) |>
  group_by(sampling.date, taxa.parent, taxa.species) |>
  summarize(species.cells.ml = sum(species.cells.ml),
            .groups = "drop")
df2_final

Note how the results for the 27.05. do now include all the possible species, with the highest proportion added to M. wesenbergii

   sampling.date taxa.parent taxa.species species.cells.ml
 6 2020-05-27    Microcystis ichthyoblabe               86
 7 2020-05-27    Microcystis smithii                   407
 8 2020-05-27    Microcystis wesenbergii             21553

Craigdux · August 3, 2023, 2:19pm

@Matthias --This worked great!

However, to add more complication some of the time, the algae are identified to genus (i.e., "Genus sp."). We do not want to delete these.

How would we preserve these? Would you modify the "imputed" below (maybe look for "NA")?

thanks again.

df2_final = df2_species |>
filter(species.is.known) |>
select(-species.is.known) |>
bind_rows(df2_imputation |>
select(sampling.date, taxa.parent, taxa.species,
species.cells.ml = count.imputed)) |>
group_by(sampling.date, taxa.parent, taxa.species) |>
summarize(species.cells.ml = sum(species.cells.ml),
.groups = "drop")
df2_final

Matthias · August 4, 2023, 11:55am

So you mean any situation where you only found the "... sp." should be kept and whenever we have multiple species the distribution of the species should replace the "... sp."?!

The best way to do this is during the "species.is.known" step as the results will be filtered based on this. So we need to count the amount of species, and kick out "sp." only when we have >1.

# Identify parent and species
df2_species <- df |>
   # split species names, basically at the first space 
  separate_wider_delim(final.taxa.name,
                       delim = " ",
                       names = c("taxa.parent", "taxa.species"), too_many = "merge") |>
  group_by(taxa.parent) |> 
  mutate(taxa.count = n_distinct(taxa.species), # count the different species per parent
         species.is.known = case_when(
                  taxa.count == 1 ~ TRUE,        # keep the ones with only 1 species
                  taxa.species != "sp." ~ TRUE,  # keep all others that are not "sp." 
                  TRUE ~ FALSE))
df2_species

# find relative proportion of species in each parent taxa
df2_prop_per_spec <- df2_species |>
  filter(species.is.known) |>
  group_by(taxa.parent, taxa.species) |> 
  summarize(species.count = sum(species.cells.ml), .groups = "drop_last") |>
  mutate(proportion.species = species.count/sum(species.count))
df2_prop_per_spec

# add back to the df taking only the unidentified species
df2_imputation <- df2_species |>
  filter(!species.is.known) |>
  select(-taxa.species, -taxa.count) |>
  left_join(df2_prop_per_spec,
            by = c("taxa.parent"),
            relationship = "many-to-many") |>
  mutate(count.imputed = round(species.cells.ml * proportion.species,0))
df2_imputation

The last part remains unchanged.

Craigdux · August 11, 2023, 9:24pm

@Matthias today we ran your script, and with a little modfication, got it to work!

This has resolved a huge uncertainty in our data.

Thank you!

system · August 18, 2023, 9:25pm

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.