em_y
March 31, 2023, 12:48pm
1
Hi,
I have a large dataset containing different fungi species, and one column on each row describes the taxonomy (including kingdom, phylum, class, order, family, genus, species). I would like to create a new column in the dataset, that only includes the "species" name, not all the other information from the taxonomy column. How would I go about isolating this information, as all species names occur after s__ in the taxonomy column, and are of different character lengths. I have attempted to use the mutate function, with str_extract, subset, and start. ITS_counts is that dataset, taxonomy is the column within the dataset im trying to use, s__ is the part of taxonomy I would like to isolate the species name from on each row. The code I have tried to use is...
mutate("species" = str_extract(ITS_counts$taxonomy, substr(start=".*s__", 1000, stop = NULL), group = NULL))
with errors...
Error in substr(start = ".*s__", 1000, stop = NULL) :
invalid substring arguments
In addition: Warning message:
In substr(start = ".*s__", 1000, stop = NULL) : NAs introduced by coercion
Thank you.
FJCC
March 31, 2023, 1:44pm
2
Please post the output of
dput(head(ITS_count$taxonomy, 20))
That will allow us to work with your data.
em_y
March 31, 2023, 1:47pm
3
the output is...
c("k__Fungi;p__Ascomycota;c__Dothideomycetes;o__Botryosphaeriales;f__Botryosphaeriaceae;g__Diplodia;s__Diplodia_subglobosa",
"k__Fungi;p__Ascomycota;c__Dothideomycetes;o__Capnodiales;f__Capnodiales_fam_Incertae_sedis;g__Vermiconia;s__Vermiconia_calcicola",
"k__Fungi;p__Ascomycota;c__Dothideomycetes;o__Capnodiales;f__Cladosporiaceae;g__Cladosporium;s__Cladosporium_exasperatum",
"k__Fungi;p__Ascomycota;c__Dothideomycetes;o__Capnodiales;f__Cladosporiaceae;g__Cladosporium;s__Cladosporium_halotolerans",
"k__Fungi;p__Ascomycota;c__Dothideomycetes;o__Capnodiales;f__Mycosphaerellaceae;g__Mycosphaerella;s__Mycosphaerella_ulmi",
"k__Fungi;p__Ascomycota;c__Dothideomycetes;o__Capnodiales;f__Mycosphaerellaceae;g__unidentified;s__unidentified",
"k__Fungi;p__Ascomycota;c__Dothideomycetes;o__Capnodiales;f__unidentified;g__unidentified;s__unidentified",
"k__Fungi;p__Ascomycota;c__Dothideomycetes;o__Capnodiales;f__unidentified;g__unidentified;s__unidentified",
"k__Fungi;p__Ascomycota;c__Dothideomycetes;o__Dothideales;f__Dothioraceae;g__Aureobasidium;s__Aureobasidium_pullulans",
"k__Fungi;p__Ascomycota;c__Dothideomycetes;o__Dothideomycetes_ord_Incertae_sedis;f__Dothideomycetes_fam_Incertae_sedis;g__Biatriospora;s__Biatriospora_mackinnonii",
"k__Fungi;p__Ascomycota;c__Dothideomycetes;o__Dothideomycetes_ord_Incertae_sedis;f__Dothideomycetes_fam_Incertae_sedis;g__Leptospora;s__Leptospora_rubella",
"k__Fungi;p__Ascomycota;c__Dothideomycetes;o__Dothideomycetes_ord_Incertae_sedis;f__Dothideomycetes_fam_Incertae_sedis;g__Monodictys;s__unidentified",
"k__Fungi;p__Ascomycota;c__Dothideomycetes;o__Dothideomycetes_ord_Incertae_sedis;f__Dothideomycetes_fam_Incertae_sedis;g__Septoriella;s__Septoriella_hirta",
"k__Fungi;p__Ascomycota;c__Dothideomycetes;o__Dothideomycetes_ord_Incertae_sedis;f__Dothideomycetes_fam_Incertae_sedis;g__Zymoseptoria;s__Zymoseptoria_halophila",
"k__Fungi;p__Ascomycota;c__Dothideomycetes;o__Dothideomycetidae_ord_Incertae_sedis;f__Eremomycetaceae;g__Arthrographis;s__unidentified",
"k__Fungi;p__Ascomycota;c__Dothideomycetes;o__Pleosporales;f__Biatriosporaceae;g__Nigrograna;s__unidentified",
"k__Fungi;p__Ascomycota;c__Dothideomycetes;o__Pleosporales;f__Corynesporascaceae;g__Corynespora;s__Corynespora_citricola",
"k__Fungi;p__Ascomycota;c__Dothideomycetes;o__Pleosporales;f__Cucurbitariaceae;g__Pyrenochaetopsis;s__Pyrenochaetopsis_leptospora",
"k__Fungi;p__Ascomycota;c__Dothideomycetes;o__Pleosporales;f__Dacampiaceae;g__Teichospora;s__Teichospora_rubriostiolata",
"k__Fungi;p__Ascomycota;c__Dothideomycetes;o__Pleosporales;f__Didymellaceae;g__Neoascochyta;s__Neoascochyta_graminicola"
)
FJCC
March 31, 2023, 2:30pm
4
The regular expression in str_extract look backwards for the text ";s__" and extracts everything from there to the end of the text.
library(stringr)
library(dplyr)
#>
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
DF <- data.frame(taxonomy = c("k__Fungi;p__Ascomycota;c__Dothideomycetes;o__Botryosphaeriales;f__Botryosphaeriaceae;g__Diplodia;s__Diplodia_subglobosa",
"k__Fungi;p__Ascomycota;c__Dothideomycetes;o__Capnodiales;f__Capnodiales_fam_Incertae_sedis;g__Vermiconia;s__Vermiconia_calcicola",
"k__Fungi;p__Ascomycota;c__Dothideomycetes;o__Capnodiales;f__Cladosporiaceae;g__Cladosporium;s__Cladosporium_exasperatum",
"k__Fungi;p__Ascomycota;c__Dothideomycetes;o__Capnodiales;f__Cladosporiaceae;g__Cladosporium;s__Cladosporium_halotolerans",
"k__Fungi;p__Ascomycota;c__Dothideomycetes;o__Capnodiales;f__Mycosphaerellaceae;g__Mycosphaerella;s__Mycosphaerella_ulmi",
"k__Fungi;p__Ascomycota;c__Dothideomycetes;o__Capnodiales;f__Mycosphaerellaceae;g__unidentified;s__unidentified",
"k__Fungi;p__Ascomycota;c__Dothideomycetes;o__Capnodiales;f__unidentified;g__unidentified;s__unidentified",
"k__Fungi;p__Ascomycota;c__Dothideomycetes;o__Capnodiales;f__unidentified;g__unidentified;s__unidentified",
"k__Fungi;p__Ascomycota;c__Dothideomycetes;o__Dothideales;f__Dothioraceae;g__Aureobasidium;s__Aureobasidium_pullulans",
"k__Fungi;p__Ascomycota;c__Dothideomycetes;o__Dothideomycetes_ord_Incertae_sedis;f__Dothideomycetes_fam_Incertae_sedis;g__Biatriospora;s__Biatriospora_mackinnonii",
"k__Fungi;p__Ascomycota;c__Dothideomycetes;o__Dothideomycetes_ord_Incertae_sedis;f__Dothideomycetes_fam_Incertae_sedis;g__Leptospora;s__Leptospora_rubella",
"k__Fungi;p__Ascomycota;c__Dothideomycetes;o__Dothideomycetes_ord_Incertae_sedis;f__Dothideomycetes_fam_Incertae_sedis;g__Monodictys;s__unidentified",
"k__Fungi;p__Ascomycota;c__Dothideomycetes;o__Dothideomycetes_ord_Incertae_sedis;f__Dothideomycetes_fam_Incertae_sedis;g__Septoriella;s__Septoriella_hirta",
"k__Fungi;p__Ascomycota;c__Dothideomycetes;o__Dothideomycetes_ord_Incertae_sedis;f__Dothideomycetes_fam_Incertae_sedis;g__Zymoseptoria;s__Zymoseptoria_halophila",
"k__Fungi;p__Ascomycota;c__Dothideomycetes;o__Dothideomycetidae_ord_Incertae_sedis;f__Eremomycetaceae;g__Arthrographis;s__unidentified",
"k__Fungi;p__Ascomycota;c__Dothideomycetes;o__Pleosporales;f__Biatriosporaceae;g__Nigrograna;s__unidentified",
"k__Fungi;p__Ascomycota;c__Dothideomycetes;o__Pleosporales;f__Corynesporascaceae;g__Corynespora;s__Corynespora_citricola",
"k__Fungi;p__Ascomycota;c__Dothideomycetes;o__Pleosporales;f__Cucurbitariaceae;g__Pyrenochaetopsis;s__Pyrenochaetopsis_leptospora",
"k__Fungi;p__Ascomycota;c__Dothideomycetes;o__Pleosporales;f__Dacampiaceae;g__Teichospora;s__Teichospora_rubriostiolata",
"k__Fungi;p__Ascomycota;c__Dothideomycetes;o__Pleosporales;f__Didymellaceae;g__Neoascochyta;s__Neoascochyta_graminicola"
))
DF <- DF |> mutate(Species = str_extract(taxonomy, "(?<=;s__).+$"))
DF$Species
#> [1] "Diplodia_subglobosa" "Vermiconia_calcicola"
#> [3] "Cladosporium_exasperatum" "Cladosporium_halotolerans"
#> [5] "Mycosphaerella_ulmi" "unidentified"
#> [7] "unidentified" "unidentified"
#> [9] "Aureobasidium_pullulans" "Biatriospora_mackinnonii"
#> [11] "Leptospora_rubella" "unidentified"
#> [13] "Septoriella_hirta" "Zymoseptoria_halophila"
#> [15] "unidentified" "unidentified"
#> [17] "Corynespora_citricola" "Pyrenochaetopsis_leptospora"
#> [19] "Teichospora_rubriostiolata" "Neoascochyta_graminicola"
Created on 2023-03-31 with reprex v2.0.2
em_y
March 31, 2023, 2:42pm
5
how would I do this for all 939 rows? and make it into a new column within ITS_counts?
FJCC
March 31, 2023, 2:47pm
6
The data frame DF is just a stand in for your ITS_counts. Your code would be
ITS_counts <- ITS_counts |> mutate(Species = str_extract(taxonomy, "(?<=;s__).+$"))
em_y
March 31, 2023, 2:51pm
7
That has worked!! Thank you so much for your help!
em_y
March 31, 2023, 4:42pm
8
following on from this, I would like to create fasta files of certain species with the sequences in ITS_counts. I have been able to do this, however when aligning the sequences in another program, it requires all species to have unique names, therefore the multiple"unidentified" species cause an issue here. How would I remove all unidentified species from this code...
write.fasta(as.list(seqs2), as.character(ITS_counts2$"Species"), file.out="seqs2fasta")
the output is a file called seqs2fasta containing the species and sequences, but there are many unidentified species, that I would like to somehow not include in this output file.
Thanks.
FJCC
March 31, 2023, 5:18pm
9
You can use the filter() function from dplyr to remove rows where the Species column is "unidentified".
ITS_filtered <- ITS_counts |> filter(Species != "unidentified")
Then use ITS_filtered to write your fasta file.
system
Closed
April 10, 2023, 12:40pm
12
This topic was automatically closed 7 days after the last reply. New replies are no longer allowed. If you have a query related to it or one of the replies, start a new topic and refer back with a link.