And I have a list of what I want them to change to in excel. I basically want to change the first set of letters/numbers (i.e. A0A383WB61 in this example) with another string. I need to do over 2000 replacements, so trying to do them all as just a gsub() would take too long. Is there a way to do these edits quickly with R code? I'm sure there is, but I'm completely new to R, so I'm not sure.
A more concise solution probably exists, but the code below captures my approach. The general idea is to:
create a tibble of all of the headings in the .fasta file
walk through each heading to pinpoint which string needs replacement
join the crosswalk of old strings to new strings (this is likely the Excel list you mention)
do a string replace
These steps leave you with a column of the newly formed headings. As a final step, set the headings of your .fasta file to the new headings. In the example, I did this as fasta_new to be able to show the comparison (notice the new object has new_string1 and new_string2).
library(tidyverse)
# this represents the .fasta file (only 2 columns for simplicity)
fasta = tibble(
`tr|A0A383WB61|A0A383WB61_TETOB Chitin-binding type-2 domain-containing protein (Fragment) OS=Tetradesmus obliquus OX=3088 GN=BQ4739_LOCUS14496 PE=4 SV=1` = c(1, 2),
`heading2=AJRST15|type-2 binding` = c(3, 4)
)
# headings to change
headings = tibble(old = names(fasta))
# crosswalk of old strings to new strings
crosswalk = tibble(
old_string = c('A0A383WB61', 'AJRST15'),
new_string = c('new_string1', 'new_string2')
)
# function to check for each target_string in crosswalk
# if the target string exists, return it, otherwise NA
string_check = function(i) {
headings %>%
mutate(old_string = ifelse(str_detect(old, i), i, NA))
}
# walk through each of the target strings (i.e. crosswalk$target_string)
headings = lapply(crosswalk$old_string, string_check) %>%
# stack into one large tibble
bind_rows() %>%
# remove NA results
filter(!is.na(old_string)) %>%
# join the crosswalk to bring in the replacement
left_join(crosswalk) %>%
# replace the target_string in old with the replacement_string
mutate(new = str_replace(old, old_string, new_string)) %>%
# keep old and new (if you want to compare)
select(old, new)
#> Joining, by = "old_string"
# update the fasta file headings
fasta_new = fasta
names(fasta_new) = headings$new
names(fasta)
#> [1] "tr|A0A383WB61|A0A383WB61_TETOB Chitin-binding type-2 domain-containing protein (Fragment) OS=Tetradesmus obliquus OX=3088 GN=BQ4739_LOCUS14496 PE=4 SV=1"
#> [2] "heading2=AJRST15|type-2 binding"
names(fasta_new)
#> [1] "tr|new_string1|A0A383WB61_TETOB Chitin-binding type-2 domain-containing protein (Fragment) OS=Tetradesmus obliquus OX=3088 GN=BQ4739_LOCUS14496 PE=4 SV=1"
#> [2] "heading2=new_string2|type-2 binding"
So I think the problem with that is I would have to type in each of the 2000 strings, plus all 2000 replacements, right? In the crosswalk tibble? If I had a spreadsheet where one column is the old string and one is the new string, would there be a way to pull that in for a reference to do the substituttions? Kinda like doing a VLOOKUP in excel
I typed in values for the example, but you're exactly right. Reading in an Excel document would be much better. Below is how you can do it. Using the data from the example, I saved a .xlsx file to my desktop named crosswalk.xlsx.
Then, using the readxl package, I read it in by inserting the proper file path.