Replacing multiple strings with a different list of strings

ctheisen · September 8, 2022, 7:00pm

So I have a .fasta file (similar to a txt file) that has a bunch of headings like the one below:

tr|A0A383WB61|A0A383WB61_TETOB Chitin-binding type-2 domain-containing protein (Fragment) OS=Tetradesmus obliquus OX=3088 GN=BQ4739_LOCUS14496 PE=4 SV=1

And I have a list of what I want them to change to in excel. I basically want to change the first set of letters/numbers (i.e. A0A383WB61 in this example) with another string. I need to do over 2000 replacements, so trying to do them all as just a gsub() would take too long. Is there a way to do these edits quickly with R code? I'm sure there is, but I'm completely new to R, so I'm not sure.

scottyd22 · September 8, 2022, 8:43pm

A more concise solution probably exists, but the code below captures my approach. The general idea is to:

create a tibble of all of the headings in the .fasta file
walk through each heading to pinpoint which string needs replacement
join the crosswalk of old strings to new strings (this is likely the Excel list you mention)
do a string replace

These steps leave you with a column of the newly formed headings. As a final step, set the headings of your .fasta file to the new headings. In the example, I did this as fasta_new to be able to show the comparison (notice the new object has new_string1 and new_string2).

library(tidyverse)

# this represents the .fasta file (only 2 columns for simplicity)
fasta = tibble(
  `tr|A0A383WB61|A0A383WB61_TETOB Chitin-binding type-2 domain-containing protein (Fragment) OS=Tetradesmus obliquus OX=3088 GN=BQ4739_LOCUS14496 PE=4 SV=1` = c(1, 2),
  `heading2=AJRST15|type-2 binding` = c(3, 4)
)

# headings to change
headings = tibble(old = names(fasta))

# crosswalk of old strings to new strings
crosswalk = tibble(
  old_string = c('A0A383WB61', 'AJRST15'),
  new_string = c('new_string1', 'new_string2')
)

# function to check for each target_string in crosswalk
# if the target string exists, return it, otherwise NA
string_check = function(i) {
  headings %>%
    mutate(old_string = ifelse(str_detect(old, i), i, NA))
}

# walk through each of the target strings (i.e. crosswalk$target_string)
headings = lapply(crosswalk$old_string, string_check) %>%
  # stack into one large tibble
  bind_rows() %>%
  # remove NA results
  filter(!is.na(old_string)) %>%
  # join the crosswalk to bring in the replacement
  left_join(crosswalk) %>%
  # replace the target_string in old with the replacement_string
  mutate(new = str_replace(old, old_string, new_string)) %>%
  # keep old and new (if you want to compare)
  select(old, new)
#> Joining, by = "old_string"

# update the fasta file headings
fasta_new = fasta
names(fasta_new) = headings$new

names(fasta)
#> [1] "tr|A0A383WB61|A0A383WB61_TETOB Chitin-binding type-2 domain-containing protein (Fragment) OS=Tetradesmus obliquus OX=3088 GN=BQ4739_LOCUS14496 PE=4 SV=1"
#> [2] "heading2=AJRST15|type-2 binding"
names(fasta_new)
#> [1] "tr|new_string1|A0A383WB61_TETOB Chitin-binding type-2 domain-containing protein (Fragment) OS=Tetradesmus obliquus OX=3088 GN=BQ4739_LOCUS14496 PE=4 SV=1"
#> [2] "heading2=new_string2|type-2 binding"

ctheisen · September 8, 2022, 9:09pm

So I think the problem with that is I would have to type in each of the 2000 strings, plus all 2000 replacements, right? In the crosswalk tibble? If I had a spreadsheet where one column is the old string and one is the new string, would there be a way to pull that in for a reference to do the substituttions? Kinda like doing a VLOOKUP in excel

scottyd22 · September 9, 2022, 2:09am

I typed in values for the example, but you're exactly right. Reading in an Excel document would be much better. Below is how you can do it. Using the data from the example, I saved a .xlsx file to my desktop named crosswalk.xlsx.

Then, using the readxl package, I read it in by inserting the proper file path.

library(readxl)
crosswalk = read_xlsx('Desktop/crosswalk.xlsx')
crosswalk
#> # A tibble: 2 × 2
#>   old_string new_string 
#>   <chr>      <chr>      
#> 1 A0A383WB61 new_string1
#> 2 AJRST15    new_string2

Created on 2022-09-08 with reprex v2.0.2.9000

ctheisen · September 12, 2022, 7:58pm

Ah, that worked. Thank you!

system · September 19, 2022, 7:58pm

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.