Hi @AsiyaV ,
Welcome to the RStudio community!
Let's see if we can help you out here. I think the first thing we'd like to see is a reproducible example (or reprex) as described in this article.
I suspect based on my very limited knowledge of genome sequences, that what you're likely looking at is a long character string of some sort. And you're asking "How do you identify certain sequences of character strings within a value from a dataframe?" Well if your goal is a De Bruijn Graph I'm not sure I can help you here. But if it's a simpler use case in the sense of you want to pick out a specific set of letters, you could try something like this:
library(tidyverse)
# An example dataset
genome_sequence <- tribble(
~sequence,
"AGTCGTAGATGCTT",
"AGTCGTGCTGAGAT",
"AGAGATCGTGCTGA"
)
# create a new dataframe containing the sequences that match "GAGA"
specific_sequence <- genome_sequence %>%
filter(grepl("GAGA", sequence)) # grepl() is useful for searching character variables in R
# Or write a function that you can pass any sequence to filter by, and output
# a dataframe that matches the input sub-sequence
sequence_checker <- function(sub_sequence) {
new_sequence <- genome_sequence %>%
filter(grepl(sub_sequence, sequence))
return(new_sequence)
}
# And then call that function with your input sequence
GAGA_genomes <- sequence_checker("GAGA")
TT_genomes <- sequence_checker("TT")
# Check how many sequences matched that sub sequence
nrow(GAGA_genomes)