calculate the number of Kmer

jonesey441 · October 6, 2022, 5:33am

Welcome to the RStudio community!

Let's see if we can help you out here. I think the first thing we'd like to see is a reproducible example (or reprex) as described in this article.

I suspect based on my very limited knowledge of genome sequences, that what you're likely looking at is a long character string of some sort. And you're asking "How do you identify certain sequences of character strings within a value from a dataframe?" Well if your goal is a De Bruijn Graph I'm not sure I can help you here. But if it's a simpler use case in the sense of you want to pick out a specific set of letters, you could try something like this:

library(tidyverse)

# An example dataset
genome_sequence <- tribble(
  ~sequence,
  "AGTCGTAGATGCTT",
  "AGTCGTGCTGAGAT",
  "AGAGATCGTGCTGA"
)

# create a new dataframe containing the sequences that match "GAGA"
specific_sequence <- genome_sequence %>% 
  filter(grepl("GAGA", sequence)) # grepl() is useful for searching character variables in R

# Or write a function that you can pass any sequence to filter by, and output
# a dataframe that matches the input sub-sequence
sequence_checker <- function(sub_sequence) {
  new_sequence <- genome_sequence %>% 
    filter(grepl(sub_sequence, sequence))
  return(new_sequence)
}

# And then call that function with your input sequence
GAGA_genomes <- sequence_checker("GAGA")
TT_genomes <- sequence_checker("TT")

# Check how many sequences matched that sub sequence
nrow(GAGA_genomes)