I have a genome sequence and need to calculate the number of Kmer.
For example, how many times will certain sequences consisting of 4 letters occur. But I need a solution without using ready-made functions like kcount from ape library.
Let's see if we can help you out here. I think the first thing we'd like to see is a reproducible example (or reprex) as described in this article.
I suspect based on my very limited knowledge of genome sequences, that what you're likely looking at is a long character string of some sort. And you're asking "How do you identify certain sequences of character strings within a value from a dataframe?" Well if your goal is a De Bruijn Graph I'm not sure I can help you here. But if it's a simpler use case in the sense of you want to pick out a specific set of letters, you could try something like this:
library(tidyverse)
# An example dataset
genome_sequence <- tribble(
~sequence,
"AGTCGTAGATGCTT",
"AGTCGTGCTGAGAT",
"AGAGATCGTGCTGA"
)
# create a new dataframe containing the sequences that match "GAGA"
specific_sequence <- genome_sequence %>%
filter(grepl("GAGA", sequence)) # grepl() is useful for searching character variables in R
# Or write a function that you can pass any sequence to filter by, and output
# a dataframe that matches the input sub-sequence
sequence_checker <- function(sub_sequence) {
new_sequence <- genome_sequence %>%
filter(grepl(sub_sequence, sequence))
return(new_sequence)
}
# And then call that function with your input sequence
GAGA_genomes <- sequence_checker("GAGA")
TT_genomes <- sequence_checker("TT")
# Check how many sequences matched that sub sequence
nrow(GAGA_genomes)
would be interesting to motivate your request with some explanation why ? In general programmers will tend optimise their time by incorporating the work of others, and not reinvent the wheel.