get sequences from available genome using gff file

MN_AGRICO · September 29, 2023, 9:14am

Hello,

I would like to use Rstudio to retrieve specific sequences by their known positions from a available genome sequence.

I have the complete genome sequence downloaded available in a fasta format, which contains the sequences of all 12 chromosomes. Below you see a representation of this file:

>ST4.03ch01
CCATTTTAAAGGTCAAATGTGACCCAGAGCAGGCAAAACCCAAATTTTATCGATTTTCGTGTGCAATAGT
CTATGGAGTTTTTGGTGATCTGGAATTCCGACATAAGTTATGCTAAAAAATTTTGTGTACGTCTGTTAAG
ACCTTAGCTATGAAGCCAGTTAGCCCTCACGGCCAAAACATACCATTTTGAAGGTCAAATATGCCCCGAA
TCTGGTAAACCCCCCAATTTGCCGATTATCATGTGCTATAGTCCATGGACTTTTTGGTGATCTGGAATTT
CGACATACTTTTTGCCAAAAATTTTCGTACACTTCCGTTAAAACCTTAGCTATGGATCCAGTTAGACTTC
(etc...)
>ST4.03ch02
ACCAACAATTTACCATAATTAAGGCATAGATAATGATGCAATTCAATAACCGCAAAGCAATTTAGACATA
TTCAAATCACAAGCTTCCATAATCAGCAAACCCACTTCATAAAGACACAATTTAGGAATTGAAAGAAATC
ATGGGTTCATAGAGTATTTTCATCATAAATCATTAATTAAACACTAATACAATCATAGTATGACTTTAGT
AACCATTTGAAATCATTTGGAGAAAGAACCCATGAGATTGAGTAAGACCCTAGGTTTTTCATGAACTTGA
AAACTTTGAAAACTTCCTTGAAATTGACTTTAGGGGTGAAAGCTACCCCTAGATGAAGGATCACCATACC
TTTGCTAAGATTTCCCAAGAAATTTGATGAAGAAACGCCTTGAGCTTCAATGGATCCTTTCTTCTTCTTC
(etc...)

I have a tab-delimited GFF3 file with the start and end coordinates of all the sequences I would like to retrieve , see the top of this file below.

|ST4.03ch01|.|PCR_product|5105452|5105605|.|+|.|Name=ARM01-004;Note=validated marker|
|ST4.03ch01|.|PCR_product|5319044|5319174|.|+|.|Name=ARM01-005;Note=validated marker|
|ST4.03ch01|.|PCR_product|5319162|5319313|.|+|.|Name=ARM01-006;Note=validated marker|
|ST4.03ch01|.|PCR_product|5319288|5319529|.|+|.|Name=ARM01-007;Note=validated marker|
|ST4.03ch01|.|PCR_product|5319526|5319779|.|+|.|Name=ARM01-008;Note=validated marker|

Column 1 corresponds to the chromosome numbers of the fasta genome file.
Column 4&5 correspond to the start and end position of the desired sequence.
Column 9 is filled with a name and a note, separated by semicolons according to GFF3 specifications.
(empty columns are filled with "." according to GFF3 specifications.)

I know there are several bio-packages like ape or biostrings, but I have not figured out yet how to make this work.

My desired output would look like this (name and sequence):

>ARM01-004
CCATTTTAAAGGTCAAATGTGACCCAGAGCAGGCAAAACCCAAATTTTATCGATTTTCGTGTGCAATAGT
CTATGGAGTTTTTGGTGATCTGGAATTCCGACATAAGTTATGCTAAAAAATTTTGTGTACGTCTGTTAAG
ACCTTAGCTATGAAGCCAGTTAGCCCTCACGGCCAAAACATACCA

>ARM01-005
TCTGGTAAACCCCCCAATTTGCCGATTATCATGTGCTATAGTCCATGGACTTTTTGGTGATCTGGAATTT
CGACATACTTTTTGCCAAAAATTTTCGTACA

>ARM01-006
CCATTTTAAAGGTCAAATGTGACCCAGAGCAGGCAAAACCCAAATTTTATCGATTTTCGTGTGCAATAGT
CTATGGAGTTTTTGGTGATCTGGAATTCCGACATAAGTTATGCTAAAAAATTTTGTG

etc...

Note that the output sequences shown above are dummies, as for example.

Thanks for any suggestions.

technocrat · September 29, 2023, 9:28pm

Having trouble with the GFF3. Could you provide a reprex (see the FAQ) using the example fasta file and an appropriate GFF3?

if (!require("BiocManager", quietly = TRUE))
  install.packages("BiocManager")

# BiocManager::install("Biostrings")
library(ape)
library(Biostrings)
#> Loading required package: BiocGenerics
#> 
#> Attaching package: 'BiocGenerics'
#> The following objects are masked from 'package:stats':
#> 
#>     IQR, mad, sd, var, xtabs
#> The following objects are masked from 'package:base':
#> 
#>     anyDuplicated, aperm, append, as.data.frame, basename, cbind,
#>     colnames, dirname, do.call, duplicated, eval, evalq, Filter, Find,
#>     get, grep, grepl, intersect, is.unsorted, lapply, Map, mapply,
#>     match, mget, order, paste, pmax, pmax.int, pmin, pmin.int,
#>     Position, rank, rbind, Reduce, rownames, sapply, setdiff, sort,
#>     table, tapply, union, unique, unsplit, which.max, which.min
#> Loading required package: S4Vectors
#> Warning: package 'S4Vectors' was built under R version 4.3.1
#> Loading required package: stats4
#> 
#> Attaching package: 'S4Vectors'
#> The following object is masked from 'package:utils':
#> 
#>     findMatches
#> The following objects are masked from 'package:base':
#> 
#>     expand.grid, I, unname
#> Loading required package: IRanges
#> Warning: package 'IRanges' was built under R version 4.3.1
#> Loading required package: XVector
#> Loading required package: GenomeInfoDb
#> Warning: package 'GenomeInfoDb' was built under R version 4.3.1
#> 
#> Attaching package: 'Biostrings'
#> The following object is masked from 'package:ape':
#> 
#>     complement
#> The following object is masked from 'package:base':
#> 
#>     strsplit

## Read a non-compressed FASTA files:
filepath1 <- system.file("extdata", "someORF.fa", package="Biostrings")
fasta.seqlengths(filepath1, seqtype="DNA")
#> YAL001C TFC3 SGDID:S0000001, Chr I from 152168-146596, reverse complement, Verified ORF 
#>                                                                                    5573 
#>                     YAL002W VPS8 SGDID:S0000002, Chr I from 142709-148533, Verified ORF 
#>
g <- read.gff("/Users/ro/projects/demo/examp.gff3", na.strings = c(".", "?"), GFF3 = TRUE)
#> Error in scan(file, w, sep = "\t", quote = "", quiet = TRUE, na.strings = na.strings,  : 
  scan() expected 'an integer', got '|ST4.03ch01|.|PCR_product|5319288|5319529|.|+|.|Name=ARM01-007;Note=validatedmarker|'

MN_AGRICO · October 2, 2023, 9:56am

Dear @technocrat,

Thank you for having a look with me. I am not sure if I understand how to produce a REPREX from my fasta file, since it is not a typical dataframe which I normally use in R.

However, below you find a example to make a working reprex , I hope this is not to much to asked for. If it is, please let me know and I will put some more effort in making a proper reprex.

The genome sequence can be copy pasted into a text editor and is saved with a .fasta file extension. The example here: >ST4.03ch01 and >ST4.03ch02, is a small REPREX genome consisting of two small chromosomes. They can be pasted into a text editor below eachother, just as they are shown below (The > comes for the sequence id).

REPREX GENOME: paste in text editor and save as .fasta

>ST4.03ch01
ATTCTGCTTTCACAAATGAGTTTCAAATGTCAAAAAAAAAATCCGATAAAAAAATTGATTTTCTTATATGTCAAATTCTT
TTGAAAATAAATAATTAAATAGGTAAAAAAAATTAACAAATAAAAGAAACCACCAAACATAATTTTTCTTGACCAAACAC
TTTTTCAAAATAAAAAGTAATAGTTTTTAAAAAGAAAATTTTGTCTTTTTAAATCTGATATTTTCCAAAAACAATAATTT
AATAGTAATTTTCAAAAATATATGTATAAAAAAACTACCAAAAGCACTTCGTCTTGCAAAAATGACTTTCACACAATAAA
AAGAAGTTTTCAAATTGTCTTTTATATTGATTTTTTCCCCCTAAAAATAGATGATAACATCCACCAATATGGGGTGCACG
TGAGTGTGTCGGCTGATGTCACATGAGCGATAGGCGCATAGGCCTGCGCTTGTGCCAGGGCATCGGTCGATGTCACTTGG
CTGATGAGCACAAGAGTATACCTTTGTGTTCGGGCCTTGGCCGATGTTTCATGGTTAATGTGTGTACGGGTCCGTCGATG
TAGCCTTGGTGATATGCATATGGCCCTACCATTGTGCTTGAGTCTCGGTCGATGTAGCTTTTTTAACTTGTGTTGGGCTC
TGCCAATGTTGCCGAAGCGATGCACGTAGATACCTGGCCTTGTAAACGGACACATGTCTCTTTACCGATGTGTGTTGGGA
CCTGCTCTTGAGTGCAGGCCTCGGTCAATGTGGCTTGGTCGATGTGTGAAGGGGTCTGTCGATGTCATTTGGGTAAAGCA
CGTAGGGCCTGCTCAGCTTATGTGTAAAGGGGTCTGCTGATAATGTTTTGACGATGAATGTTAGGATATGCTGTTCTGTG
TGGCCTAGGCCAATATGCATAGGGGTCCACCGATGTTGCTTGGGCGATGTGAATAGCGGCATGCCGTTGTCACTTAGGCA
>ST4.03ch02
ATGTGCACAAGGGCATAACATAGAACATAAGTGATGTAATTCTCAACTAGAATAATCACAACAAACAGCGGAAAACAAGA
AAGATAGAAGAAGAGACACAAGATAAGAAACAATTGGCAACTAAAGGATGTATTAAATCCAATAACCTCCAATCTAGATC
CTAACAAGTACCAAATCCAGATAAATAGATAGACACAAAAAGAGGAAACAAAGAGAAACCTAAGGGTATATTAAACCCAA
AGACCTCACCTCGTTAATCCGTCAAGAAGCACCTAACTCTTACGAAGACCTCGAGGGAAATAATATCCCAAGCAACCCTC
GTTTCTAAACAAGGTTAAACAAAAGCTTACCAAGCTTTCAACACTCACACAAGGAGTTTAGGTTTACAACCACAATATTC
AAAACAAGTCTAACCCTAAAATGAACCTCTAAAGGCCTATTTATAGTTTTTGGCATTATTACATAGTTGGGCCCAAAAGT
GACCCGAAATGGGCCAATGGGTGAGCCAAGTCGGGCTCGCCAAATGGCTCGGCTATGTGCCCAATGAACTGGTGAGGAAC
TCTACTCCTGGCCTTGGCTGGGGCTACTGTAACTTTGGGCGGTCTGGGGACCATCGCCAATCCACTTGGCGATACGCCAA
ATGGTCTTCAGCTCGCCCTTTGTGCAGCTTTCCCTGGTAGCTCCTTTGGGACCTTCGGTGATCTTGGAACCCTGTGCCAA
AAGGGTTTTGGTGAGTTGCCGAAGGGGTCTCCAGCTCTCCAAATGGAGCACACTCTTCACTTGCTCCTCCTCGCTTCGTT
AGGCCTCCCATGCTCTAAGCTATCATCCAAGCATGGTGTTTCCCAACTGGGATCTCCTTAAAGGTGCATATGTTCCCATG
CAAGGCATGTAACTCCCTCCTTAGCATCCAAACATACCAAACAACCTCCATCCTTGTGAACACTCCATCAAAGCATTGAA

The GFF3 file is a tab delimited file in a text editor, saved with a .txt extension. I don't think this file is saved with .gff3 extension. The first line of the gff3 file is a comment that identifies the version, as shown it the example below.
The GFF3 file annotates DNA regions based on coordinates. Below I presented a small reprex with 4 markers from which the sequence needs to be extracted from the genomes (2 markers for each reprex genome).

REPREX GFF3: paste the table in a text editor without column names (1 to 4) and without headers (V1 tot V9) and save as .txt
Important note: when I did a test of copy-pasting this example table in my text editor, I found that the columns were separated by more then one tab (or spaces), which will make the file useless! The table in this reprex is also missing the first comment line referring to the version. See further down below for a proper example (but not produced using the reprex function).

data.frame(
  stringsAsFactors = FALSE,
                V1 = c("ST4.03ch01", "ST4.03ch01", "ST4.03ch02", "ST4.03ch02"),
                V2 = c(".", ".", ".", "."),
                V3 = c("PCR_product","PCR_product",
                       "PCR_product","PCR_product"),
                V4 = c(200L, 500L, 400L, 700L),
                V5 = c(300L, 600L, 500L, 800L),
                V6 = c(".", ".", ".", "."),
                V7 = c("+", "+", "+", "+"),
                V8 = c(".", ".", ".", "."),
                V9 = c("Name=ARM01-001",
                       "Name=ARM01-002","Name=ARM02-001","Name=ARM02-002")
)
#>           V1 V2          V3  V4  V5 V6 V7 V8             V9
#> 1 ST4.03ch01  . PCR_product 200 300  .  +  . Name=ARM01-001
#> 2 ST4.03ch01  . PCR_product 500 600  .  +  . Name=ARM01-002
#> 3 ST4.03ch02  . PCR_product 400 500  .  +  . Name=ARM02-001
#> 4 ST4.03ch02  . PCR_product 700 800  .  +  . Name=ARM02-002

Created on 2023-10-02 with reprex v2.0.2

So here below I show how the GFF3 would look like when copy-pasted from my text editor. But again, the tab-delimitors are replaced by spaces when copy-pasting from my text editor into this forum post, which will make it not comply to GFF3 standards.

##gff-version 3
ST4.03ch01 . PCR_product 200 300 . + . Name=ARM01-001
ST4.03ch01 . PCR_product 500 600 . + . Name=ARM01-002
ST4.03ch02 . PCR_product 400 500 . + . Name=ARM02-001
ST4.03ch02 . PCR_product 700 800 . + . Name=ARM02-002

To illustrate what it should look like (and how it shouldn't), I provide a printssreen of this GFF3 example below.
The 9 columns are all delimited by tabs in the correct version.

We normally use excel to create our GFF3 files. When we have them complete we copy-paste the whole thing into text editor and save it as a .txt. This method works for uploading the file into online 'Genome Browsers', but I don't know how the file should be presented in your script, perhaps as a CSV?

I do understand this might be a unorthodox (and messy?) way of presenting my examples, hope it is clear though. Thanks a lot for your time!

technocrat · October 3, 2023, 8:06pm

Is this what you are trying to get to with the gff?

g <- data.frame(
  stringsAsFactors = FALSE,
  V1 = c("ST4.03ch01", "ST4.03ch01", "ST4.03ch02", "ST4.03ch02"),
  V2 = c(".", ".", ".", "."),
  V3 = c("PCR_product","PCR_product",
         "PCR_product","PCR_product"),
  V4 = c(200L, 500L, 400L, 700L),
  V5 = c(300L, 600L, 500L, 800L),
  V6 = c(".", ".", ".", "."),
  V7 = c("+", "+", "+", "+"),
  V8 = c(".", ".", ".", "."),
  V9 = c("Name=ARM01-001",
         "Name=ARM01-002","Name=ARM02-001","Name=ARM02-002")
)

stick <- function(x) paste0(g$V1,g$V2,g$V3,g$V4,g$V5,g$V6,g$V7,g$V8,g$V9)
stick()
#> [1] "ST4.03ch01.PCR_product200300.+.Name=ARM01-001"
#> [2] "ST4.03ch01.PCR_product500600.+.Name=ARM01-002"
#> [3] "ST4.03ch02.PCR_product400500.+.Name=ARM02-001"
#> [4] "ST4.03ch02.PCR_product700800.+.Name=ARM02-002"

MN_AGRICO · October 5, 2023, 7:24am

Dear @technocrat ,

The GFF3 file is not the product I am trying to produce. I already have it available in the correct format to use it in online Genome Browsers.

Some context: The genome sequence of many organisms are available to be viewed and also to be downloaded online. When viewing online using a free available 'Genome Browser' , one can annotate it's own regions using a GFF, GFF2, GFF3 or BED file. All file types are similar in use: they define a 'region of interest' in the genome by coordinates. So when uploading such a 'coordinate'-file into a genome browser, your 'own' regions visually show up and these items become 'clickable' and the genetic sequence of this specific region can be saved if desired.

I would like to get the sequence from hundreds of such regions, which in the online genome browsers needs to be done by clicking and saving the annotated regions one at a time.

I know this could be done using Rstudio. The input is 1) the downloaded genome and 2) the GFF file with coordinates. Then with the correct package(s) you should be able to retreive all the sequences in one go.

The complete genome is saved in a tekst editor with a .fasta extension. My genome has 12 chromosomes so there are 12 separate sequences in this file, each chromosome starting on a new line in the fasta file with a >(name). The total length is about 850.000.000 characters.

The GFF3 (coordinate) file is the file-type I currently use to do my online genome browsing, but I don't know if the Rstudio package of choice would like to have it in a different format. That doens't really matter, as long as it takes the right coordinates and brings back the sequence of that coordinate together with the name of that specific region.

So the workflow is:

Input 1: .fasta genome in text editor (small reprex below)

>ST4.03ch01
ATTCTGCTTTCACAAATGAGTTTCAAATGTCAAAAAAAAAATCCGATAAAAAAATTGATTTTCTTATATGTCAAATTCTT
TTGAAAATAAATAATTAAATAGGTAAAAAAAATTAACAAATAAAAGAAACCACCAAACATAATTTTTCTTGACCAAACAC
TTTTTCAAAATAAAAAGTAATAGTTTTTAAAAAGAAAATTTTGTCTTTTTAAATCTGATATTTTCCAAAAACAATAATTT
AATAGTAATTTTCAAAAATATATGTATAAAAAAACTACCAAAAGCACTTCGTCTTGCAAAAATGACTTTCACACAATAAA
AAGAAGTTTTCAAATTGTCTTTTATATTGATTTTTTCCCCCTAAAAATAGATGATAACATCCACCAATATGGGGTGCACG
etc...
>ST4.03ch02
ATGTGCACAAGGGCATAACATAGAACATAAGTGATGTAATTCTCAACTAGAATAATCACAACAAACAGCGGAAAACAAGA
AAGATAGAAGAAGAGACACAAGATAAGAAACAATTGGCAACTAAAGGATGTATTAAATCCAATAACCTCCAATCTAGATC
CTAACAAGTACCAAATCCAGATAAATAGATAGACACAAAAAGAGGAAACAAAGAGAAACCTAAGGGTATATTAAACCCAA
AGACCTCACCTCGTTAATCCGTCAAGAAGCACCTAACTCTTACGAAGACCTCGAGGGAAATAATATCCCAAGCAACCCTC
GTTTCTAAACAAGGTTAAACAAAAGCTTACCAAGCTTTCAACACTCACACAAGGAGTTTAGGTTTACAACCACAATATTC
etc...
>ST4.03ch03
>ST4.03ch04
>ST4.03ch05
>ST4.03ch06
>ST4.03ch07
>ST4.03ch08
>ST4.03ch09
>ST4.03ch10
>ST4.03ch11
>ST4.03ch12

Input 2 ( very small but workable reprex below):

##gff-version 3
ST4.03ch01 . PCR_product 1 10 . + . Name=ARM01-001
ST4.03ch01 . PCR_product 20 40 . + . Name=ARM01-002
ST4.03ch02 . PCR_product 5 25 . + . Name=ARM02-001

Note again that I have this GFF3 file in a text editor, with the 9 columns separated by tab's, because that is the GFF3 standard (see Specifications/gff3.md at master · The-Sequence-Ontology/Specifications · GitHub). However, if R needs it to be in a different format, that's OK as mentioned earlier. It just needs to check which chromosome to look in (column1), from which coordinated to get the sequence from (column 4 & 5) and then retreive that sequence together with the name of the feature (column9)

Desired output (of this reprex here) would look like:

>ARM01-001
ATTCTGCTTT
>ARM01-002
GTTTCAAATGTCAAAAAAAAA
>ARM02-001
GCACAAGGGCATAACATAGAA

Thanks again and if it becomes too much of a trouble, just leave it. Thnx!

technocrat · October 5, 2023, 9:44am

It should be obvious by now that my only domain knowledge in genetics is whatever is encoded in my personal DNA, which explains my disconnects. But unless I were an expert, I'd go on approaching it the same way—as a schoolkid working y = f(x), where

y is what is wanted, in this case an object that contains a variable length string of some combination of the four magic letters A,C,G,T and it's associated identifier in the form ARM01-001.

x is what is to hand, in this case is the fasta file which has 12 long sequences of ACGT coded by chromosome in the form ST4.03chl01 and the GFF3 coordinate file that specifies: 1) the chromosome to be retrieved, 2) the index of the start and end of the ACGT sequence and 3) the associated identifier to be assigned, such as ARM01-001.

f is the function (likely a composite chain of functions) to be applied to x to produce y.

So, the task is to select off-the-shelf functions specialized to the task or assemble our own or to use a combination.

The first thing to do is to design y, which is simply a two-column matrix sized to contain the expected number of results, assumed here to be 12

dumb <- rep(NA,12)
y <- cbind(dumb,dumb)
colnames(y) <- c("id","seqn")
head(y)
#>      id seqn
#> [1,] NA   NA
#> [2,] NA   NA
#> [3,] NA   NA
#> [4,] NA   NA
#> [5,] NA   NA
#> [6,] NA   NA

To populate the id column y[1], we need to read in the tab delimited gtt3 file

gtt3 <- data.table::fread("/Users/ro/projects/demo/grist.tsv")
gtt3
#>            V1 V2          V3 V4 V5 V6 V7 V8             V9
#> 1: ST4.03ch01  . PCR_product  1 10  .  +  . Name=ARM01-001
#> 2: ST4.03ch01  . PCR_product 20 40  .  +  . Name=ARM01-002
#> 3: ST4.03ch02  . PCR_product  5 25  .  +  . Name=ARM02-001
gtt3 <- gtt3[,c(1,4,5,9)]
colnames(gtt3) <- c("ch","begin","finish","id")
gtt3$id <- gsub("Name=","",gtt3$id)
gtt3
#>            ch begin finish        id
#> 1: ST4.03ch01     1     10 ARM01-001
#> 2: ST4.03ch01    20     40 ARM01-002
#> 3: ST4.03ch02     5     25 ARM02-001

Now we need a way to extract the substring beginning at position begin and ending at position finish for each id using the related ch. To do that, we locate the corresponding ch in x, so let's make x

lines <- readLines("/Users/ro/projects/demo/faux.fasta")
#> Warning in readLines("/Users/ro/projects/demo/faux.fasta"): incomplete final
#> line found on '/Users/ro/projects/demo/faux.fasta'
# Extract odd and even lines, trim leading > or space and trailing space
ch <- lines[seq(1, length(lines), 2)] |> gsub("^.","",x=_) |> gsub(" $","",x=_) 
seqn <- lines[seq(2, length(lines), 2)] |> gsub("^ ","",x=_) |> gsub(" $","",x=_)
# Combine odd and even lines into a matrix
x <- cbind(ch,seqn)
x
#>       ch          
#>  [1,] "ST4.03ch01"
#>  [2,] "ST4.03ch02"
#>  [3,] "ST4.03ch03"
#>  [4,] "ST4.03ch04"
#>  [5,] "ST4.03ch05"
#>  [6,] "ST4.03ch06"
#>  [7,] "ST4.03ch07"
#>  [8,] "ST4.03ch08"
#>  [9,] "ST4.03ch09"
#>       seqn                                                                                                  
#>  [1,] "AGCAAAGCTTCCTTCGGAAGCTTCTAGCCCATGCATCCCCGTCCGGTCCTCTGGGCGTCGATCCCAAGAGGATGGGGCCTAGGAATACGTGCTGAGAGCC"
#>  [2,] "TGCGGAGTATACGAACTAGTAGATGGGATACTCCGTATGTATGGTATTCAGGAACAGGTGCGGCGTCGTAAGGTTGAGTGTATATTAAAGCCATCCTGCT"
#>  [3,] "CGATAACACAGCCTTTTTTCTAGCACTAGAACGTTCGCTTTATTATCAGAAATAGAGGAACCCGAGATGGGCTGTTTGATCGTGGTCGTCTAACTGGAGG"
#>  [4,] "TATAAACGTCTGATAGTGGTTGCATCCCAATGATCTTTCTAACACAAACTCATTGTGAAGAGGCTAAACGGCTGGTGTGTATGATTCATGGTTAGTCGTA"
#>  [5,] "TTTGGAGTTAGGCATCCCAAACGCTCATGGTCTCTACGCCAGGCAGAGATGCAAGTACGATGGGCGGTTTGTTAAACATGTGCTCAATTGTCAGTTTCAA"
#>  [6,] "TCTACCTAAAACAAGCCTTTCAACGACCCACTCAACCCTGGACCTAGTCCCGGGTCATGGCAAAGTGGTGAGTAACCCACGCCCTGGTCCTGTAACGCTC"
#>  [7,] "TGGAATTTGAATTAATGACGTATCATTAACTTCTCTTGCCATGTCGGGCGAGAAAAACTTCGTCGACACCCGGTTAGCCTTATACGAAGGGGAATTGTCG"
#>  [8,] "CCTTGGTTACCGAGAGATGAGTCTAGTGGTCTGCTTCTGTCCGGACGGATGGCCTGGTTCTCAGTAATCCTATAAGCAAACCGTTAGAGTATGCTCCCCT"
#>  [9,] "AGGCTCGGTGGCACAAACGTCGGTACTAGGTTTCGAATCATTGGCACCACTGCCACGTTTCTGGCGATGCGTTTCGCCTCTGGCTGCTAATTTGGGTAGC"

Now, it's just a matter of extracting the sequences

pull_seg <- function(x,y,z) substr(x,y,z)
pull_seg(x[1,2],10,20)
#>          seqn 
#> "TCCTTCGGAAG"

If that seems right, let me know if you need help gluing this all together.

MN_AGRICO · October 5, 2023, 11:14am

This really seems to get very close to what I 'd like R to do!

Just a few details: you're saying when populating y, you say that you probably expect that there are 12 results. This is not the case.
We indeed have 12 very long sequences (what you call x), but from these 12 long sequences I want to retrieve around 1800 smaller sequences, which are documented in that GFF3-file. So from every chrosome I want to get, on average, about 150 smaller sequences. For some chromosomes we just want 10 sequences, from other 400...

The GFF3 file looks therefore like below, for about 1800 lines.

ST4.03ch01 . PCR_product 1 10 . + . Name=ARM01-001
ST4.03ch01 . PCR_product 20 40 . + . Name=ARM01-002
ST4.03ch01 . PCR_product 80 105 . + . Name=ARM01-003
ST4.03ch01 . PCR_product 135 165 . + . Name=ARM01-004
*(ST4.03ch01 ~150 more lines)*
ST4.03ch02 . PCR_product 5 25 . + . Name=ARM02-001
ST4.03ch02 . PCR_product 155 175 . + . Name=ARM02-002
ST4.03ch02 . PCR_product 180 195 . + . Name=ARM02-003
*(ST4.03ch02 ~150 more lines)*
etc. till ch12...

Hope this makes sense.

And yes, some help gluing it to together would be very appriciated. I am in the middle of a Rstudio course, halfway for my exam. Of course already trying to use it for several tasks. The one you're presenting here I can understand and read, but gluing it together might to be to difficult.

Thnx so far!

MN_AGRICO · October 5, 2023, 12:14pm

@technocrat
also, when I run your code below on my complete .fasta genome file...

lines <- readLines("/Users/ro/projects/demo/faux.fasta")
#> Warning in readLines("/Users/ro/projects/demo/faux.fasta"): incomplete final
#> line found on '/Users/ro/projects/demo/faux.fasta'
# Extract odd and even lines, trim leading > or space and trailing space
ch <- lines[seq(1, length(lines), 2)] |> gsub("^.","",x=_) |> gsub(" $","",x=_) 
seqn <- lines[seq(2, length(lines), 2)] |> gsub("^ ","",x=_) |> gsub(" $","",x=_)
# Combine odd and even lines into a matrix
x <- cbind(ch,seqn)

... I does not separate the 12 chromosome names and sequences. It splits them up into even and uneven lines. Therefore it just ends up in 2 large genome files, each having 50% of the genome and in mixed up way.

To make this easier, and I think it will make it much easier, I think it would be good to split things up into 12 datasets.

So the first input would be just 1 chromosome sequence in a plane.txt file, like so:

(now also there will be NO >ST4.03ch## identifier present at the first line):

ATTCTGCTTTCACAAATGAGTTTCAAATGTCAAAAAAAAAATCCGATAAAAAAATTGATTTTCTTATATGTCAAATTCTT
TTGAAAATAAATAATTAAATAGGTAAAAAAAATTAACAAATAAAAGAAACCACCAAACATAATTTTTCTTGACCAAACAC
TTTTTCAAAATAAAAAGTAATAGTTTTTAAAAAGAAAATTTTGTCTTTTTAAATCTGATATTTTCCAAAAACAATAATTT
AATAGTAATTTTCAAAAATATATGTATAAAAAAACTACCAAAAGCACTTCGTCTTGCAAAAATGACTTTCACACAATAAA
AAGAAGTTTTCAAATTGTCTTTTATATTGATTTTTTCCCCCTAAAAATAGATGATAACATCCACCAATATGGGGTGCACG
TGAGTGTGTCGGCTGATGTCACATGAGCGATAGGCGCATAGGCCTGCGCTTGTGCCAGGGCATCGGTCGATGTCACTTGG
CTGATGAGCACAAGAGTATACCTTTGTGTTCGGGCCTTGGCCGATGTTTCATGGTTAATGTGTGTACGGGTCCGTCGATG
TAGCCTTGGTGATATGCATATGGCCCTACCATTGTGCTTGAGTCTCGGTCGATGTAGCTTTTTTAACTTGTGTTGGGCTC
TGCCAATGTTGCCGAAGCGATGCACGTAGATACCTGGCCTTGTAAACGGACACATGTCTCTTTACCGATGTGTGTTGGGA
CCTGCTCTTGAGTGCAGGCCTCGGTCAATGTGGCTTGGTCGATGTGTGAAGGGGTCTGTCGATGTCATTTGGGTAAAGCA
CGTAGGGCCTGCTCAGCTTATGTGTAAAGGGGTCTGCTGATAATGTTTTGACGATGAATGTTAGGATATGCTGTTCTGTG
TGGCCTAGGCCAATATGCATAGGGGTCCACCGATGTTGCTTGGGCGATGTGAATAGCGGCATGCCGTTGTCACTTAGGC
etc... for ~60.000.000 characters

And the second input, my gff3 file can also be easily sorted per chromsome. So the second input would then be a GFF3 with only the coordinates within that one chromosome. The first column of the GFF3 file could then be ignored. Only column 4 & 5 with the coordinates and column 9 with the ID would be necessary.

The desired output would still be:

>name1
ATGCAGTTCAGG
>name2
AATGGGTACCGTGCAGTC
>name3
etc for all sequences to be retrieved from that one chromosome

If this would work, I could run it 12 times.

MN_AGRICO · October 5, 2023, 8:31pm

Hi @technocrat ,

Your suggestions made me thinking of a solution so I've been a little busy and got it worked out.
When uploading the large genome sequence using the readlines function, each new line became a new element in that value, which was blocking my search using str_sub.

So I first used the paste function to make it one large!! element.

After that is was easy, using the coordinates from the GFF3 file, searching into that one element. Then writing the result back into a new column of the GFF3 file.

See code:

library(stringr)

# load in chromosome seq from plane text file. NO >(name) in first line!
chr01_sequence_FASTA_lines <- readLines("F:/AR_Biotech/Markers in Genome Browser(s)/RSTUDIO/ST4.03_chr01.txt")

# paste all the different elements (breaklines) in one string
chr01_sequence_1_line <- paste0(chr01_sequence_FASTA_lines, collapse = "")

# read in the GFF3 files (CSV), column 4 & 5 should contain the coordinates
chr01_annotation <- read.csv2("potatoMASH_chr01_GFF3.csv", header = FALSE, stringsAsFactors = FALSE)

# make the columns 4 & 5 numeric
# got the Error: 'list' object cannot be coerced to type 'double' when using as.numeric
# google search said: use the unlist function as well because the integer values are list (?)
chr01_annotation[, 4:5] <- as.numeric(unlist(chr01_annotation[, 4:5]))

# get the substring of the "1_line" chromosome sequence by the annotation coordinate
# print this sequence back into the GFF annotation file
chr01_annotation$seq <- str_sub(chr01_sequence_1_line, chr01_annotation$V4, chr01_annotation$V5)

Result like this (table is to wide to display nicely):

          V1 V2          V3      V4      V5 V6 V7 V8        V9
1 ST4.03ch01  . PCR_product  433598  433726  .  +  . Name=C1_1
2 ST4.03ch01  . PCR_product 1489825 1489948  .  +  . Name=C1_2
3 ST4.03ch01  . PCR_product 2549253 2549381  .  +  . Name=C1_3
4 ST4.03ch01  . PCR_product 3767216 3767332  .  +  . Name=C1_4
5 ST4.03ch01  . PCR_product 5388186 5388303  .  +  . Name=C1_5
6 ST4.03ch01  . PCR_product 6218759 6218882  .  +  . Name=C1_6
                                                                                                                                seq
1 ATTGGGTTCCACACTTTTGACTATGCGAGGCACTTCCTCTCGTGTTGCAGTCGGATGTTAGGTATTTCTTATGAGTCGAAAAGGGGTTACATAGGCCTTGAGTACTATGGTAGGACTGTAAGTATTAAA
2      ATTTTTTCCCTTGATATTAATTTATTTAATAACTCCATTATCCTGTGGAATTTGTTGCTTCTGACTACAAACTGGTTTGGGTTCCTCCTCATGGATGAGCAACATTCTGGCGACCTTCTAAAAT
3 TAAAATAGTACTTGCAACAAATATGGCCGAGGCAAGTATTACTATAAATGATGTGGTCTTTGTGGTGGACTGTGGAAAAGCGAAGGAGACGACTTATGATGCTCTAAATAATACTCCGTGCTTATTACC
4             AGGTACATCATCAATTTCAGTCTGCAAAAGTTATTCCTCAATCATCAAATTCTACAATATAGCAACACTTGCAAAACAAAACTTAAAAGAACTACTTACTGGATAAAGCATATGAAT
5            CTCCTGCACTGCTACCTGATTTTTCGGATAAAGTAACAATTTTATGACAATCTAGCATAAAAGCAATGCTAAAAAAAATAGAAAACTGCCAGAAAAGCAGAGTTCTTCACTACCTGAA
6      GTTAAATGGGATTACTTTGGTAGAGGAGGTCTCGGAAAGTAACAAAGATGGTTATATTAGTGAAGGGAAACCACACACACTATTATCTTCTGCTAAGCAGGTTGTGGAATCAAGCAAAGAGTAC
>

So it seems to have worked!! Thanks for thinking with me!!!

system · April 24, 2024, 4:17pm

This topic was automatically closed 42 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.