Struggling to write an R script that searches BLAST with given DNA sequences

I'm trying to write an RStudio script that takes an excel file with a list of DNA sequences and searches BLASTn from NCBI with them, obtains the top hit for percent identity and accession number, and exports the results as a csv. The problem I am having is that the results are only showing NA for each DNA sequence in the new csv file instead of providing the expected results. I've attached the file that BLAST has been giving me with the NA values. Here is my code:

Read Excel file with DNA sequences

excel_file <- "C:/Users/rbh6/Documents/BLAST TEST START/TEST_BLAST_RSTUDIO.xlsx"
dna_sequences <- readxl::read_excel(excel_file, sheet = "Sheet1", col_names = TRUE)

Print the structure of dna_sequences to understand its format

print(head(dna_sequences))

Read Excel file with DNA sequences

excel_file <- "C:/Users/rbh6/Documents/BLAST TEST START/TEST_BLAST_RSTUDIO.xlsx"
dna_sequences <- readxl::read_excel(excel_file, sheet = "Sheet1", col_names = TRUE)

Print the structure of dna_sequences to understand its format

print(head(dna_sequences))

Perform BLAST search for each sequence

blast_results <- lapply(dna_sequences$DNA Sequence, function(sequence) {
result <- tryCatch({
perform_blast(sequence)
}, error = function(e) {
cat("Error in BLAST search for sequence:", sequence, "\n")
cat("Error message:", conditionMessage(e), "\n")
return(rep(NA, 3)) # Return NAs for columns if no match is found
})

if (length(result) == 0) {
return(rep(NA, 3))
} else {
return(result)
}
})

Determine the length of the longest vector among blast_results

max_length <- max(sapply(blast_results, function(x) max(lengths(x))))

Create a data frame with specified lengths, filling with NA where needed

result_df <- data.frame(
Sequence = rep(dna_sequences$DNA Sequence, each = max_length),
Top_Hit_Accession_Number = unlist(lapply(blast_results, function(x) c(x[[1]], rep(NA, max_length - length(x[[1]]))))),
Query_Coverage = unlist(lapply(blast_results, function(x) c(x[[2]], rep(NA, max_length - length(x[[2]]))))),
Percent_Identity = unlist(lapply(blast_results, function(x) c(x[[3]], rep(NA, max_length - length(x[[3]])))))
)

Export the data frame as a CSV file

setwd("C:/Users/rbh6/Documents/BLAST TEST COMPLETED")
data_Path <- "C:/Users/rbh6/Documents/BLAST TEST COMPLETED"
path <- file.path(data_Path, paste(Sys.Date(), "_BLAST_RESULTS.csv", sep = ""))
write.csv(result_df, file = path)

Display a message indicating successful completion

cat("BLAST search and CSV export completed. Results saved in", path, "\n")

cat("BLAST search completed. Displaying results:\n")
print(blast_results)

Completed File

Where is this perform_blast() function coming from? I don't think it's from a standard R package, did you write it yourself? Can you share the source?

In addition, when running the script, do you see the messages "Error in BLAST search for sequence: ..."?

Hello,

I don't remember which source I got this code from. I've pieced together code from tons of sources at this point and have been making minor progress along the way but I can't get over this obstacle.

For your other question, I have not received that message when running the script.

The source code has to be somewhere, or R can't run it (that could actually be the cause of the problem too). Maybe it's in a file that is read at the beginning of the script, something like:

source("R/utilities.R")

In RStudio, if you put your cursor on perform_blast() and press F2, it should open the right file (you can also try ctrl+click).

You can also directly call perform_blast (without the parentheses), and see if R tells you "Error: object 'perform_blast' not found" or prints its source code.

Is anything printed in the console when you run the script?

> # Manually set paths to the Selenium server and ChromeDriver
> selenium_path <- "path/to/selenium-server-standalone.jar"
> chromedriver_path <- "C:/Users/rbh6/Downloads/chromedriverofficial/chromedriver/chromedriver.exe"
> 
> # Start a Selenium server with Chrome
> library(RSelenium)
Warning message:
package ‘RSelenium’ was built under R version 4.3.2 
> rd <- rsDriver(browser = "chrome")
checking Selenium Server versions:
BEGIN: PREDOWNLOAD
BEGIN: DOWNLOAD
BEGIN: POSTDOWNLOAD
checking chromedriver versions:
BEGIN: PREDOWNLOAD
BEGIN: DOWNLOAD
BEGIN: POSTDOWNLOAD
checking geckodriver versions:
BEGIN: PREDOWNLOAD
BEGIN: DOWNLOAD
BEGIN: POSTDOWNLOAD
checking phantomjs versions:
BEGIN: PREDOWNLOAD
BEGIN: DOWNLOAD
BEGIN: POSTDOWNLOAD
[1] "Connecting to remote server"
Could not open chrome browser.
Client error message:
Undefined error in httr call. httr output: Failed to connect to localhost port 4567 after 2228 ms: Couldn't connect to server
Check server log for further details.
Warning message:
In rsDriver(browser = "chrome") : Could not determine server status.
> remDr <- rd[["client"]]
> 
> library(traits)
Registered S3 method overwritten by 'hoardr':
  method           from
  print.cache_info httr
Warning message:
package ‘traits’ was built under R version 4.3.2 
> library(rentrez)
Warning message:
package ‘rentrez’ was built under R version 4.3.2 
> library(seqinr)
Warning message:
package ‘seqinr’ was built under R version 4.3.2 
> library(readxl)
> library(RSelenium)
> 
> # Read Excel file with DNA sequences
> excel_file <- "C:/Users/user/Documents/BLAST TEST START/TEST_BLAST_RSTUDIO.xlsx"
> dna_sequences <- readxl::read_excel(excel_file, sheet = "Sheet1", col_names = TRUE)
> 
> # Start a Selenium server with Chrome (without specifying port)
> rd <- rsDriver(browser = "chrome", chromever = "latest")
checking Selenium Server versions:
BEGIN: PREDOWNLOAD
BEGIN: DOWNLOAD
BEGIN: POSTDOWNLOAD
checking chromedriver versions:
BEGIN: PREDOWNLOAD
BEGIN: DOWNLOAD
BEGIN: POSTDOWNLOAD
checking geckodriver versions:
BEGIN: PREDOWNLOAD
BEGIN: DOWNLOAD
BEGIN: POSTDOWNLOAD
checking phantomjs versions:
BEGIN: PREDOWNLOAD
BEGIN: DOWNLOAD
BEGIN: POSTDOWNLOAD
[1] "Connecting to remote server"
Could not open chrome browser.
Client error message:
Undefined error in httr call. httr output: Failed to connect to localhost port 4567 after 2239 ms: Couldn't connect to server
Check server log for further details.
Warning message:
In rsDriver(browser = "chrome", chromever = "latest") :
  Could not determine server status.
> remDr <- rd[["client"]]
> 
> # Check server logs
> cat(rd$log$file)
> 
> # Initialize results list
> blast_results <- list()
> 
> # Function to perform BLASTn search using RSelenium
> perform_blastn_search_rselenium <- function(query_sequence, remDr) {
+   # Navigate to the BLAST website
+   remDr$open()
+   remDr$navigate("https://blast.ncbi.nlm.nih.gov/Blast.cgi?PROGRAM=blastn&PAGE_TYPE=BlastSearch&LINK_LOC=blasthome")
+   
+   
+   # Find the search bar element by its CSS selector and input the sequence
+   seq_input <- remDr$findElement(using = "css selector", value = "#seq") 
+   seq_input$sendKeysToElement(list(query_sequence))
+   
+   # Click the "BLAST" button
+   blast_button <- remDr$findElement(using = "css selector", value = "#blastButton")
+   blast_button$clickElement()
+   
+   # Wait for the search results page to load (adjust the time as needed)
+   Sys.sleep(10)
+   
+   # Extract BLAST results (example, modify as needed)
+   blast_result <- list(
+     Top_Hit_Accession_Number = "example_accession",
+     Query_Coverage = 90,
+     Percent_Identity = 95
+   )
+   
+   return(blast_result)
+ }
> 
> # Example usage
> for (query_sequence in dna_sequences$"DNA Sequence") {
+   # Perform BLASTn search using RSelenium
+   blast_result <- perform_blastn_search_rselenium(query_sequence, remDr)
+   
+   # Add the result to the list
+   blast_results <- c(blast_results, list(blast_result))
+   
+   # Add a delay between queries to avoid overloading the server
+   Sys.sleep(5)
+ }
[1] "Connecting to remote server"
Error in checkError(res) : 
  Undefined error in httr call. httr output: Failed to connect to localhost port 4567 after 2220 ms: Couldn't connect to server
Called from: checkError(res)
Browse[1]>

Is that the content of perform_blast()? A new version of the script?

Based on the error messages, it looks like a problem in the configuration of RSelenium. I'm not familiar with RSelenium, which is its own beast, there are probably solutions, e.g. here.

Why use RSelenium though? I'm pretty sure there are simpler ways to use BLAST than to use the web interface. For example using the REST API that you can call using httr2. Or using a wrapping package such as rBLAST.

1 Like

Good idea, why make it more complicated than it has to be. Thank you for the help!

It might be helpful for you to know how to properly format code and console output that you post here. Using proper code formatting makes the site easier to read, prevents confusion (unformatted code can get garbled by the forum software :anguished:), and is generally considered the polite thing to do. Check out this FAQ to find out how — it's as easy as the click of a button! :grinning::

Key point, which can be generalized to don't try to cobble together a program from disparate sources when problems keep multiplying.

There's a bash script that may be implicated— process_blast. Its author has deprecated it.

I'll take a crack at redefining the problem as a ChatGPT prompt:

In the R programming language, I have an Excel sheet of dna sequences that I want to finds regions of local similarity between sequences by comparing nucleotide or protein sequences to sequence databases provided by the Basic Local Alignment Search Tool of https://blast.ncbi.nlm.nih.gov and to calculate the statistical significance of matches.

Notice a few things:

  1. It starts with the x in y = f(x), what is at hand
  2. It specifies y, what it is being sought
  3. It provides partial specification of f, the function (usually composite) that makes the transformation. It specifies the BLAST databased hosted at nih.gov. Beyond that, it does not get hung up in the details of how.

Here is what you might get as an answer, depending on what service you are using.

To find regions of local similarity between DNA sequences from an Excel file and compare them to sequence databases using the Basic Local Alignment Search Tool (BLAST) in R, you can follow these steps:

  1. Install and Load Necessary Libraries: You'll need to install and load libraries for handling Excel files and interfacing with BLAST. readxl is useful for reading Excel files, and rentrez or seqinr can be used for BLAST queries.

    install.packages("readxl")
    install.packages("rentrez")
    install.packages("seqinr")
    library(readxl)
    library(rentrez)
    library(seqinr)
    
  2. Read DNA Sequences from Excel: Use readxl to import your DNA sequences from the Excel file.

    dna_data <- read_excel("path_to_your_excel_file.xlsx")
    

    Make sure your Excel file is structured in a way that R can easily read the DNA sequences (e.g., each sequence in a separate row).

  3. Prepare Sequences for BLAST: Extract the DNA sequences from the dataframe and prepare them for BLAST querying.

  4. BLAST Query: Use rentrez or seqinr to perform a BLAST search. Here's a basic example using rentrez:

    blast_res <- entrez_search(db="nucleotide", term="your_dna_sequence", rettype="fasta")
    

    This is a simplified example. In practice, you'll need to loop over your sequences and adjust parameters according to your specific needs.

  5. Retrieve and Analyze BLAST Results: After querying, you'll receive BLAST results which you can analyze to find regions of local similarity. The results typically include information about matched sequences, their alignments, and statistical significance.

  6. Statistical Significance of Matches: BLAST results include E-values and other statistics that provide information about the significance of the matches. Lower E-value scores indicate more significant matches.

  7. Handle Multiple Queries and Responses: If you're working with multiple sequences, you'll need to implement a loop or apply a function to handle each sequence individually.

  8. Error Handling: Ensure your code can handle potential issues like network errors, sequences not found in the database, or BLAST timeouts.

  9. Data Export: If necessary, export your results to a file for further analysis or reporting.

Please note that querying BLAST databases can be complex, and the specifics will depend on your exact data and requirements. The NCBI BLAST database also has usage policies and limitations that you should be aware of, especially if you're planning to run a large number of queries. If you're new to BLAST or bioinformatics, you might want to consult additional resources or seek advice from someone with experience in the field.

Added note: visit bioinformatics.org if you are going to be working with DNA in R. There are vast riches there.

And I guess that's a good illustration of the limitations of LLMs: as far as I can tell, neither {rentrez} nor {seqinr} can perform BLAST :slight_smile:

{rentrez} can perform searches in the BLAST database, but from what I gather it's not about running the BLAST algorithm (unless I'm missing something), so we can't look up arbitrary sequences.

1 Like

Yeah, I didn't go on to test it, so it is a good lesson. I think the scaffolding was helpful, though. Probably, if a bioconductor utility isn't there, it would be preferable to crack open the BLAST api. Because web scrapping is ditzy.

This topic was automatically closed 42 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.