Rmd file with embedded SQL chunk. Possible to move the SQL to external file then source?

dougfir · January 16, 2020, 10:06pm

I am working on a Rmd script. This script includes a code chunk for sql:

{r packages, echo=TRUE}
library(tidyverse)
library(rmarkdown)
library(dbplyr)
library(DBI)
library(odbc)
con <- dbConnect(odbc(), "Athena")

some_var <- 7
some_other_var <- 30

Then I get my data via a sql code chunk:

{sql eval=FALSE, connection=con, include=FALSE, output.var="rawd"}
select *
from table
where blah <= ?some_var

The sql chunk uses variables declared further up in the script, in this case some_var referenced in the sql with ?some_var.

That returns a new variable 'rawd' which allows me to move on to reprocessing...

{r preprocessing, echo=TRUE, message=FALSE, warning=FALSE, cache=F}

processed_data <- rawd %>% ... do stuff

But actually, my SQL chunk is large and although I can minimize it with the small expander arrows to the left of the code text, I'd like to move it into it's own file. So I clicked on the new file icon and selected sql script then pasted the sql in there.

But, I have two problems:

On the current sql chunk I can have the output of the query stored in a variable with output.var="rawd", resulting in a new df called 'rawd'. If I import the new sql file, called 'training_data_query.sql', how can I tell the parent file, the working Rmd file to store the results as 'rawd'? I.e. What's the prescribed way to source a .sql file from within a Rmd file?
In the example above I reference variables within the sql chunk by prepending the variable name with a question mark e.g where blah <= ?some_var. How can I do that when the sql is it's own file rather than a sql chunk?

Desired script structure:

{r packages, echo=TRUE}
library(tidyverse)
library(rmarkdown)
library(dbplyr)
library(DBI)
library(odbc)
con <- dbConnect(odbc(), "Athena")

some_var <- 7
some_other_var <- 30

{r preprocessing, echo=TRUE, message=FALSE, warning=FALSE, cache=F}
source(training_data.sql) # ideally results in a new df called 'rawd'
processed_data <- rawd %>% ... do stuff

technocrat · January 18, 2020, 6:27am

Without trying to parse the code in absence of a SQL object, I'll try to describe how I handle this situation in non-user interactive settings.

Save the query object as rawd.Rds as a df and close con. When you need the query result in a chuck read("rawd.Rsa") to get rawds into the current namespace, if necessry. Then chunks can run functions on df until you remove it from namespace.

cderv · January 18, 2020, 10:43am

You can use the power of knitr to fill the content of a chunk programmatically thanks to the code chunk option. That means you can read you sql from a file to the chunk, then it is executed as any other SQL chunk.

Here is a reproducible example to show what I mean. I think it covers what you are looking for. Please tell me otherwise.

You can put the lines below in a Rmd file and render it.

---
title: "Use external SQL file for SQL chunks"
output: html_document
---

In order to provide a reproducible example, we use dplyr and dbplyr with `tbl_memdb` to load a table into a memory db.

```{r}
library(dplyr)
library(dbplyr)
iris_db <- tbl_memdb(iris)
# we isolate the connection object 
con <- src_memdb()$con
```

We check that it works with SQL chunk to filter data into an R object

```{sql, connection=con, output.var="setosa"}
SELECT * FROM iris WHERE Species == "setosa"
```

The object `setosa` exists

```{r}
glimpse(setosa)
```

Know, let see how to load the SQL command from a file to a chunk then execute to save in a R object. We can do it through the power of knitr.

For the demo, we write some SQL in a script `versicolor.sql` using the `cat` engine that can write the content of a chunk to file.

```{cat, engine.opts = list(file = "versicolor.sql", lang = "sql")}
SELECT * FROM iris WHERE Species == "versicolor"
```

and we check that the file exists with some content

```{r}
readLines("versicolor.sql")
```

Now we can use this script to fill the content of the chunk and execute it like any sql chunk. We use the knitr chunk option `code` that allow to fill the content of a chunk programmatically. 

```{sql, connection = con, code=readLines("versicolor.sql"), output.var="versicolor"}
```

The chunk is executed as any other SQL chunk. We check that versicolor data exists

```{r}
glimpse(versicolor)
```

dougfir · January 18, 2020, 4:19pm

This is great! Thank you!

dougfir · January 18, 2020, 10:10pm

I won't be able to check till I get in the office on Tuesday, but curious if this would also work with variables? I currently pass variables from r to sql with ?variable_name within the sql script. Would that work here too?

cderv · January 19, 2020, 12:09am

I need a example here. Tell me when you’ll try.

dougfir · January 19, 2020, 12:31am

In your example you have

```{cat, engine.opts = list(file = "versicolor.sql", lang = "sql")}
SELECT * FROM iris WHERE Species == "versicolor"

Suppose that further up you have a r chunk:

{r set_vars}
filter <- 'versicolor'

Then that you want to reference the variable 'filter' within your sql. If I have a regular sql chunk I can do this:

{sql connection=con, output.var='myvar'}
SELECT * FROM iris WHERE Species == ?filter

Here, ?filter refers to the string variable filter which is 'versicolor', having been set above the sql chunk.

So this: SELECT * FROM iris WHERE Species == ?filter would be interpreted in sql as SELECT * FROM iris WHERE Species == "versicolor"

Is it possible to reference variables in this way when writing the sql in a external .sql file per your example?

cderv · January 19, 2020, 2:05pm

Oh yes this should definitely work. The code chunk option will fill the chunk content programatically, and the chunk is executed as any other. So yes you can.

Did you try ?

It works for me

---
title: "Untitled"
output: html_document
---

In order to provide a reproducible example, we use dplyr and dbplyr with `tbl_memdb` to load a table into a memory db.

```{r}
library(dplyr)
library(dbplyr)
iris_db <- tbl_memdb(iris)
# we isolate the connection object 
con <- src_memdb()$con
```

We check that it works with SQL chunk to filter data into an R object

```{sql, connection=con, output.var="setosa"}
SELECT * FROM iris WHERE Species == "setosa"
```

The object `setosa` exists

```{r}
glimpse(setosa)
```

Know, let see how to load the SQL command from a file to a chunk then execute to save in a R object. We can do it through the power of knitr.

For the demo, we write some SQL in a script `versicolor.sql` and we use a variable from R to have a query with a variable

```{r}
filter_var <- "versicolor"
```


```{cat, engine.opts = list(file = "versicolor.sql", lang = "sql")}
SELECT * FROM iris WHERE Species == ?filter_var
```

and we check that the file exists with some content
```{r}
readLines("versicolor.sql")
```

Now we can use this script to fill the content of the chunk and execute it like any sql chunk. We use the knitr chunk option `code` that allow to fill the content of a chunk programmatically. 

```{sql, connection = con, code=readLines("versicolor.sql"), output.var="versicolor"}
```

The chunk is executed as any other SQL chunk. We check that versicolor data exists

```{r}
glimpse(versicolor)
```

dougfir · January 19, 2020, 2:43pm

Thank you. I'm looking forwards to getting back into the office on Tuesday to try this out!

dougfir · January 21, 2020, 7:07pm

Hi @cderv, as soon as I got in I opened up rstudio to test this out! I'm spinning my wheels and cannot see what I'm missing.

In my working directory I have both my Rmd file and a sql file "training_data_query.sql":

Also, I can see that readLines correctly reads the file:

This query runs sound when I run the sql from within a chunk or when I run sql with the file open and clicking 'SQL' from within the connections tab.

My chunks:

When I attempt to run, it's like the sql chunk does not run. Nothing happens and it appears like no computation takes place. If I take out the code=readLines("training_data_query.sql") and instead paste the SQL code within the chunk, everything works as expected.

Any ideas?

cderv · January 21, 2020, 9:18pm

Can you try to see if it works on a simpler example ?
That way you'll see if it comes from the environment (package versions, knitr option, ...) or from the sql script.

When you paste the SQL inside the chunk, does it includes also the first line with preview or the comments ? You get that when you use readLines.

Another hint would be to check the encoding of the files, and force UTF8 if you are not sure maybe.

As i can't reproduce, it is not easy to search.

dougfir · January 21, 2020, 9:47pm

Hi, thanks for your suggestions. I added a new sql script:

select s, 
       install_dt, 
       split(game_name, '_')[2] as platform
from device_metrics.game_install
where year || '-' || month || '-' || day = '2020-01-10'
limit 100

I removed the --preview... comment at the top.

Same thing, when I try to run the chunk nothing happens.

When you say "Another hint would be to check the encoding of the files, and force UTF8 if you are not sure maybe", how would one do that?

dougfir · January 21, 2020, 10:04pm

And just to verify that the query does indeed work:

(removed s field and added back in the --preview to see the results)

dougfir · January 21, 2020, 11:01pm

It seems to be particularly with the code=readLines() part.

# Works!
```{sql, connection=con, output.var="rawd"}
select 
       install_dt, 
       split(game_name, '_')[2] as platform
from device_metrics.game_install
where year || '-' || month || '-' || day = '2020-01-10'
limit 100

# Doesn't work!

```{sql, code=readLines("example_query.sql"), connection=con, output.var="rawd2"}

After running the second chunk, the console is empty till I type any command, in which case I get a warning message:

1-1
[1] 0
Warning message:
In readLines("example_query.sql") :
  incomplete final line found on 'example_query.sql'

Screen Shot 2020-01-21 at 2.59.30 PM

I tried adding a semi colon at the end following that error message, but that didn't change anything. Also when attempting to use a semi colon at the end of my sql within the chunk itself, when it works, I get an error, so I don't think it's to do with a semi colon.

As far I can tell, the lines are not being read in with code=readLines("example_query.sql")

However, per further up, running readLines("example_query.sql") within a r chunk, those lines are indeed returned.

dougfir · January 22, 2020, 1:38am

Some new information. I tried to copy paste your reprex to see if that worked and it did not:

---
title: "Use external SQL file for SQL chunks"
output: html_document
---

In order to provide a reproducible example, we use dplyr and dbplyr with `tbl_memdb` to load a table into a memory db.

```{r}
library(dplyr)
library(dbplyr)
iris_db <- tbl_memdb(iris)
# we isolate the connection object 
con <- src_memdb()$con

We check that it works with SQL chunk to filter data into an R object

SELECT * FROM iris WHERE Species == "setosa"

The object setosa exists

glimpse(setosa)

Know, let see how to load the SQL command from a file to a chunk then execute to save in a R object. We can do it through the power of knitr.

For the demo, we write some SQL in a script versicolor.sql using the cat engine that can write the content of a chunk to file.

SELECT * FROM iris WHERE Species == "versicolor"

and we check that the file exists with some content

readLines("versicolor.sql")

Now we can use this script to fill the content of the chunk and execute it like any sql chunk. We use the knitr chunk option code that allow to fill the content of a chunk programmatically.

The chunk is executed as any other SQL chunk. We check that versicolor data exists

glimpse(versicolor)

When I try to run the first chunk:

> library(dplyr)
> library(dbplyr)
> iris_db <- tbl_memdb(iris)
Error: Table `iris` exists in database, and both overwrite and append are FALSE

The second chunk works, a new variable 'setosa' appears in my environment after running:

{sql, connection=con, output.var="setosa"}
SELECT * FROM iris WHERE Species == "setosa"

The next chunk, readLines, works:

readLines("versicolor.sql")
[1] "SELECT * FROM iris WHERE Species == \"versicolor\""

The cat chunk works. When I run it a new .sql file 'versicolor.sql' appears:

{cat, engine.opts = list(file = "versicolor.sql", lang = "sql")}
SELECT * FROM iris WHERE Species == "versicolor"

The next chunk is the sql chunk that reads the lines from the newly created versicolor.sql file:

{sql, connection = con, code=readLines("versicolor.sql"), output.var="versicolor"}

This chunk does not run. Same issue as I'm having with the script above.

sessionInfo()
R version 3.6.0 (2019-04-26)
Platform: x86_64-redhat-linux-gnu (64-bit)
Running under: Amazon Linux 2

Matrix products: default
BLAS/LAPACK: /usr/lib64/R/lib/libRblas.so

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C               LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8     LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                  LC_ADDRESS=C               LC_TELEPHONE=C             LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] dbplyr_1.4.2 dplyr_0.8.3 

loaded via a namespace (and not attached):
 [1] Rcpp_1.0.3       rstudioapi_0.10  knitr_1.27       magrittr_1.5     tidyselect_0.2.5 bit_1.1-15.1     R6_2.4.1         rlang_0.4.2      fansi_0.4.1     
[10] blob_1.2.0       tools_3.6.0      packrat_0.5.0    xfun_0.12        utf8_1.1.4       cli_2.0.1        DBI_1.1.0        bit64_0.9-7      assertthat_0.2.1
[19] digest_0.6.23    tibble_2.1.3     crayon_1.3.4     purrr_0.3.3      vctrs_0.2.1      zeallot_0.1.0    memoise_1.1.0    glue_1.3.1       RSQLite_2.2.0   
[28] compiler_3.6.0   pillar_1.4.3     backports_1.1.5  pkgconfig_2.0.3

cderv · January 22, 2020, 6:16am

When you say does not run, is it run as with the play button in the IDE or run when rendered using the knit button or with rmarkdown::render ?

In my example when I say it is executed it is when you render the document using Knit button or Rmarkdown render. The code chunk option is a knitr feature that the IDE does not seem to handle when playing chunk interactively. It could be a feature request for the IDE I guess.
In RStudio, to play SQL request interactively you have the run button for SQL file as you already know.

Is this what happens ? Did you try to render the all document already ? does it work that way as it should ?

dougfir · January 22, 2020, 6:20pm

Bam! That was it. It does indeed work when I click on knit. Thanks for all of your help with this one

cderv · January 22, 2020, 7:05pm

You're welcome.

Let's note this could be a good feature request. Allow to play interactivly chunk With code option. I Do not know if possible or not. Also I did not check if it exists yet. You may fill one if not existing or give a if one is already there.

Feel free. And if you found or create one link here!

dougfir · January 22, 2020, 8:36pm

Cheers! I will submit a feature request later today.

system · January 29, 2020, 8:36pm

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.