How to solve the problems of dplyr with connected databases?

technocrat · May 13, 2021, 11:52pm

See the FAQ: How to do a minimal reproducible example reprex for beginners.

The idea behind an R database connection is to allow bringing variables and records into memory selectively, perform some operation and then doing something with the result. That something may include writing it back to the database but doing so can be unwise.

The database libraries

Open a connection
Allow the database to be queried
Close the connection

The {pool} library deals with 1 & 3. It is primarily intended for use within shiny apps and can be replaced with more direct approaches. For example, see the example in help(dbPool).

Other libraries deal with all three aspects. A script would look like

library(DBI)
library(RMariaDB)

con <- dbConnect(RMariaDB::MariaDB(), 
                 username="the_user", 
                 password="the_password", 
                 dbname ="r")

dbListTables(con)
#> [1] "jgr"
dbListFields(con, "jgr")
#>  [1] "recid" "id"    "v1"    "v2"    "v3"    "v4"    "v5"    "v6"    "v7"   
#> [10] "v8"    "v9"    "v10"   "v11"   "v12"   "v13"   "v14"   "v15"   "v16"  
#> [19] "v17"   "v18"   "v19"   "v20"   "v21"   "v22"   "v23"   "v24"   "v25"  
#> [28] "v26"   "v27"   "v28"   "v29"   "v30"

x <- dbGetQuery(con, "SELECT v4 FROM jgr WHERE id = 'NS_000'")
y <- dbGetQuery(con, "SELECT v6 FROM jgr WHERE id = 'NS_022'")

dbDisconnect(con)


dat <- cbind(x,y)
fit <- lm(v6 ~ v4, data = dat)
summary(fit)
#> 
#> Call:
#> lm(formula = v6 ~ v4, data = dat)
#> 
#> Residuals:
#>     Min      1Q  Median      3Q     Max 
#> -618.21 -228.26   -6.68   13.13 1950.54 
#> 
#> Coefficients:
#>             Estimate Std. Error t value Pr(>|t|)    
#> (Intercept) -3.18906   13.50270  -0.236    0.813    
#> v4           0.59126    0.01733  34.125   <2e-16 ***
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 391.6 on 999 degrees of freedom
#> Multiple R-squared:  0.5383, Adjusted R-squared:  0.5378 
#> F-statistic:  1165 on 1 and 999 DF,  p-value: < 2.2e-16

This does use two SQL statements, like SELECT v4 FROM jgr WHERE id = 'NS_000'. An alternative is provided by {dbplyr}, which relieves even that slight degree of complexity by allowing queries using {dplyr} syntax.