Use purrr::map function for mass-saving data frames to database

rcepka · September 11, 2023, 8:33pm

Dummy data frames:

mydf <- tribble(
    ~col1, ~col2, ~col3,
    50, 500, 5000,
    1000, 2000, 3000
  )

  mydf2 <- tribble(
    ~col1, ~col2, ~col3,
    0, 1, 2,
    666, 777, 888
  )

I want to save these data frames into database. I can do it in usual way, manually saving each one:

schema <- "test_schema"
  
  db_name <- dbplyr::in_schema(schema, "mydf")
  
  dplyr::copy_to(
    con,
    mydf,
    name = db_name,
    overwrite = T,
    temporary = F
  )

... and so fort for all other data frames, this works just fine.

However I want to avoid manual copy-paste for all data frames and use map() function from purrr package instead.
Here is what I created till now:

myschema <- "test_schema"
df_list <- c("mydf", "mydf2")

  save_to_db <- function(df_to_save) {
    
    db_name <- dbplyr::in_schema(myschema, deparse(substitute(df_to_save)))
    
    df_to_save <- get(df_to_save)

    dplyr::copy_to(
      con,
      df_to_save,
      name = db_name,
      overwrite = T,
      temporary = F
    )

  }

  map(df_list, save_to_db)

I am getting errors:

Error in `map()`:
ℹ In index: 1.
Caused by error:
! Nested transactions not supported.
Run `rlang::last_trace()` to see where the error occurred.
Called from: signal_abort(cnd, .file)

I tried many variants of the above code but with no luck. Any advices would be appreciated...thanks in advance.

technocrat · September 11, 2023, 10:18pm

I'm too lazy to set up a db to test this, but it should work.

mydf <- tribble(
  ~col1, ~col2, ~col3,
  50, 500, 5000,
  1000, 2000, 3000
)

mydf2 <- tribble(
  ~col1, ~col2, ~col3,
  0, 1, 2,
  666, 777, 888
)

schema <- "test_schema"

db_name <- dbplyr::in_schema(schema, "mydf")

con = # your connection
  
f <- function(x) {
  dplyr::copy_to(
  con,
  x,
  name = db_name,
  overwrite = T,
  temporary = F)
}

# assuming all your mydf objects are in the global
# environment with consistent names

picklist <- ls(pattern = "^mydf.*")

sapply(picklist, f)

No, it's not purrr::map because I find that approach too difficult.

rcepka · September 12, 2023, 7:29am

@ technocrat : Thank you for response. All my data frame objects are in the global environment but do not have consistent names. Names used here are just an example. Actually I append additional information to data frame names , so I find your way of selecting df by regex pattern difficult to implement (at least for me) and not much error prone.
If you can find few minutes of time, can you please explain in few sentences to R newbies like me why the purrr:map() approach is problematic and difficult?

technocrat · September 12, 2023, 8:31am

If your data frame names are inconsistent, you can still use the ls() entry if you don't have a vector or list of names. Just remove anything else in the environment.

purrr::map() isn't problematic—it works as intended. It's difficult for me because I've never developed a mental model that makes it practical to use. I might do better if I didn't find myself trying to use it in {tidyverse} mode. For me, trying to an all-tidy script swamps my syntax cache and every other function involves parsing the signature. Although base style makes more use of delimiters, which brings back bad memories of school algebra equations, I find it ultimately simpler and more intuitive.

On the other hand, everyone's R experience differs. I started using it in 2007, along with Python and SQL, after a long time away from my brief experience with FORTRAN in 1965 and C in the 80s. At first, I was happy with the tidyverse tools, in part due to the halo effect of {ggplot2}, which is really more of a domain specific language rather than specifically tidy. Ten years in I started working with data that asked for more in the way of preprocessing and I began to see advantages in using {base} tools.

This came about, I think, by realizing that putting numeric data in a vector or matrix makes subsetting and numeric operations far easier. For example, compare the datasets iris and iris3, a data frame and a matrix, respectively. If I wanted to double each of the numeric values, what I'd usually do in dplyr (don't claim it's the best way to do in tidy, just the easiest conceptually for me) compared to how I'd do it with linear algebra

library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
data(iris)

doubled <- iris |> 
  mutate(
    Sepal.Length = Sepal.Length * 2,
    Sepal.Width  = Sepal.Width  * 2,
    Petal.Length = Petal.Length * 2,
    Petal.Width  = Petal.Width  * 2
  )
head(doubled)
#>   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#> 1         10.2         7.0          2.8         0.4  setosa
#> 2          9.8         6.0          2.8         0.4  setosa
#> 3          9.4         6.4          2.6         0.4  setosa
#> 4          9.2         6.2          3.0         0.4  setosa
#> 5         10.0         7.2          2.8         0.4  setosa
#> 6         10.8         7.8          3.4         0.8  setosa
data(iris3)
doubled2 <- iris3*2
head(doubled2)
#> , , Setosa
#> 
#>      Sepal L. Sepal W. Petal L. Petal W.
#> [1,]     10.2      7.0      2.8      0.4
#> [2,]      9.8      6.0      2.8      0.4
#> [3,]      9.4      6.4      2.6      0.4
#> [4,]      9.2      6.2      3.0      0.4
#> [5,]     10.0      7.2      2.8      0.4
#> [6,]     10.8      7.8      3.4      0.8
#> 
#> , , Versicolor
#> 
#>      Sepal L. Sepal W. Petal L. Petal W.
#> [1,]     14.0      6.4      9.4      2.8
#> [2,]     12.8      6.4      9.0      3.0
#> [3,]     13.8      6.2      9.8      3.0
#> [4,]     11.0      4.6      8.0      2.6
#> [5,]     13.0      5.6      9.2      3.0
#> [6,]     11.4      5.6      9.0      2.6
#> 
#> , , Virginica
#> 
#>      Sepal L. Sepal W. Petal L. Petal W.
#> [1,]     12.6      6.6     12.0      5.0
#> [2,]     11.6      5.4     10.2      3.8
#> [3,]     14.2      6.0     11.8      4.2
#> [4,]     12.6      5.8     11.2      3.6
#> [5,]     13.0      6.0     11.6      4.4
#> [6,]     15.2      6.0     13.2      4.2

^{Created on 2023-09-12 with reprex v2.0.2}

HanOostdijk · September 12, 2023, 1:46pm

I could not get the dplyr::copy_to (because of the RSQLite dbase?) working so my example of a purrr version is slightly different.

Reading @rcepka I think the problem will not be the difference between purrr and sapply but determining picklist.

library(DBI)
library(RSQLite)
library(tibble)

mydf <- tribble(
  ~col1, ~col2, ~col3,
  50, 500, 5000,
  1000, 2000, 3000
)

mydf2 <- tribble(
  ~col1, ~col2, ~col3,
  0, 1, 2,
  666, 777, 888
)

picklist <- ls(pattern = "^mydf.*")

con = DBI::dbConnect(RSQLite::SQLite(), ":memory:")
DBI::dbListTables(con) 
#> character(0)

purrr::walk(picklist,~DBI::dbWriteTable(con, . , get(.)))
DBI::dbListTables(con) 
#> [1] "mydf"  "mydf2"

DBI::dbDisconnect(con) 
Created on 2023-09-12 with reprex v2.0.2

nirgrahamuk · September 12, 2023, 1:53pm

This error is likely not to do with the choice of iteration (map) but rather something about the dplyr::copy_to function call , the data you want to transfer and the database (which does not support nested transactions)
based on my quick reading of Support running copy_to() without transactions · Issue #368 · tidyverse/dbplyr · GitHub
You might try in_transaction = FALSE to see if it helps

dplyr::copy_to(
      con,
      df_to_save,
      name = db_name,
      overwrite = T,
      temporary = F,
      in_transaction = FALSE
    )

rcepka · September 19, 2023, 3:13pm

Many thanks guys for your inputs.

It looks that the problem was in deparse(substitute(df_to_save)) , it did not get processed here in a way I was expecting.

Here is a working solution

save_to_db <- function(df, name) {
  db_name <- dbplyr::in_schema(schema, name)
  
  dplyr::copy_to(
    con,
    df,
    name = db_name,
    overwrite = T,
    temporary = F
  )
}

schema <- "aux"

df_list <- list(mydf = mydf, mydf2 = mydf2)
purrr::imap(df_list, save_to_db)

Full post, credit to mgirlich
https://github.com/tidyverse/dbplyr/issues/1361#issuecomment-1722854675

system · September 26, 2023, 3:13pm

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.