Data Access Layer in R - best practises

As I find little/ no content on data access layers with R I would like to discuss this topic and hear about your practises and experiences.

Primarily I would like to discuss the best way(s) to decouple the business/ ui-logic from the database layer using a data access layer with R.

Here are two approaches I´m already familiar with or think about using.

1. Abstracting the database CRUD operations with R-Functions
Every DB ressource (table, view, stored proceedure, ...) gets CRUD R-Function partners:

  • insert_db_table(...)
  • read_db_table()
  • update_db_table(id, ...)
  • delete_db_table(id)

This approach fits well with the functional paradigm of R and takes us a step further in the direction of an "data access layer". What I don´t like about it, is that there is an uncoupling between the data and operations performed on it.

Let´s take the function delete_db_table(id) as an example.
First, this function can only be executed on one particular data table and for no other use case. This does not follow the functional programming principal, that functions should be generalized. The function should be written in that manner: delete(table, id) which in turn would lead to an giant delete-function knowing about all different tables that are out there.
Second, the name of the datasource, that should be accessed is contained in the function name. That does not feel right to me.

2. Using R6 classes to abstract DB ressources
I was searching in vain to find anything on this topic (connected with R). The idea is that every DB ressource is hidden within an R6 class that provides data access functions. Let´s say, we want to access the table "DataTable" in our DB:

data_table <- DataTable$new() # init data table class object
data_table$delete(1) # perform operations

This again provides us with an separation of the data access logic and also solves some of the issues I mentioned above. The data and accessing functions are bundled together in the right scope. We no longer have to encode the ressource-names in our functions as the class systems solves this problem for us.
There may be issues thou that I don´t see right now.

Now I´m really interested to hear about your opinions and experiences about this topic!

1 Like

why would it be a giant ?
it could simply just ask the db if the table it has been asked to delete is there or not ? and then if it is there delete it ?

We better take the add function as the problem is more evident there (I see the same in delete though).
Lets say, we have 100 DB tables we want to add items to. The R-function would look something like this:
add(table, ...) where ... are named items to add into the table.

As every DB-table needs a different set of very specific ...-Arguments, the R add-function has to check the input for every table separately and then build an SQL query for every table seperately (in most cases).
The issue gets no better when we´re dealing with multiple db-systems.

That leads to 100 Mini-Functions (check table specific args + build query) within the add function. Such a huge function horrors me in some ways.

if you use naming conventions when defining sub functions, you should be able to construct their calls programmatically without needing to register them explicitly in function body or doing joins etc.
Here is just an example (not involving DB) but showing the principle of metaprogramming.

function_that_knows_how_to_do_special_query_on_table_1 <- function(someparam){

function_that_knows_how_to_do_special_query_on_table_2 <- function(someparam,andanother){
  andanother + someparam^2

mygeneral_query <- function(table,...){
  if(! table %in% 1:2){
    stop("You asked to operate on a table I don't know about")
    what = paste0("function_that_knows_how_to_do_special_query_on_table_",table),
    args = list(...))

1 Like

Hi nirgrahamuk,

At my work, my organization actually have a ETL data access layer and wrote our own r package way back in 2002. However, its an bit of old package so while it works none of the maintainer know how to get it working with github actions or gitlab. We also had to later extend access beyond R and (because plumbeR didn't exist in 2002) create a front end for the general public to access the same data layer without R.
I am hoping my organizations use case might be similar to your current needs, so I am going to list the 12 basic principles, and then ask where would you like me talk more in detail about in terms of best practices:

  1. Ability to Debug The ability to debug data_table$new() using traceback() or an equivalent command.
  2. Parsing Automated fixing of coding errors in SQL such as variable names using ., $, or _ and fixing the use of "" instead of "/"
  3. Permissions Data Access Layer Ability to manually setup group permissions/policies or Integrate group-level policies and permissions (part 1 of solving the delete issue you mentioned)
  4. Admin Security Using a service account and password resetting admin account to manage permissions and tables (part 2 of solving the delete issue you mentioned)
  5. Data Security Modern Password and Database protection through Key-rings, MFA, and SAML (part 3 of solving delete issue mentioned, but also a good ETL practice)
  6. Column Definitions layer
  7. Data Management and Validation Layer
  8. Automation and Bots Automation of the other bullet points. You can use Rstudio Connect for this.
  9. Row level Permissions or Table level permissions After a permission layer is established you can then set up row level permissions. HIPPA requirement more or less
  10. Metadata Layer Seperate from the other layers you do need a layer to have a description of the dataset, link to any reference our source, date modified, date created, person modified, person created, contact email of the maintainer/creator of the dataset, contact link of the maintainer/creator. Each table requires this layer to be filled out, but the user has the option between picking either email or link.
  11. Alias Layer
  12. Function and Magrittr Layer In addition to functions, users can add there own custom functions based on a set of rules. The functions can then be used to better define the other layers. So instead of storying a column, maybe it just stores a function per row. Then calls that function every time it needs to run.

These are the basic principles that govern our data access and is the backbone behind the R package whose public output is this: . There is also a private internal outputs as you might imagine .

So where would you like me to start or go into more detail?

That´s an interesting concept. It strongly reminds me of R´s S3/S4 class system that uses the same basic idea just taking object classes to dispatch the right method.

The main point I would like to discuss is if it is appropriate to use the R6 class system for data access.
Ordinary R functions, as well as the S3/S4 class system separates data and functionality applied to it. I think, this is very useful when transforming and working with data.
I´m not sure if it is still the best solution for accessing data, as at that point data and functionality come in pairs and are tightly coupled. Here R6 seems to me a better fit.

I had never heard about data access layers, it seems to be closely related to Microsoft products and stored procedures? So I might completely misunderstand your question. In the Python world you can use sqlalchemy to interact with database objects. I dont think there is a similar package in R.

In a more generic way the DBI package provides an ORM layer on top of databases, allowing you to write generic queries that are translated to the particulars of a database.

On a higher level, dbplyr can be used to work with a database while pretending to be a dataframe.

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.