Functions with dataframes as input parameters

Lief · August 17, 2024, 7:31pm

I am looking for guidance about how to design a package to make data analysis of a specific database reproducible. The package does not need to generalize beyond this database. I dont expect that the database will change significantly in the next five years. I've inherited copy-pasted scripts that start with sql queries with different where clauses and then do the same 20-100 lines of dplyr code to transform and merge data and create a report table.

I am starting with the idea of using dataframes as function inputs, but am not sure if this is a good approach. I haven't seen a lot of examples of other folks doing this, so I'm curious if anyone else has tried this and if they found it worked well or not. I'm working solo right now, so hoping to get some other perspectives from folks who may have tried something similar.

Here's an example of what I am trying. Maybe I should define classes for the input tables to handle the table conformity checks?

foo <- function(iris, cars) {
	
	# code to check that iris/cars conform to what I expect them to be (uniqueness, missing values, etc)
	if(!("Sepal.Length" %in% names(iris))) stop("iris must contain Sepal.Length")
	if(!("speed" %in% names(cars))) stop("cars must contain speed")
	
	# long analysis script that uses various columns from the input dataframes
	mean(iris$Sepal.Length) + mean(cars$speed)
	
}

foo(iris, cars)

prubin · August 17, 2024, 9:42pm

There is nothing inherently odd about passing a dataframe (or tibble, if you are wandering through the Tidyverse) as an argument. For instance, the second argument of lm() is a dataframe (or something that can be coerced to be a dataframe).

Lief · August 18, 2024, 4:34am

Right, passing a dataframe to a function is common. The distinction I am drawing is that the functions in my package won't accept any dataframe but only dataframes that have specific variables. So while lm can accept an optional data frame... containing the variables in the model... my functions will accept only a specific dataframe. In the example, foo would only accept iris and cars and would error with other dataframe inputs.

prubin · August 18, 2024, 3:19pm

I've written a few Shiny apps that rely on the user to provide current versions of specific tables (as CSV files rather than dataframes, but the logic is similar). For each input table there is a function that takes a string containing the file name/path as input, imports the file to a dataframe and does various cleaning/formatting operations. The processing occurs inside a try-catch statement, and if anything goes wrong I generate an error message telling the user to check that they imported the correct file.

That logic is similar to what you are doing, which seems fine to me, assuming that whatever code is using your package will process the stops gracefully.

I will point out one difference. Your code assumes that if a particular column is present, the dataframe is correct, meaning it would not detect a modified version of iris or cars (columns added, columns other than the one you check deleted, ...). That may be fine in your planned use cases, but doing the "long analysis script" inside a try-catch might be a tad safer.

arangaca · August 19, 2024, 3:14am

This might not be the most appropriate approach but there's nothing wrong with it per se as long as you've the necessary checks to ensure you get what you expect. Hard to tell without seeing the actual code and how you want users to use it though.

It sounds like your package will handle the connection from the database and fetch the data directly, which should give you more control. If this is the case, you could for example give a class to your data frame/tibble and check the input inherits from that class first. This isn't 100% foolproof so you should keep your other checks but that would be sufficient in most cases.

If you have several related functions interacting with the same data, you could use an R6 class to handle everything. This way you have the option to make the data inaccessible to the user for custom modifications. That will save you many input checks because you'll have full control on the data throughout the process. The downside is that the syntax of R6 classes might confuse some users. I've seen some developers wrapping R6 objects in simple function calls to symplify the interface of a package.

Lief · August 19, 2024, 5:02am

@prubin I think the shiny app with a csv upload is a pretty similar use scenario. You're totally right about the column checking logic needing to be more robust. I've started with a function that merges two tables, and checks that all(colsA %in% names(tableA)) & all(colsB %in% names(tableB) )& !any(colsB in names(tableA)) & ... (quick pseudo code, might have a bug in the logic).

@arangaca While the package will have helper functions to fetch data, I do want the user to be able modify those queries easily. Also, given the messiness of the database I definitely can't have the data inaccessible to the user as they will want to investigate weird cases.

An example function would be something that takes a student enrollment table and a student outcome table to calculate graduation rates for program A. I'd want users (likely myself, in a year) to be able to easily re-run the analysis either for the next cohort of students in program A, or to run it for the students in program B by supplying a different subset of the student enrollment table. I'd rather not hardcode a year global parameter at the top of the script, and I definitely don't want to copy paste the script into a new file to run it for program B.

I'm not sure yet how many different tables I will be pulling from the database. It could be different ones for most analyses or I may end up using a relatively small subset of tables most of the time. It seems like if its the first case I'd be better off with functions to do conformity checks, and in the latter case the investment in class definitions would be worth it.

system · November 17, 2024, 5:03am

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.