Efficient use of purrr for multiple pairwise calculations in dataframe

joaquin_ar · February 2, 2018, 12:59am

Hello!

I am trying to use a purrr approach to make pairwise calculations with specific columns of a dataframe and I am wondering if it is a good idea in terms of speed and memory efficiency. The steps I have followed are:

Create a dataframe using grid.expand() that contains the columns names I want to use in each calculation.

pairwise_df <- expand.grid(columns1, columns1, stringsAsFactors = FALSE)
Define a function that takes 3 arguments: x, y and the dataframe with the data of interest.

do_something<- function(col1, col2, df) {
value ← mean(df[, col1] + df[, col1])
return(value)
}

Use mutate() and map2() to add a new column to pairwise_df with the results of the calculation

pairwise_df <- pairwise_df %>%
mutate(calculation = map2_dbl(.x = Var1, .y = Var2, .f = do_something ,
df = data))

The previews code is just an example to illustrate the idea. What I am wondering is if for each iteration the dataframe is been copied and therefore being an inefficient strategy.

Thanks a lot!

mishabalyasin · February 2, 2018, 10:51am

If you can put it into a reprex, it'll help to move things along.
As of now I'm not really certain what exactly you are trying to achieve with map2_dbl. Do you want to take 2 values from each column and combine them in some way? There is base::rowMeans function that seems to do what you want without any mapping.

cderv · February 2, 2018, 6:25pm

Not sure it will solved your problem but could help you anyway.
For pairwise operation in a tidy workflow, you may find this useful:

joaquin_ar · February 2, 2018, 9:18pm

Hi mishabalyasin, thanks for your answer. The reason I didn't attach a reprex was because my question is not about problems with coding but a theoretical point of view, anyway I will code a toy example and send it.

Let me try to explain better my question. Let’s say I have a dataframe with 10 columns. I want to make a calculation using specific pairs of columns, but not between all them. My approach to achieve it is:

Create a custom_function that receives as arguments the name of two columns and a dataframe, and return the value of the calculation.
Store in two vectors the names of the columns that compose each pair of interest. For example, vectorA[1] and vectorB[1] form the first pair of columns.
Using map_2() pass both vectors as .x and .y , and custom_function as f, with the dataframe as argument.

Although this works, I’m wondering if it is efficient. Is the dataframe being copied in each call to the custom_function? In other words, if there is a dataframe in the current environment and I pass it as argument to a map(), is the dataframe copied in each iteration?

Thanks

joaquin_ar · February 2, 2018, 9:22pm

Hi cderv, thanks for your reply!
I wasn't aware of this great package, however I'm wondering the best way to do it with dplyr and purrr.

Best

EconomiCurtis · March 12, 2019, 4:12pm

A post was split to a new topic: memory issues with mutate-map workflow?