Iterating through grouped operations stored in a list

abray · October 15, 2023, 6:24pm

I've got very challenging puzzle in iteration that I could use advice on.

Here's a toy example of a data frame storying raw data (prices of fruit?).

prices <- data.frame(seller_id = c("A", "B", "C"),
                     apple_1_a  = c(1, 2, 1),
                     apple_1_b  = c(1, 1, 3),
                     apple_2_a  = c(2, 4, 1),
                     apple_2_b  = c(3, 1, 2),
                     orange_1   = c(5, 5, 5),
                     orange_2   = c(6, 2, 7))
prices
#>   seller_id apple_1_a apple_1_b apple_2_a apple_2_b orange_1 orange_2
#> 1          A         1         1         2         3        5        6
#> 2          B         2         1         4         1        5        2
#> 3          C         1         3         1         2        5        7

I'd like to calculate some summary statistics from this data frame according to operations stored in a list like this:

recipe <- list(list(category = "apples",
                    aggregation = "mean",
                    items = list(list(category = "apple_1",
                                      aggregation = "sum",
                                      items = c("apple_1_a", "apple_1_b")),
                                 list(category = "apple_2",
                                      aggregation = "sum",
                                      items = c("apple_2_a", "apple_2_b")))),
               list(category = "oranges",
                    aggregation = "mean",
                    items = c("orange_1", "orange_2")))

This can be read as: I'd like to calculate the price of apples as the mean of the price of apple 1 and apple 2, where each of those prices are calculated as the sum of the prices of parts a and b of each apple, then I'd also like to calculate the price of oranges as the mean price of orange 1 and orange 2.

Those results would look like this:

#>   seller_id apple_1 apple_2 apples oranges
#> 1          A       2       5    3.5     5.5
#> 2          B       3       5    4.0     3.5
#> 3          C       4       3    3.5     6.0

I'm looking for an implementation that:

Is fast. These computations are the key functionality of a larger code base whose utility will depend on this part being fast.
Is sufficiently general to work on arbitrary aggregation functions and computations (recipes) that are nested 2-5 levels deep.

What I've tried so far:

Converting the structure of the recipe from a nested list to a flat list

recipe_flat <- list(
               list(category = "apple_1",
                    aggregation = "sum",
                    items = c("apple_1_a", "apple_1_b")),
               list(category = "apple_2",
                    aggregation = "sum",
                    items = c("apple_2_a", "apple_2_b")),
               list(category = "apples",
                    aggregation = "mean",
                    items = c("apple_1", "apple_2")),
               list(category = "oranges",
                    aggregation = "mean",
                    items = c("orange_1", "orange_2")))

Essentially what I want is a bottom-up traversal of recipe, but from what I've read, purrr doesn't build in any of that recursive behavior.

A naive for-loop implementation works just fine but is slow.

library(microbenchmark)
microbenchmark({
    for(i in 1:length(recipe_flat)) { 
        for(j in 1:nrow(prices)) {
            prices[j, recipe_flat[[i]]$category] <- get(recipe_flat[[i]]$aggregation)(
                unlist(prices[j, recipe_flat[[i]]$items]))
        }
    }
})
#> Unit: milliseconds
#>                                                                                                                                                                                                                                              
#>  {     for (i in 1:length(recipe_flat)) {         for (j in 1:nrow(pric..etc... }
#>       min       lq     mean   median      uq      max neval
#>  8.287502 8.418688 8.772099 8.491642 8.63624 11.45223   100

Here's a faster version with apply() and walk().

# utility functions
agg_items <- function(prices, recipe_item) {
    get(recipe_item$aggregation)(prices[recipe_item$items])
}

apply_recipe <- function(recipe_item, prices) {
    prices[[recipe_item$category]] <<- apply(prices, 1, 
                                             agg_items, 
                                             recipe_item = recipe_item)
}

# run computation
microbenchmark({
    purrr::walk(recipe_flat, \(x) apply_recipe(recipe_item = x, prices = prices))
})
#> Unit: microseconds
#>  {     purrr::walk(recipe_flat, function(x) apply_recipe(recipe_i. ... etc..) }
#>      min      lq     mean  median      uq      max neval
#>  781.012 789.082 958.4611 826.544 898.657 7672.474   100

This one is faster, but the use of <<- makes me uneasy. I'm stuck with it, though, because I can't figure out else to make the output of a loop available as input to the next loop.

Any suggestions?

AlexisW · October 16, 2023, 11:17pm

One thing that is slow in your for loop is that you're looping on each row one by one. A slightly faster version:

for(r in recipe_flat) { 
      prices[[r$category]] <- apply(prices[r$items], 1, r$aggregation)
    }

and for the speed:

bench::mark(
  by_row = {
    for(i in 1:length(recipe_flat)) { 
      for(j in 1:nrow(prices)) {
        prices[j, recipe_flat[[i]]$category] <- get(recipe_flat[[i]]$aggregation)(
          unlist(prices[j, recipe_flat[[i]]$items]))
      }
    }
    prices
  },
  with_apply = {
    for(r in recipe_flat) { 
      prices[[r$category]] <- apply(prices[r$items], 1, r$aggregation)
    }
    prices
  }
)
#> # A tibble: 2 × 6
#>   expression      min   median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr> <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
#> 1 by_row       5.64ms   5.92ms      166.     117KB     11.1
#> 2 with_apply   2.75ms    2.9ms      312.     171KB     10.5

In any case, there is some looping on rows which will be slower in R (which is optimized to loop on columns).

If you have much bigger datasets, it could be worth extracting the dataframes of interest, i.e. prices[r$items] and using matrix functions such as rowSums() and rowMeans().

Or in that case, if speed is really a concern, writing a helper function in C++ could be the most practical solution.

Maybe a solution, but which is getting quite tortured and I wouldn't recommend it, would be with a mix of mutate() and across(), technically that's what they are meant for (adding columns based on the other ones), but as you're in a pretty unnatural situation, I think that's a recipe for headaches.

abray · October 20, 2023, 3:11pm

Thanks very much for your thoughts! I like the one loop + apply - the slight slowdown versus walk() seems worth it to avoid the global assignment.

Indeed, matrix operations would likely be a very fast way to implement this. I still might end up exploring that, but right now I've put it on hold because the aggregation operations that we're doing are a bit more expansive that are suggested by sum() and mean() (e.g. a weighted mean that drops the n lowest numbers before computation). Those operations would take some work to prepare the matrix so that the operation can be framed in terms of a sum or a mean or a crossproduct or something. We'd also like to leave the door open to future operations, even more complex, and expanding that functionality if its implemented as matrix operations sounds challenging (and might bump into a setting where it requires so much pre-processing that its actually slower).

I think we'll explore mutate() and across() and perhaps pmap, which seems better suited given our data structure. If we're still not fast enough, then maybe I'll have to dust off my C++ skills. Haven't done that since my dissertation!

Thanks again for the help.

AlexisW · October 20, 2023, 3:57pm

Great! A couple additional comments:

With matrix operations, I don't only mean the pre-writtten commands, also the fact that, if you are willing to store the dataset as a matrix instead of a data.frame, the indexing is done differently, so it might actually speed up all operations that rely on subsetting columns. Actually the driving principle here is that the first step inside apply() is to convert to matrix, so if you've already converted you can gain a bit of time.

Which leads to the second point: if you look at the source code of apply(), it's actually doing a for loop. So I was mistaken before, the part that's slow in your code was not the fact that you used nested loops, it could be the call to unlist() (determining that would take some more work).

As apply() is written in a general way, to be able to handle arrays of any (named) dimensions, I now suspect the easiest way to speed up things might be to rewrite a more limited version of apply() suited specifically to your situation. Here is a basic version:

apply_row <- function(X, FUN){
  FUN <- match.fun(FUN)
  X <- as.matrix(X)
  
  n <- nrow(X)
  res <- double(n)
  for(row in seq_len(n)){
    res[[row]] <- forceAndCall(1, FUN, X[row,])
  }
  res
}

If you test this one, it is only very slightly faster than apply(), but maybe it gives ideas as to how to speed it up.

system · October 27, 2023, 3:57pm

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.