I've got very challenging puzzle in iteration that I could use advice on.
Here's a toy example of a data frame storying raw data (prices of fruit?).
prices <- data.frame(seller_id = c("A", "B", "C"),
apple_1_a = c(1, 2, 1),
apple_1_b = c(1, 1, 3),
apple_2_a = c(2, 4, 1),
apple_2_b = c(3, 1, 2),
orange_1 = c(5, 5, 5),
orange_2 = c(6, 2, 7))
prices
#> seller_id apple_1_a apple_1_b apple_2_a apple_2_b orange_1 orange_2
#> 1 A 1 1 2 3 5 6
#> 2 B 2 1 4 1 5 2
#> 3 C 1 3 1 2 5 7
I'd like to calculate some summary statistics from this data frame according to operations stored in a list like this:
recipe <- list(list(category = "apples",
aggregation = "mean",
items = list(list(category = "apple_1",
aggregation = "sum",
items = c("apple_1_a", "apple_1_b")),
list(category = "apple_2",
aggregation = "sum",
items = c("apple_2_a", "apple_2_b")))),
list(category = "oranges",
aggregation = "mean",
items = c("orange_1", "orange_2")))
This can be read as: I'd like to calculate the price of apples as the mean of the price of apple 1 and apple 2, where each of those prices are calculated as the sum of the prices of parts a and b of each apple, then I'd also like to calculate the price of oranges as the mean price of orange 1 and orange 2.
Those results would look like this:
#> seller_id apple_1 apple_2 apples oranges
#> 1 A 2 5 3.5 5.5
#> 2 B 3 5 4.0 3.5
#> 3 C 4 3 3.5 6.0
I'm looking for an implementation that:
- Is fast. These computations are the key functionality of a larger code base whose utility will depend on this part being fast.
- Is sufficiently general to work on arbitrary aggregation functions and computations (recipes) that are nested 2-5 levels deep.
What I've tried so far:
- Converting the structure of the recipe from a nested list to a flat list
Essentially what I want is a bottom-up traversal ofrecipe_flat <- list( list(category = "apple_1", aggregation = "sum", items = c("apple_1_a", "apple_1_b")), list(category = "apple_2", aggregation = "sum", items = c("apple_2_a", "apple_2_b")), list(category = "apples", aggregation = "mean", items = c("apple_1", "apple_2")), list(category = "oranges", aggregation = "mean", items = c("orange_1", "orange_2")))
recipe
, but from what I've read,purrr
doesn't build in any of that recursive behavior. - A naive for-loop implementation works just fine but is slow.
library(microbenchmark) microbenchmark({ for(i in 1:length(recipe_flat)) { for(j in 1:nrow(prices)) { prices[j, recipe_flat[[i]]$category] <- get(recipe_flat[[i]]$aggregation)( unlist(prices[j, recipe_flat[[i]]$items])) } } }) #> Unit: milliseconds #> #> { for (i in 1:length(recipe_flat)) { for (j in 1:nrow(pric..etc... } #> min lq mean median uq max neval #> 8.287502 8.418688 8.772099 8.491642 8.63624 11.45223 100
- Here's a faster version with
apply()
andwalk()
.
This one is faster, but the use of# utility functions agg_items <- function(prices, recipe_item) { get(recipe_item$aggregation)(prices[recipe_item$items]) } apply_recipe <- function(recipe_item, prices) { prices[[recipe_item$category]] <<- apply(prices, 1, agg_items, recipe_item = recipe_item) } # run computation microbenchmark({ purrr::walk(recipe_flat, \(x) apply_recipe(recipe_item = x, prices = prices)) }) #> Unit: microseconds #> { purrr::walk(recipe_flat, function(x) apply_recipe(recipe_i. ... etc..) } #> min lq mean median uq max neval #> 781.012 789.082 958.4611 826.544 898.657 7672.474 100
<<-
makes me uneasy. I'm stuck with it, though, because I can't figure out else to make the output of a loop available as input to the next loop.
Any suggestions?