Applying functions to a list within a tibble

IanW · March 1, 2025, 11:19pm

I am attempting to manipulate some data to create a spaghetti plot with something like this:

testData =
  tribble(~date, ~close,
          1, 1,
          2, 2,
          3, 3,
          4, 4,
          5, 5,
          6, 6,
          7, 7,
          8, 8
          )

historySize = 30

testData2 =
  testData %>% 
  arrange(desc(date)) %>%
  mutate(
    futureCloseList = accumulate(
      .x = row_number()[-1],    # apply the function on all rows, except the first one.
      .init = close[1],         # initial value for the first row.
      .f = function(last_result, i) {
        c(close[i], head(last_result, historySize))
      }
    ),
    futureCloseList = futureCloseList / close  # <<<--- problem here
    ) %>%
  arrange(date)  

td3 =
  testData2 %>% 
  unnest(cols = c(futureCloseList) ) %>%
  rename( futureClose = futureCloseList ) %>%
  group_by(date) %>%
    mutate( days = seq( from=0, to=n()-1, by=1),
            )

ggplot( data = td3, 
        aes( x = days, y = futureClose, group = date)
        ) + 
  geom_line()

> dput(testData2)
structure(list(date = c(1, 2, 3, 4, 5, 6, 7, 8), close = c(1, 
2, 3, 4, 5, 6, 7, 8), futureCloseList = list(c(1, 2, 3, 4, 5, 
6, 7, 8), c(2, 3, 4, 5, 6, 7, 8), c(3, 4, 5, 6, 7, 8), c(4, 5, 
6, 7, 8), c(5, 6, 7, 8), c(6, 7, 8), c(7, 8), 8)), row.names = c(NA, 
-8L), class = c("tbl_df", "tbl", "data. Frame"))

The idea to take a history of dated values ("close") and build a limited history list ("futureCloseList") in the tibble. This works fine. But I then wish to scale the list by the first value ("futureCloseList / close "), but I'm having trouble converting the list into a format that works. I've tried various calls to map and lapply, but seem unable to get them to work for me.

Any suggestions of what to do would be much appreciated.

lorae · March 4, 2025, 5:02pm

TL;DR:
You can delete the problematic line when creating testData2 and instead divide the futureClose column by the close column when you create td3:

testData =
    tribble(~date, ~close,
            1, 1,
            2, 2,
            3, 3,
            4, 4,
            5, 5,
            6, 6,
            7, 7,
            8, 8
    )

historySize = 30

testData2 =
 testData %>% 
 arrange(desc(date)) %>%
 mutate(
   futureCloseList = accumulate(
     .x = row_number()[-1],    # apply the function on all rows, except the first one.
     .init = close[1],         # initial value for the first row.
     .f = function(last_result, i) {
       c(close[i], head(last_result, historySize))
     }
   ) # <-- delete the problem line!
 ) %>%
 arrange(date)  

td3 =
 testData2 %>% 
 unnest(cols = c(futureCloseList) ) %>%
 rename( futureClose = futureCloseList ) %>%
 group_by(date) %>%
 mutate( 
   days = seq( from=0, to=n()-1, by=1),
   future_close_prop = futureClose / close # <-- new helpful line
 )

# Spaghetti plot
ggplot(
    data = td3, 
    aes( x = days, y = future_close_prop, group = date)
) + 
    geom_line()

Long answer:
I've run into similar frustrations when trying to manipulate lists within a data frame in R. While R allows for flexible data structures, (IMO) lists inside tibbles are best suited for cases where the number of elements per row is unpredictable (e.g., web scraping, natural language processing) or as an intermediate way to store data that doesn't require further calculations (e.g. bootstrapped model outputs). In this case, I recommend restructuring the data to a wide format, calculating future values using lead(), then reshaping it into long format for graphing and normalization against close.

library(purrr)
library(dplyr)
library(tidyr)
library(ggplot2)

testData =
  tribble(~date, ~close,
          1, 1,
          2, 2,
          3, 3,
          4, 4,
          5, 5,
          6, 6,
          7, 7,
          8, 8
  )

historySize <- 5  # Number of future periods

# Generate a data frame with leading data
future_cols <- map_dfc(1:historySize, function(i) {
  testData |> 
    transmute(!!paste0("lead.", i) := lead(close, i))
})

# Combine with original testData
testData_wide <- bind_cols(testData, future_cols)

# Make the data long for a spaghetti plot
testData_long <- testData_wide |>
  pivot_longer(
    cols = starts_with("lead."),
    names_to = "days_ahead",
    names_pattern = "lead.(\\d+)",
    values_to = "future_close"
  ) |>
    mutate(
      days_ahead = as.integer(days_ahead),
      future_close_prop = future_close/close
    )

# Spaghetti plot
ggplot(
  data = testData_long, 
  aes( x = days_ahead, y = future_close_prop, group = date)
) + 
  geom_line()

You already have the data structure I described—this approach simply provides a more direct way to generate it without needing embedded vectors. Is there a specific reason you need lists in this case?

arangaca · March 4, 2025, 6:11pm

For completeness, here's how to make it work with your nested tibble. As you correctly assumed, you had to use a function like map() or lapply():

suppressMessages({
  library(dplyr)
  library(tidyr)
  library(purrr)
})

testData =
  tribble(~date, ~close,
          1, 1,
          2, 2,
          3, 3,
          4, 4,
          5, 5,
          6, 6,
          7, 7,
          8, 8
  ) 

historySize = 30

testData2 =
  testData %>% 
  arrange(desc(date)) %>%
  mutate(
    futureCloseList = accumulate(
      .x = row_number()[-1],    # apply the function on all rows, except the first one.
      .init = close[1],         # initial value for the first row.
      .f = function(last_result, i) {
        c(close[i], head(last_result, historySize))
      }
    )
  ) %>%
  mutate(
    futureCloseList = map(futureCloseList, \(.x) .x / close),
    .by = close
  ) |>
  arrange(date)  

testData2 |>
  unnest(futureCloseList)
#> # A tibble: 36 × 3
#>     date close futureCloseList
#>    <dbl> <dbl>           <dbl>
#>  1     1     1             1  
#>  2     1     1             2  
#>  3     1     1             3  
#>  4     1     1             4  
#>  5     1     1             5  
#>  6     1     1             6  
#>  7     1     1             7  
#>  8     1     1             8  
#>  9     2     2             1  
#> 10     2     2             1.5
#> # ℹ 26 more rows

^{Created on 2025-03-04 with reprex v2.1.1}

If you have many nested objects, it will be more efficient to unnest your data, perform a transformation using vectorized functions and, if necessary, nest your data back, as suggested by @lorae.

When working with nest(), it's often useful to visualize the structure of the nested data using print():

data |> 
  head() |> 
  mutate(var = map(var, print))

system · June 2, 2025, 6:11pm

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.