Using pmap with 2 tibbles

dmco · September 18, 2020, 3:32pm

I'm fairly new to tidyverse. I have managed to get this code to work with a small example, but I'm sure there must be a better way. Here's the problem I'm trying to code.

I have 2 tibbles, tA and tB (my actual tibbles have 100,000+ rows)

tA <- tibble::tribble(
~name1, ~name2,
"k",  "fw",
"k",  "im",
"k",  "fw",
"g",  "fw",
"g",  "im",
)

tB <- tibble::tribble(
~k_im, ~k_fw, ~g_im, ~g_fw,
0.031, 0.053, 0.000, 0.090,
0.209, 0.105, 0.000, 0.105,
0.274, 0.125, 0.158, 0.132,
0.331, 0.186, 0.199, 0.185,
0.344, 0.205, 0.201, 0.271,
0.367, 0.235, 0.272, 0.308,
0.382, 0.270, 0.295, 0.368,
0.390, 0.285, 0.299, 0.430,
0.397, 0.355, 0.348, 0.443,
0.484, 0.406, 0.419, 0.524,
0.532, 0.470, 0.430, 0.531,
0.557, 0.530, 0.468, 0.601,
0.609, 0.590, 0.530, 0.614,
0.646, 0.646, 0.593, 0.631,
0.712, 0.692, 0.644, 0.687,
0.730, 0.700, 0.652, 0.694,
0.793, 0.725, 0.706, 0.707,
0.845, 0.768, 0.766, 0.772,
0.862, 0.831, 0.814, 0.778,
0.886, 0.876, 0.863, 0.788,
0.887, 0.918, 0.896, 0.835,
1.000, 0.926, 0.918, 0.904,
1.000, 0.976, 0.969, 0.990,
1.000, 1.000, 1.000, 1.000
)

For every row in tA I need to generate a different random number between 0 and 1, then look up the appropriate variable in tB (so for row 1 in tA, that would be column k_fw in tB) and find the first row where the value in tB is greater than the random number.

This function does what I need, e.g. findX(tB, "k_im", 0.5) returns 11.

findX <- function(x,y,z){ match(TRUE,x[y]>z) }

This gives the right output but involves inserting tB into every row. I think I must be missing something!

output <- tA %>%
    mutate(name3 = paste0(name1,"_",name2)) %>%
    mutate(rnm = runif(5)) %>%
    mutate(xx = list(tB)) %>%
    mutate(rr = unlist(purrr::pmap(list(xx, name3, rnm), findX)))

Any thoughts? I think I've been staring at this for so long a fresh approach might be needed. Thanks!

ChrisL · September 18, 2020, 10:15pm

Another solution that's not miles away from yours but doesn't repeat tB:

output2 <- tA %>% 
  dplyr::mutate(rnm = runif(5)) %>%
  dplyr::mutate(rr = purrr::map2_int(paste0(name1, "_", name2), rnm, ~findX(tB, .x, .y)))

nirgrahamuk · September 18, 2020, 10:15pm

On my system this is faster and with a smaller memory footprint

library(tidyverse)
library(microbenchmark)
tA <- tibble::tribble(
  ~name1, ~name2,
  "k", "fw",
  "k", "im",
  "k", "fw",
  "g", "fw",
  "g", "im",
)

tB <- tibble::tribble(
  ~k_im, ~k_fw, ~g_im, ~g_fw,
  0.031, 0.053, 0.000, 0.090,
  0.209, 0.105, 0.000, 0.105,
  0.274, 0.125, 0.158, 0.132,
  0.331, 0.186, 0.199, 0.185,
  0.344, 0.205, 0.201, 0.271,
  0.367, 0.235, 0.272, 0.308,
  0.382, 0.270, 0.295, 0.368,
  0.390, 0.285, 0.299, 0.430,
  0.397, 0.355, 0.348, 0.443,
  0.484, 0.406, 0.419, 0.524,
  0.532, 0.470, 0.430, 0.531,
  0.557, 0.530, 0.468, 0.601,
  0.609, 0.590, 0.530, 0.614,
  0.646, 0.646, 0.593, 0.631,
  0.712, 0.692, 0.644, 0.687,
  0.730, 0.700, 0.652, 0.694,
  0.793, 0.725, 0.706, 0.707,
  0.845, 0.768, 0.766, 0.772,
  0.862, 0.831, 0.814, 0.778,
  0.886, 0.876, 0.863, 0.788,
  0.887, 0.918, 0.896, 0.835,
  1.000, 0.926, 0.918, 0.904,
  1.000, 0.976, 0.969, 0.990,
  1.000, 1.000, 1.000, 1.000
)

findX <- function(x, y, z) {
  match(TRUE, x[y] > z)
}


tA %>%
  mutate(name3 = paste0(name1, "_", name2)) %>%
  mutate(rnm = runif(5)) -> tA2

microbenchmark::microbenchmark(
  
  orig = output <- tA2 %>%
    mutate(xx = list(tB)) %>%
    mutate(rr = unlist(purrr::pmap(list(xx, name3, rnm), findX))),
  
  new = {
    rr <- map_int(
      1:nrow(tA2),
      ~ findX(tB, tA2$name3[.], tA2$rnm[.])
    ) %>% enframe(value = "rr", name = NULL)
    newout <- bind_cols(tA, rr)
  }
)

dmco · September 20, 2020, 2:44pm

Thanks @ChrisL I hadn't realised map2_int would work like that, very useful.

dmco · September 20, 2020, 2:48pm

Many thanks @nirgrahamuk, this is certainly a lot faster than my solution. Both answers work fine and get around the issue of repeating tB, but I have marked this as the solution as it seems more flexible when applied to my 'real' code. (I have to repeat the process several times adding a new 'rr' column each time.)

system · September 27, 2020, 2:48pm

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.