Simplifying tidy workflow with list of tibbles and purrr

petermacp · December 6, 2018, 2:05pm

My usual project workflow is to read in a bunch of files as a list using map and readr, then do as much cleaning as possible before saving a single minimal table for analysis. I am slowly getting the handle of purrr functions for helping with this approach, but could do with specific advice about:

How to iterate through tibbles in a list using mutate, filter, select and other dplyr functions (see example below)?
Any other hints/tips/suggestions to improve this workflow using tidy principles?

library(tidyverse)

set.seed(4321)

#Make up some data in four separate tibbles
table1 <- tibble(
  id = 1:10,
  age = floor(runif(min=18, max=100, n=10)),
  sex = sample(c("Male", "Female"), 10, replace = TRUE),
  nonsense = sample(letters, 10)
)

table2 <- tibble(
  id = 1:4,
  weight = c("50 kg", "45", "65kg", "67"),
  height = c("141", "133cm", NA, "177 cm")
)

table3 <- tibble(
  id = 1:10,
  outcome = sample(c("Alive", "Dead"), 10, replace = TRUE)
)

useless_table <- tibble(
  no_use = sample(LETTERS, 10)
)

#Add the tables to a list
list_tables <- list(table1, table2, table3, useless_table)
names(list_tables) <- c("table1", "table2", "table3", "useless_table")

#Keep only the tables that we are interested in
kept_tables <- list_tables %>% keep(names(.) %in% 
                         c("table1", "table2", "table3"))

#Iterate through tables, selecting only the variables we wish to keep
keep_vars <- list("id", "age", "sex", "weight", "height", "outcome")
names(keep_vars) <- keep_vars

kept_tables <- map(kept_tables, ~ select(.x, one_of(names(keep_vars))))
#> Warning: Unknown columns: `weight`, `height`, `outcome`
#> Warning: Unknown columns: `age`, `sex`, `outcome`
#> Warning: Unknown columns: `age`, `sex`, `weight`, `height`

#Now how to iterate through tables, mutating to tidy-up some variables?
#For example, parse height and weight as numeric
#Doesn't work...
kept_tables <- kept_tables %>%
  map(.x, ~ mutate_at(.vars = vars(weight, height), .funs = funs(parse_number)))
#> Error in as_mapper(.f, ...): object '.x' not found

#As a final step, would reduce to a single table e.g. for modelling
#This works
out <- kept_tables %>%
  reduce(left_join, by=c("id"))

^{Created on 2018-12-06 by the reprex package (v0.2.1)}

mishabalyasin · December 6, 2018, 4:10pm

Well, the problem is that you don't need .x there since you are piping kept_tables in. So, it should be:

kept_tables <- kept_tables %>%
  map(~ mutate_at(.x, .vars = vars(weight, height), .funs = funs(parse_number)))

However, this would return you:

Error in .f(.x[[i]], ...) : object 'weight' not found

since not every table has weight column in it.

I would say that in this case mapping over ALL tables to clean them is awkward, since structure of tables is so different. So I would probably separate tibbles even further into ones that need cleaning (like table2) and ones that don't. You can then splice them together and continue with a flow as before.

Your last statement with reduce is a perfect example of where this approach makes sense: all of your tibbles have id column in them, so joining them is a perfectly reasonable thing to do.

petermacp · December 6, 2018, 5:18pm

Thanks - that makes sense.

My rationale for mapping over all tables is that there are often multiple similar cleaning functions to be applied to columns across different tables.

For example:

I could have several columns in each table in character format that I want to simultaneously mutate to date columns.
There might be several columns in each table that I want to strip characters from, and parse as numeric (as in the example above)

I realise that I may be pushing my thinking of what an optimal workflow might look like too far (), but to me it makes sense because:

Separating out the tibbles and running these functions on each individually seems inefficient, and often results in large chunks of essentially identical code in my scripts
Each individual tibble is often very large with 100s of columns and tens of thousands of rows. It seems much neater to only extract the data that I really need at the last possible moment

I understand why your modification here doesn't work:

kept_tables <- kept_tables %>% map(~ mutate_at(.x, .vars = vars(weight, height), .funs = funs(parse_number)))

However, say I (perhaps foolishly!) really wanted to try this route... might purrr's possibly function be useful? I have only read a bit about this, and couldn't get to work on this example...

kept_tables <- kept_tables %>%
  map(possibly(~ mutate_at(.x, 
                  .vars = vars(weight, height), 
                  .funs = funs(parse_number)),
               otherwise = "Not found!"))
#> Error in kept_tables %>% map(possibly(~mutate_at(.x, .vars = vars(weight, : could not find function "%>%"

^{Created on 2018-12-06 by the reprex package (v0.2.1)}

Thank you so much once again for the expertise - I am learning lots!

mishabalyasin · December 6, 2018, 7:30pm

Too much magic can be quite dangerous . So I would definitely try to find a way to transform your tibbles in such a way that you only apply the function to them if they have columns you want to modify.

That being said, if you really want to try it, then read on

I would say, the main stumbling block is the fact that sometimes variables you want to modify are not present in the dataset you want to modify. Since you still want your function to succeed, then the most straightforward way to achieve this is to make sure that function is only applied when given column is present in the dataset. To do that, you can do something like that:

mutate_with <- function(x, .f, ...){
  .dots <- rlang::ensyms(...) 
  in_x <- purrr::map_lgl(.dots, ~!is.null(x[[rlang::as_string(.x)]]))
  .dots <- .dots[in_x]

  dplyr::mutate_at(x, .vars = dplyr::vars(!!!.dots), .f)
}

You would then use it instead of mutate_at like that:

kept_tables %>%
  map(~ mutate_with(.x, .f = parse_number, weight, height))

Notice that I'm using rlang::ensyms when I construct .dots. I've done this on purpose to ensure that you can only pass in symbols or strings, not expressions or quosures. This way you can only pass names of columns, nothing more. However, in general, rlang::enquos should be a default to capture raw symbols when using tidy evaluation. To learn more about what it is, you can read here - https://tidyeval.tidyverse.org/.

Finally, your approach with possibly can possibly (heh ) work, but keep in mind that you are modifying your variable kept_tables, meaning that if your function fails (as it would with, for example, tibble_3), it'll overwrite that tibble with just "Not found!" which is not something that you want.

petermacp · December 7, 2018, 5:30am

Thanks so much!!!

Your mutate_with function was exactly what I was looking for - very cool, and always good to live a little dangerously!

Peter

system · December 14, 2018, 5:31am

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.