Stumped converting loop to purrr

chris.prener · October 8, 2018, 12:03am

I have a package under development that includes an initial workflow for importing up to 12 .csv files at a time using purrr::map(), validating each of them, and then creating a tibble of the validation results. The number of .csv files is not predictable except that it is 2 <= files <= 12.

I've created a reprex below that implements a very simple version of this process (while also creating some sample data). The workflow itself is rather complex, but I've tried to distill it down here as best I can.

The reprex:

Creates two sample data frames named a and b
Writes both to a temporary .csv file
Imports them both back into the session using map() to simulate the actual workflow
Names the first list item (a) red and the second list item (b) blue.
Creates a simple validation function.
Uses map() to apply the validation function to both a and b.
Prints the validation results.

Herein lies the challenge - I want to take the name of each list item (i.e. red and blue) and add them as observations in the validation results. I have the process down as a for loop, which is the last step in the reprex before I print the type of output I ultimately want to create. I cannot for the life of me figure out how to do this final step (of writing list names in as observations) with purrr as opposed to with the loop. Any suggestions would be greatly appreciated!

# load packages
suppressMessages(library(dplyr))
library(purrr)
library(readr)

# create data
a <- data.frame(
  id = c(1, 2, 3, 4, 5),
  group = c("red", "red", "red", "red", "red"),
  outcome = c(TRUE, FALSE, FALSE, TRUE, FALSE),
  stringsAsFactors = FALSE
)

b <- data.frame(
  id = c(1, 2, 3, 4, 5),
  group = c("blue", "blue", "blue", "blue", "blue"),
  outcome = c(FALSE, TRUE, FALSE, TRUE, TRUE),
  stringsAsFactors = FALSE
)

# save as csv to tempdir
a_file <- tempfile(pattern = "", fileext = ".csv")
write_csv(a, path = a_file)

b_file <- tempfile(pattern = "", fileext = ".csv")
write_csv(b, path = b_file)

# create list of files
files <- dir(path = tempdir(), pattern = "*.csv")

# combine list of files into single list using map()
files %>%
  map(~ suppressMessages(suppressWarnings(read_csv(file.path(tempdir(), .))))) -> data

# name the two items in data
names(data) <- c("red", "blue")

# validation function
validate <- function(item){
  
  # logic check 1 - does it have 3 cols?
  if (ncol(item) == 3){
    a <- TRUE
  } else {
    a <- FALSE
  }
  
  # logic check 2 - is it a tibble?
  classes <- class(item)
  
  if (classes[1] == "tbl_df"){
    b <- TRUE
  } else {
    b <- FALSE
  }
  
  # concatenate results
  out <- c(a,b)
  
  # return results
  return(out)
  
}

# validate items by iterating over list
data %>%
  purrr::map(validate) -> result

# print results
result
#> $red
#> [1] TRUE TRUE
#> 
#> $blue
#> [1] TRUE TRUE

# add name as observation
for (i in 1:length(result)){
  
  result[[i]] <- c(result[[i]], names(result[i]))
  
}

# print results again
result
#> $red
#> [1] "TRUE" "TRUE" "red" 
#> 
#> $blue
#> [1] "TRUE" "TRUE" "blue"

Created on 2018-10-07 by the reprex
package (v0.2.0).

joels · October 8, 2018, 1:41am

If I understand your question, I think imap will do what you need:

imap(result, ~ c(.x, .y) )

$red
[1] "TRUE" "TRUE" "red" 

$blue
[1] "TRUE" "TRUE" "blue"

With imap, .x are the elements of result and .y are the names of the elements of result.

Also, is it necessary to hard-code the list element names or would it be better to grab them from the data? Maybe something like this:

names(data) = map(data, ~ .x$group[1])

rensa · October 8, 2018, 2:56am

I like this solution!

Re. hard-coded map result names, one pattern I often use to bring a bunch of files in is list.files() %>% set_names(.) %>% map().

The reason I usually do it is that map_dfr() (or map() %>% bind_rows()) will take those element names and turn them into an additional data frame column. But it'd be useful here too if the inputs aren't fixed

Also, @chris.prener, with your second logic test ("is it a tibble?"), you might find b <- "tbl_df" %in% class(item) more robust than just checking class(item)[1]. I'm not sire if "tbl_df" ever ends up being anything other than the first class element (maybe if another package used tibbles with modifier classes?), but I figure this would cover all the bases

rensa · October 8, 2018, 3:06am

Also, if you have a workflow where you need to validate analysis on a list of inputs, purrr's side-effect-capturing functions are fantastic tools that I want to integrate into my next analysis! Being able to do an analysis of a bunch of things and get a tidy printout of what went wrong and where would be so handy

joels · October 8, 2018, 6:24am

@rensa, I was only vaguely aware of those functions before I read your note and followed the link. Perhaps they should be the title of a book or even a movie: Safely, quietly, possibly: Adventures in tidy validation.

chris.prener · October 8, 2018, 11:35am

Many thanks @joels for the tip on imap() - that does indeed give me exactly what I need! Thanks for the explanation of what .x and .y are as well - not sure I would have figured that out from the docs.

To answer both your and @rensa's questions - I actually don't hard code the names in. That said, my solution is not quite as elegant as

names(data) = map(data, ~ .x$group[1])

The data in the vector I used for names are not quite as clean as what I presented here, so what I have been doing is:

  # create list of months associated with year list object items
  data %>%
    purrr::map(cs_identifyMonth) -> nameList

  # convert list of months to vector
  nameVector <- unlist(nameList, recursive = TRUE, use.names = TRUE)

  # apply vector to data
  names(data) <- nameVector

cs_identifyMonth() looks at the data in the tibble and processes it so that I can get the name of the month the data are associated with. Then I use unlist() to take that list and convert it to a vector of month names, which become the basis for naming each list element.

Also, @rensa - thanks for the tips on checking the class of tibbles and those side-effect-capturing functions! I haven't seen a tibble where "tbl_df" is not the first item but this is a great thought. I'm going to update that part of the workflow as well!