Using the .data pronoun across multiple functions

arangaca · November 12, 2022, 12:14pm

The title might not reflect my issue very well which involves programming design/logic more generally.

I'm developing a package to handle authors and affiliations. The package is primarily designed to be used with Quarto and will allow users to inject author data from a dataset into a yaml header following Quarto's author/affiliations schema. I'd also like to make the package accessible to non-Quarto/Rmarkdown users (or users who don't use journal template with their qmd documents) by generating author lists and affiliations as character strings.

I'm struggling a lot with the logic of my code to generate author list. By author list, I mean a list of authors with annotations, e.g. René Descartes^1,2*^†, Blaise Pascal³, Antoine Lavoisier^1,4‡.

I'm building the package around a few R6 classes. My approach to produce author lists is to have a method that takes a format argument as a character string which is then parsed to inject actual data from a dataset. The format argument consists of keys defining each annotation (a for affiliation, c for correspondence and n for note), superscript ^ and separator ,. E.g., if I reuse the example above, "^ac^" would produce René Descartes^1,2*^ when "^c,a^n" would produce René Descartes^*,1,2^†.

I made a simplified reproducible example of the part I'm struggling with. The example dataset is the type of dataset generated by that particular method prior to building the author list with the default settings of the class instance.

library(tidyverse)
library(rlang)
library(glue)

example <- structure(list(
  id = 1:3,
  literal_name = c("René Descartes", "Blaise Pascal", "Antoine Lavoisier"),
  corresponding = c(TRUE, FALSE, FALSE),
  affiliation_id = c("1,2", "3", "1,4"),
  note_id = c("†", "", "‡")
), class = c("tbl_df", "tbl", "data.frame"), row.names = c(NA, -3L))

This is where the author list is built:

aut <- mutate(example, .authors = !!make_author_str(format = "^a,c^n"))

pull(aut) %>% 
  glue_collapse(", ", last = " and ") %>% 
  cat()
# René Descartes^1,2,\*^†, Blaise Pascal^3,\*^ and Antoine Lavoisier^1,4,\*^‡

Below are the required helper functions:

make_author_str <- function(format) {
  expr({
    env <- environment()
    dict <- list(
      c = .data[["corresponding"]],
      a = .data[["affiliation_id"]],
      n = .data[["note_id"]]
    )
    fmt <- parse_format(!!format)
    assign_to_keys(dict, seps = fmt$seps, env = env)
    pattern <- str_replace_all(fmt$format, "([acn])", "{\\1}")
    suffixes <- glue(pattern)
    paste0(.data[["literal_name"]], suffixes)
  })
}

# build the a, c and n variables prior to parsing from the dict object
# and assign their respective annotations/symbols with separator
assign_to_keys <- function(dict, seps, env) {
  iwalk(dict, ~ {
    symbols <- if (.y == "c") "\\*" else .x
    value <- if_else(
      is_true(.x) | !is.null(.x) | .x != "",
      paste0(seps[[.y]], symbols),
      ""
    )
    assign(.y, value, envir = env)
  })
}

clean_format <- function(x) {
  gsub("([a-z^,])\\K\\1+|,+", "", x, perl = TRUE)
}

extract_keys <- function(x) {
  x <- strsplit(x, split = "")
  x <- unlist(x)
  x[x %in% letters]
}

extract_key_sep <- function(format, key) {
  out <- str_extract(format, paste0("(?!^)(?<=[a-z^]),(?=", key, ")"))
  if (is.na(out)) "" else out
}

# returns key separators and a cleaned 'format' string (without comma)
parse_format <- function(format) {
  keys <- extract_keys(format)
  seps <- map_chr(keys, ~ extract_key_sep(format, .x))
  list(
    seps = set_names(seps, keys),
    format = clean_format(format)
  )
}

The above works (minus the correspondence, not sure why but it works in the class) when the corresponding, affiliation_id, note_id columns exist in the dataset but doesn't if any of those is missing (either in format or example).

The dict object as it currently is is too constraining. A better approach might be to build the dict object dynamically, like so:

cols <- list(c = "corresponding", a = "affiliation_id", n = "note_id")
dict <- cols[cols %in% names(example)]

But then I don't manage to retrieve the data using the .data pronoun inside assign_to_keys().

Note that a lot of the complexity here comes from dealing with key separators in format which is done in parse_format() and assign_to_keys().

Any insights on how I could make it more flexible? I changed this part quite a bit since the first draft and I might have followed a wrong logic in the process. So if you see a better logic/simpler way to do it, please share it.

nirgrahamuk · November 13, 2022, 11:54pm

because you are relying on dplyr; as well as .data which is pronoun you can access cur_data (and cur_data_all) which are rich objects.
So you could do things like this for example - that would build your dictionary even in the absence of corresponding

    dict <- list(
     c= if( "corresponding" %in% names(cur_data())) {
       .data[["corresponding"]]} else {
         character(0)
      },
      a = .data[["affiliation_id"]],
      n = .data[["note_id"]]
    )

arangaca · November 14, 2022, 1:27pm

Thank you for your input.

Yes, I can always add more control flow to make it work but it's adding a level of complexity to what I think is a bad initial design or an overcomplicated solution to the intial problem.

Also note that the way I set the separator between each annotation relies on the keys in both dict and format (paste0(seps[[.y]], symbols)). Consider that format = "ac", parse_format()only looks for a possible separator for the a and c keys and returns seps = c(a = "", c = ""). Since assign_to_keys()iterates through dict I'll get an error when .y is "n" because that key isn't in seps with the current design. I can add more control flow there too but this is just a sign of a poor design in my opinion.

A better solution would be to create the dict object dynamically based on the keys in format, which I can do. Then the problem is to retrieve the actual data from the dataset in assign_to_keys() which I don't know how to do.

Something like:

iwalk(dict, ~ {
  .x <- .data[[.x]]
  # rest of the code here
})

I'm getting a Can't subset `.data` outside of a data mask context error though, even when encapsulating the whole thing in expr().

I guess I'd be happy enough if I can make .data work in assign_to_keys() although there might be a better design for all this which I can't think of at the moment.

system · December 5, 2022, 1:28pm

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.