R Style for long Pipes

R Style for long Pipes

I often write long data wrangling pipelines and wonder how to best break them up or make them more readable.
I am aware of the R tidyverse style guide on pipes but it doesnt really go into detail on my question.

So, what do yo do?

  • Write one long pipe?
  • Break it up into chunks with reassignment in between ?
  • Write one long pipeline but separate it visually with comments in between?

Example

x <-
  x %>%
  mutate(
    vyears =
      case_when(
        year == 2014 ~ vdays / 365,
        year == 2013 ~ vdays / 365,
        TRUE ~ NA_real_
      ),
    vyears = as.integer(round(vyears))
  ) %>%
  select(-vdays) %>%
  relocate(vyears, .after = n_pers) %>%
  left_join(data$smth_smth %>%
    mutate(grp = "Gesamt") %>%
    select(vdat_year, grp, perc),
  by = c("year" = "vdat_year", "grp")
  ) %>%
  group_by(grp) %>%
  mutate(across(c(n_pers, vyears), ~ (.x - lag(.x)) / lag(.x),
    .names = "delta_{.col}"
  )) %>%
  select(
    grp, year, n_pers, delta_n_pers, vyears, delta_vyears,
    n_d, perc
  )


vs "cleaner" version

x <-
  x %>%
  mutate(
    vyears =
      case_when(
        year == 2014 ~ vdays / 365,
        year == 2013 ~ vdays / 365,
        TRUE ~ NA_real_
      ),
    vyears = as.integer(round(vyears))
  ) %>%
  select(-vdays) %>%
  relocate(vyears, .after = n_pers)

# prepare join
y_join <-
  data$am_percent %>%
  mutate(grp = "Gesamt") %>%
  select(vdat_year, grp, perc)

# Join Data
x <-
  left_join(x, y_join,
    by = c("year" = "vdat_year", "grp")
  ) %>%
  group_by(grp) %>%
  mutate(across(c(n_pers, vyears), ~ (.x - lag(.x)) / lag(.x),
    .names = "delta_{.col}"
  )) %>%
  select(
    grp, year, n_pers, delta_n_pers, vyears, delta_vyears,
    n_d, perc
  )

is my preference when writing for my own use. However, my usual problem is editing the code block for different objects. Assuming that I had two data frames, x and Data (I avoid data or df or names of other functions. Because sometimes.) with consistent structures and were interested in Data$SOME_COLUMN and Data$SOME_GROUP, I'd do it as a function, with appropriate comments.

process_df <- function(w,x,y,z) {
  require(dplyr)
  w %>%
  mutate(
    vyears =
      case_when(
        year == 2014 ~ vdays / 365,
        year == 2013 ~ vdays / 365,
        TRUE ~ NA_real_
      ),
    vyears = as.integer(round(vyears))
  ) %>%
  select(-vdays) %>%
  relocate(vyears, .after = n_pers) %>%
  left_join(x,y %>%
              mutate(grp = z) %>%
              select(vdat_year, grp, perc),
            by = c("year" = "vdat_year", "grp")
  ) %>%
  group_by(grp) %>%
  mutate(across(c(n_pers, vyears), ~ (.x - lag(.x)) / lag(.x),
                .names = "delta_{.col}"
  )) %>%
  select(
    grp, year, n_pers, delta_n_pers, vyears, delta_vyears,
    n_d, perc
  )
}

I would go with your second example.
Also I think its good advice that comments focus on the 'why' of the code rather than the 'how' as the code itself is how.

That's my way as well. Not to catch a few magpies by the tail at the same time, just to split the pipe in logical way with comment why, or what to achieve.

Regards,
Grzegorz

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.