How do I use rlang to create a helper function to check if input columns are present in a dataframe

ML_Rookie_2021 · March 22, 2024, 7:18pm

I have a set of functions that take single column name as inputs as well as multiple group_by columns.

What I'm Trying to Achieve
I want to create an additional function that checks if all the columns supplied as inputs (whether single column inputs or group_by column names) are present in the dataframe. However I'm also using the rlang package for tidyeval and I'm a bit confused about quo, enquo, etc.

Reprex Code

# Sample Data
bom_mrr = c(100, 350, 50, 68, 68, 10)
eom_mrr = c(200, 150, 45, 90, 87, 34)
cohort  = c("cohort1", "cohort2", "cohort2", "cohort3", "cohort1", "cohort2")
month   = c(as.Date("2024-01-01"), as.Date("2024-01-01"), as.Date("2024-01-01"),
          as.Date("2024-02-01"), as.Date("2024-02-01"), as.Date("2024-02-01"))

df = tibble(month, cohort, bom_mrr, eom_mrr)


# Function to check if all column inputs are present in the dataframe (from chat gpt)
check_present_columns <- function(data, ...) {
    
    # Convert the ... arguments into a character vector of column names
    column_names     <- rlang::ensyms(...)
    required_columns <- sapply(column_names, rlang::as_string)
    
    # Check if all required columns are present in the dataframe
    if (!all(required_columns %in% names(data))) {
        missing_cols <- required_columns[!required_columns %in% names(data)]
        stop("The following supplied columns are missing from the dataframe: ", paste(missing_cols, collapse = ", "), call. = FALSE)
    }
    
    invisible(TRUE) # Return invisibly if checks pass
}


# Main function to calculate mrr retention
get_mrr_retention_rate <- function(data, bom_mrr_column, eom_mrr_column, group_column) {

    # rlang setup
    group_cols_expr <- rlang::enquo(group_column)
    bom_mrr_expr    <- rlang::enquo(bom_mrr_column)
    eom_mrr_expr    <- rlang::enquo(eom_mrr_column)

    # checks
    check_present_columns(data, !!bom_mrr_expr, !!eom_mrr_expr)

    # calculation
    if (!missing(group_column)) {
        data <- data %>%
            dplyr::select(!!bom_mrr_expr, !!eom_mrr_expr, !!group_cols_expr) %>%
            dplyr::group_by(dplyr::across(!!group_cols_expr))
    } else {
        data <- data %>% dplyr::select(!!bom_mrr_expr, !!eom_mrr_expr, !!group_cols_expr)
    }


    mrr_retention_tbl <- data %>%
        dplyr::summarise(
            total_bom_mrr = sum(!!bom_mrr_expr),
            total_eom_mrr = sum(!!eom_mrr_expr)
        ) %>%
        dplyr::ungroup() %>%
        dplyr::mutate(mrr_retention_rate = total_eom_mrr / total_bom_mrr)

    return(mrr_retention_tbl)

}

Outputs
When I run the function normally, everyting works fine -

df %>%
    get_mrr_retention_rate(
        bom_mrr_column = bom_mrr,
        eom_mrr_column = eom_mrr,
        group_column = c(month, cohort)
    )

# A tibble: 5 × 5
  month      cohort  total_bom_mrr total_eom_mrr mrr_retention_rate
  <date>     <chr>           <dbl>         <dbl>              <dbl>
1 2024-01-01 cohort1           100           200              2    
2 2024-01-01 cohort2           400           195              0.488
3 2024-02-01 cohort1            68            87              1.28 
4 2024-02-01 cohort2            10            34              3.4  
5 2024-02-01 cohort3            68            90              1.32

Now say the user enters a wrong column name for eom_mrr_column, the check_present_columns() function works as expected -

# Wrong eom_mrr_column name
df %>%
    get_mrr_retention_rate(
        bom_mrr_column = bom_mrr,
        eom_mrr_column = mrr_eom,
        group_column = c(month, cohort)
    )

# Expected error message 
Error: The following supplied columns are missing from the dataframe: mrr_eom

However say the user enters a wrong column name for one of the group_column, the
check_present_columns() function does not appear to work, as the error message is different -

# wrong spelling for cohort
df %>%
    get_mrr_retention_rate(
        bom_mrr_column = bom_mrr,
        eom_mrr_column = eom_mrr,
        group_column = c(month, cohhort)
    )

# Error message 
Error in `dplyr::select()`:
! Can't subset columns that don't exist.
✖ Column `cohhort` doesn't exist.
Run `rlang::last_trace()` to see where the error occurred.

The function still reports the error correctly, but not in the same format as the second example above.

I feel like this has something to do with my use of rlang incorrectly somewhere, but I'm not sure where. Any help will be appreciated. Also, if there is a better/more efficient way to achive what I'm trying to do, feel free to suggest. Thanks.

prubin · March 22, 2024, 8:00pm

I don't think the problem lies with your use of rlang. The problem is that when the error is in the group_column argument, dplyr sees the error and reacts to it while building the call to get_mrr_retention_rate, before that function is called (and hence before your code checks for the error).

One possible workaround would be to change the group_column argument of get_mrr_retention_rate to be a list of columns rather than a group_by expression. Check existence of the columns and, if it passes muster, build the grouping expression inside get_mrr_retention_rate.

system · April 12, 2024, 8:00pm

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.