How should a function used in summarise handle 0-length input - ie from empty dataframe?

mikecrobp · February 1, 2023, 3:37pm

Moving to dplyr threw up warnings I couldn't explain - until I tracked it down to my homemade "mode" function. MostCommon takes the most common (ie mode) value in its input and returns the maximum value in the even of a tie. I am very open to be told a better alternative or how to improve

The warning I got was "Returning more (or less) than 1 row per summarise() group was deprecated in dplyr 1.1.0" from simple group_by/summarise. But only when using MostCommon. Same scenario with just min or max as an aggregate function has no error.

MostCommon still gets called but with a 0 length input. And returns that. With dplyr 1.0.0 group_by/summarise didn't worry. With dplyr 1.1.0 it throws the warning.

My current approach (from writing this up) is to change the return value to the below. Is there a better way of returning the rigth (non) value? Reprex below

    if(length(ux) == 0) {
      NA
    } else {
      ux   # this returns the NA with the right class. ie that of x
    }

library(tidyverse)
library(reprex)

MostCommon <- function(x) {
  ux <- unique(x)
  uxnotna <- ux[which(!is.na(ux))]
  if(length(uxnotna) > 0) {
    tab <- tabulate(match(x, uxnotna))
    candidates = uxnotna[tab == max(tab)]
    if (is.logical(x)) {
      any(candidates) # return TRUE if any true. max returns an integer
    } else {
      max(candidates) # return highest (ie max) value
    }
  } else {
    ux   # this returns the NA with the right class. ie that of x
  }
}

#####

emptymtcars <- mtcars %>%
  filter(cyl > max(cyl))

emptymtcarscylsummary <- emptymtcars %>%
  group_by(cyl, gear) %>%
  summarise(
    count = n(),
    hp = mean(hp),
    carb = MostCommon(carb)
    )
#> Warning: Returning more (or less) than 1 row per `summarise()` group was deprecated in
#> dplyr 1.1.0.
#> ℹ Please use `reframe()` instead.
#> ℹ When switching from `summarise()` to `reframe()`, remember that `reframe()`
#>   always returns an ungrouped data frame and adjust accordingly.
#> `summarise()` has grouped output by 'cyl', 'gear'. You can override using the
#> `.groups` argument.

emptymtcarscylsummary2 <- emptymtcars %>%
  reframe(
    count = n(),
    hp = mean(hp),
    carb = MostCommon(carb),
    .by = c(cyl, gear)
  )

^{Created on 2023-02-01 with reprex v2.0.2}

davis · February 9, 2023, 2:05pm

The issue with MostCommon() is that it doesn't always return a size 1 result. If the input is empty, then you get an empty result. This might seem intuitive at first glance, but summary functions like this must have a guarantee that they always return a size 1 result, no matter the input. Valid summary functions are sum() and any().

MostCommon <- function(x) {
  ux <- unique(x)
  uxnotna <- ux[which(!is.na(ux))]
  if(length(uxnotna) > 0) {
    tab <- tabulate(match(x, uxnotna))
    candidates = uxnotna[tab == max(tab)]
    if (is.logical(x)) {
      any(candidates) # return TRUE if any true. max returns an integer
    } else {
      max(candidates) # return highest (ie max) value
    }
  } else {
    ux
  }
}

MostCommon(c(1,1,2))
#> [1] 1

# Should return a size 1 result
MostCommon(integer())
#> integer(0)

# Note that sum() and any() are valid summary functions
sum()
#> [1] 0
any()
#> [1] FALSE

If you rewrite the else branch to use x[NA_integer_], which generates a typed missing value (even with size 0 x), then it should work in all of your typical cases:

library(tidyverse)

MostCommon <- function(x) {
  ux <- unique(x)
  uxnotna <- ux[which(!is.na(ux))]
  if(length(uxnotna) > 0) {
    tab <- tabulate(match(x, uxnotna))
    candidates = uxnotna[tab == max(tab)]
    if (is.logical(x)) {
      any(candidates) # return TRUE if any true. max returns an integer
    } else {
      max(candidates) # return highest (ie max) value
    }
  } else {
    x[NA_integer_]
  }
}

MostCommon(c(1, 1, 2))
#> [1] 1

MostCommon(integer())
#> [1] NA
class(MostCommon(integer()))
#> [1] "integer"

MostCommon(character())
#> [1] NA
class(MostCommon(character()))
#> [1] "character"

This makes MostCommon() consistent with n() and mean(), which are also summary functions, so you don't get any warnings here:

emptymtcars <- mtcars %>%
  filter(cyl > max(cyl))

emptymtcars
#>  [1] mpg  cyl  disp hp   drat wt   qsec vs   am   gear carb
#> <0 rows> (or 0-length row.names)

emptymtcars %>%
  summarise(
    count = n(),
    hp = mean(hp),
    carb = MostCommon(carb)
  )
#>   count  hp carb
#> 1     0 NaN   NA

emptymtcars %>%
  summarise(
    count = n(),
    hp = mean(hp),
    carb = MostCommon(carb),
    .by = c(cyl, gear)
  )
#> [1] cyl   gear  count hp    carb 
#> <0 rows> (or 0-length row.names)

In the ungrouped summarise(), we have 1 group containing all of the rows in the data frame, so we expect 1 output row.

In the grouped summarise(), there are technically 0 groups because there is no data to make up a combination of c(cyl, gear). We expect 1 row per group, but there are 0 groups, so we get 0 rows total. What happens under the hood is that each of the expressions are evaluated, giving size 1 results, and those are then recycled to size 0. This allows us to create the columns with the right types, even if there is no data there.

The difference between the grouped and ungrouped summarise()is admittedly a little tricky, but we've convinced ourselves it is correct and consistent.

mikecrobp · February 9, 2023, 6:23pm

Thank you for the explanation
And the trick with x[NA_integer_]. New one on me.

What I still don't understand is why there is not a standard library for this. It can't be a rare requirement- ie mode with tie-break on min or max.

system · March 2, 2023, 6:23pm

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.