I would like to take a dataframe (or tibble) with columns of multiple types and apply an arbitrary number of functions on each column, returning a tibble where each row is a column in the original dataframe, and each column is the result of a different summarizing function.
Having the final output as a tibble allows easy identification of columns in the original input that meet particular criteria; e.g., all columns in the dataframe input with greater than a certain proportion of NA.
The follow code does work, but I would describe it as having not just code smell but more like code reek. Any suggestions on cleaning it up? I like working in the tidyverse
, but base R would also be acceptable.
library(tidyverse) # dplyr for funs(), tibble for rownames_to_column(), tidyr for unnest()
num_unique = function(v) { length(unique(v)) }
fxns = funs( # use funs so can use '.' and include additional arguments, like na.rm = TRUE
typeof,
num_unique, # can't define a named function in here; need to define outside of funs()
mean(., na.rm = TRUE),
na_frac = mean(is.na(.)),
na_num = sum(is.na(.))
)
x <- tibble(
ints = 1:10,
char = letters[1:10],
fac = as.factor(letters[1:10]),
lgl = c(rep(TRUE, 5), rep(FALSE, 5))
)
x[5, ] <- NA # to test that na.rm is working
x
#suppressWarnings, so that applying numeric functions to non-numeric columns is quieter
suppressWarnings(sapply(fxns, function(fn) {x %>% summarise_all(fn)})) %>% # returns list, with dim and dimnames
as.data.frame %>% # convert list to dataframe; if convert to tibble, lose rownames
rownames_to_column(var = 'column') %>% # requires a dataframe
as.tibble() %>% # have column of rownames now, but each column is still a list-column
unnest() # simplify list-columns back to vectors
Thus, the input x
is:
## # A tibble: 10 x 4
## ints char fac lgl
## <int> <chr> <fct> <lgl>
## 1 1 a a TRUE
## 2 2 b b TRUE
## 3 3 c c TRUE
## 4 4 d d TRUE
## 5 NA <NA> <NA> NA
## 6 6 f f FALSE
## 7 7 g g FALSE
## 8 8 h h FALSE
## 9 9 i i FALSE
## 10 10 j j FALSE
... and the output is:
## # A tibble: 4 x 6
## column typeof num_unique mean na_frac na_num
## <chr> <chr> <int> <dbl> <dbl> <int>
## 1 ints integer 10 5.56 0.100 1
## 2 char character 10 NA 0.100 1
## 3 fac integer 10 NA 0.100 1
## 4 lgl logical 3 0.444 0.100 1
Things I'd like to clean up, if possible:
- be able to define the function within the funs() list of functions
- not have to go through the convoluted transformations from:
- an
sapply
output of a list with with dim and dimnames attributes, - to a data.frame (with columns all list-columns; can't use tibble, as would lose the rownames)
- to a tibble (so I can more easily simplify the list-columns, now with a column made from rownames)
- to a tibble, simplified back to normal vectors for columns
- an
I guess I'm happy that this Frankenstein code works, but can't help but think there should be a more elegant way to do it.