Automated summary statistics for many numerical variables

juandmaz · July 11, 2023, 8:13am

I have a table with 3 numerical variables and 1 logical variable with 2 values.
I need to get the minimum, maximum, average, median, and 1 and 4 quartiles of all the numerical variables for both values of the logical variable.
How can I do that with a single code instead of writing a code for each variable?
Here is the df

# A tibble: 10 × 4
       a     b c         d
   <int> <int> <lgl> <int>
 1     1     1 TRUE      1
 2     2     2 FALSE     2
 3     3     3 TRUE      3
 4     4     4 FALSE     4
 5     5     5 TRUE      5
 6     6     6 FALSE     6
 7     7     7 TRUE      7
 8     8     8 FALSE     8
 9     9     9 TRUE      9
10    10    10 FALSE    10

structure(list(a = 1:10, b = 1:10, c = c(TRUE, FALSE, TRUE, FALSE, 
TRUE, FALSE, TRUE, FALSE, TRUE, FALSE), d = 1:10), class = c("tbl_df", 
"tbl", "data.frame"), row.names = c(NA, -10L))

Comede_way · July 11, 2023, 9:15am

Hi, @juandmaz , i found an possible solution，is it what you want? I mainly refer to the following package link.
doBy: Groupwise Statistics, LSmeans, Linear Estimates, Utilities (r-project.org)

install.packages("doBy")
library(doBy)
#> Warning: package 'doBy' was built under R version 4.2.3

data188<-structure(list(a = 1:10, b = 1:10, c = c(TRUE, FALSE, TRUE, FALSE, 
                                         TRUE, FALSE, TRUE, FALSE, TRUE, FALSE), d = 1:10), class = c("tbl_df", 
                                                                                                      "tbl", "data.frame"), row.names = c(NA, -10L))
data189<-summaryBy(a+b+d ~ c, data188, FUN=summary)
data189
#> # A tibble: 2 × 19
#>   c     a.Min. `a.1st Qu.` a.Median a.Mean a.3rd…¹ a.Max. b.Min. b.1st…² b.Med…³
#>   <lgl>  <dbl>       <dbl>    <dbl>  <dbl>   <dbl>  <dbl>  <dbl>   <dbl>   <dbl>
#> 1 FALSE      2           4        6      6       8     10      2       4       6
#> 2 TRUE       1           3        5      5       7      9      1       3       5
#> # … with 9 more variables: b.Mean <dbl>, `b.3rd Qu.` <dbl>, b.Max. <dbl>,
#> #   d.Min. <dbl>, `d.1st Qu.` <dbl>, d.Median <dbl>, d.Mean <dbl>,
#> #   `d.3rd Qu.` <dbl>, d.Max. <dbl>, and abbreviated variable names
#> #   ¹`a.3rd Qu.`, ²`b.1st Qu.`, ³b.Median

^{Created on 2023-07-11 with reprex v2.0.2}

nirgrahamuk · July 11, 2023, 11:27am

library(tidyverse)

some_data <- structure(list(a = 1:10, b = 1:10, c = c(
  TRUE, FALSE, TRUE, FALSE,
  TRUE, FALSE, TRUE, FALSE, TRUE, FALSE
), d = 1:10), class = c(
  "tbl_df",
  "tbl", "data.frame"
), row.names = c(NA, -10L))


(wide_version <- some_data |> 
    group_by(c) |> 
    summarise(
      across(where(is.numeric),
              .fns = list(
                min = min,
                max = max,
                mean = mean,
                median = median,
                q_low = ~ quantile(.x, probs = .25),
                q_high = ~ quantile(.x, probs = .75)
              )
)))

(long_version <- pivot_longer(wide_version,
                               cols = -c) |> 
                  pivot_wider(names_from = c))

juandmaz · July 12, 2023, 5:13am

Yes, it works! Thanks

Comede_way · July 12, 2023, 5:36am

@juandmaz You are welcome! If i solved your peoblem, would you mind gave me a or click the solution buttion in my post? Thanks a lot.

Comede_way:

Hi, @juandmaz , i found an possible solution，is it what you want? I mainly refer to the following package link.
doBy: Groupwise Statistics, LSmeans, Linear Estimates, Utilities (r-project.org)

install.packages("doBy")
library(doBy)
#> Warning: package 'doBy' was built under R version 4.2.3

data188<-structure(list(a = 1:10, b = 1:10, c = c(TRUE, FALSE, TRUE, FALSE, 
                                         TRUE, FALSE, TRUE, FALSE, TRUE, FALSE), d = 1:10), class = c("tbl_df", 
                                                                                                      "tbl", "data.frame"), row.names = c(NA, -10L))
data189<-summaryBy(a+b+d ~ c, data188, FUN=summary)
data189
#> # A tibble: 2 × 19
#>   c     a.Min. `a.1st Qu.` a.Median a.Mean a.3rd…¹ a.Max. b.Min. b.1st…² b.Med…³
#>   <lgl>  <dbl>       <dbl>    <dbl>  <dbl>   <dbl>  <dbl>  <dbl>   <dbl>   <dbl>
#> 1 FALSE      2           4        6      6       8     10      2       4       6
#> 2 TRUE       1           3        5      5       7      9      1       3       5
#> # … with 9 more variables: b.Mean <dbl>, `b.3rd Qu.` <dbl>, b.Max. <dbl>,
#> #   d.Min. <dbl>, `d.1st Qu.` <dbl>, d.Median <dbl>, d.Mean <dbl>,
#> #   `d.3rd Qu.` <dbl>, d.Max. <dbl>, and abbreviated variable names
#> #   ¹`a.3rd Qu.`, ²`b.1st Qu.`, ³b.Median

Created on 2023-07-11 with reprex v2.0.2

system · July 19, 2023, 5:37am

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.