Group_By function not giving me the summary stats for each group, just the overall

Jennski · March 25, 2021, 9:03pm

Hi all, apologies if this seems like a simple question. I am extremely new to R.

I am trying to compute summary statistics by group, so that I can do a Kruskal Wallis test. The data is essentially academic years as the groups and course fee amounts. I want to get a descriptive summary for the course fee amounts within each academic year.

I use the following code:

>group_by(dataframe, dataframe$ï..Course.start.academic.year) %>% 
summarise(count = n(), 
mean = mean(dataframe$Fee.amount, na.rm = TRUE), 
sd = sd(dataframe$Fee.amount, na.rm = TRUE), 
median = median(dataframe$Fee.amount, na.rm = TRUE), 
IQR = IQR(dataframe$Fee.amount, na.rm = TRUE))

and get the following output:

# A tibble: 6 x 6
  `dataframe$ï..Course.start.academic.year` count   mean    sd median   IQR
  <fct>                                     <int>  <dbl> <dbl>  <dbl> <dbl>

1 2016-17                                    1930 13909. 8799.  11460 11547
2 2017-18                                    3486 13909. 8799.  11460 11547
3 2018-19                                    3123 13909. 8799.  11460 11547
4 2019-20                                    2767 13909. 8799.  11460 11547
5 2020-21                                    1989 13909. 8799.  11460 11547
6 2021-22                                    1014 13909. 8799.  11460 11547

Obviously it is getting the grouping and the count correct, but it is not giving me the individual descriptive statistics, just the ones for the dataset as a whole!

I already have tidyverse, ggpubr, dplyr and rstatix packages installed.

What am I doing wrong?

Many thanks

technocrat · March 25, 2021, 9:54pm

Hi, and welcome. See the FAQ: How to do a minimal reproducible example reprex for beginners for tips on how to attract more answers. The data in this case were easy enough to synthesize, but having to do so creates friction.

Two suggestions are embedded in the code:

Use short variable names (easier to type); when time come to present results, the headings can be easily changed.
Use whitespace freely; easier to spot inconsistencies. (And prefer spaces over tabs and never mix them.)

suppressPackageStartupMessages({
  library(dplyr)
})

# create synthetic data
set.seed(42)
year_basket <- sample(2000:2020,100, replace = TRUE)
set.seed(137)
fee_basket <- sample(6000:9000,100)
synthetic <- tibble(Year = year_basket, Fee = fee_basket)

# group by Year and summarize stats

synthetic %>% 
  arrange(Year) %>%
  group_by(Year) %>% summarize(
    Count = n(), 
    Mean = mean(Fee), 
    SD = sd(Fee), 
    Median = median(Fee), 
    IQR = IQR(Fee))
#> # A tibble: 21 x 6
#>     Year Count  Mean    SD Median   IQR
#>    <int> <int> <dbl> <dbl>  <dbl> <dbl>
#>  1  2000     4 7827  1033.  8144. 1186 
#>  2  2001     5 7641.  855.  8112   486 
#>  3  2002     5 7601.  997.  7714    75 
#>  4  2003     9 7160.  803.  7317  1220 
#>  5  2004    10 7806.  717.  7864  1085 
#>  6  2005     4 7480.  308.  7548.  321.
#>  7  2006     3 6692.  455.  6735   453 
#>  8  2007     6 7471.  471.  7586.  579 
#>  9  2008     5 7146. 1049.  7477  1482 
#> 10  2009     5 7190.  451.  6930   322 
#> # … with 11 more rows

system · April 15, 2021, 9:54pm

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.