Applying summary across multiple column

dobrowski · July 23, 2024, 6:09pm

I feel like this should be obvious but I'm struggling. Across function doesn't work since I'm not trying to mutate.

I want the rate to be calculated on a df of a bunch of students based on different demographic details. I know this is not a reproducible example but It should be clear enough like this.

The first function will give the rate for a demographic detail when you give the column name and the df. The second function, goes through a series of demographic columns and binds them together. I really want a function where I don't need to list all the demographic columns and it will just go through All:Homeless and do all of them and skip any missing ones.

Thanks.

car.school <- function(df,students) {
    
    ddff <-     deparse(substitute(df)) 
    studentsss <-     deparse(substitute(students))
    
    holder <-  df %>% 
        ungroup() %>%
        filter({{students}} == "Yes")  %>%
        
        mutate( # dist.standard = ScaleScore - MeetStandard,
            chronic.rate = 100*mean(chronic),
            count = n())  %>%
        select(chronic.rate, count) %>%
        distinct() %>%
        mutate(district = ddff,
               students = studentsss
        )
    
    # sheet_append(ss = sheet,
    #              sheet = "Distance from Standard Group",
    #              data = holder )
    holder
    
}


add.school.car <- function(df) {
    
    namer <- unique(df$SchoolName)
    coder <- unique(df$SchoolCode)
    
    waiting.room <- car.school(df,All) %>%
        bind_rows(  car.school(df,White) ) %>%
        bind_rows(  car.school(df,EnglishLearner) ) %>%
        bind_rows( car.school(df,Asian) )  %>%
        bind_rows( car.school(df,Filipino) )  %>%
        bind_rows( car.school(df,Multiple) )  %>%
        bind_rows( car.school(df,`Black/African Am`) )  %>%
        bind_rows( car.school(df,`Am Indian/Alskn Nat`) )  %>%
        bind_rows( car.school(df,`Nat Hwiin/Othr Pac Islndr`) )  %>%
        bind_rows( car.school(df,Hispanic) )  %>%
        bind_rows( car.school(df,StudentswithDisabilities) )  %>%
        bind_rows( car.school(df,SocioEconomicallyDisadvantaged) )  %>%
        bind_rows( car.school(df,Homeless) ) %>%
        mutate(SchoolName = namer,
               SchoolCode = coder
        )
    
    waiting.room
    
    
}

pcall · July 23, 2024, 7:26pm

Could you provide an example of what your data looks like? I have the impression that across() with summarise() could be a viable solution here. Check out the page for across: Apply a function (or functions) across multiple columns — across • dplyr.

jrkrideau · July 23, 2024, 8:14pm

May I second @ pcall's request. I suspect he is correct and that an lapply solution may also work but we really need to see some data. If your data is confidential some mock data should be fine.

A handy way to supply some sample data is the dput() function. In the case of a large dataset something like dput(head(mydata, 100)) should supply the data we need. Just do dput(mydata) where mydata is your data. Copy the output and paste it here between
```

```

dobrowski · July 23, 2024, 8:32pm

Here is example data from using dput and changing identifying information.


structure(list(SSID = c(100123123123, 100123123124, 100123123125, 100123123126, 
100123123127, 100123123128), StudentName = c("Last, First Middle", 
"Last, First Middle", "Last, First Middle", "Last, First Middle", 
"Last, First Middle", "Last, First Middle"), SchoolName = c("School Elementary", 
"School Elementary", "School Elementary", "School Middle", 
"School Middle", "School Elementary"), SchoolCode = c(6021111, 
6021111, 6021111, 6021112, 6021112, 6021111), Grade = c("01", 
"02", "KN", "06", "06", "01"), DaysExpectedA = c(180, 180, 180, 
180, 180, 180), ADAGeneratingIndependentStudyDays = c(0, 0, 0, 
0, 0, 0), DaysAttendedB = c(180, 171, 178, 177, 177, 174), DaysAbsentduetoOutOfSchoolSuspensionsC = c(0, 
0, 0, 0, 0, 0), InSchoolSuspensionDaysD = c(0, 0, 0, 0, 0, 0), 
    DaysExcusedAbsencesE = c(0, 9, 0, 3, 3, 6), DaysUnexcusedAbsencesF = c(0, 
    0, 2, 0, 0, 0), NonADAGeneratingIndependentStudyDaysG = c(0, 
    0, 0, 0, 0, 0), DaysAbsentCEFG = c(0, 9, 2, 3, 3, 6), AbsenceRate2 = c(0, 
    5, 1.11111111111111, 1.66666666666667, 1.66666666666667, 
    3.33333333333333), chronic = c(FALSE, FALSE, FALSE, FALSE, 
    FALSE, FALSE), dupes = c(FALSE, FALSE, FALSE, FALSE, FALSE, 
    FALSE), Homeless = c("N", "N", "N", "N", "N", "N"), StudentswithDisabilities = c("Yes", 
    "N", "N", "N", "Yes", "N"), EnglishLearner = c("N", "N", 
    "N", "N", "N", "N"), SocioEconomicallyDisadvantaged = c("Yes", 
    "N", "N", "N", "N", "N"), Hispanic = c("Yes", NA, NA, NA, 
    NA, NA), White = c(NA, "Yes", "Yes", "Yes", NA, "Yes"), Multiple = c(NA, 
    NA, NA, NA, "Yes", NA), Filipino = c(NA_character_, NA_character_, 
    NA_character_, NA_character_, NA_character_, NA_character_
    ), Asian = c(NA_character_, NA_character_, NA_character_, 
    NA_character_, NA_character_, NA_character_), Missing = c(NA_character_, 
    NA_character_, NA_character_, NA_character_, NA_character_, 
    NA_character_), `Am Indian/Alskn Nat` = c(NA_character_, 
    NA_character_, NA_character_, NA_character_, NA_character_, 
    NA_character_), All = c("Yes", "Yes", "Yes", "Yes", "Yes", 
    "Yes")), class = c("grouped_df", "tbl_df", "tbl", "data.frame"
), row.names = c(NA, -6L),

here is what the output looks like from the add.school.car function

structure(list(chronic.rate = c(12.9793510324484, 14.84375, 28.5714285714286, 
0, 0, 14.2857142857143, 12.6436781609195, 31.4285714285714, 17.1875, 
33.3333333333333), count = c(339L, 128L, 7L, 13L, 3L, 21L, 174L, 
35L, 64L, 3L), district = c("df", "df", "df", "df", "df", "df", 
"df", "df", "df", "df"), students = c("All", "White", "EnglishLearner", 
"Asian", "Filipino", "Multiple", "Hispanic", "StudentswithDisabilities", 
"SocioEconomicallyDisadvantaged", "Homeless"), SchoolName = c("School Middle", 
"School Middle", "School Middle", "School Middle", 
"School Middle", "School Middle", "School Middle", 
"School Middle", "School Middle", "School Middle"
), SchoolCode = c(6021112, 6021112, 6021112, 6021112, 6021112, 
6021112, 6021112, 6021112, 6021112, 6021112)), row.names = c(NA, 
-10L), class = c("tbl_df", "tbl", "data.frame"))

dromano · July 23, 2024, 9:02pm

It looks like part of the dput() output is missing.

pcall · July 23, 2024, 9:08pm

It looks like the class arg was cut off from the first structure() call--easy enough to add back in.

Edit: however, I can't recreate the posted output from the original data set using the add.school.car function.

nirgrahamuk · July 24, 2024, 9:05am

you have a use case for purrr::map* functions;
make a function, call it over a range of inputs, aggregate the results.
your process involves binding by rows at the end, presumably a map_dfr would be appropriate.

dromano · July 24, 2024, 12:00pm

Are you interested in per district or per school values of the chronic rate? I ask because your second table includes SchoolName and SchoolCode columns even though there is a single rate for StudentswithDisabilities, which suggests students from the elementary and middle schools are being aggregated to calculate the rate.

For example, here is code that calculates (incorrect) per school values of AbsentRate2 (the only column that allows interesting calculations from the sample table):

original data, saved as `demo`

structure(list(SSID = c(100123123123, 100123123124, 100123123125, 100123123126, 
100123123127, 100123123128), StudentName = c("Last, First Middle", 
"Last, First Middle", "Last, First Middle", "Last, First Middle", 
"Last, First Middle", "Last, First Middle"), SchoolName = c("School Elementary", 
"School Elementary", "School Elementary", "School Middle", 
"School Middle", "School Elementary"), SchoolCode = c(6021111, 
6021111, 6021111, 6021112, 6021112, 6021111), Grade = c("01", 
"02", "KN", "06", "06", "01"), DaysExpectedA = c(180, 180, 180, 
180, 180, 180), ADAGeneratingIndependentStudyDays = c(0, 0, 0, 
0, 0, 0), DaysAttendedB = c(180, 171, 178, 177, 177, 174), DaysAbsentduetoOutOfSchoolSuspensionsC = c(0, 
0, 0, 0, 0, 0), InSchoolSuspensionDaysD = c(0, 0, 0, 0, 0, 0), 
    DaysExcusedAbsencesE = c(0, 9, 0, 3, 3, 6), DaysUnexcusedAbsencesF = c(0, 
    0, 2, 0, 0, 0), NonADAGeneratingIndependentStudyDaysG = c(0, 
    0, 0, 0, 0, 0), DaysAbsentCEFG = c(0, 9, 2, 3, 3, 6), AbsenceRate2 = c(0, 
    5, 1.11111111111111, 1.66666666666667, 1.66666666666667, 
    3.33333333333333), chronic = c(FALSE, FALSE, FALSE, FALSE, 
    FALSE, FALSE), dupes = c(FALSE, FALSE, FALSE, FALSE, FALSE, 
    FALSE), Homeless = c("N", "N", "N", "N", "N", "N"), StudentswithDisabilities = c("Yes", 
    "N", "N", "N", "Yes", "N"), EnglishLearner = c("N", "N", 
    "N", "N", "N", "N"), SocioEconomicallyDisadvantaged = c("Yes", 
    "N", "N", "N", "N", "N"), Hispanic = c("Yes", NA, NA, NA, 
    NA, NA), White = c(NA, "Yes", "Yes", "Yes", NA, "Yes"), Multiple = c(NA, 
    NA, NA, NA, "Yes", NA), Filipino = c(NA_character_, NA_character_, 
    NA_character_, NA_character_, NA_character_, NA_character_
    ), Asian = c(NA_character_, NA_character_, NA_character_, 
    NA_character_, NA_character_, NA_character_), Missing = c(NA_character_, 
    NA_character_, NA_character_, NA_character_, NA_character_, 
    NA_character_), `Am Indian/Alskn Nat` = c(NA_character_, 
    NA_character_, NA_character_, NA_character_, NA_character_, 
    NA_character_), All = c("Yes", "Yes", "Yes", "Yes", "Yes", 
    "Yes")), class = c("grouped_df", "tbl_df", "tbl", "data.frame"
), row.names = c(NA, -6L), class = c("tbl_df", "tbl", "data.frame")) -> demo

library(tidyverse)
demo |> 
  pivot_longer(
    # pick out demographics columns based on values; 
    # create new columns with their names and values
    where(\(x) all(x %in% c('Yes', 'N', NA))) 
    ) |> 
  # group by combination of demographic column name, value, and school
  group_by(name, value, SchoolName, SchoolCode) |> 
  # calculate average value of `AbsenceRate2` and total student per group
  summarise(mean = mean(AbsenceRate2), total = n()) |> 
  # retain only values corresponding to actual demographic populations
  filter(value == 'Yes') |> 
  # move calculated columns to front
  relocate(mean, total) |> 
  # undo grouping used to calculate values
  ungroup()
#> # A tibble: 9 × 6
#>    mean total name                           value SchoolName        SchoolCode
#>   <dbl> <int> <chr>                          <chr> <chr>                  <dbl>
#> 1  2.36     4 All                            Yes   School Elementary    6021111
#> 2  1.67     2 All                            Yes   School Middle        6021112
#> 3  0        1 Hispanic                       Yes   School Elementary    6021111
#> 4  1.67     1 Multiple                       Yes   School Middle        6021112
#> 5  0        1 SocioEconomicallyDisadvantaged Yes   School Elementary    6021111
#> 6  0        1 StudentswithDisabilities       Yes   School Elementary    6021111
#> 7  1.67     1 StudentswithDisabilities       Yes   School Middle        6021112
#> 8  3.15     3 White                          Yes   School Elementary    6021111
#> 9  1.67     1 White                          Yes   School Middle        6021112

^{Created on 2024-07-24 with reprex v2.0.2}

keithn · July 24, 2024, 12:15pm

Not clear on exactly what output you are after, but would this method get you close?

library(tidyr)
library(dplyr)

district1 <- structure(list(SSID = c(100123123123, 100123123124, 100123123125, 100123123126, 
100123123127, 100123123128), StudentName = c("Last, First Middle", 
"Last, First Middle", "Last, First Middle", "Last, First Middle", 
"Last, First Middle", "Last, First Middle"), SchoolName = c("School Elementary", 
"School Elementary", "School Elementary", "School Middle", 
"School Middle", "School Elementary"), SchoolCode = c(6021111, 
6021111, 6021111, 6021112, 6021112, 6021111), Grade = c("01", 
"02", "KN", "06", "06", "01"), DaysExpectedA = c(180, 180, 180, 
180, 180, 180), ADAGeneratingIndependentStudyDays = c(0, 0, 0, 
0, 0, 0), DaysAttendedB = c(180, 171, 178, 177, 177, 174), DaysAbsentduetoOutOfSchoolSuspensionsC = c(0, 
0, 0, 0, 0, 0), InSchoolSuspensionDaysD = c(0, 0, 0, 0, 0, 0), 
    DaysExcusedAbsencesE = c(0, 9, 0, 3, 3, 6), DaysUnexcusedAbsencesF = c(0, 
    0, 2, 0, 0, 0), NonADAGeneratingIndependentStudyDaysG = c(0, 
    0, 0, 0, 0, 0), DaysAbsentCEFG = c(0, 9, 2, 3, 3, 6), AbsenceRate2 = c(0, 
    5, 1.11111111111111, 1.66666666666667, 1.66666666666667, 
    3.33333333333333), chronic = c(FALSE, FALSE, FALSE, FALSE, 
    FALSE, FALSE), dupes = c(FALSE, FALSE, FALSE, FALSE, FALSE, 
    FALSE), Homeless = c("N", "N", "N", "N", "N", "N"), StudentswithDisabilities = c("Yes", 
    "N", "N", "N", "Yes", "N"), EnglishLearner = c("N", "N", 
    "N", "N", "N", "N"), SocioEconomicallyDisadvantaged = c("Yes", 
    "N", "N", "N", "N", "N"), Hispanic = c("Yes", NA, NA, NA, 
    NA, NA), White = c(NA, "Yes", "Yes", "Yes", NA, "Yes"), Multiple = c(NA, 
    NA, NA, NA, "Yes", NA), Filipino = c(NA_character_, NA_character_, 
    NA_character_, NA_character_, NA_character_, NA_character_
    ), Asian = c(NA_character_, NA_character_, NA_character_, 
    NA_character_, NA_character_, NA_character_), Missing = c(NA_character_, 
    NA_character_, NA_character_, NA_character_, NA_character_, 
    NA_character_), `Am Indian/Alskn Nat` = c(NA_character_, 
    NA_character_, NA_character_, NA_character_, NA_character_, 
    NA_character_), All = c("Yes", "Yes", "Yes", "Yes", "Yes", 
    "Yes")), class = c("grouped_df", "tbl_df", "tbl", "data.frame"
), row.names = c(NA, -6L),class = c("tbl_df", "tbl", "data.frame"))

car.school <- function(df) {
    district <- deparse(substitute(df))
    df %>%
        pivot_longer(Homeless:All, names_to = "students", values_to = "bg_status") %>%
        filter(bg_status == "Yes") %>%
        group_by(students) %>%
        summarise(chronic.rate = 100*mean(chronic),
                  count = n()) %>%
        mutate(district = district)
}
car.school(district1)

to give

# A tibble: 6 × 4
  students                       chronic.rate count district 
  <chr>                                 <dbl> <int> <chr>    
1 All                                       0     6 district1
2 Hispanic                                  0     1 district1
3 Multiple                                  0     1 district1
4 SocioEconomicallyDisadvantaged            0     1 district1
5 StudentswithDisabilities                  0     2 district1
6 White                                     0     4 district1

system · August 8, 2024, 11:31pm

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.