dplyr and across

jfca283 · August 13, 2021, 2:26pm

Hi,
I know how to use across if I want to summarise some variables.
The issue is I don't know how to summarise a function over a third one.
For example

Here I try, using the dataset from the survey package named "api", to obtain the sum of the variable fpc grouping by cname across the dummies s1 to s3

library(survey)
data(api)

apistrat %>% 
mutate(s1=if_else(comp.imp=="Yes",1,0),
       s2=if_else(stype=="E",1,0),
       s31=if_else(dnum>=500,1,0)) %>% 
group_by(cname) %>%
summarise(across(s1:s3,~sum(fpc)))

Here I could do It, but using a filter and repeating the code three times


apistrat %>% 
mutate(s1=if_else(comp.imp=="Yes",1,0)) %>% 
filter(s1==1) %>% 
group_by(cname) %>%
summarise(population=sum(fpc))

apistrat %>% 
mutate(s2=if_else(stype=="E",1,0)) %>% 
filter(s2==1) %>% 
group_by(cname) %>%
summarise(population=sum(fpc))

apistrat %>% 
mutate(s3=if_else(dnum>=400,1,0)) %>% 
filter(s3==1) %>% 
group_by(cname) %>%
summarise(population=sum(fpc))

As always, thanks for your time and interest

jfca283 · August 13, 2021, 2:46pm

I think I solved It, but not as I wished:

library(survey)
apistrat %>% 
mutate(s1=if_else(comp.imp=="Yes",fpc,0),
         s2=if_else(stype=="E",fpc,0),
         s3=if_else(dnum>=500,fpc,0)) %>% 
group_by(cname) %>%
summarise(across(s1:s3,sum))

I had to edit the dummy. In fact, It's not a dummy anymore.
How can I use dummies and declare the fpc variable inside the summarise command?

StatSteph · August 13, 2021, 3:12pm

I think you need to use the srvyr package instead of the survey package if you want tidy syntax. Referring to my own materials here: GitHub - szimmer/tidy-survey-aapor-2021: Tidy Survey Analysis in R using the srvyr Package: AAPOR 2021 Sho

You first need to make a design object and then do the analysis to use the fpc. I think this is what you want.

library(survey)
#> Loading required package: grid
#> Loading required package: Matrix
#> Loading required package: survival
#> 
#> Attaching package: 'survey'
#> The following object is masked from 'package:graphics':
#> 
#>     dotchart
library(srvyr)
#> 
#> Attaching package: 'srvyr'
#> The following object is masked from 'package:stats':
#> 
#>     filter
library(tidyverse)
data(api)

dstrata <- apistrat %>%
   as_survey_design(strata = stype, weights = pw, fpc=fpc)


dstrata %>% 
   mutate(s1=if_else(comp.imp=="Yes",fpc,0),
          s2=if_else(stype=="E",fpc,0),
          s3=if_else(dnum>=500,fpc,0)) %>% 
   group_by(cname) %>%
   summarise(
      across(s1:s3, survey_total)
   )
#> # A tibble: 40 x 7
#>    cname              s1   s1_se       s2   s2_se      s3   s3_se
#>    <chr>           <dbl>   <dbl>    <dbl>   <dbl>   <dbl>   <dbl>
#>  1 Alameda       195452. 193229.  781810. 380558. 607084. 331901.
#>  2 Amador             0       0        0       0       0       0 
#>  3 Butte              0       0        0       0       0       0 
#>  4 Colusa         20726.  20211.       0       0   20726.  20211.
#>  5 Contra Costa  671338. 333411.  586357. 331285. 238980. 194894.
#>  6 El Dorado      32127.  23019.       0       0   20726.  20211.
#>  7 Fresno       1193441. 461649. 1563619. 526859. 195452. 193229.
#>  8 Humboldt           0       0        0       0       0       0 
#>  9 Inyo          390905. 271884.  390905. 271884.      0       0 
#> 10 Kern          802536. 381094.  977262. 423255. 411631. 272634.
#> # ... with 30 more rows

^{Created on 2021-08-13 by the reprex package (v2.0.0)}

jfca283 · August 13, 2021, 7:12pm

Thanks, StatSteph.
I already knew how to do the task using srvyr, but I still don't know the way to process the task using dplyr and across. The only solution was modifying the dumies.
Thanks for your code by the way.

StatSteph · August 13, 2021, 7:51pm

I see, to get what your 3 filter statements do. I did it in one step to create the object below called solution and then show that it's nearly the same as those 3 filter statements you had:

library(survey)
#> Loading required package: grid
#> Loading required package: Matrix
#> Loading required package: survival
#> 
#> Attaching package: 'survey'
#> The following object is masked from 'package:graphics':
#> 
#>     dotchart
library(tidyverse)
data(api)

apistrat_mod <- apistrat %>% 
   mutate(s1=if_else(comp.imp=="Yes",1,0),
          s2=if_else(stype=="E",1,0),
          s3=if_else(dnum>=500,1,0)) %>%
   select(fpc, cname, s1:s3)

summary(apistrat_mod)
#>       fpc            cname                 s1             s2     
#>  Min.   : 755.0   Length:200         Min.   :0.00   Min.   :0.0  
#>  1st Qu.: 952.2   Class :character   1st Qu.:0.00   1st Qu.:0.0  
#>  Median :2719.5   Mode  :character   Median :1.00   Median :0.5  
#>  Mean   :2653.8                      Mean   :0.58   Mean   :0.5  
#>  3rd Qu.:4421.0                      3rd Qu.:1.00   3rd Qu.:1.0  
#>  Max.   :4421.0                      Max.   :1.00   Max.   :1.0  
#>        s3       
#>  Min.   :0.000  
#>  1st Qu.:0.000  
#>  Median :0.000  
#>  Mean   :0.425  
#>  3rd Qu.:1.000  
#>  Max.   :1.000

solution <- apistrat_mod %>% 
   group_by(cname) %>%
   summarise(across(s1:s3,~sum(fpc*.))) #need to multiply by s1:s3 which are indicators

# this was your attempts
t1 <- apistrat_mod %>% 
   filter(s1==1) %>% 
   group_by(cname) %>%
   summarise(s1=sum(fpc))

t2 <- apistrat_mod %>% 
   filter(s2==1) %>% 
   group_by(cname) %>%
   summarise(s2=sum(fpc))

t3 <-apistrat_mod %>% 
   filter(s3==1) %>% 
   group_by(cname) %>%
   summarise(s3=sum(fpc))

solution_filter <- t1 %>%
   full_join(t2, by="cname") %>%
   full_join(t3, by="cname") %>%
   replace_na(list(s1=0, s2=0, s3=0))

# these are the same except my solution has some rows yours doesn't
solution %>%
   full_join(solution_filter, by="cname") %>%
   filter(!near(s1.x, s1.y)|
             is.na(near(s1.x, s1.y)),
          !near(s2.x, s2.y)|
             is.na(near(s2.x, s2.y)),
          !near(s1.x, s1.y)|
             is.na(near(s3.x, s3.y)),
   )
#> # A tibble: 6 x 7
#>   cname       s1.x  s2.x  s3.x  s1.y  s2.y  s3.y
#>   <chr>      <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 Amador         0     0     0    NA    NA    NA
#> 2 Butte          0     0     0    NA    NA    NA
#> 3 Humboldt       0     0     0    NA    NA    NA
#> 4 Mariposa       0     0     0    NA    NA    NA
#> 5 Napa           0     0     0    NA    NA    NA
#> 6 Stanislaus     0     0     0    NA    NA    NA

^{Created on 2021-08-13 by the reprex package (v2.0.0)}

jfca283 · August 13, 2021, 8:11pm

Thanks, StatSteph.
I'll study your code.
Obviously, It works flawlessly.

arthur.t · August 13, 2021, 9:04pm

I think you can accomplish with a pivot_longer of s1, s2, s3, then a group_by, summarize.

I personally find across() to be pretty confusing and never use it. You can accomplish the same thing with either pivots or an apply(1, , )

library(survey)
#> Warning: package 'survey' was built under R version 4.0.5
#> Loading required package: grid
#> Loading required package: Matrix
#> Loading required package: survival
#> 
#> Attaching package: 'survey'
#> The following object is masked from 'package:graphics':
#> 
#>     dotchart
library(tidyverse)
  
data(api)

apistrat %>% 
  mutate(s1=if_else(comp.imp=="Yes",1,0),
         s2=if_else(stype=="E",1,0),
         s3=if_else(dnum>=500,1,0)) %>%
  rename(name0 = name) %>% # prevent conflict
  pivot_longer(c(s1, s2, s3)) %>%
  filter(value == 1) %>%
  group_by(cname, name) %>%
  summarize(fpc = sum(fpc)) %>%
  ungroup() %>%
  pivot_wider(values_from = fpc)
#> `summarise()` has grouped output by 'cname'. You can override using the `.groups` argument.
#> # A tibble: 34 x 4
#>    cname           s1     s2    s3
#>    <chr>        <dbl>  <dbl> <dbl>
#>  1 Alameda       4421  17684 14281
#>  2 Colusa        1018     NA  1018
#>  3 Contra Costa 17827  13263  6949
#>  4 El Dorado     1773     NA  1018
#>  5 Fresno       27544  35368  4421
#>  6 Inyo          8842   8842    NA
#>  7 Kern         18702  22105  9860
#>  8 Kings         4421   4421    NA
#>  9 Los Angeles  77422 110525 24633
#> 10 Marin         1773     NA   755
#> # ... with 24 more rows

^{Created on 2021-08-13 by the reprex package (v1.0.0)}

jfca283 · August 13, 2021, 9:40pm

Thanks, arthur.t
Your code looks very related to the tidyverse
It worked without errors.

system · August 20, 2021, 9:41pm

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.