any easier way to summarize across multiple variables group by id

veda · March 8, 2022, 2:41am

Dear R experts,

I have a data frame as bellow:
df=data.frame(cluster_id=c(1,1,2,2,3),
datetime0=c('2022-01-03','2022-02-03','2021-11-14','2022-01-18','2022-01-27'),
var1=c(10,0,0,0,9),
var2=c(10,0,0,0,9),
var3=c(0,1,0,3,9))

I used the following commands to summarize the data frame.

setDT(df)[,list(var1=n_distinct(ifelse(var1>0 , datetime0, NA), na.rm=T),
var2=n_distinct(ifelse(var2>0 , datetime0, NA), na.rm=T),
var3=n_distinct(ifelse(var3>0 , datetime0, NA), na.rm=T),
by=.(cluster_id)]

However, var1, var2, var3 could be a much longer list (e.g., var1- var20) and I wonder whether if there any way to have the same results without writing var*=n_distinct(ifelse(var*>0 , datetime0, NA), na.rm=T) for each of Var1-20.

Your suggestions will be appreciated.

Sincerely,
Veda

FJCC · March 8, 2022, 5:05am

You can use the across() function from dpylr.

library(tidyverse)
#> Warning: package 'tibble' was built under R version 4.1.2
library(data.table)
#> 
#> Attaching package: 'data.table'
#> The following objects are masked from 'package:dplyr':
#> 
#>     between, first, last
#> The following object is masked from 'package:purrr':
#> 
#>     transpose
df=data.frame(cluster_id=c(1,1,2,2,3),
              datetime0=c('2022-01-03','2022-02-03','2021-11-14','2022-01-18','2022-01-27'),
              var1=c(10,0,0,0,9),
              var2=c(10,0,0,0,9),
              var3=c(0,1,0,3,9))


setDT(df)[,list(var1=n_distinct(ifelse(var1>0 , datetime0, NA), na.rm=T),
                var2=n_distinct(ifelse(var2>0 , datetime0, NA), na.rm=T),
                var3=n_distinct(ifelse(var3>0 , datetime0, NA), na.rm=T)),
                by=.(cluster_id)]
#>    cluster_id var1 var2 var3
#> 1:          1    1    1    1
#> 2:          2    0    0    1
#> 3:          3    1    1    1

df |> group_by(cluster_id) |> 
  summarize(across(.cols = var1:var3, ~n_distinct(ifelse(. >0, datetime0,NA), 
                                                  na.rm = TRUE)))
#> # A tibble: 3 x 4
#>   cluster_id  var1  var2  var3
#>        <dbl> <int> <int> <int>
#> 1          1     1     1     1
#> 2          2     0     0     1
#> 3          3     1     1     1

^{Created on 2022-03-07 by the reprex package (v2.0.1)}

system · March 29, 2022, 5:05am

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.