Comparing the Effect of a Variable Being Absent/Present?

swaheera · June 7, 2022, 4:01am

I am working with the R programming language.

I have the the following data:

set.seed(123)

var1 = sample(0:1, 10000, replace=T)
var2 = sample(0:1, 10000, replace=T)
var3 = sample(0:1, 10000, replace=T)
var4 = sample(0:1, 10000, replace=T)
score = rnorm(10000,10,5)

my_data = data.frame(var1,var2, var3,var4, score)

We can see the summary of unique rows for this data with the following command:

# https://stackoverflow.com/questions/34312324/r-count-all-combinations
> dt = my_data[,c(1,2,3,4)]
> setDT(dt)[,list(Count=.N) ,names(dt)]
    var1 var2 var3 var4 Count
 1:    0    0    0    0   667
 2:    0    1    0    0   601
 3:    1    1    1    1   651
 4:    0    1    1    1   608
 5:    1    0    1    1   613
 6:    1    1    0    1   588
 7:    0    1    1    0   607
 8:    0    0    1    1   607
 9:    1    0    1    0   625
10:    0    1    0    1   661
11:    1    1    1    0   635
12:    0    0    1    0   640
13:    1    1    0    0   608
14:    1    0    0    0   607
15:    0    0    0    1   626
16:    1    0    0    1   656

I want to find out the average value of "score" when some variable is "present" and the same variable is "absent". For example:

Contribution for Var4 : Average score for (var1 = 1, var2= 1, var3 = 1, var4 = 1) - Average score for (var1 = 1, var2= 1, var3 = 1, var4 = 0)
Contribution for Var2 : Average score for (var1 = 1, var2= 1, var3 = 1, var4 = 1) - Average score for (var1 = 1, var2= 0, var3 = 1, var4 = 1)
etc.

I found a very "clumsy" way to do this:

var1_present <- my_data[which(my_data$var1 == 1 & my_data$var2 == 1 & my_data$var3 == 1 & my_data$var4 == 1 ), ]

var1_present_score = mean(var1_present$score)

var1_absent <- my_data[which(my_data$var1 == 0 & my_data$var2 == 1 & my_data$var3 == 1 & my_data$var4 == 1 ), ]

var1_absent_score = mean(var1_absent$score)

var_1_contribution = var1_present_score - var1_absent_score

[1] 0.1288283

Is there someway to write a function that can look at the "contribution" of different variables to the "score"? I understand that even for 4 variables there can be many different combinations to compare - e.g. row 14 vs. row 16 : (1,0,0,0) vs. (1,0,0,1). But even for just some "contributions", is it possible to write a function that evaluates the "contribution" of variables being absent/present?

Can someone please show me how to do this?

Thanks!

pieterjanvc · June 7, 2022, 10:08am

Hello,

I think I have a way of finding out the individual contributions of each variable.

library(tidyverse)

set.seed(123)

var1 = sample(0:1, 10000, replace=T)
var2 = sample(0:1, 10000, replace=T)
var3 = sample(0:1, 10000, replace=T)
var4 = sample(0:1, 10000, replace=T, prob = c(0.2,0.8))
score = rnorm(10000,10,5)

my_data = data.frame(var1,var2, var3,var4, score)


#Transform the data into long format (create Id to keep track)
my_data = my_data %>% 
  mutate(id = 1:n()) %>%  
  pivot_longer(-c(score, id), names_to = "var", values_to = "present") %>% 
  #Calculate the percentage of contribution to each value
  group_by(id) %>% 
  mutate(
    contrPerc = present / max(sum(present), 1),
    contrVal = score * contrPerc)

head(my_data, 8)
#> # A tibble: 8 × 6
#> # Groups:   id [2]
#>   score    id var   present contrPerc contrVal
#>   <dbl> <int> <chr>   <int>     <dbl>    <dbl>
#> 1  5.82     1 var1        0       0       0   
#> 2  5.82     1 var2        0       0       0   
#> 3  5.82     1 var3        0       0       0   
#> 4  5.82     1 var4        1       1       5.82
#> 5  8.90     2 var1        0       0       0   
#> 6  8.90     2 var2        1       0.5     4.45
#> 7  8.90     2 var3        0       0       0   
#> 8  8.90     2 var4        1       0.5     4.45

#Summarise the contribution per variable
my_data %>% group_by(var) %>% 
  summarise(contrPerc = mean(contrPerc), 
            contrVal = mean(contrVal[present == 1]))
#> # A tibble: 4 × 3
#>   var   contrPerc   contrVal
#>   <chr>     <dbl> <dbl>
#> 1 var1      0.198  3.98
#> 2 var2      0.197  3.96
#> 3 var3      0.198  3.94
#> 4 var4      0.381  4.69

^{Created on 2022-06-07 by the reprex package (v2.0.1)}

I converted the data into long format and then adding a few stats was able to calculate the contributions. Note that I changed the probability of variable 4 to be 80% '1' to showcase that in the end it gets a higher score. Contr percent is the average contribution of a variable to the total score. The sum of all contrPerc = 1. The 'contrVal' is the average amount contributed to the total score (if not 0).

I don't know if this is exactly what you want, but it might get you there.

Hope this helps,
PJ

system · June 28, 2022, 10:09am

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.