All possible Combination of factors of a data frame

yellowcab74 · February 5, 2024, 2:59am

I'm trying to figure out the simplest way to do the following. I have a data frame df with colnames(df) <- c("A", "B", "C", "D", "E") where all the variables are encoded as factors. I want to find all possible combinations of this data frame. For example, the number of all possible combinations of "df" = 2^5-1=31 and will include data frames with the following column names
"A", "B"
"A", "C"
"A", "D"
"A", "E"
"A", "B", "C"
etc., with the last data frame being "A", "B", "C", "D", "E"
Is there a package to combine all these subsets in one large data frame?
Thanks

FJCC · February 5, 2024, 4:07am

I don't understand how you want to combine data frames with varying numbers of columns. I think the following code makes a list of the 31 individual data frames that you want to get.

Combinations <- expand.grid(A=0:1,B=0:1,C=0:1,D=0:1,E=0:1)
Combinations <- sapply(Combinations, as.logical)
Combinations <- Combinations[2:32,] #drop the first row that is all FALSE
DF <- data.frame(A=1:3,B=11:13,C=21:23, D=31:33, E=41:43) #invent data
DF_List <- vector(mode = "list", length = 31) #Make list to store results

for(i in 1:31) {
  DF_List[[i]] <- DF[, Combinations[i, ]]
}

Combinations[6,]
#>     A     B     C     D     E 
#> FALSE  TRUE  TRUE FALSE FALSE
DF_List[[6]]
#>    B  C
#> 1 11 21
#> 2 12 22
#> 3 13 23

^{Created on 2024-02-04 with reprex v2.0.2}

alexcray · February 5, 2024, 4:58am

If you want specific combinations based on conditions on factor levels or interactions, you might need to loop or use advanced libraries like patsy in R.

yellowcab74 · February 5, 2024, 5:44am

Thanks so much. This solves my problem. One more question, is there no way of including the full data frame with all variables A B C D E ?

nirgrahamuk · February 5, 2024, 11:08am

isnt that what you begin with ?
so you would make the list be 32;
and after you looped over 31 , add

 DF_List[[32]] <- DF

yellowcab74 · February 5, 2024, 2:36pm

Thanks. One more question, could I fit 32 different logistics models (outcome not included in that list) using each data frame?
Thanks

nirgrahamuk · February 5, 2024, 2:57pm

map() from purrr (tidyverse) is a general approach to iteration



#example of a list of 3 datasets
list_of_3 <- split(iris,~Species) 

(some_outcome_vec <- rep(0:1,25))

library(tidyverse)

#fit all the lm's for each of the list of 3 

(list_of_models <- map(list_of_3,\(df_){
  
  temp_df <- bind_cols(enframe(name=NULL,
                               value="outcome",
                               x=some_outcome_vec),df_) |> select(-Species)
  
    lm(outcome ~ .,
       data=temp_df)  
  
}))

yellowcab74 · February 5, 2024, 3:28pm

Thanks so much. On the other hand, Is it possible to use the previous code to fit a logistic model on all possible combinations of the variables in the Iris data e.g. a model with
Sepal length only
Sepal width only
petal length only
petal width only
Sepal length and sepal width etc. ?

nirgrahamuk · February 5, 2024, 3:33pm

yes, I encourage you to try to do so.

itssimonpro · February 5, 2024, 5:20pm

I am facing some trouble issues with my client's work names [Bear Names]
How do I list variables in a dataset in R?

yellowcab74 · February 5, 2024, 6:02pm

I am having difficulties modifying the codes, please help!

nirgrahamuk · February 5, 2024, 7:08pm

I expect I can find some time to review code that you might share on this

yellowcab74 · February 5, 2024, 8:48pm

See the code I wrote. I get an error that the data must be in data.frame including some additional warnings.
Thanks

library(tidyverse)
x_dat<- subset(iris, select=c(Sepal.Length,Sepal.Width, Petal.Length,Petal.Width))
Combinations <- expand.grid(Sepal.Length =0:1,Sepal.Width=0:1, Petal.Length=0:1,Petal.Width=0:1)
Combinations <- sapply(Combinations, as.logical)
Combinations <- Combinations[2:15,] #drop the first row that is all FALSE
x_List <- vector(mode = "list", length = 14) #Make list to store results
(some_outcome_vec <- rep(0:1,75))
for(i in 1:14) {
x_List[[i]] <- x_dat[, Combinations[i, ]]
#fit all the lm's for each of the list of 3
(list_of_models <- map(x_List[[i]],(df_){
x_List[[i]] <- bind_cols(enframe(name=NULL,
value="outcome",
x=some_outcome_vec),df_) |> #select(-Species)
lm(outcome ~ .,
data=x_List[[i]])
}))
}

yellowcab74 · February 6, 2024, 4:52am

Sorry guys, I was able to make the code to work. I pasted below. Is it possible to get an overall minimum of the model BIC as well as the position of this minimum from all the iterations?

Combinations <-expand.grid(Sepal.Length=0:1,Sepal.Width=0:1,Petal.Length=0:1,Petal.Width=0:1)
Combinations <- sapply(Combinations, as.logical)
Combinations <- Combinations[2:14,]

x_List <- vector(mode = "list", length = 15) #Make list to store results
list_of_3 <- split(iris,~Species)
(some_outcome_vec <- rep(0:1,25))

for(i in 1:15) {
x_List[[i]] <- list_of_3[, Combinations[i, ]]
}

for(i in 1:15) {
bic[i]<-BIC(glm(some_outcome_vec~., data = x_List[[i]], family = "binomial"))
}

nirgrahamuk · February 6, 2024, 9:47am

library(tidyverse)

Combinations <- expand.grid(Sepal.Length=0:1,Sepal.Width=0:1,Petal.Length=0:1,Petal.Width=0:1)
Combinations <- sapply(Combinations, as.logical)

max_comb_len <- 15

Combinations <- Combinations[2:(max_comb_len+1),] #drop the first row that is all FALSE

DF_List <- vector(mode = "list", length = max_comb_len) #Make list to store results
lm_list <- vector(mode = "list", length = max_comb_len) #Make list to store results
BIC_list <- vector(mode = "list", length = max_comb_len) #Make list to store results


(DF <- tibble(iris) |> mutate(outcome= Species=="virginica"))

min_bic <- min_bic_pos <- Inf  

for(i in 1:max_comb_len) {
  vars_in_subset <- names(which(Combinations[i,]))
  print(vars_in_subset)
  DF_List[[i]] <- DF[, c("outcome", vars_in_subset)]
  lm_list[[i]] <- lm(outcome ~ . , data=DF_List[[i]])
  BIC_list[[i]] <- BIC(lm_list[[i]])
  if(BIC_list[[i]] < min_bic){
    min_bic <- BIC_list[[i]]
    min_bic_pos <- i
  }
}

min_bic
min_bic_pos

yellowcab74 · February 6, 2024, 1:56pm

This is wonderful. It gives exactly what I needed
Thanks

yellowcab74 · February 6, 2024, 8:14pm

I am having errors grouping this sample data into percentiles and adding the group to the data frame. The following 2 codes give me errors. Is there something I am not doing right?
Thanks much

df <- data.frame(ID = 1:10, Score1 = c(78, 82, 65, 90, 72, 88, 55, 67, 92, 81),
Score2 = c(89, 95, 76, 82, 91, 85, 72, 68, 97, 88))

Calculate the calculating percentiles

df$quartile <- with(df, factor(
findInterval (score1, c(-Inf,
quantile(score1, probs=c(.2,.4,.6,.8)),Inf),na.rm=TRUE),
labels=c('Q1','q2','Q3','Q4'.'Q5')
))

df$quartile <- with(df, cut(score1, breaks=quantile(score1, probs=seq(0,1, by .2),na.rm=TRUE), include.lowest=TRUE
))

yellowcab74 · February 7, 2024, 11:39pm

Is it possible to keep other variables in the combination which will not be used for example ID number?
Thanks

nirgrahamuk · February 8, 2024, 10:22am

I'm lost ? it seems the original issue was addressed, and you are on to other non-related issues?
or is there something to vary in the original request ?

system · February 29, 2024, 10:23am

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.