Adding "NA" factors to the "levels" function

omario · September 26, 2021, 10:40pm

I am working with the R programming language. In this example, I have the following data:

library("dplyr")

df <- data.frame(b = rnorm(100,5,5), d = rnorm(100,2,2),
                 c = rnorm(100,10,10))

a <- c("a", "b", "c", "d", "e")
a <- sample(a, 100, replace=TRUE, prob=c(0.3, 0.2, 0.3, 0.1, 0.1))

a<- as.factor(a)
df$a = a


f <- c("a", "b", "c", "d", "e")
f <- sample(f, 100, replace=TRUE, prob=c(0.3, 0.2, 0.3, 0.1, 0.1))

f<- as.factor(f)
df$f = f

 head(df)
          b        d         c a f
1  6.896434 2.037835  2.867707 e a
2 -3.314758 2.681726 20.038918 d d
3  2.018130 2.229342 -8.341578 c a
4  9.738082 1.127069 18.337212 c c
5  2.442182 3.475735 27.875924 c c
6  5.061937 1.098709  6.166077 a e

I then have the following function ("my_subset_mean") that evaluates the "mean" value of df$c for different subsets of "a,b,d,f ":

my_subset_mean <- function(r1, r2, r3, r4){  
  subset <- df %>% filter(a %in% r1, f %in% r4, b > r2, d < r3 )
  return(mean(subset$c))
}

Here is a loop that evaluates the function "my_subset_mean" at random subsets of "a,b,d,f " :

create_output <- function() {
  uv <- levels(df$a)
  r1 <- sample(uv, sample(length(uv)))
 uv1 <- levels(df$f)
  r4 <- sample(uv1, sample(length(uv1)))
  rgb <- range(df$b)
  rgd <- range(df$d)
  r2 <- runif(1, rgb[1], rgb[2])
  r3 <- runif(1, rgd[1], rgd[2])
  my_subset_mean <- my_subset_mean(r1, r2, r3, r4)
  data.frame(r1 = toString(r1), r4 = toString(r4), r2, r3, my_subset_mean)
}

out <- do.call(rbind, replicate(100, create_output(), simplify = FALSE))

head(out)

             r1         r4        r2         r3 my_subset_mean
1 a, c, b, e, d          d 14.560821  3.4251138            NaN
2          d, e e, d, b, c  9.027482 -1.7108754            NaN
3             d e, b, a, d  1.447395  0.4279652      18.019990
4 a, e, b, c, d          e -6.807861  2.6301878       7.424415
5          a, d          d  8.307980 -1.8923647            NaN
6             a    b, c, a  7.180056 -0.4022791            NaN

Question: Is it possible to write this loop ("create_output") so that sometimes, values of "r1, r2, r3, r4" are not considered? E.g.

             r1         r4        r2         r3     my_subset_mean
1            NA          d     14.56    3.4251138            5
2          d, e, d, b,   NA    NA        -1.7108754         3.1
3             e, b,  d         1.447         NA           18.019990

I was thinking that maybe this can be specified within the "levels" statement:

uv <- levels(df$a)
  r1 <- sample(uv, sample(length(uv)))

Here, we can see the values of "uv":

uv
[1] "a" "b" "c" "d" "e"

Can something be done so that sometimes, the function "my_subset_mean" sometimes ignores the some of the subset conditions for "a, b, d,f"? E.g. the "mean" is only calculated using subset conditions on "a,d"?

Thanks

nirgrahamuk · September 26, 2021, 11:02pm

sometimes generate character(0) by

r1 <- sample(uv, sample(length(uv)+1)-1)

if you want conditional behaviour like to skip or pass a filter you can try

my_subset_mean <- function(r1, r2, r3, r4){  
if(identical(r1,character(0)){
 subset <- df %>% filter(f %in% r4, b > r2, d < r3 )
} else{
  subset <- df %>% filter(a %in% r1, f %in% r4, b > r2, d < r3 )}
  return(mean(subset$c))
}

omario · September 27, 2021, 1:23am

@ nigrahamuk: thank you for your answer!

Is this how the final code should look?

### generate data ####

library("dplyr")

df <- data.frame(b = rnorm(100,5,5), d = rnorm(100,2,2),
                 c = rnorm(100,10,10))

a <- c("a", "b", "c", "d", "e")
a <- sample(a, 100, replace=TRUE, prob=c(0.3, 0.2, 0.3, 0.1, 0.1))

a<- as.factor(a)
df$a = a


f <- c("a", "b", "c", "d", "e")
f <- sample(f, 100, replace=TRUE, prob=c(0.3, 0.2, 0.3, 0.1, 0.1))

f<- as.factor(f)
df$f = f


#### define function ####


my_subset_mean <- function(r1, r2, r3, r4){  
if(identical(r1,character(0)){
 subset <- df %>% filter(f %in% r4, a %in% r1, b > r2, d < r3 )
} else{
  subset <- df %>% filter(a %in% r1, f %in% r4, b > r2, d < r3 )}
  return(mean(subset$c))
}


### run loop ####

create_output <- function() {
  uv <- levels(df$a)
  r1 <- sample(uv, sample(length(uv)))
 uv1 <- levels(df$f)
  r4 <- sample(uv1, sample(length(uv1)))
  rgb <- range(df$b)
  rgd <- range(df$d)
  r2 <- runif(1, rgb[1], rgb[2])
  r3 <- runif(1, rgd[1], rgd[2])
  my_subset_mean <- my_subset_mean(r1, r2, r3, r4)
  data.frame(r1 = toString(r1), r4 = toString(r4), r2, r3, my_subset_mean)
}

out <- do.call(rbind, replicate(100, create_output(), simplify = FALSE))

head(out)

Thank you!

nirgrahamuk · September 27, 2021, 7:42am

You didn't change r1 to not always produce an output

omario · September 28, 2021, 2:43am

Thank you for your reply! I am still a bit confused - I tried to edit parts of the code but I still didn't quite understand. If you have time, can you please show me what you meant?

Thank you so much!

nirgrahamuk · September 28, 2021, 8:55am

library("dplyr")

df <- data.frame(b = rnorm(100,5,5), d = rnorm(100,2,2),
                 c = rnorm(100,10,10))

a <- c("a", "b", "c", "d", "e")
a <- sample(a, 100, replace=TRUE, prob=c(0.3, 0.2, 0.3, 0.1, 0.1))

a<- as.factor(a)
df$a = a


f <- c("a", "b", "c", "d", "e")
f <- sample(f, 100, replace=TRUE, prob=c(0.3, 0.2, 0.3, 0.1, 0.1))

f<- as.factor(f)
df$f = f

my_subset_mean <- function(r1, r2, r3, r4){  
  if(identical(r1,character(0))){
    subset <- df %>% filter(f %in% r4, b > r2, d < r3 )
  } else{
    subset <- df %>% filter(a %in% r1, f %in% r4, b > r2, d < r3 )}
  return(mean(subset$c))
}





create_output <- function() {
  uv <- levels(df$a)
  r1 <- sample(uv, sample(length(uv)+1)-1)
  uv1 <- levels(df$f)
  r4 <- sample(uv1, sample(length(uv1)))
  rgb <- range(df$b)
  rgd <- range(df$d)
  r2 <- runif(1, rgb[1], rgb[2])
  r3 <- runif(1, rgd[1], rgd[2])
  my_subset_mean <- my_subset_mean(r1, r2, r3, r4)
  data.frame(r1 = toString(r1), r4 = toString(r4), r2, r3, my_subset_mean)
}

out <- do.call(rbind, replicate(100, create_output(), simplify = FALSE))

system · October 19, 2021, 8:55am

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.