Possible to use 'sample()' inside 'case_when()' for conditional distributions?

I'm quite new to R and I'd like to ask if it's possible to use sample() inside of case_when() when generating a synthetic sample.

Here's an example that hopefully conveys - even though it fails - what I have in mind:

library(tidyverse)
set.seed(1000)
type = sample(c("car", "truck"), 10, replace = TRUE, prob = c(0.6, 0.4))
test <- data.frame(type)

test %>%
  mutate(
    motor = case_when(
      type == "car" ~ sample(c("petrol", "diesel"),
                             length(which(type == "car")),
                             replace = TRUE,
                             prob = c(2/3, 1/3)),
      type == "truck" ~ sample(c("petrol", "diesel"),
                               length(which(type == "truck")),
                               replace = TRUE,
                               prob = c(0.1, 0.9)),
      TRUE ~ NA_character_
    )
  )

This doesn't run through. Is the code wrong, or can you think of different approaches to achieve this?

Can you please confirm that what I understand is correct?

You are generating type of vehicles (cars and trucks) randomly. Then, depending on the number of cars generated, you are generating type of fuel (petrol and diesel), equal to the number of cars. You want to do the same for trucks.

Is that it?

I think so.

Once a random sample of cars and trucks has been drawn, I'd like to assign randomly the fuel type to every observation/vehicle, but with different probabilities depending on whether the vehicle is a car or a truck.

I think this does what you want.

set.seed(seed = 1000)

type <- sample(x = c("car", "truck"),
               size = 10,
               replace = TRUE,
               prob = c(0.6, 0.4))

motor <- ifelse(test = (type == "car"),
                yes = sample(x = c("petrol", "diesel"),
                             size = 1,
                             prob = c(2/3, 1/3)),
                no = sample(x = c("petrol", "diesel"),
                            size = 1,
                            prob = c(0.1, 0.9)))

(test <- data.frame(type, motor))
#>     type  motor
#> 1    car petrol
#> 2  truck diesel
#> 3    car petrol
#> 4  truck diesel
#> 5    car petrol
#> 6    car petrol
#> 7  truck diesel
#> 8    car petrol
#> 9    car petrol
#> 10   car petrol

Created on 2019-03-13 by the reprex package (v0.2.1)

I'm not very comfortable with tidyverse, but I think the problem is in the size argument. I think what mutate does is that it tries to perform your function for each of the elements, but since you are generating more than one, you're having problems. But I'm not sure.

What happens if you try this?

library(tidyverse)
set.seed(1000)
type = sample(c("car", "truck"), 10, replace = TRUE, prob = c(0.6, 0.4))
test <- data.frame(type)

test %>%
  mutate(
    motor = case_when(
      type == "car" ~ sample(c("petrol", "diesel"),
                             1,
                             replace = TRUE,
                             prob = c(2/3, 1/3)),
      type == "truck" ~ sample(c("petrol", "diesel"),
                               1,
                               replace = TRUE,
                               prob = c(0.1, 0.9)),
      TRUE ~ NA_character_
    )
  )

Unfortunately not, because with size = 1 inside the sample function, there's no variation - if you generate 100 vehicles, the cross table is

> table(test$type, test$motor)
       
        diesel petrol
  car        0     64
  truck     36      0

Yes, and now I think I realized that what I was trying was rather stupid: The sample function will produce a vector of length size, but I can't simply allocate its elements to the appropriate lines of a larger vector of the same name...

I think the first code chunk here works.

If you want to keep within Tidyverse, I the following should get you there.

library(tidyverse)
set.seed(1000)
type = sample(c("car", "truck"), 10, replace = TRUE, prob = c(0.6, 0.4))
test <- data.frame(type)

test %>%
  mutate(
    motor = ifelse(type == "car", 
                   sample(c("petrol", "diesel"),
                          sum(type == "car"),
                          replace = TRUE,
                          prob = c(2/3, 1/3)), 
                   sample(c("petrol", "diesel"),
                          sum(type == "truck"),
                          replace = TRUE,
                          prob = c(0.1, 0.9))))
1 Like

You can do it by using nest and list-columns. It would look like this:

library(tidyverse)
#> Warning: package 'tibble' was built under R version 3.5.2
#> Warning: package 'dplyr' was built under R version 3.5.2
#> Warning: package 'stringr' was built under R version 3.5.2
set.seed(1000)
type = sample(c("car", "truck"), 10, replace = TRUE, prob = c(0.6, 0.4))
test <- data.frame(type)

test %>% 
  mutate(car_type = type) %>% 
  group_by(type) %>% 
  nest() %>% 
  mutate(
    motor = map2(data, type, ~{
      if (.y == "car"){
        sample(c("petrol", "diesel"),
               nrow(.x),
               replace = TRUE,
               prob = c(2/3, 1/3))
      } else {
        sample(c("petrol", "diesel"),
               nrow(.x),
               replace = TRUE,
               prob = c(0.1, 0.9))
      }
    })
  ) %>% 
  unnest(motor)
#> # A tibble: 10 x 2
#>    type  motor 
#>    <fct> <chr> 
#>  1 car   petrol
#>  2 car   diesel
#>  3 car   petrol
#>  4 car   diesel
#>  5 car   diesel
#>  6 car   petrol
#>  7 car   petrol
#>  8 truck diesel
#>  9 truck diesel
#> 10 truck diesel

Created on 2019-03-13 by the reprex package (v0.2.0).

This creates a dummy variable that is the same as type and then groups and nest by type. Then you can run sample for each of the car types to return a vector of the same length as each car type.

1 Like

You are right, and my code was wrong. Sorry.

The following works:

set.seed(seed = 1000)

type <- sample(x = c("car", "truck"),
               size = 100,
               replace = TRUE,
               prob = c(0.6, 0.4))

motor <- sapply(X = type,
                FUN = function(t)
                  ifelse(test = (t == "car"),
                         yes = sample(x = c("petrol", "diesel"),
                                      size = 1,
                                      prob = c(2/3, 1/3)),
                         no = sample(x = c("petrol", "diesel"),
                                     size = 1,
                                     prob = c(0.1, 0.9))))

test <- data.frame(type, motor)

table(test$type, test$motor)
#>        
#>         diesel petrol
#>   car       24     40
#>   truck     31      5
1 Like

Hi @Tristan,

The reason my first implementation was wrong was because of the following:

If yes or no are too short, their elements are recycled. yes will be evaluated if and only if any element of test is true, and analogously for no.

Being lazy enough, I ended up asking regarding this fact on SO about such a trivial topic, instead of going through the documentation.

Sorry, once again.

1 Like

@Tristan. Your original code works save for length conflicts. "type" needs to stay length 10, as below.

set.seed(1000)
type = sample(c("car", "truck"), 10, replace = TRUE, prob = c(0.6, 0.4))
test <- data.frame(type)

test %>%
  mutate(
    motor = case_when(
      type == "car" ~ sample(c("petrol", "diesel"),
                             10,
                             replace = TRUE,
                             prob = c(2/3, 1/3)),
      type == "truck" ~ sample(c("petrol", "diesel"),
                               10,
                               replace = TRUE,
                               prob = c(0.1, 0.9)),
      TRUE ~ NA_character_
    )
  )

1 Like

Please don't apologise! To the contrary, thank you for your efforts, from which I learned something new and useful. I should really start digging into base R, see sapply, etc.

Thanks to everyone else in this thread, too! I took away a lot from the different implementations.

1 Like

If your question's been answered (even by you!), would you mind choosing a solution? It helps other people see which questions still need help, or find solutions if they have similar problems. Here’s how to do it:

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.