Dataframe Building (Specific Parameters)

jeremyz · November 5, 2018, 9:34pm

Hello, All!

I would love help with the dataframe I am trying to build. I need to create a dataframe that has 500 rows and 3 columns. Each column is a "Day" (defined below) and each row is an individual entry. The values must be random and CANNOT repeat across the rows, but may repeat within the columns. I have been able to build this in piece meal below. Is there a way to do this in a clean, fast way? THANK YOU FOR ANY HELP.

Days <- c("Monday Morning", "Monday Afternoon", "Monday Evening", "Tuesday Morning", "Tuesday Afternoon", "Tuesday Evening", "Wednesday Morning", "Wednesday Afternoon", "Wednesday Evening", "Thursday Morning", "Thursday Afternoon", "Thursday Evening", "Friday Morning", "Friday Afternoon", "Friday Evening", "Saturday Morning", "Saturday Afternoon", "Sunday")

A <- sample(Days, 3, replace = F)
B <- as.data.frame(t(A))
colnames(B) <- c("A","B","C")

C <- sample(Days, 3, replace = F)
D <- as.data.frame(t(C))
colnames(D) <- c("A","B","C")

E <- sample(Days, 3, replace = F)
F <- as.data.frame(t(E))
colnames(F) <- c("A","B","C")

rbind(B,D,F)
View(rbind(B,D,F))

jcblum · November 5, 2018, 10:17pm

Hi @jeremyz! Welcome!

I'd be inclined to approach this as a combinatorics problem, in which case I'd need to know:

Considering each row as a set, do the sets need to be unique?
If so, are these rows considered unique (i.e. does order matter)?
{Monday Morning, Monday Evening, Tuesday Morning}
{Monday Morning, Tuesday Morning, Monday Evening}

I'm also not sure what you have in mind when you say "the values must be random". For instance, if you generated a large population of sets (rows) and randomly sampled 500 sets from that population without replacement, would that be the right structure of randomness for your purposes? If not, what are your constraints?

Edited to add: if you want to start playing with code along these lines, my favorite package for combinations and permutations is arrangements.

jeremyz · November 5, 2018, 10:25pm

Thank you for replying @jcblum!

Each row as a set does need to be unique.
{Monday Morning, Monday Evening, Tuesday Morning} is OKAY.
{Monday Morning, Monday Morning, Monday Evening} is NOT OKAY. Monday Morning repeats.

Order does not matter. The 3 terms as a set can appear in any order, so long as they do not repeat.

How I am defining random:
You are almost correct in your assessment. If a large population of sets were generated, and 500 of those were sampled, then that would work! Replacement is okay with sets.

Example:
OKAY
{Monday Evening, Monday Morning, Tuesday Morning}
{Monday Evening, Monday Morning, Tuesday Morning}
{Saturday Morning, Wednesday Morning, Monday Morning}

The idea is to mimic 500 people choosing their top 3 preferred day/time option. A person would not logically choose the same day/time more than once, but we may see more than one person choosing the same top 3 days/times.

Hope this clarifies. I am happy to add more if needed.

mfherman · November 5, 2018, 10:32pm

Here's one approach to doing this that uses functions from purrr to create the data frame. It takes your original construction of a random sample of 3 days and the repeats that function 500 times and then combines it by row into a single data frame. Probably not the most efficient, but if you only need 500 samples, it should be fine.

library(tidyverse)

Days <- c("Monday Morning", "Monday Afternoon", "Monday Evening", "Tuesday Morning", "Tuesday Afternoon", "Tuesday Evening", "Wednesday Morning", "Wednesday Afternoon", "Wednesday Evening", "Thursday Morning", "Thursday Afternoon", "Thursday Evening", "Friday Morning", "Friday Afternoon", "Friday Evening", "Saturday Morning", "Saturday Afternoon", "Sunday")

rerun(500, sample(Days, 3, replace = FALSE) %>% set_names("A", "B", "C")) %>%
  map_dfr(as_tibble)
#> # A tibble: 500 x 3
#>    A                  B                 C                 
#>    <chr>              <chr>             <chr>             
#>  1 Thursday Evening   Tuesday Afternoon Friday Afternoon  
#>  2 Thursday Evening   Tuesday Evening   Saturday Afternoon
#>  3 Thursday Afternoon Friday Morning    Monday Evening    
#>  4 Thursday Evening   Saturday Morning  Wednesday Evening 
#>  5 Monday Evening     Monday Afternoon  Wednesday Evening 
#>  6 Thursday Evening   Thursday Morning  Wednesday Morning 
#>  7 Monday Afternoon   Friday Evening    Thursday Evening  
#>  8 Tuesday Morning    Monday Morning    Monday Evening    
#>  9 Monday Afternoon   Sunday            Thursday Afternoon
#> 10 Tuesday Evening    Friday Evening    Thursday Afternoon
#> # … with 490 more rows

^{Created on 2018-11-05 by the reprex package (v0.2.1)}

jeremyz · November 5, 2018, 10:38pm

This looks great @mfherman!
Exactly what I want. However, when I try to run your script, it makes my tibble 1500 x 1.
Any suggestions on how to transform the tibble into a 500 x 3 like yours?

I have tidyverse and purrr installed and selected in my library.

jcblum · November 5, 2018, 10:40pm

And here's how I'd do it with arrangements:

library(tidyverse)
library(arrangements)

Days <- c("Monday Morning", "Monday Afternoon", "Monday Evening", "Tuesday Morning", "Tuesday Afternoon", "Tuesday Evening", "Wednesday Morning", "Wednesday Afternoon", "Wednesday Evening", "Thursday Morning", "Thursday Afternoon", "Thursday Evening", "Friday Morning", "Friday Afternoon", "Friday Evening", "Saturday Morning", "Saturday Afternoon", "Sunday")

dfr <- permutations(Days, k = 3, nsample = 500) %>% 
  as_tibble() %>% 
  set_names(nm = LETTERS[1:3]) 

head(dfr)
#> # A tibble: 6 x 3
#>   A                  B                   C               
#>   <chr>              <chr>               <chr>           
#> 1 Thursday Evening   Sunday              Monday Afternoon
#> 2 Thursday Afternoon Monday Evening      Thursday Morning
#> 3 Wednesday Evening  Monday Evening      Friday Afternoon
#> 4 Sunday             Wednesday Afternoon Thursday Evening
#> 5 Friday Evening     Wednesday Evening   Thursday Morning
#> 6 Wednesday Evening  Friday Evening      Thursday Evening

^{Created on 2018-11-05 by the reprex package (v0.2.1)}

@mfherman's solution is a really nice example of how to use purrr to go from a one-piece-at-a-time solution to a full solution! Not all problems happen to coincide so neatly with an area of mathematics that people like to write packages for, so ultimately learning to use tools like purrr is more generally useful, I think

mfherman · November 5, 2018, 10:43pm

Hmmm. Not sure why it isn't working for you. Did you save the result to a new object?

library(tidyverse)

Days <- c("Monday Morning", "Monday Afternoon", "Monday Evening", "Tuesday Morning", "Tuesday Afternoon", "Tuesday Evening", "Wednesday Morning", "Wednesday Afternoon", "Wednesday Evening", "Thursday Morning", "Thursday Afternoon", "Thursday Evening", "Friday Morning", "Friday Afternoon", "Friday Evening", "Saturday Morning", "Saturday Afternoon", "Sunday")

days_sample <- rerun(500, sample(Days, 3, replace = FALSE) %>% set_names("A", "B", "C")) %>%
  map_dfr(as_tibble)
days_sample
#> # A tibble: 500 x 3
#>    A                B                   C                  
#>    <chr>            <chr>               <chr>              
#>  1 Tuesday Morning  Monday Morning      Wednesday Morning  
#>  2 Monday Afternoon Tuesday Afternoon   Wednesday Morning  
#>  3 Tuesday Evening  Monday Afternoon    Wednesday Morning  
#>  4 Friday Afternoon Saturday Morning    Tuesday Evening    
#>  5 Saturday Morning Thursday Morning    Wednesday Afternoon
#>  6 Saturday Morning Sunday              Tuesday Morning    
#>  7 Saturday Morning Tuesday Morning     Tuesday Afternoon  
#>  8 Monday Afternoon Tuesday Morning     Wednesday Morning  
#>  9 Tuesday Morning  Wednesday Afternoon Tuesday Afternoon  
#> 10 Friday Evening   Thursday Evening    Saturday Afternoon 
#> # … with 490 more rows

^{Created on 2018-11-05 by the reprex package (v0.2.1)}

jeremyz · November 5, 2018, 10:43pm

Thank you both @mfherman and jcblum. You have both been very helpful. Your advice is very appreciated. I got the dataframe to generate the way I wanted. Yay!

This is a great community. Thank you for helping people like me!

mfherman · November 5, 2018, 10:44pm

Cool! I didn't know about arrangements.

jcblum · November 5, 2018, 10:57pm

Not only does it have a lovely, logical UI, it's fast, too! https://stackoverflow.com/a/47983855/4024810

(Warning: following links from that SO page can send you down a deep rabbit hole of combinatoric fun… I can lose an afternoon to this stuff if I'm not careful )

jcblum · November 5, 2018, 11:00pm

If your question's been answered, would you mind choosing a solution? There's no imaginary internet points involved, so just choose the one that you used/liked best/whatever — nobody will get mad . Choosing a solution helps other people see which questions still need help, or find solutions if they have similar problems.

Here’s how to do it:

jeremyz · November 5, 2018, 11:04pm

@jcblum done! As a follow up question, would there be a way to assign a probability weight to each of the elements that make up the object "Days" ?

(e.g. Monday Morning prob = 0.03, Tuesday Evening prob = 0.43, etc)

jcblum · November 5, 2018, 11:13pm

If you use @mfherman's approach, you can pass a vector of weights to sample() using its prob parameter.

mfherman · November 5, 2018, 11:32pm

Just to keep it all neat, I'd probably create a data frame with the days and their corresponding probabilities and then as @jcblum wrote, add the weights to the sample() function.

As an example, I generated a random probabilities for each day. You could replace these with the real probability and run the following code

library(tidyverse)

days <- c("Monday Morning", "Monday Afternoon", "Monday Evening", "Tuesday Morning", "Tuesday Afternoon", "Tuesday Evening", "Wednesday Morning", "Wednesday Afternoon", "Wednesday Evening", "Thursday Morning", "Thursday Afternoon", "Thursday Evening", "Friday Morning", "Friday Afternoon", "Friday Evening", "Saturday Morning", "Saturday Afternoon", "Sunday")

# generate random sample of numbers between 0 and 1
prob <- runif(length(days), 0, 1)

# normalize probs so sum of all probs == 1
prob_norm <- prob / sum(prob)

# create data frame with days and probabilites
day_prob <- tibble(
  days = days,
  prob = prob_norm
  ) %>% 
  print()
#> # A tibble: 18 x 2
#>    days                   prob
#>    <chr>                 <dbl>
#>  1 Monday Morning      0.0341 
#>  2 Monday Afternoon    0.00182
#>  3 Monday Evening      0.0613 
#>  4 Tuesday Morning     0.117  
#>  5 Tuesday Afternoon   0.0508 
#>  6 Tuesday Evening     0.0311 
#>  7 Wednesday Morning   0.111  
#>  8 Wednesday Afternoon 0.0873 
#>  9 Wednesday Evening   0.0536 
#> 10 Thursday Morning    0.0994 
#> 11 Thursday Afternoon  0.0417 
#> 12 Thursday Evening    0.0898 
#> 13 Friday Morning      0.0244 
#> 14 Friday Afternoon    0.0854 
#> 15 Friday Evening      0.0131 
#> 16 Saturday Morning    0.0697 
#> 17 Saturday Afternoon  0.00405
#> 18 Sunday              0.0241

# generate 500 weighted 3 days samples
days_sample <- rerun(
  .n = 500,
  sample(
    x = day_prob$days,
    size = 3,
    replace = FALSE,
    prob = day_prob$prob
    ) %>%
    set_names(LETTERS[1:3])
  ) %>%
  map_dfr(as_tibble) %>% 
  print()
#> # A tibble: 500 x 3
#>    A                 B                   C               
#>    <chr>             <chr>               <chr>           
#>  1 Wednesday Morning Wednesday Afternoon Monday Evening  
#>  2 Monday Evening    Thursday Evening    Thursday Morning
#>  3 Monday Evening    Thursday Afternoon  Tuesday Morning 
#>  4 Tuesday Evening   Thursday Morning    Thursday Evening
#>  5 Wednesday Morning Friday Afternoon    Saturday Morning
#>  6 Thursday Evening  Monday Evening      Thursday Morning
#>  7 Friday Morning    Monday Morning      Thursday Morning
#>  8 Thursday Evening  Monday Evening      Tuesday Morning 
#>  9 Thursday Morning  Wednesday Morning   Saturday Morning
#> 10 Thursday Morning  Saturday Morning    Friday Afternoon
#> # … with 490 more rows

^{Created on 2018-11-05 by the reprex package (v0.2.1)}

system · November 12, 2018, 11:32pm

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.