From one column of comma separated string, create multiple dummies named after comma separated values

pathos · October 7, 2023, 2:25pm

Let's say we have two columns and many rows of similar data as the following:

df = data.frame(id = c('a', 'b', 'c', 'd'),
                column1 = c('c3',  'ccc2,         c3', 'c1,c3', 'c1,ccc2,c3,  c4g'))

  id      column1
1  a           c3
2  b        c2,c3
3  c        c1,c3
4  d c1,c2,c3,c4g

My goals is something like this:

df_wider = data.frame(id = c('a', 'b', 'c', 'd'),
                      c1 = c(0, 0, 1, 1),
                      ccc2 = c(0, 1, 0, 1),
                      c3 = c(1, 1, 1, 1),
                      c4g = c(0, 0, 0, 1))

  id c1 ccc2 c3 c4g
1  a  0    0  1   0
2  b  0    1  1   0
3  c  1    0  1   0
4  d  1    1  1   1

Where the column names c1, ccc2, c3, c4g (and whatever infinite other distinct values there may be) are automatically generated from the data.

This most likely means that I will have to use something along the lines of df_wider = df |> mutate(??? := expression), as well as stringr to remove those random number of white spaces (the latter shouldn't be a problem with str_replace_all()).

I thought maybe I can start by storing unique comma parsed values somewhere, but then I'm lost on how I can get that into the goal dataframe using tidytable only (or let's say dplyr+).

FJCC · October 7, 2023, 4:01pm

Here is a stream of consciousness solution. There may well be a more elegant one.

df = data.frame(id = c('a', 'b', 'c', 'd'),
                column1 = c('c3',  'ccc2,         c3', 'c1,c3', 'c1,ccc2,c3,  c4g'))
library(stringr)
library(tidyr)
library(dplyr)

df <- df |> mutate(column1 = str_replace_all(column1, " ", ""))
MaxComma <- max(str_count(df$column1, ","))
Sep_df <- df |> separate_wider_delim(column1, delim = ",", 
                           names = paste0("X", 1:(MaxComma+1)), too_few = "align_start") 
Tall_df <- Sep_df |> pivot_longer(starts_with("X"), names_to = "dummy", values_to = "Value") |> 
  filter(!is.na(Value)) |> 
  mutate(dummy = 1)
IDs <- unique(Tall_df$id)
Values <- unique(Tall_df$Value)
All_Comb <- expand.grid(IDs, Values)
All_Comb <- left_join(All_Comb, Tall_df, by = c(Var1 = "id", Var2 = "Value")) |> 
  mutate(dummy = ifelse(is.na(dummy), 0, dummy))
All_Comb |> pivot_wider(names_from = "Var2", values_from = "dummy")
#> # A tibble: 4 × 5
#>   Var1     c3  ccc2    c1   c4g
#>   <chr> <dbl> <dbl> <dbl> <dbl>
#> 1 a         1     0     0     0
#> 2 b         1     1     0     0
#> 3 c         1     0     1     0
#> 4 d         1     1     1     1

^{Created on 2023-10-07 with reprex v2.0.2}

pathos · October 8, 2023, 6:04pm

That's a very interesting approach! Thanks a million! Great to learn about separate_wider_delim().

Maybe as a side note for future readers, I will probably use Tall_df |> distinct(id) instead of unique().

Edit: Figured out a much shorter way!

df = data.frame(id = c('a', 'b', 'c', 'd'),
                column1 = c('c3',  'ccc2,         c3', 'c1,c3', 'c1,ccc2,c3,  c4g'))

df |>
  mutate(column1 = str_replace_all(column1, " ", "")) |>
  distinct(id, column1) |>
  separate_longer_delim(column1,
                        delim = ',') |>
  mutate(dummy = 1) |>
  pivot_wider(names_from = column1,
              values_from = dummy,
              values_fill = 0)

system · October 15, 2023, 6:05pm

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.