Let's say we have two columns and many rows of similar data as the following:
df = data.frame(id = c('a', 'b', 'c', 'd'),
column1 = c('c3', 'ccc2, c3', 'c1,c3', 'c1,ccc2,c3, c4g'))
id column1
1 a c3
2 b c2,c3
3 c c1,c3
4 d c1,c2,c3,c4g
My goals is something like this:
df_wider = data.frame(id = c('a', 'b', 'c', 'd'),
c1 = c(0, 0, 1, 1),
ccc2 = c(0, 1, 0, 1),
c3 = c(1, 1, 1, 1),
c4g = c(0, 0, 0, 1))
id c1 ccc2 c3 c4g
1 a 0 0 1 0
2 b 0 1 1 0
3 c 1 0 1 0
4 d 1 1 1 1
Where the column names c1
, ccc2
, c3
, c4g
(and whatever infinite other distinct values there may be) are automatically generated from the data.
This most likely means that I will have to use something along the lines of df_wider = df |> mutate(??? := expression)
, as well as stringr
to remove those random number of white spaces (the latter shouldn't be a problem with str_replace_all()
).
I thought maybe I can start by storing unique comma parsed values somewhere, but then I'm lost on how I can get that into the goal dataframe using tidytable
only (or let's say dplyr
+).