Assign the value '1' to many dummy variables at once, by matching column name to row values

notgoodatr · July 8, 2022, 2:58pm

I have the following (sample) dput() for a dataframe consisting of non-unique IDs, countries, and columns representing country names:

structure(list(id = c(1, 1, 2, 3, 4, 4), country = c("USA", "Japan", 
"Germany", "Austria", "Japan", "Austria"), USA = c(0, 0, 0, 0, 
0, 0), Germany = c(0, 0, 0, 0, 0, 0), Japan = c(0, 0, 0, 0, 0, 
0), Austria = c(0, 0, 0, 0, 0, 0), Belgium = c(0, 0, 0, 0, 0, 
0)), class = "data.frame", row.names = c(NA, -6L))

I would like to alter the above dataframe such that, for every id i, each column with a country name "I.e., df$Germany, df$Japan" has a value == 1 if df$country == 1. I.e., rows 1 and 2 should have df$Japan == 1 and df$USA == 1; rows 5 and 6 should have df$Japan == 1 and df$Austria == 1.

Are there any ways to do this without using for loops or pivot? (notice my current data has a "Belgium" column but no Belgium value in df$country -- pivot will not work here as far as I understand). For loops take much too long--I have 30+ datasets of 100k+ observations each that I need to rapidly apply this too.

Someone on a different thread suggested: df[levels(factor(df$country))] = model.matrix(~country - 1, df) but this gives an erroneous structure, the dput being given below:

structure(list(id = c(1, 1, 2, 3, 4, 4), country = c("USA", "Japan", 
"Germany", "Austria", "Japan", "Austria"), USA = c(1, 0, 0, 0, 
0, 0), Germany = c(0, 0, 1, 0, 0, 0), Japan = c(0, 1, 0, 0, 1, 
0), Austria = c(0, 0, 0, 1, 0, 1), Belgium = c(0, 0, 0, 0, 0, 
0)), row.names = c(NA, -6L), class = "data.frame")

I've also thought about using mutate() with cur_column() but I'm not sure as to how to even write that code.

AlexisW · July 10, 2022, 11:59pm

There is actually a trick, if you make country a factor with all the possible countries as level, then you can use the names_expand option of pivot_wider() to ensure all countries are represented:

library(tidyverse)


df <- structure(list(id = c(1, 1, 2, 3, 4, 4),
                     country = c("USA", "Japan", "Germany", "Austria", "Japan", "Austria"),
                     USA = c(0, 0, 0, 0, 0, 0),
                     Germany = c(0, 0, 0, 0, 0, 0),
                     Japan = c(0, 0, 0, 0, 0,0),
                     Austria = c(0, 0, 0, 0, 0, 0),
                     Belgium = c(0, 0, 0, 0, 0, 0)),
                class = "data.frame", row.names = c(NA, -6L)) 


df$country <- factor(df$country, levels = names(df)[-(1:2)])

df |>
  select(1, 2) |>
  mutate(id2 = id) |>
  pivot_wider(id2,
              names_from = "country",
              values_from = "id",
              names_expand = TRUE,
              values_fill = 0) |>
  select(-id2)
#> # A tibble: 4 × 5
#>     USA Germany Japan Austria Belgium
#>   <dbl>   <dbl> <dbl>   <dbl>   <dbl>
#> 1     1       0     1       0       0
#> 2     0       2     0       0       0
#> 3     0       0     0       3       0
#> 4     0       0     4       4       0

Note that pivot_wider() needs an "id" column to specify the final number of rows, and a separate "values" column to specify what to write inside. So here I duplicated the id column and removed id2 at the end. Another possibility if you want 1 as the value is to create a column of 1s:

df |>
  select(1, 2) |>
  mutate(vals = 1) |>
  pivot_wider(id,
              names_from = "country",
              values_from = vals,
              names_expand = TRUE,
              values_fill = 0)

If you prefer to loop, it's also possible to use nest() to create a small df for each id, then, within each id, fill in the values for countries that are represented, then unnest. In that case, at the end I had as many output rows as input rows, so I removed the country column and used distinct() to keep one row per id.

df |>
  group_by(id) |>
  nest() |>
  mutate(data = map(data,
                     \(subdf){
                       subdf[subdf$country] <- 1
                       subdf
                     })) |>
  unnest(data) |>
  select(-id,-country) |>
  distinct()
#> Adding missing grouping variables: `id`
#> # A tibble: 4 × 6
#> # Groups:   id [4]
#>      id   USA Germany Japan Austria Belgium
#>   <dbl> <dbl>   <dbl> <dbl>   <dbl>   <dbl>
#> 1     1     1       0     1       0       0
#> 2     2     0       1     0       0       0
#> 3     3     0       0     0       1       0
#> 4     4     0       0     1       1       0

^{Created on 2022-07-10 by the reprex package (v2.0.1)}

system · August 1, 2022, 12:00am

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.