Using dummy variables for categorical data

How do I convert the data below using dummy variables?

Class : chr "no-recurrence-events" "recurrence-events" "recurrence-events" "no-recurrence-events" ... PostMeno : chr "premeno" "It40" "premeno" "ge40" ...
NodeCaps : chr "no" "yes" "no" "no" "yes" "yes"... Breast : chr "left" "right" "left" "right" ...
Quadrant : chr "left_low" "right_up" "central" "left_up" "right_low"... Radiation: chr "no" "yes" "no" "yes" ...

Class : has 2 levels ----- "no-recurrence-events" "recurrence-events"
PostMeno : has 3 levels ----- "It40" "premeno" "ge40"
NodeCaps : has 2 levels -----" "no" "yes"
Breast : has 2 levels ----- "left" "right"
Quadrant : has 5 levels ----- "left_low" "right_up" "central" "left_up" "right_low"...
Radiation: has 2 levels -----" "no" "yes"

Check out fct_recode() in the forcats pacakge:

Also, some good info on recoding dummy variables using ifelse() here:
http://sphweb.bumc.bu.edu/otlt/MPH-Modules/QuantCore/PH717_MultipleVariableRegression/PH717_MultipleVariableRegression4.html

And a package specifically for recoding (though I haven't personally used it), fastDummies.

I have seen all this online. I have a very large data with 286 rows and 10 columns. My problem is trying a unique way to go about it. Now, out of the 10 columns, I want to create dummy variables for 9 of them. Please any suggestions on how to do that?

Could you please turn this into a self-contained reprex (short for reproducible example)? It will help us help you if we can be sure we're all working with/looking at the same stuff.

install.packages("reprex")

If you've never heard of a reprex before, you might want to start by reading the tidyverse.org help page. The reprex dos and don'ts are also useful.

There's also a nice FAQ on how to do a minimal reprex for beginners, below:

What to do if you run into clipboard problems

If you run into problems with access to your clipboard, you can specify an outfile for the reprex, and then copy and paste the contents into the forum.

reprex::reprex(input = "fruits_stringdist.R", outfile = "fruits_stringdist.md")

For pointers specific to the community site, check out the reprex FAQ.

1 Like

Can you please explain what do you mean by this? Can you please provide an expected object for a copy-paste friendly sample dataset? As Mara has noted, a reprex will be very helpful.

If you meant something like coding c("A", "B", "A", "A", "B", "C") as c(1, 2, 1, 1, 2, 3), then you can use the as.integer function. Or, you want to recode by some other labels, you can use the labels argument of the factor function.

Here, I'm providing an example, where I've recoded to integers but through the factor function. I'm recoding all columns except one particular column.

set.seed(seed = 28127)

suppressPackageStartupMessages(expr = library(package = "dplyr"))

dataset <- data.frame(Class = sample(x = c("no-recurrence-events", "recurrence-events"),
                                     size = 20,
                                     replace = TRUE),
                      PostMeno = sample(x = c("It40", "premeno", "ge40"),
                                        size = 20,
                                        replace = TRUE),
                      NodeCaps = sample(x = c("no", "yes"),
                                        size = 20,
                                        replace = TRUE),
                      Breast = sample(x = c("left", "right"),
                                      size = 20,
                                      replace = TRUE),
                      Quadrant = sample(x = c("left_low", "right_up", "central", "left_up", "right_low"),
                                        size = 20,
                                        replace = TRUE),
                      Radiation = sample(x = c("no", "yes"),
                                         size = 20,
                                         replace = TRUE))

dataset %>%
  mutate_at(.vars = vars(-Radiation),
            .funs = function(y) factor(x = y,
                                       labels = seq_len(length.out = nlevels(x = y))))
#>    Class PostMeno NodeCaps Breast Quadrant Radiation
#> 1      1        3        1      1        1       yes
#> 2      1        3        1      1        1        no
#> 3      2        2        1      1        1        no
#> 4      1        2        1      2        5       yes
#> 5      1        1        1      1        5        no
#> 6      1        2        1      2        2       yes
#> 7      1        1        1      2        3       yes
#> 8      1        3        2      2        3        no
#> 9      2        2        1      2        1        no
#> 10     2        1        1      1        3       yes
#> 11     2        1        2      2        2       yes
#> 12     1        2        2      2        5       yes
#> 13     2        1        2      1        4       yes
#> 14     2        2        2      2        5        no
#> 15     2        2        2      2        1       yes
#> 16     1        2        1      2        4        no
#> 17     1        2        2      2        5       yes
#> 18     1        1        1      2        1        no
#> 19     2        3        2      1        1        no
#> 20     1        2        1      1        1       yes

Created on 2019-04-09 by the reprex package (v0.2.1)

Hope this helps.

Also, have in mind that recoding your factor variables as integers (i.e. 1, 3, 4, 5) it's going to introduce an order in your data (which may or may not be desirable for your model) if you want to avoid this you have to create "one hot encoded" dummy variables (i.e. only 1 or 0 values). One way of doing this easily is using the caret package, see this example.

df <- data.frame(stringsAsFactors = FALSE,
                 age = as.factor(c("75+", "55-74", "35-54", "25-34", "15-24", "5-14")),
                 value = 1:6)

library(caret)

dmy <- dummyVars(" ~ .", data = df)
recoded <- data.frame(predict(dmy, newdata = df))
recoded
#>   age.15.24 age.25.34 age.35.54 age.5.14 age.55.74 age.75. value
#> 1         0         0         0        0         0       1     1
#> 2         0         0         0        0         1       0     2
#> 3         0         0         1        0         0       0     3
#> 4         0         1         0        0         0       0     4
#> 5         1         0         0        0         0       0     5
#> 6         0         0         0        1         0       0     6

Just to defend my proposed solution, I'd like to add that though this is often correct, it doesn't happen always. For example, the columns that I recoded above are not ordered.

> sapply(X = dataset, FUN = is.ordered)
    Class  PostMeno  NodeCaps    Breast  Quadrant Radiation 
    FALSE     FALSE     FALSE     FALSE     FALSE     FALSE

I like your coding... But I have a large data set with 286 columns and 10 column..... The name of the data set is "Cancer". Do I replace "x" with "Cancer"? How do I input that into your coding? I am new to R.

Thank you for adding this. But I want each age group to be replaced with the mid-range. For example, for "55-74" to be replace with "64.5" and "35-54" to be replace with "43.5". How do I write such a code?

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.