Undo dummy variables.

For some machine learning algorithms the string data needs to be in the form of dummy variables, meaning that instead of for example "Germany" in the column "Origin", we get a 1 in the column "Origin_Germany" and a 0 in anything else that starts with "Origin_". Naturally, this operation creates a lot of new columns depending on how many different countries are used in the dataframe. The information in the data remains the same.
Additionally, some machine learning algorithms do not allow a sort of linear dependency/ redundancy in the data. In the following example, the column "Fruit" contains the strings "Apple" and "Banana". In order to erase redundant data, after the application of the dummy operation I need to erase the column "Fruit" itself and then also exactly one of the dummy columns, since a "0" in "Fruit_Banana" implies a "1" in the other column, in this case "Fruit_Apple". Assuming I remember the name of the columns I erased, the information still remains the same.
Now the hard part: How do I invert these operations? In my actual data I have almost 100 columns and some 4 digit amount of dummy variables, which means that I will need to undo the transformation after obtaining the result from the machine learning algorithms to have some sort of readable data. The following is an easier example that illustrates the problem. How do I get the dataframe "Fruits" back from the information given in "FruitsDummy" and knowledge of the dummy columns I erased?

library(fastDummies)

Fruit<-c("Banana", "Apple", "Banana", "Apple", "Apple")
Origin<-c("New Guinea", "China", "Germany", "USA", "Germany")
Quality<-c("Good", "Bad", "Good", "Very bad", "Decent")
Value<-c(50,75,80,60,30) #cents
Price<-c(1,2,1,3,1)     #euros

Fruits<-data.frame(Fruit, Origin, Quality, Value, Price)
Fruits$Fruit<-as.character(Fruits$Fruit)
Fruits$Origin<-as.character(Fruits$Origin)
Fruits$Quality<-as.character(Fruits$Quality)
Fruits$Value<-as.numeric(Fruits$Value)
Fruits$Price<-as.numeric(Fruits$Price)

FruitsDummy<-subset(dummy_cols(Fruits), select=c(-`Fruit`, -`Fruit_Banana`, -`Origin`, 
                                                 -`Origin_China`,-`Quality`, -`Quality_Good`))
                              

The following code will recover the Origin column. If it suits your needs, you can easily replicate it with suitable tweaks to recover the Fruit and Quality columns.

library(dplyr)
library(stringr)
library(tidyr)

# List the origin columns in the data frame.
origin_cols <- names(FruitsDummy) %>% str_subset("^Origin")
# Add the missing dummy column for Origin_China.
temp <- FruitsDummy %>% mutate(Origin_China = 1 - rowSums(across(all_of(origin_cols))))
# Include Origin_China in the origin column list.
origin_cols <- c(origin_cols, "Origin_China")
# Pivot the origin column names and values into two columns.
temp <- temp %>% pivot_longer(cols = all_of(origin_cols), names_to = "Origin", values_to = "OV")
# Select the rows where the dummy value is 1. (The others are noise.) Then drop the OV column.
temp <- temp %>% filter(OV == 1) %>% select(-OV)
# Clean up the origin names by removing the prefix.
temp <- temp %>% mutate(Origin = str_extract(Origin, "^Origin_(.*)", group = 1))

This topic was automatically closed 42 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.