I am trying to apply the function one_hot
manually in R for an assignment.
sample of my dataset
a <- c('red','red','green')
b <- c('large', 'medium', 'small')
c <- c('wide','narrow','narrow')
df <- data.frame(a, b, c)
using the one_hot
function from the package scorecard
returns this output
one_hot(df)
output
a_green a_red b_large b_medium b_small c_narrow c_wide
1: 0 1 1 0 0 0 1
2: 0 1 0 1 0 1 0
3: 1 0 0 0 1 1 0
I would like to create the same output without using the function. So far I did those steps:
- converted the categorical columns to factors
for (i in colnames(df)) {
df[i] <- do.call(cbind.data.frame, lapply(df[i], as.factor))}
- found the length of the levels (k). I wrote this function
to.encode<-c('a','b','c')
one.hot <- function(df, to.encode) {
len=c()
k=sapply(df[to.encode], levels)
for (i in k) {
if (!is.null(i)){
len<-length(i)-1
print(len)
}
}
}
output is the length of the levels minus 1 (k-1)
> one.hot(df)
[1] 1
[1] 2
[1] 1
Now I want to create (k-1) new columns for each categorical column. I want to set the value to 1 if the original variable's value corresponded to the column, and 0 otherwise.
Any advice on how to take this to the next step? Thank you