Hello I am a postgrad student and I have an assignment where I am asked to do hot encoding on categorical columns in a dataframe. The process must be completed manually without applying the function one_hot. My question is that I would like to understand what steps does the function apply so I can apply it manually to my dataframe.
one_hot()
is not a base R function so I think you should specify which package are you referring to, ideally, you should present your issue as a REPRoducible EXample (reprex).
Please take a look at our homework policy to learn how to properly ask homework-inspired questions here.
I honestly don't know. That is why I am asking.
You can see the code of a function by typing its name, without parentheses, in the console.
What @FJCC is suggesting here @myusername is, that if you at the console type:
> one_hot
I.e. without the parantheses, then you can see how the function works. The likely reason that people are reluctant with giving you the answer is, that you will learn nothing simply from getting the code. The whole process of thinking the solution through and implementing and checking that it works - That's where you learn something
Give it a shout - If you get stuck, then return here with what you did and where you're stuck, then I'm certain that people will gladly chip in
I am trying to apply the function one_hot
manually in R for an assignment.
sample of my dataset
a <- c('red','red','green')
b <- c('large', 'medium', 'small')
c <- c('wide','narrow','narrow')
df <- data.frame(a, b, c)
using the one_hot
function from the package scorecard
returns this output
one_hot(df)
output
a_green a_red b_large b_medium b_small c_narrow c_wide
1: 0 1 1 0 0 0 1
2: 0 1 0 1 0 1 0
3: 1 0 0 0 1 1 0
I would like to create the same output without using the function. So far I did those steps:
- converted the categorical columns to factors
for (i in colnames(df)) {
df[i] <- do.call(cbind.data.frame, lapply(df[i], as.factor))}
- found the length of the levels (k). I wrote this function
to.encode<-c('a','b','c')
one.hot <- function(df, to.encode) {
len=c()
k=sapply(df[to.encode], levels)
for (i in k) {
if (!is.null(i)){
len<-length(i)-1
print(len)
}
}
}
output is the length of the levels minus 1 (k-1)
> one.hot(df)
[1] 1
[1] 2
[1] 1
Now I want to create (k-1) new columns for each categorical column. I want to set the value to 1 if the original variable's value corresponded to the column, and 0 otherwise.
Any advice on how to take this to the next step? Thank you
Please refrain from re-posting your question it will clutter the discussion board. I recommend continuing in the original thread.
Hi @myusername,
Here is a bit of generic code for one-hot encoding to get you started:
set.seed(859315)
n = 10
categories = sample(x = seq(from = 1, to = 3), size = n, replace = TRUE)
one_hot = t(sapply(X = categories, FUN = function(x_i){
v = c(0, 0, 0)
v[x_i] = 1
return(v)
}))
Yielding:
> categories
[1] 1 3 1 2 1 2 1 1 1 1
> one_hot
[,1] [,2] [,3]
[1,] 1 0 0
[2,] 0 0 1
[3,] 1 0 0
[4,] 0 1 0
[5,] 1 0 0
[6,] 0 1 0
[7,] 1 0 0
[8,] 1 0 0
[9,] 1 0 0
[10,] 1 0 0
See if you can formalise it and apply it to your data
This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.