Hi i'm using smote function, but I didn't understand the perc.over and perc.under functions and what is the logic behind them. Thanks for the attention
Yes, I did. But I didn't understand the logic. Is the associated number in percentage terms? How many new observations do you generate, for example, if I type per.over = 200? Thanks for your help.
First of all, I want to make it clear what SMOTE the method is doing. SMOTE is a method that generated artificial data points from the minority classes based on straight paths between nearest neighbors in the minority class observations.
SMOTE()
the function doesn't just perform SMOTE, it also performs undersampling by randomly removing observations from the majority class.
I'll create an example data set to help here. It has the majority class "common" and minority class "rare".
library(DMwR)
library(dplyr)
new_iris <- iris[-(1:27), ]
new_iris$Species <- factor(ifelse(new_iris$Species == "setosa","rare","common"))
new_iris %>% count(Species)
#> Species n
#> 1 common 100
#> 2 rare 23
I'm going over the two arguments perc.over
and perc.under
one at a time to clarify what is happening. Starting with perc.over
. To make things more clear I am going to set perc.under = 0
for the following examples, this will eliminate the majority class so we can focus on what happens to the majority class.
If we set perc.over = 100
we will get a 100% new SMOTEd observations. So for each of the "rare" observations, we are getting another so we end up with 23 + 23 = 46
observations.
SMOTE(Species ~ ., new_iris, perc.over = 100, perc.under = 0) %>%
count(Species)
#> Species n
#> 1 rare 46
And this pattern continues when you increase perc.over
, such that we have 23 + 23 * 2 = 69
SMOTE(Species ~ ., new_iris, perc.over = 200, perc.under = 0) %>%
count(Species)
#> Species n
#> 1 rare 69
and 23 + 23 * 6 = 161
.
SMOTE(Species ~ ., new_iris, perc.over = 600, perc.under = 0) %>%
count(Species)
#> Species n
#> 1 rare 161
It is worth noting here that perc.over
is rounded down to the nearest 100. So 200, 201, 250, and 299.999 are all going to return the same number of observations, as SMOTE()
will try to generate the same number of new points for each observation in the minority class.
SMOTE(Species ~ ., new_iris, perc.over = 290, perc.under = 0) %>%
count(Species)
#> Species n
#> 1 rare 69
Lastly if perc.over
is between 0 and 100 then it will generate a proportion accordingly. So here we set perc.over = 25
which mean we will generate synthetic observations based on 25% of the minority observations floor(23 + 23 * 0.25) = 28
.
SMOTE(Species ~ ., new_iris, perc.over = 25, perc.under = 0) %>%
count(Species)
#> Species n
#> 1 rare 28
Now we bring back perc.under
. perc.under
denotes a proportion related to the number of observations that were created from the minority class. This means that in the following example, 2 * 23 = 46
observations were added to the minority class "rare", so having perc.under = 100
mean that we get 100% of those 46
SMOTE(Species ~ ., new_iris, perc.over = 200, perc.under = 100) %>%
count(Species)
#> Species n
#> 1 common 46
#> 2 rare 69
perc.under
isn't rounded to the nearest 100, so you can use any amount you want here. You just have to remember that it is a percentage. floor(46 * 1.8) = 82
SMOTE(Species ~ ., new_iris, perc.over = 200, perc.under = 180) %>%
count(Species)
#> Species n
#> 1 common 82
#> 2 rare 69
This means that the number of majority cases is defined by both using the formula floor(number of minority * floor(perc.over/100) * perc.under/100)
.
SMOTE(Species ~ ., new_iris, perc.over = 400, perc.under = 180) %>%
count(Species)
#> Species n
#> 1 common 165
#> 2 rare 115
Thanks. You were very clear and explanatory!
This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.
If you have a query related to it or one of the replies, start a new topic and refer back with a link.