Hey guys i need your help for a university project. The main Task is to analyze the effects of over/under-smapling on a imbalanced Dataset. But before we can even start with that, our task sheet says, that we 1) have to find/create imbalanced Datasets and 2) fit those with a binary classification model like CART. So my auestions would be, where do i find such imbalanced datasets? And how do i fit those datasets with CART, and what does that help in regard of over/under-sampling?
Thats my whole first try.
# CART - Datensatz laden
setwd("C:\\Users\\..\\Dropbox\\Uni\\Präsentation\\Datensätze")
add <- "data1.csv"
df <- read.csv(add)
head(df) # Ersten 6 Zeilen
nrow(df) # Anzahl der Reihen des Datensatzes
# CART - Wichtige Daten selektieren
df <- mutate(df, x= as.numeric(x), y= as.numeric(y), label=factor(label))
set.seed(123)
sample = sample.split(df$x, SplitRatio = 0.70)
train = subset(df, sample==TRUE)
test = subset(df, sample==FALSE)
# grow tree (Baum wachsen lassen)
fit <- rpart(x~., data = train, method = "class")
printcp(fit)
plotcp(fit)
summary(fit)
# plot tree
plot(fit, uniform = TRUE, main="Bla Bla Bla")
# text(fit, use.n=TRUE, all=TRUE, cex=.08)
# prune the table --> to avoid overfitting the data#
pfit<- prune(fit, cp= fit$cptable[which.min(fit$cptable[,"xerror"]),"CP"])
plot(pfit, uniform=TRUE,
main="Pruned Classification Tree for Us")
Why do i need to make such a decision tree and how does it help with Over/Under-Sampling?
Since your question is homework related, we can give you tips but no answers, as this would take the fun out of it
For more info, check the policy here:
Tips
Imbalanced datasets have an unequal number of outcomes in the training set (often reflection of the real-life distribution).
Imagine you like to predict the colour of a car based on other car characteristics and the driver's personality. If you would just take a random sample out of all cars that you see on the road, most would be white, gray or black.
If you do machine learning on an imbalanced dataset, it will lead to bad performance as the model will most likely choose the higher represented category as the answer (e.g. guess the car is white instead of yellow since it's more likely). To overcome this issue, you need to balance or put in checks to avoid this.
So if you like to create an imbalanced dataset yourself, you can just do that by coming up with skewed distributions of whatever outcome (and input too) you can think of. A great function in R is the sample() function, where you can pick from a vector with user-defined probabilities. http://www.rexamples.com/14/Sample()