CART classification for imbalanced datasets with R

pieterjanvc · March 23, 2020, 3:52pm

Hi,

Welcome to the RStudio community!

Since your question is homework related, we can give you tips but no answers, as this would take the fun out of it

For more info, check the policy here:

Tips

Imbalanced datasets have an unequal number of outcomes in the training set (often reflection of the real-life distribution).

Imagine you like to predict the colour of a car based on other car characteristics and the driver's personality. If you would just take a random sample out of all cars that you see on the road, most would be white, gray or black.

If you do machine learning on an imbalanced dataset, it will lead to bad performance as the model will most likely choose the higher represented category as the answer (e.g. guess the car is white instead of yellow since it's more likely). To overcome this issue, you need to balance or put in checks to avoid this.

So if you like to create an imbalanced dataset yourself, you can just do that by coming up with skewed distributions of whatever outcome (and input too) you can think of. A great function in R is the sample() function, where you can pick from a vector with user-defined probabilities.
http://www.rexamples.com/14/Sample()

Hope this helps already,

PJ