Handling Class Imbalance for Large Dataset

iFeanyi · March 23, 2022, 10:40am

Please, what is the best way to handle the class imbalance of a large dataset? I have a dataset of over 300k rows, whose target variable has imbalanced classes. I have tried using ROSE to balance out the training dataset, after an 80/20 split, but it keeps returning an empty table of classes. This is my code:

library(ROSE)
library(DMwR)
library(caret)

ind <- createDataPartition(heart_df$HeartDisease,p = 0.8,list = F)
train_heart <- heart_df[ind,]
test_heart <- heart_df[-ind,]
nrow(train_heart)
nrow(test_heart)

set.seed(111)
trainUp <- ROSE(HeartDisease ~.,data = train_heart)$heart_df
table(trainUp$HeartDisease)

Here is a screenshot of the data:

heart

There are more "No" than "Yes", and so I want to balance out the training data. But the table(trainUp$HeartDisease) code returns the following output in my console: < table of extent 0 > instead of the adjusted classes. Please, I will appreciate your help, thank you.

nirgrahamuk · March 23, 2022, 10:57am

Hello, this is not quite a reprex, as it seems to rely both on unshared data (heart_df) and functions not declared by the listed library calls (createDataPartition). Could you review these elements ?

iFeanyi · March 23, 2022, 11:25am

I didn't find a way to upload the .csv file, so I shared a screenshot.

nirgrahamuk · March 23, 2022, 11:41am

I'm sure you shared this image with the best intentions, but perhaps you didnt realise what it implies.
If someone wished to use example data to test code against, they would type it out from your screenshot...

This is very unlikely to happen, and so it reduces the likelihood you will receive the help you desire.
Therefore please see this guide on how to reprex data. Key to this is use of either datapasta, or dput() to share your data as code

FAQ: How to do a minimal reproducible example ( reprex ) for beginners Guides & FAQs

A minimal reproducible example consists of the following items: A minimal dataset, necessary to reproduce the issue The minimal runnable code necessary to reproduce the issue, which can be run on the given dataset, and including the necessary information on the used packages. Let's quickly go over each one of these with examples: Minimal Dataset (Sample Data) You need to provide a data frame that is small enough to be (reasonably) pasted on a post, but big enough to reproduce your issue. Let's say, as an example, that you are working with the iris data frame head(iris) #> Sepal.Length Sepal.Width Petal.Length Petal.Width Species #> 1 5.1 3.5 1.4 0.…

iFeanyi · March 23, 2022, 2:09pm

If you don't mind, can I email you a sample of the data? Even with datapasta, the table is not looking nice here.

nirgrahamuk · March 23, 2022, 2:40pm

I'm going to make an educated guess that

 ROSE(HeartDisease ~.,data = train_heart)

runs ok, and shows you output

my guess is that you are accessing $heard_df from it, where that isn't there

rather I'd expect

trainUp <- ROSE(HeartDisease ~.,data = train_heart)$data

to pull out the relevant content

system · March 30, 2022, 2:40pm

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.