Hi,
I have a dataset and the data distribution across classes is completely imbalanced. Total instances 143759 with the class distribution shown as below:
Class 1 - 77624 (54%)
Class 2 - 16254 (11%)
Class 3 - 30181 (21%)
Class 4 - 15070 (11%)
Class 5 - 4630 (3%)
The % shows the percentage distribution.
I want to write a function that takes this whole data set, divides it up into 3 subsets using ‘Class’ as the dividing feature, and then use ROSE to balance those subsets out to my desired distribution. Then they would be compiled back into one, reasonably balanced data set. Each subset contains Class1 and then one of the minority classes. Class3 can be left out because it’s not under-represented.
The subset needs to be given to a ROSE argument to over-sample the minority class. The ROSE code for doing so is also required. With p=0.5, the arguments must return a data set where the minority class is now represented in 50% of the data.
So essentially, i need a subset creation and apply ROSE to get back synthetic yet balanced data samples and compile a new data set.
Any help to accomplish this will be appreciated.
Thanks,