Can I "subtract" the sub dataset from a certain dataset?

ang · July 18, 2022, 4:01am

I made a dataset to use for statistical analysis, and divided it into 2 sub dataset.
One is for calibrating(training) the model and another is for the validation(testing).
Using sampling function, I made a sample that will be used as calibration set. But the problem is, I have no idea how to divide the original dataset with two set, just by subtracting the calibration set from it.
Currently I am ambiguously thinking is just using certain primary key value, doing like this :
original dataset - calibration dataset = > validation dataset,
But I couldn't figure out how to do like above yet.
I think I can make the dataset if I take some steps, but what I want is not to make many lines to code but making it with just intuitive and simple code(or if there is a function which is exactly fit for this purpose it'll be better). So I would be very appreciate if you help me.

rwalker · July 18, 2022, 4:18am

I would suggest using dplyr and filter with an anti_join

Start with the total data. Filter by whatever determines the training or test.
then anti-join the training data from the total data leaving the test.data

To do it randomly, you can use something like

library(dplyr)
total.data <- data.frame(x=c(1:5), y=c(6:10))
total.data <- mutate(total.data, Pick.Me = sample(c(0,1), size=dim(total.data)[[1]], replace=TRUE,  prob=c(0.7,0.3)))
training.data <- filter(total.data, Pick.Me==0)
test.data <- anti_join(total.data, training.data)

ang · July 18, 2022, 4:46am

Well...thank you very much. But I cannot use this exact code because I already extracted a dataset for training by using "strata" function, because I wanted to eliminate bias of sampling from the data(total.data) which can be categorized.
In this situation, how can I apply or combine your code with mine...?

This is my code for random stratafied sampling :

training.data <- strata(data = total.data, stratanames = "exposure", size = c(100, 100), method = "srswr")
training.data <- getdata(total.data, training.data)

"setdiff" seems to be a similar function with my objective, considering it performs complement calculation but this is for vector, not a dataframe I guess...

rwalker · July 18, 2022, 5:13am

If training.data is what you say it is, then the last line that I wrote above using anti_join is what is needed.

ang · July 18, 2022, 5:16am

I'll try with it.
Thank you again.

system · August 8, 2022, 5:17am

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.