I made a dataset to use for statistical analysis, and divided it into 2 sub dataset.
One is for calibrating(training) the model and another is for the validation(testing).
Using sampling function, I made a sample that will be used as calibration set. But the problem is, I have no idea how to divide the original dataset with two set, just by subtracting the calibration set from it.
Currently I am ambiguously thinking is just using certain primary key value, doing like this :
original dataset - calibration dataset = > validation dataset,
But I couldn't figure out how to do like above yet.
I think I can make the dataset if I take some steps, but what I want is not to make many lines to code but making it with just intuitive and simple code(or if there is a function which is exactly fit for this purpose it'll be better). So I would be very appreciate if you help me.
I would suggest using dplyr and filter
with an anti_join
Start with the total data. Filter by whatever determines the training or test.
then anti-join
the training data from the total data leaving the test.data
To do it randomly, you can use something like
library(dplyr)
total.data <- data.frame(x=c(1:5), y=c(6:10))
total.data <- mutate(total.data, Pick.Me = sample(c(0,1), size=dim(total.data)[[1]], replace=TRUE, prob=c(0.7,0.3)))
training.data <- filter(total.data, Pick.Me==0)
test.data <- anti_join(total.data, training.data)
Well...thank you very much. But I cannot use this exact code because I already extracted a dataset for training by using "strata" function, because I wanted to eliminate bias of sampling from the data(total.data) which can be categorized.
In this situation, how can I apply or combine your code with mine...?
This is my code for random stratafied sampling :
training.data <- strata(data = total.data, stratanames = "exposure", size = c(100, 100), method = "srswr")
training.data <- getdata(total.data, training.data)
"setdiff" seems to be a similar function with my objective, considering it performs complement calculation but this is for vector, not a dataframe I guess...
If training.data is what you say it is, then the last line that I wrote above using anti_join is what is needed.
I'll try with it.
Thank you again.
This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.
If you have a query related to it or one of the replies, start a new topic and refer back with a link.