I am working with tidymodels at the moment where i want to update both the training and test sets in my resamples
I have a function that does some calculations based on profile data and I'm trying to avoid data leakage.
I pass the type of resample into the function which determines which calculation is to be applied to the which resamples.
I know we can access the analysis or the assessment based on the accessor functions analysis or assessment.
Is there a method to update the analysis and assessment potions of the resamples.
A rough idea of what im trying to achieve is below
update_resamples <- function(splits, data_set){
# If the dataset is the train resamples then calculate the trend from the data
if(data_set == 'train'){
analysis_splits <- analysis(splits)
enhanced_data <- calculate_trend_foo(analysis_splits)
analysis(splits) <- enhanced_data
} else {
assess_splits <- assessment(splits)
enhanced_data <- apply_trend_foo(analysis_splits) # this is calculated from the train resample in the first iteration of the loop
assessment(splits) <- enhanced_data
}
splits
}
# Generate the profile data with a fictitious resamples profile where splits is the resamples
cv_slices %>%
mutate(splits = map(splits, update_resamples, 'train')) %>% # Generates and applies the trend the data to the training set
mutate(splits = map(splits, update_resamples, 'test')) # Applies the trend data generated from the previous step to the test test of the resamples
I hope you can help. I have been scratching my head all morning
I probably wasn't very clear on why I wanted to update it (new infant in the house).
I can give a proxy example and maybe it will make things clearer and you can let me know if its a terrible idea or i have confused the issue more.
I have a fictitious data-set with thousands of sensors that send out requests to engineers to manually check them for faults. These status messages go back years and vary considerably. In the vast majority of cases its a false alarm (70-30%). So in essence this is a binary classification task.
I have a laptop which couldn't possibly tokenize all the status messages going back years to predict the outcome but it can tokenize maybe 5 months worth of data. My plan with the above question was to separate out the dataset into a profile dataset and a training dataset and a test set as part of a resampling.
The profile data set would simply be a row for each sensor and the rate of true positives/phone homes
More concretely if I have a dataset for 2020. I would take the first six months and assign this rate to each sensor. I guess you could call it an informed confidence in the sensor.
The second part would be to take all the other information for July to December and use something like sliding_period to partition the data up. Lets say for arguments sake, each resample uses one month to train data and 2 weeks in the test set. WIthin each partition on the training side, i can tokenize the messages as well as model other non text data. I also can enhance the first resample with my first six months confidence in each sensor. The time ranges for the profile data and second part are separate. I would also allow the test set access to the same sensor confidence the train set has. My thinking here was that since i am using Jan-June to generate my confidence. That confidence is a feature in the training set for July (training set) as well as the first two weeks of August (test set). So I am not sure of any bleed over.
For the second resample, the test set in the first resample is now being used in the train resample in the Slice 2. So July and and August are now in the analysis resample of slice 2. However I want to recalculate my confidence of my sensors, so now calculate it based off of Jan to July data, add it as a feature to the train set and remove all other records pertaining to July (normal tokenized data). Slice two then trains on the profile data for January to July and the tokenized data for August where the test set is the first two weeks of Sept (which also has access to the January to July profile data).
Its possible this is not a particularly useful feature and if you think that it will leak information and give overly optimistic results, it might be better to discard the profile data completely and work with just the normal 5 months operational data without the sensor confidence