step_impute_knn() - how to train and then apply for scaling up?

While I'm not doing formal machine learning but instead something more descriptive, I am having issues scaling up step_impute_knn().

For now, I am using:

f_wide_form_knn_imputation <-
             nthread = parallelly::availableCores(omit = 1),
             ...) {
        impute_rec_bps <-
                x = wide_form,
            ) %>%
                options = list(nthread = nthread)
        wide_form_imputed <- 
            prep(impute_rec_bps) %>% juice()

And given my data set (I'm having to fill in NA's in a couple of hundred columns), the imputation is taking about 8 hours on our hardware. We're about to scale up our data set about 30x, and I can't figure out how to first train step_impute_knn() and then apply it to the larger data set once trained ... there is something I'm just not understanding in the recipes documentation.

I have tried prep() %>% bake(new_data = NULL) or prep() %>% bake(new_data = head(wide_form and the NA's are not filled in like they are with prep() %>% juice().

How big are your data (in terms of rows)? At some point searching for nearest neighbors is not efficient.

We're scaling from about 8k rows to about 300k. I'm going to try step_impute_bag() and see how it does ... but all of the other imputations cause issues ... step_impute_linear() doesn't work due to sparseness, and the others wreak havoc with the distributions.

I think that you have reached "at some point". Depending on how many variables are being imputed, step_impute_bag()will also take a while.

I am starting to realize that :slight_smile: ... if you have any ideas or directions to punt me in, that would be great.

With that much data, the model variance of an individual unpruned tree should be pretty low. Maybe try using 10ish trees when bagging.

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.