Hopefully, some users encountered this before... Or @Max has an advice...

I have a fairly simple goal. I am looking to predict a numeric value based on 3 numeric variables. That would be coordinates (lat, lon) and day of the year (1:365). Simple enough, and caret's knnreg()is a perfect solution for my needs. It performs great (on tiny chunks that I feed it) and logically makes more sense for the task (I'd do just that manually if I had a tiny dataset: find closest neighbors and average their values).

One problem is that I was never able to run it in full. knnreg executes, but predict() can't handle the amount of data.

my full dataset is 1.9M rows

52K data points are missing and require prediction (final goal)

a 25% test set would be about 480K rows

the knnreg object is (5 elements, 198.7 Mb)

I can run it only on tiny sets up of to 5,000 rows.

So, I'm stalling on this first step. Besides the point that I need to do a cross-validation, find a proper k number, and on top of that I need to predict at least 4 more variables based on 3 original predictors.

Is data size my problem? Should I pick a different algorithm for the job?

If there is one algorithm that I wouldn't use in this case, it is knn.

You might try the kknn package "(via caret or directly) but the storage problem will remain.

There's also LVQ, which you can get through caret but for classification . There's no apparent reason (that I could see) that it wouldn't be good for regression but you'd need to re-implement it based on class::lvq1.

Also, I think that you can just set aside some percentage of data to use as a single holdout for tuning.

Thank you. I figured knn might be a wrong tool for the job (given the amount of data), but I don't do enough of predictive modeling to really know which way is better. Logically, knnreg makes a lot of sense: I have weather readings for various coordinates at different times, and I need to fill in holes in the data, and what's better than take neighbors' data (neighbors geographically and/or time-wise).

BTW, I was following your rstudio::conf workshop process to tackle the problem. The first stop was at simple lm() which gave a terrible performance (e.g. R^2 at under 0.3). Followed by knnreg() which was always over 0.85 R squared on tiny samples that I was able to run, which of course got me excited

Hi @Max, while I sort of got you here... Is there any easy way to run predictions for several outcomes in a data frame? I.e. there are several predictors and several outcomes, each outcome is predicted on the same set of predictors.

In recipes, I can assign outcome and predictor roles accordingly, but when it comes to the actual model, I've never seen any syntax that would allow to handle multiple outcomes. Am I doomed to writing a separate function to go about it, or is there a faster solution?

Example: if we have lat, lon, date, temp_mean, temp_max, temp_min, and want to predict temperatures based on coordinates and day.

To get the data in/out of recipes, it's pretty easy:

library(recipes)
rec <- recipe(Sepal.Width + Sepal.Length ~ Petal.Width + Petal.Length, data = iris) %>%
step_center(all_outcomes()) %>%
prep(training = iris, retain = TRUE)
outcomes <- juice(rec, all_outcomes()) # optionally use `composition` for data type

The modeling bit depends on the model. For a lot of models that use the formula method, it would be something like:

lm(cbind(Sepal.Width, Sepal.Length) ~ Petal.Width + Petal.Length, data = iris)

but for others it is foo(x, y) so it depends. Getting predictions is equally complex and package dependent.

If the model only takes one outcome at a time that you are doing some sort of iteration over columns.

(I may have never said these words before) You best bet might be a basic neural network. Using keras, you could set the batch size reasonably and get a pretty compact prediction equation.

OK, so x + y ~ a + b notation is allowed in recipes, that's good to know. I think I got it from here.
And oh no! I might actually try a neural net now? Well, if you suggest it...

@Max, I am stubborn, and therefore I stuck with knnreg() through thick and thin
I used RANN package's nn2() function to find 10 closest neighbors for each missing data point, and only kept those (as other data points would be too far removed to be considered). This reduced the amount of existing data from 1.9M to ~200K records, and made it possible to run knnreg

I may be breaking many rules here, as I'm not an ML expert, but my primary goal is a one-time hole plugging in my data. If I could do it by hand - I would.

On the subject of many outcomes, I couldn't get x + y ~ a + b or cbind(x,y) ~ a + b syntax to work - it works at the knnreg call, but breaks on predict:

Error in knnregTrain(train = c(25, 37, 38, 26, 27, 43, 32, 32, 31, 46, : 'train' and 'class' have different lengths