Random Forest - Variable lenghts differ

Hello you all!

I'm trying to run a random forest and then use the predict function to assest the accuracy of the model.
I have a train database with 7397 rows x 13 features
And a validation database with 2468 rows x 13 features

I first run the random forest function on the train database without any problem but then when I try to predict and assest the accuracy on the validation database I get the error:

Error in model.frame.default(Terms, newdata, na.action = na.omit) : 
  variables lenght differ (found for 'Administrative')
In addition: Warning message:
'newdata' has 2468 rows but the variable found has 7397 rows

So I used a subset from the train db which is a sample with 2468 (the same lenght of the validation db) but I still got the same error.

train_2 = sample(1:nrow(online_shoppers_intention_train), n_v)

OSI.ran.forest.3 <- randomForest(Revenue~., data= online_shoppers_intention_train, subset=train_2, mtry=12,importance=TRUE)

yhat.OSI = predict(OSI.ran.forest.3, newdata=validation_db)

The two databases have NOT any missing values, I have already checked.

It's hard to debug/offer advice without a reproducible example / access to your data.

The general pattern you are attempting should work. Not sure what I would have to alter in my example data to create the error you quote.

train_2 = sample(1:nrow(iris), n_v)

iris.rf <- randomForest(Species ~ ., data=iris, mtry=3,
                        importance=TRUE,subset = train_2)

validation <- iris[setdiff(1:nrow(iris),train_2),]

(yhat = predict(iris.rf, newdata=validation))

maybe because the validation database comes from a separate file (it has been given to me by my professor) and not from the same train database ? They both come from the same data base that the professor split in 3: train, validation and test.
Train data: about 60% of the units of the original dataset
validation data: about 20% of the units of the original dataset
test data: about 20% of the units of the original dataset
I've found that a col name in the validation database was different from the training one, fixed it but still having the same error. Now the variables length differs is found for "Month" but I can't really understand what is going on. This thing is driving me crazy, i've been trying to fixt it for hours.




to get textual output describing each of your two datasets and share them here ?

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.