error in random forest building using rpart()

utk_tripathi · February 27, 2020, 7:03am

Hello
I am working on a data set which has some binary encoded factor variables and some numerical variables. I was using the train subset(3500 rows and 17 variables). Out of these 17 variables, 4 are numerical and 13 are binary encoded factor variables.

NOTE: these are the only two types of variables. All continuous categorical have been dummified into binary encoded variables.

I was successful in doing the CART analysis and plotting the tree but when I tried building the random forest tree, it kept showing this error and I don't understand what it's trying to tell me. I have been stuck on this for quite some time now and would very much appreciate help on this.

NOTE: "CCAvg" is one of the variables in the data set, which is a numerical variable. In my attempt to fix the issue I just tried removing the variable mentioned but the error shows for all the variables one by one which would basically lead to removing every variable.

rndforest= randomForest(train$Personal Loan~., data = train, ntree=501, mtry = 13, nodesize = 1, importance = TRUE)

Error in model.frame.default(terms(reformulate(attributes(Terms)$term.labels)), : ***
*** variable lengths differ (found for 'CCAvg')

(Also, how should I use reprex in a case like this where I can't find similar datasets.)

technocrat · February 27, 2020, 7:27am

Hi @utk_tripathi

Remember that the reprex does need data that reproduces the problem, but 1) it doesn't have to be all the data and 2) it's often possible to massage mtcars or another standard data set into the form of the data being used.

If a sample, say n = 50 rows, consider a dput(your data) and cut-and pasting it into an assignment statement in the reprex, along with the code that throws the error.

I'm going to have to clear some brush out from under my random forest, but will look to see if anything pops out.

Just to eliminate one possibility: NA free?

utk_tripathi · February 27, 2020, 7:33am

@technocrat I will surely keep the reprex solution in mind. Thankyou for that.
Awaiting your response on random forest problem.

nirgrahamuk · February 27, 2020, 9:40am

I don't believe it will be possible to diagnose your issue away from the data, as it seems data driven.

Side comment, it will be best to amend your formula to be

rndforest= randomForest(`Personal Loan`~., data = train, ntree=501, mtry = 13, nodesize = 1, importance = TRUE)

because the train is passed already in data, so the formula doesnt need that repeated and it could introduce mistakes. also when you have variable name with spaces, this is a way to have headaches. you would need to use backticks (not single not double quotes) to pass as formula reference.

now, what we would like to do is see your data, but apparently there are too many sizes of rows from it.
I suggest you try the following, to see what the smallest number of random sampled rows causes you an error still by inserting a head function wrapping your data passing.
for example to see what happens when you only use the first 100 records:

rndforest= randomForest(`Personal Loan`~., 
                        data = head(train,n=100),
                        ntree=501, mtry = 13, nodesize = 1, importance = TRUE)

Max · February 27, 2020, 4:05pm

@utk_tripathi Often the process of coming up with a smaller example for a reprex will show you what the issue is. It can feel like "why do I have to do all this?" but is worth the time and effort (esp if volunteers are going to try to answer your question).

utk_tripathi · February 27, 2020, 7:10pm

@nirgrahamuk @technocrat @Max
I made the changes and yet the functions does not seem to work.
I have recreated an example using the imports85 dataset in R but can't seem to reproduce it properly using reprex which I will attend to later but for now, this code can be simply copied as imports85 is a standard dataset in R.

*data(imports85)*
*str(imports85)*
*mydata = imports85[c(4:5, 8:9, 10:13)]*

*fuelType.matrix= model.matrix(~fuelType - 1, mydata)*
*mydata=cbind(mydata, fuelType.matrix)*

*aspiration.matrix= model.matrix(~aspiration - 1, mydata)*
*mydata=cbind(mydata, aspiration.matrix)*

*driveWheels.matrix= model.matrix(~driveWheels - 1, mydata)*
*mydata=cbind(mydata, driveWheels.matrix)*

*engineLocation.matrix= model.matrix(~engineLocation - 1, mydata)*
*mydata=cbind(mydata, engineLocation.matrix)*

*mydata = mydata[, -c(1:4)]*
*View(mydata)*
*str(mydata)*

mydata[,c(5:13)] <- lapply(mydata[,c(5:13)], factor)

*##randomforest building####*
*library(randomForest)*
*seed=1000*
*set.seed(seed)*
*rndforest= randomForest(formula = aspirationstd~., data = mydata, ntree=51, mtry = 8, nodesize = 1, importance = TRUE)*

*print(rndforest)*
*print(rndforest$err.rate)*
*plot(rndforest)*
*importance(rndforest)*

*##tuning*
*set.seed(seed)*
*trndforest = tuneRF(x = mydata[,-c(10, 11:13)], y = mydata$aspirationstd, mtryStart = 3,* stepFactor  *1.5,improve = 0.0001, trace = TRUE, plot = TRUE, doBest = TRUE, nodesize = 10, importance = TRUE)*

you can simply run this code on R to understand,

the dataset is similar with only numerical and dummy factor variables
while the randomforest() was successful on this data, I still can't seem to understand why it's not working on my dataset.
There is a new error showing when I try to tune the randomforest which can be seen by running the code.

nirgrahamuk · February 27, 2020, 7:32pm

Hello, the tuning error is caused because your selection of the X variables, includes the Y variable being predicted, so the very first randomforest tree has 0 error, no room to improve, and this anomaly causes the function to fail.

technocrat · February 27, 2020, 7:34pm

The reprex format is not essential with all the pieces, including a standard data set set in a code block. Thanks!

Max · February 27, 2020, 7:36pm

Part of the reason is that x and y have the same column in them so there is always perfect accuracy. Also, you should probably use the same interface for the model fit (which is a formula in your code) and the tuneRF() call (which is x/y interface).

technocrat · February 27, 2020, 7:50pm

Here's a reprex

library(randomForest)
#> randomForest 4.6-14
#> Type rfNews() to see new features/changes/bug fixes.

# create representative data to illustrate problem
data(imports85)
str(imports85)
#> 'data.frame':    205 obs. of  26 variables:
#>  $ symboling       : int  3 3 1 2 2 2 1 1 1 0 ...
#>  $ normalizedLosses: int  NA NA NA 164 164 NA 158 NA 158 NA ...
#>  $ make            : Factor w/ 22 levels "alfa-romero",..: 1 1 1 2 2 2 2 2 2 2 ...
#>  $ fuelType        : Factor w/ 2 levels "diesel","gas": 2 2 2 2 2 2 2 2 2 2 ...
#>  $ aspiration      : Factor w/ 2 levels "std","turbo": 1 1 1 1 1 1 1 1 2 2 ...
#>  $ numOfDoors      : Factor w/ 2 levels "four","two": 2 2 2 1 1 2 1 1 1 2 ...
#>  $ bodyStyle       : Factor w/ 5 levels "convertible",..: 1 1 3 4 4 4 4 5 4 3 ...
#>  $ driveWheels     : Factor w/ 3 levels "4wd","fwd","rwd": 3 3 3 2 1 2 2 2 2 1 ...
#>  $ engineLocation  : Factor w/ 2 levels "front","rear": 1 1 1 1 1 1 1 1 1 1 ...
#>  $ wheelBase       : num  88.6 88.6 94.5 99.8 99.4 ...
#>  $ length          : num  169 169 171 177 177 ...
#>  $ width           : num  64.1 64.1 65.5 66.2 66.4 66.3 71.4 71.4 71.4 67.9 ...
#>  $ height          : num  48.8 48.8 52.4 54.3 54.3 53.1 55.7 55.7 55.9 52 ...
#>  $ curbWeight      : int  2548 2548 2823 2337 2824 2507 2844 2954 3086 3053 ...
#>  $ engineType      : Factor w/ 7 levels "dohc","dohcv",..: 1 1 6 4 4 4 4 4 4 4 ...
#>  $ numOfCylinders  : Ord.factor w/ 7 levels "two"<"three"<..: 3 3 5 3 4 4 4 4 4 4 ...
#>  $ engineSize      : int  130 130 152 109 136 136 136 136 131 131 ...
#>  $ fuelSystem      : Factor w/ 8 levels "1bbl","2bbl",..: 6 6 6 6 6 6 6 6 6 6 ...
#>  $ bore            : num  3.47 3.47 2.68 3.19 3.19 3.19 3.19 3.19 3.13 3.13 ...
#>  $ stroke          : num  2.68 2.68 3.47 3.4 3.4 3.4 3.4 3.4 3.4 3.4 ...
#>  $ compressionRatio: num  9 9 9 10 8 8.5 8.5 8.5 8.3 7 ...
#>  $ horsepower      : int  111 111 154 102 115 110 110 110 140 160 ...
#>  $ peakRpm         : int  5000 5000 5000 5500 5500 5500 5500 5500 5500 5500 ...
#>  $ cityMpg         : int  21 21 19 24 18 19 19 19 17 16 ...
#>  $ highwayMpg      : int  27 27 26 30 22 25 25 25 20 22 ...
#>  $ price           : int  13495 16500 16500 13950 17450 15250 17710 18920 23875 NA ...
mydata = imports85[c(4:5, 8:9, 10:13)]
fuelType.matrix= model.matrix(~fuelType - 1, mydata)
mydata=cbind(mydata, fuelType.matrix)
aspiration.matrix= model.matrix(~aspiration - 1, mydata)
mydata=cbind(mydata, aspiration.matrix)
driveWheels.matrix= model.matrix(~driveWheels - 1, mydata)
mydata=cbind(mydata, driveWheels.matrix)
engineLocation.matrix= model.matrix(~engineLocation - 1, mydata)
mydata=cbind(mydata, engineLocation.matrix)
mydata = mydata[, -c(1:4)]

# pick a continuous variable as the response and create model
rndforest= randomForest(length ~., data = mydata, ntree=501, mtry = 13, nodesize = 1, importance = TRUE)
#> Warning in randomForest.default(m, y, ...): invalid mtry: reset to within valid
#> range
# fails with mtry error
rndforest
#> 
#> Call:
#>  randomForest(formula = length ~ ., data = mydata, ntree = 501,      mtry = 13, nodesize = 1, importance = TRUE) 
#>                Type of random forest: regression
#>                      Number of trees: 501
#> No. of variables tried at each split: 12
#> 
#>           Mean of squared residuals: 9.741354
#>                     % Var explained: 93.57
# choose minimal mtry
rndforest= randomForest(length ~., data = mydata, ntree=501, mtry = 3, nodesize = 1, importance = TRUE)
# returns value
rndforest
#> 
#> Call:
#>  randomForest(formula = length ~ ., data = mydata, ntree = 501,      mtry = 3, nodesize = 1, importance = TRUE) 
#>                Type of random forest: regression
#>                      Number of trees: 501
#> No. of variables tried at each split: 3
#> 
#>           Mean of squared residuals: 14.59681
#>                     % Var explained: 90.36
# choose intermediate
rndforest= randomForest(length ~., data = mydata, ntree=501, mtry = 8, nodesize = 1, importance = TRUE)
# returns value
rndforest
#> 
#> Call:
#>  randomForest(formula = length ~ ., data = mydata, ntree = 501,      mtry = 8, nodesize = 1, importance = TRUE) 
#>                Type of random forest: regression
#>                      Number of trees: 501
#> No. of variables tried at each split: 8
#> 
#>           Mean of squared residuals: 9.448357
#>                     % Var explained: 93.76
# Quarter in on again
rndforest= randomForest(length ~., data = mydata, ntree=501, mtry = 10, nodesize = 1, importance = TRUE)
# still good
rndforest
#> 
#> Call:
#>  randomForest(formula = length ~ ., data = mydata, ntree = 501,      mtry = 10, nodesize = 1, importance = TRUE) 
#>                Type of random forest: regression
#>                      Number of trees: 501
#> No. of variables tried at each split: 10
#> 
#>           Mean of squared residuals: 10.03625
#>                     % Var explained: 93.37
# push to one less than failure point mtry
rndforest= randomForest(length ~., data = mydata, ntree=501, mtry = 12, nodesize = 1, importance = TRUE)
# conclusion: for this data set mtry cannot exceed 12
rndforest
#> 
#> Call:
#>  randomForest(formula = length ~ ., data = mydata, ntree = 501,      mtry = 12, nodesize = 1, importance = TRUE) 
#>                Type of random forest: regression
#>                      Number of trees: 501
#> No. of variables tried at each split: 12
#> 
#>           Mean of squared residuals: 9.964372
#>                     % Var explained: 93.42
# also variable lengths differ error does not arise

^{Created on 2020-02-27 by the reprex package (v0.3.0)}

utk_tripathi · February 27, 2020, 7:53pm

@Max
Using the same interface worked! Thankyou very much for your response.

utk_tripathi · February 27, 2020, 7:55pm

@nirgrahamuk
I removed the response variable from x and the tuning worked too. Thanks a lot for your help!

utk_tripathi · February 27, 2020, 7:59pm

@technocrat Thanks a lot for helping me out. I can understand where I was going wrong through your example.

nirgrahamuk · March 2, 2020, 12:55pm

Some advice for the future. When sharing code it should be copy and pastable for others. I see here that the code has many * in it.

It's good practice to open a new R session for yourself and paste the code you are thinking to share on forum in there and see that it runs and shows what you intend. Since that is what we will be facing if we look at your code.

Good news that the randforest issues got worked out.

system · March 23, 2020, 1:06pm

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.