Error in gbm.fit

Hello guys,

I'm trying to do a boosted regression tree with a dataset with abundance and environmental variables with the following code:

dados.tc3.lr003 <- gbm.step(data=Dados, gbm.x = 2:14, gbm.y = 1,
                            family = "gaussian", tree.complexity = 3,
                            learning.rate = 0.003, bag.fraction = 0.5)

But when I run the code, the following error apears:

Error in gbm.fit(x = x, y = y, offset = offset, distribution = distribution,  : 
  The data set is too small or the subsampling rate is too large: 
`nTrain * bag.fraction <= n.minobsinnode`

I'm not doing any subsampling, I'm using a full dataset with 46 observations.
I already tryed to reduce the bag.fraction, but the same error apeared.
Is there any way to decrease the subsampling rate or the data set is just too small to perform this analyse?

Thank you in advance!

Welcome to the community!

Can you please provide a REPRoducible EXample of your problem?

In case you don't know how to make a reprex, here's a great link:

1 Like

Yes of course!

Dataset:

Abundance_s AnnualPrec AnnualMeanTemp MeanDiuRange MaxTempWM LandCover Altitude
1   0.5555556        854       22.12500     11.33333      32.0        DB       42
2   0.4444444        846       22.00833     11.35000      31.9        DB       61
3   0.6666667        844       22.03333     11.43333      32.0        DB       56
4   0.0000000        843       22.10000     11.41667      32.1        DB       53
5   0.3333333        834       22.02500     11.50000      32.1         G       62
6   0.4444444        832       21.97917     11.52500      32.0         G       61
  BHerbaceous NDVI TreeCover Soil Grassland Trees BareSoil
1    57.83030  196        37  LVk         0     7        0
2    38.14339  206        44  LVk         0    30        0
3    65.35818  200        41  ARh         0    30        0
4    62.20603  193        41  LVk         0     0        0
5    56.22468  185        16  LVk         0     0        0
6    71.29130  199        16  ARh         0    21        0

Code:

#Install dismo package
library(dismo)

#Define cathegorical variables as factors
Dados$Soil <- factor(Dados$Soil, levels=c("ARh", "LPm", "LPq","LVk", "PHh", "O"))
Dados$LandCover <- factor(Dados$LandCover, levels=c("DB", "G", "O", "WD"))

#BRT
dados.tc3.lr003 <- gbm.step(data=Dados, gbm.x = 2:14, gbm.y = 1,
                            family = "gaussian", tree.complexity = 3,
                            learning.rate = 0.003, bag.fraction = 0.5)

Tell me if you need anything else, and thank you!

Your sample data is not on a copy/paste friendly format, please follow the link Anirban gave you and try to make a proper reproducible example.

Hope it's everything ok now!

library(dismo)

Dados <- data.frame(stringsAsFactors=FALSE,
                                        Abundance_b = c(0.444444444444444, 0.5, 0.333333333333333,
                                                        0.777777777777778,
                                                        0.875),
                                       MeanDiuRange = c(10.7916667461395, 10.7916666666667, 10.9999999205271,
                                                        10.9916664759318,
                                                        10.7666664918264),
                                         AnnualPrec = c(881, 889, 885, 880, 882),
                                     AnnualMeanTemp = c(22.1958333253861, 22.2041665712992, 22.2416666746139,
                                                        22.0624998410543,
                                                        22.3083331982295),
                                          MaxTempWM = c(31.7000007629394, 31.7000007629394, 31.8999996185303,
                                                        31.6000003814697,
                                                        31.8999996185303),
                                          LandCover = c("G", "UV", "UV", "UV", "G"),
                                           Altitude = c(35, 31, 22, 54, 6),
                                        BHerbaceous = c(72.0861129760742, 72.0861129760742, 64.076545715332,
                                                        73.3464584350586,
                                                        62.7008323669434),
                                               NDVI = c(192, 194, 192, 186, 192),
                                          TreeCover = c(39, 16, 16, 20, 16),
                                               Soil = c("ARh", "ARh", "LVk", "ARh", "LVk"),
                                          Grassland = c(0, 0, 0, 0, 0),
                                              Trees = c(0, 0, 0, 0, 9),
                                           BareSoil = c(0, 2, 87, 100, 53)

Dados$Soil <- factor(Dados$Soil, levels=c("ARa", "ARh", "LVk", "O"))
Dados$LandCover <- factor(Dados$LandCover, levels=c("CS", "DB", "G", "O","UV"))

dados.tc3.lr003 <- gbm.step(data=Dados, gbm.x = 2:14, gbm.y = 1,
                            family = "gaussian", tree.complexity = 3,
                            learning.rate = 0.003, bag.fraction = 0.5)


1 Like

You were nearly perfect with your reprex. You just missed a ) at the end of the dataset.

I just want to add a comment that you didn't need to use stringsAsFactors=FALSE while loading the dataset, since you're converting those to factors later.

You said that you tried with lower values of bag.fraction. But the error suggests that you should try to increase it. Actually, for this example, you can use bag.fraction from 6 onwards, but you'll get a lot of warnings (11, to be specific).

Warnings

Warning messages:
1: In gbm.fit(x = x, y = y, offset = offset, distribution = distribution, ... :
variable 11: Grassland has no variation.
2: In gbm.fit(x = x, y = y, offset = offset, distribution = distribution, ... :
variable 11: Grassland has no variation.
3: In gbm.fit(x = x, y = y, offset = offset, distribution = distribution, ... :
variable 12: Trees has no variation.
4: In gbm.fit(x = x, y = y, offset = offset, distribution = distribution, ... :
variable 11: Grassland has no variation.
5: In gbm.fit(x = x, y = y, offset = offset, distribution = distribution, ... :
variable 11: Grassland has no variation.
6: In gbm.fit(x = x, y = y, offset = offset, distribution = distribution, ... :
variable 11: Grassland has no variation.
7: In gbm.fit(x = x, y = y, offset = offset, distribution = distribution, ... :
variable 11: Grassland has no variation.
8: In gbm.fit(x = x, y = y, offset = offset, distribution = distribution, ... :
variable 11: Grassland has no variation.
9: In gbm.fit(x = x, y = y, offset = offset, distribution = distribution, ... :
variable 11: Grassland has no variation.
10: In gbm.fit(x = x, y = y, offset = offset, distribution = distribution, ... :
variable 11: Grassland has no variation.
11: In gbm.fit(x = x, y = y, offset = offset, distribution = distribution, ... :
variable 11: Grassland has no variation.

Whether you'll use that or not, that's up to you. I had boosting in our coursework, but forgot almost everything. So, I'll refrain myself from commenting further. But, you may go through the documentation to understand better what does this error mean.

from `gbm.fit` documentation

Now, here's from the documentation of gbm.fit:

nTrain
An integer representing the number of cases on which to train. This is the preferred way of specification for gbm.fit; The option train.fraction in gbm.fit is deprecated and only maintained for backward compatibility. These two parameters are mutually exclusive. If both are unspecified, all data is used for training.

bag.fraction
The fraction of the training set observations randomly selected to propose the next tree in the expansion. This introduces randomnesses into the model fit. If bag.fraction < 1 then running the same model twice will result in similar but different fits. gbm uses the R random number generator so set.seed can ensure that the model can be reconstructed. Preferably, the user can save the returned gbm.object using save. Default is 0.5.

n.minobsinnode
Integer specifying the minimum number of observations in the trees terminal nodes. Note that this is the actual number of observations not the total weight.

I'm sure someone else will provide you a better explanation, or better, you may come up with your own.

Good luck!

2 Likes

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.