I am trying to fit a lightGBM regression model to my data, using quantile as the loss function, but the results are strange.
In particular, I want to estimate the median value of my response variable and its prediction intervals for different 1 - \alpha values. I am adapting the idea from this IBM Developer Article called Prediction intervals explained: A LightGBM tutorial. It is a tutorial for python, but it can be generalized for R.
The main idea is the following: if you want to find the 1 -\alpha% prediction interval for the median of the response variable, you need to run 3 different regressions with quantile loss:
- One regression for the 0.5 quantile
- One regression for the \alpha/2 quantile
- One regression for the 1 - \alpha/2 quantile
For example, to find the 95% prediction interval, you need to run these three regressions:
- One regression for the 0.5 quantile
- One regression for the 0.025 quantile
- One regression for the 0.975 quantile
However, the results of the predicted values I am obtaining are not what I was expecting. I'll not share my data here, but I was able to use penguins
data to show the problem I am facing. In the following code, I fit quantile regressions for every \alpha \in \{ 0.1, 0.2, 0.3, \ldots, 0.8, 0.9\}. I tried to keep it simple, without feature engineering, recipe specifications or resampling techniques. Also, all regressions have the same number of tress, min_n and tree_depth, to make a fair comparison:
# packages needed
library(lightgbm)
library(tidymodels)
library(bonsai)
# seed for reproducibility
set.seed(1234)
# list to save my simulation results
result <- list()
# counter to update list position
i <- 0
# loop to fit quantile regressions for each
# alpha \in {0.1, 0.2, 0.3, ..., 0.8, 0.9}
for (j in seq(from = 0.1, to = 0.9, by = 0.1)) {
i <- i + 1
fit <-
boost_tree(trees = 100,
min_n = 10,
tree_depth = 3) |>
set_engine("lightgbm", objective = "quantile", alpha = j) |>
set_mode("regression") |>
fit(bill_length_mm ~ ., data = penguins)
result[[i]] <- tibble(j, predict(fit, new_data = penguins))
}
# transform the list in a tibble
final <- tibble(do.call(rbind, result))
# put the tibble in wide format for easier comparison
final |>
group_by(j) |>
mutate(row = row_number()) |>
pivot_wider(
names_from = j,
values_from = .pred
)
#> # A tibble: 344 × 10
#> row `0.1` `0.2` `0.3` `0.4` `0.5` `0.6` `0.7` `0.8` `0.9`
#> <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 1 37.7 38.6 38.9 39.0 39.2 39.1 40.0 40.7 40.7
#> 2 2 35.6 36.0 36.5 37.4 37.6 37.9 39.3 39.5 39.7
#> 3 3 35.9 35.8 36.6 38.0 39.2 39.7 40.1 40.3 40.2
#> 4 4 34.0 36.2 36.6 36.4 36.5 37.0 37.6 39.0 38.6
#> 5 5 34.8 35.9 36.1 36.7 37.5 38.4 38.4 38.1 40.3
#> 6 6 37.2 36.8 37.7 38.4 39.3 39.4 40.4 39.5 43.0
#> 7 7 35.8 37.0 37.5 38.3 38.6 38.9 40.4 40.7 41.3
#> 8 8 37.0 39.2 39.3 40.7 40.6 41.1 41.4 41.8 42.7
#> 9 9 35.9 35.8 36.3 37.8 38.4 39.1 38.9 39.9 40.2
#> 10 10 36.0 35.6 37.2 37.8 38.9 39.0 40.3 40.0 45.1
#> # ℹ 334 more rows
Notice the first line in my output, in particular the estimated values for the following quantiles:
- 0.4: 39.0
- 0.5: 39.2
- 0.6: 39.1
How so? As far as I understand, these values should be monotonic. The estimated value for the quantile 0.6 should be larger than the estimated value for the quantile 0.5.
Something similar occurs in line 3:
- 0.7: 40.1
- 0.8: 40.3
- 0.9: 40.2
Here, the estimated value for the quantile 0.9 should be larger than the estimated value for the quantile 0.8.
These are only two examples. This is happening in many more lines in this dataset prediction, as anyone can see.
In my real dataset, the one I didn't share, sometimes I have 80% prediction intervals wider than 95% prediction intervals. Or the estimated value for the 0.5 quantile is outside the 90% prediction interval, for example.
What could be happening in my modeling? Am I misunderstanding the quantile loss function for the lightGBM? Am I implementing something wrong in the code above (and, therefore, in my real code as well)? Is this something one would expect from the quantile loss function for the lightgbm
package? Should I search different hyperparameters for each regression I run?