Quantile loss function in lightGBM: unexpected values for prediction intervals

I am trying to fit a lightGBM regression model to my data, using quantile as the loss function, but the results are strange.

In particular, I want to estimate the median value of my response variable and its prediction intervals for different 1 - \alpha values. I am adapting the idea from this IBM Developer Article called Prediction intervals explained: A LightGBM tutorial. It is a tutorial for python, but it can be generalized for R.

The main idea is the following: if you want to find the 1 -\alpha% prediction interval for the median of the response variable, you need to run 3 different regressions with quantile loss:

  1. One regression for the 0.5 quantile
  2. One regression for the \alpha/2 quantile
  3. One regression for the 1 - \alpha/2 quantile

For example, to find the 95% prediction interval, you need to run these three regressions:

  1. One regression for the 0.5 quantile
  2. One regression for the 0.025 quantile
  3. One regression for the 0.975 quantile

However, the results of the predicted values I am obtaining are not what I was expecting. I'll not share my data here, but I was able to use penguins data to show the problem I am facing. In the following code, I fit quantile regressions for every \alpha \in \{ 0.1, 0.2, 0.3, \ldots, 0.8, 0.9\}. I tried to keep it simple, without feature engineering, recipe specifications or resampling techniques. Also, all regressions have the same number of tress, min_n and tree_depth, to make a fair comparison:

# packages needed

library(lightgbm)
library(tidymodels)
library(bonsai)

# seed for reproducibility

set.seed(1234)

# list to save my simulation results

result <- list()

# counter to update list position

i <- 0

# loop to fit quantile regressions for each
# alpha \in {0.1, 0.2, 0.3, ..., 0.8, 0.9}

for (j in seq(from = 0.1, to = 0.9, by = 0.1)) {
  
  i <- i + 1
  
  fit <-
    boost_tree(trees = 100,
               min_n = 10,
               tree_depth = 3) |>
    set_engine("lightgbm", objective = "quantile", alpha = j) |>
    set_mode("regression") |>
    fit(bill_length_mm ~ ., data = penguins)
  
  result[[i]] <- tibble(j, predict(fit, new_data = penguins))
}

# transform the list in a tibble

final <- tibble(do.call(rbind, result))
# put the tibble in wide format for easier comparison

final |>
  group_by(j) |> 
  mutate(row = row_number()) |> 
  pivot_wider(
    names_from = j,
    values_from = .pred
  )
#> # A tibble: 344 × 10
#>      row `0.1` `0.2` `0.3` `0.4` `0.5` `0.6` `0.7` `0.8` `0.9`
#>    <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#>  1     1  37.7  38.6  38.9  39.0  39.2  39.1  40.0  40.7  40.7
#>  2     2  35.6  36.0  36.5  37.4  37.6  37.9  39.3  39.5  39.7
#>  3     3  35.9  35.8  36.6  38.0  39.2  39.7  40.1  40.3  40.2
#>  4     4  34.0  36.2  36.6  36.4  36.5  37.0  37.6  39.0  38.6
#>  5     5  34.8  35.9  36.1  36.7  37.5  38.4  38.4  38.1  40.3
#>  6     6  37.2  36.8  37.7  38.4  39.3  39.4  40.4  39.5  43.0
#>  7     7  35.8  37.0  37.5  38.3  38.6  38.9  40.4  40.7  41.3
#>  8     8  37.0  39.2  39.3  40.7  40.6  41.1  41.4  41.8  42.7
#>  9     9  35.9  35.8  36.3  37.8  38.4  39.1  38.9  39.9  40.2
#> 10    10  36.0  35.6  37.2  37.8  38.9  39.0  40.3  40.0  45.1
#> # ℹ 334 more rows

Notice the first line in my output, in particular the estimated values for the following quantiles:

  • 0.4: 39.0
  • 0.5: 39.2
  • 0.6: 39.1

How so? As far as I understand, these values should be monotonic. The estimated value for the quantile 0.6 should be larger than the estimated value for the quantile 0.5.

Something similar occurs in line 3:

  • 0.7: 40.1
  • 0.8: 40.3
  • 0.9: 40.2

Here, the estimated value for the quantile 0.9 should be larger than the estimated value for the quantile 0.8.

These are only two examples. This is happening in many more lines in this dataset prediction, as anyone can see.

In my real dataset, the one I didn't share, sometimes I have 80% prediction intervals wider than 95% prediction intervals. Or the estimated value for the 0.5 quantile is outside the 90% prediction interval, for example.

What could be happening in my modeling? Am I misunderstanding the quantile loss function for the lightGBM? Am I implementing something wrong in the code above (and, therefore, in my real code as well)? Is this something one would expect from the quantile loss function for the lightgbm package? Should I search different hyperparameters for each regression I run?

Hi @mnunes ,

you are right about your assumption in general - but unfortunately you would have to enforce the monotonicity constraint while fitting since by default this is not given with lightGBM.

Looking at https://lightgbm.readthedocs.io/en/latest/Parameters.html you can search for

Note: the output cannot be monotonically constrained with respect to a categorical feature

which is the problem with the penguins dataset. Also NAs are not handled in the R version like in the Python version so you would need to take care of that too.

# Omit NA rows
data <- na.omit(penguins)
# Keep only numeric columns
numeric_data <- data[sapply(data, is.numeric)]

So there is an option to set the constrain but

set_engine("lightgbm", objective = "quantile", monotone_constraints = rep(1, ncol(numeric_data)-1), alpha = j) 
  • 1 means the prediction should increase as the feature increases.

leads to

[LightGBM] [Fatal] Cannot use monotone_constraints in quantile objective, please disable it.
Error in initialize(...) :

So that's why you are seeing this behavior - there is no monotonicity enforcement.

Thank you, @vedoa. It's a bummer I cannot use the solution quantile regression for my problem, at least not in the way I thought. I guess I'll have to use it anyway or maybe implement a bootstrap prediction interval.