Struggling with Income Prediction: Non-Linearities, Interactions, and Persistent Bias Patterns

Hi everyone,

I’m working on an income prediction project where I compare several model families (LASSO, Random Forest, XGBoost, LightGBM) within a single prediction pipeline. I’d really appreciate feedback specifically on methodological improvements and whether my current approach makes sense given the issues below.In model evaluation plots, lower actual incomes tend to be over-predicted and higher incomes under-predicted.This pattern also appears in residual plots.. This to me suggests the issue is not just “missing a transformation” but possibly a deeper structural limitation of the linear setup.
Beyond standard transformations, what approaches would be reasonable next steps here?

The feature importance heatmap across models highlights a recurring issue for me: tree-based models consistently emphasize limit- and installment-related variables, while LASSO either downplays them or replaces them with transformed proxies. This made me question whether I’m compensating for linearity too aggressively with transformations instead of thinking more carefully about interactions and collinearity.

Overall, I realize I leaned heavily on transformations as a way to “force” linear models to behave better, and I’m no longer confident that this is always the right instinct. I’m trying to step back and think more in terms of:

  • when a transformation is genuinely justified,
  • when an interaction would be more appropriate,
  • and when the correct conclusion is simply that a linear model has reached its limits for this problem.

If anyone has experience balancing interpretability, diagnostics, and practical performance in income or credit-related prediction tasks, I’d really appreciate hearing how you approach these trade-offs.

For discussions related to modeling, machine learning and deep learning. Related packages include caret, modelr, yardstick, rsample, parsnip, tensorflow, keras, cloudml, and tfestimators.