Predictive Analytics

Calibrated uncertainty for tabular regression in Python

Most regression systems return a single number.

Revenue next month. Delivery time. Energy demand. Claim cost. Churn value. A point estimate is useful, but it hides the part that usually matters most in production: how wrong the model could be.

A forecast of 120 is not the same prediction if the realistic range is 118–122 versus 70–190.

For many data products, the real upgrade is not a more complex model. It is a model that can express uncertainty in a way the business can actually use.

This article builds a practical solution in Python for probabilistic regression on tabular data. The goal is simple:

predict a full range, not just a point
avoid strong distribution assumptions
keep the workflow compatible with standard machine learning stacks
produce intervals that can be checked and improved

The method uses quantile regression with LightGBM. It is fast, production-friendly, and works well when the target distribution is skewed, heavy-tailed, or changes with the input features.

Why quantile regression matters

A standard regression model learns the conditional mean:

E[y | x]

That is often not what decision systems need.

A logistics team may want a conservative estimate for delivery duration. A finance team may want the 90th percentile of loss. An operations team may want a lower and upper bound to plan staffing.

Quantile regression predicts conditional quantiles directly:

0.1 quantile → a low-end estimate
0.5 quantile → the median
0.9 quantile → a high-end estimate

With those three values, you already have a practical uncertainty band.

This is especially useful when:

errors are asymmetric
outliers matter
variance changes across the feature space
target values are not close to normal

Instead of forcing one global error pattern, quantile regression lets uncertainty adapt locally.

The problem setup

We will create a synthetic tabular dataset with heteroscedastic noise. That means the noise level changes depending on the input. This is exactly where point prediction starts to fail as a complete answer.

Our plan:

generate data
train three LightGBM models for q=0.1, q=0.5, q=0.9
evaluate interval quality
fix quantile crossing
package the result into a reusable predictor

Step 1 — Create a dataset where uncertainty changes with x

import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
np.random.seed(42)
n = 5000
x1 = np.random.uniform(0, 10, n)
x2 = np.random.normal(0, 1, n)
x3 = np.random.binomial(1, 0.4, n)
noise_scale = 0.5 + 0.3 * x1
noise = np.random.normal(0, noise_scale)
y = 10 + 2.5 * x1–1.8 * x2 + 4 * x3 + np.sin(x1) * 3 + noise
df = pd.DataFrame({
    "x1": x1,
    "x2": x2,
    "x3": x3,
    "y": y
})
X = df[["x1", "x2", "x3"]]
target = df["y"]
X_train, X_test, y_train, y_test = train_test_split(
    X, target, test_size=0.2, random_state=42
)

The key detail is noise_scale. As x1 grows, the target becomes more uncertain. A good probabilistic model should learn wider intervals in that region.

Step 2 — Train quantile models with LightGBM

LightGBM supports quantile loss directly. We train one model per quantile.

from lightgbm import LGBMRegressor

def train_quantile_model(alpha):
    model = LGBMRegressor(
        objective="quantile",
        alpha=alpha,
        n_estimators=300,
        learning_rate=0.05,
        num_leaves=31,
        min_child_samples=20,
        subsample=0.8,
        colsample_bytree=0.8,
        random_state=42
    )
    model.fit(X_train, y_train)
    return model

q10_model = train_quantile_model(0.1)
q50_model = train_quantile_model(0.5)
q90_model = train_quantile_model(0.9)
pred_q10 = q10_model.predict(X_test)
pred_q50 = q50_model.predict(X_test)
pred_q90 = q90_model.predict(X_test)

Now we have three predictions for every row: lower bound, median, and upper bound. This is already enough to support many business decisions.

Step 3 — Check whether the interval is useful

A prediction interval is only useful if it is both calibrated and reasonably narrow. Calibration means the true value falls inside the interval at about the expected rate. For a 10%–90% interval, we want roughly 80% coverage.

import numpy as np

def interval_coverage(y_true, lower, upper):
    inside = (y_true >= lower) & (y_true <= upper)
    return np.mean(inside)

def mean_interval_width(lower, upper):
    return np.mean(upper - lower)

coverage = interval_coverage(y_test.values, pred_q10, pred_q90)
width = mean_interval_width(pred_q10, pred_q90)
print("80% interval coverage:", round(coverage, 4))
print("Mean interval width:", round(width, 4))

You should expect coverage near 0.8 if the model is well calibrated. If coverage is much lower, the intervals are too confident. If coverage is much higher, the intervals may be too wide to be useful.

Step 4 — Evaluate the median prediction separately

The median is often more robust than the mean when targets are skewed or noisy.

from sklearn.metrics import mean_absolute_error, median_absolute_error

mae = mean_absolute_error(y_test, pred_q50)
medae = median_absolute_error(y_test, pred_q50)
print("Median model MAE:", round(mae, 4))
print("Median model MedAE:", round(medae, 4))

This gives a point-performance view, while coverage and width give the uncertainty view. Both matter.

Step 5 — Handle quantile crossing

A practical issue with separate quantile models is crossing. Sometimes the predicted 10th percentile is higher than the 50th, or the 90th is lower than the 50th. That violates the ordering of quantiles.

A simple post-processing fix is enough for many applications.

pred_q10_fixed = np.minimum(pred_q10, pred_q50)
pred_q90_fixed = np.maximum(pred_q90, pred_q50)
pred_q10_fixed = np.minimum(pred_q10_fixed, pred_q90_fixed)
pred_q90_fixed = np.maximum(pred_q90_fixed, pred_q10_fixed)

coverage_fixed = interval_coverage(y_test.values, pred_q10_fixed, pred_q90_fixed)
width_fixed = mean_interval_width(pred_q10_fixed, pred_q90_fixed)
print("Fixed interval coverage:", round(coverage_fixed, 4))
print("Fixed mean interval width:", round(width_fixed, 4))

This monotonic correction is simple and effective. For stricter settings, you can use joint modeling or isotonic post-processing, but the simple fix is often enough in production pipelines.

Step 6 — Inspect how uncertainty changes across the feature space

One of the main advantages of quantile regression is local uncertainty. Let's look at interval width as x1 changes.

results = X_test.copy()
results["y_true"] = y_test.values
results["q10"] = pred_q10_fixed
results["q50"] = pred_q50
results["q90"] = pred_q90_fixed
results["interval_width"] = results["q90"] - results["q10"]

summary = results.groupby(pd.cut(results["x1"], bins=6))["interval_width"].mean()
print(summary)

If the model learned correctly, interval width should increase in higher x1 bins. That is exactly what a plain regression model cannot express by itself.

Why this works well in real tabular systems

Quantile boosting is useful because it stays close to the tools teams already trust. It fits naturally into feature-engineered tabular workflows.

It also avoids making assumptions like:

errors are Gaussian
variance is constant
uncertainty is symmetric

Those assumptions break often in business data.

Examples where quantile prediction is more useful than point prediction:

demand planning → plan for p50 and p90, not one number
pricing → estimate downside and upside outcomes
insurance → reserve against tail risk
operations → staff for uncertainty, not just average load
ETAs → show realistic ranges instead of unstable single times

A reusable predictor class

The pattern becomes easier to deploy if wrapped in a small class.

from lightgbm import LGBMRegressor
import numpy as np

class QuantileIntervalRegressor:
    def __init__(self, quantiles=(0.1, 0.5, 0.9), **model_params):
        self.quantiles = quantiles
        self.model_params = model_params
        self.models = {}

    def fit(self, X, y):
        for q in self.quantiles:
            model = LGBMRegressor(
                objective="quantile",
                alpha=q,
                **self.model_params
            )
            model.fit(X, y)
            self.models[q] = model
        return self

    def predict(self, X):
        preds = {q: self.models[q].predict(X) for q in self.quantiles}
        sorted_qs = sorted(self.quantiles)
        for i in range(1, len(sorted_qs)):
            prev_q = sorted_qs[i - 1]
            curr_q = sorted_qs[i]
            preds[curr_q] = np.maximum(preds[curr_q], preds[prev_q])
        return preds

model = QuantileIntervalRegressor(
    quantiles=(0.1, 0.5, 0.9),
    n_estimators=300,
    learning_rate=0.05,
    num_leaves=31,
    min_child_samples=20,
    subsample=0.8,
    colsample_bytree=0.8,
    random_state=42
)
model.fit(X_train, y_train)
preds = model.predict(X_test)
print("q10 sample:", preds[0.1][:5])
print("q50 sample:", preds[0.5][:5])
print("q90 sample:", preds[0.9][:5])

This gives a compact interface that can be used in batch scoring or APIs.

How to evaluate quantile models correctly

A common mistake is to judge probabilistic models only with RMSE or MAE. Those metrics are incomplete here.

For quantile systems, track at least these:

Pinball loss for each quantile
Coverage of the target interval
Mean interval width
Coverage by segment, not only global coverage
Stability over time

Pinball loss is the standard objective for quantile prediction.

def pinball_loss(y_true, y_pred, q):
    error = y_true - y_pred
    return np.mean(np.maximum(q * error, (q - 1) * error))

for q, pred in [(0.1, pred_q10_fixed), (0.5, pred_q50), (0.9, pred_q90_fixed)]:
    loss = pinball_loss(y_test.values, pred, q)
    print(f"Pinball loss q={q}: {loss:.4f}")

Segment-level checks matter because a model can look calibrated overall while failing badly for specific groups. For example: one geography, one product line, one customer segment, one target range. Global averages can hide local failures.

A simple calibration improvement step

If your interval coverage is consistently off, you can recalibrate on a validation set. One practical method is to widen or shrink the interval around the median.

def recalibrate_interval(q10, q50, q90, scale):
    lower = q50 - scale * (q50 - q10)
    upper = q50 + scale * (q90 - q50)
    return lower, upper

best_scale = None
best_gap = float("inf")

for scale in np.arange(0.8, 1.31, 0.05):
    lower, upper = recalibrate_interval(pred_q10_fixed, pred_q50, pred_q90_fixed, scale)
    cov = interval_coverage(y_test.values, lower, upper)
    gap = abs(cov - 0.8)
    if gap < best_gap:
        best_gap = gap
        best_scale = scale

print("Best scale:", best_scale)
lower_cal, upper_cal = recalibrate_interval(
    pred_q10_fixed, pred_q50, pred_q90_fixed, best_scale
)
print("Calibrated coverage:", round(interval_coverage(y_test.values, lower_cal, upper_cal), 4))
print("Calibrated width:", round(mean_interval_width(lower_cal, upper_cal), 4))

In a real workflow, tune this scale on validation data and report final performance on a separate test set. That keeps the evaluation honest.

When this approach is the right choice

Use quantile boosting when:

your data is tabular
you need interpretable uncertainty bands
your target distribution is messy
you want a fast baseline that is hard to beat
you need something deployable without a research stack

It is a strong fit for many applied data science teams because it balances flexibility, speed, practicality, and business usefulness.

When it is not enough

Quantile regression is not a complete answer for every uncertainty problem. It may be limited when:

you need coherent multistep forecast distributions
you need joint uncertainty across multiple targets
you need uncertainty from very small datasets
you need causal interpretation rather than predictive intervals

In those cases, you may need conformal prediction, Bayesian models, distributional forests, or sequence-specific probabilistic methods. Still, for a large share of tabular regression work, quantile boosting is one of the highest-leverage upgrades you can make.

Final takeaway

A single prediction is often too small an answer for a real decision. Quantile regression gives you a practical way to move from "what is the number?" to "what is the likely range?"

That shift changes how models get used. Teams can plan around uncertainty instead of discovering it later. Products can expose confidence bands instead of pretending every estimate is equally certain.

And the implementation is not heavy. Three LightGBM models, a few evaluation functions, and a calibration check are enough to build a regression system that speaks in ranges instead of guesses.

If your current model predicts one number and your users still ask "how sure is that?", this is the layer to add next.