Bayesian left-censored quantile regression is a specialized statistical approach used when the response variable has a detection limit or a threshold below which values are not observed exactly. In these models, data points below a certain threshold (left-censored) are not discarded; instead, they are incorporated by explicitly modeling their probability of being in the censored region. Stan, as a probabilistic programming language, allows for flexible representation of such models by combining standard regression modeling with techniques that handle censoring.
The typical implementation in Stan involves defining three primary sections:
In the data block, you explicitly declare the variables needed for modeling, such as the number of observations, the observed responses (with censoring), the predictor matrix, the censoring threshold, and the quantile level (denoted typically by tau). This block sets up the foundation by formatting your dataset for Stan's algorithm.
The parameters block defines the regression coefficients and any other parameters such as scale parameters or dispersion metrics. These coefficients represent the effect of predictors on the conditional quantile of the response variable.
The model block is where the likelihood is defined for both censoring and non-censoring scenarios. Special functions like normal_lcdf (for left-censoring) or equivalent log-likelihood adaptations are employed to handle data points that fall below the censoring threshold. Prior distributions are also specified here, which help guide the Bayesian estimation process, especially when data are limited.
Below is an illustrative Stan code example that synthesizes ideas from various templates and examples. This template represents a basic framework for a left-censored quantile regression model with a single predictor and can be extended or modified as necessary.
// Stan model for Bayesian left-censored quantile regression
data {
int<lower=0> N; // Number of observations
int<lower=0> K; // Number of predictors
vector[N] y; // Response variable (with censoring)
matrix[N, K] X; // Predictor matrix
real c; // Left-censoring threshold
real<lower=0, upper=1> tau; // Quantile to estimate (e.g., 0.5 for median)
}
transformed data {
// Create an indicator for non-censored observations
// 1 if y > c (observed), 0 if y <= c (censored)
int delta[N];
for (n in 1:N)
delta[n] = (y[n] > c) ? 1 : 0;
}
parameters {
vector[K] beta; // Regression coefficients
real<lower=0> sigma; // Scale parameter
// Additional parameters (e.g., for asymmetric distributions) can be added here
}
model {
// Prior distributions for the parameters to incorporate existing beliefs
beta ~ normal(0, 10); // Diffuse prior for regression coefficients
sigma ~ cauchy(0, 5); // Diffuse prior for standard deviation
// Likelihood specification for every observation
for (n in 1:N) {
if (delta[n] == 1) {
// For non-censored observations, use the regular likelihood
// Here, a normal distribution is used; in practice, an asymmetric likelihood might be more appropriate
y[n] ~ normal(dot_product(X[n], beta), sigma);
// Optionally, add a quantile constraint term if using a specialized quantile regression formulation:
// target += log(tau) * normal_lcdf((dot_product(X[n], beta) - y[n]) / sigma);
} else {
// For censored observations, add the log cumulative probability instead
// Indicates that the actual value lies below the censoring threshold
target += normal_lcdf(c | dot_product(X[n], beta), sigma);
}
}
}
In this example:
data block defines the input including the left-censoring threshold c and the quantile level tau.
transformed data block creates an indicator delta to flag whether an observation is above the censoring threshold.
parameters block declares the regression coefficients beta and a scale parameter sigma.
model block, non-censored observations are modeled directly with a normal likelihood, while censored observations have their contribution represented by the log CDF (normal_lcdf) to account for the probability mass below the censoring point.
When the response variable is left-censored, you know that the true value is somewhere below the threshold c but not the exact value. The Stan model reflects this by using the normal_lcdf function to compute the cumulative probability up to c. This approach ensures the likelihood incorporates the uncertainty associated with censored observations.
This treatment allows the model to effectively “fill-in” the unknown values by sampling them during the Markov Chain Monte Carlo (MCMC) process while accounting for the censoring in the likelihood.
Standard linear regression estimates the conditional mean of the response variable given predictor variables. In contrast, quantile regression estimates a specific quantile (e.g., median) of the conditional distribution. One common method is to employ an asymmetric likelihood, such as the asymmetric Laplace distribution, which directly targets the quantile of interest.
In the provided example, while a normal likelihood has been used for clarity, more advanced implementations might substitute this with an asymmetric likelihood or incorporate additional terms to enforce quantile-specific constraints. The inclusion of the quantile parameter tau allows the model to shift focus to a particular quantile, making the estimation process more robust to outliers and skewness.
Bayesian frameworks allow the introduction of prior information. Diffuse normal priors for regression coefficients and heavy-tailed Cauchy priors for the scale parameters are popular choices when there is limited prior information. These priors regularize the estimation process and are crucial when dealing with small or noisy datasets.
In practice, these priors can be modified based on domain expertise or empirical evidence from past research.
The following table summarizes the structure and responsibilities of each section in the Stan code for left-censored quantile regression:
| Block | Purpose | Key Components |
|---|---|---|
| Data Block | Input variables and censoring details | N (observations), K (predictors), y (response), X (predictor matrix), c (censoring threshold), tau (quantile) |
| Transformed Data Block | Pre-processing data for censoring | Indicator variable delta to flag if observation is censored or not |
| Parameters Block | Model parameters to be estimated | Regression coefficients beta, scale parameter sigma |
| Model Block | Likelihood and prior specification | Censored observations using normal_lcdf, non-censored via standard likelihood, prior distributions |
To simplify the process of writing and fitting these models, you might consider using the R package brms. This package provides a more intuitive interface for specifying Bayesian models with Stan under the hood, including functions to handle censored data and quantify uncertainty in a user-friendly manner.
The basic template provided can be extended to more complex scenarios:
X and potentially the parameters block.
sigma to vary with predictors.
Implementing these extensions may require nuanced modifications to the likelihood and prior specifications, but the fundamental structure of data, parameters, and model blocks remains consistent.