Stan Code for Bayesian Left-Censored Quantile Regression

Understanding the components & implementation in Stan for left-censored data

physical regression model code printed pages

Key Highlights

Model Specification: The Stan code is structured into data, parameters, and model blocks, carefully incorporating left-censoring through probability functions such as log-CDF.
Censoring Implementation: Left-censored observations are modeled by constraining the likelihood using indicator functions and cumulative distribution functions, ensuring proper treatment of observations below a threshold.
Quantile Regression: The approach often relies on an asymmetric likelihood (e.g., asymmetric Laplace distribution) or a normal likelihood with quantile constraints, integrating prior distributions to manage uncertainty.

Conceptual Overview

Bayesian left-censored quantile regression is a specialized statistical approach used when the response variable has a detection limit or a threshold below which values are not observed exactly. In these models, data points below a certain threshold (left-censored) are not discarded; instead, they are incorporated by explicitly modeling their probability of being in the censored region. Stan, as a probabilistic programming language, allows for flexible representation of such models by combining standard regression modeling with techniques that handle censoring.

The typical implementation in Stan involves defining three primary sections:

Data Block

In the data block, you explicitly declare the variables needed for modeling, such as the number of observations, the observed responses (with censoring), the predictor matrix, the censoring threshold, and the quantile level (denoted typically by tau). This block sets up the foundation by formatting your dataset for Stan's algorithm.

Parameters Block

The parameters block defines the regression coefficients and any other parameters such as scale parameters or dispersion metrics. These coefficients represent the effect of predictors on the conditional quantile of the response variable.

Model Block

The model block is where the likelihood is defined for both censoring and non-censoring scenarios. Special functions like normal_lcdf (for left-censoring) or equivalent log-likelihood adaptations are employed to handle data points that fall below the censoring threshold. Prior distributions are also specified here, which help guide the Bayesian estimation process, especially when data are limited.

Stan Code Example

Below is an illustrative Stan code example that synthesizes ideas from various templates and examples. This template represents a basic framework for a left-censored quantile regression model with a single predictor and can be extended or modified as necessary.


// Stan model for Bayesian left-censored quantile regression
data {
  int<lower=0> N;                // Number of observations
  int<lower=0> K;                // Number of predictors
  vector[N] y;                   // Response variable (with censoring)
  matrix[N, K] X;                // Predictor matrix
  real c;                        // Left-censoring threshold
  real<lower=0, upper=1> tau;     // Quantile to estimate (e.g., 0.5 for median)
}

transformed data {
  // Create an indicator for non-censored observations
  // 1 if y > c (observed), 0 if y <= c (censored)
  int delta[N];
  for (n in 1:N)
    delta[n] = (y[n] > c) ? 1 : 0;
}

parameters {
  vector[K] beta;             // Regression coefficients
  real<lower=0> sigma;         // Scale parameter
  // Additional parameters (e.g., for asymmetric distributions) can be added here
}

model {
  // Prior distributions for the parameters to incorporate existing beliefs
  beta ~ normal(0, 10);        // Diffuse prior for regression coefficients
  sigma ~ cauchy(0, 5);        // Diffuse prior for standard deviation
  
  // Likelihood specification for every observation
  for (n in 1:N) {
    if (delta[n] == 1) {
      // For non-censored observations, use the regular likelihood
      // Here, a normal distribution is used; in practice, an asymmetric likelihood might be more appropriate
      y[n] ~ normal(dot_product(X[n], beta), sigma);
      // Optionally, add a quantile constraint term if using a specialized quantile regression formulation:
      // target += log(tau) * normal_lcdf((dot_product(X[n], beta) - y[n]) / sigma);
    } else {
      // For censored observations, add the log cumulative probability instead
      // Indicates that the actual value lies below the censoring threshold
      target += normal_lcdf(c | dot_product(X[n], beta), sigma);
    }
  }
}

In this example:

The data block defines the input including the left-censoring threshold c and the quantile level tau.
The transformed data block creates an indicator delta to flag whether an observation is above the censoring threshold.
The parameters block declares the regression coefficients beta and a scale parameter sigma.
In the model block, non-censored observations are modeled directly with a normal likelihood, while censored observations have their contribution represented by the log CDF (normal_lcdf) to account for the probability mass below the censoring point.

Detailed Explanation of Core Components

Handling Censored Data

When the response variable is left-censored, you know that the true value is somewhere below the threshold c but not the exact value. The Stan model reflects this by using the normal_lcdf function to compute the cumulative probability up to c. This approach ensures the likelihood incorporates the uncertainty associated with censored observations.

This treatment allows the model to effectively “fill-in” the unknown values by sampling them during the Markov Chain Monte Carlo (MCMC) process while accounting for the censoring in the likelihood.

Quantile Regression Considerations

Standard linear regression estimates the conditional mean of the response variable given predictor variables. In contrast, quantile regression estimates a specific quantile (e.g., median) of the conditional distribution. One common method is to employ an asymmetric likelihood, such as the asymmetric Laplace distribution, which directly targets the quantile of interest.

In the provided example, while a normal likelihood has been used for clarity, more advanced implementations might substitute this with an asymmetric likelihood or incorporate additional terms to enforce quantile-specific constraints. The inclusion of the quantile parameter tau allows the model to shift focus to a particular quantile, making the estimation process more robust to outliers and skewness.

Incorporating Priors

Bayesian frameworks allow the introduction of prior information. Diffuse normal priors for regression coefficients and heavy-tailed Cauchy priors for the scale parameters are popular choices when there is limited prior information. These priors regularize the estimation process and are crucial when dealing with small or noisy datasets.

In practice, these priors can be modified based on domain expertise or empirical evidence from past research.

Comprehensive Model Structure Comparison

The following table summarizes the structure and responsibilities of each section in the Stan code for left-censored quantile regression:

Block	Purpose	Key Components
Data Block	Input variables and censoring details	N (observations), K (predictors), y (response), X (predictor matrix), c (censoring threshold), tau (quantile)
Transformed Data Block	Pre-processing data for censoring	Indicator variable `delta` to flag if observation is censored or not
Parameters Block	Model parameters to be estimated	Regression coefficients `beta`, scale parameter `sigma`
Model Block	Likelihood and prior specification	Censored observations using `normal_lcdf`, non-censored via standard likelihood, prior distributions

Practical Considerations and Extensions

Utilizing Packages like brms

To simplify the process of writing and fitting these models, you might consider using the R package brms. This package provides a more intuitive interface for specifying Bayesian models with Stan under the hood, including functions to handle censored data and quantify uncertainty in a user-friendly manner.

Model Extensions

The basic template provided can be extended to more complex scenarios:

Including multiple predictors or interactions, which would involve expanding the predictor matrix X and potentially the parameters block.
Adjusting for heteroscedasticity by allowing sigma to vary with predictors.
Utilizing different distributions (e.g., asymmetric Laplace) to more directly target quantiles.
Incorporating hierarchical structures if data are grouped or clustered, which can be implemented using multilevel models in Stan.

Implementing these extensions may require nuanced modifications to the likelihood and prior specifications, but the fundamental structure of data, parameters, and model blocks remains consistent.