Flow matching generative models represent a significant advancement in the field of generative modeling, offering a powerful and efficient framework for transforming simple prior distributions into complex target distributions. These models are designed to learn complex data distributions by transforming a simple base distribution into a more complex target distribution through a series of invertible transformations. This approach is rooted in the principles of normalizing flows, but flow matching introduces novel techniques to enhance the efficiency, convergence, and sample quality of the learning process.
At their core, flow matching generative models aim to match the flow of a simple base distribution to that of a complex target distribution. This is achieved by learning a sequence of transformations that can convert samples from the base distribution into samples that closely resemble those from the target distribution. The transformations are typically parameterized by neural networks, which are trained to minimize a divergence measure between the transformed base distribution and the target distribution.
Generative models aim to learn the underlying probability distribution p(x) of a dataset and generate new samples that resemble the data. Normalizing flows achieve this by transforming a simple base distribution pz(z) (e.g., Gaussian) into a complex target distribution p(x) through a series of invertible transformations f. The key property of normalizing flows is that they allow for exact likelihood computation via the change of variables formula:
\[ p(x) = p_z(f^{-1}(x)) \left| \det \frac{\partial f^{-1}(x)}{\partial x} \right| \]
Here, f-1(x) maps the data point x back to the latent space, and the Jacobian determinant accounts for the change in volume during the transformation.
Continuous Normalizing Flows (CNFs) are a type of generative model that use continuous transformations to map a simple distribution (e.g., a standard Gaussian) to a complex data distribution. These transformations are defined by an ordinary differential equation (ODE), which ensures the continuity and invertibility of the flow. CNFs extend normalizing flows by parameterizing the transformation as a continuous-time process. Instead of discrete transformations, CNFs define a time-dependent vector field fθ(x, t) that governs the evolution of data points:
\[ \frac{dx}{dt} = f_\theta(x, t) \]
This leads to the following probability density evolution, governed by the instantaneous change of variables formula:
\[ \frac{\partial \log p(x, t)}{\partial t} = -\nabla_x \cdot f_\theta(x, t) \]
CNFs are trained by minimizing the negative log-likelihood (NLL) of the data, which requires solving an ordinary differential equation (ODE) for both the forward and backward transformations.
The core of flow matching lies in the concept of a velocity field. The velocity field vt(x) defines the direction and speed at which the points in the distribution move over time. The evolution of the distribution is governed by the following ODE:
\[ \frac{d\mathbf{x}(t)}{dt} = \mathbf{v}_t(\mathbf{x}(t)) \]
where x(t) is the position at time t, and vt(x(t)) is the velocity field at time t and position x(t).
The flow matching objective is to learn a velocity field vt(x) that matches a target velocity field ut(x), which generates the desired probability path pt(x). This is formulated as a regression problem, where the loss function is defined as:
\[ L_{\text{FM}}(\theta) = \mathbb{E}_{t \sim U[0, 1], \mathbf{x} \sim p_t(\mathbf{x})} \left[ \|\mathbf{v}_t(\mathbf{x}; \theta) - \mathbf{u}_t(\mathbf{x})\|^2 \right] \]
Here, θ represents the learnable parameters of the neural network that approximates the velocity field vt(x; θ), and U[0, 1] is a uniform distribution over the time interval [0, 1].
Flow matching can be extended to conditional distributions using Conditional Flow Matching (CFM). In CFM, each data sample x1 is associated with a conditional probability path pt(x | x1). This path starts from a simple distribution (e.g., a standard Gaussian) at t = 0 and converges to a distribution concentrated around x1 at t = 1:
\[ p_t(\mathbf{x} | \mathbf{x}_1) = \mathcal{N}(\mathbf{x} | \mu_t(\mathbf{x}_1), \sigma_t^2(\mathbf{x}_1) \mathbf{I}) \]
where μt(x1) and σt(x1) are time-dependent mean and standard deviation, respectively. The CFM loss is then defined as:
\[ L_{\text{CFM}}(\theta) = \mathbb{E}_{t \sim U[0, 1], \mathbf{x} \sim p_t(\mathbf{x} | \mathbf{x}_1)} \left[ \|\mathbf{v}_t(\mathbf{x}; \theta) - \mathbf{u}_t(\mathbf{x} | \mathbf{x}_1)\|^2 \right] \]
This formulation allows the model to learn conditional velocity fields that generate conditional distributions.
Flow Matching operates on a predefined probability path pt(x), which interpolates between the base distribution p0(x) and the target distribution p1(x). A common choice for pt(x) is a linear interpolation:
\[ \phi_t(x \mid x_1) = (1 - (1 - \sigma_{\text{min}})t)x + tx_1 \]
Here, σmin controls the variance of the interpolation, and x1 represents a sample from the target distribution.
The vector field fθ(x, t) is trained to match the true vector field ut(x) of the probability path pt(x). The true vector field is derived from the continuity equation, which ensures conservation of probability mass:
\[ u_t(x) = \frac{\partial x}{\partial t} \]
The Flow Matching loss is defined as the mean squared error (MSE) between the learned and true vector fields:
\[ L_{\text{FM}}(\theta) = \mathbb{E}_{t, p_t(x)} \left[ \| f_\theta(x, t) - u_t(x) \|^2 \right] \]
In conditional generative modeling, the goal is to model p(x | y), where y is a conditional variable. The conditional probability path is defined as:
\[ p_t(x \mid y) = \int p_t(x \mid x_1)p_D(x_1 \mid y) dx_1 \]
The corresponding conditional Flow Matching loss is:
\[ L_{\text{CFM}}(\theta) = \mathbb{E}_{t, p_t(x \mid x_1), p_D(x_1)} \left[ \| f_\theta(x, t) - u_t(x \mid x_1) \|^2 \right] \]
This loss ensures that the learned vector field aligns with the conditional vector field ut(x | x1), enabling conditional generation.
The key mechanism in flow matching is the regression of the vector field. The model learns to approximate the target velocity field ut(x) using a neural network vt(x; θ). This is done by minimizing the FM loss, which ensures that the learned velocity field closely matches the target velocity field.
Flow matching allows for the construction of various probability paths, including Gaussian paths and Optimal Transport (OT) paths. For Gaussian paths, the probability path is defined as a mixture of simpler paths:
\[ p_t(\mathbf{x} | \mathbf{x}_1) = \mathcal{N}(\mathbf{x} | \mu_t(\mathbf{x}_1), \sigma_t^2(\mathbf{x}_1) \mathbf{I}) \]
where μt(x1) and σt(x1) are time-dependent mean and standard deviation, respectively. For OT paths, the velocity field corresponds to an OT displacement interpolant, which results in straight-line trajectories and faster training.
Flow Matching often employs Optimal Transport (OT) paths to define the probability interpolation. OT paths minimize the transportation cost between the base and target distributions, leading to more natural vector fields and faster convergence. The OT path is defined as:
\[ \phi_t(x \mid x_1) = (1 - t)x + tx_1 \]
Flow Matching improves efficiency by avoiding the need to solve the ODE for density evolution. Instead, it directly optimizes the vector field using the Flow Matching loss. This results in faster convergence during training and reduced computational overhead during sampling.
Flow Matching allows for efficient sampling by parameterizing the vector field directly. This reduces the number of function evaluations (NFE) required for sampling, making it competitive with state-of-the-art methods like diffusion models.
Local Flow Matching (LFM) is an extension of FM that learns a sequence of FM sub-models, each matching a diffusion process up to a certain step size. This approach allows for the use of smaller models with faster training and is particularly effective for unconditional and conditional generation tasks. LFM also enables the use of distillation techniques to speed up generation.
Flow matching models have been successfully applied to large-scale generative modeling tasks. They offer computational efficiency and greater theoretical clarity compared to other methods like diffusion models. For instance, FM models have been used in tasks such as generating images on datasets like ImageNet, demonstrating faster training and better performance.
Flow Matching has been successfully applied to image generation tasks, demonstrating competitive performance on datasets like CIFAR-10 and ImageNet. By leveraging OT paths, Flow Matching achieves high sample quality (low FID scores) while maintaining efficiency.
Conditional Flow Matching has been used for super-resolution tasks, where the goal is to generate high-resolution images from low-resolution inputs. The conditional probability paths enable the model to focus on relevant features, improving the quality of generated images.
Flow Matching is well-suited for conditional generative modeling, such as text-to-image generation or class-conditional image synthesis. The conditional Flow Matching loss ensures that the generated samples align with the given conditions.
Flow Matching has been explored for modeling structured data, such as graphs or time series. The flexibility of the probability paths and vector fields allows it to adapt to diverse data modalities.
In addition to image and tabular data generation, flow matching models have been applied to the conditional generation of robotic manipulation policies. The stepwise structure of LFM makes it natural for distillation, which can significantly speed up the generation process.
Flow matching models have been analyzed in terms of their memorization and generalization capabilities. It has been shown that under the optimal velocity field, the generated samples memorize the real data points, faithfully representing the sample data subspace. This analysis provides insights into the geometry of the generation paths under the velocity field.
Flow Matching is evaluated using metrics like:
Studies have shown that Flow Matching with OT paths consistently outperforms other methods in terms of NLL, FID, and NFE. For example:
Flow matching generative models offer a powerful and efficient framework for transforming simple prior distributions into complex target distributions. By learning a velocity field through an ODE, these models achieve computational efficiency, theoretical clarity, and robust training. The extensions such as Conditional Flow Matching and Local Flow Matching further enhance their applicability and performance in various generative modeling tasks. Flow Matching represents a significant advancement in generative modeling, offering a principled and efficient approach to training CNFs. By aligning learned vector fields with predefined probability paths, it achieves high sample quality and efficiency across various applications. The integration of OT paths further enhances its performance, making it a competitive alternative to traditional methods like GANs, VAEs, and diffusion models.
For a comprehensive understanding, it is crucial to refer to the detailed mathematical formulations and the specific applications as outlined in the relevant literature: