The Mirror Descent algorithm is a fundamental optimization technique that extends traditional gradient-based methods, such as Gradient Descent, to accommodate non-Euclidean geometries. Introduced to address the limitations of standard optimization approaches in structured or constrained spaces, Mirror Descent offers enhanced flexibility and efficiency in minimizing convex objective functions.
Originally proposed by Nemirovski and Yudin in 1983, Mirror Descent has since become a cornerstone in the field of convex optimization and online learning. Its ability to adapt to various geometrical structures makes it a versatile tool in both theoretical and practical applications.
Mirror Descent operates within two interconnected spaces:
The mirror map is a crucial component that defines the transformation between the primal and dual spaces. It is a strictly convex and differentiable function denoted as \( R(x) \), which shapes the geometry of the optimization landscape.
Instead of using the traditional Euclidean distance, Mirror Descent employs Bregman divergence to measure the distance between points in the optimization space. Defined with respect to the mirror map, it provides a more nuanced measure suitable for non-Euclidean geometries.
The Bregman divergence between two points \( x \) and \( y \) is given by:
$$ D_R(x, y) = R(x) - R(y) - \langle \nabla R(y), x - y \rangle $$
To minimize a convex function \( f(x) \) over a convex set \( \mathcal{X} \), Mirror Descent follows an iterative procedure:
For each iteration \( t \), perform the following:
Repeat the iterative steps until convergence criteria are met, such as a predetermined number of iterations or a minimal change in successive iterates.
Viewing Mirror Descent through the lens of duality provides deeper insights:
The choice of the mirror map influences the algorithm's performance by tailoring the optimization process to the problem's geometry. An appropriate mirror map can lead to faster convergence and better handling of constraints.
For standard Gradient Descent scenarios, the mirror map is typically the squared Euclidean norm:
$$ R(x) = \frac{1}{2} \|x\|_2^2 $$
With this choice, the Bregman divergence simplifies to the squared Euclidean distance:
$$ D_R(x, y) = \frac{1}{2} \|x - y\|_2^2 $$
In this case, Mirror Descent reduces to traditional Gradient Descent.
For optimization over the probability simplex or similar constrained domains, the negative entropy function is a popular choice:
$$ R(x) = \sum_{i=1}^n x_i \log(x_i) $$
The corresponding Bregman divergence is the Kullback–Leibler (KL) divergence, making it suitable for probability distributions and leading to algorithms like the exponentiated gradient.
Depending on the problem's structure, other mirror maps can be employed, such as:
Feature | Gradient Descent | Mirror Descent |
---|---|---|
Geometry | Euclidean | Non-Euclidean |
Distance Measure | Euclidean distance | Bregman divergence |
Projection Handling | Potentially inefficient, depends on problem | Natural handling via mirror map |
Adaptability | Less adaptable to structured domains | Highly adaptable to various geometries |
Applications | General convex optimization | Structured convex optimization, online learning |
This table highlights the key differences between Gradient Descent and Mirror Descent, underscoring the latter's flexibility and suitability for a broader range of optimization problems.
Consider the task of optimizing a function over the probability simplex, where each variable represents a probability, and the sum equals one. Traditional Gradient Descent may struggle with such constraints, but Mirror Descent excels by leveraging the negative entropy mirror map.
Let \( \mathcal{X} \) be the probability simplex:
$$ \mathcal{X} = \left\{ x \in \mathbb{R}^n : x_i \geq 0, \sum_{i=1}^n x_i = 1 \right\} $$
Choose the mirror map as the negative entropy:
$$ R(x) = \sum_{i=1}^n x_i \log(x_i) $$
The corresponding Bregman divergence is the KL divergence:
$$ D_R(x, y) = \sum_{i=1}^n x_i \log\left(\frac{x_i}{y_i}\right) - \sum_{i=1}^n x_i + \sum_{i=1}^n y_i $$
In online learning scenarios where data arrives sequentially, Mirror Descent is adept at minimizing regret— the difference between the algorithm's performance and the best possible performance in hindsight. Its adaptability to changing data distributions enhances its effectiveness in dynamic environments.
Mirror Descent is widely used in large-scale convex optimization problems, especially those with structured constraints. Its ability to handle non-Euclidean geometries makes it suitable for diverse applications, from resource allocation to signal processing.
Within machine learning, Mirror Descent facilitates training models with complex constraints, such as those involving probability distributions or sparsity. It's particularly useful in scenarios where traditional optimization methods falter due to the problem's geometric intricacies.
By incorporating specific regularizers through the mirror map, Mirror Descent can enforce desirable properties in the learned models, such as smoothness or sparsity, enhancing model generalization and interpretability.
# Mirror Descent Pseudocode
def mirror_descent(f, grad_f, R, R_star, X, eta, T, x0):
x = x0
for t in range(1, T + 1):
g_t = grad_f(x)
y = grad_R(x)
y_new = y - eta * g_t
x = grad_R_star(y_new)
x = project(X, x, R)
return x
import numpy as np
def mirror_descent(f_grad, mirror_map_grad, mirror_map_star_grad, project, x0, eta, T):
x = x0
for t in range(T):
g_t = f_grad(x)
y = mirror_map_grad(x)
y_new = y - eta * g_t
x = mirror_map_star_grad(y_new)
x = project(x)
return x
# Example usage for the simplex with negative entropy
def f_grad(x):
# Example gradient of some convex function
return np.log(x + 1e-8)
def mirror_map_grad(x):
return 1 + np.log(x + 1e-8)
def mirror_map_star_grad(y):
return np.exp(y - 1)
def project(x):
return x / np.sum(x)
x0 = np.ones(10) / 10
eta = 0.1
T = 100
optimal_x = mirror_descent(f_grad, mirror_map_grad, mirror_map_star_grad, project, x0, eta, T)
print(optimal_x)
This Python snippet demonstrates how to implement the Mirror Descent algorithm for optimizing over a probability simplex using the negative entropy mirror map. The functions define the gradient of the objective function, the mirror map, its conjugate, and the projection onto the simplex.
The flexibility in choosing different mirror maps allows Mirror Descent to seamlessly adapt to various problem geometries, making it effective for a wide range of applications.
By utilizing Bregman divergence, Mirror Descent naturally incorporates constraints into the optimization process, often eliminating the need for explicit and potentially costly projection steps.
For certain classes of problems, especially those with structured constraints, Mirror Descent can achieve faster convergence rates compared to traditional Gradient Descent, enhancing computational efficiency.
In online learning and iterative decision-making scenarios, Mirror Descent effectively minimizes regret, ensuring that the algorithm's performance remains competitive over time.
The Mirror Descent algorithm stands out as a versatile and powerful optimization tool, particularly suited for scenarios where traditional Gradient Descent methods fall short due to geometric constraints or the need for adaptability. Its foundation on the concepts of mirror maps and Bregman divergence allows it to navigate complex optimization landscapes with efficiency and precision. Whether applied to online learning, constrained convex optimization, or machine learning model training, Mirror Descent offers a robust framework for achieving optimal solutions in diverse and structured environments.