Mirror Descent Algorithm: A Comprehensive Guide

Unlocking the Potential of Non-Euclidean Optimization

Key Takeaways

Adaptable Geometry: Mirror Descent leverages non-Euclidean geometries through the use of mirror maps, making it suitable for diverse optimization landscapes.
Bregman Divergence: It utilizes Bregman divergence instead of traditional Euclidean distance, enabling more efficient handling of constraints and structured domains.
Versatile Applications: Widely applicable in online learning, constrained convex optimization, and machine learning, offering improved convergence rates and flexibility.

1. Introduction to Mirror Descent

The Mirror Descent algorithm is a fundamental optimization technique that extends traditional gradient-based methods, such as Gradient Descent, to accommodate non-Euclidean geometries. Introduced to address the limitations of standard optimization approaches in structured or constrained spaces, Mirror Descent offers enhanced flexibility and efficiency in minimizing convex objective functions.

1.1. Origins and Evolution

Originally proposed by Nemirovski and Yudin in 1983, Mirror Descent has since become a cornerstone in the field of convex optimization and online learning. Its ability to adapt to various geometrical structures makes it a versatile tool in both theoretical and practical applications.

2. Fundamental Concepts

2.1. Primal and Dual Spaces

Mirror Descent operates within two interconnected spaces:

Primal Space: The original domain where the optimization problem is defined.
Dual Space: A transformed space where gradient-based updates are performed, facilitating more efficient optimization.

2.2. Mirror Map

The mirror map is a crucial component that defines the transformation between the primal and dual spaces. It is a strictly convex and differentiable function denoted as $ R(x) $, which shapes the geometry of the optimization landscape.

2.3. Bregman Divergence

Instead of using the traditional Euclidean distance, Mirror Descent employs Bregman divergence to measure the distance between points in the optimization space. Defined with respect to the mirror map, it provides a more nuanced measure suitable for non-Euclidean geometries.

The Bregman divergence between two points $ x $ and $ y $ is given by:

$$ D_R(x, y) = R(x) - R(y) - \langle \nabla R(y), x - y \rangle $$

3. The Mirror Descent Algorithm

3.1. Algorithm Steps

To minimize a convex function $ f(x) $ over a convex set $ \mathcal{X} $, Mirror Descent follows an iterative procedure:

Step 1: Initialization

Select an initial point $ x_1 \in \mathcal{X} $.
Choose an appropriate mirror map $ R(x) $ based on the problem's geometry.

Step 2: Iterative Updates

For each iteration $ t $, perform the following:

Gradient Computation: Calculate the gradient of the objective function at the current point: $ g_t = \nabla f(x_t) $.
Dual Space Mapping: Map the current primal point to the dual space using the mirror map: $ y_t = \nabla R(x_t) $.
Dual Update: Perform a gradient descent step in the dual space: $ y_{t+1} = y_t - \eta g_t $, where $ \eta > 0 $ is the learning rate.
Primal Space Mapping: Map back to the primal space using the inverse of the mirror map: $ x_{t+1} = \nabla R^*(y_{t+1}) $, where $ R^* $ is the convex conjugate of $ R $.
Projection: Ensure that $ x_{t+1} $ lies within $ \mathcal{X} $ using Bregman projection: $ x_{t+1} = \arg \min_{x \in \mathcal{X}} D_R(x, x_t') $.

Step 3: Convergence

Repeat the iterative steps until convergence criteria are met, such as a predetermined number of iterations or a minimal change in successive iterates.

3.2. Duality Perspective

Viewing Mirror Descent through the lens of duality provides deeper insights:

Dual Gradient Step: The algorithm performs gradient descent in the dual space, leveraging the mirror map's geometry.
Primal Recovery: After the dual update, the solution is mapped back to the primal space, ensuring feasibility and adherence to constraints.

4. Mirror Map Selection

4.1. Importance of the Mirror Map

The choice of the mirror map influences the algorithm's performance by tailoring the optimization process to the problem's geometry. An appropriate mirror map can lead to faster convergence and better handling of constraints.

4.2. Common Mirror Maps

Euclidean Mirror Map

For standard Gradient Descent scenarios, the mirror map is typically the squared Euclidean norm:

$$ R(x) = \frac{1}{2} \|x\|_2^2 $$

With this choice, the Bregman divergence simplifies to the squared Euclidean distance:

$$ D_R(x, y) = \frac{1}{2} \|x - y\|_2^2 $$

In this case, Mirror Descent reduces to traditional Gradient Descent.

Negative Entropy Mirror Map

For optimization over the probability simplex or similar constrained domains, the negative entropy function is a popular choice:

$$ R(x) = \sum_{i=1}^n x_i \log(x_i) $$

The corresponding Bregman divergence is the Kullback–Leibler (KL) divergence, making it suitable for probability distributions and leading to algorithms like the exponentiated gradient.

Other Mirror Maps

Depending on the problem's structure, other mirror maps can be employed, such as:

Logarithmic Barrier: Useful for interior point methods in constrained optimization.
Mahalanobis Distance: Adapted for problems with specific covariance structures.

5. Comparative Analysis: Mirror Descent vs. Gradient Descent

Feature	Gradient Descent	Mirror Descent
Geometry	Euclidean	Non-Euclidean
Distance Measure	Euclidean distance	Bregman divergence
Projection Handling	Potentially inefficient, depends on problem	Natural handling via mirror map
Adaptability	Less adaptable to structured domains	Highly adaptable to various geometries
Applications	General convex optimization	Structured convex optimization, online learning

This table highlights the key differences between Gradient Descent and Mirror Descent, underscoring the latter's flexibility and suitability for a broader range of optimization problems.

6. Practical Example: Optimization over a Probability Simplex

Consider the task of optimizing a function over the probability simplex, where each variable represents a probability, and the sum equals one. Traditional Gradient Descent may struggle with such constraints, but Mirror Descent excels by leveraging the negative entropy mirror map.

6.1. Setup

Let $ \mathcal{X} $ be the probability simplex:

$$ \mathcal{X} = \left\{ x \in \mathbb{R}^n : x_i \geq 0, \sum_{i=1}^n x_i = 1 \right\} $$

6.2. Negative Entropy Mirror Map

Choose the mirror map as the negative entropy:

$$ R(x) = \sum_{i=1}^n x_i \log(x_i) $$

The corresponding Bregman divergence is the KL divergence:

$$ D_R(x, y) = \sum_{i=1}^n x_i \log\left(\frac{x_i}{y_i}\right) - \sum_{i=1}^n x_i + \sum_{i=1}^n y_i $$

6.3. Mirror Descent Update Steps

Gradient Computation: Calculate $ g_t = \nabla f(x_t) $.
Dual Mapping: Map to dual space: $ y_t = \nabla R(x_t) = 1 + \log(x_t) $.
Dual Update: Perform gradient step: $ y_{t+1} = y_t - \eta g_t $.
Primal Mapping: Map back to primal space: $ x_{t+1} = \exp(y_{t+1} - 1) $ (element-wise exponentiation).
Projection: Normalize to ensure $ x_{t+1} $ lies on the simplex.

6.4. Advantages in This Context

Natural Constraint Handling: The algorithm inherently respects the simplex constraints without explicit projection steps.
Efficient Updates: Utilizing the mirror map simplifies the update process, leading to more computationally efficient iterations.
Improved Convergence: Adapting to the simplex geometry can result in faster convergence rates compared to standard Gradient Descent.

7. Applications of Mirror Descent

7.1. Online Learning and Prediction

In online learning scenarios where data arrives sequentially, Mirror Descent is adept at minimizing regret— the difference between the algorithm's performance and the best possible performance in hindsight. Its adaptability to changing data distributions enhances its effectiveness in dynamic environments.

7.2. Convex Optimization

Mirror Descent is widely used in large-scale convex optimization problems, especially those with structured constraints. Its ability to handle non-Euclidean geometries makes it suitable for diverse applications, from resource allocation to signal processing.

7.3. Machine Learning

Within machine learning, Mirror Descent facilitates training models with complex constraints, such as those involving probability distributions or sparsity. It's particularly useful in scenarios where traditional optimization methods falter due to the problem's geometric intricacies.

7.4. Regularized Learning

By incorporating specific regularizers through the mirror map, Mirror Descent can enforce desirable properties in the learned models, such as smoothness or sparsity, enhancing model generalization and interpretability.

8. Pseudocode and Implementation

8.1. Pseudocode


# Mirror Descent Pseudocode
def mirror_descent(f, grad_f, R, R_star, X, eta, T, x0):
    x = x0
    for t in range(1, T + 1):
        g_t = grad_f(x)
        y = grad_R(x)
        y_new = y - eta * g_t
        x = grad_R_star(y_new)
        x = project(X, x, R)
    return x

8.2. Python Implementation Example


import numpy as np

def mirror_descent(f_grad, mirror_map_grad, mirror_map_star_grad, project, x0, eta, T):
    x = x0
    for t in range(T):
        g_t = f_grad(x)
        y = mirror_map_grad(x)
        y_new = y - eta * g_t
        x = mirror_map_star_grad(y_new)
        x = project(x)
    return x

# Example usage for the simplex with negative entropy
def f_grad(x):
    # Example gradient of some convex function
    return np.log(x + 1e-8)

def mirror_map_grad(x):
    return 1 + np.log(x + 1e-8)

def mirror_map_star_grad(y):
    return np.exp(y - 1)

def project(x):
    return x / np.sum(x)

x0 = np.ones(10) / 10
eta = 0.1
T = 100
optimal_x = mirror_descent(f_grad, mirror_map_grad, mirror_map_star_grad, project, x0, eta, T)
print(optimal_x)

8.3. Explanation

This Python snippet demonstrates how to implement the Mirror Descent algorithm for optimizing over a probability simplex using the negative entropy mirror map. The functions define the gradient of the objective function, the mirror map, its conjugate, and the projection onto the simplex.

9. Advantages of Mirror Descent

9.1. Adaptability

The flexibility in choosing different mirror maps allows Mirror Descent to seamlessly adapt to various problem geometries, making it effective for a wide range of applications.

9.2. Efficient Constraint Handling

By utilizing Bregman divergence, Mirror Descent naturally incorporates constraints into the optimization process, often eliminating the need for explicit and potentially costly projection steps.

9.3. Improved Convergence Rates

For certain classes of problems, especially those with structured constraints, Mirror Descent can achieve faster convergence rates compared to traditional Gradient Descent, enhancing computational efficiency.

9.4. Regret Minimization in Online Settings

In online learning and iterative decision-making scenarios, Mirror Descent effectively minimizes regret, ensuring that the algorithm's performance remains competitive over time.

10. Conclusion

The Mirror Descent algorithm stands out as a versatile and powerful optimization tool, particularly suited for scenarios where traditional Gradient Descent methods fall short due to geometric constraints or the need for adaptability. Its foundation on the concepts of mirror maps and Bregman divergence allows it to navigate complex optimization landscapes with efficiency and precision. Whether applied to online learning, constrained convex optimization, or machine learning model training, Mirror Descent offers a robust framework for achieving optimal solutions in diverse and structured environments.