Optimization and Gradient Descent on Riemannian Manifolds

Differential geometry can be seen as a generalization of calculus on Riemannian manifolds. Objects in calculus such as gradient, Jacobian, and Hessian on are adapted on arbitrary Riemannian manifolds. This fact let us also generalize one of the most ubiquitous problem in calculus: the optimization problem. The implication of this generalization is far-reaching: We can make a more general and thus flexible assumption regarding the domain of our optimization, which might fit real-world problems better or has some desirable properties.

In this article, we will focus on the most popular optimization there is, esp. in machine learning: the gradient descent method. We will begin with a review on the optimization problem of a real-valued function on , which we should have been familiar with. Next, we will adapt the gradient descent method to make it work in optimization problem of a real-valued function on an arbitrary Riemannian manifold . Lastly, we will discuss how the natural gradient descent method can be seen from this perspective, instead of purely from the second-order optimization point-of-view.

Optimization problem and the gradient descent

Let be the usual Euclidean space (i.e. a Riemannian manifold where ) and let be a real-valued function. An (unconstrained) optimization problem on this space has the form

That is we would like to find a point such that is the minimum of .

One of the most popular numerical method for solving this problem is the gradient descent method. Its algorithm is as follows.

Algorithm 1 (Euclidean gradient descent).

Pick arbitrary and let with
While the stopping criterion is not satisfied:
1. Compute the gradient of at , i.e.
2. Move in the direction of , i.e.
Return

The justification of the gradient descent method is because of the fact that the gradient is the direction in which is increasing fastest. Its negative therefore points to the direction of steepest descent.

Proposition 1. Let be a real-valued function on and . Among all unit vector , the gradient of at is the direction in which the directional derivative is greatest. Furthermore, equals to the value of the directional derivative in that direction.

Proof. First, note that, by our assumption, . By definition of the directional derivative and dot product on ,

where is the angle between and . As and , the above expression is maximized whenever . This implies that the particular vector that maximizes the directional derivative points in the same direction as . Furthermore, plugging in into the above equation, we get

Thus, the length of is equal to the value of .

Gradient descent on Riemannian manifolds

Remark. _We shall use the Einstein summation convention: Repeated indices above and below are implied to be summed, e.g. and . By convention the index in is thought to be a lower index._

We now want to break the confine of the Euclidean space. We would like to generalize the gradient descent algorithm on a function defined on a Riemannian manifold. Based on Algorithm 1, at least there are two parts of the algorithm that we need to adapt, namely, (i) the gradient of and (ii) the way we move between points on .

Suppose is a -dimensional Riemannian manifold. Let be a real-valued function (scalar field) defined on . Then, the optimization problem on simply has the form

Although it seems innocent enough (we only replace with from the Euclidean version), some difficulties exist.

First, we shall discuss about the gradient of on . By definition, is a vector field on , i.e. and at each , is a tangent vector in . Let the differential of be a one one-form, which, in given coordinates , has the form

Then, the gradient of is obtained by raising an index of . That is,

and in coordinates, it has the expression

At any , given , it is characterized by the following equation:

That is, pointwise, the inner product of the gradient and any tangent vector is the action of derivation on . We can think of this action as taking directional derivative of in the direction . Thus, we have the analogue of Proposition 1 on Riemannian manifolds.

Proposition 2. Let be a Riemannian manifold and be a real-valued function on and . Among all unit vector , the gradient of at is the direction in which the directional derivative is greatest. Furthermore, equals to the value of the directional derivative in that direction.

Proof. We simply note that by definition of inner product induced by , we have

where is again the angle between and . Using the characteristic of we have discussed above and by substituting for in the proof of Proposition 1, we immediately get the desired result.

Proposition 2 therefore provides us with a justification of just simply substituting the Euclidean gradient with the Riemannian gradient in Algorithm 1.

To make this concrete, we do the computation in coordinates. In coordinates, we can represent by a row vector (i.e. a sequence of numbers in the sense of linear algebra) containing all partial derivatives of :

Given the matrix representation of the metric tensor in coordinates, the gradient of is represented by a column vector , such that

Example 1. (Euclidean gradient in coordinates). Notice that in the Euclidean case, , thus it is represented by an identity matrix , in coordinates. Therefore the Euclidean gradient is simply

The second modification to Algorithm 1 that we need to find the analogue of is the way we move between points on . Notice that, at each , the way we move between points in the Euclidean gradient descent is by following a straight line in the direction . We know by triangle inequality that straight line is the path with shortest distance between points in .

On Riemannian manifolds, we move between points by the means of curves. There exist a special kind of curve , where is an interval, that are “straight” between two points on , in the sense that the covariant derivative of the velocity vector along the curve itself, at any time is . The intuition is as follows: Although if we look at on from the outsider’s point-of-view, it is not straight (i.e. it follows the curvature of ), as far as the inhabitants of are concerned, is straight, as its velocity vector (its direction and length) is the same everywhere along . Thus, geodesics are the generalization of straight lines on Riemannian manifolds.

For any and , we can show that there always exists a geodesic starting at with initial velocity , denoted by . Furthermore, if we can rescale any geodesic by

and thus we can define a map by

called the exponential map. The exponential map is the generalization of “moving straight in the direction ” on Riemannian manifolds.

Example 2. (Exponential map on a sphere). Let be a sphere embedded in with radius . The shortest path between any pair of points on the sphere can be found by following the great circle connecting them.

Let and be arbitrary. The curve given by

is a geodesic, as its image is the great circle formed by the intersection of with the linear subspace of spanned by . Therefore the exponential map on is given by

Given the exponential map, our modification to Algorithm 1 is complete, which we show in Algorithm 2. The new modifications from Algorithm 1 are in blue.

Algorithm 2 (Riemannian gradient descent).

Pick arbitrary . Let with
While the stopping criterion is not satisfied:
1. Compute the gradient of at , i.e.
2. Move in the direction , i.e.
Return

Approximating the exponential map

In general, the exponential map is difficult to compute, as to compute a geodesic, we have to solve a system of second-order ODE. Therefore, for a computational reason, we would like to approximate the exponential map with cheaper alternative.

Let be arbitrary. We define a map called the retraction map, by the following properties:

The second property is called the local rigidity condition and it preserves gradients at . In particular, the exponential map is a retraction. Furthermore, if denotes the Riemannian distance and , retraction can be seen as a first-order approximation of the exponential map, in the sense that

On an arbitrary embedded submanifold , if and , viewing to be a point on the ambient manifold and to be a point on the ambient tangent space , we can compute by (i) moving along to get and then (ii) project the point back to .

Example 3. (Retraction on a sphere). Let be a sphere embedded in with radius . The retraction on any and is defined by

Therefore, the Riemannian gradient descent in Algorithm 2 can be modified to be

Algorithm 3 (Riemannian gradient descent with retraction).

Pick arbitrary . Let with
While the stopping criterion is not satisfied:
1. Compute the gradient of at , i.e.
2. Move in the direction , i.e.
Return

Natural gradient descent

One of the most important applications of the Riemannian gradient descent in machine learning is for doing optimization of statistical manifolds. We define a statistical manifold to be the set corresponding to the set of parameter of a statistical model , equipped with metric tensor which is the Fisher information metric, given by

The most common objective function in the optimization problem on a statistical manifold is the expected log-likelihood function of our statistical model. That is, given a dataset , the objective is given by .

The metric tensor is represented by matrix , called the Fisher information matrix. The Riemannian gradient in this manifold is therefore can be represented by a column vector . Furthermore, as the manifold is , the construction of the retraction map we have discussed previously tells us that we can simply do addition for any and . This is well defined as there is a natural isomorphism between and . All in all, the gradient descent in this manifold is called natural gradient descent and is presented in Algorithm 4 below.

Algorithm 4 (Natural gradient descent).

Pick arbitrary . Let with
While the stopping criterion is not satisfied:
1. Compute the gradient of at , i.e.
2. Move in the direction , i.e.
Return

Conclusion

Optimization in Riemannian manifold is an interesting and important application in the field of geometry. It generalizes the optimization methods from Euclidean spaces onto Riemannian manifolds. Specifically, in the gradient descent method, adapting it to a Riemannian manifold requires us to use the Riemannian gradient as the search direction and the exponential map or retraction to move between points on the manifold.

One major difficulty exists: Computing and storing the matrix representation of the metric tensor are very expensive. Suppose the manifold is -dimensional. Then, the size of is in and the complexity of inverting it is in . In machine learning, could be in the order of million, so a naive implementation is infeasible. Thankfully, many approximations of the metric tensor, especially for the Fisher information metric exist (e.g. [7]). Thus, even with these difficulties, the Riemannian gradient descent or its variants have been successfully applied on many areas, such as in inference problems [8], word or knowledge graph embeddings [9], etc.

References

Lee, John M. “Smooth manifolds.” Introduction to Smooth Manifolds. Springer, New York, NY, 2013. 1-31.
Lee, John M. Riemannian manifolds: an introduction to curvature. Vol. 176. Springer Science & Business Media, 2006.
Fels, Mark Eric. “An Introduction to Differential Geometry through Computation.” (2016).
Absil, P-A., Robert Mahony, and Rodolphe Sepulchre. Optimization algorithms on matrix manifolds. Princeton University Press, 2009.
Boumal, Nicolas. Optimization and estimation on manifolds. Diss. Catholic University of Louvain, Louvain-la-Neuve, Belgium, 2014.
Graphics: https://tex.stackexchange.com/questions/261408/sphere-tangent-to-plane.
Martens, James, and Roger Grosse. “Optimizing neural networks with kronecker-factored approximate curvature.” International conference on machine learning. 2015.
Patterson, Sam, and Yee Whye Teh. “Stochastic gradient Riemannian Langevin dynamics on the probability simplex.” Advances in neural information processing systems. 2013.
Suzuki, Atsushi, Yosuke Enokida, and Kenji Yamanishi. “Riemannian TransE: Multi-relational Graph Embedding in Non-Euclidean Space.” (2018).