Agustinus Kristiadi

Volume Forms and Probability Density Functions Under Change of Variables

Wed, 13 Dec 2023 00:00:00 -0500

Suppose we have $\R^n$ equipped with the Cartesian coordinates; the latter represents a point in $\R^n$ with $x = (x^1, \dots, x^n)$, an $n$-tuple of numbers via the identity function $\mathrm{Id}_{\R^n}$—this is because $\R^n$ itself is already defined as the space of tuples of $n$ numbers. (Note that $v^i$ is not a power, but just indexing; we write e.g. $(v^i)^2$ if we need to take the power.)

Here are some interesting objects to study in this setting.

Riemannian Metrics

In $\R^d$, we usually have the standard Euclidean inner product $\inner{v, w} = \sum_{i=1}^n v^i w^i$ where $v = (v^1, \dots, v^n)$ and $w = (w^1, \dots, w^n)$ are two vectors. We can write an inner product in terms of an inner product matrix $\inner{v, w} = v^\top G w$.

The matrix $G$, which is symmetric positive definite, is called (the matrix representation of) a Riemannian metric. In the case of the Euclidean inner product, we have $G = I$, the identity $n \times n$ matrix.

Volume Forms

Another interesting object is the volume form $dx$. This is a differential form of degree $n$, meaning that it takes $n$ vectors as arguments and returns a number. There is a deeper meaning in the notation, but for the purpose of this post, it suffices to say that $dx$ measures the volume of a parallelepiped spanned by $n$ vectors. Indeed, the evaluation $dx(v_1, \dots, v_n)$ on vectors $v_1, \dots, v_n$ is obtained by computing the determinant of the matrix resulting from stacking the tuples $v_1, \dots, v_n$. An important fact is that if $f: \R^n \to \R$ is any continuous function on $\R^n$, then $f \, dx$ is also a volume form.

The Riemannian metric $G$ and the volume form $dx$ can be combined to obtain a special volume form

\[dV_G = \sqrt{\abs{\det{G}}} \, dx\]

called the Riemannian volume form. In the case of $\R^n$ with the Cartesian coordinates and the standard dot product, $G = I$, so, $dV_G = dx$ is a special case. The idea here is that non-identity $G$’s “distort” the Cartesian grids and thus the volume changes proportionally to the distortion. For this reason, $dV_G$ is the natural volume form for any choice of metric and any manifold in general. Indeed, technically speaking, it is the unique volume form that evaluates to one on parallelepipeds spanned by orthonormal basis vectors.

Volume Forms and Measures

A non-negative volume form $f \, dx$ induces a measure via $\mu(A) = \int_A f \, dx$ for $A$ Borel measurable subset of $\R^n$. One can then see that $dx$ is the volume form corresponding to the Lebesgue measure $\mu(A) = \int_A dx$.

Suppose we have a probability measure (with support $\R^n$) and assume that it can be expressed as $P(A) = \int_A p \, dx$. Then, $p$ is the probability density function (pdf) of $P$ under the reference measure $dx$, i.e., it is positive everywhere $p > 0$ and it integrates to one under $dx$, that is, $\int_{\R^n} p \, dx = 1$.

Another way to define $p$ as a pdf is via the Radon-Nikodym derivative

\[p = \frac{p \, dx}{dx} .\]

Then it’s clear that we can take any volume form as the reference measure, not just $dx$. E.g., we can take

\[p_G := \frac{p \, dx}{dV_G} = \frac{p \, dx}{\sqrt{\abs{\det{G}}} \, dx} = p \, \abs{\det{G}}^{-\frac{1}{2}} ,\]

which is a pdf under $dV_G$ since it’s still positive (note that $G$ is positive-definite) and

\[\require{cancel} \int_{\R^n} p \, \abs{\det{G}}^{-\frac{1}{2}} \, dV_G = \int_{\R^n} p \, \cancel{\abs{\det{G}}^{-\frac{1}{2}}} \, \cancel{\abs{\det{G}}^{\frac{1}{2}}} \, dx = 1 ,\]

i.e., it integrates to one under $dV_G$.

Change of Variables

Now, assume that we have another coordinates for $\R^n$, say, representing each element of $\R^n$ with $y = (y^1, \dots, y^n)$ instead. The change of coordinates function, mapping $x \mapsto y$ is a diffeomorphism—a differentiable function with a differentiable inverse. Let’s call it $\varphi$; and call its $n \times n$ Jacobian matrix $J = [\partial y^i / \partial x^j]_{i,j=1}^n$ with inverse $J^{-1} = [\partial x^i / \partial y^j]_{ij=1}^n$.

Here are some rules for transforming a metric and a volume form.

If $G$ is a matrix representation of a Riemannian metric in $x$-coordinates, then

\[\widehat{G} = (J^{-1})^\top G J^{-1}\]

is the matrix representation of the same metric in $y$-coordinates. Consequently, the determinant of the metric $\abs{\det G}$ transforms into $\abs{\det G} \, \abs{J^{-1}}^2$. This transformation rule is to ensure that if $\hat{v}, \hat{w}$ are the representations of $v, w$ in $y$-coordinates, then $\hat{v}^\top \widehat{G} \hat{w} = v^\top G w$. That is, the value of the inner product is independent of the choice of coordinates. In other words, this rule is to make sure we are referring to the same abstract object (in this case inner product, which is an abstract function) even when we use a different representation.

Now, if $f \, dx$ is a volume form in $x$-coordinates, then

\[(f \circ \varphi^{-1}) \, \abs{\det J^{-1}} \, dy\]

is the same volume form in $y$-coordinates. In particular, we have the relation $dx = \abs{\det J^{-1}} \, dy$ [2, Corollary 14.21]. Again, this rule is to ensure coordinate independence.

As a consequence, integrals are also invariant under a change of coordinates:

\[\int_{\varphi(A)} (f \circ \varphi) \, \underbrace{\abs{\det{J^{-1}}} \, dy}_{=dx} = \int_A f \, dx \,,\]

where $A \subseteq \R^n$. Notice that this is just the standard change-of-variable rule in calculus. But one thing to keep in mind is that the Jacobian-determinant term is part of the transformation of $dx$, not the function $f$ itself.

Pdfs Under Change of Variables

From elementary probability theory, we have the transformation of a pdf $p_x$ (defined w.r.t. $dx$):

\[p_y = (p_x \circ \varphi^{-1}) \, \abs{\det{J^{-1}}} \, ,\]

and this is known to be problematic because of the additional Jacobian-determinant term.

For instance, the mode $\argmax p_y$ of $p_y$ doesn’t correspond to the mode $\argmax p_x$ of $p_x$. That is, modes of pdfs are not coordinate-independent. Maximum a posterior (MAP) estimation, which is the standard estimation method for neural networks is thus pathological since an arbitrary reparametrization/change of variables will yield a different MAP estimate, see e.g. [1, Sec. 5.2.1.4] Or are they?

The reason for the above transformation rule between $p_x$ and $p_y$ is to ensure invariance in the integration, to ensure that $p_y$ is a valid pdf w.r.t. $dy$:

\[\begin{align} \int_{\varphi(\R^n)} p_y \, dy &= \int_{\varphi(\R^n)} (p_x \circ \varphi^\inv) \, \underbrace{\abs{\det J^{-1}} \, dy}_{= dx} \\ % &= \int_{\R^n} p_x \, dx \\[5pt] % &= 1 . \end{align}\]

However, as we have seen before, $\abs{\det J^{-1}}$ is part of the transformation of $dx$, i.e. $dx = \abs{\det J^{-1}} dy$! So, the problem in pdf maximization is actually because we attribute the Jacobian-determinant to the wrong part of the volume measure $p_x \, dx$. This can only be detected if we see things holistically as the transformation of the whole volume form, and not just view it as the transformation of the function $p_x$ independently.

This leads to a very straightforward solution to the non-invariance problem. Simply transform $p_x$ into $p_y = (p_x \circ \varphi^{-1})$. This is just the transformation rule of standard function, so its extrema will always be coordinate-independent. It is still a pdf w.r.t. $dy$, just don’t forget to add a Jacobian-determinant term as part of the transformation from $dx$ to $dy$.

Riemannian Pdfs Under Change of Variables

What about a Riemannian pdf $p_G = p_x \, \abs{\det{G}}^{-\frac{1}{2}}$ under the Riemannian volume form $dV_G$? First, recall that $\abs{\det{\widehat{G}}} = \abs{\det{G}} \, \abs{\det J^{-1}}^2$. So,

\[p_{\widehat{G}} = (p_x \circ \varphi^{-1}) \, \abs{\det{G}}^{-\frac{1}{2}} \, \abs{\det J^{-1}}^{-1} .\]

This seems problematic since now we have the Jacobian determinant term again, just like the “incorrect” transformation of pdf in the previous section. It actually is! Just look at the following integral that attempts to show that $p_{\widehat{G}}$ integrates to one under $dV_{\widehat{G}}$.

\[\begin{align} \int_{\varphi(\R^n)} p_{\widehat{G}} \, dV_{\widehat{G}} &= \int_{\varphi(\R^n)} (p_x \circ \varphi^{-1}) \, \cancel{\abs{\det{G}}^{-\frac{1}{2}}} \, \cancel{\abs{\det J^{-1}}^{-1}} \, \cancel{\abs{\det J^{-1}}} \, \cancel{\abs{\det G}^{\frac{1}{2}}} \, dy \\ % &= \int_{\varphi(\R^n)} (p_x \circ \varphi^{-1}) \, dy . \end{align}\]

We now don’t have the $\abs{\det{J^{-1}}}$ term anymore. So we can’t apply the relation $dx = \abs{\det{J^{-1}}} \, dy$ to complete the steps. What gives?

This is actually because there is a Jacobian-determinant term that we forget about because we don’t see things as a whole. The complete way to see a pdf is in terms of the Radon-Nikodym derivative. So, let’s see, in $x$-coordinates, we have:

\[p_G = \frac{p_x \, dx}{\abs{\det{G}}^{\frac{1}{2}} \, dx} .\]

Now in $y$-coordinates, we have the following by transforming both the volume forms in the numerator and the denominator:

\[p_{\widehat{G}} = \frac{(p_x \circ \varphi^{-1}) \, \cancel{\abs{\det{J^{-1}}} \, dy}}{(\abs{\det{G}}^{\frac{1}{2}} \circ \varphi^{-1}) \, \cancel{\abs{\det{J^{-1}}} \, dy}} = (p_x \circ \varphi^{-1}) \, \abs{\det{G}}^{-\frac{1}{2}} .\]

The key is to view $\abs{\det{G}}^{\frac{1}{2}}$ as a function in front of $dx$, which, by the transformation rule discussed previously, transforms into $\abs{\det{G}}^{\frac{1}{2}} \circ \varphi^{-1}$. For brevity, we might as well write it down as $\abs{\det{G}}^{\frac{1}{2}}$, just remember that the domain of this function is the $y$-coordinates.

Compare this to before: we now don’t have the Jacobian-determinant term! Performing the integration as before:

\[\begin{align} \int_{\varphi(\R^n)} p_{\widehat{G}} \, dV_{\widehat{G}} &= \int_{\varphi(\R^n)} (p_x \circ \varphi^{-1}) \, \cancel{\abs{\det{G}}^{-\frac{1}{2}}} \, \cancel{\abs{\det G}^{\frac{1}{2}}} \, \underbrace{\abs{\det J^{-1}} \, dy}_{=dx} \\ % &= \int_{\R^n} p_x \, dx \\[5pt] % &= 1 . \end{align}\]

And therefore, we have shown that $(p_x \circ \varphi^{-1}) \, \abs{\det{G}}^{-\frac{1}{2}}$ is the correct transformation of $p_G$. Notice that this is again just a transformation of standard function and so the modes are coordinate-independent.

Conclusion

Two take-aways from this post. First, be aware of the correct transformation of objects. In particular, for a volume form $f \, dx$, the Jacobian-determinant is part of the transformation of $dx$, not the function $f$. This way, we don’t have any problem with MAP estimation.

Second, it’s best to see things as a whole to avoid confusion. For pdfs, write them holistically as Radon-Nikodym derivatives. Then, the correct transformations can easily be applied without confusion.

References

Murphy, Kevin P. Machine learning: a probabilistic perspective. MIT Press, 2012.
Lee, John M. Introduction to Smooth Manifolds. 2003.

The Invariance of the Hessian and Its Eigenvalues, Determinant, and Trace

Thu, 09 Feb 2023 00:00:00 -0500

Let $f: \mathcal{X} \times \Theta \to \R^k$ be a neural network, defined by $(x, \theta) \mapsto f(x; \theta) = f_\theta(x)$. Suppose $\L: \Theta \to \R$ is a loss function defined on the $d$-dimensional parameter space $\Theta$ of $f$ and let $\theta^*$ be a minimum of $\L$. Suppose further $\varphi: \Theta \to \Psi$ is a reparametrization, i.e., a differentiable map with a differentiable inverse, mapping $\theta \mapsto \psi$.

Suppose we transform $\theta^*$ into $\psi^* = \varphi(\theta^*)$. The consensus in the deep learning field regarding the Hessian matrix $H(\theta^*)$ of $\L$ at $\theta^*$ is that:

The eigenvalues of $H(\theta^*)$ are not invariant.
The determinant of $H(\theta^*)$ is not invariant.
The trace of $H(\theta^*)$ is not invariant.
Seen as a bilinear map, the Hessian is not invariant outside the critical points of $\L$.

In this post, we shall see that these quantities are actually invariant under reparametrization! Although the argument comes from Riemannian geometry, it will also hold even if we use the default assumption found in calculus—the standard setting assumed by deep learning algorithms and practitioners.

Note. Throughout this post, we use the Einstein summation convention. That is, we sum two variables together if one has an upper index and the other has a lower index, while omitting the summation symbol. For example: $v^i w_i$ corresponds to $\sum_i v^i w_i$ and $v^i w^j H_{ij} = \sum_i \sum_j v^i w^j H_{ij}$, meanwhile the index $i$ in the following partial derivative $\partial f/\partial \theta^i$ counts as a lower index.

The Hessian as a Bilinear Map

In calculus, the Hessian matrix $H(\theta^*)$ at $\theta^*$ is defined by

\[H_{ij}(\theta^*) = \frac{\partial^2 \L}{\partial \theta^i \theta^j}(\theta^*) \qquad\qquad \text{for all} \qquad i,j = 1, \dots, d .\]

The Hessian matrix defines a bilinear function, i.e., given arbitrary vectors $v, w$ in $\R^d$, we can write a function $B(v, w) = v^i w^j H_{ij}(\theta^*)$. For example, this term comes up in the 2nd-order Taylor expansion of $\L$ at $\theta^*$:

\[\begin{align} \L(\theta) &\approx \L(\theta^*) + (\nabla \L \vert_{\theta^*})^\top d + \frac{1}{2} \underbrace{d^\top H(\theta^*) d}_{=B(d, d)} , \end{align}\]

where we have defined $d = (\theta - \theta^*)$.

Under the reparametrization $\varphi: \theta \mapsto \psi$ with $\psi^* = \varphi(\theta^*)$, we have $\L \mapsto \varphi^{-1}$. Thus, by the chain and product rules, the Hessian $H_{ij}$ becomes

\[\begin{align} \tilde{H}_{ij} &= \frac{\partial^2 (\L \circ \varphi^{-1})}{\partial \psi^i \partial \psi^j} = \frac{\partial}{\partial \psi^j}\left( \frac{\partial \L}{\partial \theta^m} \frac{\partial \theta^m}{\partial \psi^i} \right) \\ &= \frac{\partial^2 \L}{\partial \theta^m \partial \theta^n} \frac{\partial \theta^m}{\partial \psi^i} \frac{\partial \theta^n}{\partial \psi^j} + \frac{\partial \L}{\partial \theta^o} \frac{\partial^2 \theta^o}{\partial \psi^i \partial \psi^j} . \end{align}\]

However, notice that if we evaluate $\tilde{H}_{ij}$ at a minimum $\psi^* = \varphi(\theta^*)$, the second term vanishes. And so, we have

\[\tilde{H}_{ij}(\psi^*) = \frac{\partial^2 \L}{\partial \theta^m \partial \theta^n}(\varphi^{-1}(\psi^*)) \frac{\partial \theta^m}{\partial \psi^i}(\psi^*) \frac{\partial \theta^n}{\partial \psi^j}(\psi^*) .\]

Meanwhile, if $v = (v^1, \dots, v^d)$ and $w = (w^1, \dots, w^d)$ are vectors at $\theta^* \in \Theta$, their components become

\[\tilde{v}^i = v^m \frac{\partial \psi^i}{\partial \theta^m}(\theta^*) \qquad \text{and} \qquad \tilde{w}^j = w^n \frac{\partial \psi^j}{\partial \theta^n}(\theta^*) ,\]

because the Jacobian of the reparametrization (i.e. change of coordinates) $\varphi: \theta \mapsto \psi$ defines a change of basis.

Notice that $\frac{\partial \theta^m}{\partial \psi^i}(\psi^*)$ is the inverse of $\frac{\partial \psi^i}{\partial \theta^m}(\theta^*) = \frac{\partial \psi^i}{\partial \theta^m}(\varphi^{-1}(\psi^*))$. Considering the transformed $H$, $v$, and $w$, the bilinear map $B$ then becomes

\[\require{cancel} \begin{align} \tilde{B}(\tilde{v}, \tilde{w}) &= \tilde{v}^i \tilde{w}^j \tilde{H}_{ij}(\psi^*) \\ % &= v^m \cancel{\frac{\partial \psi^i}{\partial \theta^m}(\varphi^{-1}(\theta^*))} w^n \cancel{\frac{\partial \psi^j}{\partial \theta^n}(\varphi^{-1}(\theta^*))} \frac{\partial^2 \L}{\partial \theta^m \partial \theta^n}(\varphi^{-1}(\psi^*)) \cancel{\frac{\partial \theta^m}{\partial \psi^i}(\psi^*)} \cancel{\frac{\partial \theta^n}{\partial \psi^j}(\psi^*)} \\ % &= v^m w^n H_{mn}(\varphi^{-1}(\psi^*)) . \end{align}\]

under the reparametization $\varphi$. Since all those indices $m$, $n$ are simply dummy indices, the last expression is equivalent to $v^i w^i H_{ij}(\theta^*)$. Since $v$ and $w$ and $\varphi$ are arbitrary, this implies that, seen as a bilinear map, the Hessian at a minimum $\theta^*$ is invariant under reparametrization.

The Non-Invariance of the Hessian

While the Hessian, as a bilinear map at a minimum, is (functionally) invariant, some of its downstream quantities are not. Let us illustrate this using the determinant—one can also easily show similar results for trace and eigenvalues.

First, recall that the components $H_{ij}(\theta^*)$ of the Hessian transforms into the following under a reparametrization $\varphi$:

In matrix notation, this is $\tilde{\mathbf{H}} = (\mathbf{J}^{-1})^\top \mathbf{H} \mathbf{J}^{-1}$. (The dependency on $\psi^*$ is omitted for simplicity.) Then, the determinant of $\tilde{\mathbf{H}}$ is

\[\det \tilde{\mathbf{H}} = (\det \mathbf{J}^{-1})^2 \det \mathbf{H} .\]

Thus, in general, $\det \tilde{\mathbf{H}} \neq \det \mathbf{H}$. Hence the determinant of the Hessian is not invariant. This causes problems in deep learning: For instance, Dinh et al. 2017 argue that one cannot study the connection between flatness and generalization performance at the minimum of $\L$.

The Riemannian Hessian

From the Riemannian-geometric perspective, the component $H_{ij}$ of the Hessian of $\L$ is defined under $\theta$ coordinates/parametrization as:

\[H_{ij} = \frac{\partial^2 \L}{\partial \theta^i \partial \theta^j} - \Gamma^k_{ij} \frac{\partial \L}{\partial \theta^k} ,\]

where $\Gamma^k_{ij}$ is a three-dimensional array that represent the Levi-Civita connection (or any connection) on the tangent spaces of $\Theta$, seen as a Riemannian manifold. In the calculus case, where the Euclidean metric and the Cartesian coordinates are assumed by default, $\Gamma^k_{ij}$ vanishes identically; hence the previous definition of the Hessian. This also shows that the Riemannian Hessian is a generalization to the standard Hessian.

Under a reparametrization $\varphi: \theta \to \psi$, the connection coefficient $\Gamma$ transforms as follows:

\[\tilde\Gamma_{ij}^k = \Gamma_{mn}^o \frac{\partial \psi^k}{\partial \theta^o} \frac{\partial \theta^m}{\partial \psi^i} \frac{\partial \theta^n}{\partial \psi^j} + \frac{\partial^2 \theta^o}{\partial \psi^i \partial \psi^j} \frac{\partial \psi^k}{\partial \theta^o} .\]

And thus, combined with the transformation of the “calculus Hessian” (i.e. second partial derivatives) from the previous section, the Riemannian Hessian transform as:

\[\begin{align*} \tilde{H}_{ij} &= \frac{\partial^2 (\L \circ \varphi^{-1})}{\partial \psi^i \partial \psi^j} - \tilde\Gamma^k_{ij} \frac{\partial (\L \circ \varphi^{-1})}{\partial \psi^k} \\ % &= \frac{\partial^2 \L}{\partial \theta^m \partial \theta^n} \frac{\partial \theta^m}{\partial \psi^i} \frac{\partial \theta^n}{\partial \psi^j} + \frac{\partial \L}{\partial \theta^o} \frac{\partial^2 \theta^o}{\partial \psi^i \partial \psi^j} - \left( \Gamma_{mn}^o \frac{\partial \psi^k}{\partial \theta^o} \frac{\partial \theta^m}{\partial \psi^i} \frac{\partial \theta^n}{\partial \psi^j} + \frac{\partial^2 \theta^o}{\partial \psi^i \partial \psi^j} \frac{\partial \psi^k}{\partial \theta^o} \right) \frac{\partial \L}{\partial \theta^o} \frac{\partial \theta^o}{\partial \psi^k} \\ % &= \frac{\partial^2 \L}{\partial \theta^m \partial \theta^n} \frac{\partial \theta^m}{\partial \psi^i} \frac{\partial \theta^n}{\partial \psi^j} + \frac{\partial \L}{\partial \theta^o} \frac{\partial^2 \theta^o}{\partial \psi^i \partial \psi^j} - \Gamma_{mn}^o \cancel{\frac{\partial \psi^k}{\partial \theta^o}} \frac{\partial \theta^m}{\partial \psi^i} \frac{\partial \theta^n}{\partial \psi^j} \frac{\partial \L}{\partial \theta^o} \cancel{\frac{\partial \theta^o}{\partial \psi^k}} - \frac{\partial^2 \theta^o}{\partial \psi^i \partial \psi^j} \cancel{\frac{\partial \psi^k}{\partial \theta^o}} \frac{\partial \L}{\partial \theta^o} \cancel{\frac{\partial \theta^o}{\partial \psi^k}} \\ % &= \frac{\partial^2 \L}{\partial \theta^m \partial \theta^n} \frac{\partial \theta^m}{\partial \psi^i} \frac{\partial \theta^n}{\partial \psi^j} \cancel{+ \frac{\partial \L}{\partial \theta^o} \frac{\partial^2 \theta^o}{\partial \psi^i \partial \psi^j}} - \Gamma_{mn}^o \frac{\partial \theta^m}{\partial \psi^i} \frac{\partial \theta^n}{\partial \psi^j} \frac{\partial \L}{\partial \theta^o} \cancel{- \frac{\partial^2 \theta^o}{\partial \psi^i \partial \psi^j} \frac{\partial \L}{\partial \theta^o}} \\ % &= \frac{\partial \theta^m}{\partial \psi^i} \frac{\partial \theta^n}{\partial \psi^j} \left( \frac{\partial^2 \L}{\partial \theta^m \partial \theta^n} - \Gamma_{mn}^o \frac{\partial \L}{\partial \theta^o} \right) \\ % &= \frac{\partial \theta^m}{\partial \psi^i} \frac{\partial \theta^n}{\partial \psi^j} H_{mn} . \end{align*}\]

Note that while this transformation rule is very similar to the transformation of the “calculus Hessian” at a critical point, the transformation rule of the Riemannian Hessian applies everywhere on $\Theta$.

This means, seen as a bilinear map, the Hessian is invariant everywhere on $\Theta$. (Not just at the critical points as before.) How does this discrepancy happen? This is because we ignore $\Gamma^k_{ij}$ in calculus! This is, of course, justified since $\Gamma^k_{ij} \equiv 0$. But as can be seen in its transformation rule, under a reparametrization $\varphi$, this quantity is non-zero in general in $\psi$ parametrization—this is already true for a simple, common transformation between the Cartesian and polar coordinates.

The Invariance of the Hessian Eigenvalues, Determinant, and Trace

Let us focus on the determinant of the Hessian. As discussed above, it is not invariant. This is true even if the Riemannian Hessian above is used. How do we make sense of this?

To make sense of this, we need to fully understand the object we care about when we talk about the determinant of the Hessian as a measure of the flatness of the loss landscape of $\L$.

The loss landscape of $\L$ is the graph $\{ (\theta, \L(\theta)) \in \R^{d+1}: \theta \in \Theta \}$ of $\L$. This is actually a $d$-dimensional hypersurface embedded in $\R^{d+1}$. In particular, a hypersurface is a manifold. Meanwhile, the concept of “sharpness” or “flatness” of the loss landscape of $\L$ is nothing but the curvatures of the above manifold, particularly the principal curvatures, Gaussian curvature, and mean curvature. See this previous post for intuition.

These curvatures can actually be derived from the Hessian of $\L$ since this Hessian is the second fundamental form of that manifold. (See that previous post!) However, to obtain those curvatures, we must first derive the shape operator with the help of the metric. (The shape operator is a linear operator, mapping a vector to a vector.) Suppose the matrix representation of the metric on $\Theta$ is $\mathbf{G}$. Then, the shape operator $E$ is given by

\[\mathbf{E} := \mathbf{G}^{-1} \mathbf{H} .\]

The principal, Gaussian, and mean curvatures of the loss landscape are then the eigenvalues, determinant, and trace of $\mathbf{E}$, respectively. The reason why we can simply take eigenvalues or determinant or trace of the Hessian $\mathbf{H}$ in calculus is because, by default, $\mathbf{G}$ is assumed to be the $d \times d$ identity matrix $\mathbf{I}$, i.e. the Euclidean metric. That is $\mathbf{E} = \mathbf{H}$ and we can ignore the $\mathbf{G}^{-1}$ term above.

But notice that under a reparametrization $\varphi: \theta \to \psi$, we have

\[\mathbf{G} \mapsto (\mathbf{J}^{-1})^\top \mathbf{G} \mathbf{J}^{-1} .\]

So, even when $\mathbf{G} \equiv \mathbf{I}$ in the $\theta$ parametrization, the matrix representation of the metric is different than $\mathbf{I}$ in the $\psi$ parametrization! That is, we must not ignore the metric in the shape operator, however trivial it might be, if we care about reparametrization. This is the cause of the non-invariance of the Hessian’s eigenvalues, determinant, and trace observed in deep learning!

First, let us see the transformation of the shape operator by combining the transformation rules of $\mathbf{G}$ and $\mathbf{H}$:

\[\begin{align} \tilde{\mathbf{E}} &= \tilde{\mathbf{G}}^{-1} \tilde{\mathbf{H}} \\ % &= ((\mathbf{J}^{-1})^\top \mathbf{G} \mathbf{J}^{-1})^{-1} (\mathbf{J}^{-1})^\top \mathbf{H} \mathbf{J}^{-1} \\ % &= \mathbf{J} \mathbf{G}^{-1} \cancel{\mathbf{J}^\top} \cancel{\mathbf{J}^{-\top}} \mathbf{H} \mathbf{J}^{-1} \\ % &= \mathbf{J} \mathbf{G}^{-1} \mathbf{H} \mathbf{J}^{-1} \\ % &= \mathbf{J} \mathbf{E} \mathbf{J}^{-1} . \end{align}\]

If we take the determinant of both sides, we have:

\[\det \tilde{\mathbf{E}} = \cancel{(\det \mathbf{J})} \cancel{(\det \mathbf{J})^{-1}} (\det \mathbf{E}) = \det \mathbf{E} .\]

That is, the determinant of the Hessian, seen as a shape operator, is invariant!

What about the trace of $\mathbf{E}$? Recall that $\tr{\mathbf{A}\mathbf{B}} = \tr{\mathbf{B}\mathbf{A}}$. Using this property and the transformation of $\tilde{\mathbf{E}}$ above, we have:

\[\begin{align} \mathrm{tr}\, \tilde{\mathbf{E}} &= \tr{\mathbf{J} \mathbf{E} \mathbf{J}^{-1}} = \tr{\mathbf{J} \mathbf{J}^{-1} \mathbf{E}} = \mathrm{tr}\, \mathbf{E} , \end{align}\]

and so the trace is also invariant.

Finally, we can also show a general invariance result for eigenvalues. Recall that $\lambda$ is an eigenvalue of the linear operator $\mathbf{E}$ if $\mathbf{E} \mathbf{v} = \lambda \mathbf{v}$ for an eigenvector $\mathbf{v}$.

Let $(\lambda, \mathbf{v})$ be an eigenpair on the $\theta$ parametrization and $(\tilde{\lambda}, \tilde{\mathbf{v}})$ be an eigenpair on the $\psi$ parametrization. We want to show that $\lambda = \tilde{\lambda}$. Recall vectors are transformed by multiplying it with the Jacobian of $\varphi$. So, $\tilde{\mathbf{v}} = \mathbf{J} \mathbf{v}$. Therefore:

\[\begin{align} \tilde{\mathbf{E}} \tilde{\mathbf{v}} &= \tilde{\lambda} \tilde{\mathbf{v}} \\ % \mathbf{J} \mathbf{E} \cancel{\mathbf{J}^{-1}} \cancel{\mathbf{J}} \mathbf{v} &= \tilde{\lambda} \mathbf{J} \mathbf{v} \\ % \mathbf{J} \mathbf{E} &= \tilde{\lambda} \mathbf{J} \mathbf{v} \\ % \mathbf{E} &= \tilde{\lambda} \mathbf{v} , \end{align}\]

where the last step is done by multiplying both sides by the inverse of the Jacobian—recall that $\varphi$ is invertible.

Therefore, we identify that $\lambda = \tilde\lambda$. Since $\lambda$ is an arbitrary eigenvalue, we conclude that all eigenvalues of $\mathbf{E}$ are invariant.

Non-Invariance from the Tensor Analysis Viewpoint

In tensor analysis, this issue is very easy to identify. First, the Hessian represents a bilinear map, so it is a covariant 2-tensor. Meanwhile, when we talk about eigenvalues, we refer to the spectral theorem and this theorem applies to linear maps. So, there is a type mismatch here.

To apply the spectral theorem on the Hessian, we need to express it as a linear map. This can be done by viewing the Hessian as a linear map on the tangent space onto itself, which is a 1-contravariant 1-covariant tensor. That is, we need to “raise” one of the indices of $H$. How do we do this? You guessed it: Multiply $H$ with the inverse of the metric.

Conclusion

The reason why “flatness measures” derived from the calculus version of Hessian is not invariant is simply because we measure those “flatness measures” from an incorrect object. The correct object we should use is the shape operator, which is obtained with the help of the metric (even when the latter is Euclidean).

Moreover, the reason why Newton’s method is not invariant (see Sec. 12 of Martens, 2020) is that we ignore the second term involving the connection coefficient $\Gamma$.

Ignoring those geometric quantities are totally justified in calculus and deep learning since we always assume a Euclidean metric along with the Cartesian coordinates. But this simplification makes us “forget” about the correct transformation of the Hessian, giving rise to the pathological non-invariance issues observed in deep learning.

Convolution of Gaussians and the Probit Integral

Sat, 25 Jun 2022 00:00:00 -0400

Gaussian distributions are very useful in Bayesian inference due to their (many!) convenient properties. In this post we take a look at two of them: the convolution of two Gaussian pdfs and the integral of the probit function w.r.t. a Gaussian measure.

Convolution and the Predictive Distribution of Gaussian Regression

Let’s start with the convolution $\N(z_1 \mid \mu_1, \sigma^2_1) * \N(z_2 \mid \mu_2, \sigma^2_2)$ of two Gaussians $\N(z_1 \mid \mu_1, \sigma^2_1)$ and $\N(z_2 \mid \mu_2, \sigma^2_2)$ on $\R$:

\[\N(z_1 \mid \mu_1, \sigma^2_1) * \N(z_2 \mid \mu_2, \sigma^2_2) := \int_{\R} \N(z_1 - z_2 \mid \mu_1, \sigma^2_1) \, \N(z_2 \mid \mu_2, \sigma^2_2) \,dz_2 .\]

Proposition 1 (Convolution of Gaussians) Let $\N(z_1 \mid \mu_1, \sigma^2_1)$ and $\N(z_2 \mid \mu_2, \sigma^2_2)$ be two Gaussians on $\R$.

\[\N(z_1 \mid \mu_1, \sigma^2_1) * \N(z_2 \mid \mu_2, \sigma^2_2) = \N(z_1 \mid \mu_1+\mu_2, \sigma^2_1+\sigma^2_2) .\]

Proof. By the convolution theorem, the convolution of two functions is equivalent to the product of the functions’ Fourier transforms. The Fourier transform of a density function is given by its characteristic function. For a Gaussian $f(x) := \N(x, \mu, \sigma^2)$, it is $\varphi(u) := \exp\left(-iu\mu - \frac{1}{2}u^2\sigma^2\right)$. Therefore, if $\varphi_1$ and $\varphi_2$ are the characteristic functions of $\N(z_1 \mid \mu_1, \sigma^2_1)$ and $\N(z_2 \mid \mu_2, \sigma^2_2)$, respectively, then

\[\begin{align} (\varphi_1 \varphi_2)(u) &= \exp\left(-iu\mu_1 - \frac{1}{2}u^2\sigma_1^2\right) \exp\left(-iu\mu_2 - \frac{1}{2}u^2\sigma_2^2\right) \\[5pt] % &= \exp\left(-iu(\mu_1+\mu_2) - \frac{1}{2}u^2(\sigma_1^2 + \sigma_2^2)\right) , \end{align}\]

which we can immediately identify as the characteristic function of a Gaussian with mean $\mu_1 + \mu_2$ and variance $\sigma_1^2 + \sigma_2^2$.

$ \square $

This result is very useful in Bayesian machine learning, especially to obtain the predictive distribution of a Bayesian regression model. For instance, when one knows that the distribution over the regressor’s output is a Gaussian $\N(f \mid \mu, \sigma^2)$ and we assume that the output is noisy $\N(y \mid f, s^2)$.

Corollary 2 (Gaussian Regression). Let $p(y \mid f) = \N(y \mid f, s^2)$ and $p(f) = \N(f \mid \mu, \sigma^2)$ are Gaussians on $\R$. Then,

\[p(y) = \int_\R p(y \mid f) \, p(f) \,df = \N(y \mid f, \sigma^2 + s^2) .\]

Proof. First, notice that Gaussian is symmetric:

\[\begin{align} \N(x - a \mid \mu, \sigma^2) &= \frac{1}{Z} \exp\left(-\frac{1}{2\sigma^2} ((x-a)-\mu)^2\right) \\[5pt] % &= \frac{1}{Z} \exp\left(-\frac{1}{2\sigma^2} (x-(\mu+a))^2\right) \\[5pt] % &= \N(x \mid \mu + a, \sigma^2) , \end{align}\]

for $x, a \in \R$, where $Z$ is the normalizing constant. Using this, we can write the integral above as a convolution:

\[\begin{align} \int_\R \N(y \mid f, s^2) \, \N(f \mid \mu, \sigma^2) \,df &= \int_\R \N(y \mid 0+f, s^2) \, \N(f \mid \mu, \sigma^2) \,df \\[5pt] % &= \N(y \mid 0, s^2) * \N(f \mid \mu, \sigma^2) . \end{align}\]

Thus, by Proposition 1, we have $p(y) = \N(y \mid f, s^2 + \sigma^2)$.

$ \square $

The Probit Integral and the Probit Approximation

The probit function $\Phi$ is the cumulative distribution function of the standard Normal distribution $\N(x \mid 0, 1)$ on $\R$, i.e., $\Phi(z) := \int_{-\infty}^z \N(x \mid 0, 1) \,dx$. It can conveniently be written in terms of the error function

\[\mathrm{erf}(z) := \frac{2}{\sqrt{\pi}} \int_0^z \exp(-x^2) \,dx\]

\[\Phi(z) = \frac{1}{2} \left( 1 + \mathrm{erf}\left(\frac{z}{\sqrt{2}}\right) \right) .\]

Proposition 3 (The Probit Integral). If $\N(x \mid \mu, \sigma^2)$ be a Gaussian on $\R$ and $a, b \in \R$ then

\[\int_{\R} \Phi(ax + b) \, \N(x \mid \mu, \sigma^2) \,dx = \Phi\left(\frac{a\mu + b}{\sqrt{1 + a^2 \sigma^2}}\right).\]

Proof. The standard property of the error function [2] says that

\[\int_{\R} \mathrm{erf}(ax + b) \, \N(x \mid \mu, \sigma^2) \, dx = \mathrm{erf}\left(\frac{a\mu+b}{\sqrt{1 + 2 a^2 \sigma^2}}\right) .\]

So,

\[\require{cancel} \begin{align} \int_{\R} &\left(\frac{1}{2} + \frac{1}{2} \mathrm{erf}\left(\frac{ax+b}{\sqrt{2}}\right)\right) \, \N(x \mid \mu, \sigma^2) \,dx \\[5pt] % &= \frac{1}{2} + \frac{1}{2} \int_{\R} \mathrm{erf}\left(\left(\frac{a}{\sqrt{2}}\right)x+\left(\frac{b}{\sqrt{2}}\right)\right) \, \N(x \mid \mu, \sigma^2) \,dx \\[5pt] % &= \frac{1}{2} + \frac{1}{2} \mathrm{erf}\left(\frac{(a\mu+b)/\sqrt{2}}{\sqrt{1 + \cancel{2} (a/\cancel{\sqrt{2}})^2 \sigma^2}}\right) \\[5pt] % &= \frac{1}{2} \left(1 + \mathrm{erf}\left(\frac{a\mu+b}{\sqrt{2} \sqrt{1 + a^2 \sigma^2}}\right) \right) \\[5pt] % &= \Phi\left(\frac{a\mu+b}{\sqrt{1 + a^2 \sigma^2}}\right) . \end{align}\]

$ \square $

This integral is very useful for Bayesian inference since it enables us to approximate the following integral that is ubiquitous in Bayesian binary classifications

\[\int_{\R} \sigma(z) \, \N(z \mid m, s^2) \,dx ,\]

where $\sigma(z) := 1/(1 + \exp(-z))$ is the logistic function.

The key idea is to notice that the probit and logistic function are both sigmoid functions. That is, their graphs have a similar “S-shape”. Moreover, their images are both $[0, 1]$. However, they are a bit different—the probit function is more “horizontally stretched” compared to the logistic function.

So, the strategy to approximate the integral above is as follows: (i) horizontally “contract” the probit function and then (ii) use Proposition 3 to get an analytic approximation to the integral.

For the first step, this can be done by a simple change of coordinate: stretch the domain of the probit function with a constant $\lambda$, i.e., $z \mapsto \lambda z$. There are several “good” values for $\lambda$, but commonly it is chosen to be $\lambda = \sqrt{\pi/8}$, which makes the probit function have the same derivative as the logistic function at zero. That is, we have the approximation $\sigma(z) \approx \Phi(\lambda z) = \Phi(\sqrt{\pi/8} \, z)$.

Corollary 4. If $\N(z \mid m, s^2)$ is a Gaussian on $\R$, then

\[\int_{\R} \Phi(\lambda z) \, \N(z \mid m, s^2) \, dz = \Phi\left( \frac{m}{\sqrt{\lambda^{-2} + s^2}} \right) .\]

Proof. By Proposition 3, we have

\[\begin{align} \int_{\R} \Phi(\lambda \, z) \, \N(z \mid m, s^2) \, dz &= \Phi\left( \frac{\lambda \mu}{\sqrt{1 + \lambda^2 s^2}} \right) \\[5pt] % &= \Phi\left( \frac{\cancel{\lambda} \mu}{\cancel{\lambda} \sqrt{\lambda^{-2} + s^2}} \right) . \end{align}\]

$ \square $

Now we are ready to obtain the final approximation, often called the probit approximation.

Proposition 5 (Probit Approximation) If $\N(z \mid m, s^2)$ is a Gaussian on $\R$ and $\sigma(z) \approx \Phi\left(\sqrt{\pi/8} \, z\right)$, then

\[\int_{\R} \sigma(z) \, \N(z \mid m, s^2) \, dz \approx \sigma\left( \frac{m}{\sqrt{1 + \pi/8 \, s^2}} \right) .\]

Proof. Let $\lambda = \sqrt{\pi/8}$. Using Corollary 4 and substituting $\Phi(z) \approx \sigma\left(\lambda^{-1} \, z\right)$:

\[\begin{align} \int_{\R} \sigma(z) \, \N(z \mid m, s^2) \,dz &\approx \Phi\left( \frac{m}{\sqrt{\lambda^{-2} + s^2}} \right) \\[5pt] % &= \sigma\left( \frac{\lambda^{-1} \, m}{\sqrt{\lambda^{-2} + s^2}} \right) \\[5pt] % &= \sigma\left( \frac{\cancel{\lambda^{-1}} \, m}{\cancel{\lambda^{-1}} \, \sqrt{1 + \lambda^2 \, s^2}} \right) . \end{align}\]

Substituting $\lambda^2 = \pi/8$ into the last equation yields the desired result.

$ \square $

The probit approximation can also be used to obtain an approximation to the following integral, ubiquitous in multi-class classifications:

\[\int_{\R^k} \mathrm{softmax}(z) \, \N(z \mid \mu, \varSigma) \, dz ,\]

where the Gaussian is defined on $\R^k$ and the softmax function is identified by its components $\exp(z_i)/\sum_{j=1}^k \exp(z_j)$ for $i = 1, \dots, k$.

Proposition 6 (Multiclass Probit Approximation; Gibbs, 1998). If $\N(z \mid \mu, \varSigma)$ is a Gaussian on $\R^k$ and $\sigma(z) \approx \Phi(\sqrt{\pi/8}\,z)$, then

\[\int_{\R^k} \mathrm{softmax}(z) \, \N(z \mid \mu, \varSigma) \, dz \approx \mathrm{softmax}\left( \frac{\mu}{\sqrt{1 + \pi/8 \, \diag \varSigma}} \right) ,\]

where the division in the r.h.s. is component-wise.

Proof. The proof is based on [3]. Notice that we can write the $i$-th component of $\mathrm{softmax}(z)$ as $1/(1 + \sum_{j \neq i} \exp(-(z_i - z_j)))$. So, for each $i = 1, \dots, k$, using $z_{ij} := z_i - z_j$, we can write

\[\begin{align} \frac{1}{1 + \sum_{j \neq i} \exp(-z_{ij})} &= \frac{1}{1 - (K-1) + \sum_{j \neq i} \frac{1}{\frac{1}{1 + \exp(-z_{ij})}}} \\[5pt] % &= \frac{1}{2-K+\sum_{j \neq i} \frac{1}{\sigma(z_{ij})}} . \end{align}\]

Then, we use the following approximations (which admittedly might be quite loose):

$\E(f(x)) \approx f(\E(x))$,
the mean-field approximation $\N(z \mid \mu, \varSigma) \approx \N(z \mid \mu, \diag{\varSigma})$, and thus we have $z_i - z_j \sim \N(z_{ij} \mid \mu_i - \mu_j, \varSigma_{ii} + \varSigma_{jj})$, and
using the probit approximation (Proposition 5), with a further approximation

\[\begin{align} \int_{\R} \sigma(z_{ij}) \, \N(z_{ij} \mid \mu_i - \mu_j, \varSigma_{ii} + \varSigma_{jj}) \, dz_{ij} &\approx \sigma \left( \frac{\mu_i - \mu_j}{\sqrt{1 + \pi/8 \, \varSigma_{ii} + \varSigma_{jj}}} \right) \\[5pt] % &\approx \sigma \left( \frac{\mu_i}{\sqrt{1 + \pi/8 \, \varSigma_{ii}}} - \frac{\mu_j}{\sqrt{1 + \pi/8 \, \varSigma_{jj}}} \right) , \end{align}\]

we obtain

\[\begin{align} \int_{\R^k} \mathrm{softmax}_i(z) \, \N(z \mid \mu, \varSigma) &\approx \frac{1}{2-K+\sum_{j \neq i} \frac{1}{\E \sigma(z_{ij})}} \\[5pt] % &\approx \frac{1}{2-K+\sum_{j \neq i} \frac{1}{\sigma \left( \frac{\mu_i}{\sqrt{1 + \pi/8 \, \varSigma_{ii}}} - \frac{\mu_j}{\sqrt{1 + \pi/8 \, \varSigma_{jj}}} \right)}} \\[5pt] % &= \frac{1}{1 + \sum_{j \neq i} \exp\left( -\left(\frac{\mu_i}{\sqrt{1 + \pi/8 \, \varSigma_{ii}}} - \frac{\mu_j}{\sqrt{1 + \pi/8 \, \varSigma_{jj}}} \right)\right)} \\[5pt] % &= \frac{\exp\left(\mu_i/\sqrt{1 + \pi/8 \, \varSigma_{ii}}\right)}{\sum_{j=1}^k \exp\left(\mu_j/\sqrt{1 + \pi/8 \, \varSigma_{jj}}\right)} \end{align}\]

We identify the last equation above as the $i$-th component of $\mathrm{softmax}\left( \frac{\mu}{\sqrt{1 + \pi/8 \, \diag \varSigma}} \right)$.

$ \square $

References

Ng, Edward W., and Murray Geller. “A table of integrals of the error functions.” Journal of Research of the National Bureau of Standards B 73, no. 1 (1969): 1-20.
Gibbs, Mark N. Bayesian Gaussian processes for regression and classification. Dissertation, University of Cambridge, 1998.
Lu, Zhiyun, Eugene Ie, and Fei Sha. “Mean-Field Approximation to Gaussian-Softmax Integral with Application to Uncertainty Estimation.” arXiv preprint arXiv:2006.07584 (2020).

The Last Mile of Creating Publication-Ready Plots

Sun, 01 May 2022 00:00:00 -0400

Let’s start with a side-by-side comparison. Which one of the following two plots is more aesthetically pleasing? Left or right? (Taken from one of my papers [1]. The code for generating it is in [2])

Hopefully, you agree with me that the answer is the one on the right. In that case, we can start our journey in transforming the l.h.s. figure to the r.h.s. one.

Elements of publication-ready plots

Over the year of writing papers, I’ve come to realize some patterns in creating publication-ready plots. Note, I’m not even talking about the content of the plot itself—this is more about how to make your plots fit your paper well. This is essentially the “last mile” of making publication-ready plots, which, sadly, is something that many people ignore.

Anyway, those elements are:

Must be a vector graphic (pdf, svg, etc.).
Should fill the entire \linewidth (or \textwidth) of the page.
Must not be stretched disproportionally.
The font face must be the same as the text’s font face.
The font size can be smaller than the text’s font size, but must still be legible and consistent.

One way to tell that one’s plot is not publication-ready is if one uses Matplotlib without further touching its rcParams, and simply “pastes” it to the paper’s .tex file with \includegraphics.

Below, I show how to ensure the elements above by leveraging the powerful TikZ. Note that one can also do this by modifying the rcParams of Matplotlib, but I only do this in a pinch—I will talk about this in a future post.

TikZ-ing your Matplotlib plots: A basic workflow

TikZ is great because it’s tightly coupled to LaTeX, which we already use for writing the paper. So, TikZ plots will respect the styling of the paper by default, making them aesthetically pleasing out of the box. However, TikZ is notoriously difficult to learn.

But, what if I told you that you don’t need to understand TikZ to use it for making publication-ready plots? The Tikzplotlib library will do all the hard work for you, and all you need is to customize the styling once. Then the resulting plot can be reused over and over again e.g. slides and posters without modification.

So, here’s my workflow for creating a publication-ready plot, from start to finish.

Create a Matplotlib plot as usual.

Instead of plt.savefig(figname), do:

 import tikzplotlib as tpl

 # Create a matplotlib plot

 ...

 # Save as TikZ plot

 tpl.save('figname.tex', axis_width=r'\figwidth', axis_height=r'\figheight')

Here’s an example file you can use to follow this tutorial along: download.

Copy figname.tex to the figs directory in your paper’s LaTeX project.

In the preamble of your paper’s LaTeX file, add:

 \usepackage{pgfplots}
 \pgfplotsset{compat=newest}
 \pgfplotsset{scaled y ticks=false}
 \usepgfplotslibrary{groupplots}
 \usepgfplotslibrary{dateplot}

 \usepackage{tikz}

In your .tex file, do the following to add the figure:

 \begin{figure}
     \def\figwidth{\linewidth}
     \def\figheight{0.15\textheight} % Feel free to change

     \input{figs/figname}
 \end{figure}

Note that \figwidth and \figheight are local variables, so their values will only be used for figname.

At this point, you will already have a quite aesthetically pleasing figure, cf. below. Notice that the font face and size are consistent with the paper’s text. However, notice that we need to improve the plot further, e.g. by unhiding the x- and y-tick labels.

Open figname.tex. You will see the following code:

 \begin{axis}[
     width=\figwidth,
     height=\figheight,
     axis line style={lightgray204},
     tick align=outside,
     unbounded coords=jump,
     x grid style={lightgray204},
     xlabel=\textcolor{darkslategray38}{Dataset},
     xmajorticks=false,
     xmin=-0.5, xmax=3.5,
     xtick style={color=darkslategray38},
     xtick={0,1,2,3},
     xticklabels={MNIST,SVHN,CIFAR-10,CIFAR-100},
     y grid style={lightgray204},
     ylabel=\textcolor{darkslategray38}{Mean Confidence},
     ymajorgrids,
     ymajorticks=false,
     ymin=0, ymax=102.714733242989,
     ytick style={color=darkslategray38}
 ]

You can think of this as the “CSS” of your plot.

First, add the line \tikzstyle{every node}=[font=\scriptsize] before \begin{axis}. This will scale all the font in the plot to \scriptsize, which I think is more pleasing, while still legible.
To unhide the x- and y-tick labels, simply change xmajorticks and ymajorticks to true.
Moreover, notice that we don’t have much space for the legend. So, we need to customize it. Change xmax to 4.1 and add the following option:
```
\begin{axis}[
...
    legend style={nodes={scale=0.75, transform shape}, at={(1,0)}, anchor=south east, draw=black},
...
]
```
The change in xmax will make some room to the right of the plot, while the legend style option will scale down the legend and move it to the lower-right portion of the plot.

Here’s the final result:

Looks much more pleasing than the standard Matplotlib output, isn’t it? Note that we didn’t change many things other than refining the styling options of the TikZ axis—we didn’t even touch the plot content itself!

If you noticed, at this point, we already pretty much fulfilled of the elements of the publication-ready plots we discussed previously. I personally think that this level of aesthetic is more than acceptable for publication.

But, to me, the plot can still be refined further.

First, notice that the plot still doesn’t fill the whole text/column’s width. This can be fixed by increasing \figwidth to e.g. 1.04\linewidth.
Second, the y-axis is too tall: it exceeds the maximum value of the data (100). To fix this, simply set ymax=100 in the axis option in figname.tex.
Furthermore, the ticks on the axes (not to be confused with the tick labels) are unecessary. We can hide them setting xtick style={draw=none} and ytick style={draw=none}.
Last, the legend looks ugly to me: for some reason by default TikZ uses two bars in the legend. The fix is to add the following before \begin{axis} or in the preamble of main.tex to make it global:

\pgfplotsset{compat=1.11,
 /pgfplots/ybar legend/.style={
 /pgfplots/legend image code/.code={
 \draw[##1,/tikz/.cd,yshift=-0.25em]
 (0cm,0cm) rectangle (3pt,0.8em);},
 },
}

Putting everything together, here’s the final result:

Looks great to me! As a bonus, this plot (i.e. the figname.tex) is highly portable. For example, when you want to reuse this plot in a Beamer presentation or poster, you can simply copy-and-paste figname.tex and include it in your presentation’s .tex file as above—you only need to change the values of figwidth and figwidth. All the refinement done previously will carry over and the plot’s style will automatically adapt to the style (e.g. font face and size) of your presentation!

Miscellaneous Tips

Suppose you have two plots that you want to show side-by-side in a figure:

\begin{figure}[t]
    \def\figwidth{0.5\linewidth}
    \def\figheight{0.15\textheight}

    \centering

    \subfloat{\input{figs/fig1a}}
    \hfill
    \subfloat{\input{figs/fig1b}}
\end{figure}

How do you make sure that they are perfectly aligned? Easy: simply add baseline option at the \begin{tikzpicture} line of both fig1a.tex and fig1b.tex, i.e.,

\begin{tikzpicture}[baseline]

...

\begin{axis}[
...

There are also trim axis left and trim axis right options for tikzpicture. As the names suggest, they can be used to tell LaTeX to ignore the left and right axes of the plot when computing the plot’s width. They might be useful in some niche situations.

Faster compilation

If your paper has many complex TikZ plots, it can happen that your LaTeX compilation is slow. To mitigate this, we can “cache” the compiled TikZ plots. To do so, we can use the external package: In your LaTeX preamble, add the following.

\usepackage{tikz}
\usetikzlibrary{external}
\tikzexternalize[prefix=tikz/, figure name=output-figure]

Then, create a directory called tikz/ in your main project directory. This will be the directory external will cache your compiled TikZ plots. Note that this is “trick” is fully compatible with Overleaf.

In case you want to disable externalize for one of your plot, e.g. for debugging, you can “surround” your TikZ plot with \tikzexternaldisable and \tikzexternalenable.

\begin{figure}[t]
    \def\figwidth{\linewidth}
    \def\figheight{0.15\textheight}

    \centering

    \tikzexternaldisable
    \input{figs/figname}
    \tikzexternalenable
\end{figure}

Final remark

Last but not least, my final tips is: utilize Google search and Stackoverflow if you need more advanced styling. You will more often than not find your questions already answered there.

References

Kristiadi, Agustinus, Matthias Hein, and Philipp Hennig. “Being a Bit Frequentist Improves Bayesian Neural Networks” AISTATS 2022.
https://github.com/wiseodd/bayesian_ood_training/blob/master/notebooks/plot_uniform.ipynb.

Modern Arts of Laplace Approximations

Wed, 27 Oct 2021 00:00:00 -0400

Let $f: X \times \Theta \to Y$ defined by $(x, \theta) \mapsto f_\theta(x)$ be a neural network, where $X \subseteq \R^n$, $\Theta \subseteq \R^d$, and $Y \subseteq \R^c$ be the input, parameter, and output spaces, respectively. Given a dataset $\D := \{ (x_i, y_i) : x_i \in X, y_i \in Y \}_{i=1}^m$, we define the likelihood $p(\D \mid \theta) := \prod_{i=1}^m p(y_i \mid f_\theta(x_i))$. Then, given a prior $p(\theta)$, we can obtain the posterior via an application of Bayes’ rule: $p(\theta \mid \D) = 1/Z \,\, p(\D \mid \theta) p(\theta)$. But, the exact computation of $p(\theta \mid \D)$ is intractable in general due to the need of computing the normalization constant

\[Z = \int_\Theta p(\D \mid \theta) p(\theta) \,d\theta .\]

We must then approximate $p(\theta \mid \D)$. One simple way to do this is by simply finding one single likeliest point under the posterior, i.e. the mode of $p(\theta \mid \D)$. This can be done via optimization, instead of integration:

\[\theta_\map := \argmax_\theta \sum_{i=1}^m \log p(y_i \mid f_\theta(x_i)) + \log p(\theta) =: \argmax_\theta \L(\theta; \D) .\]

The estimate $\theta_\map$ is referred to as the maximum a posteriori (MAP) estimate. However, the MAP estimate does not capture the uncertainty around $\theta$. Thus, it often (and in some cases, e.g. [1], almost always) leads to overconfidence.

In the context of Bayesian neural networks, the Laplace approximation (LA) is a family of methods for obtaining a Gaussian approximate posterior distribution of networks’ parameters. The fact that it produces a Gaussian approximation is a step up from the MAP estimation: particularly, it conveys some notion of uncertainty in $\theta$. LA stems from the early work of Pierre-Simon Laplace in 1774 [2] and it was first adapted for Bayesian neural networks (BNNs) by David MacKay in 1992 [3]. The method goes as follows.

Given the MAP estimate $\theta_\map$, let us Taylor-expand $\L$ around $\theta_\map$ up to the second-order:

\[\L(\theta; \D) \approx \L(\theta_\map; \D) + \frac{1}{2} (\theta - \theta_\map)^\top \left(\nabla^2_\theta \L\vert_{\theta_\map}\right) (\theta - \theta_\map) .\]

(Note that the gradient $\nabla_\theta \L$ is zero at $\theta_\map$ since $\theta_\map$ is a critical point of $\L$ and thus the first-order term in the above is also zero.) Now, recall that $\L$ is the log-numerator of the posterior $p(\theta \mid \D)$. Thus, the r.h.s. of the above can be used to approximate the true numerator, by simply exponentiating it:

\[\begin{align*} p(\D \mid \theta)p(\theta) &\approx \exp\left( \L(\theta_\map; \D) + \frac{1}{2} (\theta - \theta_\map)^\top \left(\nabla^2_\theta \L\vert_{\theta_\map}\right) (\theta - \theta_\map) \right) \\[5pt] % &= \exp(\L(\theta_\map; \D)) \exp\left(\frac{1}{2} (\theta - \theta_\map)^\top \left(\nabla^2_\theta \L\vert_{\theta_\map}\right) (\theta - \theta_\map) \right) . \end{align*}\]

For simplicity, let $\varSigma := -\left(\nabla^2_\theta \L\vert_{\theta_\map}\right)^{-1}$. Then, using this approximation, we can also obtain an approximation of $Z$:

\[\begin{align*} Z &\approx \exp(\L(\theta_\map; \D)) \int_\theta \exp\left(-\frac{1}{2} (\theta - \theta_\map)^\top \varSigma^{-1} (\theta - \theta_\map) \right) \,d\theta \\[5pt] % &= \exp(\L(\theta_\map; \D)) (2\pi)^{d/2} (\det \varSigma)^{1/2} , \end{align*}\]

where the equality follows from the fact the integral above is the famous, tractable Gaussian integral. Combining both approximations, we obtain

\[\begin{align*} p(\theta \mid \D) &\approx \frac{1}{(2\pi)^{d/2} (\det \varSigma)^{1/2}} \exp\left(-\frac{1}{2} (\theta - \theta_\map)^\top \varSigma^{-1} (\theta - \theta_\map) \right) \\[5pt] % &= \N(\theta \mid \theta_\map, \varSigma) . \end{align*}\]

That is, we obtain a tractable, easy-to-work-with Gaussian approximation to the intractable posterior via a simple second-order Taylor expansion! Moreover, this is not just any Gaussian approximation: Notice that this Gaussian is fully determined once we have the MAP estimate $\theta_\map$. Considering that the MAP estimation is the standard procedure for training NNs, the LA is nothing but a simple post-training step on top of it. This means the LA, unlike other approximate inference methods, is a post-hoc method that can be applied to virtually any pre-trained NN, without the need of re-training!

Given this approximation, we can then use it as a proxy to the true posterior. For instance, we can use it to obtain the predictive distribution

\[\begin{align*} p(y \mid x, \D) &\approx \int_\theta p(y \mid f_\theta(x)) \, \N(\theta \mid \theta_\map, \varSigma) \,d\theta \\ % &\approx \frac{1}{s} \sum_{i=1}^s p(y \mid f_\theta(x)) \qquad \text{where} \enspace \theta_s \sim \N(\theta \mid \theta_\map, \varSigma) , \end{align*}\]

which in general is less overconfident compared to the MAP-estimate-induced predictive distribution [3].

What we have seen is the most general framework of the LA. One can make a specific design decision, such as by imposing a special structure to the Hessian $\nabla^2_\theta \L$, and thus the covariance $\varSigma$.

The laplace-torch library

The simplicity of the LA is not without a drawback. Recall that the parameter $\theta$ is in $\Theta \subseteq \R^d$. In neural networks (NNs), $d$ is often in the order of millions or even billions. Naively computing the Hessian $\nabla^2_\theta \L$ is thus often infeasible since it scales like $O(d^2)$. Together with the fact that the LA is an old method (and thus not “trendy” in the (Bayesian) deep learning community), this might be the reason why the LA is not as popular as other BNN posterior approximation methods such as variational Bayes (VB) and Markov Chain Monte Carlo (MCMC).

Motivated by this observation, in our NeurIPS 2021 paper titled “Laplace Redux – Effortless Bayesian Deep Learning”, we showcase that (i) the Hessian can be obtained cheaply, thanks to recent advances in second-order optimization, and (ii) even the simplest LA can be competitive to more sophisticated VB and MCMC methods, while only being much cheaper than them. Of course, numbers alone are not sufficient to promote the goodness of the LA. So, in that paper, we also propose an extendible, easy-to-use software library for PyTorch called laplace-torch, which is available at https://github.com/AlexImmer/Laplace.

The laplace-torch is a simple library for, essentially, “turning standard NNs into BNNs”. The main class of this library is the class Laplace, which can be used to transform a standard PyTorch model into a Laplace-approximated BNN. Here is an example.

from laplace import Laplace

model = load_pretrained_model()

la = Laplace(model, 'regression')

# Compute the Hessian

la.fit(train_loader)

# Hyperparameter tuning

la.optimize_prior_precision()

# Make prediction

pred_mean, pred_var = la(x_test)

The resulting object, la is a fully-functioning BNN, yielding the following prediction. (Notice the identical regression curves—the LA essentially imbues MAP predictions with uncertainty estimates.)

Of course, laplace-torch is flexible: the Laplace class has almost all state-of-the-art features in Laplace approximations. Those features, along with the corresponding options in laplace-torch, are summarized in the following flowchart. (The options 'subnetwork' for subset_of_weights and 'lowrank' for hessian_structure are in the work, by the time this post is first published.)

The laplace-torch library uses a very cheap yet highly-performant flavor of LA by default, based on [4]:

def Laplace(model, likelihood, subset_of_weights='last_layer', hessian_structure='kron', ...)

That is, by default the Laplace class will fit a last-layer Laplace with a Kronecker-factored Hessian for approximating the covariance. Let us see how this default flavor of LA performs compared to the more sophisticated, recent (all-layer) Bayesian baselines in classification.

Here we can see that Laplace, with default options, improves the calibration (in terms of expected calibration error (ECE)) of the MAP model. Moreover, it is guaranteed to preserve the accuracy of the MAP model—something that cannot be said for other baselines. Ultimately, this improvement is cheap: laplace-torch only incurs little overhead relative to the MAP model—far cheaper than other Bayesian baselines.

Hyperparameter Tuning

Hyperparameter tuning, especially for the prior variance/precision, is crucial in modern Laplace approximations for BNNs. laplace-torch provides several options: (i) cross-validation and (ii) marginal-likelihood maximization (MLM, also known as empirical Bayes and type-II maximum likelihood).

Cross-validation is simple but needs a validation dataset. In laplace-torch, this can be done via the following.

la.optimize_prior_precision(method='CV', val_loader=val_loader)

A more sophisticated and interesting tuning method is MLM. Recall that by taking the second-order Taylor expansion over the log-posterior, we obtain an approximate normalization constant $Z$ of the Gaussian approximate posterior. This object is called the marginal likelihood: it is a probability over the dataset $\D$ and crucially, it is a function of the hyperparameter since the parameter $\theta$ is marginalized out. Thus, we can find the best values for our hyperparameters by maximizing this function.

In laplace-torch, the marginal likelihood can be accessed via

ml = la.log_marginal_likelihood(prior_precision)

This function is compatible with PyTorch’s autograd, so we can backpropagate through it to obtain the gradient of $Z$ w.r.t. the prior precision hyperparameter:

ml.backward()  # Works!

Thus, MLM can easily be done in laplace-torch. By extension, recent methods such as online MLM [5], can also easily be applied using laplace-torch.

Outlooks

The laplace-torch library is continuously developed. Support for more likelihood functions and priors, subnetwork Laplace, etc. are on the way.

In any case, we hope to see the revival of the LA in the Bayesian deep learning community. So, please try out our library at https://github.com/AlexImmer/Laplace!

References

Hein, Matthias, Maksym Andriushchenko, and Julian Bitterwolf. “Why ReLU networks yield high-confidence predictions far away from the training data and how to mitigate the problem.” CVPR 2019.
Laplace, Pierre Simon. “Mémoires de Mathématique et de Physique, Tome Sixieme” 1774.
MacKay, David JC. “The evidence framework applied to classification networks.” Neural computation 4.5 (1992).
Kristiadi, Agustinus, Matthias Hein, and Philipp Hennig. “Being Bayesian, even just a bit, fixes overconfidence in ReLU networks.” ICML 2020.
Immer, Alexander, Matthias Bauer, Vincent Fortuin, Gunnar Rätsch, and Mohammad Emtiyaz Khan. “Scalable marginal likelihood estimation for model selection in deep learning.” ICML, 2021.

Chentsov's Theorem

Tue, 20 Jul 2021 00:00:00 -0400

Let $p_\theta(x)$ be a probability density on $\R^n$, parametrized by $\theta \in \R^d$. The Fisher information is defined by

\[\I_{ij}(\theta) := \E_{p_\theta(x)} \left( \partial_i \log p_\theta(x) \, \partial_j \log p_\theta(x) \right)\]

where $\partial_i := \partial/\partial \theta^i$ for each $i = 1, \dots, d$. Note that $\I(\theta)$ is positive semi-definite because one can see it as the (expected) outer-product of the gradient of the log-density.

The Fisher Information under Sufficient Statistics

Let $T : \R^n \to \R^n$ with $x \mapsto y$ be a bijective transformation of the r.v. $x \sim p_\theta(x)$. By the Fisher-Neyman factorization, $T$ is a sufficient statistic for the parameter $\theta$ if there exist non-negative functions $g_\theta$ and $h$, where $g_\theta$ depends on $\theta$ while $h$ does not, such that we can write the density $p_\theta(x)$ as follows:

\[p_\theta(x) = g_\theta(T(x)) h(x) .\]

The following proposition shows the behavior of $\I$ under sufficient statistics.

Proposition 1. The Fisher information is invariant under sufficient statistics.

Proof. Let $T$ be a sufficient statistic and so $p_\theta(x) := g_\theta(T(x)) h(x)$. Notice that this implies

\[\partial_i \log g_\theta(T(x)) = \partial_i \log p_\theta(x) .\]

So, the Fisher information $\I(\theta; T)$ under $T$ is

\[\begin{align} \I(\theta; T) &= \E \left( \partial_i \log (g_\theta(T(x)) h(x)) \, \partial_j \log (g_\theta(T(x)) h(x)) \right) \\ % &= \E \left( \partial_i \log g_\theta(T(x)) \, \partial_j \log g_\theta(T(x)) \right) \\ % &= \E \left( \partial_i \log p_\theta(x) \, \partial_j \log p_\theta(x) \right) \\ % &= \I(\theta) . \end{align}\]

We conclude that $\I$ is invariant under sufficient statistics.

$ \square $

The Fisher Information as a Riemannian Metric

Let

\[M := \{ p_\theta(x) : \theta \in \R^d \}\]

be the set of the parametric densities $p_\theta(x)$. We can treat $M$ as a smooth $d$-manifold by imposing a global coordinate chart $p_\theta(x) \mapsto \theta$. Thus, we can identify a point $p_\theta(x)$ on $M$ with its parameter $\theta$ interchangeably.

Let us assume that $\I$ is positive-definite everywhere, and each $\I_{ij}$ is smooth. Then we can use it as (the coordinates representation of) a Riemannian metric for $M$. This is because $\I$ is a covariant 2-tensor. (Recall the definition of a Riemannian metric.)

Proposition 2. The component functions $\I_{ij}$ of $\I$ follows the covariant transformation rule.

Proof. Let $\theta \mapsto \varphi$ be a change of coordinates and let $\ell(\varphi) := \log p_\varphi(x)$. The component function $\I_{ij}(\theta)$ in the “old” coordinates is expressed in terms of the “new” ones, as follows:

\[\begin{align} \I_{ij}(\theta) &= \E \left( \frac{\partial \ell}{\partial \theta^i} \, \frac{\partial \ell}{\partial \theta^j} \right) \\ % &= \E \left( \frac{\partial \ell}{\partial \varphi^i} \frac{\partial \varphi^i}{\partial \theta^i} \, \frac{\partial \ell}{\partial \varphi^j} \frac{\partial \varphi^j}{\partial \theta^j} \right) \\ % &= \frac{\partial \varphi^i}{\partial \theta^i} \frac{\partial \varphi^j}{\partial \theta^j} \E \left( \frac{\partial \ell}{\partial \varphi^i} \, \frac{\partial \ell}{\partial \varphi^j} \right) \\ % &= \frac{\partial \varphi^i}{\partial \theta^i} \frac{\partial \varphi^j}{\partial \theta^j} \I_{ij}(\varphi) , \end{align}\]

where the second equality follows from the standard chain rule. We conclude that $\I$ is covariant since the Jacobian $\partial \varphi/\partial \theta$ of the transformation multiplies the “new” component functions $\I_{ij}(\varphi)$ of $\I$ to obtain the “old” ones.

$ \square $

Chentsov's Theorem

The previous two results are useful since the Fisher information metric is invariant under sufficient statistics. In this sense, $\I$ has a statistical invariance property. But this is not a strong enough reason for arguing that $\I$ is a “natural” or “the best” metric for $M$.

Here, we shall see a stronger statement, due to Chentsov in 1972, about the Fisher metric: It is the unique statistically-invariant metric for $M$ (up to a scaling constant). This makes $\I$ stands out over any other metric for $M$.

Originally, Chentsov’s theorem is described on the space of Categorical probability distributions over the sample space $\Omega := \{ 1, \dots, n \}$, i.e. the probability simplex. We use the result of Campbell (1986) as a stepping stone. To do so, we need to define the so-called Markov embeddings.

Let $\{ A_1, \dots, A_m \}$ be a partition $\Omega$, where $2 \leq n \leq m$. We define a conditional probability table $Q$ of size $n \times m$ where

\[\begin{align} q_{ij} &= 0 \quad \text{if } j \not\in A_i \\ q_{ij} &> 0 \quad \text{if } j \in A_i \\ & {\textstyle\sum_{j=1}^m} q_{ij} = 1 . \end{align}\]

That is, the $i$-th row of $Q$ gives probabilities signifying the membership of each $j \in \Omega$ in $A_i$. Based on this, we define a map $f: \R^n_{> 0} \to \R^m_{>0}$ by

\[y_j := \sum_{i=1}^n q_{ij} x^i \qquad \forall\enspace j = 1, \dots, m .\]

We call this map a Markov embedding. The name suggests that $f$ embeds $\R^n_{> 0}$ in a higher-dimensional space $\R^m_{> 0}$.

The result of Campbell (1986) characterizes the form of the Riemannian metric in $\R^n_{>0}$ that is invariant under any Markov embedding.

Lemma 3 (Campbell, 1986). Let $g$ be a Riemannian metric on $\R^n_{>0}$ where $n \geq 2$. Suppose that every Markov embedding on $(\R^n_{>0}, g)$ is an isometry. Then

\[g_{ij}(x) = A(\abs{x}) + \delta_{ij} \frac{\abs{x} B(\abs{x})}{x^i} ,\]

where $\abs{x} = \sum_{i=1}^n x^i$, $\delta_{ij}$ is the Kronecker delta, and $A, B \in C^\infty(\R_{>0})$ satisfying $B > 0$ and $A + B > 0$.

Proof. See Campbell (1986) and Amari (2016, Sec. 3.5).

$ \square $

Lemma 3 is a general statement about the invariant metric in $\R^n_{>0}$ and it does not say anything about sufficient statistics and probability distributions. To get the main result, we restrict ourselves to the $(n-1)$-probability simplex $\Delta^{n-1} \subset \R^n_{>0}$, which is the space of (Categorical) probability distribution.

The fact that the Fisher information is the unique invariant metric under sufficient statistics follows from the fact that when $n = m$, the Markov embedding reduces to a permutation of the components of $x \in \R^n_{>0}$—i.e. the permutation of $\Omega$. This is because permutations of $\Omega$ are sufficient statistics for Categorical distribution.

Let us, therefore, connect the result in Lemma 3 with the Fisher information on $\Delta^{n-1}$. We give the latter in the following lemma.

Lemma 4. The Fisher information of a Categorical distribution $p_\theta(z)$ where $z$ takes values in $\Omega = \{ 1, \dots, n \}$ and $\theta = \{ \theta^1, \dots, \theta^n \} \in \Delta^{n-1}$ is given by

\[\I_{ij}(\theta) = \delta_{ij} \frac{1}{\theta^i} .\]

That is, $\I(\theta)$ is an $(n \times n)$ diagonal matrix with $i$-th entry $1/\theta^i$.

Proof. By definition,

\[p_\theta(z) = \prod_{i=1}^n \left(\theta^i\right)^{(z^i)} ,\]

where we assume that $z$ is one-hot encoded. Its score function is given by

\[\partial_i \log p_\theta(x) = \partial_i \sum_{i=1}^n z^i \log \theta^i = \sum_{i=1}^n z^i \frac{1}{\theta^i} \delta_{ij} = \frac{z^i}{\theta^i} ,\]

for each $i = 1, \dots n$. Hence, using the fact that $z$ is one-hot:

\[\begin{align} \I_{ii}(\theta) &= \E \left( \frac{z^i}{\theta^i} \, \frac{z^i}{\theta^i} \right) \\ % &= \frac{1}{(\theta^i)^2} \sum_{i=1}^n (z^i)^2 \theta^i \\ % &= \frac{1}{(\theta^i)^2} \theta^i \\ % &= \frac{1}{\theta^i} . \end{align}\]

Using similar step, we can show that $\I_{ij}(\theta) = 0$ for $i \neq j$ because $z^i z^j$ is always zero.

$ \square $

Now we are ready to state the main result.

Theorem 5 (Chentsov, 1972). The Fisher information is the unique Riemannian metric on $\Delta^{n-1}$ that is invariant under sufficient statistics, up to a multiplicative constant.

Proof. By Lemma 3, the invariant metric under Markov embeddings in $\R^n_{> 0}$ is given by

\[g_{ij}(x) = A(\abs{x}) + \delta_{ij} \frac{\abs{x} B(\abs{x})}{x^i} ,\]

for any $x \in \R^n_{> 0}$. Therefore, this is the form of the invariant metric under sufficient statistics in $\Delta^{n-1} \subset \R^n_{>0}$, i.e. when $n=m$ in the Markov embedding.

Let us therefore restrict $g$ to $\Delta^{n-1}$. For each $\theta \in \Delta^{n-1}$, the tangent space $T_\theta \Delta^{n-1}$ is orthogonal to the line $x^1 = x^2 = \dots = x^n$, which direction is given by the vector $\mathbf{1} = (1, \dots, 1) \in \R^n_{>0}$. This is a vector normal to $\Delta^{n-1}$, implying that any $v \in T_\theta \Delta^{n-1}$ satisfies $\inner{\mathbf{1}, v}_g = 0$, i.e. $\sum_{i=1}^n v^i = 0$.

Moreover, if $\theta \in \Delta^{n-1}$, then $\abs{\theta} = \sum_{i=1}^n \theta^i = 1$ by definition. Thus, $A(1)$ and $B(1)$ are constants. So, if $v, w \in T_\theta \Delta^{n-1}$, we have:

\[\begin{align} \inner{v, w}_{\theta} &= \sum_{i=1}^n \sum_{j=1}^n g_{ij} v^i w^j = A(1) \sum_{i = 1}^n \sum_{j = 1}^n v^i w^j + B(1) \sum_{i=1}^n \frac{v^i w^i}{\theta^i} \\ % &= A(1) \underbrace{\left(\sum_{i = 1}^n v^i\right)}_{=0} \underbrace{\left(\sum_{j = 1}^n w^j\right)}_{=0} + B(1) \sum_{i=1}^n \frac{v^i w^i}{\theta^i} . \end{align}\]

Therefore $A(1)$ does not contribute to the inner product and we may, w.l.o.g., write the metric as a diagonal matrix:

\[g_{ij}(\theta) = \delta_{ij} \frac{B(1)}{\theta^i} .\]

Recalling that $B(1)$ is a constant, by Lemma 4, we have $g_{ij}(\theta) \propto \I_{ij}(\theta)$.

$ \square $

Generalizations to this (original) version Chentsov’s theorem exists. For instance, Ay et al. (2015) showed Chentsov’s theorem for arbitrary, parametric probability distributions. Dowty (2018) stated Chentsov’s theorem for exponential family distributions.

References

Chentsov, N. N. “Statistical Decision Rules and Optimal Deductions.” (1972).
Campbell, L. Lorne. “An extended Čencov characterization of the information metric.” Proceedings of the American Mathematical Society 98, no. 1 (1986): 135-141.
Amari, Shun-ichi. Information geometry and its applications. Vol. 194. Springer, 2016.
Ay, Nihat, Jürgen Jost, Hông Vân Lê, and Lorenz Schwachhöfer. “Information geometry and sufficient statistics.” Probability Theory and Related Fields 162, no. 1-2 (2015): 327-364.
Dowty, James G. “Chentsov’s theorem for exponential families.” Information Geometry 1, no. 1 (2018): 117-135.

The Curvature of the Manifold of Gaussian Distributions

Mon, 21 Jun 2021 08:00:00 -0400

The (univariate) Gaussian distribution is defined by the following p.d.f.:

\[\N(x \mid \mu, \sigma) := \frac{1}{\sigma \sqrt{2 \pi}} \exp\left( - \frac{(x-\mu)^2}{2 \sigma^2} \right) .\]

Let $M := \{ \N(x \mid \mu, \sigma) : (\mu, \sigma) \in \R \times \R_{> 0} \}$ be the set of all Gaussian p.d.f.s. We would like to treat this set as a smooth manifold and then, additionally, as a Riemannian manifold.

First, let’s define a coordinate chart for $M$. Let $\theta : M \to \R \times \R_{>0}$, defined by $\N(x \mid \mu, \sigma) \mapsto (\mu, \sigma)$ be such a chart. That is, the coordinate chart $\theta$ maps $M$ to the open Euclidean upper half-plane $\{ (x, y) : y > 0 \}$. Note that $\theta$ is a global chart since the Gaussian distribution is uniquely identified by its location and scale (i.e. its mean and standard-deviation). Thus, we can interchangeably write $p \in M$ or $\theta := (\mu, \sigma) \in \R \times \R_{>0}$ with a slight abuse of notation. From here, it is clear that $M$ is of dimension $2$ because $\theta$ gives a homeomorphism from $M$ to $\R \times \R_{>0} \simeq \R^2$.

Now let us equip the smooth manifold $M$ with a Riemannian metric, say $g$. The standard choice for $g$ for probability distributions is the Fisher information metric. I.e., in coordinates, it is defined by

\[\begin{align} g_{ij} &= g_{ij}(\theta) := \E_{\N(x \mid \mu, \sigma)} \left( \frac{\partial \log \N(x \mid \mu, \sigma)}{\partial \theta^i} \, \frac{\partial \log \N(x \mid \mu, \sigma)}{\partial \theta^j} \right) \\ % &= -\E_{\N(x \mid \mu, \sigma)} \left( \frac{\partial^2 \log \N(x \mid \mu, \sigma)}{\partial \theta^i \, \partial \theta^j} \right) . \end{align}\]

In a matrix form, it is (see here)

\[G := (g_{ij}) = \begin{pmatrix} \frac{1}{\sigma^2} & 0 \\ 0 & \frac{2}{\sigma^2} \end{pmatrix} .\]

Its inverse, denoted by upper indices, $(g^{ij}) = G^{-1}$ is given by

\[g^{ij} = \begin{pmatrix} \sigma^2 & 0 \\ 0 & \frac{\sigma^2}{2} \end{pmatrix} .\]

Note in particular that the matrix $G$ is positive definite for any $(\mu, \sigma)$ and thus gives a notion of inner product in the tangent bundle of $M$. Therefore, the tuple $(M, g)$ is a Riemannian manifold.

One more structure is needed for computing the curvature(s) of $M$. We need to equip $(M, g)$ with an affine connection. Here, we will use the Levi-Civita connection $\nabla$ of $g$.

Note. We will use the Einstein summation convention from now on. For example, $\Gamma^k_{ij} \Gamma^l_{km} = \sum_k \Gamma^k_{ij} \Gamma^l_{km}$.

Christoffel Symbols

The first order of business is to determine the connection coefficients of $\nabla$—the Christoffel symbols of the second kind. In coordinates, it is represented by the $3$-dimensional array $(\Gamma^k_{ij}) \in \R^{2 \times 2 \times 2}$, and is given by the following formula

\[\Gamma^k_{ij} := \frac{1}{2} g^{kl} \left( \frac{\partial g_{jl}}{\partial \theta^i} + \frac{\partial g_{il}}{\partial \theta^j} - \frac{\partial g_{ij}}{\partial \theta^l} \right) .\]

Moreover, due to the symmetric property of the Levi-Civita connection, the lower indices of $\Gamma$ is symmetric, i.e. $\Gamma^k_{ij} = \Gamma^k_{ji}$ for all $i, j, k = 1, 2$.

Let us begin with $k = 1$. For $i,j = 1$, we have

\[\begin{align} \Gamma^1_{11} &= \frac{1}{2} g^{11} \left( \frac{\partial g_{11}}{\partial \theta^1} + \frac{\partial g_{11}}{\partial \theta^1} - \frac{\partial g_{11}}{\partial \theta^1} \right) + \frac{1}{2} \underbrace{g^{12}}_{=0} \left( \frac{\partial g_{12}}{\partial \theta^1} + \frac{\partial g_{12}}{\partial \theta^1} - \frac{\partial g_{11}}{\partial \theta^2} \right) \\ % &= \frac{1}{2} \sigma^2 \frac{\partial}{\partial \mu} \left( \frac{1}{\sigma^2} \right) = 0 . \end{align}\]

Similarly, we have $\Gamma^1_{22} = 0$. For $\Gamma^1_{12} = \Gamma^1_{21}$, we have

\[\require{cancel} % \begin{align} \Gamma^1_{12} = \Gamma^1_{21} &= \frac{1}{2} g^{11} \left( \cancel{\frac{\partial g_{21}}{\partial \theta^1}} + \frac{\partial g_{11}}{\partial \theta^2} - \cancel{\frac{\partial g_{12}}{\partial \theta^1}} \right) + \frac{1}{2} \underbrace{g^{12}}_{=0} \dots \\ % &= \frac{1}{2} \sigma^2 \frac{\partial}{\partial \sigma} \left( \frac{1}{\sigma^2} \right) \\ % &= -\frac{1}{\sigma} . \end{align}\]

Note that in the above, we can immediately cross out partial derivatives that depend on $\theta^1 = \mu$ since we know that $g_{ij}$ does not depend on $\mu$ for all $i, j = 1, 2$. Meanwhile, we know immediately that the second term is zero because $g$ is diagonal—in particular $g^{ij} = 0$ for $i \neq j$.

Now, for $k=2$, we can easily show (the hardest part is to keep track the indices) that $\Gamma^2_{12} = \Gamma^2_{21}=$. Meanwhile,

\[\begin{align} \Gamma^2_{11} &= \frac{1}{2} \underbrace{g^{21}}_{0} \dots + \frac{1}{2} g^{22} \left( \underbrace{\frac{\partial g_{12}}{\partial \theta^1}}_{0} + \underbrace{\frac{\partial g_{12}}{\partial \theta^1}}_{0} - \frac{\partial g_{11}}{\partial \theta^2} \right) \\ % &= -\frac{1}{2} \frac{\sigma^2}{2} \frac{\partial}{\partial \sigma} \left( \frac{1}{\sigma^2} \right) \\ % &= -\frac{1}{\cancel{2}} \frac{\cancel{\sigma^2}}{2} \left(-\cancel{2} \frac{1}{\sigma^{\cancel{3}}}\right) \\ % &= \frac{1}{2\sigma} , \end{align}\]

and similar computation gives $\Gamma^2_{22} = -\frac{1}{\sigma}$.

So, all in all, $\Gamma$ is given by

\[\Gamma^k = \begin{cases} \begin{pmatrix} 0 & -\frac{1}{\sigma} \\ -\frac{1}{\sigma} & 0 \end{pmatrix} & \text{if } k = 1 \\[3pt] % \begin{pmatrix} \frac{1}{2\sigma} & 0 \\ 0 & -\frac{1}{\sigma} \end{pmatrix} & \text{if } k = 2 . \end{cases}\]

Sectional Curvature

Now we are ready to compute the curvature of $M$. There are different notions of curvatures, e.g. the Riemann, Ricci curvature tensor, or the scalar curvature. In this post, we focus on the sectional curvature, which is a generalization of the Gaussian curvature in classical surface geometry (i.e. the study of embedded $2$-dimensional surfaces in $\R^3$).

Let $v, w$ in $T_pM$ be two basis vectors for $T_pM$. The formula of the sectional curvature $\text{sec}(v, w)$ under $v, w$ is as follows:

\[\text{sec}(v, w) := \frac{Rm(v, w, w, v)}{\inner{v, v} \inner{w, w} - \inner{v, w}^2} ,\]

where $Rm$ is the Riemann curvature tensor, and $\inner{\cdot, \cdot}$ denotes the inner product w.r.t. $g$. Note that $\text{sec}(v, w)$ is independent of the choice of $(v,w)$, i.e. given another pair of basis vectors $(v_0, w_0)$ of $T_pM$, we have that $\text{sec}(v_0, w_0) = \text{sec}(v, w)$.

The partial derivative operators $\frac{\partial}{\partial \theta^1} =: \partial_1$ and $\frac{\partial}{\partial \theta^2} =: \partial_2$ under the coordinates $\theta$ form a basis for $T_pM$. So, let us use them to compute the sectional curvature of $M$. In this case, the formula reads as

\[\text{sec}(\partial_1, \partial_2) = \frac{Rm(\partial_1, \partial_2, \partial_2, \partial_1)}{\inner{\partial_1, \partial_1} \inner{\partial_2, \partial_2} - \inner{\partial_1, \partial_2}^2} .\]

But the definition of $Rm$ implies that $Rm(\partial_1, \partial_2, \partial_2, \partial_1) = R_{1221}$, i.e. the element $1,2,2,1$ of the multidimensional array representation of $Rm$ in coordinates. Moreover, by definition, $g_{ij} = \inner{\partial_1, \partial_2}$. And so:

\[\text{sec}(\partial_1, \partial_2) = \frac{R_{1221}}{g_{11} g_{22} - (g_{12})^2} = \frac{R_{1221}}{\det g} ,\]

since $g$ is symmetric. Note that this is but the definition of the Gaussian curvature—indeed, in dimension $2$, the sectional and the Gaussian curvatures coincide.

We are now ready to compute $\text{sec}(\partial_1, \partial_2)$. The denominator is easy from our definition of $G$ at the beginning of this post:

\[\det g = \frac{1}{\sigma^2} \frac{2}{\sigma^2} = \frac{2}{\sigma^4} .\]

For the numerator, we can compute $R_{ijkl}$ via the metric and the Christoffel symbols:

\[R_{ijkl} = g_{lm} \left( \frac{\partial \Gamma^m_{jk}}{\partial \theta^i} - \frac{\partial \Gamma^m_{ik}}{\partial \theta^j} + \Gamma^p_{jk} \Gamma^m_{ip} - \Gamma^p_{ik} \Gamma^m_{jp} \right) .\]

So, we have

\[\begin{align} R_{1221} &= g_{1m} \left( \frac{\partial \Gamma^m_{22}}{\partial \theta^1} - \frac{\partial \Gamma^m_{12}}{\partial \theta^2} + \Gamma^p_{22} \Gamma^m_{1p} - \Gamma^p_{12} \Gamma^m_{2p} \right) \\ % &= g_{1m} \left( \frac{\partial \Gamma^m_{22}}{\partial \mu} - \frac{\partial \Gamma^m_{12}}{\partial \sigma} + \left( \Gamma^1_{22} \Gamma^m_{11} + \Gamma^2_{22} \Gamma^m_{12} \right) - \left( \Gamma^1_{12} \Gamma^m_{21} + \Gamma^2_{12} \Gamma^m_{22} \right) \right) \\ % &= g_{11} \left( \frac{\partial \Gamma^1_{22}}{\partial \mu} - \frac{\partial \Gamma^1_{12}}{\partial \sigma} + \left( \Gamma^1_{22} \Gamma^1_{11} + \Gamma^2_{22} \Gamma^1_{12} \right) - \left( \Gamma^1_{12} \Gamma^1_{21} + \Gamma^2_{12} \Gamma^1_{22} \right) \right) + \underbrace{g_{12}}_{=0} \dots . \end{align}\]

Now, we can cross out the partial derivative term w.r.t. $\mu$ since we know already that none of the $\Gamma^k_{ij}$ depend on $\mu$. Moreover, recall that the Christoffel symbols are given by $\Gamma^1_{12} = \Gamma^1_{21} = -\frac{1}{\sigma}$ and $\Gamma^2_{11} = \frac{1}{2\sigma}$, and $0$ otherwise. Hence,

\[\begin{align} R_{1221} &= g_{11} \left( - \frac{\partial \Gamma^1_{12}}{\partial \sigma} + \Gamma^2_{22} \Gamma^1_{12} - \Gamma^1_{12} \Gamma^1_{21} \right) \\ % &= \frac{1}{\sigma^2} \left( -\frac{\partial}{\partial \sigma} \left( -\frac{1}{\sigma} \right) + \cancel{\left( -\frac{1}{\sigma} \right)^2} - \cancel{\left( -\frac{1}{\sigma} \right)^2} \right) \\ % &= \frac{1}{\sigma^2} \left( -\frac{1}{\sigma^2} \right) \\ % &= -\frac{1}{\sigma^4} . \end{align}\]

Thus, the sectional curvature is given by

\[\text{sec}(\partial_1, \partial_2) = \frac{-\frac{1}{\sigma^4}}{\frac{2}{\sigma^4}} = -\frac{1}{2} .\]

Note in particular that this sectional curvature does not depend on both $\mu$ and $\sigma$, i.e. it is constant. Hence, $M$ is a manifold of constant negative curvature. I.e., we can think of $M$ as a saddle surface.

Visualization

Thanks to the amazing geomstats package, we can visualize $M$ in coordinates easily. The idea is by visualizing the contours of the distances from points in $\R \times \R_{>0}$ to $(0, 1)$, i.e. corresponding to $\N(x \mid 0, 1)$—the standard normal.

Above, red points are the discretized steps of geodesics from $\N(x \mid 0, 1)$ to other Gaussians with different mean and variance. Indeed, geodesics of $M$ behave similarly like in the Poincaré half-space model—one of the poster children of the hyperbolic geometry.

Hessian and Curvatures in Machine Learning: A Differential-Geometric View

Mon, 02 Nov 2020 12:00:00 -0500

In machine learning, especially in neural networks, the Hessian matrix is often treated synonymously with curvatures, in the following sense. Suppose $f: \R^n \times \R^d \to \R$ defined by $(x, \theta) \mapsto f(x; \theta) =: f_\theta(x)$ is a (real-valued) neural network, mapping an input $x$ to the output $f(x; \theta)$ under the parameter $\theta$. Given a dataset $\D$, we can define a loss function $\ell: \R^d \to \R$ by $\theta \mapsto \ell(\theta)$ such as the mean-squared-error or cross-entropy loss. (We do not explicitly show the dependency of $\ell$ to $f$ and $\D$ for brevity.) Assuming the standard basis for $\R^d$, from calculus we know that the second partial derivatives of $\ell$ at a point $\theta \in \R^d$ form a matrix called the Hessian matrix at $\theta$.

Often, one calls the Hessian matrix the “curvature matrix” of $L$ at $\theta$ [1, 2, etc.]. Indeed, it is well-justified since as we have learned in calculus, the eigenspectrum of this Hessian matrix represents the curvatures of the loss landscape of $\ell$ at $\theta$. It is, however, not clear from calculus alone what is the precise geometric meaning of these curvatures. In this post, we will use tools from differential geometry—especially the hypersurface theory—to study the geometric interpretation of the Hessian matrix.

Loss Landscapes as Hypersurfaces

We begin by formalizing what exactly is a loss landscape via the Euclidean hypersurface theory. We call an $n$-dimensional manifold $M$ a (Euclidean) hypersurface of $\R^{n+1}$ if $M$ is a subset of $\R^{n+1}$ (equipped with the standard basis) and the inclusion $\iota: M \hookrightarrow \R^{n+1}$ is a smooth topological embedding. Since $\R^{n+1}$ is equipped with a metric in the form of the standard dot product, we can equip $M$ with an induced metric characterized at each point $p \in M$ by

\[\langle v, w\rangle_p = (d\iota)_p(v) \cdot (d\iota)_p(w) ,\]

for all tangent vectors $v, w \in T_pM$. Here, $\cdot$ represents the dot product and $(d\iota)_p: T_pM \to T_{\iota(p)}\R^{n+1} \simeq \R^{n+1}$ is the differential of $\iota$ at $p$ which is represented by the Jacobian matrix of $\iota$ at $p$. In matrix notation this is

\[\inner{v, w}_p = (J_p v)^\top (J_p w) .\]

Intuitively, the induced inner product on $M$ at $p$ is obtained by “pushing forward” tangent vectors $v$ and $w$ using the Jacobian $J_p$ at $p$ and compute their dot product on $\R^{n+1}$.

Let $g: U \to \R$ is a smooth real-valued function over an open subset $U \subseteq \R^n$, then the graph of $g$ is the subset $M := \{ (u, g(u)) : u \in U \} \subseteq \R^{n+1}$ which is a hypersurface in $\R^{n+1}$. In this case, we can describe $M$ via the so-called graph parametrization which is a function $X: U \to \R^{n+1}$ defined by $X(u) := (u, g(u))$.

Coming back to our neural network setting, assuming that the loss $\ell$ is smooth, the graph $L := \{ (\theta, \ell(\theta)) : \theta \in \R^d \}$ is a Euclidean hypersurface of $\R^{d+1}$ with parametrization $Z: \R^d \to \R^{d+1}$ defined by $Z(\theta) := (\theta, \ell(\theta))$. Furthermore, the metric of $L$ is given by the Jacobian of the parametrization $Z$ and the standard dot product on $\R^{d+1}$, as before. Thus, the loss landscape of $\ell$ can indeed be amenable to geometric analysis.

The Second Fundamental Form and Shape Operator

Consider vector fields $X$ and $Y$ on the hypersurface $L \subseteq \R^{d+1}$. We can view them as vector fields on $\R^{d+1}$ and thus the directional derivative $\nabla_X Y$ on $\R^{d+1}$ is well-defined at all points in $L$. That is, at every $p \in L$, $\nabla_X Y$ is a $(d+1)$-dimensional vector “rooted” at $p$. This vector can be decomposed as follows:

\[\nabla_X Y = (\nabla_X Y)^\top + (\nabla_X Y)^\perp ,\]

where $(\cdot)^\top$ and $(\cdot)^\perp$ are the orthogonal projection operators onto the tangent/normal space of $L$ at $p$. We define the second fundamental form as the map $\mathrm{II}$ that takes two vector fields on $L$ and yielding normal vector fields of $L$, as follows:

\[\mathrm{II}(X,Y) := (\nabla_X Y)^\perp .\]

See the following figure for an intuition.

Since $L$ is a $d$-dimensional hypersurface of $(d+1)$-dimensional Euclidean space, the normal space $N_pL$ at each point $p$ of $L$ has dimension one, and there exist only two ways of choosing a unit vector field normal to $L$. Any choice of the unit vector field thus automatically gives a basis for $N_pL$ for all $p \in L$. One of the choices is the following normal vector field which is oriented outward relative to $L$.

Another choice is the same unit normal field but oriented inward relative to $L$.

Fix a unit normal field $N$. We can replace the vector-valued second fundamental form $\mathrm{II}$ with a simpler scalar-valued form. We define the scalar second fundamental form of $M$ to be

\[h(X, Y) := \inner{N, \mathrm{II}(X,Y)} .\]

Furthermore, we define the shape operator of $L$ as the map $s$, mapping a vector field to another vector field on $L$, characterized by

\[\inner{s(X), Y} = h(X,Y) .\]

Based on the characterization above, we can alternatively view $s$ as an operator obtained by raising an index of $h$, i.e. multiplying the matrix of $h$ with the inverse-metric.

Note that, at each point $p \in L$, the shape operator at $p$ is a linear endomorphism of $T_p L$, i.e. it defines a map from the tangent space to itself. Furthermore, we can show that $\mathrm{II}(X,Y) = \mathrm{II}(Y,X)$ and thus $h(X,Y)$ is symmetric. This implies that $s$ is self-adjoint since we can write

\[\inner{s(X), Y} = h(X,Y) = h(Y,X) = \inner{s(Y), X} = \inner{X, s(Y)} .\]

Altogether, this means that at each $p \in L$, the shape operator at $p$ can be represented by a symmetric $d \times d$ matrix.

Principal Curvatures

The previous fact about the matrix of $s$ says that we can apply eigendecomposition on $s$ and obtain $n$ real eigenvalues $\kappa_1, \dots, \kappa_n$ and an orthonormal basis for $T_p L$ formed by the eigenvectors $(b_1, \dots, b_n)$ corresponding to these eigenvalues. We call these eigenvalues the principal curvatures of $L$ at $p$ and the corresponding eigenvectors the principal directions. Moreover, we also define the Gaussian curvature as $\det s = \prod_{i=1}^d \kappa_i$ and the mean curvature as $\frac{1}{d} \mathrm{tr}\,s = \frac{1}{d} \sum_{i=1}^d \kappa_i$.

The intuition of the principal curvatures and directions in $\R^3$ is shown in the preceding figure. Suppose $M$ is a surface in $\R^3$. Choose a tangent vector $v \in T_pM$. Together with the choice of our unit normal vector $N_p$ at $p$, we obtain a plane $\varPi$ passing through $p$. The intersection of $\varPi$ and the neighborhood of $p$ in $M$ is a plane curve $\gamma \subseteq \varPi$ containing $p$. We can now compute the curvature of this curve at $p$ as usual, in the calculus sense (the reciprocal of the radius of the osculating circle at $p$). Then, the principal curvatures of $M$ at $p$ are the minimum and maximum curvatures obtained this way. The corresponding vectors in $T_p M$ that attain the minimum and maximum are the principal directions.

Principal and mean curvatures are not intrinsic to a hypersurface. There are two hypersurfaces that are isometric, but have different principal curvatures and hence different mean curvatures. Consider the following two surfaces.

The first (left) surface is the plane described by the parametrization $(x,y) \mapsto \{ x, y, 0 \}$ for $0 < x < \pi$ and $0 < y < \pi$. The second one is the half cylinder $(x,y) \mapsto \{ x, y, \sqrt{1-y^2} \}$ for $0 < x < \pi$ and $\abs{y} < 1$. It is clear that they have different principal curvatures since the plane is a flat while the half-cylinder is “curvy”. Indeed, assuming a downward pointing normal, we can see that $\kappa_1 = \kappa_2 = 0$ for the plane and $\kappa_1 = 0, \kappa_2 = 1$ for the half-cylinder and thus their mean curvatures differ. However, they are actually isometric to each other—from the point of view of Riemannian geometry, they are the same. Thus, both principal and mean curvatures depend on the choice of the parametrization and not intrinsic.

Remarkably, the Gaussian curvature is intrinsic: All isometric hypersurfaces of dimension $\geq 2$ have the same Gaussian curvature (up to sign). Using the previous example: the plane and half-cylinder have the same Gaussian curvature of $0$. In 2D surfaces, this is a classic result which Gauss named Theorema Egregium. For hypersurfaces with dimension $> 2$, it can be shown that the Gaussian curvature is intrinsic up to sign [5, Ch. 7, Cor. 23].

The Loss Landscape's Hessian

Now we are ready to draw a geometric connection between principal curvatures and the Hessian of $\ell$. Let $Z: \R^d \to \R^{d+1}$ be graph parametrization of the loss landscape $L$. The coordinates $(\theta^1, \dots, \theta^d) \in \R^d$ thus give local coordinates for $L$. The coordinate vector field $\partial/\partial \theta^1, \dots, \partial/\partial \theta^d$, push forward to vector fields $dZ(\partial/\partial \theta^1), \dots, dZ(\partial/\partial \theta^d)$ on $\R^{d+1}$, via the Jacobian of $Z$. At each $p \in L$, these vector fields form a basis for $T_p L$, viewed as a collection of $d$ vectors in $\R^{d+1}$.

If we think of $Z(\theta) = (Z^1(\theta), \dots, Z^{d+1}(\theta))$ as a vector-valued function of $\theta$, then by definition of Jacobian, these push-forwarded coordinate vector fields can be written for every $\theta \in \R^d$ as

\[dZ_\theta \left( \frac{\partial}{\partial \theta^i} \right) = \frac{\partial Z}{\partial \theta^i} (\theta) =: \partial_i Z(\theta) ,\]

for each $i = 1, \dots, d$.

Let us suppose we are given a unit normal field to $L$. Then we have the following result.

Proposition 1. Suppose $L \subseteq \R^{d+1}$ is the loss landscape of $\ell$, $Z: \R^d \to \R^{d+1}$ is the graph parametrization of $L$. Suppose further that $\partial_1 Z, \dots, \partial_d Z$ are the vector fields determined by $Z$ which restriction at each $p \in L$ is a basis for $T_pL$, and suppose $N$ is a unit normal field on $L$. Then the scalar second fundamental form is given by

\[h(\partial_i Z, \partial_j Z) = \left\langle \frac{\partial^2 Z}{\partial \theta^i \partial \theta^d} , N \right\rangle = N^{d+1} \frac{\partial^2 \ell}{\partial \theta^i \partial \theta^j}.\]

Where $N^{d+1}$ is the last component of the unit normal field.

Proof. To show the first equality, one can refer to Proposition 8.23 in [1], which works for any parametrization and not just the graph parametrization. Now recall that $Z(\theta) = (\theta^1, \dots, \theta^d, \ell(\theta^1, \dots, \theta^d))$. Therefore for each $i = 1, \dots, d$:

\[\frac{\partial Z}{\partial \theta^i} = \left( 0, \dots, 1, \dots, \frac{\partial \ell}{\partial \theta^i} \right) ,\]

and thus

\[\frac{\partial^2 Z}{\partial \theta^i \partial \theta^j} = \left( 0, \dots, 0, \frac{\partial^2 \ell}{\partial \theta^i \partial \theta^j} \right) .\]

Taking the inner product with the unit normal field $N$, we obtain

\[h(\partial_i Z, \partial_j Z) = 0 + \dots + 0 + N^{d+1} \frac{\partial^2 \ell}{\partial \theta^i \partial \theta^j} = N^{d+1} \frac{\partial^2 \ell}{\partial \theta^i \partial \theta^j} ,\]

where $N^{d+1}$ is the $(d+1)$-st component function (it is a function $\R^{d+1} \to \R$) of the normal field $N$. At each $p \in L$, the matrix of $h$ is therefore $N^{d+1}(p)$ times the Hessian matrix of $\ell$ at $p$.

$ \square $

Finally, we show the connection between the principal curvatures with the scalar second fundamental form, and hence the principal curvatures with the Hessian. The following proposition says that at a critical point, the unit normal vector can be chosen as $(0, \dots, 0, 1)$ and thus the scalar second fundamental form coincides with the Hessian of $\ell$. Furthermore, by orthonormalizing the basis for the tangent space at that point, we can show that the matrix of the scalar second fundamental form in this case is exactly the matrix of the shape operator at $p$ and thus the Hessian encodes the principal curvatures at that point.

Proposition 2. Suppose $L \subseteq \R^{d+1}$ is a loss landscape with its graph parametrization and let $\theta_* \in \R^d$ be a critical point of $\ell$ and $p_* := (\theta_*^1, \dots, \theta_*^d, \ell(\theta_*)) \in L$. Then the matrix of the shape operator $s$ of $L$ at $p_*$ is equal to the Hessian matrix of $\ell$ at $\theta_*$.

Proof. We can assume w.l.o.g. that the basis $(E_1, \dots, E_d)$ for $T_{p_*} L$ is orthonormal by applying the Gram-Schmidt algorithm on $d$ linearly independent tangent vectors in $T_{p_*} L$. Furthermore pick $(0, \dots, 0, 1) \in \R^{d+1}$ as the choice of the unit normal $N$ at $p_*$. We can do so since by hypothesis $p_*$ is a critical point and therefore $(0, \dots, 0, 1)$ is perpendicular to $T_{p_*} L$.

It follows by Proposition 1 that the matrix of the scalar second fundamental form $h$ of $L$ at $p_*$ is equal to the Hessian matrix of $\ell$ at $\theta_*$. Moreover, since we have an orthonormal basis for $T_{p_*} L$, the metric of $L$ at $p_*$ is represented by the $d \times d$ diagonal matrix. This implies that the matrix of the shape operator at $p_*$ is equal to the matrix of the second fundamental form and the claim follows directly.

$ \square $

As a side note, we can actually have a more general statement: At any point in a hypersurface with any parametrization, the principal curvatures give a concise description of the local shape of the hypersurface by approximating it with the graph of a quadratic function. See Prop. 8.24 in [3] for a detailed discussion.

Flatness and Generalization

In deep learning, there have been interesting works connecting the “flatness” of the loss landscape’s local minima with the generalization performance of an NN. The conjecture is that the flatter a minimum is, the better the network generalizes. “Flatness” here often refers to the eigenvalues or trace of the Hessian matrix at the minima. However, this has been disputed by e.g. [4] and rightly so.

As we have seen previously, at a minimum, the principal and mean curvature (the eigenvalues and trace of the Hessian of $\ell$, resp.) are not intrinsic. Different parametrization of $L$ can yield different principal and mean curvatures. Just like the illustration with the plane and the half-cylinder above, [4] illustrates this directly in the loss landscape. In particular, we can apply a bijective transformation $\varphi$ to the original parameter space $\R^d$ s.t. the resulting loss landscape is isometric to the original loss landscape and the particular minimum $\theta_*$ does not change, i.e. $\varphi(\theta_*) = \theta_*$. See the following figure for an illustration (we assume that the length of the red curves below is the same).

It is clear that the principal curvature changes even though functionally, the NN still represents the same function. Thus, we cannot actually connect the notion of “flatness” that are common in literature to the generalization ability of the NN. A definitive connection between them must start with some intrinsic notion of flatness—for starter, the Gaussian curvature, which can be easily computed since it is just the determinant of the Hessian at the minima.

References

Martens, James. “New Insights and Perspectives on the Natural Gradient Method.” arXiv preprint arXiv:1412.1193 (2014).
Dangel, Felix, Stefan Harmeling, and Philipp Hennig. “Modular Block-diagonal Curvature Approximations for Feedforward Architectures.” AISTATS. 2020.
Lee, John M. Riemannian manifolds: an introduction to curvature. Vol. 176. Springer Science & Business Media, 2006.
Dinh, Laurent, et al. “Sharp Minima can Generalize for Deep Nets.” ICML, 2017.
Spivak, Michael D. A comprehensive introduction to differential geometry. Publish or perish, 1970.

Optimization and Gradient Descent on Riemannian Manifolds

Fri, 22 Feb 2019 12:00:00 -0500

Differential geometry can be seen as a generalization of calculus on Riemannian manifolds. Objects in calculus such as gradient, Jacobian, and Hessian on $\R^n$ are adapted on arbitrary Riemannian manifolds. This fact let us also generalize one of the most ubiquitous problem in calculus: the optimization problem. The implication of this generalization is far-reaching: We can make a more general and thus flexible assumption regarding the domain of our optimization, which might fit real-world problems better or has some desirable properties.

In this article, we will focus on the most popular optimization there is, esp. in machine learning: the gradient descent method. We will begin with a review on the optimization problem of a real-valued function on $\R^n$, which we should have been familiar with. Next, we will adapt the gradient descent method to make it work in optimization problem of a real-valued function on an arbitrary Riemannian manifold $(\M, g)$. Lastly, we will discuss how natural gradient descent method can be seen from this perspective, instead of purely from the second-order optimization point-of-view.

Optimization problem and the gradient descent

Let $\R^n$ be the usual Euclidean space (i.e. a Riemannian manifold $(\R^n, \bar{g})$ where $\bar{g}_{ij} = \delta_{ij}$) and let $f: \R^n \to \R$ be a real-valued function. An (unconstrained) optimization problem on this space has the form

\[\min_{x \in \R^n} f(x) \, .\]

That is we would like to find a point $\hat{x} \in \R^n$ such that $f(\hat{x})$ is the minimum of $f$.

One of the most popular numerical method for solving this problem is the gradient descent method. Its algorithm is as follows.

Algorithm 1 (Euclidean gradient descent).

Pick arbitrary $x_{(0)} \in \R^n$ and let $\alpha \in \R$ with $\alpha > 0$
While the stopping criterion is not satisfied:
1. Compute the gradient of $f$ at $x_{(t)}$, i.e. $h_{(t)} := \gradat{f}{x_{(t)}}$
2. Move in the direction of $-h_{(t)}$, i.e. $x_{(t+1)} = x_{(t)} - \alpha h_{(t)}$
3. $t = t+1$
Return $x_{(t)}$

The justification of the gradient descent method is because of the fact that the gradient is the direction in which $f$ is increasing fastest. Its negative therefore points to the direction of steepest descent.

Proposition 1. Let $f: \R^n \to \R$ be a real-valued function on $\R^n$ and $x \in \R^n$. Among all unit vector $v \in \R^n$, the gradient $\grad f \, \vert_x$ of $f$ at $x$ is the direction in which the directional derivative $D_v \, f \, \vert_x$ is greatest. Furthermore, $\norm{\gradat{f}{x}}$ equals to the value of the directional derivative in that direction.

Proof. First, note that, by our assumption, $\norm{v} = 1$. By definition of the directional derivative and dot product on $\R^n$,

\[\begin{align} D_v \, f \, \vert_x &= \grad f \, \vert_x \cdot v \\ &= \norm{\gradat{f}{x}} \norm{v} \cos \theta \\ &= \norm{\gradat{f}{x}} \cos \theta \, , \end{align}\]

where $\theta$ is the angle between $\gradat{f}{x}$ and $v$. As $\norm{\cdot} \geq 0$ and $-1 \leq \cos \theta \leq 1$, the above expression is maximized whenever $\cos \theta = 1$. This implies that the particular vector $\hat{v}$ that maximizes the directional derivative points in the same direction as $\gradat{f}{x}$. Furthermore, plugging in $\hat{v}$ into the above equation, we get

\[D_{\hat{v}} \, f \, \vert_x = \norm{\gradat{f}{x}} \, .\]

Thus, the length of $\gradat{f}{x}$ is equal to the value of $D_{\hat{v}} \, f \, \vert_x$.

$\square$

Gradient descent on Riemannian manifolds

Remark. These notes about Riemannian geometry are useful as references. We shall use the Einstein summation convention: Repeated indices above and below are implied to be summed, e.g. $v^i w_i \implies \sum_i v^i w_i$ and $g_{ij} v^i v^j \implies \sum_{ij} g_{ij} v^i v^j$. By convention the index in $\partder{}{x^i}$ is thought to be a lower index.

We now want to break the confine of the Euclidean space. We would like to generalize the gradient descent algorithm on a function defined on a Riemannian manifold. Based on Algorithm 1, at least there are two parts of the algorithm that we need to adapt, namely, (i) the gradient of $f$ and (ii) the way we move between points on $\M$.

Suppose $(\M, g)$ is a $n$-dimensional Riemannian manifold. Let $f: \M \to R$ be a real-valued function (scalar field) defined on $\M$. Then, the optimization problem on $\M$ simply has the form

\[\min_{p \in \M} f(p) \, .\]

Although it seems innocent enough (we only replace $\R^n$ with $\M$ from the Euclidean version), some difficulties exist.

First, we shall discuss about the gradient of $f$ on $\M$. By definition, $\grad{f}$ is a vector field on $\M$, i.e. $\grad{f} \in \mathfrak{X}(\M)$ and at each $p \in \M$, $\gradat{f}{p}$ is a tangent vector in $T_p \M$. Let the differential $df$ of $f$ be a one one-form, which, in given coordinates $\vx_p := (x^1(p), \dots, x^n(p))$, has the form

\[df = \partder{f}{x^i} dx^i \, .\]

Then, the gradient of $f$ is obtained by raising an index of $df$. That is,

\[\grad{f} = (df)^\sharp \, ,\]

and in coordinates, it has the expression

\[\grad{f} = g^{ij} \partder{f}{x^i} \partder{}{x^j} \, .\]

At any $p \in \M$, given $v \in T_x \M$, it is characterized by the following equation:

\[\inner{\gradat{f}{p}, v}_g = df(v) = vf \, .\]

That is, pointwise, the inner product of the gradient and any tangent vector is the action of derivation $v$ on $f$. We can think of this action as taking directional derivative of $f$ in the direction $v$. Thus, we have the analogue of Proposition 1 on Riemannian manifolds.

Proposition 2. Let $(\M, g)$ be a Riemannian manifold and $f: \M \to \R$ be a real-valued function on $\M$ and $p \in \M$. Among all unit vector $v \in T_p \M$, the gradient $\gradat{f}{p}$ of $f$ at $p$ is the direction in which the directional derivative $vf$ is greatest. Furthermore, $\norm{\gradat{f}{p}}$ equals to the value of the directional derivative in that direction.

Proof. We simply note that by definition of inner product induced by $g$, we have

\[\inner{u, w}_g = \norm{u}_g \norm{w}_g \cos \theta \qquad \forall \, u, w \in T_p \M \, ,\]

where $\theta$ is again the angle between $u$ and $w$. Using the characteristic of $\gradat{f}{p}$ we have discussed above and by substituting $vf$ for $D_v \, f \, \vert_p$ in the proof of Proposition 1, we immediately get the desired result.

$\square$

Proposition 2 therefore provides us with a justification of just simply substituting the Euclidean gradient with the Riemannian gradient in Algorithm 1.

To make this concrete, we do the computation in coordinates. In coordinates, we can represent $df$ by a row vector $d$ (i.e. a sequence of numbers in the sense of linear algebra) containing all partial derivatives of $f$:

\[d := \left( \partder{f}{x^1}, \dots, \partder{f}{x^n} \right) \, .\]

Given the matrix representation $G$ of the metric tensor $g$ in coordinates, the gradient of $f$ is represented by a column vector $h$, such that

\[h = G^{-1} d^\T \, .\]

Example 1. (Euclidean gradient in coordinates). Notice that in the Euclidean case, $\bar{g}_{ij} = \delta_{ij}$, thus it is represented by an identity matrix $I$, in coordinates. Therefore the Euclidean gradient is simply

\[h = I^{-1} d^\T = d^\T \, .\]

The second modification to Algorithm 1 that we need to find the analogue of is the way we move between points on $\M$. Notice that, at each $x \in \R^n$, the way we move between points in the Euclidean gradient descent is by following a straight line in the direction $\gradat{f}{x}$. We know by triangle inequality that straight line is the path with shortest distance between points in $\R^n$.

On Riemannian manifolds, we move between points by the means of curves. There exist a special kind of curve $\gamma: I \to \M$, where $I$ is an interval, that are “straight” between two points on $\M$, in the sense that the covariant derivative $D_t \gamma’$ of the velocity vector along the curve itself, at any time $t$ is $0$. The intuition is as follows: Although if we look at $\gamma$ on $\M$ from the outsider’s point-of-view, it is not straight (i.e. it follows the curvature of $\M$), as far as the inhabitants of $\M$ are concerned, $\gamma$ is straight, as its velocity vector (its direction and length) is the same everywhere along $\gamma$. Thus, geodesics are the generalization of straight lines on Riemannian manifolds.

For any $p \in \M$ and $v \in T_p \M$, we can show that there always exists a geodesic starting at $p$ with initial velocity $v$, denoted by $\gamma_v$. Furthermore, if $c, t \in \R$ we can rescale any geodesic $\gamma_v$ by

\[\gamma_{cv}(t) = \gamma_v (ct) \, ,\]

and thus we can define a map $\exp_p(v): T_p \M \to \M$ by

\[\exp_p(v) = \gamma_v(1) \, ,\]

called the exponential map. The exponential map is the generalization of “moving straight in the direction $v$” on Riemannian manifolds.

Example 2. (Exponential map on a sphere). Let $\mathbb{S}^n(r)$ be a sphere embedded in $\R^{n+1}$ with radius $r$. The shortest path between any pair of points on the sphere can be found by following the great circle connecting them.

Let $p \in \mathbb{S}^n(r)$ and $0 \neq v \in T_p \mathbb{S}^n(r)$ be arbitrary. The curve $\gamma_v: \R \to \R^{n+1}$ given by

\[\gamma_v(t) = \cos \left( \frac{t\norm{v}}{r} \right) p + \sin \left( \frac{t\norm{v}}{r} \right) r \frac{v}{\norm{v}} \, ,\]

is a geodesic, as its image is the great circle formed by the intersection of $\mathbb{S}^n(r)$ with the linear subspace of $\R^{n+1}$ spanned by $\left\{ p, r \frac{v}{\norm{v}} \right\}$. Therefore the exponential map on $\mathbb{S}^n(r)$ is given by

\[\exp_p(v) = \cos \left( \frac{\norm{v}}{r} \right) p + \sin \left( \frac{\norm{v}}{r} \right) r \frac{v}{\norm{v}} \, .\]

Given the exponential map, our modification to Algorithm 1 is complete, which we show in Algorithm 2. The new modifications from Algorithm 1 are in blue.

Algorithm 2 (Riemannian gradient descent).

Pick arbitrary $p_{(0)} \in \M$. Let $\alpha \in \R$ with $\alpha > 0$
While the stopping criterion is not satisfied:
1. Compute the gradient of $f$ at $p_{(t)}$, i.e. $h_{(t)} := \gradat{f}{p_{(t)}} = (df \, \vert_{p_{(t)}})^\sharp$
2. Move in the direction $-h_{(t)}$, i.e. $p_{(t+1)} = \exp_{p_{(t)}}(-\alpha h_{(t)})$
3. $t = t+1$
Return $p_{(t)}$

Approximating the exponential map

In general, the exponential map is difficult to compute, as to compute a geodesic, we have to solve a system of second-order ODE. Therefore, for a computational reason, we would like to approximate the exponential map with cheaper alternative.

Let $p \in \M$ be arbitrary. We define a map $R_p: T\M \to \M$ called the retraction map, by the following properties:

$R_p(0) = p$
$dR_p(0) = \text{Id}_{T_p \M}$.

The second property is called the local rigidity condition and it preserves gradients at $p$. In particular, the exponential map is a retraction. Furthermore, if $d_g$ denotes the Riemannian distance and $t \in \R$, retraction can be seen as a first-order approximation of the exponential map, in the sense that

\[d_g(\exp_p(tv), R_p(tv)) = O(t^2) \, .\]

On an arbitrary embedded submanifold $\S \in \R^{n+1}$, if $p \in \S$ and $v \in T_p \S$, viewing $p$ to be a point on the ambient manifold and $v$ to be a point on the ambient tangent space $T_p \R^{n+1}$, we can compute $R_p(v)$ by (i) moving along $v$ to get $p + v$ and then (ii) project the point $p+v$back to $\S$.

Example 3. (Retraction on a sphere). Let $\mathbb{S}^n(r)$ be a sphere embedded in $\R^{n+1}$ with radius $r$. The retraction on any $p \in \mathbb{S}^n(r)$ and $v \in T_p \mathbb{S}^n(r)$ is defined by

\[R_p(v) = r \frac{p + v}{\norm{p + v}}\]

Therefore, the Riemannian gradient descent in Algorithm 2 can be modified to be

Algorithm 3 (Riemannian gradient descent with retraction).

Pick arbitrary $p_{(0)} \in \M$. Let $\alpha \in \R$ with $\alpha > 0$
While the stopping criterion is not satisfied:
1. Compute the gradient of $f$ at $p_{(t)}$, i.e. $h_{(t)} := \gradat{f}{p_{(t)}} = (df \, \vert_{p_{(t)}})^\sharp$
2. Move in the direction $-h_{(t)}$, i.e. $p_{(t+1)} = R_{p_{(t)}}(-\alpha h_{(t)})$
3. $t = t+1$
Return $p_{(t)}$

Natural gradient descent

One of the most important applications of the Riemannian gradient descent in machine learning is for doing optimization of statistical manifolds. We define a statistical manifold $(\R^n, g)$ to be the set $\R^n$ corresponding to the set of parameter of a statistical model $p_\theta(z)$, equipped with metric tensor $g$ which is the Fisher information metric, given by

\[g_{ij} = \E_{z \sim p_\theta} \left[ \partder{\log p_\theta(z)}{\theta^i} \partder{\log p_\theta(z)}{\theta^j} \right] \, .\]

The most common objective function $f$ in the optimization problem on a statistical manifold is the expected log-likelihood function of our statistical model. That is, given a dataset $\D = \{ z_i \}$, the objective is given by $f(\theta) = \sum_{z \in \D} \log p_\theta(z)$.

The metric tensor $g$ is represented by $n \times n$ matrix $F$, called the Fisher information matrix. The Riemannian gradient in this manifold is therefore can be represented by a column vector $h = F^{-1} d^\T$. Furthermore, as the manifold is $\R^n$, the construction of the retraction map we have discussed previously tells us that we can simply do addition $p + v$ for any $p \in \R^n$ and $v \in T_p \R^n$. This is well defined as there is a natural isomorphism between $\R^n$ and $T_p \R^n$. All in all, the gradient descent in this manifold is called the natural gradient descent and is presented in Algorithm 4 below.

Algorithm 4 (Natural gradient descent).

Pick arbitrary $\theta_{(0)} \in \R^n$. Let $\alpha \in \R$ with $\alpha > 0$
While the stopping criterion is not satisfied:
1. Compute the gradient of $f$ at $\theta_{(t)}$, i.e. $h_{(t)} := F^{-1} d^\T$
2. Move in the direction $-h_{(t)}$, i.e. $\theta_{(t+1)} = \theta_{(t)} - \alpha h_{(t)}$
3. $t = t+1$
Return $\theta_{(t)}$

Conclusion

Optimization in Riemannian manifold is an interesting and important application in the field of geometry. It generalizes the optimization methods from Euclidean spaces onto Riemannian manifolds. Specifically, in the gradient descent method, adapting it to a Riemannian manifold requires us to use the Riemannian gradient as the search direction and the exponential map or retraction to move between points on the manifold.

One major difficulty exists: Computing and storing the matrix representation $G$ of the metric tensor are very expensive. Suppose the manifold is $n$-dimensional. Then, the size of $G$ is in $O(n^2)$ and the complexity of inverting it is in $O(n^3)$. In machine learning, $n$ could be in the order of million, so a naive implementation is infeasible. Thankfully, many approximations of the metric tensor, especially for the Fisher information metric exist (e.g. [7]). Thus, even with these difficulties, the Riemannian gradient descent or its variants have been successfully applied on many areas, such as in inference problems [8], word or knowledge graph embeddings [9], etc.

References

Lee, John M. “Smooth manifolds.” Introduction to Smooth Manifolds. Springer, New York, NY, 2013. 1-31.
Lee, John M. Riemannian manifolds: an introduction to curvature. Vol. 176. Springer Science & Business Media, 2006.
Fels, Mark Eric. “An Introduction to Differential Geometry through Computation.” (2016).
Absil, P-A., Robert Mahony, and Rodolphe Sepulchre. Optimization algorithms on matrix manifolds. Princeton University Press, 2009.
Boumal, Nicolas. Optimization and estimation on manifolds. Diss. Catholic University of Louvain, Louvain-la-Neuve, Belgium, 2014.
Graphics: https://tex.stackexchange.com/questions/261408/sphere-tangent-to-plane.
Martens, James, and Roger Grosse. “Optimizing neural networks with kronecker-factored approximate curvature.” International conference on machine learning. 2015.
Patterson, Sam, and Yee Whye Teh. “Stochastic gradient Riemannian Langevin dynamics on the probability simplex.” Advances in neural information processing systems. 2013.
Suzuki, Atsushi, Yosuke Enokida, and Kenji Yamanishi. “Riemannian TransE: Multi-relational Graph Embedding in Non-Euclidean Space.” (2018).

Notes on Riemannian Geometry

Fri, 22 Feb 2019 12:00:00 -0500

Recently I have been studying differential geometry, including Riemannian geometry. When studying this subject, a lot of aha moments came up due to my previous (albeit informal) exposure to the geometric point-of-view of natural gradient method. I found that the argument from this point-of-view to be very elegant, which motivates me further to study geometry in depth. This writing is a collection of small notes (largely from Lee’s Introduction to Smooth Manifolds and Introduction to Riemannian Manifolds) that I find useful as a reference on this subject. Note that, this is by no means a completed article. I will update it as I study further.

Manifolds

We are interested in generalizing the notion of Euclidean space into arbitrary smooth curved space, called smooth manifold. Intuitively speaking, a topological $n$-manifold $\M$ is a topological space that locally resembles $\R^n$. A smooth $n$-manifold is a topological $n$-manifold equipped with locally smooth map $\phi_p: \M \to \R^n$ around each point $p \in \M$, called the local coordinate chart.

Example 1 (Euclidean spaces). For each $n \in \mathbb{N}$, the Euclidean space $\R^n$ is a smooth $n$-manifold with a single chart $\phi := \text{Id}_{\R^n}$, the identity map, for all $p \in \M$. Thus, $\phi$ is a global coordinate chart.

Example 2 (Spaces of matrices). Let $\text{M}(m \times n, \R)$ denote the set of $m \times n$ matrices with real entries. We can identify it with $\R^{mn}$ and as before, this is a smooth $mn$-dimensional manifold. Some of its subsets, e.g. the general linear group $\text{GL}(n, \R)$ and the space of full rank matrices, are smooth manifolds.

Remark 1. We will drop $n$ when referring a smooth $n$-manifold from now on, for brevity sake. Furthermore, we will start to use the Einstein summation convention: repeated indexes above and below are implied to be summed, e.g. $v_i w^i := \sum_i v_i w^i$.

Tangent vectors and covectors

At each point $p \in \M$, there exists a vector space $T_p \M$, called the tangent space of $p$. An element $v \in T_p \M$ is called the tangent vector. Let $f: \M \to \R$ be a smooth function. In local coordinate $\{x^1, \dots, x^n\}$ defined around $p$, the coordinate vectors $\{ \partial/\partial x^1, \dots, \partial/\partial x^n \}$ form a coordinate basis for $T_p \M$.

A tangent vector $v \in T_p \M$ can also be seen as a derivation, a linear map $C^\infty(\M) \to \R$ that follows Leibniz rule (product rule of derivative), i.e.

\[v(fg) = f(p)vg + g(p)vf \enspace \enspace \forall f, g \in C^\infty(\M) \, .\]

Thus, we can also see $T_p \M$ to be the set of all derivations of $C^\infty(\M)$ at $p$.

For each $p \in \M$ there also exists the dual space $T_p^* \M$ of $T_p \M$, called the cotangent space at $p$. Each element $\omega \in T_p^* \M$ is called the tangent covector, which is a linear functional $\omega: T_p \M \to \R$ acting on tangent vectors at $p$. Given the same local coordinate as above, the basis for the cotangent space at $p$ is called the dual coordinate basis and is given by $\{ dx^1, \dots, dx^n \}$, such that $dx^i(\partial/\partial x^j) = \delta^i_j$ the Kronecker delta. Note that, this implies that if $v := v^i \, \partial/\partial x^i$, then $dx^i(v) = v^i$.

Tangent vectors and covectors follow different transformation rules. We call an object with lower index, e.g. the components of tangent covector $\omega_i$ and the coordinate basis $\partial/\partial x^i =: \partial_i$, to be following the covariant transformation rule. Meanwhile an object with upper index, e.g. the components a tangent vector $v^i$ and the dual coordinate basis $dx^i$, to be following the contravariant transformation rule. These stem from how an object transform w.r.t. change of coordinate. Recall that a vector, when all the basis vectors are scaled up by a factor of $k$, the coefficients in its linear combination will be scaled by $1/k$, thus it is said that a vector transforms contra-variantly (the opposite way to the basis). Analogously, we can show that when we apply the same transformation to the dual basis, the covectors coefficients will be scaled by $k$, thus it transforms the same way to the basis (co-variantly).

The partial derivatives of a scalar field (real valued function) on $\M$ can be interpreted as the components of a covector field in a coordinate-independent way. Let $f$ be such scalar field. We define a covector field $df: \M \to T^* \M$, called the differential of $f$, by

\[df_p(v) := vf \enspace \enspace \text{for} \, v \in T_p\M \, .\]

Concretely, in smooth coordinates $\{ x^i \}$ around $p$, we can show that it can be written as

\[df_p := \frac{\partial f}{\partial x^i} (p) \, dx^i \, \vert_p \, ,\]

or as an equation between covector fields instead of covectors:

\[df := \frac{\partial f}{\partial x^i} \, dx^i \, .\]

The disjoint union of the tangent spaces at all points of $\M$ is called the tangent bundle of $\M$

\[TM := \coprod_{p \in \M} T_p \M \, .\]

Meanwhile, analogously for the cotangent spaces, we define the cotangent bundle of $\M$ as

\[T^*M := \coprod_{p \in \M} T^*_p \M \, .\]

If $\M$ and $\mathcal{N}$ are smooth manifolds and $F: \M \to \mathcal{N}$ is a smooth map, for each $p \in \M$ we define a map

\[dF_p : T_p \M \to T_{F(p)} \mathcal{N} \, ,\]

called the differential of $F$ at $p$, as follows. Given $v \in T_p \M$:

\[dF_p (v)(f) := v(f \circ F) \, .\]

Moreover, for any $v \in T_p \M$, we call $dF_p (v)$ the pushforward of $v$ by $F$ at $p$. It differs from the previous definition of differential in the sense that this map is a linear map between tangent spaces of two manifolds. Furthermore the differential of $F$ can be seen as the generalization of the total derivative in Euclidean spaces, in which $dF_p$ is represented by the Jacobian matrix.

Vector fields

If $\M$ is a smooth $n$-manifold, a vector field on $\M$ is a continuous map $X: \M \to T\M$, written as $p \mapsto X_p$, such that $X_p \in T_p \M$ for each $p \in \M$. If $(U, (x^i))$ is any smooth chart for $\M$, we write the value of $X$ at any $p \in U \subset \M$ as

\[X_p = X^i(p) \, \frac{\partial}{\partial x^i} \vert_p \, .\]

This defines $n$ functions $X^i: U \to \R$, called the component functions of $X$. The restriction of $X$ to $U$ is a smooth vector field if and only if its component functions w.r.t. the chart are smooth.

Example 3 (Coordinate vector fields). If $(U, (x^i))$ is any smooth chart on $\M$, then $p \mapsto \partial/\partial x^i \vert_p$ is a vector field on $U$, called the i-th coordinate vector field. It is smooth as its component functions are constant. This vector fields defines a basis of the tangent space at each point.

Example 4 (Gradient). If $f \in C^\infty(\M)$ is a real-valued function on $\M$, then the gradient of $f$ is a vector field on $\M$. See the corresponding section below for more detail.

We denote $\mathfrak{X}(\M)$ to be the set of all smooth vector fields on $\M$. It is a vector space under pointwise addition and scalar multiplication, i.e. $(aX + bY)_p = aX_p + bY_p$. The zero element is the zero vector field, whose value is $0 \in T_p \M$ for all $p \in \M$. If $f \in C^\infty(\M)$ and $X \in \mathfrak{X}(\M)$, then we define $fX: \M \to T\M$ by $(fX)_p = f(p)X_p$. Note that this defines a multiplication of a vector field with a smooth real-valued function. Furthermore, if in addition, $g \in C^\infty(\M)$ and $Y \in \mathfrak{X}(\M)$, then $fX + gY$ is also a smooth vector field.

A local frame for $\M$ is an ordered $n$-tuple of vector fields $(E_1, \dots, E_n)$ defined on an open subset $U \subseteq M$ that is linearly independent and spans the tangent bundle, i.e. $(E_1 \vert_p, \dots, E_n \vert_p)$ form a basis for $T_p \M$ for each $p \in \M$. It is called a global frame if $U = M$, and a smooth frame if each $E_i$ is smooth.

If $X \in \mathfrak{X}(\M)$ and $f \in C^\infty(U)$, we define $Xf: U \to \R$ by $(Xf)(p) = X_p f$. $X$ also defines a map $C^\infty(\M) \to C^\infty(\M)$ by $f \mapsto Xf$ which is linear and Leibniz, thus it is a derivation. Moreover, derivations of $C^\infty(\M)$ can be identified with smooth vector fields, i.e. $D: C^\infty(\M) \to C^\infty(\M)$ is a derivation if and only if it is of the form $Df = Xf$ for some $X \in \mathfrak{X}(\M)$.

Tensors

Let $\{ V_k \}$ and $U$ be real vector spaces. A map $F: V_1 \times \dots \times V_k \to U$ is said to be multilinear if it is linear as a function of each variable separately when the others are held fixed. That is, it is a generalization of the familiar linear and bilinear maps. Furthermore, we write the vector space of all multilinear maps $ V_1 \times \dots \times V_k \to U $ as $ \text{L}(V_1, \dots, V_k; U) $.

Example 5 (Multilinear functions). Some examples of familiar multilinear functions are

The dot product in $ \R^n $ is a scalar-valued bilinear function of two vectors. E.g. for any $ v, w \in \R^n $, the dot product between them is $ v \cdot w := \sum_i^n v^i w^i $, which is linear on both $ v $ and $ w $.
The determinant is a real-valued multilinear function of $ n $ vectors in $ \R^n $.

Let $\{ W_l \}$ also be real vector spaces and suppose

\[\begin{align} F&: V_1 \times \dots \times V_k \to \R \\ G&: W_1 \times \dots \times W_l \to \R \end{align}\]

be multilinear maps. Define a function

\[\begin{align} F \otimes G &: V_1 \times \dots \times V_k \times W_1 \times \dots \times W_l \to \R \\ F \otimes G &(v_1, \dots, v_k, w_1, \dots, w_k) = F(v_1, \dots, v_k) G(w_1, \dots, w_l) \, . \end{align}\]

From the multilinearity of $ F $ and $ G $ it follows that $ F \otimes G $ is also multilinear, and is called the tensor product of $ F $ and $ G $. I.e. tensors and tensor products are multilinear map with codomain in $ \R $.

Example 6 (Tensor products of covectors). Let $ V $ be a vector space and $ \omega, \eta \in V^* $. Recall that they both a linear map from $ V $ to $ \R $. Therefore the tensor product between them is

\[\begin{align} \omega \otimes \eta &: V \times V \to \R \\ \omega \otimes \eta &(v_1, v_2) = \omega(v_1) \eta(v_2) \, . \end{align}\]

Example 7 (Tensor products of dual basis). Let $ \epsilon^1, \epsilon^2 $ be the standard dual basis for $ (\R^2)^* $. Then, the tensor product $ \epsilon^1 \otimes \epsilon^2: \R^2 \times \R^2 \to \R $ is the bilinear function defined by

\[\epsilon^1 \otimes \epsilon^2(x, y) = \epsilon^1 \otimes \epsilon^2((w, x), (y, z)) := wz \, .\]

We use the notation $ V_1^* \otimes \dots \otimes V_k^* $ to denote the space $ \text{L}(V_1, \dots, V_k; \R) $. Let $ V $ be a finite-dimensional vector space. If $ k \in \mathbb{N} $, a covariant $ k $-tensor on $ V $ is an element of the $ k $-fold tensor product $ V^* \otimes \dots \otimes V^* $, which is a real-valued multilinear function of $ k $ elements of $ V $ to $ \R $. The number $ k $ is called the rank of the tensor.

Analogously, we define a contravariant $ k $-tensor on $ V $ to be an element of the element of the $ k $-fold tensor product $ V \otimes \dots \otimes V $. We can mixed the two types of tensors together: For any $ k, l \in \mathbb{N} $, we define a mixed tensor on $ V $ of type $ (k, l) $ to be the tensor product of $ k $ such $ V $ and $ l $ such $ V^* $.

Riemannian metrics

So far we have no mechanism to measure the length of (tangent) vectors like we do in standard Euclidean geometry, where the length of a vector $v$ is measured in term of the dot product $ \sqrt{v \cdot v} $. Thus, we would like to add a structure that enables us to do just that to our smooth manifold $\M$.

A Riemannian metric $ g $ on $ \M $ is a smooth symmetric covariant 2-tensor field on $ \M $ that is positive definite at each point. Furthermore, for each $ p \in \M $, $ g_p $ defines an inner product on $ T_p \M $, written $ \inner{v, w}_g = g_p(v, w) $ for all $ v, w \in T_p \M $. We call a tuple $(\M, g)$ to be a Riemannian manifold.

In any smooth local coordinate $\{x^i\}$, a Riemannian metric can be written as tensor product

\[g = g_{ij} \, dx^i \otimes dx^j \, ,\]

such that

\[g(v, w) = g_{ij} \, dx^i \otimes dx^j(v, w) = g_{ij} \, dx^i(v) dx^j(w) = g_{ij} \, v^i w^j \, .\]

That is we can represent $ g $ as a symmetric, positive definite matrix $ G $ taking two tangent vectors as its arguments: $ \inner{v, w}_g = v^\text{T} G w $. Furthermore, we can define a norm w.r.t. $g$ to be $\norm{\cdot}_g := \inner{v, v}_g$ for any $v \in T_p \M$.

Example 8 (The Euclidean Metric). The simplest example of a Riemannian metric is the familiar Euclidean metric $g$ of $\R^n$ using the standard coordinate. It is defined by

\[g = \delta_{ij} \, dx^i \otimes dx^j \, ,\]

which, if applied to vectors $v, w \in T_p \R^n$, yields

\[g_p(v, w) = \delta_{ij} \, v^i w^j = \sum_{i=1}^n v^i w^i = v \cdot w \, .\]

Note that above, $\delta_{ij}$ is the Kronecker delta. Thus, the Euclidean metric can be represented by the $n \times n$ identity matrix.

The tangent-cotangent isomorphism

Riemannian metrics also provide an isomorphism between the tangent and cotangent space: They allow us to convert vectors to covectors and vice versa. Let $(\M, g)$ be a Riemannian manifold. We define an isomorphism $\hat{g}: T_p \M \to T_p^* \M$ as follows. For each $p \in \M$ and each $v \in T_p \M$

\[\hat{g}(v) = \inner{v, \cdot}_g \, .\]

Notice that that $\hat{g}(v)$ is in $T_p^* \M$ as it is a linear functional over $T_p \M$. In any smooth coordinate $\{x^i\}$, by definition we can write $g = g_{ij} \, dx^i dx^j$. Thus we can write the isomorphism above as

\[\hat{g}(v) = (g_{ij} \, v^i) \, dx^j =: v_i \, dx^j \, .\]

Notice that we transform a contravariant component $v^i$ (denoted by the upper index component $i$) to a covariant component $v_i = g_{ij} \, v^i$ (denoted by the lower index component $j$), with the help of the metric tensor $g$. Because of this, we say that we obtain a covector from a tangent vector by lowering an index. Note that, we can also denote this by “flat” symbol in musical sheets: $\hat{g}(v) =: v^\flat$.

As Riemannian metric can be seen as a symmetric positive definite matrix, it has an inverse $g^{ij} := g_{ij}^{-1}$, which we denote by moving the index to the top, such that $g^{ij} \, g_{jk} = g_{kj} \, g^{ji} = \delta^i_k$. We can then define the inverse map of the above isomorphism as $\hat{g}^{-1}: T_p^* \M \to T_p \M$, where

\[\hat{g}^{-1}(\omega) = (g^{ij} \, \omega_j) \, \frac{\partial}{\partial x^i} =: \omega^i \, \frac{\partial}{\partial x^i} \, ,\]

for all $\omega \in T_p^* \M$. In correspondence with the previous operation, we are now looking at the components $\omega^i := g^{ij} \, \omega_j$, hence this operation is called raising an index, which we can also denote by “sharp” musical symbol: $\hat{g}^{-1}(\omega) =: \omega^\sharp$. Putting these two map together, we call the isomorphism between the tangent and cotangent space as the musical isomorphism.

The Riemannian gradient

Let $(\M, g)$ be a Riemannian manifold, and let $f: \M \to \R$ be a real-valued function over $\M$ (i.e. a scalar field on $\M)$. Recall that $df$ is a covector field, which in coordinates has partial derivatives of $f$ as its components. We define a vector field called the gradient of $f$ by

\[\begin{align} \grad{f} := (df)^\sharp = \hat{g}^{-1}(df) \, . \end{align}\]

For any $p \in \M$ and for any $v \in T_p \M$, the gradient satisfies

\[\inner{\grad{f}, v}_g = vf \, .\]

That is, for each $p \in \M$ and for any $v \in T_p \M$, $\grad{f}$ is a vector in $T_p \M$ such that the inner product with $v$ is the derivation of $f$ by $v$. Observe the compatibility of this definition with standard multi-variable calculus: the directional derivative of a function in the direction of a vector is the dot product of its gradient and that vector.

In any smooth coordinate $\{x^i\}$, $\grad{f}$ has the expression

\[\grad{f} = g^{ij} \frac{\partial f}{\partial x^i} \frac{\partial}{\partial x^j} \, .\]

Example 9 (Euclidean gradient). On $\R^n$ with the Euclidean metric with the standard coordinate, the gradient of $f: \R^n \to \R$ is

\[\grad{f} = \delta^{ij} \, \frac{\partial f}{\partial x^i} \frac{\partial}{\partial x^j} = \sum_{i=1}^n \frac{\partial f}{\partial x^i} \frac{\partial}{\partial x^i} \, .\]

Thus, again it is coincide with the definition we are familiar with form calculus.

All in all then, given a basis, in matrix notation, let $G$ be the matrix representation of $g$ and let $d$ be the matrix representation of $df$ (i.e. as a row vector containing all partial derivatives of $f$), then: $\grad{f} = G^{-1} d^\T$.

The interpretation of the gradient in Riemannian manifold is analogous to the one in Euclidean space: its direction is the direction of steepest ascent of $f$ and it is orthogonal to the level sets of $f$; and its length is the maximum directional derivative of $f$ in any direction.

Connections

Let $(\M, g)$ be a Riemannian manifold and let $X, Y: \M \to T \M$ be a vector field. Applying the usual definition for directional derivative, the way we differentiate $X$ is by

\[D_X \vert_p Y = \lim_{h \to 0} \frac{Y_{p+hX_p} - Y_p}{h} \, .\]

However, we will have problems: We have not defined what this expression $p+hX_p$ means. Furthermore, as $Y_{p+hX_p}$ and $Y_p$ live in different vector spaces $T_{p+hX_p} \M$ and $T_p \M$, it does not make sense to subtract them, unless there is a natural isomorphism between each $T_p \M$ and $\M$ itself, as in Euclidean spaces. Hence, we need to add an additional structure, called connection that allows us to compare different tangent vectors from different tangent spaces of nearby points.

Specifically, we define the affine connection to be a connection in the tangent bundle of $\M$. Let $\mathfrak{X}(\M)$ be the space of vector fields on $\M$; $X, Y, Z \in \mathfrak{X}(\M)$; $f, g \in C^\infty(\M)$; and $a, b \in \R$. The affine connection is given by the map

\[\begin{align} \nabla: \mathfrak{X}(\M) \times \mathfrak{X}(\M) &\to \mathfrak{X}(\M) \\ (X, Y) &\mapsto \nabla_X Y \, , \end{align}\]

which satisfies the following properties

$C^\infty(\M)$-linearity in $X$, i.e., $\nabla_{fX+gY} Z = f \, \nabla_X Z + g \, \nabla_Y Z$
$\R$-linearity in Y, i.e., $\nabla_X (aY + bZ) = a \, \nabla_X Y + b \, \nabla_X Z$
Leibniz rule, i.e., $\nabla_X (fY) = (Xf) Y + f \, \nabla_X Y$ .

We call $\nabla_X Y$ the covariant derivative of $Y$ in the direction $X$. Note that the notation $Xf$ means $Xf(p) := D_{X_p} \vert_p f$ for all $p \in \M$, i.e. the directional derivative (it is a scalar field). Furthermore, notice that, covariant derivative and connection are the same thing and they are useful to generalize the notion of directional derivative to vector fields.

In any smooth local frame $(E_i)$ in $T \M$ on an open subset $U \in \M$, we can expand the vector field $\nabla_{E_i} E_j$ in terms of this frame

\[\nabla_{E_i} E_j = \Gamma^k_{ij} E_k \,.\]

The $n^3$ smooth functions $\Gamma^k_{ij}: U \to \R$ is called the connection coefficients or the Christoffel symbols of $\nabla$.

Example 10 (Covariant derivative in Euclidean spaces). Let $\R^n$ with the Euclidean metric be a Riemannian manifold. Then

\[(\nabla_Y X)_p = \lim_{h \to 0} \frac{Y_{p+hX_p} - Y_p}{h} \enspace \enspace \forall p \in \M \, ,\]

the usual directional derivative, is a covariant derivative.

There exists a unique affine connection for every Riemannian manifold $(\M, g)$ that satisfies

Symmetry, i.e., $\nabla_X Y - \nabla_Y X = [X, Y]$
Metric compatible, i.e., $Z \inner{X, Y}_g = \inner{\nabla_Z X, Y}_g + \inner{X, \nabla_Z Y}_g$,

for all $X, Y, Z \in \mathfrak{X}(\M)$. It is called the Levi-Civita connection. Note that, $[\cdot, \cdot]$ is the Lie bracket, defined by $[X, Y]f = X(Yf) - Y(Xf)$ for all $f \in C^\infty(\M)$. Note also that, the connection shown in Example 10 is the Levi-Civita connection for Euclidean spaces with the Euclidean metric.

Riemannian Hessians

Let $(\M, g)$ be a Riemannian manifold equipped by the Levi-Civita connection $\nabla$. Given a scalar field $f: \M \to \R$ and any $X, Y \in \mathfrak{X}(\M)$, the Riemannian Hessian of $f$ is the covariant 2-tensor field $\text{Hess} \, f := \nabla^2 f := \nabla \nabla f$, defined by

\[\text{Hess} \, f(X, Y) := X(Yf) - (\nabla_X Y)f = \inner{\nabla_X \, \grad{f}, Y}_g \, .\]

Another way to define Riemannian Hessian is to treat is a linear map $T_p \M \to T_p \M$, defined by

\[\text{Hess}_{v} \, f = \nabla_v \, \grad{f} \, ,\]

for every $p \in \M$ and $v \in T_p \M$.

In any local coordinate $\{x^i\}$, it is defined by

\[\text{Hess} \, f = f_{; i,j} \, dx^i \otimes dx^j := \left( \frac{\partial f}{\partial x^i \partial x^j} - \Gamma^k_{ij} \frac{\partial f}{\partial x^k} \right) \, dx^i \otimes dx^j \, .\]

Example 11 (Euclidean Hessian). Let $\R^n$ be a Euclidean space with the Euclidean metric and standard Euclidean coordinate. We can show that connection coefficients of the Levi-Civita connection are all $0$. Thus the Hessian is defined by

\[\text{Hess} \, f = \left( \frac{\partial f}{\partial x^i \partial x^j} \right) \, dx^i \otimes dx^j \, .\]

It is the same Hessian as we have seen in calculus.

Geodesics

Let $(\M, g)$ be a Riemannian manifold and let $\nabla$ be a connection on $T\M$. Given a smooth curve $\gamma: I \to \M$, a vector field along $\gamma$ is a smooth map $V: I \to T\M$ such that $V(t) \in T_{\gamma(t)}\M$ for every $t \in I$. We denote the space of all such vector fields $\mathfrak{X}(\gamma)$. A vector field $V$ along $\gamma$ is said to be extendible if there exists another vector field $\tilde{V}$ on a neighborhood of $\gamma(I)$ such that $V = \tilde{V} \circ \gamma$.

For each smooth curve $\gamma: I \to \M$, the connection determines a unique operator

\[D_t: \mathfrak{X}(\gamma) \to \mathfrak{X}(\gamma) \, ,\]

called the covariant derivative along $\gamma$, satisfying (i) linearity over $\R$, (ii) Leibniz rule, and (iii) if it $V \in \mathfrak{X}(\gamma)$ is extendible, then for all $\tilde{V}$ of $V$, we have that $ D_t V(t) = \nabla_{\gamma’(t)} \tilde{V}$.

For every smooth curve $\gamma: I \to \M$, we define the acceleration of $\gamma$ to be the vector field $D_t \gamma’$ along $\gamma$. A smooth curve $\gamma$ is called a geodesic with respect to $\nabla$ if its acceleration is zero, i.e. $D_t \gamma’ = 0 \enspace \forall t \in I$. In term of smooth coordinates $\{x^i\}$, if we write $\gamma$ in term of its components $\gamma(t) := \{x^1(t), \dots, x^n(t) \}$, then it follows that $\gamma$ is a geodesic if and only if its component functions satisfy the following geodesic equation:

\[\ddot{x}^k(t) + \dot{x}^i(t) \, \dot{x}^j(t) \, \Gamma^k_{ij}(x(t)) = 0 \, ,\]

where we use $x(t)$ as an abbreviation for $\{x^1(t), \dots, x^n(t)\}$. Observe that, this gives us a hint that to compute a geodesic we need to solve a system of second-order ODE for the real-valued functions $x^1, \dots, x^n$.

Suppose $\gamma: [a, b] \to \M$ is a smooth curve segment with domain in the interval $[a, b]$. The length of $\gamma$ is

\[L_g (\gamma) := \int_a^b \norm{\gamma'(t)}_g \, dt \, ,\]

where $\gamma’$ is the derivative (the velocity vector) of $\gamma$. We can then use curve segments as “measuring tapes” to measure the Riemannian distance from $p$ to $q$ for any $p, q \in \M$$

\[d_g(p, q) := \inf \, \{L_g(\gamma) \, \vert \, \gamma: [a, b] \to \M \enspace \text{s.t.} \enspace \gamma(a) = p, \, \gamma(b) = q\} \, ,\]

over all curve segments $\gamma$ which have endpoints at $p$ and $q$. We call the particular $\gamma$ such that $L_g(\gamma) = d_g(p, q)$ as the length-minimizing curve. We can show that all geodesics are locally length-minimizing, and all length-minimizing curves are geodesics.

Parallel transport

Let $(\M, g)$ be a Riemannian manifold with affine connection $\nabla$. A smooth vector field $V$ along a smooth curve $\gamma: I \to \M$ is said to be parallel along $\gamma$ if $D_t V = 0$ for all $t \in I$. Notice that a geodesic can then be said to be a curve whose velocity vector field is parallel along the curve.

Given $t_0 \in I$ and $v \in T_{\gamma(t_0)} \M$, we can show there exists a unique parallel vector field $V$ along $\gamma$ such that $V(t_0) = v$. This vector field is called the parallel transport of $v$ along $\gamma$. Now, for each $t_0, t_1 \in I$, we define a map

\[\begin{align} &P^\gamma_{t_0 t_1} : T_{\gamma(t_0)} \M \to T_{\gamma(t_1)} \M \\ &P^\gamma_{t_0 t_1}(v) = V(t_1) \, , \end{align}\]

called the parallel transport map. We can picture the concept of parallel transport by imagining that we are “sliding” a tangent vector $v$ along $\gamma$ such that the direction and the magnitude of $v$ is preserved.

Note that, the parallel transport map is a linear map with inverse $P^\gamma_{t_1 t_0}$, hence it is an isomorphism between two tangent spaces $T_{\gamma(t_0)} \M$ and $T_{\gamma(t_1)} \M$. We can therefore determine the covariant derivative along $\gamma$ using parallel transport:

\[D_t V(t_0) = \lim_{t_1 \to t_0} \frac{P^\gamma_{t_1 t_0} \, V(t_1) - V(t_0)}{t_1 - t_0} \, ,\]

Moreover, we can also determine the connection $\nabla$ via parallel transport:

\[\nabla_X Y \, \vert_p = \lim_{h \to 0} \frac{P^\gamma_{h 0} Y_{\gamma(h)} - Y_p}{h} \, ,\]

for every $p \in \M$.

Finally, if $A$ is a smooth vector field on $\M$, then $A$ is parallel on $\M$ if and only if $\nabla A = 0$.

The exponential map

Geodesics with proportional initial velocities are related in a simple way. Let $(\M, g)$ be a Riemannian manifold equipped with the Levi-Civita connection. For every $p \in \M$, $v \in T_p \M$, and $c, t \in \R$,

\[\gamma_{cv} (t) = \gamma_{v} (ct) \, ,\]

whenever either side is defined. This fact is compatible with our intuition on how speed and time are related to distance.

From the fact above, we can define a map from the tangent bundle to $\M$ itself, which sends each line through the origin in $T_p \M$ to a geodesic. Define a subset $\mathcal{E} \subseteq T\M$, the domain of the exponential map by

\[\mathcal{E} := \{ v \in T\M : \gamma_v \text{ is defined on an interval containing } [0, 1] \} \, ,\]

and then define the exponential map

\[\begin{align} &\text{exp}: \mathcal{E} \to \M \\ &\text{exp}(v) = \gamma_v(1) \, . \end{align}\]

For each $p \in \M$, the restricted exponential map at $p$, denoted $\text{exp}_p$ is the restriction of $\text{exp}$ to the set $\mathcal{E}_p := \mathcal{E} \cap T_p \M$.

The interpretation of the (restricted) exponential maps is that, given a point $p$ and tangent vector $v$, we follow a geodesic which has the property $\gamma(0) = p$ and $\gamma’(0) = v$. This is then can be seen as the generalization of moving around the Euclidean space by following straight line in the direction of velocity vector.

Curvature

Let $(\M, g)$ be a Riemannian manifold. Recall that an isometry is a map that preserves distance. Now, $\M$ is said to be flat if it is locally isometric to a Euclidean space, that is, every point in $\M$ has a neighborhood that is isometric to an open set in $\R^n$. We say that a connection $\nabla$ on $\M$ satisfies the flatness criterion if whenever $X, Y, Z$ are smooth vector fields defined on an open subset of $\M$, the following identity holds:

\[\nabla_X \nabla_Y Z - \nabla_Y \nabla_X Z = \nabla_{[X, Y]} Z \, .\]

Furthermore, we can show that $(\M, g)$ is a flat Riemannian manifold, then its Levi-Civita connection satisfies the flatness criterion.

Example 12 (Euclidean space is flat). Let $\R^n$ with the Euclidean metric be a Riemannian manifold, equipped with the Euclidean connection $\nabla$. Then, given $X, Y, Z$ smooth vector fields:

\[\begin{align} \nabla_X \nabla_Y Z &= \nabla_X (Y(Z^k) \partial_k) = XY(Z^k) \partial_k \\ \nabla_Y \nabla_X Z &= \nabla_Y (X(Z^k) \partial_k) = YX(Z^k) \partial_k \, . \end{align}\]

The difference between them is

\[(XY(Z^k) - YX(Z^k)) \partial_k = \nabla_{[X, Y]}Z \, ,\]

by definition. Thus

\[\nabla_X \nabla_Y Z - \nabla_Y \nabla_X Z = \nabla_{[X, Y]}Z \, .\]

Therefore, the Euclidean space with the Euclidean connection (which is the Levi-Civita connection on Euclidean space) is flat.

Based on the above definition of the flatness criterion, then we can define a measure on how far away a manifold to be flat:

\[\begin{align} &R: \mathfrak{X}(\M) \times \mathfrak{X}(\M) \times \mathfrak{X}(\M) \to \mathfrak{X}(\M) \\ &R(X, Y)Z = \nabla_X \nabla_Y Z - \nabla_Y \nabla_X Z - \nabla_{[X, Y]} Z \, , \end{align}\]

which is a multilinear map over $C^\infty (\M)$, and is therefore a $(1, 3)$-tensor field on $\M$.

We can then define a covariant 4-tensor called the (Riemann) curvature tensor to be the $(0, 4)$-tensor field $Rm := R^\flat$, by lowering the contravariant index of $R$. Its action on vector fields is given by

\[Rm(X, Y, Z, W) := \inner{R(X, Y)Z, W}_g \, .\]

In any local coordinates, it is written

\[Rm = R_{ijkl} \, dx^i \otimes dx^j \otimes dx^k \otimes dx^l \, ,\]

where $R_{ijkl} = g_{lm} \, {R_{ijkl}}^m$. We can show that $Rm$ is a local isometry invariant. Furthermore, compatible with our intuition of the role of the curvature tensor, a Riemannian manifold is flat if and only if its curvature tensor vanishes identically.

Working with $4$-tensors are complicated, thus we want to construct simpler tensors that summarize some of the information contained in the curvature tensor. For that, first we need to define the trace operator for tensors. Let $T^{(k,l)}(V)$ denotes the space of tensors with $k$ covariant and $l$ contravariant components of a vector space $V$, the trace operator is:

\[\begin{align} &\text{tr}: T^{(k+1, l+1)}(V) \to T^{(k,l)}(V) \\ &(\text{tr} \, F)(\omega^1, \dots, \omega^k, v_1, \dots, v_l) = \text{tr}(F(\omega^1, \dots, \omega^k, \cdot, v_1, \dots, v_l, \cdot)) \, , \end{align}\]

where the trace operator in the right hand side is the usual trace operator, as $F(\omega^1, \dots, \omega^k, \cdot, v_1, \dots, v_l, \cdot) \in T^{(1,1)}(V)$ is a $(1,1)$-tensor, which can be represented by a matrix. We can extend this operator to covariant tensors in Riemannian manifolds: If $h$ is any covariant $k$-tensor field with $k \geq 2$, we can raise one of its indices and obtain $(1, k-1)$-tensor $h^\sharp$. The trace of $h^\sharp$ is thus a covariant $(k-2)$-tensor field. All in all, we define the trace of $h$ w.r.t. $g$ as

\[\text{tr}_g \, h := \text{tr}(h^\sharp) \, .\]

In coordinates, it is

\[\text{tr}_g \, h = {h_i}^i = g^{ij} h_{ij} \, ,\]

which, in an orthonormal frame, it is given by the ordinary trace of the matrix $(h_{ij})$.

We now define the Ricci curvature or Ricci tensor $Rc$ which is the covariant 2-tensor field defined as follows:

\[Rc(X, Y) := \text{tr}(Z \mapsto R(Z, X)Y) \, ,\]

for any vector fields $X, Y$. In local coordinates, its components are

\[R_{ij} := {R_{kij}}^k = g^{km} \, R_{kijm} \, .\]

We can simplify it further: We define the scalar curvature to be the function $S$ to be the trace of the Ricci tensor:

\[S := \text{tr}_g \, Rc = {R_i}^i = g^{ij} \, R_{ij} \, .\]

Thus the scalar curvature is a scalar field on $\M$.

Submanifolds

Let $\M$ be a smooth manifold. An embedded or regular submanifold of $\M$ is a subset $\mathcal{S} \subset \M$ that is a manifold in the subspace topology, endowed with a smooth structure w.r.t. which the inclusion map $\mathcal{S} \hookrightarrow \M$ is a smooth embedding. We call the difference $\text{dim} \, \M - \text{dim} \, \mathcal{S}$ to be the codimension of $\mathcal{S}$ in $\M$, and $\M$ to be the ambient manifold. An embedded hypersurface is an embedded submanifold of codimension 1.

Example 13 (Graphs as submanifolds). Suppose $\M$ is a smooth $m$-manifold, $\mathcal{N}$ is a smooth $n$-manifold, $U \subset \M$ is open, and $f: U \to \mathcal{N}$ is a smooth map. Let $\Gamma(f) \subseteq \M \times \mathcal{N}$ denote the graph of $f$, i.e.

\[\Gamma(f) := \{ (x, y) \in \M \times \mathcal{N} : x \in U, y = f(x) \} \, .\]

Then $\Gamma(f)$ is an embedded $m$-submanifold of $\M \times \mathcal{N}$.

Furthermore, if $f: \M \to \mathcal{N}$ is a smooth map (notice that we have defined $f$ globally here), then $\Gamma(f)$ is properly embedded in $\M \times \mathcal{N}$, i.e. the inclusion map is a proper map.

Suppose $\M$ and $\N$ are smooth manifolds. Let $F: \M \to \N$ be a smooth map and $p \in \M$. We define the rank of $F$ at $p$ to be the rank of the linear map $dF_p: T_p\M \to T_{F(p)\N}$, i.e. the rank of the Jacobian matrix of $F$ in coordinates. If $F$ has the same rank $r$ at any point, we say that it has constant rank, written $\rank{F} = r$. Note that it is bounded by $\min \{ \dim{\M}, \dim{\N} \}$ and if it is equal to this bound, we say $F$ has full rank at $p$.

A smooth map $F: \M \to \N$ is called a smooth submersion if $dF$ is surjective at each point ($\rank{F} = \dim{\N}$). It is called a smooth immersion if $dF$ is injective at each point ($\rank{F} = \dim{\M}$).

Example 14 (Submersions and immersions).

Suppose $\M_1, \dots, \M_k$ are smooth manifolds. Then each projection maps $\pi_i: \M_1 \times \dots \times \M_k \to \M_i$ is a smooth submersion. In particular $\pi: \R^{n+k} \to \R^n$ is a smooth submersion.
If $\gamma: I \to \M$ is a smooth curve in a smooth manifold $\M$, then $\gamma$ is a smooth immersion if and only if $\gamma’(t) \neq 0$ for all $t \in I$.

If $\M$ and $\N$ are smooth manifolds. A diffeomorphism from $\M$ to $\N$ is a smooth bijective map $F: \M \to \N$ that has a smooth inverse, and $\M$ and $\N$ are said to be diffeomorphic. $F$ is called a local diffeomorphism if every point $p \in \M$ has a neighborhood $U$ such that $F(U)$ is open in $\N$ and $F\vert_U: U \to F(U)$ is a diffeomorphism. We can show that $F$ is a local diffeomorphism if and only if it is both a smooth immersion and submersion. Furthermore, if $\dim{\M} = \dim{\N}$ and $F$ is either a smooth immersion or submersion, then it is a local diffeomorphism.

The Global rank theorem says that if $\M$ and $\N$ are smooth manifolds, and suppose $F: \M \to \N$ is a smooth map of constant rank, then it is (a) a smooth submersion if it is injective, (b) a smooth immersion if it is injective, and (c) a diffeomorphism if it is bijective.

If $\M$ and $\N$ are smooth manifolds, a smooth embedding of $\M$ into $\N$ is a smooth immersion $F: \M \to \N$ that is also a topological embedding (homeomorphism onto its image in the subspace topology).

Example 15 (Smooth embeddings). If $\M$ is a smooth manifold and $U \subseteq \M$ is an open submanifold, the inclusion $U \hookrightarrow \M$ is a smooth embedding.

Let $F: \M \to \N$ be an injective smooth immersion. If any of these condition holds, then $F$ is a smooth embedding: (a) $F$ is an open or closed map, (b) $F$ is a proper map, (c) $\M$ is compact, and (d) $\M$ has empty boundary and $\dim{\M} = \dim{\N}$.

The second fundamental form

Let $(\M, g)$ be a Riemannian submanifold of a Riemannian manifold $(\tilde{\M}, \tilde{g})$. Then, $g$ is the induced metric $g = \iota_\M^* \tilde{g}$, where $\iota_\M: \M \hookrightarrow \tilde{\M}$ is the inclusion map. Note that, the expression $\iota^*_\M \tilde{g}$ is called the pullback metric or the induced metric of $\tilde{g}$ by $\iota_\M$ and is defined by

\[\iota_\M^* \tilde{g}(u, v) := \tilde{g}(d\iota_\M(u), d\iota_\M(v)) \, ,\]

for any $u, v \in T_p \M$. Also, recall that $d\iota_\M$ is the pushforward (tangent map) by $\iota_\M$. Intuitively, we map the tangent vectors $u, v$ of $T_p \M$ to some tangent vectors of $T_{\iota_\M(p)} \tilde{\M}$ and use $\tilde{g}$ as the metric.

In this section, we will denote any geometric object of the ambient manifold with tilde, e.g. $\tilde{\nabla}, \tilde{Rm}$, etc. Note also that, we can use the inner product notation $\inner{u, v}$ to refer to $g$ or $\tilde{g}$, since $g$ is just the restriction of $\tilde{g}$ to pairs of tangent vectors in $T \M$.

We would like to compare the Levi-Civita connection of $\M$ with that of $\tilde{\M}$. First, we define orthogonal projection maps, called tangential and normal projections by

\[\begin{align} \pi^\top &: T \tilde{\M} \vert_\M \to T\M \\ \pi^\perp &: T \tilde{\M} \vert_\M \to N\M \, , \end{align}\]

where $N\M$ is the normal bundle of $\M$, i.e. the set of all vectors normal to $\M$. If $X$ is a section of $T\tilde{\M}\vert_\M$, we use the shorthand notations $X^\top = \pi^\top X$ and $X^\perp = \pi^\perp X$.

Given $X, Y \in \mathfrak{X}(\M)$, we can extend them to vector fields on an open subset of $\tilde{\M}$, apply the covariant derivative $\tilde{\nabla}$, and then decompose at $p \in \M$ to get

\[\tilde{\nabla}_X Y = (\tilde{\nabla}_X Y)^\top + (\tilde{\nabla}_X Y)^\perp \, .\]

Let $\Gamma(E)$ be the space of smooth sections of bundle $E$. For the second part, we define the second fundamental form of $\M$ to be a map $\two: \mathfrak{X}(\M) \times \mathfrak{X}(\M) \to \Gamma(N\M)$ defined by

\[\two(X, Y) = (\tilde{\nabla}_X Y)^\perp \, .\]

Meanwhile, we can show that, the first part is the covariant derivative w.r.t. the Levi-Civita connection of the induced metric on $\M$. All in all, the above equation can be written as the Gauss formula:

\[\tilde{\nabla}_X Y = \nabla_X Y + \two(X, Y) \, .\]

The second fundamental form can also be used to evaluate extrinsic covariant derivatives of normal vector fields (instead of tangent ones above). For each normal vector field $N \in \Gamma(N\M)$, we define a scalar-valued symmetric bilinear form $\two_N: \mathfrak{X}(\M) \times \mathfrak{X}(\M) \to \R$ by

\[\two_N(X, Y) = \inner{N, \two(X, Y)} \, .\]

Let $W_N: \mathfrak{X}(\M) \to \mathfrak{X}(\M)$ denote the self-adjoint linear map associated with this bilinear form, characterized by

\[\inner{W_N(X), Y} = \two_N(X, Y) = \inner{N, \two(X, Y)} \, .\]

The map $W_N$ is called the Weingarten map in the direction of $N$. Furthermore we can show that the equation $(\tilde{\nabla}_X N)^\top = -W_N(X)$ holds and is called the Weingarten equation.

In addition to describing the difference between the intrinsic and extrinsic connections, the second fundamental form describes the difference between the curvature tensors of $\tilde{\M}$ and $\M$. The explicit formula is called the Gauss equation and is given by

\[\tilde{Rm}(W, X, Y, Z) = Rm(W, X, Y, Z) - \inner{\two(W, Z), \two(X, Y)} + \inner{\two(W, Y), \two(X, Z)} \, .\]

To give a geometric interpretation of the second fundamental form, we study the curvatures of curves. Let $\gamma: I \to \M$ be a smooth unit-speed curve. We define the curvature of $\gamma$ as the length of the acceleration vector field, i.e. the function $\kappa: I \to \R$ given by $\kappa(t) := \norm{D_t \gamma’(t)}$. We can see this curvature of the curve as a quantitative measure of how far the curve deviates from being a geodesic. Note that, if $\M = \R^n$ the curvature agrees with the one defined in calculus.

Now, suppose that $\M$ is a submanifold in the ambient manifold $\tilde{\M}$. Every regular curve $\gamma: I \to \M$ has two distinct curvature: its intrinsic curvature $\kappa$ as a curve in $\M$ and its extrinsic curvature $\tilde{\kappa}$ as a curve in $\tilde{\M}$. The second fundamental form can then be used to compute the relationship between the two: For $p \in \M$ and $v \in T_p \M$, (i) $\two(v, v)$ is the $\tilde{g}$-acceleration at $p$ of the $g$-geodesic $\gamma_v$, and (ii) if $v$ is a unit vector, then $\norm{\two(v, v)}$ is the $\tilde{g}$-curvature of $\gamma_v$ at $p$.

The intrinsic and extrinsic accelerations of a curve are usually different. A Riemannian submanifold $(\M, g)$ of $(\tilde{\M}, \tilde{g})$ is said to be totally geodesic if every $\tilde{g}$-geodesic that is tangent to $\M$ at some time $t_0$ stays in $\M$ for all $t \in (t_0 - \epsilon, t_0 + \epsilon)$.

Riemannian hypersurfaces

We focus on the case when $(\M, g)$ is an embedded $n$-dimensional Riemannian submanifold of an $(n+1)$-dimensional Riemannian manifold $(\tilde{\M}, \tilde{g})$. That is, $\M$ is a hypersurface of $\tilde{\M}$.

In this situation, at each point of $\M$, there are exactly two unit normal vectors. We choose one of these normal vector fields and call it $N$. We can replace the vector-valued second fundamental form above by a simpler scalar-valued form. The scalar second fundamental form of $\M$ is the symmetric covariant $2$-tensor field $h = \two_N$, i.e.

\[h(X, Y) := \inner{N, \two(X, Y)} \enspace \enspace \enspace \text{for all } X, Y \in \mathfrak{X}(\M) \, .\]

By the Gauss formula $\tilde{\nabla}_X Y = \nabla_X Y + \two(X, Y)$ and noting that $\nabla_X Y$ is orthogonal to $N$, we can rewrite the definition as $h(X, Y) = \inner{N, \tilde{\nabla}_X Y}$. Furthermore, since $N$ is a unit vector spanning $N\M$, we can write $\two(X, Y) = h(X, Y)N$. Note that the sign of $h$ depends on the normal vector field chosen.

The choice of $N$ also determines a Weingarten map $W_N: \mathfrak{X}(\M) \to \mathfrak{X}(\M)$. In this special case of a hypersurface, we use the notation $s = W_N$ and call it the shape operator of $\M$. We can think of $s$ as the $(1, 1)$-tensor field on $\M$ obtained from $h$ by raising an index. It is characterized by

\[\inner{sX, Y} = h(X, Y) \enspace \enspace \enspace \text{for all } X, Y \in \mathfrak{X}(\M) \, .\]

As with $h$, the choice of $N$ determines the sign of $s$.

Note that at every $p \in \M$, $s$ is a self-adjoint linear endomorphism of the tangent space $T_p \M$. Let $v \in T_p \M$. From linear algebra, we know that there is a unit vector $v_0 \in T_p \M$ such that $v \mapsto \inner{sv, v}$ achieve its maximum among all unit vectors. Every such vector is an eigenvector of $s$ with eigenvalue $\lambda_0 = \inner{s v_0, v_0}$. Furthermore, $T_p \M$ has an orthonormal basis $(b_1, \dots, b_n)$ formed by the eigenvectors of $s$ and all of the eigenvalues $(\kappa_1, \dots \kappa_n)$ are real. (Note that this means for each $i$, $s b_i = \kappa_i b_i)$.) In this basis, both $h$ and $s$ are represented by diagonal matrices.

The eigenvalues of $s$ at $p \in \M$ are called the principal curvatures of $\M$ at $p$, and the corresponding eigenvectors are called the principal directions. Note that the sign of the principal curvatures depend on the choice of $N$. But otherwise both the principal curvatures and directions are independent of the choice of coordinates.

From the principal curvatures, we can compute other quantities: The Gaussian curvature which is defined as $K := \text{det}(s)$ and the mean curvature $H := (1/n) \text{tr}(s)$. In other words, $K = \prod_i \kappa_i$ and $H = (1/n) \sum_i \kappa_i$, since $s$ can be represented by a symmetric matrix.

The Gaussian curvature, which is a local isometric invariant, is connected to a global topological invariant, the Euler characteristic, through the Gauss-Bonnet theorem. Let $(\M, g)$ be a smoothly triangulated compact Riemannian 2-manifold, then

\[\int_\M K \, dA = 2 \pi \, \chi(\M) \, ,\]

where $dA$ is its Riemannian density.

Hypersurfaces of Euclidean space

Assume that $\M \subseteq \R^{n+1}$ is an embedded Riemannian $n$-submanifold (with the induced metric from the Euclidean metric). We denote geometric objects on $\R^{n+1}$ with bar, e.g. $\bar{g}$, $\overline{Rm}$, etc. Observe that $\overline{Rm} \equiv 0$, which implies that the Riemann curvature tensor of a hypersurface in $\R^{n+1}$ is completely determined by the second fundamental form.

In this setting we can give some very concrete geometric interpretation about quantities in hypersurfaces. First is for curves. For every $v \in T_p \M$, let $\gamma = \gamma_v : I \to \M$ be the $g$-geodesic in $\M$ with initial velocity $v$. The Gauss formula shows that the Euclidean acceleration of $\gamma$ at $0$ is $\gamma^{\prime\prime}(0) = \overline{D}_t \gamma’(0) = h(v, v)N_p$, thus $\norm{h(v, v)}$ is the Euclidean curvature of $\gamma$ at $0$. Furthermore, $h(v,v) = \inner{\gamma^{\prime\prime}(0), N_p} > 0$ iff. $\gamma^{\prime\prime}(0)$ points in the same direction as $N_p$. That is $h(v, v)$ is positive if $\gamma$ is curving in the direction of $N_p$ and negative if it is curving away from $N_p$.

We can show that the above Euclidean curvature can be interpreted in terms f the radius of the “best circular approximation”, just in Calculus. Suppose $\gamma: I \to \R^m$ is a unit-speed curve, $t_0 \in I$, and $\kappa(t_0) \neq 0$. We define a unique unit-speed parametrized circle $c: \R \to \R^m$ as the osculating circle at $\gamma(t_0)$, with the property that $c$ and $\gamma$ have the same position, velocity, and acceleration at $t=t_0$. Then, the Euclidean curvature of $\gamma$ at $t_0$ is $\kappa(t_0) = 1/R$ where $R$ is the radius of the osculating circle.

As mentioned before, to compute the curvature of a hypersurface in Euclidean space, we can compute the second fundamental form. Suppose $X: U \to \M$ is a smooth local parametrization of $\M$, $(X_1, \dots, X_n)$ is the local frame for $T \M$ determined by $X$, and $N$ is a unit normal field on $\M$. Then, the scalar second fundamental form is given by

\[h(X_i, X_j) = \innerbig{\frac{\partial^2 X}{\partial u^i \partial u^j}, N} \, .\]

The implication of this is that it shows how the principal curvatures give a concise description of the local shape of the hypersurface by approximating the surface with the graph of a quadratic function. That is, we can show that there is an isometry $\phi: \R^{n+1} \to \R^{n+1}$ that takes $p \in \M$ to the origin and takes a neighborhood of it to a graph of the form $x^{n+1} = f(x^1, \dots, x^n)$, where

\[f(x) = \frac{1}{2} \sum_{i=1}^n\kappa_i (x^i)^2 + O(\abs{x}^3) \, .\]

We can write down a smooth vector field $N = N^i \partial_i$ on an open subset of $\R^{n+1}$ that restricts to a unit normal vector field along $\M$. Then, the shape operator can be computed straightforwardly using the Weingarten equation and observing that the Euclidean covariant derivatives of $N$ are just ordinary directional derivatives in Euclidean space. Thus, for every vector $X = X^i \partial_j$ tangent to $\M$, we have

\[sX = -\bar{\nabla}_X N = -\sum_{i,j=1}^{n+1} X^j (\partial_j N^i) \partial_i \, .\]

One common way to get such smooth vector field is to work with a local defining function $F$ for $\M$, i.e. a smooth scalar field defined on some open subset $U \subseteq \R^{n+1}$ s.t. $U \cap \M$ is a regular level set of $F$. Then, we can take

\[N = \frac{\grad{F}}{\norm{\grad{F}}} \, .\]

Because we know that the gradient is always normal to the level set.

Example 16 (Shape operators of spheres). The function $F: \R^{n+1} \to \R$ with $F(x) := \norm{x}^2$ is a smooth defining function of any sphere in $\mathbb{S}^{n}(R)$. Thus, the normalized gradient vector field

\[N = \frac{1}{R} \sum_{i,j=1}^{n+1} x^i \partial_i\]

is a (outward pointing) unit normal vector field along $\mathbb{S}^n(R)$. The shape operator is

\[sX = -\frac{1}{R} \sum_{i,j=1}^{n+1} X^j (\partial_j x^i) \partial_i = -\frac{1}{R} X \, ,\]

where recall that, $\partial_j x^i = \partial x^i / \partial x^j = \delta_{ij}$. We can therefore write $s$ as a matrix $s = (-1/R) \mathbf{I}$ where $\mathbf{I}$ is the identity matrix. The principal curvatures are then all equal to $-1/R$, the mean curvature is $H = -1/R$, and the Gaussian curvature is $K = (-1/R)^n$. Note that, these curvatures are constant. These reflects the fact that the sphere bends the exact same way at every point.

Lastly, for surfaces in $\R^3$, given a parametrization of $X$, the normal vector field can be computed via the cross product:

\[N = \frac{X_1 \times X_2}{\norm{X_1 \times X_2}} \, ,\]

where $X_1 := \partial_1 X$ and $X_2 := \partial_2 X$, which together form a basis of the tangent space at each point on the surface.

Although the Gaussian curvature is defined in terms of a particular embedding of a submanifold in the Euclidean space (i.e. it is an extrinsic quantity), it is actually an intrinsic invariant of the submanifold. Gauss showed in his Theorema Egregium that in an embedded $2$-dimensional Riemannian submanifold $(\M, g)$ of $\R^3$, for every point $p \in \M$, the Gaussian curvature of $\M$ at $p$ is equal to one-half the scalar curvature of $g$ at $p$, and thus it is a local isometry invariant of $(\M, g)$.

Suppose $\M$ is a Riemannian $n$-manifold with $n \geq 2$, $p \in \M$, and $V \subset T_p \M$ is a star-shaped neighborhood of zero on which $\text{exp}_p$ is a diffeomorphism onto an open set $U \subset \M$. Let $\Pi$ be any $2$-dimensional linear subspace of $T_p \M$. Since $\Pi \cap V$ is an embedded $2$-dim submanifold of $V$, it follows that $\mathcal{S}_\Pi = \text{exp}_p(\Pi \cup V)$ is an embedded $2$-dim submmanifold of $U \subset \M$ containing $p$, called the plane section determined by $\Pi$. We define the sectional curvature of $\Pi$, denoted by $\text{sec}(\Pi)$, to be the intrinsic Gaussian curvature at $p$ of the surface $\mathcal{S}_\Pi$ with the metric induced from the embedding $\mathcal{S}_\Pi \subseteq \M$. If $v, w \in T_p \M$ are linearly independent vectors, the sectional curvature’s formula is given by

\[\text{sec}(v, w) := \frac{Rm_p(v, w, w, v)}{\norm{v \wedge w}^2} \, ,\]

where

\[\norm{v \wedge w} := \sqrt{\norm{v}^2 \norm{w}^2 - \inner{v, w}^2} \, .\]

We can show the connection between the sectional curvature and Ricci and scalar curvatures. $Rc_p(v, v)$ is the sum of the sectional curvatures of the $2$-planes spanned by $(v, b_2), \dots, (v, b_n)$, where $(b_1, \dots, b_n)$ is any orthonormal basis for $T_p \M$ with $b_1 = v$. Furthermore, the scalar curvature at $p$ is the sum of all sectional curvatures of the $2$-planes spanned by ordered pairs of distinct basis vectors in any orthonormal basis.

Lie groups

A Lie group is a smooth manifold $\G$ that is also a group in the algebraic sense, with the property that the multiplication map $m: \G \times \G \to \G$ and inversion map $i: \G \to \G$, given by

\[m(g, h) := gh \, , \qquad i(g) := g^{-1} \, ,\]

are both smooth for arbitrary $g, h \in \G$. We denote the identity element of $G$ by $e$.

Example 17 (Lie groups). The following manifolds are Lie groups.

The general linear group $\GL(n, \R)$ is the set of invertible $n \times n$ matrices with real elements. It is a group under matrix multiplication and it is a submanifold of the vector space $\text{M}(n, \R)$, the space of $n \times n$ matrices.
The real number field $\R$ and the Euclidean space $\R^n$ are Lie groups under addition.

If $\G$ and $\mathcal{H}$ are Lie groups, a Lie group homomorphism from $\G$ to $\mathcal{H}$ is a smooth map $F: \G \to \mathcal{H}$ that is also a group homomorphism. If $F$ is also a diffeomorphism, then it is a Lie group isomorphism. We say that $\G$ and $\mathcal{H}$ are isomorphic Lie groups.

If $G$ is a group and $M$ is a set, a left action of $G$ on $M$ is a map $G \times M \to M$ defined by $(g, p) \mapsto g \cdot p$ that satisfies

\[\begin{alignat}{2} g_1 \cdot (g_2 \cdot p) &= (g_1 g_2) \cdot p \qquad &&\text{for all } g_1, g_2 \in G, p \in M \, ; \\ e \cdot p &= p &&\text{for all } p \in M \, . \end{alignat}\]

Analogously, a right action is defined as a map $M \times G \to M$ satisfying

\[\begin{alignat}{2} (p \cdot g_1) \cdot g_2 &= p \cdot (g_1 g_2) \qquad &&\text{for all } g_1, g_2 \in G, p \in M \, ; \\ p \cdot e &= p &&\text{for all } p \in M \, . \end{alignat}\]

If $M$ is a smooth manifold, $G$ is a Lie group, and the defining map is smooth, then the action is said to be smooth action.

We can also give a name to an action, e.g. $\theta: G \times M \to M$ with $(g, p) \mapsto \theta_g (p)$. In this notation, the above conditions for the left action read

\[\begin{align} \theta_{g_1} \circ \theta_{g_2} &= \theta_{g_1 g_2} \, , \\ \theta_e &= \Id_M \, , \end{align}\]

while for a right action the first equation is replaced by $\theta_{g_2} \circ \theta_{g_1} = \theta_{g_1 g_2}$. For a smooth action, each map $\theta_g : M \to M$ is a diffeomorphism.

For each $p \in M$, the orbit of $p$, denoted by $G \cdot p$, is the set of all images of $p$ under the action by elements of $G$:

\[G \cdot p := \{ g \cdot p : g \in G \} \, .\]

The isotropy group or stabilizer of $p$, denoted by $G_p$, is the set of elements of $G$ that fix $p$ (implying $G_p$ is a subgroup of $G$):

\[G_p := \{ g \in G : g \cdot p = p \} \, .\]

A group action is said to be transitive if for every pair of points $p, q \in M$, there exists $g \in G$ such that $g \cdot p = q$, i.e. if the only orbit is all of $M$. The action is said to be free if the only element of $G$ that fixes any element of $M$ is the identity: $g \cdot p$ for some $p \in M$ implies $g = e$, i.e. if every isotropy group is trivial.

Example 18 (Lie group actions).

If $\G$ is a Lie group and $\M$ is a smooth manifold, the trivial action of $\G$ on $\M$ is defined by $g \cdot p = p$ for all $g \in \G$ and $p \in \M$.
The natural action of $\GL(n, \R)$ on $\R^n$ is the left action given by matrix multiplication $(\b{A}, \vx) \mapsto \b{A} \vx$.

Let $\G$ be a Lie group, $\M$ and $\N$ be smooth manifolds endowed with smooth left or right $\G$-actions. A map $F: \M \to \N$ is equivariant w.r.t. the given actions if for each $g \in G$,

\[\begin{alignat}{2} F(g \cdot p) &= g \cdot F(p) \qquad &&\text{for left actions} \, , \\ F(p \cdot g) &= F(p) \cdot g &&\text{for right actions} \, . \end{alignat}\]

If $F: \M \to \N$ is a smooth map that is equivariant w.r.t. a transitive smooth $\G$-action on $\M$ and any smooth $\G$-action on $\N$, then $F$ has constant rank, meaning that its rank is the same for all $p \in \M$. Thus, if $F$ is surjective, it is a smooth submersion; if it is injective, it is a smooth immersion; and if it is bijective, it is a diffeomorphism.

Example 19 (The orthogonal group). A real $n \times n$ matrix $\b{A}$ is said to be orthogonal if it preserves the Euclidean dot product as a linear map:

\[(\b{A} \vx) \cdot (\b{A} \vx) = \vx \cdot \vy \qquad \text{for all} \, \vx, \vy \in \R^n \, .\]

The set of all orthogonal $n \times n$ matrices $\text{O}(n)$ is a subgroup of $\GL(n, \R)$, called the orthogonal group of degree $n$.

We would like to also study the theory of group representations, i.e. asking the question whether all Lie groups can be realized as Lie subgroups of $\GL(n, \R)$ or $\GL(n, \C)$. If $\G$ is a Lie group, a representation of $\G$ is a Lie group homomorphism from $\G$ to $\GL(V)$ for some finite-dimensional vector space $V$. Note that, $\GL(V)$ denotes the group of invertible linear transformations of $V$ which is a Lie group isomorphic to $\GL(n, \R)$. If a representation is injective, it is said to be faithful.

There is a close connection between representations and group actions. An action of $\G$ on $V$ is said to be a linear action if for each $g \in \G$, the map $V \to V$ defined by $x \mapsto g \cdot x$ is linear.

Example 20 (Linear action). If $\rho: \G \to \GL(V)$ is a representation of $\G$, there is an associated smooth linear action of $\G$ on $V$ given by $g \cdot x = \rho(g) x$. In fact, this holds for every linear action.

References

Lee, John M. “Smooth manifolds.” Introduction to Smooth Manifolds. Springer, New York, NY, 2013. 1-31.
Lee, John M. Riemannian manifolds: an introduction to curvature. Vol. 176. Springer Science & Business Media, 2006.
Fels, Mark Eric. “An Introduction to Differential Geometry through Computation.” (2016).
Absil, P-A., Robert Mahony, and Rodolphe Sepulchre. Optimization algorithms on matrix manifolds. Princeton University Press, 2009.
Boumal, Nicolas. Optimization and estimation on manifolds. Diss. Catholic University of Louvain, Louvain-la-Neuve, Belgium, 2014.
Graphics: https://tex.stackexchange.com/questions/261408/sphere-tangent-to-plane.