Agustinus Kristiadi's Blog
https://agustinus.kristia.de/
Sun, 21 Nov 2021 12:22:52 +0100Sun, 21 Nov 2021 12:22:52 +0100Jekyll v3.9.0Modern Arts of Laplace Approximations<p>Let $f: X \times \Theta \to Y$ defined by $(x, \theta) \mapsto f_\theta(x)$ be a neural network, where $X \subseteq \R^n$, $\Theta \subseteq \R^d$, and $Y \subseteq \R^c$ be the input, parameter, and output spaces, respectively.
Given a dataset $\D := \{ (x_i, y_i) : x_i \in X, y_i \in Y \}_{i=1}^m$, we define the likelihood $p(\D \mid \theta) := \prod_{i=1}^m p(y_i \mid f_\theta(x_i))$.
Then, given a prior $p(\theta)$, we can obtain the posterior via an application of Bayes’ rule: $p(\theta \mid \D) = 1/Z \,\, p(\D \mid \theta) p(\theta)$.
But, the exact computation of $p(\theta \mid \D)$ is intractable in general due to the need of computing the normalization constant</p>
\[Z = \int_\Theta p(\D \mid \theta) p(\theta) \,d\theta .\]
<p>We must then approximate $p(\theta \mid \D)$.
One simple way to do this is by simply finding one single likeliest point under the posterior, i.e. the mode of $p(\theta \mid \D)$.
This can be done via optimization, instead of integration:</p>
\[\theta_\map := \argmax_\theta \sum_{i=1}^m \log p(y_i \mid f_\theta(x_i)) + \log p(\theta) =: \argmax_\theta \L(\theta; \D) .\]
<p>The estimate $\theta_\map$ is referred to as the <em>maximum a posteriori</em> (MAP) estimate.
However, the MAP estimate does not capture the uncertainty around $\theta$.
Thus, it often (and in some cases, e.g. [1], almost always) leads to overconfidence.</p>
<p>In the context of Bayesian neural networks, the Laplace approximation (LA) is a family of methods for obtaining a Gaussian approximate posterior distribution of networks’ parameters.
The fact that it produces a Gaussian approximation is a step up from the MAP estimation: particularly, it conveys some notion of uncertainty in $\theta$.
LA stems from the early work of Pierre-Simon Laplace in 1774 [2] and it was first adapted for Bayesian neural networks (BNNs) by David MacKay in 1992 [3].
The method goes as follows.</p>
<p>Given the MAP estimate $\theta_\map$, let us Taylor-expand $\L$ around $\theta_\map$ up to the second-order:</p>
\[\L(\theta; \D) \approx \L(\theta_\map; \D) + \frac{1}{2} (\theta - \theta_\map)^\top \left(\nabla^2_\theta \L\vert_{\theta_\map}\right) (\theta - \theta_\map) .\]
<p>(Note that the gradient $\nabla_\theta \L$ is zero at $\theta_\map$ since $\theta_\map$ is a critical point of $\L$ and thus the first-order term in the above is also zero.)
Now, recall that $\L$ is the log-numerator of the posterior $p(\theta \mid \D)$.
Thus, the r.h.s. of the above can be used to approximate the true numerator, by simply exponentiating it:</p>
\[\begin{align*}
p(\D \mid \theta)p(\theta) &\approx \exp\left( \L(\theta_\map; \D) + \frac{1}{2} (\theta - \theta_\map)^\top \left(\nabla^2_\theta \L\vert_{\theta_\map}\right) (\theta - \theta_\map) \right) \\[5pt]
%
&= \exp(\L(\theta_\map; \D)) \exp\left(\frac{1}{2} (\theta - \theta_\map)^\top \left(\nabla^2_\theta \L\vert_{\theta_\map}\right) (\theta - \theta_\map) \right) .
\end{align*}\]
<p>For simplicity, let $\varSigma := -\left(\nabla^2_\theta \L\vert_{\theta_\map}\right)^{-1}$. Then, using this approximation, we can also obtain an approximation of $Z$:</p>
\[\begin{align*}
Z &\approx \exp(\L(\theta_\map; \D)) \int_\theta \exp\left(-\frac{1}{2} (\theta - \theta_\map)^\top \varSigma^{-1} (\theta - \theta_\map) \right) \,d\theta \\[5pt]
%
&= \exp(\L(\theta_\map; \D)) (2\pi)^{d/2} (\det \varSigma)^{1/2} ,
\end{align*}\]
<p>where the equality follows from the fact the integral above is the famous, tractable <a href="https://en.wikipedia.org/wiki/Gaussian_integral">Gaussian integral</a>.
Combining both approximations, we obtain</p>
\[\begin{align*}
p(\theta \mid \D) &\approx \frac{1}{(2\pi)^{d/2} (\det \varSigma)^{1/2}} \exp\left(-\frac{1}{2} (\theta - \theta_\map)^\top \varSigma^{-1} (\theta - \theta_\map) \right) \\[5pt]
%
&= \N(\theta \mid \theta_\map, \varSigma) .
\end{align*}\]
<p>That is, we obtain a tractable, easy-to-work-with Gaussian approximation to the intractable posterior via a simple second-order Taylor expansion!
Moreover, this is not just any Gaussian approximation: Notice that this Gaussian is fully determined once we have the MAP estimate $\theta_\map$.
Considering that the MAP estimation is <em>the</em> standard procedure for training NNs, the LA is nothing but a simple post-training step on top of it.
This means the LA, unlike other approximate inference methods, is a <em>post-hoc</em> method that can be applied to virtually any pre-trained NN, without the need of re-training!</p>
<p>Given this approximation, we can then use it as a proxy to the true posterior.
For instance, we can use it to obtain the predictive distribution</p>
\[\begin{align*}
p(y \mid x, \D) &\approx \int_\theta p(y \mid f_\theta(x)) \, \N(\theta \mid \theta_\map, \varSigma) \,d\theta \\
%
&\approx \frac{1}{s} \sum_{i=1}^s p(y \mid f_\theta(x)) \qquad \text{where} \enspace \theta_s \sim \N(\theta \mid \theta_\map, \varSigma) ,
\end{align*}\]
<p>which in general is less overconfident compared to the MAP-estimate-induced predictive distribution [3].</p>
<p>What we have seen is the most general framework of the LA.
One can make a specific design decision, such as by imposing a special structure to the Hessian $\nabla^2_\theta \L$, and thus the covariance $\varSigma$.</p>
<h2 class="section-heading">The <span style="font-family: monospace; font-size: 15pt">laplace-torch</span> library</h2>
<p>The simplicity of the LA is not without a drawback.
Recall that the parameter $\theta$ is in $\Theta \subseteq \R^d$.
In neural networks (NNs), $d$ is often in the order of millions or even billions.
Naively computing the Hessian $\nabla^2_\theta \L$ is thus often infeasible since it scales like $O(d^2)$.
Together with the fact that the LA is an old method (and thus not “trendy” in the (Bayesian) deep learning community), this might be the reason why the LA is not as popular as other BNN posterior approximation methods such as variational Bayes (VB) and Markov Chain Monte Carlo (MCMC).</p>
<p>Motivated by this observation, in our NeurIPS 2021 paper titled <a href="https://arxiv.org/abs/2106.14806">“Laplace Redux – Effortless Bayesian Deep Learning”</a>, we showcase that (i) the Hessian can be obtained cheaply, thanks to recent advances in second-order optimization, and (ii) even the simplest LA can be competitive to more sophisticated VB and MCMC methods, while only being much cheaper than them.
Of course, numbers alone are not sufficient to promote the goodness of the LA.
So, in that paper, we also propose an extendible, easy-to-use software library for PyTorch called <span style="font-family: monospace; font-size: 12pt">laplace-torch</span>, which is available at <a href="https://github.com/AlexImmer/Laplace">https://github.com/AlexImmer/Laplace</a>.</p>
<p>The <span style="font-family: monospace; font-size: 12pt">laplace-torch</span> is a simple library for, essentially, “turning standard NNs into BNNs”.
The main class of this library is the class <code class="language-plaintext highlighter-rouge">Laplace</code>, which can be used to transform a standard PyTorch model into a Laplace-approximated BNN.
Here is an example.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">from</span> <span class="nn">laplace</span> <span class="kn">import</span> <span class="n">Laplace</span>
<span class="n">model</span> <span class="o">=</span> <span class="n">load_pretrained_model</span><span class="p">()</span>
<span class="n">la</span> <span class="o">=</span> <span class="n">Laplace</span><span class="p">(</span><span class="n">model</span><span class="p">,</span> <span class="s">'regression'</span><span class="p">)</span>
<span class="c1"># Compute the Hessian
</span>
<span class="n">la</span><span class="p">.</span><span class="n">fit</span><span class="p">(</span><span class="n">train_loader</span><span class="p">)</span>
<span class="c1"># Hyperparameter tuning
</span>
<span class="n">la</span><span class="p">.</span><span class="n">optimize_prior_precision</span><span class="p">()</span>
<span class="c1"># Make prediction
</span>
<span class="n">pred_mean</span><span class="p">,</span> <span class="n">pred_var</span> <span class="o">=</span> <span class="n">la</span><span class="p">(</span><span class="n">x_test</span><span class="p">)</span>
</code></pre></div></div>
<p>The resulting object, <code class="language-plaintext highlighter-rouge">la</code> is a fully-functioning BNN, yielding the following prediction.
(Notice the identical regression curves—the LA essentially imbues MAP predictions with uncertainty estimates.)</p>
<p><img src="/img/2021-10-27-laplace/regression_example.png" alt="Regression" width="50%" /></p>
<p>Of course, <span style="font-family: monospace; font-size: 12pt">laplace-torch</span> is flexible: the <code class="language-plaintext highlighter-rouge">Laplace</code> class has almost all state-of-the-art features in Laplace approximations.
Those features, along with the corresponding options in <span style="font-family: monospace; font-size: 12pt">laplace-torch</span>, are summarized in the following flowchart.
(The options <code class="language-plaintext highlighter-rouge">'subnetwork'</code> for <code class="language-plaintext highlighter-rouge">subset_of_weights</code> and <code class="language-plaintext highlighter-rouge">'lowrank'</code> for <code class="language-plaintext highlighter-rouge">hessian_structure</code> are in the work, by the time this post is first published.)</p>
<p><img src="/img/2021-10-27-laplace/flowchart.png" alt="Laplace Flowchart" width="100%" /></p>
<p>The <span style="font-family: monospace; font-size: 12pt">laplace-torch</span> library uses a very cheap yet highly-performant flavor of LA by default, based on [4]:</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">def</span> <span class="nf">Laplace</span><span class="p">(</span><span class="n">model</span><span class="p">,</span> <span class="n">likelihood</span><span class="p">,</span> <span class="n">subset_of_weights</span><span class="o">=</span><span class="s">'last_layer'</span><span class="p">,</span> <span class="n">hessian_structure</span><span class="o">=</span><span class="s">'kron'</span><span class="p">,</span> <span class="p">...)</span>
</code></pre></div></div>
<p>That is, by default the <code class="language-plaintext highlighter-rouge">Laplace</code> class will fit a last-layer Laplace with a Kronecker-factored Hessian for approximating the covariance.
Let us see how this default flavor of LA performs compared to the more sophisticated, recent (all-layer) Bayesian baselines in classification.</p>
<p><img src="/img/2021-10-27-laplace/classification.png" alt="Classification" width="100%" /></p>
<p>Here we can see that <code class="language-plaintext highlighter-rouge">Laplace</code>, with default options, improves the calibration (in terms of expected calibration error (ECE)) of the MAP model.
Moreover, it is guaranteed to preserve the accuracy of the MAP model—something that cannot be said for other baselines.
Ultimately, this improvement is cheap: <span style="font-family: monospace; font-size: 12pt">laplace-torch</span> only incurs little overhead relative to the MAP model—far cheaper than other Bayesian baselines.</p>
<h2 class="section-heading">Hyperparameter Tuning</h2>
<p>Hyperparameter tuning, especially for the prior variance/precision, is crucial in modern Laplace approximations for BNNs.
<span style="font-family: monospace; font-size: 12pt">laplace-torch</span> provides several options: (i) cross-validation and (ii) marginal-likelihood maximization (MLM, also known as empirical Bayes and type-II maximum likelihood).</p>
<p>Cross-validation is simple but needs a validation dataset.
In <span style="font-family: monospace; font-size: 12pt">laplace-torch</span>, this can be done via the following.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">la</span><span class="p">.</span><span class="n">optimize_prior_precision</span><span class="p">(</span><span class="n">method</span><span class="o">=</span><span class="s">'CV'</span><span class="p">,</span> <span class="n">val_loader</span><span class="o">=</span><span class="n">val_loader</span><span class="p">)</span>
</code></pre></div></div>
<p>A more sophisticated and interesting tuning method is MLM.
Recall that by taking the second-order Taylor expansion over the log-posterior, we obtain an approximate normalization constant $Z$ of the Gaussian approximate posterior.
This object is called the marginal likelihood: it is a probability over the dataset $\D$ and crucially, it is a function of the hyperparameter since the parameter $\theta$ is marginalized out.
Thus, we can find the best values for our hyperparameters by maximizing this function.</p>
<p>In <span style="font-family: monospace; font-size: 12pt">laplace-torch</span>, the marginal likelihood can be accessed via</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">ml</span> <span class="o">=</span> <span class="n">la</span><span class="p">.</span><span class="n">log_marginal_likelihood</span><span class="p">(</span><span class="n">prior_precision</span><span class="p">)</span>
</code></pre></div></div>
<p>This function is compatible with PyTorch’s autograd, so we can backpropagate through it to obtain the gradient of $Z$ w.r.t. the prior precision hyperparameter:</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">ml</span><span class="p">.</span><span class="n">backward</span><span class="p">()</span> <span class="c1"># Works!
</span></code></pre></div></div>
<p>Thus, MLM can easily be done in <span style="font-family: monospace; font-size: 12pt">laplace-torch</span>.
By extension, recent methods such as online MLM [5], can also easily be applied using <span style="font-family: monospace; font-size: 12pt">laplace-torch</span>.</p>
<h2 class="section-heading">Outlooks</h2>
<p>The <span style="font-family: monospace; font-size: 12pt">laplace-torch</span> library is continuously developed.
Support for more likelihood functions and priors, subnetwork Laplace, etc. are on the way.</p>
<p>In any case, we hope to see the revival of the LA in the Bayesian deep learning community.
So, please try out our library at <a href="https://github.com/AlexImmer/Laplace">https://github.com/AlexImmer/Laplace</a>!</p>
<h2 class="section-heading">References</h2>
<ol>
<li>Hein, Matthias, Maksym Andriushchenko, and Julian Bitterwolf. “Why ReLU networks yield high-confidence predictions far away from the training data and how to mitigate the problem.” CVPR 2019.</li>
<li>Laplace, Pierre Simon. “Mémoires de Mathématique et de Physique, Tome Sixieme” 1774.</li>
<li>MacKay, David JC. “The evidence framework applied to classification networks.” Neural computation 4.5 (1992).</li>
<li>Kristiadi, Agustinus, Matthias Hein, and Philipp Hennig. “Being Bayesian, even just a bit, fixes overconfidence in ReLU networks.” ICML 2020.</li>
<li>Immer, Alexander, Matthias Bauer, Vincent Fortuin, Gunnar Rätsch, and Mohammad Emtiyaz Khan. “Scalable marginal likelihood estimation for model selection in deep learning.” ICML, 2021.</li>
</ol>
Wed, 27 Oct 2021 00:00:00 +0200
https://agustinus.kristia.de/techblog/2021/10/27/laplace/
https://agustinus.kristia.de/techblog/2021/10/27/laplace/bayestechblogChentsov's Theorem<p>Let $p_\theta(x)$ be a probability density on $\R^n$, parametrized by $\theta \in \R^d$. The <strong><em>Fisher information</em></strong> is defined by</p>
\[\I_{ij}(\theta) := \E_{p_\theta(x)} \left( \partial_i \log p_\theta(x) \, \partial_j \log p_\theta(x) \right)\]
<p>where $\partial_i := \partial/\partial \theta^i$ for each $i = 1, \dots, d$. Note that $\I(\theta)$ is positive semi-definite because one can see it as the (expected) outer-product of the gradient of the log-density.</p>
<h2 class="section-heading">The Fisher Information under Sufficient Statistics</h2>
<p>Let $T : \R^n \to \R^n$ with $x \mapsto y$ be a bijective transformation of the r.v. $x \sim p_\theta(x)$. By the <a href="https://en.wikipedia.org/wiki/Sufficient_statistic#Fisher%E2%80%93Neyman_factorization_theorem">Fisher-Neyman factorization</a>, $T$ is a <strong><em>sufficient statistic</em></strong> for the parameter $\theta$ if there exist non-negative functions $g_\theta$ and $h$, where $g_\theta$ depends on $\theta$ while $h$ does not, such that we can write the density $p_\theta(x)$ as follows:</p>
\[p_\theta(x) = g_\theta(T(x)) h(x) .\]
<p>The following proposition shows the behavior of $\I$ under sufficient statistics.</p>
<p><strong>Proposition 1.</strong> <em>The Fisher information is invariant under sufficient statistics.</em></p>
<p><em>Proof.</em> Let $T$ be a sufficient statistic and so $p_\theta(x) := g_\theta(T(x)) h(x)$. Notice that this implies</p>
\[\partial_i \log g_\theta(T(x)) = \partial_i \log p_\theta(x) .\]
<p>So, the Fisher information $\I(\theta; T)$ under $T$ is</p>
\[\begin{align}
\I(\theta; T) &= \E \left( \partial_i \log (g_\theta(T(x)) h(x)) \, \partial_j \log (g_\theta(T(x)) h(x)) \right) \\
%
&= \E \left( \partial_i \log g_\theta(T(x)) \, \partial_j \log g_\theta(T(x)) \right) \\
%
&= \E \left( \partial_i \log p_\theta(x) \, \partial_j \log p_\theta(x) \right) \\
%
&= \I(\theta) .
\end{align}\]
<p>We conclude that $\I$ is invariant under sufficient statistics.</p>
<p class="right">\( \square \)</p>
<h2 class="section-heading">The Fisher Information as a Riemannian Metric</h2>
<p>Let</p>
\[M := \{ p_\theta(x) : \theta \in \R^d \}\]
<p>be the set of the parametric densities $p_\theta(x)$. We can treat $M$ as a smooth $d$-manifold by imposing a global coordinate chart $p_\theta(x) \mapsto \theta$. Thus, we can identify a point $p_\theta(x)$ on $M$ with its parameter $\theta$ interchangeably.</p>
<p>Let us assume that $\I$ is positive-definite everywhere, and each $\I_{ij}$ is smooth. Then we can use it as (the coordinates representation of) a Riemannian metric for $M$. This is because $\I$ is a covariant 2-tensor. (Recall the definition of a Riemannian metric.)</p>
<p><strong>Proposition 2.</strong> <em>The component functions $\I_{ij}$ of $\I$ follows the covariant transformation rule.</em></p>
<p><em>Proof.</em> Let $\theta \mapsto \varphi$ be a change of coordinates and let $\ell(\varphi) := \log p_\varphi(x)$. The component function $\I_{ij}(\theta)$ in the “old” coordinates is expressed in terms of the “new” ones, as follows:</p>
\[\begin{align}
\I_{ij}(\theta) &= \E \left( \frac{\partial \ell}{\partial \theta^i} \, \frac{\partial \ell}{\partial \theta^j} \right) \\
%
&= \E \left( \frac{\partial \ell}{\partial \varphi^i} \frac{\partial \varphi^i}{\partial \theta^i} \, \frac{\partial \ell}{\partial \varphi^j} \frac{\partial \varphi^j}{\partial \theta^j} \right) \\
%
&= \frac{\partial \varphi^i}{\partial \theta^i} \frac{\partial \varphi^j}{\partial \theta^j} \E \left( \frac{\partial \ell}{\partial \varphi^i} \, \frac{\partial \ell}{\partial \varphi^j} \right) \\
%
&= \frac{\partial \varphi^i}{\partial \theta^i} \frac{\partial \varphi^j}{\partial \theta^j} \I_{ij}(\varphi) ,
\end{align}\]
<p>where the second equality follows from the standard chain rule. We conclude that $\I$ is covariant since the Jacobian $\partial \varphi/\partial \theta$ of the transformation multiplies the “new” component functions $\I_{ij}(\varphi)$ of $\I$ to obtain the “old” ones.</p>
<p class="right">\( \square \)</p>
<h2 class="section-heading">Chentsov's Theorem</h2>
<p>The previous two results are useful since the Fisher information metric is invariant under sufficient statistics. In this sense, $\I$ has a statistical invariance property. But this is not a strong enough reason for arguing that $\I$ is a “natural” or “the best” metric for $M$.</p>
<p>Here, we shall see a stronger statement, due to Chentsov in 1972, about the Fisher metric: It is the <em>unique</em> statistically-invariant metric for $M$ (up to a scaling constant). This makes $\I$ stands out over any other metric for $M$.</p>
<p>Originally, Chentsov’s theorem is described on the space of Categorical probability distributions over the sample space $\Omega := \{ 1, \dots, n \}$, i.e. the probability simplex. We use the result of Campbell (1986) as a stepping stone. To do so, we need to define the so-called <em>Markov embeddings</em>.</p>
<p>Let $\{ A_1, \dots, A_m \}$ be a partition $\Omega$, where $2 \leq n \leq m$. We define a conditional probability table $Q$ of size $n \times m$ where</p>
\[\begin{align}
q_{ij} &= 0 \quad \text{if } j \not\in A_i \\
q_{ij} &> 0 \quad \text{if } j \in A_i \\
& {\textstyle\sum_{j=1}^m} q_{ij} = 1 .
\end{align}\]
<p>That is, the $i$-th row of $Q$ gives probabilities signifying the membership of each $j \in \Omega$ in $A_i$. Based on this, we define a map $f: \R^n_{> 0} \to \R^m_{>0}$ by</p>
\[y_j := \sum_{i=1}^n q_{ij} x^i \qquad \forall\enspace j = 1, \dots, m .\]
<p>We call this map a <strong><em>Markov embedding</em></strong>. The name suggests that $f$ embeds $\R^n_{> 0}$ in a higher-dimensional space $\R^m_{> 0}$.</p>
<p>The result of Campbell (1986) characterizes the form of the Riemannian metric in $\R^n_{>0}$ that is invariant under any Markov embedding.</p>
<p><strong>Lemma 3 (Campbell, 1986).</strong> <em>Let $g$ be a Riemannian metric on $\R^n_{>0}$ where $n \geq 2$. Suppose that every Markov embedding on $(\R^n_{>0}, g)$ is an isometry. Then</em></p>
\[g_{ij}(x) = A(\abs{x}) + \delta_{ij} \frac{\abs{x} B(\abs{x})}{x^i} ,\]
<p><em>where $\abs{x} = \sum_{i=1}^n x^i$, $\delta_{ij}$ is the Kronecker delta, and $A, B \in C^\infty(\R_{>0})$ satisfying $B > 0$ and $A + B > 0$.</em></p>
<p><em>Proof.</em> See Campbell (1986) and Amari (2016, Sec. 3.5).</p>
<p class="right">\( \square \)</p>
<p>Lemma 3 is a general statement about the invariant metric in $\R^n_{>0}$ and it does not say anything about sufficient statistics and probability distributions. To get the main result, we restrict ourselves to the $(n-1)$-<strong><em>probability simplex</em></strong> $\Delta^{n-1} \subset \R^n_{>0}$, which is the space of (Categorical) probability distribution.</p>
<p>The fact that the Fisher information is the unique invariant metric under sufficient statistics follows from the fact that when $n = m$, the Markov embedding reduces to a permutation of the components of $x \in \R^n_{>0}$—i.e. the permutation of $\Omega$. This is because permutations of $\Omega$ are sufficient statistics for Categorical distribution.</p>
<p>Let us, therefore, connect the result in Lemma 3 with the Fisher information on $\Delta^{n-1}$. We give the latter in the following lemma.</p>
<p><strong>Lemma 4.</strong> <em>The Fisher information of a Categorical distribution $p_\theta(z)$ where $z$ takes values in $\Omega = \{ 1, \dots, n \}$ and $\theta = \{ \theta^1, \dots, \theta^n \} \in \Delta^{n-1}$ is given by</em></p>
\[\I_{ij}(\theta) = \delta_{ij} \frac{1}{\theta^i} .\]
<p><em>That is, $\I(\theta)$ is an $(n \times n)$ diagonal matrix with $i$-th entry $1/\theta^i$.</em></p>
<p><em>Proof.</em> By definition,</p>
\[p_\theta(z) = \prod_{i=1}^n \left(\theta^i\right)^{(z^i)} ,\]
<p>where we assume that $z$ is one-hot encoded. Its score function is given by</p>
\[\partial_i \log p_\theta(x) = \partial_i \sum_{i=1}^n z^i \log \theta^i = \sum_{i=1}^n z^i \frac{1}{\theta^i} \delta_{ij} = \frac{z^i}{\theta^i} ,\]
<p>for each $i = 1, \dots n$. Hence, using the fact that $z$ is one-hot:</p>
\[\begin{align}
\I_{ii}(\theta) &= \E \left( \frac{z^i}{\theta^i} \, \frac{z^i}{\theta^i} \right) \\
%
&= \frac{1}{(\theta^i)^2} \sum_{i=1}^n (z^i)^2 \theta^i \\
%
&= \frac{1}{(\theta^i)^2} \theta^i \\
%
&= \frac{1}{\theta^i} .
\end{align}\]
<p>Using similar step, we can show that $\I_{ij}(\theta) = 0$ for $i \neq j$ because $z^i z^j$ is always zero.</p>
<p class="right">\( \square \)</p>
<p>Now we are ready to state the main result.</p>
<p><strong>Theorem 5 (Chentsov, 1972).</strong> <em>The Fisher information is the unique Riemannian metric on $\Delta^{n-1}$ that is invariant under sufficient statistics, up to a multiplicative constant.</em></p>
<p><em>Proof.</em> By Lemma 3, the invariant metric under Markov embeddings in $\R^n_{> 0}$ is given by</p>
\[g_{ij}(x) = A(\abs{x}) + \delta_{ij} \frac{\abs{x} B(\abs{x})}{x^i} ,\]
<p>for any $x \in \R^n_{> 0}$. Therefore, this is the form of the invariant metric under sufficient statistics in $\Delta^{n-1} \subset \R^n_{>0}$, i.e. when $n=m$ in the Markov embedding.</p>
<p>Let us therefore restrict $g$ to $\Delta^{n-1}$. For each $\theta \in \Delta^{n-1}$, the tangent space $T_\theta \Delta^{n-1}$ is orthogonal to the line $x^1 = x^2 = \dots = x^n$, which direction is given by the vector $\mathbf{1} = (1, \dots, 1) \in \R^n_{>0}$. This is a vector normal to $\Delta^{n-1}$, implying that any $v \in T_\theta \Delta^{n-1}$ satisfies $\inner{\mathbf{1}, v}_g = 0$, i.e. $\sum_{i=1}^n v^i = 0$.</p>
<p>Moreover, if $\theta \in \Delta^{n-1}$, then $\abs{\theta} = \sum_{i=1}^n \theta^i = 1$ by definition. Thus, $A(1)$ and $B(1)$ are constants. So, if $v, w \in T_\theta \Delta^{n-1}$, we have:</p>
\[\begin{align}
\inner{v, w}_{\theta} &= \sum_{i=1}^n \sum_{j=1}^n g_{ij} v^i w^j = A(1) \sum_{i = 1}^n \sum_{j = 1}^n v^i w^j + B(1) \sum_{i=1}^n \frac{v^i w^i}{\theta^i} \\
%
&= A(1) \underbrace{\left(\sum_{i = 1}^n v^i\right)}_{=0} \underbrace{\left(\sum_{j = 1}^n w^j\right)}_{=0} + B(1) \sum_{i=1}^n \frac{v^i w^i}{\theta^i} .
\end{align}\]
<p>Therefore $A(1)$ does not contribute to the inner product and we may, w.l.o.g., write the metric as a diagonal matrix:</p>
\[g_{ij}(\theta) = \delta_{ij} \frac{B(1)}{\theta^i} .\]
<p>Recalling that $B(1)$ is a constant, by Lemma 4, we have $g_{ij}(\theta) \propto \I_{ij}(\theta)$.</p>
<p class="right">\( \square \)</p>
<p>Generalizations to this (original) version Chentsov’s theorem exists. For instance, Ay et al. (2015) showed Chentsov’s theorem for arbitrary, parametric probability distributions. Dowty (2018) stated Chentsov’s theorem for exponential family distributions.</p>
<h2 class="section-heading">References</h2>
<ol>
<li>Chentsov, N. N. “Statistical Decision Rules and Optimal Deductions.” (1972).</li>
<li>Campbell, L. Lorne. “An extended Čencov characterization of the information metric.” Proceedings of the American Mathematical Society 98, no. 1 (1986): 135-141.</li>
<li>Amari, Shun-ichi. Information geometry and its applications. Vol. 194. Springer, 2016.</li>
<li>Ay, Nihat, Jürgen Jost, Hông Vân Lê, and Lorenz Schwachhöfer. “Information geometry and sufficient statistics.” Probability Theory and Related Fields 162, no. 1-2 (2015): 327-364.</li>
<li>Dowty, James G. “Chentsov’s theorem for exponential families.” Information Geometry 1, no. 1 (2018): 117-135.</li>
</ol>
Tue, 20 Jul 2021 00:00:00 +0200
https://agustinus.kristia.de/techblog/2021/07/20/chentsov-theorem/
https://agustinus.kristia.de/techblog/2021/07/20/chentsov-theorem/mathtechblogThe Curvature of the Manifold of Gaussian Distributions<p>The (univariate) Gaussian distribution is defined by the following p.d.f.:</p>
\[\N(x \mid \mu, \sigma) := \frac{1}{\sigma \sqrt{2 \pi}} \exp\left( - \frac{(x-\mu)^2}{2 \sigma^2} \right) .\]
<p>Let $M := \{ \N(x \mid \mu, \sigma) : (\mu, \sigma) \in \R \times \R_{> 0} \}$ be the set of all Gaussian p.d.f.s. We would like to treat this set as a smooth manifold and then, additionally, as a Riemannian manifold.</p>
<p>First, let’s define a coordinate chart for $M$. Let $\theta : M \to \R \times \R_{>0}$, defined by $\N(x \mid \mu, \sigma) \mapsto (\mu, \sigma)$ be such a chart. That is, the coordinate chart $\theta$ maps $M$ to the open Euclidean upper half-plane $\{ (x, y) : y > 0 \}$. Note that $\theta$ is a <em>global</em> chart since the Gaussian distribution is uniquely identified by its location and scale (i.e. its mean and standard-deviation). Thus, we can interchangeably write $p \in M$ or $\theta := (\mu, \sigma) \in \R \times \R_{>0}$ with a slight abuse of notation. From here, it is clear that $M$ is of dimension $2$ because $\theta$ gives a homeomorphism from $M$ to $\R \times \R_{>0} \simeq \R^2$.</p>
<p>Now let us equip the smooth manifold $M$ with a Riemannian metric, say $g$. The standard choice for $g$ for probability distributions is the Fisher information metric. I.e., in coordinates, it is defined by</p>
\[\begin{align}
g_{ij} &= g_{ij}(\theta) := \E_{\N(x \mid \mu, \sigma)} \left( \frac{\partial \log \N(x \mid \mu, \sigma)}{\partial \theta^i} \, \frac{\partial \log \N(x \mid \mu, \sigma)}{\partial \theta^j} \right) \\
%
&= -\E_{\N(x \mid \mu, \sigma)} \left( \frac{\partial^2 \log \N(x \mid \mu, \sigma)}{\partial \theta^i \, \partial \theta^j} \right) .
\end{align}\]
<p>In a matrix form, it is (see <a href="https://en.wikipedia.org/wiki/Normal_distribution">here</a>)</p>
\[G := (g_{ij}) = \begin{pmatrix}
\frac{1}{\sigma^2} & 0 \\
0 & \frac{2}{\sigma^2}
\end{pmatrix} .\]
<p>Its inverse, denoted by upper indices, $(g^{ij}) = G^{-1}$ is given by</p>
\[g^{ij} = \begin{pmatrix}
\sigma^2 & 0 \\
0 & \frac{\sigma^2}{2}
\end{pmatrix} .\]
<p>Note in particular that the matrix $G$ is positive definite for any $(\mu, \sigma)$ and thus gives a notion of inner product in the tangent bundle of $M$. Therefore, the tuple $(M, g)$ is a Riemannian manifold.</p>
<p>One more structure is needed for computing the curvature(s) of $M$. We need to equip $(M, g)$ with an affine connection. Here, we will use the Levi-Civita connection $\nabla$ of $g$.</p>
<p><strong>Note.</strong> <em>We will use the Einstein summation convention from now on. For example, $\Gamma^k_{ij} \Gamma^l_{km} = \sum_k \Gamma^k_{ij} \Gamma^l_{km}$.</em></p>
<h2 class="section-heading">Christoffel Symbols</h2>
<p>The first order of business is to determine the connection coefficients of $\nabla$—the Christoffel symbols of the second kind. In coordinates, it is represented by the $3$-dimensional array $(\Gamma^k_{ij}) \in \R^{2 \times 2 \times 2}$, and is given by the following formula</p>
\[\Gamma^k_{ij} := \frac{1}{2} g^{kl} \left( \frac{\partial g_{jl}}{\partial \theta^i} + \frac{\partial g_{il}}{\partial \theta^j} - \frac{\partial g_{ij}}{\partial \theta^l} \right) .\]
<p>Moreover, due to the symmetric property of the Levi-Civita connection, the lower indices of $\Gamma$ is symmetric, i.e. $\Gamma^k_{ij} = \Gamma^k_{ji}$ for all $i, j, k = 1, 2$.</p>
<p>Let us begin with $k = 1$. For $i,j = 1$, we have</p>
\[\begin{align}
\Gamma^1_{11} &= \frac{1}{2} g^{11} \left( \frac{\partial g_{11}}{\partial \theta^1} + \frac{\partial g_{11}}{\partial \theta^1} - \frac{\partial g_{11}}{\partial \theta^1} \right) + \frac{1}{2} \underbrace{g^{12}}_{=0} \left( \frac{\partial g_{12}}{\partial \theta^1} + \frac{\partial g_{12}}{\partial \theta^1} - \frac{\partial g_{11}}{\partial \theta^2} \right) \\
%
&= \frac{1}{2} \sigma^2 \frac{\partial}{\partial \mu} \left( \frac{1}{\sigma^2} \right) = 0 .
\end{align}\]
<p>Similarly, we have $\Gamma^1_{22} = 0$. For $\Gamma^1_{12} = \Gamma^1_{21}$, we have</p>
\[\require{cancel}
%
\begin{align}
\Gamma^1_{12} = \Gamma^1_{21} &= \frac{1}{2} g^{11} \left( \cancel{\frac{\partial g_{21}}{\partial \theta^1}} + \frac{\partial g_{11}}{\partial \theta^2} - \cancel{\frac{\partial g_{12}}{\partial \theta^1}} \right) + \frac{1}{2} \underbrace{g^{12}}_{=0} \dots \\
%
&= \frac{1}{2} \sigma^2 \frac{\partial}{\partial \sigma} \left( \frac{1}{\sigma^2} \right) \\
%
&= -\frac{1}{\sigma} .
\end{align}\]
<p>Note that in the above, we can immediately cross out partial derivatives that depend on $\theta^1 = \mu$ since we know that $g_{ij}$ does not depend on $\mu$ for all $i, j = 1, 2$. Meanwhile, we know immediately that the second term is zero because $g$ is diagonal—in particular $g^{ij} = 0$ for $i \neq j$.</p>
<p>Now, for $k=2$, we can easily show (the hardest part is to keep track the indices) that $\Gamma^2_{12} = \Gamma^2_{21} = \Gamma^2_{22} = 0$. Meanwhile,</p>
\[\begin{align}
\Gamma^2_{11} &= \frac{1}{2} \underbrace{g^{21}}_{0} \dots + \frac{1}{2} g^{22} \left( \cancel{\frac{\partial g_{12}}{\partial \theta^1}} + \cancel{\frac{\partial g_{12}}{\partial \theta^1}} - \frac{\partial g_{11}}{\partial \theta^2} \right) \\
%
&= -\frac{1}{2} \frac{\sigma^2}{2} \frac{\partial}{\partial \sigma} \left( \frac{1}{\sigma^2} \right) \\
%
&= \frac{1}{2\sigma} .
\end{align}\]
<p>So, all in all, $\Gamma$ is given by</p>
\[\Gamma^k = \begin{cases}
\begin{pmatrix}
0 & -\frac{1}{\sigma} \\
-\frac{1}{\sigma} & 0
\end{pmatrix} & \text{if } k = 1 \\[3pt]
%
\begin{pmatrix}
\frac{1}{2\sigma} & 0 \\
0 & 0
\end{pmatrix} & \text{if } k = 2 .
\end{cases}\]
<h2 class="section-heading">Sectional Curvature</h2>
<p>Now we are ready to compute the curvature of $M$. There are different notions of curvatures, e.g. the Riemann, Ricci curvature tensor, or the scalar curvature. In this post, we focus on the sectional curvature, which is a generalization of the Gaussian curvature in classical surface geometry (i.e. the study of embedded $2$-dimensional surfaces in $\R^3$).</p>
<p>Let $v, w$ in $T_pM$ be two basis vectors for $T_pM$. The formula of the sectional curvature $\text{sec}(v, w)$ under $v, w$ is as follows:</p>
\[\text{sec}(v, w) := \frac{Rm(v, w, w, v)}{\inner{v, v} \inner{w, w} - \inner{v, w}^2} ,\]
<p>where $Rm$ is the Riemann curvature tensor, and $\inner{\cdot, \cdot}$ denotes the inner product w.r.t. $g$. Note that $\text{sec}(v, w)$ is independent of the choice of $(v,w)$, i.e. given another pair of basis vectors $(v_0, w_0)$ of $T_pM$, we have that $\text{sec}(v_0, w_0) = \text{sec}(v, w)$.</p>
<p>The partial derivative operators $\frac{\partial}{\partial \theta^1} =: \partial_1$ and $\frac{\partial}{\partial \theta^2} =: \partial_2$ under the coordinates $\theta$ form a basis for $T_pM$. So, let us use them to compute the sectional curvature of $M$. In this case, the formula reads as</p>
\[\text{sec}(\partial_1, \partial_2) = \frac{Rm(\partial_1, \partial_2, \partial_2, \partial_1)}{\inner{\partial_1, \partial_1} \inner{\partial_2, \partial_2} - \inner{\partial_1, \partial_2}^2} .\]
<p>But the definition of $Rm$ implies that $Rm(\partial_1, \partial_2, \partial_2, \partial_1) = R_{1221}$, i.e. the element $1,2,2,1$ of the multidimensional array representation of $Rm$ in coordinates. Moreover, by definition, $g_{ij} = \inner{\partial_1, \partial_2}$. And so:</p>
\[\text{sec}(\partial_1, \partial_2) = \frac{R_{1221}}{g_{11} g_{22} - (g_{12})^2} = \frac{R_{1221}}{\det g} ,\]
<p>since $g$ is symmetric. Note that this is but the definition of the Gaussian curvature—indeed, in dimension $2$, the sectional and the Gaussian curvatures coincide.</p>
<p>We are now ready to compute $\text{sec}(\partial_1, \partial_2)$. The denominator is easy from our definition of $G$ at the beginning of this post:</p>
\[\det g = \frac{1}{\sigma^2} \frac{2}{\sigma^2} = \frac{2}{\sigma^4} .\]
<p>For the numerator, we can compute $R_{ijkl}$ via the metric and the Christoffel symbols:</p>
\[R_{ijkl} = g_{lm} \left( \frac{\partial \Gamma^m_{jk}}{\partial \theta^i} - \frac{\partial \Gamma^m_{ik}}{\partial \theta^j} + \Gamma^p_{jk} \Gamma^m_{ip} - \Gamma^p_{ik} \Gamma^m_{jp} \right) .\]
<p>So, we have</p>
\[\begin{align}
R_{1221} &= g_{1m} \left( \frac{\partial \Gamma^m_{22}}{\partial \theta^1} - \frac{\partial \Gamma^m_{12}}{\partial \theta^2} + \Gamma^p_{22} \Gamma^m_{1p} - \Gamma^p_{12} \Gamma^m_{2p} \right) \\
%
&= g_{1m} \left( \frac{\partial \Gamma^m_{22}}{\partial \mu} - \frac{\partial \Gamma^m_{12}}{\partial \sigma} + \left( \Gamma^1_{22} \Gamma^m_{11} + \Gamma^2_{22} \Gamma^m_{12} \right) - \left( \Gamma^1_{12} \Gamma^m_{21} + \Gamma^2_{12} \Gamma^m_{22} \right) \right) \\
%
&= g_{11} \left( \frac{\partial \Gamma^1_{22}}{\partial \mu} - \frac{\partial \Gamma^1_{12}}{\partial \sigma} + \left( \Gamma^1_{22} \Gamma^1_{11} + \Gamma^2_{22} \Gamma^1_{12} \right) - \left( \Gamma^1_{12} \Gamma^1_{21} + \Gamma^2_{12} \Gamma^1_{22} \right) \right) + \underbrace{g_{12}}_{=0} \dots .
\end{align}\]
<p>Now, we can cross out the partial derivative term w.r.t. $\mu$ since we know already that none of the $\Gamma^k_{ij}$ depend on $\mu$. Moreover, recall that the Christoffel symbols are given by $\Gamma^1_{12} = \Gamma^1_{21} = -\frac{1}{\sigma}$ and $\Gamma^2_{11} = \frac{1}{2\sigma}$, and $0$ otherwise. Hence,</p>
\[\begin{align}
R_{1221} &= g_{11} \left( - \frac{\partial \Gamma^1_{12}}{\partial \sigma} - \Gamma^1_{12} \Gamma^1_{21} \right) \\
%
&= \frac{1}{\sigma^2} \left( -\frac{\partial}{\partial \sigma} \left( -\frac{1}{\sigma} \right) - \left( -\frac{1}{\sigma} \right)^2 \right) \\
%
&= \frac{1}{\sigma^2} \left( -\frac{2}{\sigma^2} - \frac{1}{\sigma^2} \right) \\
%
&= -\frac{2}{\sigma^4} .
\end{align}\]
<p>Thus, the sectional curvature is given by</p>
\[\text{sec}(\partial_1, \partial_2) = \frac{-\frac{2}{\sigma^4}}{\frac{2}{\sigma^4}} = -1 .\]
<p>Note in particular that this sectional curvature does not depend on both $\mu$ and $\sigma$, i.e. it is constant. Hence, $M$ is a manifold of constant negative curvature. I.e., we can think of $M$ as a saddle surface.</p>
<h2 class="section-heading">Visualization</h2>
<p>Thanks to the amazing <a href="https://github.com/geomstats/geomstats"><code class="language-plaintext highlighter-rouge">geomstats</code></a> package, we can visualize $M$ in coordinates easily. The idea is by visualizing the contours of the distances from points in $\R \times \R_{>0}$ to $(0, 1)$, i.e. corresponding to $\N(x \mid 0, 1)$—the standard normal.</p>
<p><img src="/img/2021-06-21-manifold-gaussians/gaussians_geo.png" alt="Manifold of Gaussians" width="80%" /></p>
<p>Above, red points are the discretized steps of geodesics from $\N(x \mid 0, 1)$ to other Gaussians with different mean and variance. Indeed, geodesics of $M$ behave similarly like in the Poincaré half-space model—one of the poster children of the hyperbolic geometry.</p>
Mon, 21 Jun 2021 08:00:00 +0200
https://agustinus.kristia.de/techblog/2021/06/21/manifold-gaussians/
https://agustinus.kristia.de/techblog/2021/06/21/manifold-gaussians/mathtechblogHessian and Curvatures in Machine Learning: A Differential-Geometric View<p>In machine learning, especially in neural networks, the Hessian matrix is often treated synonymously with curvatures, in the following sense. Suppose $f: \R^n \times \R^d \to \R$ defined by $(x, \theta) \mapsto f(x; \theta) =: f_\theta(x)$ is a (real-valued) neural network, mapping an input $x$ to the output $f(x; \theta)$ under the parameter $\theta$. Given a dataset $\D$, we can define a loss function $\ell: \R^d \to \R$ by $\theta \mapsto \ell(\theta)$ such as the mean-squared-error or cross-entropy loss. (We do not explicitly show the dependency of $\ell$ to $f$ and $\D$ for brevity.) Assuming the standard basis for $\R^d$, from calculus we know that the second partial derivatives of $\ell$ at a point $\theta \in \R^d$ form a matrix called the Hessian matrix at $\theta$.</p>
<p>Often, one calls the Hessian matrix the “curvature matrix” of $L$ at $\theta$ [1, 2, etc.]. Indeed, it is well-justified since as we have learned in calculus, the eigenspectrum of this Hessian matrix represents the curvatures of the <em>loss landscape</em> of $\ell$ at $\theta$. It is, however, not clear from calculus alone what is the precise geometric meaning of these curvatures. In this post, we will use tools from differential geometry—especially the hypersurface theory—to study the geometric interpretation of the Hessian matrix.</p>
<h2 class="section-heading">Loss Landscapes as Hypersurfaces</h2>
<p>We begin by formalizing what exactly is a <em>loss landscape</em> via the Euclidean hypersurface theory. We call an $n$-dimensional manifold $M$ a <strong><em>(Euclidean) hypersurface</em></strong> of $\R^{n+1}$ if $M$ is a subset of $\R^{n+1}$ (equipped with the standard basis) and the inclusion $\iota: M \hookrightarrow \R^{n+1}$ is a smooth topological embedding. Since $\R^{n+1}$ is equipped with a metric in the form of the standard dot product, we can equip $M$ with an induced metric characterized at each point $p \in M$ by</p>
\[\langle v, w\rangle_p = (d\iota)_p(v) \cdot (d\iota)_p(w) ,\]
<p>for all tangent vectors $v, w \in T_pM$. Here, $\cdot$ represents the dot product and $(d\iota)_p: T_pM \to T_{\iota(p)}\R^{n+1} \simeq \R^{n+1}$ is the differential of $\iota$ at $p$ which is represented by the Jacobian matrix of $\iota$ at $p$. In matrix notation this is</p>
\[\inner{v, w}_p = (J_p v)^\top (J_p w) .\]
<p>Intuitively, the induced inner product on $M$ at $p$ is obtained by “pushing forward” tangent vectors $v$ and $w$ using the Jacobian $J_p$ at $p$ and compute their dot product on $\R^{n+1}$.</p>
<p><img src="/img/2020-11-01-hessian-curvatures/pushforward.png" alt="Pushforward" width="80%" /></p>
<p>Let $g: U \to \R$ is a smooth real-valued function over an open subset $U \subseteq \R^n$, then the <strong><em>graph</em></strong> of $g$ is the subset $M := \{ (u, g(u)) : u \in U \} \subseteq \R^{n+1}$ which is a hypersurface in $\R^{n+1}$. In this case, we can describe $M$ via the so-called <strong><em>graph parametrization</em></strong> which is a function $X: U \to \R^{n+1}$ defined by $X(u) := (u, g(u))$.</p>
<p>Coming back to our neural network setting, assuming that the loss $\ell$ is smooth, the graph $L := \{ (\theta, \ell(\theta)) : \theta \in \R^d \}$ is a Euclidean hypersurface of $\R^{d+1}$ with parametrization $Z: \R^d \to \R^{d+1}$ defined by $Z(\theta) := (\theta, \ell(\theta))$. Furthermore, the metric of $L$ is given by the Jacobian of the parametrization $Z$ and the standard dot product on $\R^{d+1}$, as before. Thus, the loss landscape of $\ell$ can indeed be amenable to geometric analysis.</p>
<p><img src="/img/2020-11-01-hessian-curvatures/graph_hypersurface.png" alt="Loss landscape" width="80%" /></p>
<h2 class="section-heading">The Second Fundamental Form and Shape Operator</h2>
<p>Consider vector fields $X$ and $Y$ on the hypersurface $L \subseteq \R^{d+1}$. We can view them as vector fields on $\R^{d+1}$ and thus the directional derivative $\nabla_X Y$ on $\R^{d+1}$ is well-defined at all points in $L$. That is, at every $p \in L$, $\nabla_X Y$ is a $(d+1)$-dimensional vector “rooted” at $p$. This vector can be decomposed as follows:</p>
\[\nabla_X Y = (\nabla_X Y)^\top + (\nabla_X Y)^\perp ,\]
<p>where $(\cdot)^\top$ and $(\cdot)^\perp$ are the orthogonal projection operators onto the tangent/normal space of $L$ at $p$. We define the <strong><em>second fundamental form</em></strong> as the map $\mathrm{II}$ that takes two vector fields on $L$ and yielding normal vector fields of $L$, as follows:</p>
\[\mathrm{II}(X,Y) := (\nabla_X Y)^\perp .\]
<p>See the following figure for an intuition.</p>
<p><img src="/img/2020-11-01-hessian-curvatures/II.png" alt="The second fundamental form" width="60%" /></p>
<p>Since $L$ is a $d$-dimensional hypersurface of $(d+1)$-dimensional Euclidean space, the normal space $N_pL$ at each point $p$ of $L$ has dimension one, and there exist only two ways of choosing a unit vector field normal to $L$. Any choice of the unit vector field thus automatically gives a basis for $N_pL$ for all $p \in L$. One of the choices is the following normal vector field which is oriented <em>outward</em> relative to $L$.</p>
<p><img src="/img/2020-11-01-hessian-curvatures/unit_normal.png" alt="Unit normal field" width="60%" /></p>
<p>Another choice is the same unit normal field but oriented <em>inward</em> relative to $L$.</p>
<p>Fix a unit normal field $N$. We can replace the vector-valued second fundamental form $\mathrm{II}$ with a simpler scalar-valued form. We define the <strong><em>scalar second fundamental form</em></strong> of $M$ to be</p>
\[h(X, Y) := \inner{N, \mathrm{II}(X,Y)} .\]
<p>Furthermore, we define the <strong><em>shape operator</em></strong> of $L$ as the map $s$, mapping a vector field to another vector field on $L$, characterized by</p>
\[\inner{s(X), Y} = h(X,Y) .\]
<p>Based on the characterization above, we can alternatively view $s$ as an operator obtained by raising an index of $h$, i.e. multiplying the matrix of $h$ with the inverse-metric.</p>
<p>Note that, at each point $p \in L$, the shape operator at $p$ is a linear endomorphism of $T_p L$, i.e. it defines a map from the tangent space to itself. Furthermore, we can show that $\mathrm{II}(X,Y) = \mathrm{II}(Y,X)$ and thus $h(X,Y)$ is symmetric. This implies that $s$ is self-adjoint since we can write</p>
\[\inner{s(X), Y} = h(X,Y) = h(Y,X) = \inner{s(Y), X} = \inner{X, s(Y)} .\]
<p>Altogether, this means that at each $p \in L$, the shape operator at $p$ can be represented by a symmetric $d \times d$ matrix.</p>
<h2 class="section-heading">Principal Curvatures</h2>
<p>The previous fact about the matrix of $s$ says that we can apply eigendecomposition on $s$ and obtain $n$ real eigenvalues $\kappa_1, \dots, \kappa_n$ and an orthonormal basis for $T_p L$ formed by the eigenvectors $(b_1, \dots, b_n)$ corresponding to these eigenvalues. We call these eigenvalues the <strong><em>principal curvatures</em></strong> of $L$ at $p$ and the corresponding eigenvectors the <strong><em>principal directions</em></strong>. Moreover, we also define the <strong><em>Gaussian curvature</em></strong> as $\det s = \prod_{i=1}^d \kappa_i$ and the <strong><em>mean curvature</em></strong> as $\frac{1}{d} \mathrm{tr}\,s = \frac{1}{d} \sum_{i=1}^d \kappa_i$.</p>
<p><img src="/img/2020-11-01-hessian-curvatures/curvature_plane_curv.png" alt="Unit normal field" width="60%" /></p>
<p>The intuition of the principal curvatures and directions in $\R^3$ is shown in the preceding figure. Suppose $M$ is a surface in $\R^3$. Choose a tangent vector $v \in T_pM$. Together with the choice of our unit normal vector $N_p$ at $p$, we obtain a plane $\varPi$ passing through $p$. The intersection of $\varPi$ and the neighborhood of $p$ in $M$ is a plane curve $\gamma \subseteq \varPi$ containing $p$. We can now compute the curvature of this curve at $p$ as usual, in the calculus sense (the reciprocal of the radius of the osculating circle at $p$). Then, the principal curvatures of $M$ at $p$ are the minimum and maximum curvatures obtained this way. The corresponding vectors in $T_p M$ that attain the minimum and maximum are the principal directions.</p>
<p>Principal and mean curvatures are not intrinsic to a hypersurface. There are two hypersurfaces that are isometric, but have different principal curvatures and hence different mean curvatures. Consider the following two surfaces.</p>
<p><img src="/img/2020-11-01-hessian-curvatures/principal_curvatures_extrinsic.png" alt="Unit normal field" width="80%" /></p>
<p>The first (left) surface is the plane described by the parametrization $(x,y) \mapsto \{ x, y, 0 \}$ for $0 < x < \pi$ and $0 < y < \pi$. The second one is the half cylinder $(x,y) \mapsto \{ x, y, \sqrt{1-y^2} \}$ for $0 < x < \pi$ and $\abs{y} < 1$. It is clear that they have different principal curvatures since the plane is a flat while the half-cylinder is “curvy”. Indeed, assuming a downward pointing normal, we can see that $\kappa_1 = \kappa_2 = 0$ for the plane and $\kappa_1 = 0, \kappa_2 = 1$ for the half-cylinder and thus their mean curvatures differ. However, they are actually isometric to each other—from the point of view of Riemannian geometry, they are the same. Thus, both principal and mean curvatures depend on the choice of the parametrization and not intrinsic.</p>
<p>Remarkably, the Gaussian curvature is intrinsic: All isometric hypersurfaces of dimension $\geq 2$ have the same Gaussian curvature (up to sign). Using the previous example: the plane and half-cylinder have the same Gaussian curvature of $0$. In 2D surfaces, this is a classic result which Gauss named <em>Theorema Egregium</em>. For hypersurfaces with dimension $> 2$, it can be shown that the Gaussian curvature is intrinsic up to sign [5, Ch. 7, Cor. 23].</p>
<h2 class="section-heading">The Loss Landscape's Hessian</h2>
<p>Now we are ready to draw a geometric connection between principal curvatures and the Hessian of $\ell$. Let $Z: \R^d \to \R^{d+1}$ be graph parametrization of the loss landscape $L$. The coordinates $(\theta^1, \dots, \theta^d) \in \R^d$ thus give local coordinates for $L$. The coordinate vector field $\partial/\partial \theta^1, \dots, \partial/\partial \theta^d$, push forward to vector fields $dZ(\partial/\partial \theta^1), \dots, dZ(\partial/\partial \theta^d)$ on $\R^{d+1}$, via the Jacobian of $Z$. At each $p \in L$, these vector fields form a basis for $T_p L$, viewed as a collection of $d$ vectors in $\R^{d+1}$.</p>
<p>If we think of $Z(\theta) = (Z^1(\theta), \dots, Z^{d+1}(\theta))$ as a vector-valued function of $\theta$, then by definition of Jacobian, these push-forwarded coordinate vector fields can be written for every $\theta \in \R^d$ as</p>
\[dZ_\theta \left( \frac{\partial}{\partial \theta^i} \right) = \frac{\partial Z}{\partial \theta^i} (\theta) =: \partial_i Z(\theta) ,\]
<p>for each $i = 1, \dots, d$.</p>
<p>Let us suppose we are given a unit normal field to $L$. Then we have the following result.</p>
<p><strong>Proposition 1.</strong> <em>Suppose $L \subseteq \R^{d+1}$ is the loss landscape of $\ell$, $Z: \R^d \to \R^{d+1}$ is the graph parametrization of $L$. Suppose further that $\partial_1 Z, \dots, \partial_d Z$ are the vector fields determined by $Z$ which restriction at each $p \in L$ is a basis for $T_pL$, and suppose $N$ is a unit normal field on $L$. Then the scalar second fundamental form is given by</em></p>
\[h(\partial_i Z, \partial_j Z) = \left\langle \frac{\partial^2 Z}{\partial \theta^i \partial \theta^d} , N \right\rangle = N^{d+1} \frac{\partial^2 \ell}{\partial \theta^i \partial \theta^j}.\]
<p><em>Where $N^{d+1}$ is the last component of the unit normal field.</em></p>
<p><em>Proof.</em> To show the first equality, one can refer to Proposition 8.23 in [1], which works for any parametrization and not just the graph parametrization. Now recall that $Z(\theta) = (\theta^1, \dots, \theta^d, \ell(\theta^1, \dots, \theta^d))$. Therefore for each $i = 1, \dots, d$:</p>
\[\frac{\partial Z}{\partial \theta^i} = \left( 0, \dots, 1, \dots, \frac{\partial \ell}{\partial \theta^i} \right) ,\]
<p>and thus</p>
\[\frac{\partial^2 Z}{\partial \theta^i \partial \theta^j} = \left( 0, \dots, 0, \frac{\partial^2 \ell}{\partial \theta^i \partial \theta^j} \right) .\]
<p>Taking the inner product with the unit normal field $N$, we obtain</p>
\[h(\partial_i Z, \partial_j Z) = 0 + \dots + 0 + N^{d+1} \frac{\partial^2 \ell}{\partial \theta^i \partial \theta^j} = N^{d+1} \frac{\partial^2 \ell}{\partial \theta^i \partial \theta^j} ,\]
<p>where $N^{d+1}$ is the $(d+1)$-st component function (it is a function $\R^{d+1} \to \R$) of the normal field $N$. At each $p \in L$, the matrix of $h$ is therefore $N^{d+1}(p)$ times the Hessian matrix of $\ell$ at $p$.</p>
<p class="right">\( \square \)</p>
<p>Finally, we show the connection between the principal curvatures with the scalar second fundamental form, and hence the principal curvatures with the Hessian. The following proposition says that at a critical point, the unit normal vector can be chosen as $(0, \dots, 0, 1)$ and thus the scalar second fundamental form coincides with the Hessian of $\ell$. Furthermore, by orthonormalizing the basis for the tangent space at that point, we can show that the matrix of the scalar second fundamental form in this case is exactly the matrix of the shape operator at $p$ and thus the Hessian encodes the principal curvatures at that point.</p>
<p><strong>Proposition 2.</strong> <em>Suppose $L \subseteq \R^{d+1}$ is a loss landscape with its graph parametrization and let $\theta_* \in \R^d$ be a critical point of $\ell$ and $p_* := (\theta_*^1, \dots, \theta_*^d, \ell(\theta_*)) \in L$. Then the matrix of the shape operator $s$ of $L$ at $p_*$ is equal to the Hessian matrix of $\ell$ at $\theta_*$.</em></p>
<p><em>Proof.</em> We can assume w.l.o.g. that the basis $(E_1, \dots, E_d)$ for $T_{p_*} L$ is orthonormal by applying the Gram-Schmidt algorithm on $d$ linearly independent tangent vectors in $T_{p_*} L$. Furthermore pick $(0, \dots, 0, 1) \in \R^{d+1}$ as the choice of the unit normal $N$ at $p_*$. We can do so since by hypothesis $p_*$ is a critical point and therefore $(0, \dots, 0, 1)$ is perpendicular to $T_{p_*} L$.</p>
<p>It follows by Proposition 1 that the matrix of the scalar second fundamental form $h$ of $L$ at $p_*$ is equal to the Hessian matrix of $\ell$ at $\theta_*$. Moreover, since we have an orthonormal basis for $T_{p_*} L$, the metric of $L$ at $p_*$ is represented by the $d \times d$ diagonal matrix. This implies that the matrix of the shape operator at $p_*$ is equal to the matrix of the second fundamental form and the claim follows directly.</p>
<p class="right">\( \square \)</p>
<p>As a side note, we can actually have a more general statement: At any point in a hypersurface with any parametrization, the principal curvatures give a concise description of the local shape of the hypersurface by approximating it with the graph of a quadratic function. See Prop. 8.24 in [3] for a detailed discussion.</p>
<h2 class="section-heading">Flatness and Generalization</h2>
<p>In deep learning, there have been interesting works connecting the “flatness” of the loss landscape’s local minima with the generalization performance of an NN. The conjecture is that the flatter a minimum is, the better the network generalizes. “Flatness” here often refers to the eigenvalues or trace of the Hessian matrix at the minima. However, this has been disputed by e.g. [4] and rightly so.</p>
<p>As we have seen previously, at a minimum, the principal and mean curvature (the eigenvalues and trace of the Hessian of $\ell$, resp.) are not intrinsic. Different parametrization of $L$ can yield different principal and mean curvatures. Just like the illustration with the plane and the half-cylinder above, [4] illustrates this directly in the loss landscape. In particular, we can apply a bijective transformation $\varphi$ to the original parameter space $\R^d$ s.t. the resulting loss landscape is isometric to the original loss landscape and the particular minimum $\theta_*$ does not change, i.e. $\varphi(\theta_*) = \theta_*$. See the following figure for an illustration (we assume that the length of the red curves below is the same).</p>
<p><img src="/img/2020-11-01-hessian-curvatures/reparametrization_curvatures.png" alt="Unit normal field" width="80%" /></p>
<p>It is clear that the principal curvature changes even though functionally, the NN still represents the same function. Thus, we cannot actually connect the notion of “flatness” that are common in literature to the generalization ability of the NN. A definitive connection between them must start with some intrinsic notion of flatness—for starter, the Gaussian curvature, which can be easily computed since it is just the determinant of the Hessian at the minima.</p>
<h2 class="section-heading">References</h2>
<ol>
<li>Martens, James. “New Insights and Perspectives on the Natural Gradient Method.” arXiv preprint arXiv:1412.1193 (2014).</li>
<li>Dangel, Felix, Stefan Harmeling, and Philipp Hennig. “Modular Block-diagonal Curvature Approximations for Feedforward Architectures.” AISTATS. 2020.</li>
<li>Lee, John M. Riemannian manifolds: an introduction to curvature. Vol. 176. Springer Science & Business Media, 2006.</li>
<li>Dinh, Laurent, et al. “Sharp Minima can Generalize for Deep Nets.” ICML, 2017.</li>
<li>Spivak, Michael D. A comprehensive introduction to differential geometry. Publish or perish, 1970.</li>
</ol>
Mon, 02 Nov 2020 12:00:00 +0100
https://agustinus.kristia.de/techblog/2020/11/02/hessian-curvatures/
https://agustinus.kristia.de/techblog/2020/11/02/hessian-curvatures/mathtechblogOptimization and Gradient Descent on Riemannian Manifolds<p>Differential geometry can be seen as a generalization of calculus on Riemannian manifolds. Objects in calculus such as gradient, Jacobian, and Hessian on $\R^n$ are adapted on arbitrary Riemannian manifolds. This fact let us also generalize one of the most ubiquitous problem in calculus: the optimization problem. The implication of this generalization is far-reaching: We can make a more general and thus flexible assumption regarding the domain of our optimization, which might fit real-world problems better or has some desirable properties.</p>
<p>In this article, we will focus on the most popular optimization there is, esp. in machine learning: the gradient descent method. We will begin with a review on the optimization problem of a real-valued function on $\R^n$, which we should have been familiar with. Next, we will adapt the gradient descent method to make it work in optimization problem of a real-valued function on an arbitrary Riemannian manifold $(\M, g)$. Lastly, we will discuss how <a href="/techblog/2018/03/14/natural-gradient/">natural gradient descent</a> method can be seen from this perspective, instead of purely from the second-order optimization point-of-view.</p>
<h2 class="section-heading">Optimization problem and the gradient descent</h2>
<p>Let $\R^n$ be the usual Euclidean space (i.e. a Riemannian manifold $(\R^n, \bar{g})$ where $\bar{g}_{ij} = \delta_{ij}$) and let $f: \R^n \to \R$ be a real-valued function. An (unconstrained) optimization problem on this space has the form</p>
\[\min_{x \in \R^n} f(x) \, .\]
<p>That is we would like to find a point $\hat{x} \in \R^n$ such that $f(\hat{x})$ is the minimum of $f$.</p>
<p>One of the most popular numerical method for solving this problem is the gradient descent method. Its algorithm is as follows.</p>
<p><strong>Algorithm 1 (Euclidean gradient descent).</strong></p>
<ol>
<li>Pick arbitrary $x_{(0)} \in \R^n$ and let $\alpha \in \R$ with $\alpha > 0$</li>
<li>While the stopping criterion is not satisfied:
<ol>
<li>Compute the gradient of $f$ at $x_{(t)}$, i.e. $h_{(t)} := \gradat{f}{x_{(t)}}$</li>
<li>Move in the direction of $-h_{(t)}$, i.e. $x_{(t+1)} = x_{(t)} - \alpha h_{(t)}$</li>
<li>$t = t+1$</li>
</ol>
</li>
<li>Return $x_{(t)}$</li>
</ol>
<p class="right">//</p>
<p>The justification of the gradient descent method is because of the fact that the gradient is the direction in which $f$ is increasing fastest. Its negative therefore points to the direction of steepest descent.</p>
<p><strong>Proposition 1.</strong> <em>Let $f: \R^n \to \R$ be a real-valued function on $\R^n$ and $x \in \R^n$. Among all unit vector $v \in \R^n$, the gradient $\grad f \, \vert_x$ of $f$ at $x$ is the direction in which the directional derivative $D_v \, f \, \vert_x$ is greatest. Furthermore, $\norm{\gradat{f}{x}}$ equals to the value of the directional derivative in that direction.</em></p>
<p><em>Proof.</em> First, note that, by our assumption, $\norm{v} = 1$. By definition of the directional derivative and dot product on $\R^n$,</p>
\[\begin{align}
D_v \, f \, \vert_x &= \grad f \, \vert_x \cdot v \\
&= \norm{\gradat{f}{x}} \norm{v} \cos \theta \\
&= \norm{\gradat{f}{x}} \cos \theta \, ,
\end{align}\]
<p>where $\theta$ is the angle between $\gradat{f}{x}$ and $v$. As $\norm{\cdot} \geq 0$ and $-1 \leq \cos \theta \leq 1$, the above expression is maximized whenever $\cos \theta = 1$. This implies that the particular vector $\hat{v}$ that maximizes the directional derivative points in the same direction as $\gradat{f}{x}$. Furthermore, plugging in $\hat{v}$ into the above equation, we get</p>
\[D_{\hat{v}} \, f \, \vert_x = \norm{\gradat{f}{x}} \, .\]
<p>Thus, the length of $\gradat{f}{x}$ is equal to the value of $D_{\hat{v}} \, f \, \vert_x$.</p>
<p class="right">$\square$</p>
<h2 class="section-heading">Gradient descent on Riemannian manifolds</h2>
<p><strong>Remark.</strong> <em>These <a href="/techblog/2019/02/22/riemannian-geometry/">notes about Riemannian geometry</a> are useful as references. We shall use the Einstein summation convention: Repeated indices above and below are implied to be summed, e.g. $v^i w_i \implies \sum_i v^i w_i$ and $g_{ij} v^i v^j \implies \sum_{ij} g_{ij} v^i v^j$. By convention the index in $\partder{}{x^i}$ is thought to be a lower index.</em></p>
<p>We now want to break the confine of the Euclidean space. We would like to generalize the gradient descent algorithm on a function defined on a Riemannian manifold. Based on Algorithm 1, at least there are two parts of the algorithm that we need to adapt, namely, (i) the gradient of $f$ and (ii) the way we move between points on $\M$.</p>
<p>Suppose $(\M, g)$ is a $n$-dimensional Riemannian manifold. Let $f: \M \to R$ be a real-valued function (scalar field) defined on $\M$. Then, the optimization problem on $\M$ simply has the form</p>
\[\min_{p \in \M} f(p) \, .\]
<p>Although it seems innocent enough (we only replace $\R^n$ with $\M$ from the Euclidean version), some difficulties exist.</p>
<p>First, we shall discuss about the gradient of $f$ on $\M$. By definition, $\grad{f}$ is a vector field on $\M$, i.e. $\grad{f} \in \mathfrak{X}(\M)$ and at each $p \in \M$, $\gradat{f}{p}$ is a tangent vector in $T_p \M$. Let the differential $df$ of $f$ be a one one-form, which, in given coordinates $\vx_p := (x^1(p), \dots, x^n(p))$, has the form</p>
\[df = \partder{f}{x^i} dx^i \, .\]
<p>Then, the gradient of $f$ is obtained by raising an index of $df$. That is,</p>
\[\grad{f} = (df)^\sharp \, ,\]
<p>and in coordinates, it has the expression</p>
\[\grad{f} = g^{ij} \partder{f}{x^i} \partder{}{x^j} \, .\]
<p>At any $p \in \M$, given $v \in T_x \M$, it is characterized by the following equation:</p>
\[\inner{\gradat{f}{p}, v}_g = df(v) = vf \, .\]
<p>That is, pointwise, the inner product of the gradient and any tangent vector is the action of derivation $v$ on $f$. We can think of this action as taking directional derivative of $f$ in the direction $v$. Thus, we have the analogue of Proposition 1 on Riemannian manifolds.</p>
<p><strong>Proposition 2.</strong> <em>Let $(\M, g)$ be a Riemannian manifold and $f: \M \to \R$ be a real-valued function on $\M$ and $p \in \M$. Among all unit vector $v \in T_p \M$, the gradient $\gradat{f}{p}$ of $f$ at $p$ is the direction in which the directional derivative $vf$ is greatest. Furthermore, $\norm{\gradat{f}{p}}$ equals to the value of the directional derivative in that direction.</em></p>
<p><em>Proof.</em> We simply note that by definition of inner product induced by $g$, we have</p>
\[\inner{u, w}_g = \norm{u}_g \norm{w}_g \cos \theta \qquad \forall \, u, w \in T_p \M \, ,\]
<p>where $\theta$ is again the angle between $u$ and $w$. Using the characteristic of $\gradat{f}{p}$ we have discussed above and by substituting $vf$ for $D_v \, f \, \vert_p$ in the proof of Proposition 1, we immediately get the desired result.</p>
<p class="right">$\square$</p>
<p>Proposition 2 therefore provides us with a justification of just simply substituting the Euclidean gradient with the Riemannian gradient in Algorithm 1.</p>
<p>To make this concrete, we do the computation in coordinates. In coordinates, we can represent $df$ by a row vector $d$ (i.e. a sequence of numbers in the sense of linear algebra) containing all partial derivatives of $f$:</p>
\[d := \left( \partder{f}{x^1}, \dots, \partder{f}{x^n} \right) \, .\]
<p>Given the matrix representation $G$ of the metric tensor $g$ in coordinates, the gradient of $f$ is represented by a column vector $h$, such that</p>
\[h = G^{-1} d^\T \, .\]
<p><strong>Example 1. (Euclidean gradient in coordinates).</strong> Notice that in the Euclidean case, $\bar{g}_{ij} = \delta_{ij}$, thus it is represented by an identity matrix $I$, in coordinates. Therefore the Euclidean gradient is simply</p>
\[h = I^{-1} d^\T = d^\T \, .\]
<p class="right">//</p>
<p>The second modification to Algorithm 1 that we need to find the analogue of is the way we move between points on $\M$. Notice that, at each $x \in \R^n$, the way we move between points in the Euclidean gradient descent is by following a straight line in the direction $\gradat{f}{x}$. We know by triangle inequality that straight line is the path with shortest distance between points in $\R^n$.</p>
<p>On Riemannian manifolds, we move between points by the means of curves. There exist a special kind of curve $\gamma: I \to \M$, where $I$ is an interval, that are “straight” between two points on $\M$, in the sense that the covariant derivative $D_t \gamma’$ of the velocity vector along the curve itself, at any time $t$ is $0$. The intuition is as follows: Although if we look at $\gamma$ on $\M$ from the outsider’s point-of-view, it is not straight (i.e. it follows the curvature of $\M$), as far as the inhabitants of $\M$ are concerned, $\gamma$ is straight, as its velocity vector (its direction and length) is the same everywhere along $\gamma$. Thus, geodesics are the generalization of straight lines on Riemannian manifolds.</p>
<p>For any $p \in \M$ and $v \in T_p \M$, we can show that there always exists a geodesic starting at $p$ with initial velocity $v$, denoted by $\gamma_v$. Furthermore, if $c, t \in \R$ we can rescale any geodesic $\gamma_v$ by</p>
\[\gamma_{cv}(t) = \gamma_v (ct) \, ,\]
<p>and thus we can define a map $\exp_p(v): T_p \M \to \M$ by</p>
\[\exp_p(v) = \gamma_v(1) \, ,\]
<p>called the exponential map. The exponential map is the generalization of “moving straight in the direction $v$” on Riemannian manifolds.</p>
<p><strong>Example 2. (Exponential map on a sphere).</strong> Let $\mathbb{S}^n(r)$ be a sphere embedded in $\R^{n+1}$ with radius $r$. The shortest path between any pair of points on the sphere can be found by following the <a href="https://en.wikipedia.org/wiki/Great_circle">great circle</a> connecting them.</p>
<p>Let $p \in \mathbb{S}^n(r)$ and $0 \neq v \in T_p \mathbb{S}^n(r)$ be arbitrary. The curve $\gamma_v: \R \to \R^{n+1}$ given by</p>
\[\gamma_v(t) = \cos \left( \frac{t\norm{v}}{r} \right) p + \sin \left( \frac{t\norm{v}}{r} \right) r \frac{v}{\norm{v}} \, ,\]
<p>is a geodesic, as its image is the great circle formed by the intersection of $\mathbb{S}^n(r)$ with the linear subspace of $\R^{n+1}$ spanned by $\left\{ p, r \frac{v}{\norm{v}} \right\}$. Therefore the exponential map on $\mathbb{S}^n(r)$ is given by</p>
\[\exp_p(v) = \cos \left( \frac{\norm{v}}{r} \right) p + \sin \left( \frac{\norm{v}}{r} \right) r \frac{v}{\norm{v}} \, .\]
<p class="right">//</p>
<p>Given the exponential map, our modification to Algorithm 1 is complete, which we show in Algorithm 2. The new modifications from Algorithm 1 are in <span style="color:blue">blue</span>.</p>
<p><strong>Algorithm 2 (Riemannian gradient descent).</strong></p>
<ol>
<li>Pick arbitrary <span style="color:blue">$p_{(0)} \in \M$</span>. Let $\alpha \in \R$ with $\alpha > 0$</li>
<li>While the stopping criterion is not satisfied:
<ol>
<li>Compute the gradient of $f$ at $p_{(t)}$, i.e. <span style="color:blue">$h_{(t)} := \gradat{f}{p_{(t)}} = (df \, \vert_{p_{(t)}})^\sharp$</span></li>
<li>Move in the direction $-h_{(t)}$, i.e. <span style="color:blue">$p_{(t+1)} = \exp_{p_{(t)}}(-\alpha h_{(t)})$</span></li>
<li>$t = t+1$</li>
</ol>
</li>
<li>Return $p_{(t)}$</li>
</ol>
<h2 class="section-heading">Approximating the exponential map</h2>
<p>In general, the exponential map is difficult to compute, as to compute a geodesic, we have to solve a system of second-order ODE. Therefore, for a computational reason, we would like to approximate the exponential map with cheaper alternative.</p>
<p>Let $p \in \M$ be arbitrary. We define a map $R_p: T\M \to \M$ called the <strong><em>retraction</em></strong> map, by the following properties:</p>
<ol>
<li>$R_p(0) = p$</li>
<li>$dR_p(0) = \text{Id}_{T_p \M}$.</li>
</ol>
<p>The second property is called the <strong><em>local rigidity</em></strong> condition and it preserves gradients at $p$. In particular, the exponential map is a retraction. Furthermore, if $d_g$ denotes the Riemannian distance and $t \in \R$, retraction can be seen as a first-order approximation of the exponential map, in the sense that</p>
\[d_g(\exp_p(tv), R_p(tv)) = O(t^2) \, .\]
<p>On an arbitrary embedded submanifold $\S \in \R^{n+1}$, if $p \in \S$ and $v \in T_p \S$, viewing $p$ to be a point on the ambient manifold and $v$ to be a point on the ambient tangent space $T_p \R^{n+1}$, we can compute $R_p(v)$ by (i) moving along $v$ to get $p + v$ and then (ii) project the point $p+v$back to $\S$.</p>
<p><strong>Example 3. (Retraction on a sphere).</strong> Let $\mathbb{S}^n(r)$ be a sphere embedded in $\R^{n+1}$ with radius $r$. The retraction on any $p \in \mathbb{S}^n(r)$ and $v \in T_p \mathbb{S}^n(r)$ is defined by</p>
\[R_p(v) = r \frac{p + v}{\norm{p + v}}\]
<p class="right">//</p>
<p>Therefore, the Riemannian gradient descent in Algorithm 2 can be modified to be</p>
<p><strong>Algorithm 3 (Riemannian gradient descent with retraction).</strong></p>
<ol>
<li>Pick arbitrary $p_{(0)} \in \M$. Let $\alpha \in \R$ with $\alpha > 0$</li>
<li>While the stopping criterion is not satisfied:
<ol>
<li>Compute the gradient of $f$ at $p_{(t)}$, i.e. $h_{(t)} := \gradat{f}{p_{(t)}} = (df \, \vert_{p_{(t)}})^\sharp$</li>
<li>Move in the direction $-h_{(t)}$, i.e. <span style="color:blue">$p_{(t+1)} = R_{p_{(t)}}(-\alpha h_{(t)})$</span></li>
<li>$t = t+1$</li>
</ol>
</li>
<li>Return $p_{(t)}$</li>
</ol>
<h2 class="section-heading">Natural gradient descent</h2>
<p>One of the most important applications of the Riemannian gradient descent in machine learning is for doing optimization of statistical manifolds. We define a statistical manifold $(\R^n, g)$ to be the set $\R^n$ corresponding to the set of parameter of a statistical model $p_\theta(z)$, equipped with metric tensor $g$ which is the Fisher information metric, given by</p>
\[g_{ij} = \E_{z \sim p_\theta} \left[ \partder{\log p_\theta(z)}{\theta^i} \partder{\log p_\theta(z)}{\theta^j} \right] \, .\]
<p>The most common objective function $f$ in the optimization problem on a statistical manifold is the expected log-likelihood function of our statistical model. That is, given a dataset $\D = \{ z_i \}$, the objective is given by $f(\theta) = \sum_{z \in \D} \log p_\theta(z)$.</p>
<p>The metric tensor $g$ is represented by $n \times n$ matrix $F$, called the <a href="/techblog/2018/03/11/fisher-information/"><em>Fisher information matrix</em></a>. The Riemannian gradient in this manifold is therefore can be represented by a column vector $h = F^{-1} d^\T$. Furthermore, as the manifold is $\R^n$, the construction of the retraction map we have discussed previously tells us that we can simply do addition $p + v$ for any $p \in \R^n$ and $v \in T_p \R^n$. This is well defined as there is a natural isomorphism between $\R^n$ and $T_p \R^n$. All in all, the gradient descent in this manifold is called the <a href="/techblog/2018/03/14/natural-gradient/"><em>natural gradient descent</em></a> and is presented in Algorithm 4 below.</p>
<p><strong>Algorithm 4 (Natural gradient descent).</strong></p>
<ol>
<li>Pick arbitrary $\theta_{(0)} \in \R^n$. Let $\alpha \in \R$ with $\alpha > 0$</li>
<li>While the stopping criterion is not satisfied:
<ol>
<li>Compute the gradient of $f$ at $\theta_{(t)}$, i.e. $h_{(t)} := F^{-1} d^\T$</li>
<li>Move in the direction $-h_{(t)}$, i.e. $\theta_{(t+1)} = \theta_{(t)} - \alpha h_{(t)}$</li>
<li>$t = t+1$</li>
</ol>
</li>
<li>Return $\theta_{(t)}$</li>
</ol>
<h2 class="section-heading">Conclusion</h2>
<p>Optimization in Riemannian manifold is an interesting and important application in the field of geometry. It generalizes the optimization methods from Euclidean spaces onto Riemannian manifolds. Specifically, in the gradient descent method, adapting it to a Riemannian manifold requires us to use the Riemannian gradient as the search direction and the exponential map or retraction to move between points on the manifold.</p>
<p>One major difficulty exists: Computing and storing the matrix representation $G$ of the metric tensor are very expensive. Suppose the manifold is $n$-dimensional. Then, the size of $G$ is in $O(n^2)$ and the complexity of inverting it is in $O(n^3)$. In machine learning, $n$ could be in the order of million, so a naive implementation is infeasible. Thankfully, many approximations of the metric tensor, especially for the Fisher information metric exist (e.g. [7]). Thus, even with these difficulties, the Riemannian gradient descent or its variants have been successfully applied on many areas, such as in inference problems [8], word or knowledge graph embeddings [9], etc.</p>
<h2 class="section-heading">References</h2>
<ol>
<li>Lee, John M. “Smooth manifolds.” Introduction to Smooth Manifolds. Springer, New York, NY, 2013. 1-31.</li>
<li>Lee, John M. Riemannian manifolds: an introduction to curvature. Vol. 176. Springer Science & Business Media, 2006.</li>
<li>Fels, Mark Eric. “An Introduction to Differential Geometry through Computation.” (2016).</li>
<li>Absil, P-A., Robert Mahony, and Rodolphe Sepulchre. Optimization algorithms on matrix manifolds. Princeton University Press, 2009.</li>
<li>Boumal, Nicolas. Optimization and estimation on manifolds. Diss. Catholic University of Louvain, Louvain-la-Neuve, Belgium, 2014.</li>
<li>Graphics: <a href="https://tex.stackexchange.com/questions/261408/sphere-tangent-to-plane">https://tex.stackexchange.com/questions/261408/sphere-tangent-to-plane</a>.</li>
<li>Martens, James, and Roger Grosse. “Optimizing neural networks with kronecker-factored approximate curvature.” International conference on machine learning. 2015.</li>
<li>Patterson, Sam, and Yee Whye Teh. “Stochastic gradient Riemannian Langevin dynamics on the probability simplex.” Advances in neural information processing systems. 2013.</li>
<li>Suzuki, Atsushi, Yosuke Enokida, and Kenji Yamanishi. “Riemannian TransE: Multi-relational Graph Embedding in Non-Euclidean Space.” (2018).</li>
</ol>
Fri, 22 Feb 2019 12:00:00 +0100
https://agustinus.kristia.de/techblog/2019/02/22/optimization-riemannian-manifolds/
https://agustinus.kristia.de/techblog/2019/02/22/optimization-riemannian-manifolds/mathtechblogNotes on Riemannian Geometry<p>Recently I have been studying differential geometry, including Riemannian geometry. When studying this subject, a lot of <em>aha</em> moments came up due to my previous (albeit informal) exposure to the geometric point-of-view of natural gradient method. I found that the argument from this point-of-view to be very elegant, which motivates me further to study geometry in depth. This writing is a collection of small notes (largely from Lee’s Introduction to Smooth Manifolds and Introduction to Riemannian Manifolds) that I find useful as a reference on this subject. Note that, this is by no means a completed article. I will update it as I study further.</p>
<h2 class="section-heading">Manifolds</h2>
<p>We are interested in generalizing the notion of Euclidean space into arbitrary smooth curved space, called smooth manifold. Intuitively speaking, a <strong><em>topological $n$-manifold</em></strong> $\M$ is a topological space that locally resembles $\R^n$. A <strong><em>smooth $n$-manifold</em></strong> is a topological $n$-manifold equipped with locally smooth map $\phi_p: \M \to \R^n$ around each point $p \in \M$, called the <strong><em>local coordinate chart</em></strong>.</p>
<p><strong>Example 1 (Euclidean spaces).</strong> For each $n \in \mathbb{N}$, the Euclidean space $\R^n$ is a smooth $n$-manifold with a single chart $\phi := \text{Id}_{\R^n}$, the identity map, for all $p \in \M$. Thus, $\phi$ is a <em>global coordinate chart</em>.</p>
<p class="right">//</p>
<p><strong>Example 2 (Spaces of matrices).</strong> Let $\text{M}(m \times n, \R)$ denote the set of $m \times n$ matrices with real entries. We can identify it with $\R^{mn}$ and as before, this is a smooth $mn$-dimensional manifold. Some of its subsets, e.g. the general linear group $\text{GL}(n, \R)$ and the space of full rank matrices, are smooth manifolds.</p>
<p class="right">//</p>
<p><strong>Remark 1.</strong> We will drop $n$ when referring a smooth $n$-manifold from now on, for brevity sake. Furthermore, we will start to use the <strong><em>Einstein summation convention</em></strong>: repeated indexes above and below are implied to be summed, e.g. $v_i w^i := \sum_i v_i w^i$.</p>
<p class="right">//</p>
<h2 class="section-heading">Tangent vectors and covectors</h2>
<p>At each point $p \in \M$, there exists a vector space $T_p \M$, called the <strong><em>tangent space</em></strong> of $p$. An element $v \in T_p \M$ is called the <strong><em>tangent vector</em></strong>. Let $f: \M \to \R$ be a smooth function. In local coordinate $\{x^1, \dots, x^n\}$ defined around $p$, the coordinate vectors $\{ \partial/\partial x^1, \dots, \partial/\partial x^n \}$ form a <strong><em>coordinate basis</em></strong> for $T_p \M$.</p>
<p>A tangent vector $v \in T_p \M$ can also be seen as a <strong><em>derivation</em></strong>, a linear map $C^\infty(\M) \to \R$ that follows Leibniz rule (product rule of derivative), i.e.</p>
\[v(fg) = f(p)vg + g(p)vf \enspace \enspace \forall f, g \in C^\infty(\M) \, .\]
<p>Thus, we can also see $T_p \M$ to be the set of all derivations of $C^\infty(\M)$ at $p$.</p>
<p>For each $p \in \M$ there also exists the dual space $T_p^* \M$ of $T_p \M$, called the <strong><em>cotangent space</em></strong> at $p$. Each element $\omega \in T_p^* \M$ is called the <strong><em>tangent covector</em></strong>, which is a linear functional $\omega: T_p \M \to \R$ acting on tangent vectors at $p$. Given the same local coordinate as above, the basis for the cotangent space at $p$ is called the <strong><em>dual coordinate basis</em></strong> and is given by $\{ dx^1, \dots, dx^n \}$, such that $dx^i(\partial/\partial x^j) = \delta^i_j$ the Kronecker delta. Note that, this implies that if $v := v^i \, \partial/\partial x^i$, then $dx^i(v) = v^i$.</p>
<p>Tangent vectors and covectors follow different transformation rules. We call an object with lower index, e.g. the components of tangent covector $\omega_i$ and the coordinate basis $\partial/\partial x^i =: \partial_i$, to be following the <strong><em>covariant</em></strong> transformation rule. Meanwhile an object with upper index, e.g. the components a tangent vector $v^i$ and the dual coordinate basis $dx^i$, to be following the <strong><em>contravariant</em></strong> transformation rule. These stem from how an object transform w.r.t. change of coordinate. Recall that a vector, when all the basis vectors are scaled up by a factor of $k$, the coefficients in its linear combination will be scaled by $1/k$, thus it is said that a vector transforms <em>contra</em>-variantly (the opposite way to the basis). Analogously, we can show that when we apply the same transformation to the dual basis, the covectors coefficients will be scaled by $k$, thus it transforms the same way to the basis (<em>co</em>-variantly).</p>
<p>The partial derivatives of a scalar field (real valued function) on $\M$ can be interpreted as the components of a covector field in a coordinate-independent way. Let $f$ be such scalar field. We define a covector field $df: \M \to T^* \M$, called the <strong><em>differential</em></strong> of $f$, by</p>
\[df_p(v) := vf \enspace \enspace \text{for} \, v \in T_p\M \, .\]
<p>Concretely, in smooth coordinates $\{ x^i \}$ around $p$, we can show that it can be written as</p>
\[df_p := \frac{\partial f}{\partial x^i} (p) \, dx^i \, \vert_p \, ,\]
<p>or as an equation between covector fields instead of covectors:</p>
\[df := \frac{\partial f}{\partial x^i} \, dx^i \, .\]
<p>The disjoint union of the tangent spaces at all points of $\M$ is called the <strong><em>tangent bundle</em></strong> of $\M$</p>
\[TM := \coprod_{p \in \M} T_p \M \, .\]
<p>Meanwhile, analogously for the cotangent spaces, we define the <strong><em>cotangent bundle</em></strong> of $\M$ as</p>
\[T^*M := \coprod_{p \in \M} T^*_p \M \, .\]
<p>If $\M$ and $\mathcal{N}$ are smooth manifolds and $F: \M \to \mathcal{N}$ is a smooth map, for each $p \in \M$ we define a map</p>
\[dF_p : T_p \M \to T_{F(p)} \mathcal{N} \, ,\]
<p>called the <strong><em>differential</em></strong> of $F$ at $p$, as follows. Given $v \in T_p \M$:</p>
\[dF_p (v)(f) := v(f \circ F) \, .\]
<p>Moreover, for any $v \in T_p \M$, we call $dF_p (v)$ the <strong><em>pushforward</em></strong> of $v$ by $F$ at $p$. It differs from the previous definition of differential in the sense that this map is a linear map between tangent spaces of two manifolds. Furthermore the differential of $F$ can be seen as the generalization of the total derivative in Euclidean spaces, in which $dF_p$ is represented by the Jacobian matrix.</p>
<h2 class="section-heading">Vector fields</h2>
<p>If $\M$ is a smooth $n$-manifold, a <strong><em>vector field</em></strong> on $\M$ is a continuous map $X: \M \to T\M$, written as $p \mapsto X_p$, such that $X_p \in T_p \M$ for each $p \in \M$. If $(U, (x^i))$ is any smooth chart for $\M$, we write the value of $X$ at any $p \in U \subset \M$ as</p>
\[X_p = X^i(p) \, \frac{\partial}{\partial x^i} \vert_p \, .\]
<p>This defines $n$ functions $X^i: U \to \R$, called the <strong><em>component functions</em></strong> of $X$. The restriction of $X$ to $U$ is a smooth vector field if and only if its component functions w.r.t. the chart are smooth.</p>
<p><strong>Example 3 (Coordinate vector fields).</strong> If $(U, (x^i))$ is any smooth chart on $\M$, then $p \mapsto \partial/\partial x^i \vert_p$ is a vector field on $U$, called the <strong><em>i-th coordinate vector field</em></strong>. It is smooth as its component functions are constant. This vector fields defines a basis of the tangent space at each point.</p>
<p class="right">//</p>
<p><strong>Example 4 (Gradient).</strong> If $f \in C^\infty(\M)$ is a real-valued function on $\M$, then the gradient of $f$ is a vector field on $\M$. See the corresponding section below for more detail.</p>
<p class="right">//</p>
<p>We denote $\mathfrak{X}(\M)$ to be the set of all smooth vector fields on $\M$. It is a vector space under pointwise addition and scalar multiplication, i.e. $(aX + bY)_p = aX_p + bY_p$. The zero element is the zero vector field, whose value is $0 \in T_p \M$ for all $p \in \M$. If $f \in C^\infty(\M)$ and $X \in \mathfrak{X}(\M)$, then we define $fX: \M \to T\M$ by $(fX)_p = f(p)X_p$. Note that this defines a multiplication of a vector field with a smooth real-valued function. Furthermore, if in addition, $g \in C^\infty(\M)$ and $Y \in \mathfrak{X}(\M)$, then $fX + gY$ is also a smooth vector field.</p>
<p>A <strong><em>local frame</em></strong> for $\M$ is an ordered $n$-tuple of vector fields $(E_1, \dots, E_n)$ defined on an open subset $U \subseteq M$ that is linearly independent and spans the tangent bundle, i.e. $(E_1 \vert_p, \dots, E_n \vert_p)$ form a basis for $T_p \M$ for each $p \in \M$. It is called a <strong><em>global frame</em></strong> if $U = M$, and a <strong><em>smooth frame</em></strong> if each $E_i$ is smooth.</p>
<p>If $X \in \mathfrak{X}(\M)$ and $f \in C^\infty(U)$, we define $Xf: U \to \R$ by $(Xf)(p) = X_p f$. $X$ also defines a map $C^\infty(\M) \to C^\infty(\M)$ by $f \mapsto Xf$ which is linear and Leibniz, thus it is a derivation. Moreover, derivations of $C^\infty(\M)$ can be identified with smooth vector fields, i.e. $D: C^\infty(\M) \to C^\infty(\M)$ is a derivation if and only if it is of the form $Df = Xf$ for some $X \in \mathfrak{X}(\M)$.</p>
<h2 class="section-heading">Tensors</h2>
<p>Let $\{ V_k \}$ and $U$ be real vector spaces. A map $F: V_1 \times \dots \times V_k \to U$ is said to be <strong><em>multilinear</em></strong> if it is linear as a function of each variable separately when the others are held fixed. That is, it is a generalization of the familiar linear and bilinear maps. Furthermore, we write the vector space of all multilinear maps $ V_1 \times \dots \times V_k \to U $ as $ \text{L}(V_1, \dots, V_k; U) $.</p>
<p><strong>Example 5 (Multilinear functions).</strong> Some examples of familiar multilinear functions are</p>
<ol>
<li>The <em>dot product</em> in $ \R^n $ is a scalar-valued bilinear function of two vectors. E.g. for any $ v, w \in \R^n $, the dot product between them is $ v \cdot w := \sum_i^n v^i w^i $, which is linear on both $ v $ and $ w $.</li>
<li>The <em>determinant</em> is a real-valued multilinear function of $ n $ vectors in $ \R^n $.</li>
</ol>
<p class="right">//</p>
<p>Let $\{ W_l \}$ also be real vector spaces and suppose</p>
\[\begin{align}
F&: V_1 \times \dots \times V_k \to \R \\
G&: W_1 \times \dots \times W_l \to \R
\end{align}\]
<p>be multilinear maps. Define a function</p>
\[\begin{align}
F \otimes G &: V_1 \times \dots \times V_k \times W_1 \times \dots \times W_l \to \R \\
F \otimes G &(v_1, \dots, v_k, w_1, \dots, w_k) = F(v_1, \dots, v_k) G(w_1, \dots, w_l) \, .
\end{align}\]
<p>From the multilinearity of $ F $ and $ G $ it follows that $ F \otimes G $ is also multilinear, and is called the <strong><em>tensor product of $ F $ and $ G $</em></strong>. I.e. tensors and tensor products are multilinear map with codomain in $ \R $.</p>
<p><strong>Example 6 (Tensor products of covectors).</strong> Let $ V $ be a vector space and $ \omega, \eta \in V^* $. Recall that they both a linear map from $ V $ to $ \R $. Therefore the tensor product between them is</p>
\[\begin{align}
\omega \otimes \eta &: V \times V \to \R \\
\omega \otimes \eta &(v_1, v_2) = \omega(v_1) \eta(v_2) \, .
\end{align}\]
<p class="right">//</p>
<p><strong>Example 7 (Tensor products of dual basis).</strong> Let $ \epsilon^1, \epsilon^2 $ be the standard dual basis for $ (\R^2)^* $. Then, the tensor product $ \epsilon^1 \otimes \epsilon^2: \R^2 \times \R^2 \to \R $ is the bilinear function defined by</p>
\[\epsilon^1 \otimes \epsilon^2(x, y) = \epsilon^1 \otimes \epsilon^2((w, x), (y, z)) := wz \, .\]
<p class="right">//</p>
<p>We use the notation $ V_1^* \otimes \dots \otimes V_k^* $ to denote the space $ \text{L}(V_1, \dots, V_k; \R) $. Let $ V $ be a finite-dimensional vector space. If $ k \in \mathbb{N} $, a <strong><em>covariant</em> $ k $-tensor on $ V $</strong> is an element of the $ k $-fold tensor product $ V^* \otimes \dots \otimes V^* $, which is a real-valued multilinear function of $ k $ elements of $ V $ to $ \R $. The number $ k $ is called the <strong><em>rank</em></strong> of the tensor.</p>
<p>Analogously, we define a <strong><em>contravariant $ k $-tensor on $ V $</em></strong> to be an element of the element of the $ k $-fold tensor product $ V \otimes \dots \otimes V $. We can mixed the two types of tensors together: For any $ k, l \in \mathbb{N} $, we define a <strong><em>mixed tensor on $ V $ of type $ (k, l) $</em></strong> to be the tensor product of $ k $ such $ V $ and $ l $ such $ V^* $.</p>
<h2 class="section-heading">Riemannian metrics</h2>
<p>So far we have no mechanism to measure the length of (tangent) vectors like we do in standard Euclidean geometry, where the length of a vector $v$ is measured in term of the dot product $ \sqrt{v \cdot v} $. Thus, we would like to add a structure that enables us to do just that to our smooth manifold $\M$.</p>
<p>A <strong><em>Riemannian metric</em></strong> $ g $ on $ \M $ is a smooth symmetric covariant 2-tensor field on $ \M $ that is positive definite at each point. Furthermore, for each $ p \in \M $, $ g_p $ defines an inner product on $ T_p \M $, written $ \inner{v, w}_g = g_p(v, w) $ for all $ v, w \in T_p \M $. We call a tuple $(\M, g)$ to be a <strong><em>Riemannian manifold</em></strong>.</p>
<p>In any smooth local coordinate $\{x^i\}$, a Riemannian metric can be written as tensor product</p>
\[g = g_{ij} \, dx^i \otimes dx^j \, ,\]
<p>such that</p>
\[g(v, w) = g_{ij} \, dx^i \otimes dx^j(v, w) = g_{ij} \, dx^i(v) dx^j(w) = g_{ij} \, v^i w^j \, .\]
<p>That is we can represent $ g $ as a symmetric, positive definite matrix $ G $ taking two tangent vectors as its arguments: $ \inner{v, w}_g = v^\text{T} G w $. Furthermore, we can define a norm w.r.t. $g$ to be $\norm{\cdot}_g := \inner{v, v}_g$ for any $v \in T_p \M$.</p>
<p><strong>Example 8 (The Euclidean Metric).</strong> The simplest example of a Riemannian metric is the familiar <strong><em>Euclidean metric</em></strong> $g$ of $\R^n$ using the standard coordinate. It is defined by</p>
\[g = \delta_{ij} \, dx^i \otimes dx^j \, ,\]
<p>which, if applied to vectors $v, w \in T_p \R^n$, yields</p>
\[g_p(v, w) = \delta_{ij} \, v^i w^j = \sum_{i=1}^n v^i w^i = v \cdot w \, .\]
<p>Note that above, $\delta_{ij}$ is the Kronecker delta. Thus, the Euclidean metric can be represented by the $n \times n$ identity matrix.</p>
<p class="right">//</p>
<h2 class="section-heading">The tangent-cotangent isomorphism</h2>
<p>Riemannian metrics also provide an isomorphism between the tangent and cotangent space: They allow us to convert vectors to covectors and vice versa. Let $(\M, g)$ be a Riemannian manifold. We define an isomorphism $\hat{g}: T_p \M \to T_p^* \M$ as follows. For each $p \in \M$ and each $v \in T_p \M$</p>
\[\hat{g}(v) = \inner{v, \cdot}_g \, .\]
<p>Notice that that $\hat{g}(v)$ is in $T_p^* \M$ as it is a linear functional over $T_p \M$. In any smooth coordinate $\{x^i\}$, by definition we can write $g = g_{ij} \, dx^i dx^j$. Thus we can write the isomorphism above as</p>
\[\hat{g}(v) = (g_{ij} \, v^i) \, dx^j =: v_i \, dx^j \, .\]
<p>Notice that we transform a contravariant component $v^i$ (denoted by the upper index component $i$) to a covariant component $v_i = g_{ij} \, v^i$ (denoted by the lower index component $j$), with the help of the metric tensor $g$. Because of this, we say that we obtain a covector from a tangent vector by <strong><em>lowering an index</em></strong>. Note that, we can also denote this by “flat” symbol in musical sheets: $\hat{g}(v) =: v^\flat$.</p>
<p>As Riemannian metric can be seen as a symmetric positive definite matrix, it has an inverse $g^{ij} := g_{ij}^{-1}$, which we denote by moving the index to the top, such that $g^{ij} \, g_{jk} = g_{kj} \, g^{ji} = \delta^i_k$. We can then define the inverse map of the above isomorphism as $\hat{g}^{-1}: T_p^* \M \to T_p \M$, where</p>
\[\hat{g}^{-1}(\omega) = (g^{ij} \, \omega_j) \, \frac{\partial}{\partial x^i} =: \omega^i \, \frac{\partial}{\partial x^i} \, ,\]
<p>for all $\omega \in T_p^* \M$. In correspondence with the previous operation, we are now looking at the components $\omega^i := g^{ij} \, \omega_j$, hence this operation is called <strong><em>raising an index</em></strong>, which we can also denote by “sharp” musical symbol: $\hat{g}^{-1}(\omega) =: \omega^\sharp$. Putting these two map together, we call the isomorphism between the tangent and cotangent space as the <strong><em>musical isomorphism</em></strong>.</p>
<h2 class="section-heading">The Riemannian gradient</h2>
<p>Let $(\M, g)$ be a Riemannian manifold, and let $f: \M \to \R$ be a real-valued function over $\M$ (i.e. a scalar field on $\M)$. Recall that $df$ is a covector field, which in coordinates has partial derivatives of $f$ as its components. We define a vector field called the <strong><em>gradient</em></strong> of $f$ by</p>
\[\begin{align}
\grad{f} := (df)^\sharp = \hat{g}^{-1}(df) \, .
\end{align}\]
<p>For any $p \in \M$ and for any $v \in T_p \M$, the gradient satisfies</p>
\[\inner{\grad{f}, v}_g = vf \, .\]
<p>That is, for each $p \in \M$ and for any $v \in T_p \M$, $\grad{f}$ is a vector in $T_p \M$ such that the inner product with $v$ is the derivation of $f$ by $v$. Observe the compatibility of this definition with standard multi-variable calculus: the directional derivative of a function in the direction of a vector is the dot product of its gradient and that vector.</p>
<p>In any smooth coordinate $\{x^i\}$, $\grad{f}$ has the expression</p>
\[\grad{f} = g^{ij} \frac{\partial f}{\partial x^i} \frac{\partial}{\partial x^j} \, .\]
<p><strong>Example 9 (Euclidean gradient).</strong> On $\R^n$ with the Euclidean metric with the standard coordinate, the gradient of $f: \R^n \to \R$ is</p>
\[\grad{f} = \delta^{ij} \, \frac{\partial f}{\partial x^i} \frac{\partial}{\partial x^j} = \sum_{i=1}^n \frac{\partial f}{\partial x^i} \frac{\partial}{\partial x^i} \, .\]
<p>Thus, again it is coincide with the definition we are familiar with form calculus.</p>
<p class="right">//</p>
<p>All in all then, given a basis, in matrix notation, let $G$ be the matrix representation of $g$ and let $d$ be the matrix representation of $df$ (i.e. as a row vector containing all partial derivatives of $f$), then: $\grad{f} = G^{-1} d^\T$.</p>
<p>The interpretation of the gradient in Riemannian manifold is analogous to the one in Euclidean space: its direction is the direction of steepest ascent of $f$ and it is orthogonal to the level sets of $f$; and its length is the maximum directional derivative of $f$ in any direction.</p>
<h2 class="section-heading">Connections</h2>
<p>Let $(\M, g)$ be a Riemannian manifold and let $X, Y: \M \to T \M$ be a vector field. Applying the usual definition for directional derivative, the way we differentiate $X$ is by</p>
\[D_X \vert_p Y = \lim_{h \to 0} \frac{Y_{p+hX_p} - Y_p}{h} \, .\]
<p>However, we will have problems: We have not defined what this expression $p+hX_p$ means. Furthermore, as $Y_{p+hX_p}$ and $Y_p$ live in different vector spaces $T_{p+hX_p} \M$ and $T_p \M$, it does not make sense to subtract them, unless there is a natural isomorphism between each $T_p \M$ and $\M$ itself, as in Euclidean spaces. Hence, we need to add an additional structure, called <strong><em>connection</em></strong> that allows us to compare different tangent vectors from different tangent spaces of nearby points.</p>
<p>Specifically, we define the <strong><em>affine connection</em></strong> to be a connection in the tangent bundle of $\M$. Let $\mathfrak{X}(\M)$ be the space of vector fields on $\M$; $X, Y, Z \in \mathfrak{X}(\M)$; $f, g \in C^\infty(\M)$; and $a, b \in \R$. The affine connection is given by the map</p>
\[\begin{align}
\nabla: \mathfrak{X}(\M) \times \mathfrak{X}(\M) &\to \mathfrak{X}(\M) \\
(X, Y) &\mapsto \nabla_X Y \, ,
\end{align}\]
<p>which satisfies the following properties</p>
<ol>
<li>$C^\infty(\M)$-linearity in $X$, i.e., $\nabla_{fX+gY} Z = f \, \nabla_X Z + g \, \nabla_Y Z$</li>
<li>$\R$-linearity in Y, i.e., $\nabla_X (aY + bZ) = a \, \nabla_X Y + b \, \nabla_X Z$</li>
<li>Leibniz rule, i.e., $\nabla_X (fY) = (Xf) Y + f \, \nabla_X Y$ .</li>
</ol>
<p>We call $\nabla_X Y$ the <strong><em>covariant derivative</em></strong> of $Y$ in the direction $X$. Note that the notation $Xf$ means $Xf(p) := D_{X_p} \vert_p f$ for all $p \in \M$, i.e. the directional derivative (it is a scalar field). Furthermore, notice that, covariant derivative and connection are the same thing and they are useful to generalize the notion of directional derivative to vector fields.</p>
<p>In any smooth local frame $(E_i)$ in $T \M$ on an open subset $U \in \M$, we can expand the vector field $\nabla_{E_i} E_j$ in terms of this frame</p>
\[\nabla_{E_i} E_j = \Gamma^k_{ij} E_k \,.\]
<p>The $n^3$ smooth functions $\Gamma^k_{ij}: U \to \R$ is called the <strong><em>connection coefficients</em></strong> or the <strong><em>Christoffel symbols</em></strong> of $\nabla$.</p>
<p><strong>Example 10 (Covariant derivative in Euclidean spaces).</strong> Let $\R^n$ with the Euclidean metric be a Riemannian manifold. Then</p>
\[(\nabla_Y X)_p = \lim_{h \to 0} \frac{Y_{p+hX_p} - Y_p}{h} \enspace \enspace \forall p \in \M \, ,\]
<p>the usual directional derivative, is a covariant derivative.</p>
<p class="right">//</p>
<p>There exists a unique affine connection for every Riemannian manifold $(\M, g)$ that satisfies</p>
<ol>
<li>Symmetry, i.e., $\nabla_X Y - \nabla_Y X = [X, Y]$</li>
<li>Metric compatible, i.e., $Z \inner{X, Y}_g = \inner{\nabla_Z X, Y}_g + \inner{X, \nabla_Z Y}_g$,</li>
</ol>
<p>for all $X, Y, Z \in \mathfrak{X}(\M)$. It is called the <strong><em>Levi-Civita connection</em></strong>. Note that, $[\cdot, \cdot]$ is the <strong>Lie bracket</strong>, defined by $[X, Y]f = X(Yf) - Y(Xf)$ for all $f \in C^\infty(\M)$. Note also that, the connection shown in Example 10 is the Levi-Civita connection for Euclidean spaces with the Euclidean metric.</p>
<h2 class="section-heading">Riemannian Hessians</h2>
<p>Let $(\M, g)$ be a Riemannian manifold equipped by the Levi-Civita connection $\nabla$. Given a scalar field $f: \M \to \R$ and any $X, Y \in \mathfrak{X}(\M)$, the <strong><em>Riemannian Hessian</em></strong> of $f$ is the covariant 2-tensor field $\text{Hess} \, f := \nabla^2 f := \nabla \nabla f$, defined by</p>
\[\text{Hess} \, f(X, Y) := X(Yf) - (\nabla_X Y)f = \inner{\nabla_X \, \grad{f}, Y}_g \, .\]
<p>Another way to define Riemannian Hessian is to treat is a linear map $T_p \M \to T_p \M$, defined by</p>
\[\text{Hess}_{v} \, f = \nabla_v \, \grad{f} \, ,\]
<p>for every $p \in \M$ and $v \in T_p \M$.</p>
<p>In any local coordinate $\{x^i\}$, it is defined by</p>
\[\text{Hess} \, f = f_{; i,j} \, dx^i \otimes dx^j := \left( \frac{\partial f}{\partial x^i \partial x^j} - \Gamma^k_{ij} \frac{\partial f}{\partial x^k} \right) \, dx^i \otimes dx^j \, .\]
<p><strong>Example 11 (Euclidean Hessian).</strong> Let $\R^n$ be a Euclidean space with the Euclidean metric and standard Euclidean coordinate. We can show that connection coefficients of the Levi-Civita connection are all $0$. Thus the Hessian is defined by</p>
\[\text{Hess} \, f = \left( \frac{\partial f}{\partial x^i \partial x^j} \right) \, dx^i \otimes dx^j \, .\]
<p>It is the same Hessian as we have seen in calculus.</p>
<p class="right">//</p>
<h2 class="section-heading">Geodesics</h2>
<p>Let $(\M, g)$ be a Riemannian manifold and let $\nabla$ be a connection on $T\M$. Given a smooth curve $\gamma: I \to \M$, a <strong><em>vector field along $\gamma$</em></strong> is a smooth map $V: I \to T\M$ such that $V(t) \in T_{\gamma(t)}\M$ for every $t \in I$. We denote the space of all such vector fields $\mathfrak{X}(\gamma)$. A vector field $V$ along $\gamma$ is said to be <strong><em>extendible</em></strong> if there exists another vector field $\tilde{V}$ on a neighborhood of $\gamma(I)$ such that $V = \tilde{V} \circ \gamma$.</p>
<p>For each smooth curve $\gamma: I \to \M$, the connection determines a unique operator</p>
\[D_t: \mathfrak{X}(\gamma) \to \mathfrak{X}(\gamma) \, ,\]
<p>called the <strong><em>covariant derivative along $\gamma$</em></strong>, satisfying (i) linearity over $\R$, (ii) Leibniz rule, and (iii) if it $V \in \mathfrak{X}(\gamma)$ is extendible, then for all $\tilde{V}$ of $V$, we have that $ D_t V(t) = \nabla_{\gamma’(t)} \tilde{V}$.</p>
<p>For every smooth curve $\gamma: I \to \M$, we define the <strong><em>acceleration</em></strong> of $\gamma$ to be the vector field $D_t \gamma’$ along $\gamma$. A smooth curve $\gamma$ is called a <strong><em>geodesic</em></strong> with respect to $\nabla$ if its acceleration is zero, i.e. $D_t \gamma’ = 0 \enspace \forall t \in I$. In term of smooth coordinates $\{x^i\}$, if we write $\gamma$ in term of its components $\gamma(t) := \{x^1(t), \dots, x^n(t) \}$, then it follows that $\gamma$ is a geodesic if and only if its component functions satisfy the following <strong><em>geodesic equation</em></strong>:</p>
\[\ddot{x}^k(t) + \dot{x}^i(t) \, \dot{x}^j(t) \, \Gamma^k_{ij}(x(t)) = 0 \, ,\]
<p>where we use $x(t)$ as an abbreviation for $\{x^1(t), \dots, x^n(t)\}$. Observe that, this gives us a hint that to compute a geodesic we need to solve a system of second-order ODE for the real-valued functions $x^1, \dots, x^n$.</p>
<p>Suppose $\gamma: [a, b] \to \M$ is a smooth curve segment with domain in the interval $[a, b]$. The <strong><em>length</em></strong> of $\gamma$ is</p>
\[L_g (\gamma) := \int_a^b \norm{\gamma'(t)}_g \, dt \, ,\]
<p>where $\gamma’$ is the derivative (the velocity vector) of $\gamma$. We can then use curve segments as “measuring tapes” to measure the <strong><em>Riemannian distance</em></strong> from $p$ to $q$ for any $p, q \in \M$$</p>
\[d_g(p, q) := \inf \, \{L_g(\gamma) \, \vert \, \gamma: [a, b] \to \M \enspace \text{s.t.} \enspace \gamma(a) = p, \, \gamma(b) = q\} \, ,\]
<p>over all curve segments $\gamma$ which have endpoints at $p$ and $q$. We call the particular $\gamma$ such that $L_g(\gamma) = d_g(p, q)$ as the <strong><em>length-minimizing curve</em></strong>. We can show that all geodesics are locally length-minimizing, and all length-minimizing curves are geodesics.</p>
<h2 class="section-heading">Parallel transport</h2>
<p>Let $(\M, g)$ be a Riemannian manifold with affine connection $\nabla$. A smooth vector field $V$ along a smooth curve $\gamma: I \to \M$ is said to be <strong><em>parallel</em></strong> along $\gamma$ if $D_t V = 0$ for all $t \in I$. Notice that a geodesic can then be said to be a curve whose velocity vector field is parallel along the curve.</p>
<p>Given $t_0 \in I$ and $v \in T_{\gamma(t_0)} \M$, we can show there exists a unique parallel vector field $V$ along $\gamma$ such that $V(t_0) = v$. This vector field is called the <strong><em>parallel transport</em></strong> of $v$ along $\gamma$. Now, for each $t_0, t_1 \in I$, we define a map</p>
\[\begin{align}
&P^\gamma_{t_0 t_1} : T_{\gamma(t_0)} \M \to T_{\gamma(t_1)} \M \\
&P^\gamma_{t_0 t_1}(v) = V(t_1) \, ,
\end{align}\]
<p>called the <strong><em>parallel transport map</em></strong>. We can picture the concept of parallel transport by imagining that we are “sliding” a tangent vector $v$ along $\gamma$ such that the direction and the magnitude of $v$ is preserved.</p>
<p>Note that, the parallel transport map is a linear map with inverse $P^\gamma_{t_1 t_0}$, hence it is an isomorphism between two tangent spaces $T_{\gamma(t_0)} \M$ and $T_{\gamma(t_1)} \M$. We can therefore determine the covariant derivative along $\gamma$ using parallel transport:</p>
\[D_t V(t_0) = \lim_{t_1 \to t_0} \frac{P^\gamma_{t_1 t_0} \, V(t_1) - V(t_0)}{t_1 - t_0} \, ,\]
<p>Moreover, we can also determine the connection $\nabla$ via parallel transport:</p>
\[\nabla_X Y \, \vert_p = \lim_{h \to 0} \frac{P^\gamma_{h 0} Y_{\gamma(h)} - Y_p}{h} \, ,\]
<p>for every $p \in \M$.</p>
<p>Finally, if $A$ is a smooth vector field on $\M$, then $A$ is parallel on $\M$ if and only if $\nabla A = 0$.</p>
<h2 class="section-heading">The exponential map</h2>
<p>Geodesics with proportional initial velocities are related in a simple way. Let $(\M, g)$ be a Riemannian manifold equipped with the Levi-Civita connection. For every $p \in \M$, $v \in T_p \M$, and $c, t \in \R$,</p>
\[\gamma_{cv} (t) = \gamma_{v} (ct) \, ,\]
<p>whenever either side is defined. This fact is compatible with our intuition on how speed and time are related to distance.</p>
<p>From the fact above, we can define a map from the tangent bundle to $\M$ itself, which sends each line through the origin in $T_p \M$ to a geodesic. Define a subset $\mathcal{E} \subseteq T\M$, the <strong><em>domain of the exponential map</em></strong> by</p>
\[\mathcal{E} := \{ v \in T\M : \gamma_v \text{ is defined on an interval containing } [0, 1] \} \, ,\]
<p>and then define the <strong><em>exponential map</em></strong></p>
\[\begin{align}
&\text{exp}: \mathcal{E} \to \M \\
&\text{exp}(v) = \gamma_v(1) \, .
\end{align}\]
<p>For each $p \in \M$, the <strong><em>restricted exponential map</em></strong> at $p$, denoted $\text{exp}_p$ is the restriction of $\text{exp}$ to the set $\mathcal{E}_p := \mathcal{E} \cap T_p \M$.</p>
<p>The interpretation of the (restricted) exponential maps is that, given a point $p$ and tangent vector $v$, we follow a geodesic which has the property $\gamma(0) = p$ and $\gamma’(0) = v$. This is then can be seen as the generalization of moving around the Euclidean space by following straight line in the direction of velocity vector.</p>
<h2 class="section-heading">Curvature</h2>
<p>Let $(\M, g)$ be a Riemannian manifold. Recall that an <strong><em>isometry</em></strong> is a map that preserves distance. Now, $\M$ is said to be <strong><em>flat</em></strong> if it is locally isometric to a Euclidean space, that is, every point in $\M$ has a neighborhood that is isometric to an open set in $\R^n$. We say that a connection $\nabla$ on $\M$ satisfies the <strong><em>flatness criterion</em></strong> if whenever $X, Y, Z$ are smooth vector fields defined on an open subset of $\M$, the following identity holds:</p>
\[\nabla_X \nabla_Y Z - \nabla_Y \nabla_X Z = \nabla_{[X, Y]} Z \, .\]
<p>Furthermore, we can show that $(\M, g)$ is a flat Riemannian manifold, then its Levi-Civita connection satisfies the flatness criterion.</p>
<p><strong>Example 12 (Euclidean space is flat).</strong> Let $\R^n$ with the Euclidean metric be a Riemannian manifold, equipped with the Euclidean connection $\nabla$. Then, given $X, Y, Z$ smooth vector fields:</p>
\[\begin{align}
\nabla_X \nabla_Y Z &= \nabla_X (Y(Z^k) \partial_k) = XY(Z^k) \partial_k \\
\nabla_Y \nabla_X Z &= \nabla_Y (X(Z^k) \partial_k) = YX(Z^k) \partial_k \, .
\end{align}\]
<p>The difference between them is</p>
\[(XY(Z^k) - YX(Z^k)) \partial_k = \nabla_{[X, Y]}Z \, ,\]
<p>by definition. Thus</p>
\[\nabla_X \nabla_Y Z - \nabla_Y \nabla_X Z = \nabla_{[X, Y]}Z \, .\]
<p>Therefore, the Euclidean space with the Euclidean connection (which is the Levi-Civita connection on Euclidean space) is flat.</p>
<p class="right">//</p>
<p>Based on the above definition of the flatness criterion, then we can define a measure on how far away a manifold to be flat:</p>
\[\begin{align}
&R: \mathfrak{X}(\M) \times \mathfrak{X}(\M) \times \mathfrak{X}(\M) \to \mathfrak{X}(\M) \\
&R(X, Y)Z = \nabla_X \nabla_Y Z - \nabla_Y \nabla_X Z - \nabla_{[X, Y]} Z \, ,
\end{align}\]
<p>which is a multilinear map over $C^\infty (\M)$, and is therefore a $(1, 3)$-tensor field on $\M$.</p>
<p>We can then define a covariant 4-tensor called the <strong><em>(Riemann) curvature tensor</em></strong> to be the $(0, 4)$-tensor field $Rm := R^\flat$, by lowering the contravariant index of $R$. Its action on vector fields is given by</p>
\[Rm(X, Y, Z, W) := \inner{R(X, Y)Z, W}_g \, .\]
<p>In any local coordinates, it is written</p>
\[Rm = R_{ijkl} \, dx^i \otimes dx^j \otimes dx^k \otimes dx^l \, ,\]
<p>where $R_{ijkl} = g_{lm} \, {R_{ijkl}}^m$. We can show that $Rm$ is a local isometry invariant. Furthermore, compatible with our intuition of the role of the curvature tensor, a Riemannian manifold is flat if and only if its curvature tensor vanishes identically.</p>
<p>Working with $4$-tensors are complicated, thus we want to construct simpler tensors that summarize some of the information contained in the curvature tensor. For that, first we need to define the trace operator for tensors. Let $T^{(k,l)}(V)$ denotes the space of tensors with $k$ covariant and $l$ contravariant components of a vector space $V$, the trace operator is:</p>
\[\begin{align}
&\text{tr}: T^{(k+1, l+1)}(V) \to T^{(k,l)}(V) \\
&(\text{tr} \, F)(\omega^1, \dots, \omega^k, v_1, \dots, v_l) = \text{tr}(F(\omega^1, \dots, \omega^k, \cdot, v_1, \dots, v_l, \cdot)) \, ,
\end{align}\]
<p>where the trace operator in the right hand side is the usual trace operator, as $F(\omega^1, \dots, \omega^k, \cdot, v_1, \dots, v_l, \cdot) \in T^{(1,1)}(V)$ is a $(1,1)$-tensor, which can be represented by a matrix. We can extend this operator to covariant tensors in Riemannian manifolds: If $h$ is any covariant $k$-tensor field with $k \geq 2$, we can raise one of its indices and obtain $(1, k-1)$-tensor $h^\sharp$. The trace of $h^\sharp$ is thus a covariant $(k-2)$-tensor field. All in all, we define the <strong><em>trace</em></strong> of $h$ w.r.t. $g$ as</p>
\[\text{tr}_g \, h := \text{tr}(h^\sharp) \, .\]
<p>In coordinates, it is</p>
\[\text{tr}_g \, h = {h_i}^i = g^{ij} h_{ij} \, ,\]
<p>which, in an orthonormal frame, it is given by the ordinary trace of the matrix $(h_{ij})$.</p>
<p>We now define the <strong><em>Ricci curvature</em></strong> or <strong><em>Ricci tensor</em></strong> $Rc$ which is the covariant 2-tensor field defined as follows:</p>
\[Rc(X, Y) := \text{tr}(Z \mapsto R(Z, X)Y) \, ,\]
<p>for any vector fields $X, Y$. In local coordinates, its components are</p>
\[R_{ij} := {R_{kij}}^k = g^{km} \, R_{kijm} \, .\]
<p>We can simplify it further: We define the <strong><em>scalar curvature</em></strong> to be the function $S$ to be the trace of the Ricci tensor:</p>
\[S := \text{tr}_g \, Rc = {R_i}^i = g^{ij} \, R_{ij} \, .\]
<p>Thus the scalar curvature is a scalar field on $\M$.</p>
<h2 class="section-heading">Submanifolds</h2>
<p>Let $\M$ be a smooth manifold. An <strong><em>embedded or regular submanifold</em></strong> of $\M$ is a subset $\mathcal{S} \subset \M$ that is a manifold in the subspace topology, endowed with a smooth structure w.r.t. which the inclusion map $\mathcal{S} \hookrightarrow \M$ is a smooth embedding. We call the difference $\text{dim} \, \M - \text{dim} \, \mathcal{S}$ to be the <strong><em>codimension</em></strong> of $\mathcal{S}$ in $\M$, and $\M$ to be the <strong><em>ambient manifold</em></strong>. An <strong><em>embedded hypersurface</em></strong> is an embedded submanifold of codimension 1.</p>
<p><strong>Example 13 (Graphs as submanifolds).</strong> Suppose $\M$ is a smooth $m$-manifold, $\mathcal{N}$ is a smooth $n$-manifold, $U \subset \M$ is open, and $f: U \to \mathcal{N}$ is a smooth map. Let $\Gamma(f) \subseteq \M \times \mathcal{N}$ denote the graph of $f$, i.e.</p>
\[\Gamma(f) := \{ (x, y) \in \M \times \mathcal{N} : x \in U, y = f(x) \} \, .\]
<p>Then $\Gamma(f)$ is an embedded $m$-submanifold of $\M \times \mathcal{N}$.</p>
<p>Furthermore, if $f: \M \to \mathcal{N}$ is a smooth map (notice that we have defined $f$ globally here), then $\Gamma(f)$ is <strong><em>properly embedded</em></strong> in $\M \times \mathcal{N}$, i.e. the inclusion map is a <a href="https://en.wikipedia.org/wiki/Proper_map">proper map</a>.</p>
<p class="right">//</p>
<p>Suppose $\M$ and $\N$ are smooth manifolds. Let $F: \M \to \N$ be a smooth map and $p \in \M$. We define the rank of $F$ at $p$ to be the <strong><em>rank</em></strong> of the linear map $dF_p: T_p\M \to T_{F(p)\N}$, i.e. the rank of the Jacobian matrix of $F$ in coordinates. If $F$ has the same rank $r$ at any point, we say that it has <strong><em>constant rank</em></strong>, written $\rank{F} = r$. Note that it is bounded by $\min \{ \dim{\M}, \dim{\N} \}$ and if it is equal to this bound, we say $F$ has <strong><em>full rank</em></strong> at $p$.</p>
<p>A smooth map $F: \M \to \N$ is called a <strong><em>smooth submersion</em></strong> if $dF$ is surjective at each point ($\rank{F} = \dim{\N}$). It is called a <strong><em>smooth immersion</em></strong> if $dF$ is injective at each point ($\rank{F} = \dim{\M}$).</p>
<p><strong>Example 14 (Submersions and immersions).</strong></p>
<ol>
<li>Suppose $\M_1, \dots, \M_k$ are smooth manifolds. Then each projection maps $\pi_i: \M_1 \times \dots \times \M_k \to \M_i$ is a smooth submersion. In particular $\pi: \R^{n+k} \to \R^n$ is a smooth submersion.</li>
<li>If $\gamma: I \to \M$ is a smooth curve in a smooth manifold $\M$, then $\gamma$ is a smooth immersion if and only if $\gamma’(t) \neq 0$ for all $t \in I$.</li>
</ol>
<p class="right">//</p>
<p>If $\M$ and $\N$ are smooth manifolds. A <strong><em>diffeomorphism</em></strong> from $\M$ to $\N$ is a smooth bijective map $F: \M \to \N$ that has a smooth inverse, and $\M$ and $\N$ are said to be <strong><em>diffeomorphic</em></strong>. $F$ is called a <strong><em>local diffeomorphism</em></strong> if every point $p \in \M$ has a neighborhood $U$ such that $F(U)$ is open in $\N$ and $F\vert_U: U \to F(U)$ is a diffeomorphism. We can show that $F$ is a local diffeomorphism if and only if it is both a smooth immersion and submersion. Furthermore, if $\dim{\M} = \dim{\N}$ and $F$ is either a smooth immersion or submersion, then it is a local diffeomorphism.</p>
<p>The <em>Global rank theorem</em> says that if $\M$ and $\N$ are smooth manifolds, and suppose $F: \M \to \N$ is a smooth map of constant rank, then it is (a) a smooth submersion if it is injective, (b) a smooth immersion if it is injective, and (c) a diffeomorphism if it is bijective.</p>
<p>If $\M$ and $\N$ are smooth manifolds, a <strong><em>smooth embedding</em></strong> of $\M$ into $\N$ is a smooth immersion $F: \M \to \N$ that is also a topological embedding (homeomorphism onto its image in the subspace topology).</p>
<p><strong>Example 15 (Smooth embeddings).</strong> If $\M$ is a smooth manifold and $U \subseteq \M$ is an open submanifold, the inclusion $U \hookrightarrow \M$ is a smooth embedding.</p>
<p class="right">//</p>
<p>Let $F: \M \to \N$ be an injective smooth immersion. If any of these condition holds, then $F$ is a smooth embedding: (a) $F$ is an open or closed map, (b) $F$ is a proper map, (c) $\M$ is compact, and (d) $\M$ has empty boundary and $\dim{\M} = \dim{\N}$.</p>
<h2 class="section-heading">The second fundamental form</h2>
<p>Let $(\M, g)$ be a Riemannian submanifold of a Riemannian manifold $(\tilde{\M}, \tilde{g})$. Then, $g$ is the induced metric $g = \iota_\M^* \tilde{g}$, where $\iota_\M: \M \hookrightarrow \tilde{\M}$ is the inclusion map. Note that, the expression $\iota^*_\M \tilde{g}$ is called the <strong><em>pullback metric</em></strong> or the <strong><em>induced metric</em></strong> of $\tilde{g}$ by $\iota_\M$ and is defined by</p>
\[\iota_\M^* \tilde{g}(u, v) := \tilde{g}(d\iota_\M(u), d\iota_\M(v)) \, ,\]
<p>for any $u, v \in T_p \M$. Also, recall that $d\iota_\M$ is the pushforward (tangent map) by $\iota_\M$. Intuitively, we map the tangent vectors $u, v$ of $T_p \M$ to some tangent vectors of $T_{\iota_\M(p)} \tilde{\M}$ and use $\tilde{g}$ as the metric.</p>
<p>In this section, we will denote any geometric object of the ambient manifold with tilde, e.g. $\tilde{\nabla}, \tilde{Rm}$, etc. Note also that, we can use the inner product notation $\inner{u, v}$ to refer to $g$ or $\tilde{g}$, since $g$ is just the restriction of $\tilde{g}$ to pairs of tangent vectors in $T \M$.</p>
<p>We would like to compare the Levi-Civita connection of $\M$ with that of $\tilde{\M}$. First, we define orthogonal projection maps, called <strong><em>tangential</em></strong> and <strong><em>normal projections</em></strong> by</p>
\[\begin{align}
\pi^\top &: T \tilde{\M} \vert_\M \to T\M \\
\pi^\perp &: T \tilde{\M} \vert_\M \to N\M \, ,
\end{align}\]
<p>where $N\M$ is the <strong><em>normal bundle</em></strong> of $\M$, i.e. the set of all vectors normal to $\M$. If $X$ is a section of $T\tilde{\M}\vert_\M$, we use the shorthand notations $X^\top = \pi^\top X$ and $X^\perp = \pi^\perp X$.</p>
<p>Given $X, Y \in \mathfrak{X}(\M)$, we can extend them to vector fields on an open subset of $\tilde{\M}$, apply the covariant derivative $\tilde{\nabla}$, and then decompose at $p \in \M$ to get</p>
\[\tilde{\nabla}_X Y = (\tilde{\nabla}_X Y)^\top + (\tilde{\nabla}_X Y)^\perp \, .\]
<p>Let $\Gamma(E)$ be the space of smooth sections of bundle $E$. For the second part, we define the <strong><em>second fundamental form</em></strong> of $\M$ to be a map $\two: \mathfrak{X}(\M) \times \mathfrak{X}(\M) \to \Gamma(N\M)$ defined by</p>
\[\two(X, Y) = (\tilde{\nabla}_X Y)^\perp \, .\]
<p>Meanwhile, we can show that, the first part is the covariant derivative w.r.t. the Levi-Civita connection of the induced metric on $\M$. All in all, the above equation can be written as the <strong><em>Gauss formula</em></strong>:</p>
\[\tilde{\nabla}_X Y = \nabla_X Y + \two(X, Y) \, .\]
<p>The second fundamental form can also be used to evaluate extrinsic covariant derivatives of <em>normal</em> vector fields (instead of <em>tangent</em> ones above). For each normal vector field $N \in \Gamma(N\M)$, we define a scalar-valued symmetric bilinear form $\two_N: \mathfrak{X}(\M) \times \mathfrak{X}(\M) \to \R$ by</p>
\[\two_N(X, Y) = \inner{N, \two(X, Y)} \, .\]
<p>Let $W_N: \mathfrak{X}(\M) \to \mathfrak{X}(\M)$ denote the self-adjoint linear map associated with this bilinear form, characterized by</p>
\[\inner{W_N(X), Y} = \two_N(X, Y) = \inner{N, \two(X, Y)} \, .\]
<p>The map $W_N$ is called the <strong><em>Weingarten map</em></strong> in the direction of $N$. Furthermore we can show that the equation $(\tilde{\nabla}_X N)^\top = -W_N(X)$ holds and is called the <strong><em>Weingarten equation</em></strong>.</p>
<p>In addition to describing the difference between the intrinsic and extrinsic connections, the second fundamental form describes the difference between the curvature tensors of $\tilde{\M}$ and $\M$. The explicit formula is called the <strong><em>Gauss equation</em></strong> and is given by</p>
\[\tilde{Rm}(W, X, Y, Z) = Rm(W, X, Y, Z) - \inner{\two(W, Z), \two(X, Y)} + \inner{\two(W, Y), \two(X, Z)} \, .\]
<p>To give a geometric interpretation of the second fundamental form, we study the curvatures of curves. Let $\gamma: I \to \M$ be a smooth unit-speed curve. We define the <strong><em>curvature</em></strong> of $\gamma$ as the length of the acceleration vector field, i.e. the function $\kappa: I \to \R$ given by $\kappa(t) := \norm{D_t \gamma’(t)}$. We can see this curvature of the curve as a quantitative measure of how far the curve deviates from being a geodesic. Note that, if $\M = \R^n$ the curvature agrees with the one defined in calculus.</p>
<p>Now, suppose that $\M$ is a submanifold in the ambient manifold $\tilde{\M}$. Every regular curve $\gamma: I \to \M$ has two distinct curvature: its <strong><em>intrinsic curvature</em></strong> $\kappa$ as a curve in $\M$ and its <strong><em>extrinsic curvature</em></strong> $\tilde{\kappa}$ as a curve in $\tilde{\M}$. The second fundamental form can then be used to compute the relationship between the two: For $p \in \M$ and $v \in T_p \M$, (i) $\two(v, v)$ is the $\tilde{g}$-acceleration at $p$ of the $g$-geodesic $\gamma_v$, and (ii) if $v$ is a unit vector, then $\norm{\two(v, v)}$ is the $\tilde{g}$-curvature of $\gamma_v$ at $p$.</p>
<p>The intrinsic and extrinsic accelerations of a curve are usually different. A Riemannian submanifold $(\M, g)$ of $(\tilde{\M}, \tilde{g})$ is said to be <strong><em>totally geodesic</em></strong> if every $\tilde{g}$-geodesic that is tangent to $\M$ at some time $t_0$ stays in $\M$ for all $t \in (t_0 - \epsilon, t_0 + \epsilon)$.</p>
<h2 class="section-heading">Riemannian hypersurfaces</h2>
<p>We focus on the case when $(\M, g)$ is an embedded $n$-dimensional Riemannian submanifold of an $(n+1)$-dimensional Riemannian manifold $(\tilde{\M}, \tilde{g})$. That is, $\M$ is a hypersurface of $\tilde{\M}$.</p>
<p>In this situation, at each point of $\M$, there are exactly two unit normal vectors. We choose one of these normal vector fields and call it $N$. We can replace the vector-valued second fundamental form above by a simpler scalar-valued form. The <strong><em>scalar second fundamental form</em></strong> of $\M$ is the symmetric covariant $2$-tensor field $h = \two_N$, i.e.</p>
\[h(X, Y) := \inner{N, \two(X, Y)} \enspace \enspace \enspace \text{for all } X, Y \in \mathfrak{X}(\M) \, .\]
<p>By the Gauss formula $\tilde{\nabla}_X Y = \nabla_X Y + \two(X, Y)$ and noting that $\nabla_X Y$ is orthogonal to $N$, we can rewrite the definition as $h(X, Y) = \inner{N, \tilde{\nabla}_X Y}$. Furthermore, since $N$ is a unit vector spanning $N\M$, we can write $\two(X, Y) = h(X, Y)N$. Note that the sign of $h$ depends on the normal vector field chosen.</p>
<p>The choice of $N$ also determines a Weingarten map $W_N: \mathfrak{X}(\M) \to \mathfrak{X}(\M)$. In this special case of a hypersurface, we use the notation $s = W_N$ and call it the <strong><em>shape operator</em></strong> of $\M$. We can think of $s$ as the $(1, 1)$-tensor field on $\M$ obtained from $h$ by raising an index. It is characterized by</p>
\[\inner{sX, Y} = h(X, Y) \enspace \enspace \enspace \text{for all } X, Y \in \mathfrak{X}(\M) \, .\]
<p>As with $h$, the choice of $N$ determines the sign of $s$.</p>
<p>Note that at every $p \in \M$, $s$ is a self-adjoint linear endomorphism of the tangent space $T_p \M$. Let $v \in T_p \M$. From linear algebra, we know that there is a unit vector $v_0 \in T_p \M$ such that $v \mapsto \inner{sv, v}$ achieve its maximum among all unit vectors. Every such vector is an eigenvector of $s$ with eigenvalue $\lambda_0 = \inner{s v_0, v_0}$. Furthermore, $T_p \M$ has an orthonormal basis $(b_1, \dots, b_n)$ formed by the eigenvectors of $s$ and all of the eigenvalues $(\kappa_1, \dots \kappa_n)$ are real. (Note that this means for each $i$, $s b_i = \kappa_i b_i)$.) In this basis, both $h$ and $s$ are represented by diagonal matrices.</p>
<p>The eigenvalues of $s$ at $p \in \M$ are called the <strong><em>principal curvatures</em></strong> of $\M$ at $p$, and the corresponding eigenvectors are called the <strong><em>principal directions</em></strong>. Note that the sign of the principal curvatures depend on the choice of $N$. But otherwise both the principal curvatures and directions are independent of the choice of coordinates.</p>
<p>From the principal curvatures, we can compute other quantities: The <strong><em>Gaussian curvature</em></strong> which is defined as $K := \text{det}(s)$ and the <strong><em>mean curvature</em></strong> $H := (1/n) \text{tr}(s)$. In other words, $K = \prod_i \kappa_i$ and $H = (1/n) \sum_i \kappa_i$, since $s$ can be represented by a symmetric matrix.</p>
<p>The Gaussian curvature, which is a local isometric invariant, is connected to a global topological invariant, the <a href="https://en.wikipedia.org/wiki/Euler_characteristic">Euler characteristic</a>, through the <strong><em>Gauss-Bonnet theorem</em></strong>. Let $(\M, g)$ be a smoothly triangulated compact Riemannian 2-manifold, then</p>
\[\int_\M K \, dA = 2 \pi \, \chi(\M) \, ,\]
<p>where $dA$ is its Riemannian density.</p>
<h2 class="section-heading">Hypersurfaces of Euclidean space</h2>
<p>Assume that $\M \subseteq \R^{n+1}$ is an embedded Riemannian $n$-submanifold (with the induced metric from the Euclidean metric). We denote geometric objects on $\R^{n+1}$ with bar, e.g. $\bar{g}$, $\overline{Rm}$, etc. Observe that $\overline{Rm} \equiv 0$, which implies that the Riemann curvature tensor of a hypersurface in $\R^{n+1}$ is completely determined by the second fundamental form.</p>
<p>In this setting we can give some very concrete geometric interpretation about quantities in hypersurfaces. First is for curves. For every $v \in T_p \M$, let $\gamma = \gamma_v : I \to \M$ be the $g$-geodesic in $\M$ with initial velocity $v$. The Gauss formula shows that the Euclidean acceleration of $\gamma$ at $0$ is $\gamma^{\prime\prime}(0) = \overline{D}_t \gamma’(0) = h(v, v)N_p$, thus $\norm{h(v, v)}$ is the Euclidean curvature of $\gamma$ at $0$. Furthermore, $h(v,v) = \inner{\gamma^{\prime\prime}(0), N_p} > 0$ iff. $\gamma^{\prime\prime}(0)$ points in the same direction as $N_p$. That is $h(v, v)$ is positive if $\gamma$ is curving in the direction of $N_p$ and negative if it is curving away from $N_p$.</p>
<p>We can show that the above Euclidean curvature can be interpreted in terms f the radius of the “best circular approximation”, just in Calculus. Suppose $\gamma: I \to \R^m$ is a unit-speed curve, $t_0 \in I$, and $\kappa(t_0) \neq 0$. We define a unique unit-speed parametrized circle $c: \R \to \R^m$ as the <strong><em>osculating circle</em></strong> at $\gamma(t_0)$, with the property that $c$ and $\gamma$ have the same position, velocity, and acceleration at $t=t_0$. Then, the Euclidean curvature of $\gamma$ at $t_0$ is $\kappa(t_0) = 1/R$ where $R$ is the radius of the osculating circle.</p>
<p>As mentioned before, to compute the curvature of a hypersurface in Euclidean space, we can compute the second fundamental form. Suppose $X: U \to \M$ is a smooth local parametrization of $\M$, $(X_1, \dots, X_n)$ is the local frame for $T \M$ determined by $X$, and $N$ is a unit normal field on $\M$. Then, the scalar second fundamental form is given by</p>
\[h(X_i, X_j) = \innerbig{\frac{\partial^2 X}{\partial u^i \partial u^j}, N} \, .\]
<p>The implication of this is that it shows how the principal curvatures give a concise description of the local shape of the hypersurface by approximating the surface with the graph of a quadratic function. That is, we can show that there is an isometry $\phi: \R^{n+1} \to \R^{n+1}$ that takes $p \in \M$ to the origin and takes a neighborhood of it to a graph of the form $x^{n+1} = f(x^1, \dots, x^n)$, where</p>
\[f(x) = \frac{1}{2} \sum_{i=1}^n\kappa_i (x^i)^2 + O(\abs{x}^3) \, .\]
<p>We can write down a smooth vector field $N = N^i \partial_i$ on an open subset of $\R^{n+1}$ that restricts to a unit normal vector field along $\M$. Then, the shape operator can be computed straightforwardly using the Weingarten equation and observing that the Euclidean covariant derivatives of $N$ are just ordinary directional derivatives in Euclidean space. Thus, for every vector $X = X^i \partial_j$ tangent to $\M$, we have</p>
\[sX = -\bar{\nabla}_X N = -\sum_{i,j=1}^{n+1} X^j (\partial_j N^i) \partial_i \, .\]
<p>One common way to get such smooth vector field is to work with a local defining function $F$ for $\M$, i.e. a smooth scalar field defined on some open subset $U \subseteq \R^{n+1}$ s.t. $U \cap \M$ is a regular level set of $F$. Then, we can take</p>
\[N = \frac{\grad{F}}{\norm{\grad{F}}} \, .\]
<p>Because we know that the gradient is always normal to the level set.</p>
<p><strong>Example 16 (Shape operators of spheres).</strong> The function $F: \R^{n+1} \to \R$ with $F(x) := \norm{x}^2$ is a smooth defining function of any sphere in $\mathbb{S}^{n}(R)$. Thus, the normalized gradient vector field</p>
\[N = \frac{1}{R} \sum_{i,j=1}^{n+1} x^i \partial_i\]
<p>is a (outward pointing) unit normal vector field along $\mathbb{S}^n(R)$. The shape operator is</p>
\[sX = -\frac{1}{R} \sum_{i,j=1}^{n+1} X^j (\partial_j x^i) \partial_i = -\frac{1}{R} X \, ,\]
<p>where recall that, $\partial_j x^i = \partial x^i / \partial x^j = \delta_{ij}$. We can therefore write $s$ as a matrix $s = (-1/R) \mathbf{I}$ where $\mathbf{I}$ is the identity matrix. The principal curvatures are then all equal to $-1/R$, the mean curvature is $H = -1/R$, and the Gaussian curvature is $K = (-1/R)^n$. Note that, these curvatures are constant. These reflects the fact that the sphere bends the exact same way at every point.</p>
<p class="right">//</p>
<p>Lastly, for surfaces in $\R^3$, given a parametrization of $X$, the normal vector field can be computed via the cross product:</p>
\[N = \frac{X_1 \times X_2}{\norm{X_1 \times X_2}} \, ,\]
<p>where $X_1 := \partial_1 X$ and $X_2 := \partial_2 X$, which together form a basis of the tangent space at each point on the surface.</p>
<p>Although the Gaussian curvature is defined in terms of a particular embedding of a submanifold in the Euclidean space (i.e. it is an extrinsic quantity), it is actually an intrinsic invariant of the submanifold. Gauss showed in his <strong><em>Theorema Egregium</em></strong> that in an embedded $2$-dimensional Riemannian submanifold $(\M, g)$ of $\R^3$, for every point $p \in \M$, the Gaussian curvature of $\M$ at $p$ is equal to one-half the scalar curvature of $g$ at $p$, and thus it is a local isometry invariant of $(\M, g)$.</p>
<p>Suppose $\M$ is a Riemannian $n$-manifold with $n \geq 2$, $p \in \M$, and $V \subset T_p \M$ is a <a href="https://en.wikipedia.org/wiki/Star_domain">star-shaped neighborhood</a> of zero on which $\text{exp}_p$ is a diffeomorphism onto an open set $U \subset \M$. Let $\Pi$ be any $2$-dimensional linear subspace of $T_p \M$. Since $\Pi \cap V$ is an embedded $2$-dim submanifold of $V$, it follows that $\mathcal{S}_\Pi = \text{exp}_p(\Pi \cup V)$ is an embedded $2$-dim submmanifold of $U \subset \M$ containing $p$, called the <strong><em>plane section</em></strong> determined by $\Pi$. We define the <strong><em>sectional curvature</em></strong> of $\Pi$, denoted by $\text{sec}(\Pi)$, to be the intrinsic Gaussian curvature at $p$ of the surface $\mathcal{S}_\Pi$ with the metric induced from the embedding $\mathcal{S}_\Pi \subseteq \M$. If $v, w \in T_p \M$ are linearly independent vectors, the sectional curvature’s formula is given by</p>
\[\text{sec}(v, w) := \frac{Rm_p(v, w, w, v)}{\norm{v \wedge w}^2} \, ,\]
<p>where</p>
\[\norm{v \wedge w} := \sqrt{\norm{v}^2 \norm{w}^2 - \inner{v, w}^2} \, .\]
<p>We can show the connection between the sectional curvature and Ricci and scalar curvatures. $Rc_p(v, v)$ is the sum of the sectional curvatures of the $2$-planes spanned by $(v, b_2), \dots, (v, b_n)$, where $(b_1, \dots, b_n)$ is any orthonormal basis for $T_p \M$ with $b_1 = v$. Furthermore, the scalar curvature at $p$ is the sum of all sectional curvatures of the $2$-planes spanned by ordered pairs of distinct basis vectors in any orthonormal basis.</p>
<h2 class="section-heading">Lie groups</h2>
<p>A <strong><em>Lie group</em></strong> is a smooth manifold $\G$ that is also a group in the algebraic sense, with the property that the multiplication map $m: \G \times \G \to \G$ and inversion map $i: \G \to \G$, given by</p>
\[m(g, h) := gh \, , \qquad i(g) := g^{-1} \, ,\]
<p>are both smooth for arbitrary $g, h \in \G$. We denote the identity element of $G$ by $e$.</p>
<p><strong>Example 17 (Lie groups).</strong> The following manifolds are Lie groups.</p>
<ol>
<li>
<p>The <strong><em>general linear group</em></strong> $\GL(n, \R)$ is the set of invertible $n \times n$ matrices with real elements. It is a group under matrix multiplication and it is a submanifold of the vector space $\text{M}(n, \R)$, the space of $n \times n$ matrices.</p>
</li>
<li>
<p>The real number field $\R$ and the Euclidean space $\R^n$ are Lie groups under addition.</p>
</li>
</ol>
<p class="right">//</p>
<p>If $\G$ and $\mathcal{H}$ are Lie groups, a <strong><em>Lie group homomorphism</em></strong> from $\G$ to $\mathcal{H}$ is a smooth map $F: \G \to \mathcal{H}$ that is also a group homomorphism. If $F$ is also a diffeomorphism, then it is a <strong><em>Lie group isomorphism</em></strong>. We say that $\G$ and $\mathcal{H}$ are <strong><em>isomorphic Lie groups</em></strong>.</p>
<p>If $G$ is a group and $M$ is a set, a <strong><em>left action</em></strong> of $G$ on $M$ is a map $G \times M \to M$ defined by $(g, p) \mapsto g \cdot p$ that satisfies</p>
\[\begin{alignat}{2}
g_1 \cdot (g_2 \cdot p) &= (g_1 g_2) \cdot p \qquad &&\text{for all } g_1, g_2 \in G, p \in M \, ; \\
e \cdot p &= p &&\text{for all } p \in M \, .
\end{alignat}\]
<p>Analogously, a <strong><em>right action</em></strong> is defined as a map $M \times G \to M$ satisfying</p>
\[\begin{alignat}{2}
(p \cdot g_1) \cdot g_2 &= p \cdot (g_1 g_2) \qquad &&\text{for all } g_1, g_2 \in G, p \in M \, ; \\
p \cdot e &= p &&\text{for all } p \in M \, .
\end{alignat}\]
<p>If $M$ is a smooth manifold, $G$ is a Lie group, and the defining map is smooth, then the action is said to be <strong><em>smooth action</em></strong>.</p>
<p>We can also give a name to an action, e.g. $\theta: G \times M \to M$ with $(g, p) \mapsto \theta_g (p)$. In this notation, the above conditions for the left action read</p>
\[\begin{align}
\theta_{g_1} \circ \theta_{g_2} &= \theta_{g_1 g_2} \, , \\
\theta_e &= \Id_M \, ,
\end{align}\]
<p>while for a right action the first equation is replaced by $\theta_{g_2} \circ \theta_{g_1} = \theta_{g_1 g_2}$. For a smooth action, each map $\theta_g : M \to M$ is a diffeomorphism.</p>
<p>For each $p \in M$, the <strong><em>orbit</em></strong> of $p$, denoted by $G \cdot p$, is the set of all images of $p$ under the action by elements of $G$:</p>
\[G \cdot p := \{ g \cdot p : g \in G \} \, .\]
<p>The <strong><em>isotropy group</em></strong> or <strong><em>stabilizer</em></strong> of $p$, denoted by $G_p$, is the set of elements of $G$ that fix $p$ (implying $G_p$ is a subgroup of $G$):</p>
\[G_p := \{ g \in G : g \cdot p = p \} \, .\]
<p>A group action is said to be <strong><em>transitive</em></strong> if for every pair of points $p, q \in M$, there exists $g \in G$ such that $g \cdot p = q$, i.e. if the only orbit is all of $M$. The action is said to be <strong><em>free</em></strong> if the only element of $G$ that fixes any element of $M$ is the identity: $g \cdot p$ for some $p \in M$ implies $g = e$, i.e. if every isotropy group is trivial.</p>
<p><strong>Example 18 (Lie group actions).</strong></p>
<ol>
<li>
<p>If $\G$ is a Lie group and $\M$ is a smooth manifold, the <strong><em>trivial action</em></strong> of $\G$ on $\M$ is defined by $g \cdot p = p$ for all $g \in \G$ and $p \in \M$.</p>
</li>
<li>
<p>The <strong><em>natural action</em></strong> of $\GL(n, \R)$ on $\R^n$ is the left action given by matrix multiplication $(\b{A}, \vx) \mapsto \b{A} \vx$.</p>
</li>
</ol>
<p class="right">//</p>
<p>Let $\G$ be a Lie group, $\M$ and $\N$ be smooth manifolds endowed with smooth left or right $\G$-actions. A map $F: \M \to \N$ is <strong><em>equivariant</em></strong> w.r.t. the given actions if for each $g \in G$,</p>
\[\begin{alignat}{2}
F(g \cdot p) &= g \cdot F(p) \qquad &&\text{for left actions} \, , \\
F(p \cdot g) &= F(p) \cdot g &&\text{for right actions} \, .
\end{alignat}\]
<p>If $F: \M \to \N$ is a smooth map that is equivariant w.r.t. a transitive smooth $\G$-action on $\M$ and any smooth $\G$-action on $\N$, then $F$ has <strong><em>constant rank</em></strong>, meaning that its rank is the same for all $p \in \M$. Thus, if $F$ is surjective, it is a smooth submersion; if it is injective, it is a smooth immersion; and if it is bijective, it is a diffeomorphism.</p>
<p><strong>Example 19 (The orthogonal group).</strong> A real $n \times n$ matrix $\b{A}$ is said to be <strong><em>orthogonal</em></strong> if it preserves the Euclidean dot product as a linear map:</p>
\[(\b{A} \vx) \cdot (\b{A} \vx) = \vx \cdot \vy \qquad \text{for all} \, \vx, \vy \in \R^n \, .\]
<p>The set of all orthogonal $n \times n$ matrices $\text{O}(n)$ is a subgroup of $\GL(n, \R)$, called the <strong><em>orthogonal group</em></strong> of degree $n$.</p>
<p class="right">//</p>
<p>We would like to also study the theory of <strong><em>group representations</em></strong>, i.e. asking the question whether all Lie groups can be realized as Lie subgroups of $\GL(n, \R)$ or $\GL(n, \C)$. If $\G$ is a Lie group, a <strong><em>representation</em></strong> of $\G$ is a Lie group homomorphism from $\G$ to $\GL(V)$ for some finite-dimensional vector space $V$. Note that, $\GL(V)$ denotes the group of invertible linear transformations of $V$ which is a Lie group isomorphic to $\GL(n, \R)$. If a representation is injective, it is said to be <strong><em>faithful</em></strong>.</p>
<p>There is a close connection between representations and group actions. An action of $\G$ on $V$ is said to be a <strong><em>linear action</em></strong> if for each $g \in \G$, the map $V \to V$ defined by $x \mapsto g \cdot x$ is linear.</p>
<p><strong>Example 20 (Linear action).</strong> If $\rho: \G \to \GL(V)$ is a representation of $\G$, there is an associated smooth linear action of $\G$ on $V$ given by $g \cdot x = \rho(g) x$. In fact, this holds for every linear action.</p>
<p class="right">//</p>
<h2 class="section-heading">References</h2>
<ol>
<li>Lee, John M. “Smooth manifolds.” Introduction to Smooth Manifolds. Springer, New York, NY, 2013. 1-31.</li>
<li>Lee, John M. Riemannian manifolds: an introduction to curvature. Vol. 176. Springer Science & Business Media, 2006.</li>
<li>Fels, Mark Eric. “An Introduction to Differential Geometry through Computation.” (2016).</li>
<li>Absil, P-A., Robert Mahony, and Rodolphe Sepulchre. Optimization algorithms on matrix manifolds. Princeton University Press, 2009.</li>
<li>Boumal, Nicolas. Optimization and estimation on manifolds. Diss. Catholic University of Louvain, Louvain-la-Neuve, Belgium, 2014.</li>
<li>Graphics: <a href="https://tex.stackexchange.com/questions/261408/sphere-tangent-to-plane">https://tex.stackexchange.com/questions/261408/sphere-tangent-to-plane</a>.</li>
</ol>
Fri, 22 Feb 2019 12:00:00 +0100
https://agustinus.kristia.de/techblog/2019/02/22/riemannian-geometry/
https://agustinus.kristia.de/techblog/2019/02/22/riemannian-geometry/mathtechblogMinkowski's, Dirichlet's, and Two Squares Theorem<p><img src="/img/2018-07-24-minkowski-dirichlet/forest.svg" alt="Forest" height="250px" width="250px" /></p>
<p>Suppose we are standing at the origin of bounded regular forest in \( \mathbb{R}^2 \), with diameter of \(26\)m, and all the trees inside have diameter of \(0.16\)m. Can we see outside this forest? This problem can be solved using Minkowski’s Theorem. We will see the theorem itself first, and we shall see how can we answer that question. Furthermore, Minkowski’s Theorem can also be applied to answer two other famous theorems, Dirichlet’s Approximation Theorem, and Two Squares Theorem.</p>
<p><strong>Theorem 1 (Minkowski’s Theorem)</strong><br />
Let \( C \subseteq \mathbb{R}^d \) be symmetric around the origin, convex, and bounded set. If \( \text{vol}(C) > 2^d \) then \( C \) contains at least one lattice point different from the origin.</p>
<p><em>Proof.</em> Let \( C’ := \frac{1}{2} C = \{ \frac{1}{2} c \, \vert \, c \in C \} \). Assume that there exists non-zero integer \( v \in \mathbb{Z}^d \setminus \{ 0 \} \), such that the intersection between \( C’ \) and its translation wrt. \( v \) is non-empty.</p>
<p>Pick arbitrary \( x \in C’ \cap (C’ + v) \). Then \( x - v \in C’ \) by construction. By symmetry, \( v - x \in C’ \). As \( C’ \) is convex, then line segment between \( x \) and \( v - x \) is in \( C’ \). We particularly consider the midpoint of the line segment: \( \frac{1}{2}x + \frac{1}{2} (v - x) = \frac{1}{2} v \in C’ \). This immediately implies that \( v \in C \) by the definition of \( C’ \), which proves the theorem.</p>
<p class="right">\( \square \)</p>
<p>The claim that there exists non-zero integer \( v \in \mathbb{Z}^d \setminus \{ 0 \} \), such that \( C’ \cap (C’ + v) \neq \emptyset \) is not proven in this post. One can refer to Matoušek’s book for the proof.</p>
<p><img src="/img/2018-07-24-minkowski-dirichlet/forest_minkowski.svg" alt="Minkowsi_forest" height="250px" width="250px" /></p>
<p>Given Minkowski’s Theorem, now we can answer our original question. We assume the trees are just lattice points, and our visibility line is now a visibility strip, which has wide of \( 0.16 \)m and length of \( 26 \)m. We note that the preconditions of Minkowski’s Theorem are satisfied by this visibility strip, which has the volume of \( \approx 4.16 > 4 = 2^d \). Therefore, there exists a lattice point other than the origin inside our visibility strip. Thus our vision outside is blocked by the tree.</p>
<p>Now we look at two theorems that can be proven using Minkowski’s Theorem. The first one is about approximation of real number with a rational.</p>
<p><strong>Theorem 2 (Dirichlet’s Approximation Theorem)</strong><br />
Let \( \alpha \in \mathbb{R} \). Then for all \( N \in \mathbb{N} \), there exists \( m \in \mathbb{Z}, n \in \mathbb{N} \) with \( n \leq N \) such that:</p>
\[\left \vert \, \alpha - \frac{m}{n} \right \vert \lt \frac{1}{nN} \enspace .\]
<p><em>Proof.</em> Consider \( C := \{ (x, y) \in \mathbb{R}^2 \, \vert \, -N-\frac{1}{2} \leq x \leq N+\frac{1}{2}, \vert \alpha x - y \vert \lt \frac{1}{N} \} \). By inspection on the figure below, we can observe that \( C \) is convex, bounded, and symmetric around the origin.</p>
<p><img src="/img/2018-07-24-minkowski-dirichlet/dirichlet.svg" alt="Dirichlet" height="400px" width="400px" /></p>
<p>Observe also that the area of \( C \) is \( \text{vol}(C) = \frac{2}{N} (2N + 1) = 4 + \frac{2}{N} \gt 4 = 2^d \). Thus this construction satisfied the Minkowski’s Theorem’s preconditions. Therefore there exists lattice point \( (n, m) \neq (0, 0) \). As \( C \) is symmetric, we can always assume \( n \gt 0 \) thus \( n \in \mathbb{N} \). By definition of \( C \), \( n \leq N+\frac{1}{2} \implies n \leq N \) as \( N \in \mathbb{N} \). Futhermore, we have \( \vert \alpha n - m \vert \lt \frac{1}{N} \). This implies \( \left\vert \alpha - \frac{m}{n} \right\vert \lt \frac{1}{nN} \) which conclude the proof.</p>
<p class="right">\( \square \)</p>
<p>Our second application is the theorem saying that prime number \( p \equiv 1 \, (\text{mod } 4) \) can be written as a sum of two squares. For this we need the General Minkowski’s Theorem, which allows us to use arbitrary basis for our lattice.</p>
<p><strong>Theorem 3 (General Minkowski’s Theorem)</strong>
Let \( C \subseteq \mathbb{R}^d \) be symmetric around the origin, convex, and bounded set. Let \( \Gamma \) be the lattice in \( \mathbb{R}^d \). If \( \text{vol}(C) > 2^d \,\text{vol}(\Gamma) = 2^d \det \Gamma \), then \( C \) contains at least one lattice point in \( \Gamma \) different from the origin.</p>
<p class="right">\( \square \)</p>
<p><strong>Theorem 4 (Two Squares Theorem)</strong><br />
Every prime number \( p \equiv 1 \, (\text{mod } 4) \) can be written by the sum of two squares \( p = a^2 + b^2 \) where \( a, b \in \mathbb{Z} \).</p>
<p><em>Proof.</em> We need intermediate result which will not be proven here (refer to [1] for the proof): \( -1 \) is a quadratic residue modulo \( p \), that is, there exists \( q \lt p \) such that \( q^2 \equiv -1 \, (\text{mod } p) \).</p>
<p>Fix \( q \) and take the following basis for our lattice: \( z_1 := (1, q), \, z_2 := (0, p) \). The volume of this lattice is: \( \det \Gamma = \det \begin{bmatrix} 1 & 0 \\ q & p \end{bmatrix} = p \).</p>
<p>Define a convex, symmetric, and bounded body \( C := \{ (x, y) \in \mathbb{R}^2 \, \vert \, x^2 + y^2 \lt 2p \} \), i.e. \( C \) is an open ball around the origin with radius \( \sqrt{2p} \). Note:</p>
\[\text{vol}(C) = \pi r^2 \approx 6.28p \gt 4p = 2^2 p = 2^d \det \Gamma \enspace ,\]
<p>thus General Minkowski’s Theorem applies and there exists a lattice point \( (a, b) = i z_1 + j z_2 = (i, iq + jp) \neq (0, 0) \). Notice:</p>
\[\begin{align}
a^2 + b^2 &= i^2 + i^2 q^2 + 2ijpq + j^2 p^2 \\
&\equiv i^2 + i^2q^2 \, (\text{mod } p) \\
&\equiv i^2(1+q^2) \, (\text{mod } p) \\
&\equiv i^2(1-1) \, (\text{mod } p) \\
&\equiv 0 \, (\text{mod } p) \enspace .
\end{align}\]
<p>To go from 3rd to 4th line, we use our very first assumption, i.e. \( q^2 \equiv -1 \, (\text{mod } p) \). Therefore \( a^2 + b^2 \) has to be divisible by \( p \). Also, as \( (a, b) \) is in \( C \) this implies \( a^2 + b^2 \lt 2p \) by definition. Thus the only choice is \( a^2 + b^2 = p \). This proves the theorem.</p>
<p class="right">\( \square \)</p>
<h2 class="section-heading">References</h2>
<ol>
<li>Matoušek, Jiří. Lectures on discrete geometry. Vol. 212. New York: Springer, 2002.</li>
</ol>
Tue, 24 Jul 2018 20:30:00 +0200
https://agustinus.kristia.de/techblog/2018/07/24/minkowski-dirichlet/
https://agustinus.kristia.de/techblog/2018/07/24/minkowski-dirichlet/mathtechblogReduced Betti number of sphere: Mayer-Vietoris Theorem<p>In the <a href="/techblog/2018/07/18/brouwers-fixed-point/">previous post</a> about Brouwer’s Fixed Point Theorem, we used two black boxes. In this post we will prove the slight variation of those black boxes. We will start with the simplest lemma first: the reduced homology of balls.</p>
<p><strong>Lemma 2 (Reduced homology of balls)</strong><br />
Given a \( d \)-ball \( \mathbb{B}^d \), then its reduced \( p \)-th homology space is trivial, i.e. \(\tilde{H}_p(\mathbb{B}^d) = 0 \), for any \( d \) and \( p \).</p>
<p><em>Proof.</em> Observe that \( \mathbb{B}^d \) is contractible, i.e. homotopy equivalent to a point. Assuming we use coefficient \( \mathbb{Q} \), we know the zero-th homology space of point is \( H_0(\, \cdot \,, \mathbb{Q}) = \mathbb{Q} \), and trivial otherwise, i.e. \( H_p (\, \cdot \,, \mathbb{Q}) = 0 \enspace \forall p \geq 1 \).</p>
<p>In the reduced homology, therefore \( \tilde{H}_0(\, \cdot \,, \mathbb{Q}) = 0 \). Thus the reduced homology of balls is trivial for all \( d, p \).</p>
<p class="right">\( \square \)</p>
<p><strong>Corollary 1 (Reduced Betti numbers of balls)</strong><br />
The \( p \)-th reduced Betti numbers of \( \mathbb{B}^d \) is zero for all \(d, p\).</p>
<p class="right">\( \square \)</p>
<p>Now, we are ready to prove the main theme of this post.</p>
<p><strong>Lemma 1 (Reduced Betti numbers of spheres)</strong> <br />
Given a \( d \)-sphere \( \mathbb{S}^d \), then its \( p \)-th reduced Betti number is:</p>
\[\tilde{\beta}_p(\mathbb{S}^d) = \begin{cases} 1, & \text{if } p = d \\ 0, & \text{otherwise} \enspace . \end{cases}\]
<p><em>Proof.</em> We use “divide-and-conquer” approach to apply Mayer-Vietoris Theorem. We cut the sphere along the equator and note that the upper and lower portion of the sphere is just a disk, and the intersection between those two parts is a circle (sphere one dimension down), as shown in the figure below.</p>
<p><img src="/img/2018-07-23-mayer-vietoris-sphere/sphere.svg" alt="Sphere" height="350px" width="350px" /></p>
<p>By Mayer-Vietoris Theorem, we have a long exact sequence in the form of:</p>
\[\dots \longrightarrow \tilde{H}_p(\mathbb{S}^{d-1}) \longrightarrow \tilde{H}_p(\mathbb{B}^d) \oplus \tilde{H}_p(\mathbb{B}^d) \longrightarrow \tilde{H}_p(\mathbb{S}^d) \longrightarrow \tilde{H}_{p-1}(\mathbb{S}^{d-1}) \longrightarrow \dots \enspace .\]
<p>By Corollary 1, \( \tilde{H}_p(\mathbb{B}^d) \oplus \tilde{H}_p(\mathbb{B}^d) = \tilde{H}_{p-1}(\mathbb{B}^d) \oplus \tilde{H}_{p-1}(\mathbb{B}^d) = 0 \). As the sequence is exact, therefore \( \tilde{H}_p(\mathbb{S}^d) \longrightarrow \tilde{H}_{p-1}(\mathbb{S}^{d-1}) \) is a bijection, and thus an isomorphism. Then by induction with base case of \( \mathbb{S}^0 \), we conclude that the claim holds.</p>
<p class="right">\( \square \)</p>
<h2 class="section-heading">References</h2>
<ol>
<li>Hatcher, Allen. “Algebraic topology.” (2001).</li>
</ol>
Mon, 23 Jul 2018 10:00:00 +0200
https://agustinus.kristia.de/techblog/2018/07/23/mayer-vietoris-sphere/
https://agustinus.kristia.de/techblog/2018/07/23/mayer-vietoris-sphere/mathtechblogBrouwer's Fixed Point Theorem: A Proof with Reduced Homology<p>This post is about the proof I found very interesting during the Topology course I took this semester. It highlights the application of Reduced Homology, which is a modification of Homology theory in Algebraic Topology. We will use two results from Reduced Homology as black-boxes for the proof. Everywhere, we will assume \( \mathbb{Q} \) is used as the coefficient of the Homology space.</p>
<p><strong>Lemma 1 (Reduced Homology of spheres)</strong>
Given a \( d \)-sphere \( \mathbb{S}^d \), then its reduced \( p \)-th Homology space is:</p>
\[\tilde{H}_p(\mathbb{S}^d) = \begin{cases} \mathbb{Q}, & \text{if } p = d \\ 0, & \text{otherwise} \enspace . \end{cases}\]
<p class="right">\( \square \)</p>
<p><strong>Lemma 2 (Reduced Homology of balls)</strong>
Given a \( d \)-ball \( \mathbb{B}^d \), then its reduced \( p \)-th Homology space is trivial, i.e. \(\tilde{H}_p(\mathbb{B}^d) = 0 \), for any \( d \) and \( p \).</p>
<p class="right">\( \square \)</p>
<p>Equipped with these lemmas, we are ready to prove the special case of Brouwer’s Fixed Point Theorem, where we consider map from a ball to itself.</p>
<p><strong>Brouwer’s Fixed Point Theorem</strong>
Given \( f: \mathbb{B}^{d+1} \to \mathbb{B}^{d+1} \) continuous, then there exists \( x
\in \mathbb{B}^{d+1} \) such that \( f(x) = x \).</p>
<p><em>Proof.</em> For contradiction, assume \( \forall x \in \mathbb{B}^{d+1}: f(x) \neq x \). We construct a map \( r: \mathbb{B}^{d+1} \to \mathbb{S}^d \), casting ray from the ball to its shell by extending the line segment between \( x \) and \( f(x) \).</p>
<p><img src="/img/2018-07-18-brouwers-fixed-point/map_r.svg" alt="Map r" height="200px" width="200px" /></p>
<p>Observe that \( r(x) \) is continuous because \( f(x) \) is. Also, \( x \in \mathbb{S}^d \implies r(x) = x \). Therefore we have the following commutative diagram.</p>
<p><img src="/img/2018-07-18-brouwers-fixed-point/comm_diag.svg" alt="Commutative Diagram" height="200px" width="200px" /></p>
<p>Above, \( i \) is inclusion map, and \( id \) is identity map. We then look of the Reduced Homology of the above, and this gives us the following commutative diagram.</p>
<p><img src="/img/2018-07-18-brouwers-fixed-point/comm_diag_hom.svg" alt="Commutative Diagram Homology" height="275px" width="275px" /></p>
<p>As the diagram commute, then \( \tilde{H}_d(\mathbb{S}^d) \xrightarrow{i^*} \tilde{H}_d(\mathbb{B}^{d+1}) \xrightarrow{r^*} \tilde{H}_d(\mathbb{S}^d) \) should be identity map on \( \tilde{H}_d(\mathbb{S}^d) \). By Lemma 2, \( \tilde{H}_d(\mathbb{B}^{d+1}) = 0 \). This implies \( \tilde{H}_d(\mathbb{S}^d) = 0 \). But this is a contradiction, as By Lemma 1, \( \tilde{H}_d(\mathbb{S}^d) = \mathbb{Q} \). Therefore there must be a fixed point.</p>
<p class="right">\( \square \)</p>
<h2 class="section-heading">References</h2>
<ol>
<li>Hatcher, Allen. “Algebraic topology.” (2001).</li>
</ol>
Wed, 18 Jul 2018 10:00:00 +0200
https://agustinus.kristia.de/techblog/2018/07/18/brouwers-fixed-point/
https://agustinus.kristia.de/techblog/2018/07/18/brouwers-fixed-point/mathtechblogNatural Gradient Descent<p><a href="/techblog/2018/03/11/fisher-information/">Previously</a>, we looked at the Fisher Information Matrix. We saw that it is equal to the negative expected Hessian of log likelihood. Thus, the immediate application of Fisher Information Matrix is as drop-in replacement of Hessian in second order optimization algorithm. In this article, we will look deeper at the intuition on what excatly is the Fisher Information Matrix represents and what is the interpretation of it.</p>
<h2 class="section-heading">Distribution Space</h2>
<p>As per previous article, we have a probabilistic model represented by its likelihood \( p(x \vert \theta) \). We want to maximize this likelihood function to find the most likely parameter \( \theta \). Equivalent formulation would be to minimize the loss function \( \mathcal{L}(\theta) \), which is the negative log likelihood.</p>
<p>Usual way to solve this optimization is to use gradient descent. In this case, we are taking step which direction is given by \( -\nabla_\theta \mathcal{L}(\theta) \). This is the steepest descent direction around the local neighbourhood of the current value of \( \theta \) in the parameter space. Formally, we have</p>
\[\frac{-\nabla_\theta \mathcal{L}(\theta)}{\lVert \nabla_\theta \mathcal{L}(\theta) \rVert} = \lim_{\epsilon \to 0} \frac{1}{\epsilon} \mathop{\text{arg min}}_{d \text{ s.t. } \lVert d \rVert \leq \epsilon} \mathcal{L}(\theta + d) \, .\]
<p>The above expression is saying that the steepest descent direction in parameter space is to pick a vector \( d \), such that the new parameter \( \theta + d \) is within the \( \epsilon \)-neighbourhood of the current parameter \( \theta \), and we pick \( d \) that minimize the loss. Notice the way we express this neighbourhood is by the means of Euclidean norm. Thus, the optimization in gradient descent is dependent to the Euclidean geometry of the parameter space.</p>
<p>Meanwhile, if our objective is to minimize the loss function (maximizing the likelihood), then it is natural that we taking step in the space of all possible likelihood, realizable by parameter \( \theta \). As the likelihood function itself is a probability distribution, we call this space distribution space. Thus it makes sense to take the steepest descent direction in this distribution space instead of parameter space.</p>
<p>Which metric/distance then do we need to use in this space? A popular choice would be KL-divergence. KL-divergence measure the “closeness” of two distributions. Although as KL-divergence is non-symmetric and thus not a true metric, we can use it anyway. This is because as \( d \) goes to zero, KL-divergence is asymptotically symmetric. So, within a local neighbourhood, KL-divergence is approximately symmetric [1].</p>
<p>We can see the problem when using only Euclidean metric in parameter space from the illustrations below. Consider a Gaussian parameterized by only its mean and keep the variance fixed to 2 and 0.5 for the first and second image respectively:</p>
<p><img src="/img/2018-03-14-natural-gradient/param_space_dist.png" alt="Param1" /></p>
<p><img src="/img/2018-03-14-natural-gradient/param_space_dist2.png" alt="Param2" /></p>
<p>In both images, the distance of those Gaussians are the same, i.e. 4, according to Euclidean metric (red line). However, clearly in distribution space, i.e. when we are taking into account the shape of the Gaussians, the distance is different in the first and second image. In the first image, the KL-divergence should be lower as there is more overlap between those Gaussians. Therefore, if we only work in parameter space, we cannot take into account this information about the distribution realized by the parameter.</p>
<!-- The other nice property of working in distribution space instead of parameter space is that in distribution space, it is invariant to parameterization of the distribution. As an illustration, consider a Gaussian. We can parametrize it with its covariance matrix or precision matrix. Covariance and precision matrix are different to each other (up to special condition, e.g. identity matrix), even though it induces the same Gaussian. Thus, a single point in distribution space are possibly mapped into two different points in the parameter space. If we work in distribution space, then we only care about the resulting Gaussian. -->
<h2 class="section-heading">Fisher Information and KL-divergence</h2>
<p>One question still needs to be answered is what exactly is the connection between Fisher Information Matrix and KL-divergence? It turns out, Fisher Information Matrix defines the local curvature in distribution space for which KL-divergence is the metric.</p>
<p><strong>Claim:</strong>
Fisher Information Matrix \( \text{F} \) is the Hessian of KL-divergence between two distributions \( p(x \vert \theta) \) and \( p(x \vert \theta’) \), with respect to \( \theta’ \), evaluated at \( \theta’ = \theta \).</p>
<p><em>Proof.</em> KL-divergence can be decomposed into entropy and cross-entropy term, i.e.:</p>
\[\text{KL} [p(x \vert \theta) \, \Vert \, p(x \vert \theta')] = \mathop{\mathbb{E}}_{p(x \vert \theta)} [ \log p(x \vert \theta) ] - \mathop{\mathbb{E}}_{p(x \vert \theta)} [ \log p(x \vert \theta') ] \, .\]
<p>The first derivative wrt. \( \theta’ \) is:</p>
\[\begin{align}
\nabla_{\theta'} \text{KL}[p(x \vert \theta) \, \Vert \, p(x \vert \theta')] &= \nabla_{\theta'} \mathop{\mathbb{E}}_{p(x \vert \theta)} [ \log p(x \vert \theta) ] - \nabla_{\theta'} \mathop{\mathbb{E}}_{p(x \vert \theta)} [ \log p(x \vert \theta') ] \\[5pt]
&= - \mathop{\mathbb{E}}_{p(x \vert \theta)} [ \nabla_{\theta'} \log p(x \vert \theta') ] \\[5pt]
&= - \int p(x \vert \theta) \nabla_{\theta'} \log p(x \vert \theta') \, \text{d}x \, .
\end{align}\]
<p>The second derivative is:</p>
\[\begin{align}
\nabla_{\theta'}^2 \, \text{KL}[p(x \vert \theta) \, \Vert \, p(x \vert \theta')] &= - \int p(x \vert \theta) \, \nabla_{\theta'}^2 \log p(x \vert \theta') \, \text{d}x \\[5pt]
\end{align}\]
<p>Thus, the Hessian wrt. \( \theta’ \) evaluated at \( \theta’ = \theta \) is:</p>
\[\begin{align}
\text{H}_{\text{KL}[p(x \vert \theta) \, \Vert \, p(x \vert \theta')]} &= - \int p(x \vert \theta) \, \left. \nabla_{\theta'}^2 \log p(x \vert \theta') \right\vert_{\theta' = \theta} \, \text{d}x \\[5pt]
&= - \int p(x \vert \theta) \, \text{H}_{\log p(x \vert \theta)} \, \text{d}x \\[5pt]
&= - \mathop{\mathbb{E}}_{p(x \vert \theta)} [\text{H}_{\log p(x \vert \theta)}] \\[5pt]
&= \text{F} \, .
\end{align}\]
<p>The last line follows from <a href="/techblog/2018/03/11/fisher-information/">the previous article about Fisher Information Matrix</a>, in which we showed that the negative expected Hessian of log likelihood is the Fisher Information Matrix.</p>
<p class="right">\( \square \)</p>
<h2 class="section-heading">Steepest Descent in Distribution Space</h2>
<p>Now we are ready to use the Fisher Information Matrix to enhance the gradient descent. But first, we need to derive the Taylor series expansion for KL-divergence around \( \theta \).</p>
<p><strong>Claim:</strong>
Let \( d \to 0 \). The second order Taylor series expansion of KL-divergence is \( \text{KL}[p(x \vert \theta) \, \Vert \, p(x \vert \theta + d)] \approx \frac{1}{2} d^\text{T} \text{F} d \).</p>
<p><em>Proof.</em> We will use \( p_{\theta} \) as a notational shortcut for \( p(x \vert \theta) \). By definition, the second order Taylor series expansion of KL-divergence is:</p>
\[\begin{align}
\text{KL}[p_{\theta} \, \Vert \, p_{\theta + d}] &\approx \text{KL}[p_{\theta} \, \Vert \, p_{\theta}] + (\left. \nabla_{\theta'} \text{KL}[p_{\theta} \, \Vert \, p_{\theta'}] \right\vert_{\theta' = \theta})^\text{T} d + \frac{1}{2} d^\text{T} \text{F} d \\[5pt]
&= \text{KL}[p_{\theta} \, \Vert \, p_{\theta}] - \mathop{\mathbb{E}}_{p(x \vert \theta)} [ \nabla_\theta \log p(x \vert \theta) ]^\text{T} d + \frac{1}{2} d^\text{T} \text{F} d \\[5pt]
\end{align}\]
<p>Notice that the first term is zero as it is the same distribution. Furthermore, from the <a href="/techblog/2018/03/11/fisher-information/">previous article</a>, we saw that the expected value of the gradient of log likelihood, which is exactly the gradient of KL-divergence as shown in the previous proof, is also zero. Thus the only thing left is:</p>
\[\begin{align}
\text{KL}[p(x \vert \theta) \, \Vert \, p(x \vert \theta + d)] &\approx \frac{1}{2} d^\text{T} \text{F} d \, .
\end{align}\]
<p class="right">\( \square \)</p>
<p>Now, we would like to know what is update vector \( d \) that minimizes the loss function \( \mathcal{L} (\theta) \) in distribution space, so that we know in which direction decreases the KL-divergence the most. This is analogous to the method of steepest descent, but in distribution space with KL-divergence as metric, instead of the usual parameter space with Euclidean metric. For that, we do this minimization:</p>
\[d^* = \mathop{\text{arg min}}_{d \text{ s.t. } \text{KL}[p_\theta \Vert p_{\theta + d}] = c} \mathcal{L} (\theta + d) \, ,\]
<p>where \( c \) is some constant. The purpose of fixing the KL-divergence to some constant is to make sure that we move along the space with constant speed, regardless the curvature. Further benefit is that this makes the algorithm more robust to the reparametrization of the model, i.e. the algorithm does not care how the model is parametrized, it only cares about the distribution induced by the parameter [3].</p>
<p>If we write the above minimization in Lagrangian form, with constraint KL-divergence approximated by its second order Taylor series expansion and approximate \( \mathcal{L}(\theta + d) \) with its first order Taylor series expansion, we get:</p>
\[\begin{align}
d^* &= \mathop{\text{arg min}}_d \, \mathcal{L} (\theta + d) + \lambda \, (\text{KL}[p_\theta \Vert p_{\theta + d}] - c) \\
&\approx \mathop{\text{arg min}}_d \, \mathcal{L}(\theta) + \nabla_\theta \mathcal{L}(\theta)^\text{T} d + \frac{1}{2} \lambda \, d^\text{T} \text{F} d - \lambda c \, .
\end{align}\]
<p>To solve this minimization, we set its derivative wrt. \( d \) to zero:</p>
\[\begin{align}
0 &= \frac{\partial}{\partial d} \mathcal{L}(\theta) + \nabla_\theta \mathcal{L}(\theta)^\text{T} d + \frac{1}{2} \lambda \, d^\text{T} \text{F} d - \lambda c \\[5pt]
&= \nabla_\theta \mathcal{L}(\theta) + \lambda \, \text{F} d \\[5pt]
\lambda \, \text{F} d &= -\nabla_\theta \mathcal{L}(\theta) \\[5pt]
d &= -\frac{1}{\lambda} \text{F}^{-1} \nabla_\theta \mathcal{L}(\theta) \\[5pt]
\end{align}\]
<p>Up to constant factor of \( \frac{1}{\lambda} \), we get the optimal descent direction, i.e. the opposite direction of gradient while taking into account the local curvature in distribution space defined by \( \text{F}^{-1} \). We can absorb this constant factor into the learning rate.</p>
<p><strong>Definition:</strong>
Natural gradient is defined as</p>
\[\tilde{\nabla}_\theta \mathcal{L}(\theta) = \text{F}^{-1} \nabla_\theta \mathcal{L}(\theta) \, .\]
<p class="right">\( \square \)</p>
<p>As corollary, we have the following algorithm:</p>
<p><strong>Algorithm: Natural Gradient Descent</strong></p>
<ol>
<li>Repeat:
<ol>
<li>Do forward pass on our model and compute loss \( \mathcal{L}(\theta) \).</li>
<li>Compute the gradient \( \nabla_\theta \mathcal{L}(\theta) \).</li>
<li><a href="/techblog/2018/03/11/fisher-information/">Compute the Fisher Information Matrix</a> \( \text{F} \), or its empirical version (wrt. our training data).</li>
<li>Compute the natural gradient \( \tilde{\nabla}_\theta \mathcal{L}(\theta) = \text{F}^{-1} \nabla_\theta \mathcal{L}(\theta) \).</li>
<li>Update the parameter: \( \theta = \theta - \alpha \, \tilde{\nabla}_\theta \mathcal{L}(\theta) \), where \( \alpha \) is the learning rate.</li>
</ol>
</li>
<li>Until convergence.</li>
</ol>
<!-- <h2 class="section-heading">Simple Implementation Example</h2>
**Remark.** _The implementation below is based on the empirical FIM, thus does not reflect the true natural gradient. (See <https://arxiv.org/abs/1905.12558>.) To use the true FIM, one need to take the expectation of the outer product of the gradient w.r.t. the predictive distribution. For example, one can do Monte Carlo approximation by drawing random labels to compute the loss and subsequently compute the FIM. Note that the calculation of the vanilla gradient remains unchanged._
Let's consider logistic regression problem. The training data is drawn from a mixture of Gaussians centered at \\( (-1, -1) \\) and \\( (1, 1) \\). We assign different labels for each mode. The code is as follows:
<figure class="highlight"><pre><code class="language-python" data-lang="python"><span class="kn">import</span> <span class="nn">numpy</span> <span class="k">as</span> <span class="n">np</span>
<span class="kn">from</span> <span class="nn">sklearn.utils</span> <span class="kn">import</span> <span class="n">shuffle</span>
<span class="n">X0</span> <span class="o">=</span> <span class="n">np</span><span class="p">.</span><span class="n">random</span><span class="p">.</span><span class="n">randn</span><span class="p">(</span><span class="mi">100</span><span class="p">,</span> <span class="mi">2</span><span class="p">)</span> <span class="o">-</span> <span class="mi">1</span>
<span class="n">X1</span> <span class="o">=</span> <span class="n">np</span><span class="p">.</span><span class="n">random</span><span class="p">.</span><span class="n">randn</span><span class="p">(</span><span class="mi">100</span><span class="p">,</span> <span class="mi">2</span><span class="p">)</span> <span class="o">+</span> <span class="mi">1</span>
<span class="n">X</span> <span class="o">=</span> <span class="n">np</span><span class="p">.</span><span class="n">vstack</span><span class="p">([</span><span class="n">X0</span><span class="p">,</span> <span class="n">X1</span><span class="p">])</span>
<span class="n">t</span> <span class="o">=</span> <span class="n">np</span><span class="p">.</span><span class="n">vstack</span><span class="p">([</span><span class="n">np</span><span class="p">.</span><span class="n">zeros</span><span class="p">([</span><span class="mi">100</span><span class="p">,</span> <span class="mi">1</span><span class="p">]),</span> <span class="n">np</span><span class="p">.</span><span class="n">ones</span><span class="p">([</span><span class="mi">100</span><span class="p">,</span> <span class="mi">1</span><span class="p">])])</span>
<span class="n">X</span><span class="p">,</span> <span class="n">t</span> <span class="o">=</span> <span class="n">shuffle</span><span class="p">(</span><span class="n">X</span><span class="p">,</span> <span class="n">t</span><span class="p">)</span>
<span class="n">X_train</span><span class="p">,</span> <span class="n">X_test</span> <span class="o">=</span> <span class="n">X</span><span class="p">[:</span><span class="mi">150</span><span class="p">],</span> <span class="n">X</span><span class="p">[</span><span class="mi">150</span><span class="p">:]</span>
<span class="n">t_train</span><span class="p">,</span> <span class="n">t_test</span> <span class="o">=</span> <span class="n">t</span><span class="p">[:</span><span class="mi">150</span><span class="p">],</span> <span class="n">t</span><span class="p">[</span><span class="mi">150</span><span class="p">:]</span></code></pre></figure>
Next, we consider our model. It is a simple linear model (without bias) with sigmoid output. Thus naturally, we use binary cross entropy loss:
<figure class="highlight"><pre><code class="language-python" data-lang="python"><span class="c1"># Initialize weight
</span><span class="n">W</span> <span class="o">=</span> <span class="n">np</span><span class="p">.</span><span class="n">random</span><span class="p">.</span><span class="n">randn</span><span class="p">(</span><span class="mi">2</span><span class="p">,</span> <span class="mi">1</span><span class="p">)</span> <span class="o">*</span> <span class="mf">0.01</span>
<span class="k">def</span> <span class="nf">sigm</span><span class="p">(</span><span class="n">x</span><span class="p">):</span>
<span class="k">return</span> <span class="mi">1</span><span class="o">/</span><span class="p">(</span><span class="mi">1</span><span class="o">+</span><span class="n">np</span><span class="p">.</span><span class="n">exp</span><span class="p">(</span><span class="o">-</span><span class="n">x</span><span class="p">))</span>
<span class="k">def</span> <span class="nf">NLL</span><span class="p">(</span><span class="n">y</span><span class="p">,</span> <span class="n">t</span><span class="p">):</span>
<span class="k">return</span> <span class="o">-</span><span class="n">np</span><span class="p">.</span><span class="n">mean</span><span class="p">(</span><span class="n">t</span><span class="o">*</span><span class="n">np</span><span class="p">.</span><span class="n">log</span><span class="p">(</span><span class="n">y</span><span class="p">)</span> <span class="o">+</span> <span class="p">(</span><span class="mi">1</span><span class="o">-</span><span class="n">t</span><span class="p">)</span><span class="o">*</span><span class="n">np</span><span class="p">.</span><span class="n">log</span><span class="p">(</span><span class="mi">1</span><span class="o">-</span><span class="n">y</span><span class="p">))</span></code></pre></figure>
Inside the training loop, the forward pass looks like:
<figure class="highlight"><pre><code class="language-python" data-lang="python"><span class="c1"># Forward
</span><span class="n">z</span> <span class="o">=</span> <span class="n">X_train</span> <span class="o">@</span> <span class="n">W</span>
<span class="n">y</span> <span class="o">=</span> <span class="n">sigm</span><span class="p">(</span><span class="n">z</span><span class="p">)</span>
<span class="n">loss</span> <span class="o">=</span> <span class="n">NLL</span><span class="p">(</span><span class="n">y</span><span class="p">,</span> <span class="n">t_train</span><span class="p">)</span>
<span class="c1"># Loss
</span><span class="k">print</span><span class="p">(</span><span class="sa">f</span><span class="s">'Loss: </span><span class="si">{</span><span class="n">loss</span><span class="p">:.</span><span class="mi">3</span><span class="n">f</span><span class="si">}</span><span class="s">'</span><span class="p">)</span></code></pre></figure>
The gradient of the loss function wrt. parameter \\( w \\) is then as follows:
<figure class="highlight"><pre><code class="language-python" data-lang="python"><span class="n">dy</span> <span class="o">=</span> <span class="p">(</span><span class="n">y</span><span class="o">-</span><span class="n">t_train</span><span class="p">)</span><span class="o">/</span><span class="p">(</span><span class="n">m</span> <span class="o">*</span> <span class="p">(</span><span class="n">y</span> <span class="o">-</span> <span class="n">y</span><span class="o">*</span><span class="n">y</span><span class="p">))</span>
<span class="n">dz</span> <span class="o">=</span> <span class="n">sigm</span><span class="p">(</span><span class="n">z</span><span class="p">)</span><span class="o">*</span><span class="p">(</span><span class="mi">1</span><span class="o">-</span><span class="n">sigm</span><span class="p">(</span><span class="n">z</span><span class="p">))</span>
<span class="n">dW</span> <span class="o">=</span> <span class="n">X_train</span><span class="p">.</span><span class="n">T</span> <span class="o">@</span> <span class="p">(</span><span class="n">dz</span> <span class="o">*</span> <span class="n">dy</span><span class="p">)</span></code></pre></figure>
At this point we are ready to do update step for vanilla gradient descent:
<figure class="highlight"><pre><code class="language-python" data-lang="python"><span class="n">W</span> <span class="o">=</span> <span class="n">W</span> <span class="o">-</span> <span class="n">alpha</span> <span class="o">*</span> <span class="n">dW</span></code></pre></figure>
For natural gradient descent, we need some extra works. Firstly we need to compute the gradient of log likelihood wrt. \\( w \\), without summing, as we will do this when we compute the covariance.
<figure class="highlight"><pre><code class="language-python" data-lang="python"><span class="n">grad_loglik_z</span> <span class="o">=</span> <span class="p">(</span><span class="n">t_train</span><span class="o">-</span><span class="n">y</span><span class="p">)</span><span class="o">/</span><span class="p">(</span><span class="n">y</span> <span class="o">-</span> <span class="n">y</span><span class="o">*</span><span class="n">y</span><span class="p">)</span> <span class="o">*</span> <span class="n">dz</span>
<span class="n">grad_loglik_W</span> <span class="o">=</span> <span class="n">grad_loglik_z</span> <span class="o">*</span> <span class="n">X_train</span></code></pre></figure>
The Empirical Fisher is given by the empirical covariance matrix of the gradient of log likelihood wrt. our training data:
<figure class="highlight"><pre><code class="language-python" data-lang="python"><span class="n">F</span> <span class="o">=</span> <span class="n">np</span><span class="p">.</span><span class="n">cov</span><span class="p">(</span><span class="n">grad_loglik_W</span><span class="p">.</span><span class="n">T</span><span class="p">)</span></code></pre></figure>
To do the update step, we need to take the product of \\( \text{F}^{-1} \\) with the gradient of loss:
<figure class="highlight"><pre><code class="language-python" data-lang="python"><span class="n">W</span> <span class="o">=</span> <span class="n">W</span> <span class="o">-</span> <span class="n">alpha</span> <span class="o">*</span> <span class="n">np</span><span class="p">.</span><span class="n">linalg</span><span class="p">.</span><span class="n">inv</span><span class="p">(</span><span class="n">F</span><span class="p">)</span> <span class="o">@</span> <span class="n">dW</span></code></pre></figure>
The complete script to reproduce this can be found at:
<https://gist.github.com/wiseodd/1c9f5006310f5ee03bd4682b4c03020a>.
How good is natural gradient descent compared to the vanilla gradient descent? Below are the comparison of loss value after five iterations, averaged over 100 repetitions.
| Method | Mean loss | Std. loss |
|:---|:---:|:---:|
| Natural Gradient Descent | **0.1823** | **0.0814** |
| Vanilla Gradient Descent | 0.4058 | 0.106 |
{:.table-bordered}
At least in this very simple setting, natural gradient descent converges twice as fast as the vanilla counterpart. Furthermore, it converges faster consistently, as shown by the standard deviation. -->
<h2 class="section-heading">Discussion</h2>
<p>In the above very simple model with low amount of data, we saw that we can implement natural gradient descent easily. But how easy is it to do this in the real world? As we know, the number of parameters in deep learning models is very large, within millions of parameters. The Fisher Information Matrix for these kind of models is then infeasible to compute, store, or invert. This is the same problem as why second order optimization methods are not popular in deep learning.</p>
<p>One way to get around this problem is to approximate the Fisher/Hessian instead. Method like ADAM [4] computes the running average of first and second moment of the gradient. First moment can be seen as momentum which is not our interest in this article. The second moment is approximating the Fisher Information Matrix, but constrainting it to be diagonal matrix. Thus in ADAM, we only need \( O(n) \) space to store (the approximation of) \( \text{F} \) instead of \( O(n^2) \) and the inversion can be done in \( O(n) \) instead of \( O(n^3) \). In practice ADAM works really well and is currently the <em>de facto</em> standard for optimizing deep neural networks.</p>
<h2 class="section-heading">References</h2>
<ol>
<li>Martens, James. “New insights and perspectives on the natural gradient method.” arXiv preprint arXiv:1412.1193 (2014).</li>
<li>Ly, Alexander, et al. “A tutorial on Fisher information.” Journal of Mathematical Psychology 80 (2017): 40-55.</li>
<li>Pascanu, Razvan, and Yoshua Bengio. “Revisiting natural gradient for deep networks.” arXiv preprint arXiv:1301.3584 (2013).</li>
<li>Kingma, Diederik P., and Jimmy Ba. “Adam: A method for stochastic optimization.” arXiv preprint arXiv:1412.6980 (2014).</li>
</ol>
Wed, 14 Mar 2018 07:00:00 +0100
https://agustinus.kristia.de/techblog/2018/03/14/natural-gradient/
https://agustinus.kristia.de/techblog/2018/03/14/natural-gradient/machine learningtechblog