The Invariance of the Hessian and Its Eigenvalues, Determinant, and Trace

Let be a neural network, defined by . Suppose is a loss function defined on the -dimensional parameter space of and let be a minimum of . Suppose further is a reparametrization, i.e., a differentiable map with a differentiable inverse, mapping .

Suppose we transform into . The consensus in the deep learning field regarding the Hessian matrix of at is that:

The eigenvalues of are not invariant.
The determinant of is not invariant.
The trace of is not invariant.
Seen as a bilinear map, the Hessian is not invariant outside the critical points of .

In this post, we shall see that these quantities are actually invariant under reparametrization! Although the argument comes from Riemannian geometry, it will also hold even if we use the default assumption found in calculus—the standard setting assumed by deep learning algorithms and practitioners.

Note. Throughout this post, we use the Einstein summation convention. That is, we sum two variables together if one has an upper index and the other has a lower index, while omitting the summation symbol. For example: corresponds to and , meanwhile the index in the following partial derivative counts as a lower index.

The Hessian as a Bilinear Map

In calculus, the Hessian matrix at is defined by

The Hessian matrix defines a bilinear function, i.e., given arbitrary vectors in , we can write a function . For example, this term comes up in the 2nd-order Taylor expansion of at :

where we have defined .

Under the reparametrization with , we have . Thus, by the chain and product rules, the Hessian becomes

However, notice that if we evaluate at a minimum , the second term vanishes. And so, we have

Meanwhile, if and are vectors at , their components become

because the Jacobian of the reparametrization (i.e. change of coordinates) defines a change of basis.

Notice that is the inverse of . Considering the transformed , , and , the bilinear map then becomes

under the reparametization . Since all those indices , are simply dummy indices, the last expression is equivalent to . Since and and are arbitrary, this implies that, seen as a bilinear map, the Hessian at a minimum is invariant under reparametrization.

The Non-Invariance of the Hessian

While the Hessian, as a bilinear map at a minimum, is (functionally) invariant, some of its downstream quantities are not. Let us illustrate this using the determinant—one can also easily show similar results for trace and eigenvalues.

First, recall that the components of the Hessian transforms into the following under a reparametrization :

In matrix notation, this is . (The dependency on is omitted for simplicity.) Then, the determinant of is

Thus, in general, . Hence the determinant of the Hessian is not invariant. This causes problems in deep learning: For instance, Dinh et al. 2017 argue that one cannot study the connection between flatness and generalization performance at the minimum of .

The Riemannian Hessian

From the Riemannian-geometric perspective, the component of the Hessian of is defined under coordinates/parametrization as:

where is a three-dimensional array that represent the Levi-Civita connection (or any connection) on the tangent spaces of , seen as a Riemannian manifold. In the calculus case, where the Euclidean metric and the Cartesian coordinates are assumed by default, vanishes identically; hence the previous definition of the Hessian. This also shows that the Riemannian Hessian is a generalization to the standard Hessian.

Under a reparametrization , the connection coefficient transforms as follows:

And thus, combined with the transformation of the “calculus Hessian” (i.e. second partial derivatives) from the previous section, the Riemannian Hessian transform as:

Note that while this transformation rule is very similar to the transformation of the “calculus Hessian” at a critical point, the transformation rule of the Riemannian Hessian applies everywhere on .

This means, seen as a bilinear map, the Hessian is invariant everywhere on . (Not just at the critical points as before.) How does this discrepancy happen? This is because we ignore in calculus! This is, of course, justified since . But as can be seen in its transformation rule, under a reparametrization , this quantity is non-zero in general in parametrization—this is already true for a simple, common transformation between the Cartesian and polar coordinates.

The Invariance of the Hessian Eigenvalues, Determinant, and Trace

Let us focus on the determinant of the Hessian. As discussed above, it is not invariant. This is true even if the Riemannian Hessian above is used. How do we make sense of this?

To make sense of this, we need to fully understand the object we care about when we talk about the determinant of the Hessian as a measure of the flatness of the loss landscape of .

The loss landscape of is the graph of . This is actually a -dimensional hypersurface embedded in . In particular, a hypersurface is a manifold. Meanwhile, the concept of “sharpness” or “flatness” of the loss landscape of is nothing but the curvatures of the above manifold, particularly the principal curvatures, Gaussian curvature, and mean curvature.

These curvatures can actually be derived from the Hessian of since this Hessian is the second fundamental form of that manifold. (See that previous post!) However, to obtain those curvatures, we must first derive the shape operator with the help of the metric. (The shape operator is a linear operator, mapping a vector to a vector.) Suppose the matrix representation of the metric on is . Then, the shape operator is given by

The principal, Gaussian, and mean curvatures of the loss landscape are then the eigenvalues, determinant, and trace of , respectively. The reason why we can simply take eigenvalues or determinant or trace of the Hessian in calculus is because, by default, is assumed to be the identity matrix , i.e. the Euclidean metric. That is and we can ignore the term above.

But notice that under a reparametrization , we have

So, even when in the parametrization, the matrix representation of the metric is different than in the parametrization! That is, we must not ignore the metric in the shape operator, however trivial it might be, if we care about reparametrization. This is the cause of the non-invariance of the Hessian’s eigenvalues, determinant, and trace observed in deep learning!

First, let us see the transformation of the shape operator by combining the transformation rules of and :

If we take the determinant of both sides, we have:

That is, the determinant of the Hessian, seen as a shape operator, is invariant!

What about the trace of ? Recall that . Using this property and the transformation of above, we have:

and so the trace is also invariant.

Finally, we can also show a general invariance result for eigenvalues. Recall that is an eigenvalue of the linear operator if for an eigenvector .

Let be an eigenpair on the parametrization and be an eigenpair on the parametrization. We want to show that . Recall vectors are transformed by multiplying it with the Jacobian of . So, . Therefore:

where the last step is done by multiplying both sides by the inverse of the Jacobian—recall that is invertible.

Therefore, we identify that . Since is an arbitrary eigenvalue, we conclude that all eigenvalues of are invariant.

Non-Invariance from the Tensor Analysis Viewpoint

In tensor analysis, this issue is very easy to identify. First, the Hessian represents a bilinear map, so it is a covariant 2-tensor. Meanwhile, when we talk about eigenvalues, we refer to the spectral theorem and this theorem applies to linear maps. So, there is a type mismatch here.

To apply the spectral theorem on the Hessian, we need to express it as a linear map. This can be done by viewing the Hessian as a linear map on the tangent space onto itself, which is a 1-contravariant 1-covariant tensor. That is, we need to “raise” one of the indices of . How do we do this? You guessed it: Multiply with the inverse of the metric.

Conclusion

The reason why “flatness measures” derived from the calculus version of Hessian is not invariant is simply because we measure those “flatness measures” from an incorrect object. The correct object we should use is the shape operator, which is obtained with the help of the metric (even when the latter is Euclidean).

Moreover, the reason why Newton’s method is not invariant (see Sec. 12 of Martens, 2020) is that we ignore the second term involving the connection coefficient .

Ignoring those geometric quantities are totally justified in calculus and deep learning since we always assume a Euclidean metric along with the Cartesian coordinates. But this simplification makes us “forget” about the correct transformation of the Hessian, giving rise to the pathological non-invariance issues observed in deep learning.