Let be a probability density on , parametrized by . The Fisher information is defined by
where for each . Note that is positive semi-definite because one can see it as the (expected) outer-product of the gradient of the log-density.
Let with be a bijective transformation of the r.v. . By the Fisher-Neyman factorization, is a sufficient statistic for the parameter if there exist non-negative functions and , where depends on while does not, such that we can write the density as follows:
The following proposition shows the behavior of under sufficient statistics.
Proposition 1. The Fisher information is invariant under sufficient statistics.
Proof. Let be a sufficient statistic and so . Notice that this implies
So, the Fisher information under is
We conclude that is invariant under sufficient statistics.
Let
be the set of the parametric densities . We can treat as a smooth -manifold by imposing a global coordinate chart . Thus, we can identify a point on with its parameter interchangeably.
Let us assume that is positive-definite everywhere, and each is smooth. Then we can use it as (the coordinates representation of) a Riemannian metric for . This is because is a covariant 2-tensor. (Recall the definition of a Riemannian metric.)
Proposition 2. _The component functions of follows the covariant transformation rule._
Proof. Let be a change of coordinates and let . The component function in the “old” coordinates is expressed in terms of the “new” ones, as follows:
where the second equality follows from the standard chain rule. We conclude that is covariant since the Jacobian of the transformation multiplies the “new” component functions of to obtain the “old” ones.
Chentsov’s Theorem
The previous two results are useful since the Fisher information metric is invariant under sufficient statistics. In this sense, has a statistical invariance property. But this is not a strong enough reason for arguing that is a “natural” or “the best” metric for .
Here, we shall see a stronger statement, due to Chentsov in 1972, about the Fisher metric: It is the unique statistically-invariant metric for (up to a scaling constant). This makes stands out over any other metric for .
Originally, Chentsov’s theorem is described on the space of Categorical probability distributions over the sample space , i.e. the probability simplex. We use the result of Campbell (1986) as a stepping stone. To do so, we need to define the so-called Markov embeddings.
Let be a partition , where . We define a conditional probability table of size where
That is, the -th row of gives probabilities signifying the membership of each in . Based on this, we define a map by
We call this map a Markov embedding. The name suggests that embeds in a higher-dimensional space .
The result of Campbell (1986) characterizes the form of the Riemannian metric in that is invariant under any Markov embedding.
Lemma 3 (Campbell, 1986). _Let be a Riemannian metric on where . Suppose that every Markov embedding on is an isometry. Then_
_where , is the Kronecker delta, and satisfying and ._
Proof. See Campbell (1986) and Amari (2016, Sec. 3.5).
Lemma 3 is a general statement about the invariant metric in and it does not say anything about sufficient statistics and probability distributions. To get the main result, we restrict ourselves to the -probability simplex , which is the space of (Categorical) probability distribution.
The fact that the Fisher information is the unique invariant metric under sufficient statistics follows from the fact that when , the Markov embedding reduces to a permutation of the components of —i.e. the permutation of . This is because permutations of are sufficient statistics for Categorical distribution.
Let us, therefore, connect the result in Lemma 3 with the Fisher information on . We give the latter in the following lemma.
Lemma 4. _The Fisher information of a Categorical distribution where takes values in and is given by_
That is, is an diagonal matrix with -th entry .
Proof. By definition,
where we assume that is one-hot encoded. Its score function is given by
for each . Hence, using the fact that is one-hot:
Using similar step, we can show that for because is always zero.
Now we are ready to state the main result.
Theorem 5 (Chentsov, 1972). The Fisher information is the unique Riemannian metric on that is invariant under sufficient statistics, up to a multiplicative constant.
Proof. By Lemma 3, the invariant metric under Markov embeddings in is given by
for any . Therefore, this is the form of the invariant metric under sufficient statistics in , i.e. when in the Markov embedding.
Let us therefore restrict to . For each , the tangent space is orthogonal to the line , which direction is given by the vector . This is a vector normal to , implying that any satisfies , i.e. .
Moreover, if , then by definition. Thus, and are constants. So, if , we have:
Therefore does not contribute to the inner product and we may, w.l.o.g., write the metric as a diagonal matrix:
Recalling that is a constant, by Lemma 4, we have .
Generalizations to this (original) version Chentsov’s theorem exists. For instance, Ay et al. (2015) showed Chentsov’s theorem for arbitrary, parametric probability distributions. Dowty (2018) stated Chentsov’s theorem for exponential family distributions.
References
- Chentsov, N. N. “Statistical Decision Rules and Optimal Deductions.” (1972).
- Campbell, L. Lorne. “An extended Čencov characterization of the information metric.” Proceedings of the American Mathematical Society 98, no. 1 (1986): 135-141.
- Amari, Shun-ichi. Information geometry and its applications. Vol. 194. Springer, 2016.
- Ay, Nihat, Jürgen Jost, Hông Vân Lê, and Lorenz Schwachhöfer. “Information geometry and sufficient statistics.” Probability Theory and Related Fields 162, no. 1-2 (2015): 327-364.
- Dowty, James G. “Chentsov’s theorem for exponential families.” Information Geometry 1, no. 1 (2018): 117-135.