Let  be a probability density on , parametrized by . The Fisher information is defined by
where  for each . Note that  is positive semi-definite because one can see it as the (expected) outer-product of the gradient of the log-density.
Let  with  be a bijective transformation of the r.v. . By the Fisher-Neyman factorization,  is a sufficient statistic for the parameter  if there exist non-negative functions  and , where  depends on  while  does not, such that we can write the density  as follows:
The following proposition shows the behavior of  under sufficient statistics.
Proposition 1. The Fisher information is invariant under sufficient statistics.
Proof. Let  be a sufficient statistic and so . Notice that this implies
So, the Fisher information  under  is
We conclude that  is invariant under sufficient statistics.
Let
be the set of the parametric densities . We can treat  as a smooth -manifold by imposing a global coordinate chart . Thus, we can identify a point  on  with its parameter  interchangeably.
Let us assume that  is positive-definite everywhere, and each  is smooth. Then we can use it as (the coordinates representation of) a Riemannian metric for . This is because  is a covariant 2-tensor. (Recall the definition of a Riemannian metric.)
Proposition 2. _The component functions  of  follows the covariant transformation rule._
Proof. Let  be a change of coordinates and let . The component function  in the “old” coordinates is expressed in terms of the “new” ones, as follows:
where the second equality follows from the standard chain rule. We conclude that  is covariant since the Jacobian  of the transformation multiplies the “new” component functions  of  to obtain the “old” ones.
Chentsov’s Theorem
The previous two results are useful since the Fisher information metric is invariant under sufficient statistics. In this sense,  has a statistical invariance property. But this is not a strong enough reason for arguing that  is a “natural” or “the best” metric for .
Here, we shall see a stronger statement, due to Chentsov in 1972, about the Fisher metric: It is the unique statistically-invariant metric for  (up to a scaling constant). This makes  stands out over any other metric for .
Originally, Chentsov’s theorem is described on the space of Categorical probability distributions over the sample space , i.e. the probability simplex. We use the result of Campbell (1986) as a stepping stone. To do so, we need to define the so-called Markov embeddings.
Let  be a partition , where . We define a conditional probability table  of size  where
That is, the -th row of  gives probabilities signifying the membership of each  in . Based on this, we define a map  by
We call this map a Markov embedding. The name suggests that  embeds  in a higher-dimensional space .
The result of Campbell (1986) characterizes the form of the Riemannian metric in  that is invariant under any Markov embedding.
Lemma 3 (Campbell, 1986). _Let  be a Riemannian metric on  where . Suppose that every Markov embedding on  is an isometry. Then_
_where ,  is the Kronecker delta, and  satisfying  and ._
Proof. See Campbell (1986) and Amari (2016, Sec. 3.5).
Lemma 3 is a general statement about the invariant metric in  and it does not say anything about sufficient statistics and probability distributions. To get the main result, we restrict ourselves to the -probability simplex , which is the space of (Categorical) probability distribution.
The fact that the Fisher information is the unique invariant metric under sufficient statistics follows from the fact that when , the Markov embedding reduces to a permutation of the components of —i.e. the permutation of . This is because permutations of  are sufficient statistics for Categorical distribution.
Let us, therefore, connect the result in Lemma 3 with the Fisher information on . We give the latter in the following lemma.
Lemma 4. _The Fisher information of a Categorical distribution  where  takes values in  and  is given by_
That is,  is an  diagonal matrix with -th entry .
Proof. By definition,
where we assume that  is one-hot encoded. Its score function is given by
for each . Hence, using the fact that  is one-hot:
Using similar step, we can show that  for  because  is always zero.
Now we are ready to state the main result.
Theorem 5 (Chentsov, 1972). The Fisher information is the unique Riemannian metric on  that is invariant under sufficient statistics, up to a multiplicative constant.
Proof. By Lemma 3, the invariant metric under Markov embeddings in  is given by
for any . Therefore, this is the form of the invariant metric under sufficient statistics in , i.e. when  in the Markov embedding.
Let us therefore restrict  to . For each , the tangent space  is orthogonal to the line , which direction is given by the vector . This is a vector normal to , implying that any  satisfies , i.e. .
Moreover, if , then  by definition. Thus,  and  are constants. So, if , we have:
Therefore  does not contribute to the inner product and we may, w.l.o.g., write the metric as a diagonal matrix:
Recalling that  is a constant, by Lemma 4, we have .
Generalizations to this (original) version Chentsov’s theorem exists. For instance, Ay et al. (2015) showed Chentsov’s theorem for arbitrary, parametric probability distributions. Dowty (2018) stated Chentsov’s theorem for exponential family distributions.
References
- Chentsov, N. N. “Statistical Decision Rules and Optimal Deductions.” (1972).
- Campbell, L. Lorne. “An extended Čencov characterization of the information metric.” Proceedings of the American Mathematical Society 98, no. 1 (1986): 135-141.
- Amari, Shun-ichi. Information geometry and its applications. Vol. 194. Springer, 2016.
- Ay, Nihat, Jürgen Jost, Hông Vân Lê, and Lorenz Schwachhöfer. “Information geometry and sufficient statistics.” Probability Theory and Related Fields 162, no. 1-2 (2015): 327-364.
- Dowty, James G. “Chentsov’s theorem for exponential families.” Information Geometry 1, no. 1 (2018): 117-135.