Suppose we have a model parameterized by parameter vector
that is, score function is the gradient of log likelihood function. The result about score function below is important building block on our discussion.
Claim: The expected value of score wrt. our model is zero.
Proof. Below, the gradient is wrt.
But how certain are we to our estimate? We can define an uncertainty measure around the expected estimate. That is, we look at the covariance of score of our model. Taking the result from above:
We can then see it as an information. The covariance of score function above is the definition of Fisher Information. As we assume
However, usually our likelihood function is complicated and computing the expectation is intractable. We can approximate the expectation in
Fisher and Hessian
One property of
Claim:
The negative expected Hessian of log likelihood is equal to the Fisher Information Matrix
Proof. The Hessian of the log likelihood is given by the Jacobian of its gradient:
where the second line is a result of applying quotient rule of derivative. Taking expectation wrt. our model, we have:
Thus we have
Indeed knowing this result, we can see the role of
Conclusion
Fisher Information Matrix is defined as the covariance of score function. It is a curvature matrix and has interpretation as the negative expected Hessian of log likelihood function. Thus the immediate application of
One of the most exciting results of
References
- Martens, James. “New insights and perspectives on the natural gradient method.” arXiv preprint arXiv:1412.1193 (2014).
- Ly, Alexander, et al. “A tutorial on Fisher information.” Journal of Mathematical Psychology 80 (2017): 40-55.