/ 2 min read
Maximizing likelihood is equivalent to minimizing KL-Divergence
When reading Kevin Murphy’s book, I came across this statement:
… maximizing likelihood is equivalent to minimizing
, where is the true distribution and is our estimate …
So here is an attempt to prove that.
If it looks familiar, the left term is the entropy of
Suppose we sample
which is the right term of the above KL-Divergence. Notice that:
where NLL is the negative log-likelihood and
Then, if we minimize
Why does this matter, though? Because this gives MLE a nice interpretation: maximizing the likelihood of data under our estimate is equal to minimizing the difference between our estimate and the real data distribution. We can see MLE as a proxy for fitting our estimate to the real distribution, which cannot be done directly as the real distribution is unknown to us.