Maximizing likelihood is equivalent to minimizing KL-Divergence

When reading Kevin Murphy’s book, I came across this statement:

… maximizing likelihood is equivalent to minimizing , where is the true distribution and is our estimate …

So here is an attempt to prove that.

If it looks familiar, the left term is the entropy of . However it does not depend on the estimated parameter , so we will ignore that.

Suppose we sample of these . Then, the Law of Large Number says that as goes to infinity:

which is the right term of the above KL-Divergence. Notice that:

where NLL is the negative log-likelihood and is a constant.

Then, if we minimize , it is equivalent to minimizing the NLL. In other words, it is equivalent to maximizing the log-likelihood.

Why does this matter, though? Because this gives MLE a nice interpretation: maximizing the likelihood of data under our estimate is equal to minimizing the difference between our estimate and the real data distribution. We can see MLE as a proxy for fitting our estimate to the real distribution, which cannot be done directly as the real distribution is unknown to us.