/ 2 min read

# Maximizing likelihood is equivalent to minimizing KL-Divergence

When reading Kevin Murphy’s book, I came across this statement:

… maximizing likelihood is equivalent to minimizing

, where is the true distribution and is our estimate …

So here is an attempt to prove that.

If it looks familiar, the left term is the entropy of

Suppose we sample

which is the right term of the above KL-Divergence. Notice that:

where NLL is the negative log-likelihood and

Then, if we minimize

Why does this matter, though? Because this gives MLE a nice interpretation: maximizing the likelihood of data under our estimate is equal to minimizing the difference between our estimate and the real data distribution. We can see MLE as a proxy for fitting our estimate to the real distribution, which cannot be done directly as the real distribution is unknown to us.