Kullback-Leibler Divergence, or KL Divergence is a measure on how “off” two probability distributions

For example, if we have two gaussians,

The KL Divergence could be computed as follows:

that is, for all random variable

## KL Divergence in optimization

In optimization setting, we assume that

Just like any other distance functions (e.g. euclidean distance), we can use KL Divergence as a loss function in an optimization setting, especially in a probabilistic setting. For example, in Variational Bayes, we are trying to fit an approximate to the true posterior, and the process to make sure that

However, we have to note this important property about KL Divergence: it is not symmetric. Formally,

## Forward KL

In forward KL, the difference between

Consider

Reversely, if

Let’s see some visual examples.

In the example above, the right hand side mode is not covered by

In the above example,

Although there are still some area that are wrongly covered by

Those are the reason why, Forward KL is known as *zero avoiding*, as it is avoiding

## Reverse KL

In Reverse KL, as we switch the two distributions’ position in the equation, now

First, what happen if

Second, what happen if

Therefore, the failure case example above for Forward KL, is the desireable outcome for Reverse KL. That is, for Reverse KL, it is better to fit just some portion of

Consequently, Reverse KL will try avoid spreading the approximate. Now, there would be some portion of

As those properties suggest, this form of KL Divergence is know as *zero forcing*, as it forces

## Conclusion

So, what’s the best KL?

As always, the answer is “it depends”. As we have seen above, both has its own characteristic. So, depending on what we want to do, we choose which KL Divergence mode that’s suitable for our problem.

In Bayesian Inference, esp. in Variational Bayes, Reverse KL is widely used. As we could see at the derivation of Variational Autoencoder, VAE also uses Reverse KL (as the idea is rooted in Variational Bayes!).

## References

- Blei, David M. “Variational Inference.” Lecture from Princeton, https://www. cs. princeton. edu/courses/archive/fall11/cos597C/lectures/variational-inference-i. pdf (2011).
- Fox, Charles W., and Stephen J. Roberts. “A tutorial on variational Bayesian inference.” Artificial intelligence review 38.2 (2012): 85-95.