Latent Dirichlet Allocation (LDA) [1] is a mixed membership model for topic modeling. Given a set of documents in bag of words representation, we want to infer the underlying topics those documents represent. To get a better intuition, we shall look at LDA’s generative story. Note, the full code is available at https://github.com/wiseodd/mixture-models.
Given
where
Details for the above generative process above in words:
- Assume each document generated by selecting the topic first. Thus, sample
, the topic distribution for -th document. - Assume each words in
-th document comes from one of the topics. Therefore, we sample , the topic for each word in document . - Assume each topic is composed of words, e.g. topic “computer” consits of words “cpu”, “gpu”, etc. Therefore, we sample
, the distribution those words for particular topic . - Finally, to actually generate the word, given that we already know it comes from topic
, we sample the word given the -th topic word distribution.
Inference
The goal of inference in LDA is that given a corpus, we infer the underlying topics that explain those documents, according to the generative process above. Essentially, given
We will infer those variables using Gibbs Sampling algorithm. In short, it works by sampling each of those variables given the other variables (full conditional distribution). Because of the conjugacy, the full conditionals are as follows:
Essentially, what we are doing is to count the assignment of words and documents to particular topics. Those are the sufficient statistics for the full conditionals
Given those full conditionals, the rest is as easy as plugging those into the Gibbs Sampling framework, as we shall discuss in the next section.
Implementation
We begin with randomly initializing topic assignment matrix
Then we sample the new values for each of those variables from the full conditionals in the previous section, and iterate:
And basically we are done. We could inspect the result by looking at those variables after some iterations of the algorithm.
Example
Let’s say we have these data:
Those data are already in bag of words representation, so it is a little abstract at a glance. However if we look at it, we could see two big clusters of documents based on their words:
Here is the result:
As we can see, indeed document 1, 2, and 3 tend to be in the same cluster. The same could be said for document 4, 5, 6.
References
- Blei, David M., Andrew Y. Ng, and Michael I. Jordan. “Latent dirichlet allocation.” Journal of machine Learning research 3.Jan (2003): 993-1022.
- Murphy, Kevin P. Machine learning: a probabilistic perspective. MIT press, 2012.