$$ \newcommand{\dint}{\mathrm{d}} \newcommand{\vphi}{\boldsymbol{\phi}} \newcommand{\vpi}{\boldsymbol{\pi}} \newcommand{\vpsi}{\boldsymbol{\psi}} \newcommand{\vomg}{\boldsymbol{\omega}} \newcommand{\vsigma}{\boldsymbol{\sigma}} \newcommand{\vzeta}{\boldsymbol{\zeta}} \renewcommand{\vx}{\mathbf{x}} \renewcommand{\vy}{\mathbf{y}} \renewcommand{\vz}{\mathbf{z}} \renewcommand{\vh}{\mathbf{h}} \renewcommand{\b}{\mathbf} \renewcommand{\vec}{\mathrm{vec}} \newcommand{\vecemph}{\mathrm{vec}} \newcommand{\mvn}{\mathcal{MN}} \newcommand{\G}{\mathcal{G}} \newcommand{\M}{\mathcal{M}} \newcommand{\N}{\mathcal{N}} \newcommand{\S}{\mathcal{S}} \newcommand{\I}{\mathcal{I}} \newcommand{\diag}[1]{\mathrm{diag}(#1)} \newcommand{\diagemph}[1]{\mathrm{diag}(#1)} \newcommand{\tr}[1]{\text{tr}(#1)} \renewcommand{\C}{\mathbb{C}} \renewcommand{\R}{\mathbb{R}} \renewcommand{\E}{\mathbb{E}} \newcommand{\D}{\mathcal{D}} \newcommand{\inner}[1]{\langle #1 \rangle} \newcommand{\innerbig}[1]{\left \langle #1 \right \rangle} \newcommand{\abs}[1]{\lvert #1 \rvert} \newcommand{\norm}[1]{\lVert #1 \rVert} \newcommand{\two}{\mathrm{II}} \newcommand{\GL}{\mathrm{GL}} \newcommand{\Id}{\mathrm{Id}} \newcommand{\grad}[1]{\mathrm{grad} \, #1} \newcommand{\gradat}[2]{\mathrm{grad} \, #1 \, \vert_{#2}} \newcommand{\Hess}[1]{\mathrm{Hess} \, #1} \newcommand{\T}{\text{T}} \newcommand{\dim}[1]{\mathrm{dim} \, #1} \newcommand{\partder}[2]{\frac{\partial #1}{\partial #2}} \newcommand{\rank}[1]{\mathrm{rank} \, #1} \newcommand{\inv}1 \newcommand{\map}{\text{MAP}} \newcommand{\L}{\mathcal{L}} \DeclareMathOperator*{\argmax}{arg\,max} \DeclareMathOperator*{\argmin}{arg\,min} $$

Gibbs Sampler for LDA

Latent Dirichlet Allocation (LDA) [1] is a mixed membership model for topic modeling. Given a set of documents in bag of words representation, we want to infer the underlying topics those documents represent. To get a better intuition, we shall look at LDA’s generative story. Note, the full code is available at https://github.com/wiseodd/mixture-models.

Given \( i = \{1, \dots, N_D\} \) the document index, \( v = \{1, \dots, N_W\} \) the word index, \( k = \{1, \dots, N_K\} \) the topic index, LDA assumes:

\[\begin{align} \mathbf{\pi}_i &\sim \text{Dir}(\mathbf{\pi}_i \, \vert \, \alpha) \\[10pt] z_{iv} &\sim \text{Cat}(z_{iv} \, \vert \, \mathbf{\pi}_i) \\[10pt] \mathbf{b}_k &\sim \text{Dir}(\mathbf{b}_k \, \vert \, \gamma) \\[10pt] y_{iv} &\sim \text{Cat}(y_{iv} \, \vert \, z_{iv} = k, \mathbf{B}) \end {align}\]

where \( \alpha \) and \( \gamma \) are the parameters for the Dirichlet priors. They tell us how narrow or spread the document topic and topic word distributions are.

Details for the above generative process above in words:

  1. Assume each document generated by selecting the topic first. Thus, sample \( \mathbf{\pi}_i \), the topic distribution for \( i \)-th document.
  2. Assume each words in \( i \)-th document comes from one of the topics. Therefore, we sample \( z_{iv} \), the topic for each word \( v \) in document \( i \).
  3. Assume each topic is composed of words, e.g. topic “computer” consits of words “cpu”, “gpu”, etc. Therefore, we sample \( \mathbf{b}_k \), the distribution those words for particular topic \( k \).
  4. Finally, to actually generate the word, given that we already know it comes from topic \( k \), we sample the word \( y_{iv} \) given the \( k \)-th topic word distribution.

Inference

The goal of inference in LDA is that given a corpus, we infer the underlying topics that explain those documents, according to the generative process above. Essentially, given \( y_{iv} \), we are inverting the above process to find \( z_{iv} \), \( \mathbf{\pi}_i \), and \( \mathbf{b}_k \).

We will infer those variables using Gibbs Sampling algorithm. In short, it works by sampling each of those variables given the other variables (full conditional distribution). Because of the conjugacy, the full conditionals are as follows:

\[\begin{align} p(z_{iv} = k \, \vert \, \mathbf{\pi}_i, \mathbf{b}_k) &\propto \exp(\log \pi_{ik} + \log b_{k, y_{iv}}) \\[10pt] p(\mathbf{\pi}_i \, \vert \, z_{iv} = k, \mathbf{b}_k) &= \text{Dir}(\alpha + \sum_l \mathbb{I}(z_{il} = k)) \\[3pt] p(\mathbf{b}_k \, \vert \, z_{iv} = k, \mathbf{\pi}_i) &= \text{Dir}(\gamma + \sum_i \sum_l \mathbb{I}(y_{il} = v, z_{il} = k)) \end {align}\]

Essentially, what we are doing is to count the assignment of words and documents to particular topics. Those are the sufficient statistics for the full conditionals

Given those full conditionals, the rest is as easy as plugging those into the Gibbs Sampling framework, as we shall discuss in the next section.

Implementation

We begin with randomly initializing topic assignment matrix \( \mathbf{Z}_{N_D \times N_W} \). We also sample the initial values of \( \mathbf{\Pi}_{N_D \times N_K} \) and \( \mathbf{B}_{N_K \times N_W} \).

# Dirichlet priors
alpha = 1
gamma = 1

# Z := word topic assignment
Z = np.zeros(shape=[N_D, N_W])

for i in range(N_D):
    for l in range(N_W):
        Z[i, l] = np.random.randint(N_K)  # randomly assign word's topic

# Pi := document topic distribution
Pi = np.zeros([N_D, N_K])

for i in range(N_D):
    Pi[i] = np.random.dirichlet(alpha*np.ones(N_K))

# B := word topic distribution
B = np.zeros([N_K, N_W])

for k in range(N_K):
    B[k] = np.random.dirichlet(gamma*np.ones(N_W))

Then we sample the new values for each of those variables from the full conditionals in the previous section, and iterate:

for it in range(1000):
    # Sample from full conditional of Z
    # ---------------------------------
    for i in range(N_D):
        for v in range(N_W):
            # Calculate params for Z
            p_iv = np.exp(np.log(Pi[i]) + np.log(B[:, X[i, v]]))
            p_iv /= np.sum(p_iv)

            # Resample word topic assignment Z
            Z[i, v] = np.random.multinomial(1, p_iv).argmax()

    # Sample from full conditional of Pi
    # ----------------------------------
    for i in range(N_D):
        m = np.zeros(N_K)

        # Gather sufficient statistics
        for k in range(N_K):
            m[k] = np.sum(Z[i] == k)

        # Resample doc topic dist.
        Pi[i, :] = np.random.dirichlet(alpha + m)

    # Sample from full conditional of B
    # ---------------------------------
    for k in range(N_K):
        n = np.zeros(N_W)

        # Gather sufficient statistics
        for v in range(N_W):
            for i in range(N_D):
                for l in range(N_W):
                    n[v] += (X[i, l] == v) and (Z[i, l] == k)

        # Resample word topic dist.
        B[k, :] = np.random.dirichlet(gamma + n)

And basically we are done. We could inspect the result by looking at those variables after some iterations of the algorithm.

Example

Let’s say we have these data:

# Words
W = np.array([0, 1, 2, 3, 4])

# D := document words
X = np.array([
    [0, 0, 1, 2, 2],
    [0, 0, 1, 1, 1],
    [0, 1, 2, 2, 2],
    [4, 4, 4, 4, 4],
    [3, 3, 4, 4, 4],
    [3, 4, 4, 4, 4]
])

N_D = X.shape[0]  # num of docs
N_W = W.shape[0]  # num of words
N_K = 2  # num of topics

Those data are already in bag of words representation, so it is a little abstract at a glance. However if we look at it, we could see two big clusters of documents based on their words: \( \{ 1, 2, 3 \} \) and \( \{ 4, 5, 6 \} \). Therefore, we expect after our sampler converges to the posterior, the topic distribution for those documents will follow our intuition.

Here is the result:

Document topic distribution:
----------------------------
[[ 0.81960751  0.18039249]
 [ 0.8458758   0.1541242 ]
 [ 0.78974177  0.21025823]
 [ 0.20697807  0.79302193]
 [ 0.05665149  0.94334851]
 [ 0.15477016  0.84522984]]

As we can see, indeed document 1, 2, and 3 tend to be in the same cluster. The same could be said for document 4, 5, 6.

References

  1. Blei, David M., Andrew Y. Ng, and Michael I. Jordan. “Latent dirichlet allocation.” Journal of machine Learning research 3.Jan (2003): 993-1022.
  2. Murphy, Kevin P. Machine learning: a probabilistic perspective. MIT press, 2012.