$$ \newcommand{\dint}{\mathrm{d}} \newcommand{\vphi}{\boldsymbol{\phi}} \newcommand{\vpi}{\boldsymbol{\pi}} \newcommand{\vpsi}{\boldsymbol{\psi}} \newcommand{\vomg}{\boldsymbol{\omega}} \newcommand{\vsigma}{\boldsymbol{\sigma}} \newcommand{\vzeta}{\boldsymbol{\zeta}} \renewcommand{\vx}{\mathbf{x}} \renewcommand{\vy}{\mathbf{y}} \renewcommand{\vz}{\mathbf{z}} \renewcommand{\vh}{\mathbf{h}} \renewcommand{\b}{\mathbf} \renewcommand{\vec}{\mathrm{vec}} \newcommand{\vecemph}{\mathrm{vec}} \newcommand{\mvn}{\mathcal{MN}} \newcommand{\G}{\mathcal{G}} \newcommand{\M}{\mathcal{M}} \newcommand{\N}{\mathcal{N}} \newcommand{\S}{\mathcal{S}} \newcommand{\I}{\mathcal{I}} \newcommand{\diag}[1]{\mathrm{diag}(#1)} \newcommand{\diagemph}[1]{\mathrm{diag}(#1)} \newcommand{\tr}[1]{\text{tr}(#1)} \renewcommand{\C}{\mathbb{C}} \renewcommand{\R}{\mathbb{R}} \renewcommand{\E}{\mathbb{E}} \newcommand{\D}{\mathcal{D}} \newcommand{\inner}[1]{\langle #1 \rangle} \newcommand{\innerbig}[1]{\left \langle #1 \right \rangle} \newcommand{\abs}[1]{\lvert #1 \rvert} \newcommand{\norm}[1]{\lVert #1 \rVert} \newcommand{\two}{\mathrm{II}} \newcommand{\GL}{\mathrm{GL}} \newcommand{\Id}{\mathrm{Id}} \newcommand{\grad}[1]{\mathrm{grad} \, #1} \newcommand{\gradat}[2]{\mathrm{grad} \, #1 \, \vert_{#2}} \newcommand{\Hess}[1]{\mathrm{Hess} \, #1} \newcommand{\T}{\text{T}} \newcommand{\dim}[1]{\mathrm{dim} \, #1} \newcommand{\partder}[2]{\frac{\partial #1}{\partial #2}} \newcommand{\rank}[1]{\mathrm{rank} \, #1} \newcommand{\inv}1 \newcommand{\map}{\text{MAP}} \newcommand{\L}{\mathcal{L}} \DeclareMathOperator*{\argmax}{arg\,max} \DeclareMathOperator*{\argmin}{arg\,min} $$

Deriving Contractive Autoencoder and Implementing it in Keras

In the last post, we have seen many different flavors of a family of methods called Autoencoders. However, there is one more autoencoding method on top of them, dubbed Contractive Autoencoder (Rifai et al., 2011).

The idea of Contractive Autoencoder is to make the learned representation to be robust towards small changes around the training examples. It achieves that by using different penalty term imposed to the representation.

The loss function for the reconstruction term is similar to previous Autoencoders that we have been seen, i.e. using $ \ell_2 $ loss. The penalty term, however is more complicated: we need to calculate the representation’s jacobian matrix with regards of the training data.

Hence, the loss function is as follows:

\[L = \lVert X - \hat{X} \rVert_2^2 + \lambda \lVert J_h(X) \rVert_F^2\]

in which

\[\lVert J_h(X) \rVert_F^2 = \sum_{ij} \left( \frac{\partial h_j(X)}{\partial X_i} \right)^2\]

that is, the penalty term is the Frobenius norm of the jacobian matrix, which is the sum squared over all elements inside the matrix. We could think Frobenius norm as the generalization of euclidean norm.

In the loss above, clearly it’s the calculation of the jacobian that’s not straightforward. Calculating a jacobian of the hidden layer with respect to input is similar to gradient calculation. Recall than jacobian is the generalization of gradient, i.e. when a function is a vector valued function, the partial derivative is a matrix called jacobian.

Let’s calculate the jacobian of the hidden layer of our autoencoder then. Let’s say:

\[\begin{align} Z_j &= W_i X_i \\[10pt] h_j &= \phi(Z_j) \end{align}\]

where $ \phi $ is sigmoid nonlinearity. That is, to get the $ j\text{-th} $ hidden unit, we need to get the dot product of the $ i\text{-th} $ feature and the corresponding weight. Then using chain rule:

\[\begin{align} \frac{\partial h_j}{\partial X_i} &= \frac{\partial \phi(Z_j)}{\partial X_i} \\[10pt] &= \frac{\partial \phi(W_i X_i)}{\partial W_i X_i} \frac{\partial W_i X_i}{\partial X_i} \\[10pt] &= [\phi(W_i X_i)(1 - \phi(W_i X_i))] \, W_{i} \\[10pt] &= [h_j(1 - h_j)] \, W_i \end{align}\]

It looks familiar, doesn’t it? Because it’s exactly how we calculate gradient. The difference is however, that we treat $ h(X) $ as a vector valued function. That is, we treat $ h_{i}(X) $ each as a separate output. Intuitively, let’s say for example we have 64 hidden units, then we have 64 function outputs, and so we will have a gradient vector for each of those 64 hidden unit. Hence, when we get the derivative of that hidden layer, what we get instead is a jacobian matrix. And as we now know how to calculate the jacobian, we can calculate the penalty term in our loss.

Let $ diag(x) $ be a diagonal matrix, the matrix form of the above derivative is as follows:

\[\frac{\partial h}{\partial X} = diag[h(1 - h)] \, W^T\]

We need to form a diagonal matrix of the gradient of $ h $ because if we look carefully at the original equation, the first term doesn’t depend on $ i $. Hence, for all values of $ W_i $, we want to multiply it with the correspondent $ h_j $. And the nice way to do that is to use diagonal matrix.

As our main objective is to calculate the norm, we could simplify that in our implementation so that we don’t need to construct the diagonal matrix:

\[\begin{align} \lVert J_h(X) \rVert_F^2 &= \sum_{ij} \left( \frac{\partial h_j}{\partial X_i} \right)^2 \\[10pt] &= \sum_i \sum_j [h_j(1 - h_j)]^2 (W_{ji}^T)^2 \\[10pt] &= \sum_j [h_j(1 - h_j)]^2 \sum_i (W_{ji}^T)^2 \\[10pt] \end{align}\]

Translated to code:

import numpy as np


# Let's say we have minibatch of 32, and 64 hidden units
# Our input is 786 elements vector
X = np.random.randn(32, 786)
W = np.random.randn(786, 64)

Z = np.dot(W, X)
h = sigmoid(Z)  # 32x64

Wj_sqr = np.sum(W.T**2, axis=1)  # Marginalize i (note the transpose), 64x1
dhj_sqr = (h * (1 - h))**2  # Derivative of h, 32x64
J_norm = np.sum(dhj_sqr * Wj_sqr, axis=1) # 32x1, i.e. 1 jacobian norm for each data point

Putting all of those together, we have our full Contractive Autoencoder implemented in Keras:

from keras.layers import Input, Dense
from keras.models import Model
import keras.backend as K


lam = 1e-4

inputs = Input(shape=(N,))
encoded = Dense(N_hidden, activation='sigmoid', name='encoded')(inputs)
outputs = Dense(N, activation='linear')(encoded)

model = Model(input=inputs, output=outputs)

def contractive_loss(y_pred, y_true):
    mse = K.mean(K.square(y_true - y_pred), axis=1)

    W = K.variable(value=model.get_layer('encoded').get_weights()[0])  # N x N_hidden
    W = K.transpose(W)  # N_hidden x N
    h = model.get_layer('encoded').output
    dh = h * (1 - h)  # N_batch x N_hidden

    # N_batch x N_hidden * N_hidden x 1 = N_batch x 1
    contractive = lam * K.sum(dh**2 * K.sum(W**2, axis=1), axis=1)

    return mse + contractive

model.compile(optimizer='adam', loss=contractive_loss)
model.fit(X, X, batch_size=N_batch, nb_epoch=5)

And that is it! The full code could be found in my Github repository: https://github.com/wiseodd/hipsternet.

References

Rifai, Salah, et al. “Contractive auto-encoders: Explicit invariance during feature extraction.” Proceedings of the 28th international conference on machine learning (ICML-11). 2011.