$$ \newcommand{\dint}{\mathrm{d}} \newcommand{\vphi}{\boldsymbol{\phi}} \newcommand{\vpi}{\boldsymbol{\pi}} \newcommand{\vpsi}{\boldsymbol{\psi}} \newcommand{\vomg}{\boldsymbol{\omega}} \newcommand{\vsigma}{\boldsymbol{\sigma}} \newcommand{\vzeta}{\boldsymbol{\zeta}} \renewcommand{\vx}{\mathbf{x}} \renewcommand{\vy}{\mathbf{y}} \renewcommand{\vz}{\mathbf{z}} \renewcommand{\vh}{\mathbf{h}} \renewcommand{\b}{\mathbf} \renewcommand{\vec}{\mathrm{vec}} \newcommand{\vecemph}{\mathrm{vec}} \newcommand{\mvn}{\mathcal{MN}} \newcommand{\G}{\mathcal{G}} \newcommand{\M}{\mathcal{M}} \newcommand{\N}{\mathcal{N}} \newcommand{\S}{\mathcal{S}} \newcommand{\I}{\mathcal{I}} \newcommand{\diag}[1]{\mathrm{diag}(#1)} \newcommand{\diagemph}[1]{\mathrm{diag}(#1)} \newcommand{\tr}[1]{\text{tr}(#1)} \renewcommand{\C}{\mathbb{C}} \renewcommand{\R}{\mathbb{R}} \renewcommand{\E}{\mathbb{E}} \newcommand{\D}{\mathcal{D}} \newcommand{\inner}[1]{\langle #1 \rangle} \newcommand{\innerbig}[1]{\left \langle #1 \right \rangle} \newcommand{\abs}[1]{\lvert #1 \rvert} \newcommand{\norm}[1]{\lVert #1 \rVert} \newcommand{\two}{\mathrm{II}} \newcommand{\GL}{\mathrm{GL}} \newcommand{\Id}{\mathrm{Id}} \newcommand{\grad}[1]{\mathrm{grad} \, #1} \newcommand{\gradat}[2]{\mathrm{grad} \, #1 \, \vert_{#2}} \newcommand{\Hess}[1]{\mathrm{Hess} \, #1} \newcommand{\T}{\text{T}} \newcommand{\dim}[1]{\mathrm{dim} \, #1} \newcommand{\partder}[2]{\frac{\partial #1}{\partial #2}} \newcommand{\rank}[1]{\mathrm{rank} \, #1} \newcommand{\inv}1 \newcommand{\map}{\text{MAP}} \newcommand{\L}{\mathcal{L}} \DeclareMathOperator*{\argmax}{arg\,max} \DeclareMathOperator*{\argmin}{arg\,min} $$

Many flavors of Autoencoder

Consider a neural net. Usually we use it for classification and regression task, that is, given an input vector \( X \), we want to find \( y \). In other words, we want neural net to find a mapping \( y = f(X) \).

Now, what happens if we use the same data as codomain of the function? That is, we want to find a mapping \( X = f(X) \). Well, the neural net now will learn an identity mapping of \( X \). We probably would ask, how is that useful?

It turns out, the hidden layer(s) of neural net learns a very interesting respresentation of the data. Hence, we can use the hidden layer representation for many things, for example data compression, dimensionality reduction, and feature learning. This is exactly the last decade idea of Deep Learning: by stacking Autoencoders to learn the representation of data, and train it greedily, hopefully we can train deep net effectively.

Vanilla Autoencoder

In its simplest form, Autoencoder is a two layer net, i.e. a neural net with one hidden layer. The input and output are the same, and we learn how to reconstruct the input, for example using the \( \ell_{2} \) norm.

from tensorflow.examples.tutorials.mnist import input_data
from keras.layers import Input, Dense, Conv2D, MaxPooling2D, UpSampling2D, Flatten, Reshape
from keras.models import Model
from keras.optimizers import Adam
from keras.regularizers import activity_l1

import numpy as np
import matplotlib.pyplot as plt
import keras.backend as K
import tensorflow as tf


mnist = input_data.read_data_sets('../data/MNIST_data', one_hot=True)
X, _ = mnist.train.images, mnist.train.labels

inputs = Input(shape=(784,))
h = Dense(64, activation='sigmoid')(inputs)
outputs = Dense(784)(h)

model = Model(input=inputs, output=outputs)
model.compile(optimizer='adam', loss='mse')
model.fit(X, X, batch_size=64, nb_epoch=5)

One question that might surface is if we are essentially learning an identity mapping, why do we even bother using a fancy algorithm? Isn’t identity mapping trivial? Well, we are trying to learn identity mapping with some constraints, hence it’s non trivial. The constraints might arise because of the architectural decision of the neural net.

Consider this. In our implementation above, we use a hidden layer with dimension of 64. The data we are going to learn is a vector with dimension of 784. Hence, we can see that we are imposing a constraint in our neural net such that we learn a compressed representation of data.

Sparse Autoencoder

Another way we can constraint the reconstruction of Autoencoder is to impose a constraint in its loss. We could, for example, add a reguralization term in the loss function. Doing this will make our Autoencoder to learn sparse representation of data.

inputs = Input(shape=(784,))
h = Dense(64, activation='sigmoid', activity_regularizer=activity_l1(1e-5))(inputs)
outputs = Dense(784)(h)

model = Model(input=inputs, output=outputs)
model.compile(optimizer='adam', loss='mse')
model.fit(X, X, batch_size=64, nb_epoch=5)

Notice in our hidden layer, we added an \( \ell_{1} \) penalty. As a result, the representation is now sparser compared to the vanilla Autoencoder. We could see that by looking at the statistics of the hidden layer. The mean value of vanilla Autoencoder is 0.512477, whereas Sparse Autoencoder 0.148664.

Multilayer Autoencoder

One natural thought that might arise is to extend the Autoencoder beyond just single layer.

inputs = Input(shape=(784,))
h = Dense(128, activation='relu')(inputs)
encoded = Dense(64, activation='relu', activity_regularizer=activity_l1(1e-5))(h)
h = Dense(128, activation='relu')(encoded)
outputs = Dense(784)(h)

model = Model(input=inputs, output=outputs)
model.compile(optimizer='adam', loss='mse')
model.fit(X, X, batch_size=64, nb_epoch=5)

Now our implementation uses 3 hidden layers instead of just one. We could pick any layer as the feature representation, but for simplicity sake, let’s make it simmetrical and use the middle-most layer.

Convolutional Autoencoder

We then naturally extend our thinking: can we use convnet instead of FCN?

inputs = Input(shape=(28, 28, 1))
h = Conv2D(4, 3, 3, activation='relu', border_mode='same')(inputs)
encoded = MaxPooling2D((2, 2))(h)
h = Conv2D(4, 3, 3, activation='relu', border_mode='same')(encoded)
h = UpSampling2D((2, 2))(h)
outputs = Conv2D(1, 3, 3, activation='relu', border_mode='same')(h)

model = Model(input=inputs, output=outputs)
model.compile(optimizer='adam', loss='mse')
model.fit(X, X, batch_size=64, nb_epoch=5)

Above we could see that instead of using fully connected layer, we use convolution and pooling layers as seen in convnet.

Conclusion

In this post we looked at many different types of Autoencoder: vanilla, sparse, multilayer, convolutional. Each has different intriguing property that comes from the imposed constraints, be it from the architectural choice or additional penalty term in the loss function.

The learned representation of Autoencoder can be used for dimensionality reduction or compression, and can be used as a features for another task. The way it is being used is analogous of using things like PCA to transform the features. It has been shown empirically that using learned features of Autoencoder, one can get significant boost in classification performance [3].

References

  1. https://en.wikipedia.org/wiki/Autoencoder
  2. https://blog.keras.io/building-autoencoders-in-keras.html
  3. Rifai, Salah, et al. “Contractive auto-encoders: Explicit invariance during feature extraction.” Proceedings of the 28th international conference on machine learning (ICML-11). 2011.