In the last post, we have seen many different flavors of a family of methods called Autoencoders. However, there is one more autoencoding method on top of them, dubbed Contractive Autoencoder (Rifai et al., 2011).
The idea of Contractive Autoencoder is to make the learned representation to be robust towards small changes around the training examples. It achieves that by using different penalty term imposed to the representation.
The loss function for the reconstruction term is similar to previous Autoencoders that we have been seen, i.e. using loss. The penalty term, however is more complicated: we need to calculate the representation’s jacobian matrix with regards of the training data.
Hence, the loss function is as follows:
in which
that is, the penalty term is the Frobenius norm of the jacobian matrix, which is the sum squared over all elements inside the matrix. We could think Frobenius norm as the generalization of euclidean norm.
In the loss above, clearly it’s the calculation of the jacobian that’s not straightforward. Calculating a jacobian of the hidden layer with respect to input is similar to gradient calculation. Recall than jacobian is the generalization of gradient, i.e. when a function is a vector valued function, the partial derivative is a matrix called jacobian.
Let’s calculate the jacobian of the hidden layer of our autoencoder then. Let’s say:
where is sigmoid nonlinearity. That is, to get the hidden unit, we need to get the dot product of the feature and the corresponding weight. Then using chain rule:
It looks familiar, doesn’t it? Because it’s exactly how we calculate gradient. The difference is however, that we treat as a vector valued function. That is, we treat each as a separate output. Intuitively, let’s say for example we have 64 hidden units, then we have 64 function outputs, and so we will have a gradient vector for each of those 64 hidden unit. Hence, when we get the derivative of that hidden layer, what we get instead is a jacobian matrix. And as we now know how to calculate the jacobian, we can calculate the penalty term in our loss.
Let be a diagonal matrix, the matrix form of the above derivative is as follows:
We need to form a diagonal matrix of the gradient of because if we look carefully at the original equation, the first term doesn’t depend on . Hence, for all values of , we want to multiply it with the correspondent . And the nice way to do that is to use diagonal matrix.
As our main objective is to calculate the norm, we could simplify that in our implementation so that we don’t need to construct the diagonal matrix:
Translated to code:
Putting all of those together, we have our full Contractive Autoencoder implemented in Keras:
Rifai, Salah, et al. “Contractive auto-encoders: Explicit invariance during feature extraction.” Proceedings of the 28th international conference on machine learning (ICML-11). 2011.