/ 6 min read
Implementing Dropout in Neural Net
Dropout is one of the recent advancement in Deep Learning that enables us to train deeper and deeper network. Essentially, Dropout act as a regularization, and what it does is to make the network less prone to overfitting.
As we already know, the deeper the network is, the more parameter it has. For example, VGGNet from ImageNet competition 2014, has some 148 million parameters. That’s a lot. With that many parameters, the network could easily overfit, especially with small dataset.
Enter Dropout.
In training phase, with Dropout, at each hidden layer, with probability p
, we kill the neuron. What it means by ‘kill’ is to set the neuron to 0. As neural net is a collection multiplicative operations, then those 0 neuron won’t propagate anything to the rest of the network.
Let n
be the number of neuron in a hidden layer, then the expectation of the number of neuron to be active at each Dropout is p*n
, as we sample the neurons uniformly with probability p
. Concretely, if we have 1024 neurons in hidden layer, if we set p = 0.5
, then we can expect that only half of the neurons (512) would be active at each given time.
Because we force the network to train with only random p*n
of neurons, then intuitively, we force it to learn the data with different kind of neurons subset. The only way the network could perform the best is to adapt to that constraint, and learn the more general representation of the data.
It’s easy to remember things when the network has a lot of parameters (overfit), but it’s hard to remember things when effectively the network only has so many parameters to work with. Hence, the network must learn to generalize more to get the same performance as remembering things.
So, that’s why Dropout will increase the test time performance: it improves generalization and reduce the risk of overfitting.
Let’s see the concrete code for Dropout:
First, we sample an array of independent Bernoulli Distribution, which is just a collection of zero or one to indicate whether we kill the neuron or not. For example, the value of u1
would be np.array([1, 0, 0, 1, 1, 0, 1, 0])
. Then, if we multiply our hidden layer with this array, what we get is the originial value of the neuron if the array element is 1, and 0 if the array element is also 0.
For example, after Dropout, we need to do h2 = np.dot(h1, W2)
, which is a multiplication operation. What is zero times x? It’s zero. Then the subsequent multiplications would be also zero. That’s why those 0 neurons won’t contribute anything to the rest of the propagation.
Now, because we’re only using p*n
of the neurons, the output then has the expectation of p*x
, if x
is the expected output if we use all the neurons (without Dropout).
As we don’t use Dropout in test time, then the expected output of the layer is x
. That doesn’t match with the training phase. What we need to do is to make it matches the training phase expectation, so we scale the layer output with p
.
In practice, it’s better to simplify things. It’s cumbersome to maintain codes in two places. So, we move that scaling into the Dropout training itself.
With that code, we essentially make the expectation of layer output to be x
instead of px
, because we scale it back with 1/p
. Hence in the test time, we don’t need to do anything as the expected output of the layer is the same.
Dropout backprop
During the backprop, what we need to do is just to consider the Dropout. The killed neurons don’t contribute anything to the network, so we won’t flow the gradient through them.
For full example, please refer to: https://github.com/wiseodd/hipsternet/blob/master/hipsternet/neuralnet.py.
Test and Comparison
Test time! But first, let’s declare what kind of network we will use for testing.
We’re using three layers network with 256 neurons in each hidden layer. The weights are initialized using Xavier divided by 2, as proposed by He, et al, 2015. The data used are MNIST data with 55000 training data and 10000 test data. The optimization algorithm used is RMSprop with 1000 iterations, repeated 5 times and the test accuracy is averaged.
Looking at the result, model which use Dropout yield a better accuracy across the test set. The difference of 0.005 might be negligible, but considering we have 10000 test data, that’s quite a bit.
The standard deviation of the test tells different story though. It seems that the network that uses Dropout for training perform consistently better during test time. Compare it to the non-Dropout network, it’s an order of magnitude worse in term of consistency. We can see this when comparing the standard deviation: 0.0006 vs 0.0071.
However, when we look at the convergence of the network during training, it seems that non-Dropout network converge better and faster. Here, we could see at the loss at every 100 iterations.
This indicates that network without Dropout performs better at the training phase while Dropout network perform worse. The table is turned at the test time, Dropout network is not just perform better, but consistenly better. One could interpret this as the sign of overfitting. So, really, we could see that Dropout regularize our network and make it more robust to overfitting.
Conclusion
We look at one of the driving force of the recent advancement of Deep Learning: Dropout. It’s a relatively new technique but already made a very big impact in the field. Dropout act as regularizer by stochastically kill neurons in hidden layers. This in turn force the network to generalize more.
We also implement Dropout in our model. Implementing Dropout in our neural net model is just a matter of several lines of code. We found that it’s a very simple method to implement.
We then compare the Dropout network with non Dropout network. The result is nice: Dropout network performs consistenly better in test time compared to the non Dropout Network.
To see more about, check my full example in my Github page: https://github.com/wiseodd/hipsternet.