In training deep networks, it is helpful to reduce the learning rate as the number of training epochs increases. This is based on the intuition that with a high learning rate, the deep learning model would possess high kinetic energy. As a result, it’s parameter vector bounces around chaotically. Thus, it’s unable to settle down into deeper and narrower parts of the loss function (local minima). If the learning rate, on the other hand, was very small, the system then would have low kinetic energy. Thus, it would settle down into shallow and narrower parts of the loss function (false minima).
Two popular and easy to use learning rate schedules are as follows:
- Decrease the learning rate gradually based on the epoch.
- Decrease the learning rate using punctuated large drops at specific epochs.
Constant Learning Rate
Constant learning rate is the default learning rate schedule in SGD optimizer in Keras. Momentum and decay rate are both set to zero by default. It is tricky to choose the right learning rate. By experimenting with range of learning rates in our example, lr=0.1
shows a relative good performance to start with. This can serve as a baseline for us to experiment with different learning rate strategies.
keras.optimizers.SGD(lr=0.1, momentum=0.0, decay=0.0, nesterov=False)