The learning rate or step size in machine learning is a hyperparameter which determines to what extent newly acquired information overrides old information.[1] It is the most important hyper-parameter to tune for training deep neural networks. The learning rate is crucial because it controls both the speed of convergence and the ultimate performance of the network. We select learning rate mostly by trial and error, or by virtue of previous experience or some methods like LR finder. A too high learning rate will make the learning jump over minima but a too low learning rate will either take too long to converge or get stuck in undesirable local minima.
Finding a decent learning rate for a neural network is like fishing. The selection of learning rate is one of those things that makes deep learning look like magic. One of the simplest learning rate strategies is to have a fixed learning rate throughout the training process.[2] During earlier iterations, faster learning rates lead to faster convergence while during later epochs, slower learning rate produces better accuracy. Changing the learning rate over time can overcome this tradeoff.
Schedules define how the learning rate changes over time and are typically specified for each epoch or iteration (i.e batch) of training. The main benefits of learning rate schedules are it converges faster with higher accuracy. They differ from adaptive methods (such as AdaDelta and Adam) because :[2]
The theory of stochastic approximation gives us many types of schedules. But, these are not the ones that are usually used in contemporary deep learning models and frameworks. The theoretical basis of why these schedules work well is an active area of research.[See. 3] Here, we will be looking closely at schedules that are prominently used ones. Here, we will look at most common of these schedules:
In step-wise decay, the learning-rate is decayed after a fixed number of steps(intervals) by a fixed factor. This fixed factor is called decay-factor, usually represented by \(\gamma\) (gamma).
where, \(\eta_{n}\) is the learning rate \(n^{th}\) epoch. \(\gamma\) is the decay-rate. \(k\) is the step-size.
Tips
[5]
Stepwise schedules and the discontinuities they introduce may sometime lead to instability in the optimization, so in some cases smoother schedules are preferred[6]. The learning-rate is decayed after every epoch based on a polynomial function. Polynomial Decay provides a smoother decay using a polynomial function and reaches a learning rate of 0 after max_update iterations.[6]
The two important quantities in polynomial decay are:-
Tips
Like the polynomial decay given above, exponential decay gives a smoother decay, solving the instability issues in step-wise scheduling. But here, the learning-rate is decayed after every epoch based on an exponential function.
The important parameters in exponential decay are:
Tips
All the above decay methods like step-wise, polynomial, exponential reduce the learning-rate according to a pre-defined rule. It may change after a few steps or with every step, but the change is imminent. Consider a situation where a learning-rate value is performing well, then decaying it prematurely may not be a wise idea. Similarly continuing with a stale learning-rate value waiting for the decay step is also not helpful. All these scheduling methods do not take into consideration the position of the loss function at the moment.
So, a better idea may be to decrease the learning-rate only when the loss plateaus. This is exactly what we do in ‘Reduce on Loss Plateau Decay’. The decaying action occurs after no improvement in loss value is found. The plateau condition is checked by a fixed value called patience. Patience determines the number of epochs to wait before changing the learning-rate. For example, if patience = 2
, then we will ignore the first 2 epochs with no improvement, and will only decrease the LR after the 3rd epoch if the loss still hasn’t improved then.[4]
The two important quantities in loss plateau decay are:
Tips
Cosine Annealing was proposed in SGDR: Stochastic Gradient Descent with Warm Restarts
by Ilya Loshchilov & Frank Hutter. We will only be talking about the cosine annealing part here, we can leave out the Warm restart for a later time. In cosine annealing, we will be using the cosine function in the range \([0, \frac{\pi}{2}]\). This is particularly useful for us as in the early iterations it will give us a relatively large learning rate to quickly approach a local minimum (faster convergence), and towards the end, it gives us many small learning rate iterations (better loss/accuracy).
Important parameters in cosine annealing are:-
Tips
Along with all these common LR scheduling methods, we can make our own schedules. So, let make a schedule that decays according to the function \(log(\frac{1}{x})\) .
class LogAnnealingLR(_LRScheduler):
def __init__(self, optimizer, T_max, eta_min=0, last_epoch=-1):
self.T_max = T_max
self.eta_min = eta_min
super(LogAnnealingLR, self).__init__(optimizer, last_epoch)
def get_lr(self):
return [self.eta_min + (base_lr - self.eta_min) *
(1 + math.cos(math.pi * self.last_epoch / self.T_max)) / 2
for base_lr in self.base_lrs]
A better way to understand the scheduling methods is by dividing them based on these two criteria:
By ‘frequency’, I mean the intervals in which the learning rate changes. From all the schedules we have seen above, the learning rates are changed under different intervals. It could change discretely, like after a fixed interval(Step-wise decay). It could reduce continuously, like Polynomial Decay. Or when the learning-rate plateaus(Reduce on loss plateau decay). If we think in terms of the frequency of decay, the step-wise decay is just a type of exponential or polynomial decay, where the decay happens after a fixed interval(number of steps).
By ‘quantity’, I mean the ‘way of value-change’ in the learning-rate. It could happen exponentially, like in step-wise decay. It could be based on some function, like cosine-annealing decay. It is not compulsory that the change in learning rate has only to move downwards(decay), we have also seen some cyclic methods which usually out-perform these uni-directional decay schedules.