Theory Behind 1Cycle Learning Rate Scheduling and Learning Rate Schedules
Introduction to 1Cycle Scheduling
The 1Cycle Learning Rate Policy was introduced by Leslie Smith in 2018 as a way to improve the speed and performance of training deep neural networks, particularly when using large batch sizes. The core idea behind the 1Cycle policy is to help the model find an optimal solution more efficiently by increasing and decreasing the learning rate throughout the training process. This approach contrasts with traditional methods like constant or exponentially decaying learning rates, which often lead to slower convergence and less optimal model performance.
Smith’s work demonstrated that neural networks could achieve super-convergence, meaning the model reaches a near-optimal solution significantly faster, often in half the number of epochs. The key innovation of 1Cycle is its exploratory phase (high learning rate) and its fine-tuning phase (low learning rate), which allows the model to quickly move through unimportant regions of the loss landscape and fine-tune when approaching the global minimum.
—
How 1Cycle Works
1. Phase 1: Increasing the Learning Rate
The learning rate starts at a small value () and increases linearly up to a high value () during the first portion of the training. This large learning rate forces the model to explore the parameter space more aggressively, helping it escape local minima and reach better areas of the loss function.
Mathematically, the learning rate during this phase can be described as:
where is the current training step, and is the total number of steps for this phase.
2. Phase 2: Decreasing the Learning Rate
After the peak learning rate is reached, the learning rate is gradually reduced to a much smaller value during the second half of the training process. This phase fine-tunes the model, allowing it to converge to a solution with smaller, more precise steps.
The learning rate decreases linearly as:
where is the number of steps in this second phase.
3. Momentum Adjustments
Momentum is inversely adjusted in the 1Cycle schedule, meaning it decreases when the learning rate increases and vice versa. This helps smooth out the updates during high learning rate phases and stabilizes the training process during lower learning rate periods.
—
Mathematical Justification of 1Cycle
The success of 1Cycle scheduling lies in its ability to escape local minima early in training by utilizing high learning rates, then fine-tuning when near the global minimum using lower learning rates. The underlying math suggests that a high learning rate introduces stochastic behavior, allowing the optimizer to avoid poor local minima. When paired with momentum cycling, this leads to more stable and faster convergence.
—
Complementary Learning Rate Schedules
1. Exponential Decay
In exponential decay, the learning rate decreases exponentially with each epoch, following:
where is the decay rate and is the epoch number. This schedule helps balance exploration (higher learning rates early) with precision (lower learning rates later).
2. Piecewise Constant Scheduling
In piecewise constant scheduling, the learning rate is held constant for specific intervals and dropped sharply after each interval:
- Epochs 1-10:
- Epochs 11-20:
- Epochs 21+:
This method is often used in simpler training tasks or when the model benefits from more structured learning rate reductions.
3. Cyclical Learning Rates (CLR)
Cyclical learning rates allow the learning rate to oscillate between a minimum and maximum value during training, promoting exploration of the loss landscape. This method was designed to eliminate the need for fine-tuning a single learning rate.
Practical Implementation of 1Cycle and Learning Rate Schedules
In this part, we’ll explore how to implement various learning rate schedules such as 1Cycle, Exponential Decay, Piecewise Constant, and Cyclical Learning Rates (CLR) in popular deep learning frameworks like Keras and PyTorch.
—
1Cycle Learning Rate Schedule in Keras
The following example demonstrates how to implement the 1Cycle learning rate policy in Keras. Keras has no built-in 1Cycle function, but we can customize it using the LearningRateScheduler
callback.
import numpy as np import tensorflow as tf from tensorflow.keras.callbacks import LearningRateScheduler # Function to implement the 1Cycle schedule def one_cycle_schedule(epoch, max_lr=0.01, base_lr=0.001, total_epochs=100): if epoch < total_epochs // 2: lr = base_lr + (max_lr - base_lr) * (epoch / (total_epochs // 2)) else: lr = max_lr - (max_lr - base_lr) * ((epoch - total_epochs // 2) / (total_epochs // 2)) return lr # Callback for Keras training lr_scheduler = LearningRateScheduler(lambda epoch: one_cycle_schedule(epoch)) # Define your model model = tf.keras.models.Sequential([ tf.keras.layers.Dense(128, activation='relu', input_shape=(784,)), tf.keras.layers.Dense(10, activation='softmax') ]) # Compile the model model.compile(optimizer=tf.keras.optimizers.SGD(), loss='sparse_categorical_crossentropy', metrics=['accuracy']) # Train with the 1Cycle schedule model.fit(X_train, y_train, epochs=100, callbacks=[lr_scheduler]) [/code] --- <h2>Exponential Decay in Keras</h2> Here’s how you can implement **exponential decay** in Keras using the <code>LearningRateScheduler</code> callback: [code language="python"] # Exponential decay function def exponential_decay(epoch, initial_lr=0.01, decay_rate=0.1, decay_epochs=20): return initial_lr * np.exp(-decay_rate * epoch / decay_epochs) # Callback for exponential decay lr_scheduler = LearningRateScheduler(lambda epoch: exponential_decay(epoch)) # Train your model model.fit(X_train, y_train, epochs=100, callbacks=[lr_scheduler])
—
Piecewise Constant Scheduling in Keras
With **Piecewise Constant Scheduling**, the learning rate is kept constant for a certain period, then reduced. This can be achieved as follows:
# Piecewise constant learning rate function def piecewise_constant(epoch): if epoch < 10: return 0.01 elif epoch < 20: return 0.005 else: return 0.001 # Learning rate scheduler callback lr_scheduler = LearningRateScheduler(piecewise_constant) # Training the model model.fit(X_train, y_train, epochs=30, callbacks=[lr_scheduler]) [/code] --- <h2>Cyclical Learning Rates (CLR) in Keras</h2> **Cyclical learning rates (CLR)** can be implemented by making the learning rate oscillate between a lower and upper bound during training: [code language="python"] # Cyclical learning rate function def cyclical_lr(epoch, base_lr=0.001, max_lr=0.006, step_size=10): cycle = np.floor(1 + epoch / (2 * step_size)) x = np.abs(epoch / step_size - 2 * cycle + 1) lr = base_lr + (max_lr - base_lr) * np.maximum(0, (1 - x)) return lr # CLR scheduler lr_scheduler = LearningRateScheduler(lambda epoch: cyclical_lr(epoch)) # Training the model with CLR model.fit(X_train, y_train, epochs=50, callbacks=[lr_scheduler])
—
1Cycle Learning Rate Schedule in PyTorch
For PyTorch, 1Cycle scheduling is available out-of-the-box using the torch.optim.lr_scheduler.OneCycleLR
function:
import torch import torch.optim as optim from torch.optim.lr_scheduler import OneCycleLR # Define model, optimizer, and loss function model = MyModel() optimizer = optim.SGD(model.parameters(), lr=0.01) criterion = torch.nn.CrossEntropyLoss() # Set up the 1Cycle learning rate scheduler scheduler = OneCycleLR(optimizer, max_lr=0.01, total_steps=100) # Training loop for epoch in range(100): for data, target in train_loader: optimizer.zero_grad() output = model(data) loss = criterion(output, target) loss.backward() optimizer.step() scheduler.step() # Update learning rate after each batch
—
Exponential Decay in PyTorch
In PyTorch, **exponential decay** is implemented using the ExponentialLR
scheduler:
from torch.optim.lr_scheduler import ExponentialLR # Set up the exponential decay scheduler scheduler = ExponentialLR(optimizer, gamma=0.9) # gamma is the decay rate # Training loop with learning rate scheduler for epoch in range(100): for data, target in train_loader: optimizer.zero_grad() output = model(data) loss = criterion(output, target) loss.backward() optimizer.step() scheduler.step() # Decay the learning rate after each epoch
—
Conclusion
These examples show how to implement various learning rate schedules like **1Cycle**, **exponential decay**, **piecewise constant scheduling**, and **cyclical learning rates** in both Keras and PyTorch. Each method has its own advantages depending on the task, dataset, and model architecture, and experimenting with these schedules can lead to faster convergence and improved performance in your deep learning projects.