Machine Learning Overview

Theory Behind 1Cycle Learning Rate Scheduling and Learning Rate Schedules – Day 43






Learning Rate Schedules: 1Cycle and Beyond


Theory Behind 1Cycle Learning Rate Scheduling and Learning Rate Schedules

Introduction to 1Cycle Scheduling

The 1Cycle Learning Rate Policy was introduced by Leslie Smith in 2018 as a way to improve the speed and performance of training deep neural networks, particularly when using large batch sizes. The core idea behind the 1Cycle policy is to help the model find an optimal solution more efficiently by increasing and decreasing the learning rate throughout the training process. This approach contrasts with traditional methods like constant or exponentially decaying learning rates, which often lead to slower convergence and less optimal model performance.

Smith’s work demonstrated that neural networks could achieve super-convergence, meaning the model reaches a near-optimal solution significantly faster, often in half the number of epochs. The key innovation of 1Cycle is its exploratory phase (high learning rate) and its fine-tuning phase (low learning rate), which allows the model to quickly move through unimportant regions of the loss landscape and fine-tune when approaching the global minimum.

How 1Cycle Works

1. Phase 1: Increasing the Learning Rate

The learning rate starts at a small value (\eta_0) and increases linearly up to a high value (\eta_{\text{max}}) during the first portion of the training. This large learning rate forces the model to explore the parameter space more aggressively, helping it escape local minima and reach better areas of the loss function.

Mathematically, the learning rate during this phase can be described as:

\eta(t) = \eta_0 + (\eta_{\text{max}} - \eta_0) \cdot \frac{t}{T_1}

where t is the current training step, and T_1 is the total number of steps for this phase.

2. Phase 2: Decreasing the Learning Rate

After the peak learning rate is reached, the learning rate is gradually reduced to a much smaller value during the second half of the training process. This phase fine-tunes the model, allowing it to converge to a solution with smaller, more precise steps.

The learning rate decreases linearly as:

\eta(t) = \eta_{\text{max}} - (\eta_{\text{max}} - \eta_{\text{min}}) \cdot \frac{t - T_1}{T_2}

where T_2 is the number of steps in this second phase.

3. Momentum Adjustments

Momentum is inversely adjusted in the 1Cycle schedule, meaning it decreases when the learning rate increases and vice versa. This helps smooth out the updates during high learning rate phases and stabilizes the training process during lower learning rate periods.

Mathematical Justification of 1Cycle

The success of 1Cycle scheduling lies in its ability to escape local minima early in training by utilizing high learning rates, then fine-tuning when near the global minimum using lower learning rates. The underlying math suggests that a high learning rate introduces stochastic behavior, allowing the optimizer to avoid poor local minima. When paired with momentum cycling, this leads to more stable and faster convergence.

Complementary Learning Rate Schedules

1. Exponential Decay

In exponential decay, the learning rate decreases exponentially with each epoch, following:

\eta(t) = \eta_0 \cdot e^{-\lambda t}

where \lambda is the decay rate and t is the epoch number. This schedule helps balance exploration (higher learning rates early) with precision (lower learning rates later).

2. Piecewise Constant Scheduling

In piecewise constant scheduling, the learning rate is held constant for specific intervals and dropped sharply after each interval:

  • Epochs 1-10: \eta = 0.01
  • Epochs 11-20: \eta = 0.005
  • Epochs 21+: \eta = 0.001

This method is often used in simpler training tasks or when the model benefits from more structured learning rate reductions.

3. Cyclical Learning Rates (CLR)

Cyclical learning rates allow the learning rate to oscillate between a minimum and maximum value during training, promoting exploration of the loss landscape. This method was designed to eliminate the need for fine-tuning a single learning rate.






Practical Implementation of 1Cycle and Learning Rate Schedules


Practical Implementation of 1Cycle and Learning Rate Schedules

In this part, we’ll explore how to implement various learning rate schedules such as 1Cycle, Exponential Decay, Piecewise Constant, and Cyclical Learning Rates (CLR) in popular deep learning frameworks like Keras and PyTorch.

1Cycle Learning Rate Schedule in Keras

The following example demonstrates how to implement the 1Cycle learning rate policy in Keras. Keras has no built-in 1Cycle function, but we can customize it using the LearningRateScheduler callback.

import numpy as np
import tensorflow as tf
from tensorflow.keras.callbacks import LearningRateScheduler

# Function to implement the 1Cycle schedule
def one_cycle_schedule(epoch, max_lr=0.01, base_lr=0.001, total_epochs=100):
if epoch < total_epochs // 2:
        lr = base_lr + (max_lr - base_lr) * (epoch / (total_epochs // 2))
    else:
        lr = max_lr - (max_lr - base_lr) * ((epoch - total_epochs // 2) / (total_epochs // 2))
    return lr

# Callback for Keras training
lr_scheduler = LearningRateScheduler(lambda epoch: one_cycle_schedule(epoch))

# Define your model
model = tf.keras.models.Sequential(&#91;
    tf.keras.layers.Dense(128, activation='relu', input_shape=(784,)),
    tf.keras.layers.Dense(10, activation='softmax')
&#93;)

# Compile the model
model.compile(optimizer=tf.keras.optimizers.SGD(), loss='sparse_categorical_crossentropy', metrics=&#91;'accuracy'&#93;)

# Train with the 1Cycle schedule
model.fit(X_train, y_train, epochs=100, callbacks=&#91;lr_scheduler&#93;)
&#91;/code&#93;

---
<h2>Exponential Decay in Keras</h2>
Here’s how you can implement **exponential decay** in Keras using the <code>LearningRateScheduler</code> callback:

[code language="python"]
# Exponential decay function
def exponential_decay(epoch, initial_lr=0.01, decay_rate=0.1, decay_epochs=20):
return initial_lr * np.exp(-decay_rate * epoch / decay_epochs)

# Callback for exponential decay
lr_scheduler = LearningRateScheduler(lambda epoch: exponential_decay(epoch))

# Train your model
model.fit(X_train, y_train, epochs=100, callbacks=[lr_scheduler])

Piecewise Constant Scheduling in Keras

With **Piecewise Constant Scheduling**, the learning rate is kept constant for a certain period, then reduced. This can be achieved as follows:

# Piecewise constant learning rate function
def piecewise_constant(epoch):
if epoch < 10:
        return 0.01
    elif epoch < 20:
        return 0.005
    else:
        return 0.001

# Learning rate scheduler callback
lr_scheduler = LearningRateScheduler(piecewise_constant)

# Training the model
model.fit(X_train, y_train, epochs=30, callbacks=&#91;lr_scheduler&#93;)
&#91;/code&#93;

---
<h2>Cyclical Learning Rates (CLR) in Keras</h2>
**Cyclical learning rates (CLR)** can be implemented by making the learning rate oscillate between a lower and upper bound during training:

[code language="python"]
# Cyclical learning rate function
def cyclical_lr(epoch, base_lr=0.001, max_lr=0.006, step_size=10):
cycle = np.floor(1 + epoch / (2 * step_size))
x = np.abs(epoch / step_size - 2 * cycle + 1)
lr = base_lr + (max_lr - base_lr) * np.maximum(0, (1 - x))
return lr

# CLR scheduler
lr_scheduler = LearningRateScheduler(lambda epoch: cyclical_lr(epoch))

# Training the model with CLR
model.fit(X_train, y_train, epochs=50, callbacks=[lr_scheduler])

1Cycle Learning Rate Schedule in PyTorch

For PyTorch, 1Cycle scheduling is available out-of-the-box using the torch.optim.lr_scheduler.OneCycleLR function:

import torch
import torch.optim as optim
from torch.optim.lr_scheduler import OneCycleLR

# Define model, optimizer, and loss function
model = MyModel()
optimizer = optim.SGD(model.parameters(), lr=0.01)
criterion = torch.nn.CrossEntropyLoss()

# Set up the 1Cycle learning rate scheduler
scheduler = OneCycleLR(optimizer, max_lr=0.01, total_steps=100)

# Training loop
for epoch in range(100):
for data, target in train_loader:
optimizer.zero_grad()
output = model(data)
loss = criterion(output, target)
loss.backward()
optimizer.step()
scheduler.step()  # Update learning rate after each batch

Exponential Decay in PyTorch

In PyTorch, **exponential decay** is implemented using the ExponentialLR scheduler:

from torch.optim.lr_scheduler import ExponentialLR

# Set up the exponential decay scheduler
scheduler = ExponentialLR(optimizer, gamma=0.9)  # gamma is the decay rate

# Training loop with learning rate scheduler
for epoch in range(100):
for data, target in train_loader:
optimizer.zero_grad()
output = model(data)
loss = criterion(output, target)
loss.backward()
optimizer.step()
scheduler.step()  # Decay the learning rate after each epoch

Conclusion

These examples show how to implement various learning rate schedules like **1Cycle**, **exponential decay**, **piecewise constant scheduling**, and **cyclical learning rates** in both Keras and PyTorch. Each method has its own advantages depending on the task, dataset, and model architecture, and experimenting with these schedules can lead to faster convergence and improved performance in your deep learning projects.