Machine Learning Overview

Learning rate – part 4- 1-Cycle Scheduling, exponential decay and Cyclic Exponential Decay (CED) – day 45






Advanced Learning Rate Scheduling Methods for Machine Learning

Advanced Learning Rate Scheduling Methods for Machine Learning: A Detailed Guide with Mathematical Proofs and Implementations

Learning rate scheduling is critical in optimizing machine learning models, helping them converge faster and avoid pitfalls such as getting stuck in local minima. In this guide, we explore three key learning rate schedules: Exponential Decay, Cyclic Exponential Decay (CED), and 1-Cycle Scheduling, providing mathematical proofs, code implementations, and theory behind each method.


1. Exponential Decay Learning Rate

Exponential Decay reduces the learning rate by a factor of e^{-kt}, allowing larger updates early in training and smaller, more refined updates as the model approaches convergence.

Formula:

\text{lr}_t = \text{lr}_0 \cdot e^{-kt}

Where:

  • \text{lr}_t is the learning rate at time step t,
  • \text{lr}_0 is the initial learning rate,
  • k is the decay rate, controlling how fast the learning rate decreases,
  • t represents the current time step (or epoch).

Mathematical Proof of Exponential Decay

The core idea of exponential decay is that the learning rate decreases over time. Let’s prove that this results in convergence.

The parameter update rule for gradient descent is:

\theta_{t+1} = \theta_t - \text{lr}_t \cdot \nabla L(\theta_t)

Substituting the exponentially decayed learning rate:

\theta_{t+1} = \theta_t - \text{lr}_0 \cdot e^{-kt} \cdot \nabla L(\theta_t)

As t \to \infty, the decay factor e^{-kt} \to 0, meaning that the updates to \theta become smaller and smaller, allowing the model to settle into a minimum.

TensorFlow/Keras Implementation:

import tensorflow as tf

initial_learning_rate = 0.01
decay_steps = 100000
decay_rate = 0.96

lr_schedule = tf.keras.optimizers.schedules.ExponentialDecay(
    initial_learning_rate, decay_steps, decay_rate, staircase=True
)

optimizer = tf.keras.optimizers.Adam(learning_rate=lr_schedule)

model.compile(optimizer=optimizer, loss='sparse_categorical_crossentropy')
model.fit(x_train, y_train, epochs=5)

PyTorch Implementation:

import torch
import torch.optim as optim

model = torch.nn.Linear(10, 1)
optimizer = optim.SGD(model.parameters(), lr=0.1)

scheduler = optim.lr_scheduler.ExponentialLR(optimizer, gamma=0.96)

for epoch in range(100):
    optimizer.zero_grad()
    output = model(torch.randn(10))
    loss = (output - torch.ones(1)).pow(2).sum()
    loss.backward()
    optimizer.step()
    scheduler.step()
    print(f'Epoch {epoch+1}: Learning Rate = {scheduler.get_last_lr()[0]}')

MLX Implementation:

import mlx.core as mx
import mlx.optimizers as optim

def exponential_decay_lr(epoch, initial_lr=0.01, decay_rate=0.96):
    return initial_lr * (decay_rate ** epoch)

model = mx.nn.Linear(10, 1)
optimizer = optim.SGD(model.parameters(), lr=0.01)

for epoch in range(100):
    output = model(mx.random.normal((10,)))
    loss = mx.mean((output - mx.ones((1,))) ** 2)
    optimizer.lr = exponential_decay_lr(epoch)
    optimizer.step()
    print(f'Epoch {epoch}: Learning Rate = {optimizer.lr}')

2. Cyclic Exponential Decay (CED)

Cyclic Exponential Decay (CED) extends the exponential decay by adding a periodic component to the learning rate. This allows the model to escape local minima by periodically increasing the learning rate.

Formula:

\text{lr}_t = \text{lr}_0 \cdot e^{-kt} \cdot \cos\left(\frac{2\pi t}{T}\right)

Where:

  • T is the cycle length,
  • k is the decay rate.

Mathematical Proof of Cyclic Exponential Decay

The cyclic component of CED ensures periodic exploration of the parameter space, while the exponential decay guarantees eventual convergence.

The cosine term introduces cyclic behavior into the learning rate, allowing it to increase periodically:

\cos\left(\frac{2\pi t}{T}\right)

The learning rate still decays over time due to the e^{-kt} term, but the periodic oscillations prevent the optimizer from settling into local minima too early.

PyTorch Implementation:

import math

def cyclic_exponential_decay(epoch, lr_init=0.1, decay_rate=0.96, cycle_length=10):
    cycle = math.floor(epoch / cycle_length)
    return lr_init * (decay_rate ** cycle) * math.cos((2 * math.pi * (epoch % cycle_length)) / cycle_length)

for epoch in range(100):
    optimizer.lr = cyclic_exponential_decay(epoch)
    optimizer.step()
    print(f'Epoch {epoch+1}: Learning Rate = {optimizer.lr}')

MLX Implementation:

def cyclic_exponential_decay(epoch, lr_init=0.1, decay_rate=0.96, cycle_length=10):
    cycle = math.floor(epoch / cycle_length)
    lr = lr_init * (decay_rate ** cycle) * math.cos((2 * math.pi * (epoch % cycle_length)) / cycle_length)
    return lr

for epoch in range(100):
    optimizer.lr = cyclic_exponential_decay(epoch)
    optimizer.step()
    print(f'Epoch {epoch}: Learning Rate = {optimizer.lr}')

3. 1-Cycle Learning Rate Scheduling

1-Cycle Scheduling is a powerful technique that increases the learning rate in the first half of training and decreases it in the second half. This helps the model explore the parameter space early on and converge smoothly later.

Formula:

\text{lr}_t = \begin{cases}  \text{lr}_{\min} + \frac{t}{t_{\max}/2} \cdot (\text{lr}_{\max} - \text{lr}_{\min}) & \text{if } t \leq t_{\max}/2 \\ \text{lr}_{\max} - \frac{t - t_{\max}/2}{t_{\max}/2} \cdot (\text{lr}_{\max} - \text{lr}_{\min}) & \text{if } t > t_{\max}/2 \end{cases}

Where:

  • t_{\max} is the total number of iterations.

Mathematical Proof of 1-Cycle Scheduling

The 1-Cycle method is based on the idea that increasing the learning rate early in training allows the model to explore the parameter space, while decreasing the learning rate later encourages the model to converge smoothly.

During the first half of training, the learning rate increases linearly:

\text{lr}_t = \text{lr}_{\min} + \frac{t}{t_{\max}/2} \cdot (\text{lr}_{\max} - \text{lr}_{\min})

This encourages larger parameter updates, which helps the optimizer escape local minima.

During the second half, the learning rate decreases linearly:

\text{lr}_t = \text{lr}_{\max} - \frac{t - t_{\max}/2}{t_{\max}/2} \cdot (\text{lr}_{\max} - \text{lr}_{\min})

This fine-tunes the model as it approaches a solution, ensuring smoother convergence.

PyTorch Implementation:

import torch.optim as optim

model = torch.nn.Linear(10, 1)
optimizer = optim.SGD(model.parameters(), lr=0.01)

scheduler = optim.lr_scheduler.OneCycleLR(optimizer, max_lr=0.1, steps_per_epoch=100, epochs=10)

for epoch in range(10):
    for step in range(100):
        optimizer.zero_grad()
        output = model(torch.randn(10))
        loss = (output - torch.ones(1)).pow(2).sum()
        loss.backward()
        optimizer.step()
        scheduler.step()
        print(f'Epoch {epoch+1}, Step {step+1}: Learning Rate = {scheduler.get_last_lr()[0]}')

Conclusion

Each learning rate scheduling method has its strengths:

  • Exponential Decay: Ideal for gradual convergence, particularly with well-behaved loss surfaces.
  • Cyclic Exponential Decay (CED): Introduces periodic exploration, helping escape local minima while still ensuring convergence due to the exponential decay.
  • 1-Cycle Scheduling: Effectively combines exploration and exploitation, allowing models to explore large parameter spaces before refining their solution as training nears completion.

Each method can be implemented across various platforms, including TensorFlow, PyTorch, and MLX for Apple Silicon. Selecting the right learning rate schedule is critical to achieving fast and stable convergence in your machine learning models.







Learning Rate Scheduling Comparison


Comparison of Learning Rate Scheduling Methods

Learning Rate Method Purpose Formula Strengths Platform Implementations How to Decide Which to Use
Exponential Decay Gradual convergence with decreasing updates over time. \text{lr}_t = \text{lr}_0 \cdot e^{-kt} Ensures smaller updates as training progresses; works well for stable, smaller models. TensorFlow, PyTorch, MLX Best for stable, smaller models with smooth convergence needs. Use when you need a consistent, gradual reduction in learning rate for fine-tuning.
Cyclic Exponential Decay (CED) Combines exponential decay with periodic increases to escape local minima. \text{lr}_t = \text{lr}_0 \cdot e^{-kt} \cdot \cos\left(\frac{2\pi t}{T}\right) Periodically increases the learning rate, helping escape local minima and explore the parameter space more thoroughly. PyTorch, MLX Ideal for non-convex optimization problems, where the risk of getting stuck in local minima is high. Good for more complex models or rugged loss surfaces.
1-Cycle Scheduling Explores parameters early in training and refines solutions smoothly in the second half. \text{lr}_t = \begin{cases}  \text{lr}_{\min} + \frac{t}{t_{\max}/2} \cdot (\text{lr}_{\max} - \text{lr}_{\min}) & \text{if } t \leq t_{\max}/2 \\ \text{lr}_{\max} - \frac{t - t_{\max}/2}{t_{\max}/2} \cdot (\text{lr}_{\max} - \text{lr}_{\min}) & \text{if } t > t_{\max}/2 \end{cases} Balances exploration and refinement, making it especially useful for large datasets and models that need longer training. PyTorch, DeepSpeed Recommended for large datasets or models that require both exploration early on and refinement later. Works well with large batch sizes and long training periods.