Machine Learning Overview

How do Transfer Learning in Deep Learning Model – with a example – day 30

Understanding Transfer Learning – The Challenges and Opportunities

Introduction to Transfer Learning

Transfer learning is a technique in machine learning where a model developed for one task is reused as the starting point for a model on a second task. This method is particularly useful when the second task has limited data, as it allows the model to leverage the knowledge it gained during the first task, thereby reducing the training time and improving performance. However, applying transfer learning effectively requires a deep understanding of both the original task and the new task, as well as how the model’s learned features will transfer.

The Challenge of Transfer Learning for Small Tasks

When dealing with small tasks—tasks that are simple or have limited data—transfer learning may not always yield the expected benefits. Let’s explore why this is the case by breaking down the issues discussed in the provided images:

1. Initial Setup and Model A:

Imagine you have a neural network (Model A) trained on a multi-class classification problem using the Fashion MNIST dataset. This dataset might include various classes of clothing items, such as T-shirts, trousers, pullovers, dresses, etc. Model A, trained on these classes, performs well, achieving over 90% accuracy.

2. New Task with Model B:

Now, suppose you want to adapt Model A for a simpler task—distinguishing between just two categories, such as T-shirts and sandals. You might think that applying transfer learning by simply modifying the last layer of Model A (to output two classes instead of eight) would be enough.

3. Why Transfer Learning Might Not Work Well for Small Tasks:

Feature Specificity: Model A’s layers are trained to recognize patterns specific to the original eight classes. These patterns might not be general enough to be useful for distinguishing between just T-shirts and sandals. The features learned by the model could be too specific, leading to poor generalization in the new task.
Overfitting Risk: In small tasks, especially with limited data, there’s a risk that the model might overfit to the new task if not properly managed. The fine-tuning process needs to be carefully controlled to prevent this.
Trial and Error: As finding the best solution for a new model is not that easy, multiple configurations need to be tried before finding one that worked. This suggests that the improvements may not be robust or might be the result of overfitting rather than true generalization.

How Transfer Learning Can Be Effective

Despite these challenges, transfer learning can still be a powerful tool when applied correctly. Here’s how to make it work effectively:

1. Starting with a Generalized Model:

A successful transfer learning setup starts with a well-trained model (like Model A) on a broad and diverse dataset. This model should have learned general features—such as edges, textures, and shapes—that are applicable across various tasks.

2. Adapting the Model Carefully:

For the new task, modify the model’s architecture appropriately. In our example, this would mean replacing the final output layer to suit the binary classification task.
Layer Freezing: Initially, freeze the layers of the pre-trained model. This means you prevent these layers from being updated during the early stages of training. The idea is to retain the valuable features the model has already learned.
Fine-Tuning: After training the new output layer, you can unfreeze some or all of the previous layers and continue training with a lower learning rate. This allows the model to make slight adjustments, tailoring the pre-learned features to the new task without losing the foundational knowledge.

3. Using a Lower Learning Rate:

During fine-tuning, using a lower learning rate is crucial. This slows down the weight updates, allowing the model to adjust its pre-learned features more subtly, thereby improving the transfer of knowledge without drastically changing the learned features.

A Small Example: Applying Transfer Learning

Let’s walk through a very simple example to demonstrate how transfer learning is applied.

Problem:
You have a model pre-trained on the CIFAR-10 dataset (which has 10 classes of objects), and you want to adapt it to classify images of cats and dogs only.

Step 1: Load Pre-Trained Model

from tensorflow.keras.applications import VGG16

# Load the VGG16 model pre-trained on ImageNet
base_model = VGG16(weights='imagenet', include_top=False, input_shape=(64, 64, 3))

Step 2: Modify the Model for the New Task

from tensorflow.keras import layers, models

# Add new layers on top of the pre-trained base
model = models.Sequential()
model.add(base_model) # Add the pre-trained base model
model.add(layers.Flatten())
model.add(layers.Dense(256, activation='relu'))
model.add(layers.Dense(1, activation='sigmoid')) # Binary classification output layer

Step 3: Freeze the Base Model Layers

# Freeze all layers in the base model
for layer in base_model.layers:
layer.trainable = False

Step 4: Compile the Model

model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])

Step 5: Train the Model on New Data

# Assuming X_train and y_train are your training data and labels
model.fit(X_train, y_train, epochs=10, batch_size=32, validation_data=(X_val, y_val))

Step 6: Fine-Tuning

# Unfreeze some layers in the base model for fine-tuning
for layer in base_model.layers[-4:]:
layer.trainable = True

# Re-compile the model with a lower learning rate
model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])

# Continue training
model.fit(X_train, y_train, epochs=10, batch_size=32, validation_data=(X_val, y_val))

Mathematical Proof Behind Transfer Learning

The mathematical foundation of transfer learning can be understood through the concept of feature reuse. In deep learning, the lower layers of a model generally capture more generic features (such as edges, textures), while the higher layers capture more task-specific features. When you apply transfer learning:

  • Feature Reuse: The lower layers (capturing general features) are assumed to be transferable across tasks. Mathematically, if \( f(x) \) represents the learned function in the source task, transfer learning assumes that \( f(x) \) can be partially reused in the target task, where \( f'(x) = g(f(x)) \) for some new function \( g \).
  • Optimization: The pre-trained model minimizes a loss function \( L(f(x)) \) over the source task’s data. In transfer learning, you aim to minimize a new loss function \( L'(f'(x)) = L'(g(f(x))) \) over the target task’s data. By freezing and fine-tuning, you’re essentially finding a new function \( g \) that minimally adjusts \( f \) to suit the new task, ensuring that the new loss is minimized effectively.

Transfer learning, when used properly, can significantly boost model performance on new tasks, particularly when data is limited. The key is in understanding when and how to apply it. For small tasks or those that are quite different from the original task, it requires careful handling—through freezing layers, fine-tuning, and adjusting the output structure—to ensure that the model can transfer the learned features effectively.

Implementing Transfer Learning – From Model A to Model B

All in all, so far in our today blog post , we discussed the challenges and benefits of using transfer learning, especially when adapting a model trained on one task to a new, simpler task. Now, let’s dive into the implementation with complete, runnable code examples for both Model A and Model B. We’ll walk through the entire process, explaining each line of code.

Step 1: Building and Training Model A

Model A will be trained on a multi-class classification task, such as classifying images in the Fashion MNIST dataset (with some classes potentially excluded).

Complete Code for Model A

import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Flatten, Dense, Dropout
from tensorflow.keras.optimizers import SGD

# Load the Fashion MNIST dataset
fashion_mnist = tf.keras.datasets.fashion_mnist
(train_images, train_labels), (test_images, test_labels) = fashion_mnist.load_data()

# Normalize the images to the [0, 1] range
train_images = train_images / 255.0
test_images = test_images / 255.0

# Filter out two classes for this example (assuming 'sandal' is label 5 and 'shirt' is label 6)
train_filter = (train_labels < 5) | (train_labels > 6)
test_filter = (test_labels < 5) | (test_labels > 6)

train_images_A = train_images[train_filter]
train_labels_A = train_labels[train_filter]
test_images_A = test_images[test_filter]
test_labels_A = test_labels[test_filter]

# Build Model A - a simple neural network
model_A = Sequential([
Flatten(input_shape=(28, 28)), # Flatten the 28x28 images into vectors
Dense(128, activation='relu'), # First fully connected layer with 128 units and ReLU activation
Dropout(0.2), # Dropout layer for regularization
Dense(8, activation='softmax') # Output layer for 8 classes, using softmax for multi-class classification
])

# Compile Model A
model_A.compile(optimizer=SGD(learning_rate=0.01), # Using SGD optimizer with a learning rate of 0.01
loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True), # Loss function for multi-class classification
metrics=['accuracy']) # Track accuracy during training

# Train Model A
model_A.fit(train_images_A, train_labels_A, epochs=10, # Train for 10 epochs
validation_data=(test_images_A, test_labels_A)) # Validate the model on the test set

# Save Model A for later use
model_A.save("model_A.h5") # Save the trained model to disk

Explanation of Model A Code

  • Loading and Preprocessing the Data: The Fashion MNIST dataset is loaded and normalized to have values between 0 and 1. Specific classes are filtered out to create a smaller task.
  • Model Architecture: The model is built using a `Sequential` API. It flattens the input images, applies a dense layer with ReLU activation, adds a dropout layer for regularization, and finally outputs predictions across 8 classes using softmax.
  • Compilation and Training: The model is compiled using the SGD optimizer and trained for 10 epochs on the filtered dataset.
  • Saving the Model: The trained model is saved to disk, so it can be reused later for transfer learning.

Step 2: Building Model B Using Transfer Learning

Model B will reuse the layers of Model A to perform a new, simpler task—binary classification (e.g., distinguishing between T-shirts and sandals).

Complete Code for Model B

import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense
from tensorflow.keras.optimizers import SGD

# Load the pre-trained Model A
model_A = tf.keras.models.load_model("model_A.h5") # Load the saved model from disk

# Prepare the data for Model B - focusing only on two classes (T-shirts and sandals)
train_filter_B = (train_labels == 0) | (train_labels == 5) # Keep only T-shirts (label 0) and sandals (label 5)
test_filter_B = (test_labels == 0) | (test_labels == 5)

train_images_B = train_images[train_filter_B] / 255.0 # Normalize the training images
train_labels_B = (train_labels[train_filter_B] == 5).astype(int) # Convert labels to 0 and 1
test_images_B = test_images[test_filter_B] / 255.0 # Normalize the test images
test_labels_B = (test_labels[test_filter_B] == 5).astype(int) # Convert labels to 0 and 1

# Create Model B by reusing layers from Model A, except the last layer
model_B_on_A = Sequential(model_A.layers[:-1]) # Reuse all layers except the output layer
model_B_on_A.add(Dense(1, activation='sigmoid')) # Add a new output layer with sigmoid activation for binary classification

# Freeze the reused layers initially
for layer in model_B_on_A.layers[:-1]:
layer.trainable = False # Freeze all layers except the newly added one

# Compile Model B
optimizer = SGD(learning_rate=0.001) # Using SGD optimizer with a lower learning rate for fine-tuning
model_B_on_A.compile(loss="binary_crossentropy", # Loss function for binary classification
optimizer=optimizer,
metrics=["accuracy"]) # Track accuracy during training

# Train Model B initially with frozen layers
history = model_B_on_A.fit(train_images_B, train_labels_B, epochs=4, # Train for 4 epochs with frozen layers
validation_data=(test_images_B, test_labels_B))

# Unfreeze all layers and fine-tune the model
for layer in model_B_on_A.layers[:-1]:
layer.trainable = True # Unfreeze all layers for fine-tuning

# Re-compile the model to apply changes
model_B_on_A.compile(loss="binary_crossentropy", # Recompile with the same loss and optimizer
optimizer=optimizer,
metrics=["accuracy"])

# Continue training the model with all layers trainable
history = model_B_on_A.fit(train_images_B, train_labels_B, epochs=16, # Train for an additional 16 epochs
validation_data=(test_images_B, test_labels_B))

# Evaluate Model B
test_loss, test_acc = model_B_on_A.evaluate(test_images_B, test_labels_B)
print(f"Test accuracy of Model B: {test_acc:.4f}")

Explanation of Model B Code

  • Loading Model A: The pre-trained Model A is loaded from disk, bringing all its learned features and weights.
  • Preparing Data for the New Task: The dataset is filtered to focus only on two classes—T-shirts (label 0) and sandals (label 5). Labels are converted to binary (0 or 1) for the new binary classification task.
  • Reusing Layers: Model B is created by reusing all but the last layer of Model A. A new output layer with a single neuron and sigmoid activation is added for binary classification.
  • Freezing Layers: Initially, the reused layers are frozen to prevent their weights from being updated. This allows the new output layer to train on its own first.
  • Compiling and Training: The model is compiled with a binary cross-entropy loss and trained with the frozen layers. After a few epochs, the layers are unfrozen, and the model is fine-tuned with all layers trainable.
  • Evaluating the Model: After training, the model is evaluated on the test data, and the accuracy is printed.

Summary of Key Changes from Model A to Model B

  • Task Focus: Model A is designed for multi-class classification, while Model B is adapted for binary classification.
  • Output Layer: The output layer is changed from 8 neurons with softmax activation (in Model A) to 1 neuron with sigmoid activation (in Model B).
  • Layer Freezing: Layers inherited from Model A are initially frozen to preserve the learned features. They are later unfrozen for fine-tuning, allowing the model to adjust to the new task.
  • Training Strategy: Model B uses a phased training strategy—starting with frozen layers and moving to fine-tuning, which helps to adapt the pre-trained features to the binary classification task.

By following this detailed guide, you should now have a clear understanding of how to implement transfer learning by reusing and adapting a pre-trained model for a new task. This approach not only saves training time but also leverages the powerful features learned by the original model, making it easier to achieve high accuracy on the new task.

Additionally Final Note to Take : Understanding Layer Freezing in Transfer Learning

When applying transfer learning, a crucial concept is the idea of “freezing” layers in a pre-trained model. Freezing layers means that we keep the weights of those layers unchanged during the initial training of the new model. This process leverages the knowledge the model has already learned from a previous task, allowing it to be applied effectively to a new, often smaller, task.

What Does Freezing Layers Mean?

Frozen Layers: When we freeze a layer, we set its trainable attribute to False. This means that during the training process of the new model, the weights of the frozen layers are not updated—they remain as they were initialized from the pre-trained model.

for layer in model_B.layers[:-1]:
layer.trainable = False # Freezing the layers

Why Do We Freeze Layers?

  • Retain Pre-Trained Knowledge: The layers of the pre-trained model have already learned useful features from the original task. By freezing these layers, we retain this knowledge and apply it directly to the new task.
  • Focus on New Layers: Initially, freezing allows the new layers (such as a new output layer) to learn to map the existing features to the new task. This can be especially important if the new task has a different output structure (e.g., binary classification instead of multi-class classification).
  • Prevent Overfitting: Freezing layers can also help prevent overfitting, especially when the new dataset is small. This is because the model won’t adjust the complex feature representations it already learned for the original task, which might be overfitting to the small new dataset.

How Do We Use These Initialized Weights in the New Model?

When you load a pre-trained model and freeze its layers, the new model starts with these pre-trained weights rather than random initializations. Here’s how this works:

  1. Load Pre-Trained Model: The pre-trained model’s weights, which have been trained on a large dataset (e.g., ImageNet, Fashion MNIST), are loaded into the model.
  2. Freeze the Layers: By setting trainable = False, these weights are retained as is during the training on the new task.
  3. Train New Layers: Only the weights of the unfrozen layers (often just the final few layers, including the new output layer) are updated during the initial training on the new dataset.
  4. Fine-Tuning (Optional): After training the new layers, you can unfreeze some or all of the frozen layers and continue training at a very low learning rate. This allows the model to slightly adjust the pre-trained weights to better fit the specifics of the new task without drastically changing the learned features.

Example:

Suppose the pre-trained model has learned to detect edges, textures, and other patterns that are common across many types of images. When you apply this model to a new task (like distinguishing between cats and dogs), freezing the layers ensures these features are used as is. Later, if you unfreeze the layers, the model can make minor adjustments to these features to improve performance on the new task.

Summary:

Freezing layers means retaining the pre-trained weights in the new model, allowing you to leverage the knowledge the model has already gained, which can lead to faster training and often better performance, especially when dealing with limited data in the new task.