The Rise of Transformers in Vision and Multimodal Models
In this first part of our blog series, we’ll explore how transformers, originally created for Natural Language Processing (NLP), have expanded into Computer Vision (CV) and even multimodal tasks, handling text, images, and video in a unified way. This will set the stage for Part 2, where we will dive into using Hugging Face and code examples for practical implementations.
1. The Journey of Transformers from NLP to Vision
The introduction of transformers in 2017 revolutionized NLP, but researchers soon realized their potential for tasks beyond just text. Originally used alongside Convolutional Neural Networks (CNNs), transformers were able to handle image captioning tasks by replacing older architectures like Recurrent Neural Networks (RNNs).
How Transformers Replace RNNs
Transformers replaced RNNs due to their ability to capture long-term dependencies and work in parallel rather than sequentially, like RNNs. This made transformers faster and more efficient, especially for image-based tasks where multiple features needed to be processed simultaneously.
2. The Emergence of Vision Transformers (ViT)
In 2020, researchers at Google proposed a completely transformer-based model for vision tasks, named the Vision Transformer (ViT). ViT treats an image in a way similar to text data—by splitting it into smaller image patches and feeding these patches into a transformer model.
How ViT Works:
- Splitting Images into Patches: Instead of feeding an entire image into a CNN, the ViT divides an image into 16×16 pixel patches.
- Embedding Patches: Each patch is flattened into a vector, which is then treated like a word in a sentence for the transformer model.
- Processing Through Self-Attention: The transformer processes these patch vectors through a self-attention mechanism, which looks at the relationships between all patches simultaneously.
Feature | CNN | Vision Transformer (ViT) |
---|---|---|
Input | Entire image (filtered) | Image patches |
Processing Style | Local (focus on specific parts) | Global (entire image at once) |
Inductive Bias | Strong (assumes local relationships) | Weak (learns global relationships) |
Best Use Cases | Small to medium datasets | Large datasets, such as ImageNet |
Inductive Bias in CNNs vs. Transformers
CNNs assume that pixels close to each other are related, which is called inductive bias. This makes CNNs very good at image recognition tasks. Transformers don’t make these assumptions, allowing them to capture long-range dependencies better, but they need more data to do this effectively.
3. Multimodal Transformers: Perceiver and GATO
The power of transformers in processing sequences has inspired the development of multimodal models like Perceiver and GATO. These models can handle text, images, video, and even audio in one go.
Perceiver: Efficient Multimodal Transformer
Perceiver, introduced by DeepMind in 2021, can process various types of input by converting them into a compressed latent representation. The Perceiver model is much more efficient when processing long sequences of data, which makes it scalable for multimodal tasks.
Model | Modality Support | Key Features |
---|---|---|
Perceiver | Text, Images, Video, Audio | Latent representations, scalable attention |
GATO | Text, Images, Atari games, etc. | Handles multiple task types |
4. Advanced Multimodal Models: Flamingo and GATO
Flamingo and GATO, both introduced by DeepMind in 2022, represent a significant leap forward in multimodal models.
- Flamingo: Capable of handling text, images, and video, Flamingo is pre-trained across multiple modalities to work on tasks like question answering and image captioning.
- GATO: A versatile transformer model that can be applied to a variety of tasks, including playing Atari games, handling text input, and image recognition. GATO integrates several capabilities into one unified model.
Model | Task Capabilities | Special Features |
---|---|---|
Flamingo | Question answering, captioning | Trained on multiple modalities simultaneously |
GATO | Image classification, game playing | Unified model for different types of tasks |
Multimodal AI has advanced significantly, with models evolving to handle text, images, audio, and video seamlessly. Early breakthroughs like Flamingo and GATO set the foundation, followed by Gemini models enhancing long-context understanding and multimodal output. GPT-4o improved real-time interactions across multiple modalities, while Nvidia’s Cosmos advanced video generation and robotics training.
Meanwhile, DeepSeek emerged as a strong competitor, introducing Janus Pro for image generation and R1, an open-source reasoning model, challenging existing AI leaders. These advancements have reshaped industries, leading to increased integration in smart wearables, autonomous systems, and decision-making AI.
With companies rapidly innovating and competing, multimodal AI continues to improve year by year, unlocking more sophisticated and efficient applications across various domains.
2. Video Generation and Diffusion Models
Companies like Haiper are advancing with models based on DiT (Diffusion and Transformer) architectures. Haiper 2.0
allows users to generate ultra-realistic videos from prompts, combining diffusion models with Transformer components
to increase speed and efficiency. This breakthrough has implications for video generation and creative content industries.
3. Robotics and Transformer Efficiency
In robotics, Google’s SARA-RT system is refining Transformer models used for robotic tasks, making them faster
and more efficient. This leads to improved real-time decision-making in robots, critical for practical applications
such as autonomous driving and general real-world robotics tasks.
4. New Releases of LLMs
OpenAI and Meta have been at the forefront of developing large language models (LLMs), continually pushing the boundaries of natural language processing.
OpenAI’s GPT-5 Development
OpenAI has been working on GPT-5, aiming to enhance reasoning capabilities and address limitations observed in previous models. However, the development has faced challenges, including delays and substantial costs, leading to an anticipated release in early 2025.
Meta’s Llama 3 Series
Meta has made significant strides with its Llama series, culminating in the release of Llama 3.1. This model boasts 405 billion parameters, supporting multiple languages and demonstrating notable improvements in coding and complex mathematics. Despite its size, Llama 3.1 competes closely with other leading models in performance.
These developments underscore the rapid evolution of LLMs, with each iteration bringing enhanced capabilities and performance, thereby intensifying the competitive landscape in AI research and application. These were some examples of Development on 2025 to see the improvement of AI through some years but each year AI is improving so fast.
Hugging Face Transformers – A Step-by-Step Guide with Code and Explanations
In this part of the blog post, I will guide you through using Hugging Face’s Transformers library, explaining what each code block does, why you need it, and how it works in real-world scenarios. This way, you won’t just copy and paste code—you’ll understand its purpose and how to use it effectively.
1. Getting Started with Hugging Face Pipelines
The easiest way to start with Hugging Face is by using the pipeline()
function. A pipeline is a high-level abstraction that allows you to quickly run pretrained models for various tasks such as sentiment analysis, text generation, or text classification.
Why Use a Pipeline?
Pipelines are great when you want to solve a problem quickly without worrying about the details of model architecture, tokenization, or input processing. For example, if you want to analyze customer reviews for positive or negative sentiment, pipelines allow you to do this in seconds.
Example: Sentiment Analysis Pipeline
Code:
# Import the pipeline from Hugging Face
from transformers import pipeline
# Load the sentiment-analysis pipeline
classifier = pipeline("sentiment-analysis")
# Test the classifier with a sample sentence
result = classifier("The actors were very convincing.")
print(result)
Explanation:
This code loads a pretrained model for sentiment analysis and classifies the sentence as either positive or negative. It’s useful in real-world applications like analyzing product reviews, feedback, or social media comments.
Output:
[{'label': 'POSITIVE', 'score': 0.9998071789741516}]
2. Using Pretrained Models for Text Classification
Why Use Pretrained Models?
Pretrained models save you time. Instead of training a model from scratch, which can take days or even weeks, you can download and use a model that’s already been trained on massive datasets. This is especially useful when working on tight deadlines or without extensive computational resources.
Example: Text Classification with DistilBERT
Code:
# Specify the model for text classification
model_name = "distilbert-base-uncased-finetuned-mnli"
# Load the text classification pipeline
classifier_mnli = pipeline("text-classification", model=model_name)
# Test with two sentences
result = classifier_mnli("She loves me. [SEP] She loves me not.")
print(result)
Explanation:
This code compares two sentences and checks for contradiction, entailment, or neutrality. The [SEP]
token separates the sentences for the model. You could use this in applications like legal document review, where you need to check for inconsistencies between two statements.
Output:
[{'label': 'CONTRADICTION', 'score': 0.9790192246437073}]
3. Customizing Tokenization for Input Control
Why Customize Tokenization?
Sometimes, you need more control over how text is tokenized. Tokenization breaks a sentence into smaller pieces (tokens), which are later converted into numerical IDs for the model. This is crucial for fine-tuning models or working with custom datasets.
Example: Custom Tokenization
Code:
from transformers import AutoTokenizer, TFAutoModelForSequenceClassification
# Load the tokenizer and model
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = TFAutoModelForSequenceClassification.from_pretrained(model_name)
# Tokenize the input sentences
token_ids = tokenizer(["I like soccer. [SEP] We all love soccer!",
"Joe lived for a very long time. [SEP] Joe is old."],
padding=True, return_tensors="tf")
# Display the tokenized input
print(token_ids)
Explanation:
This code takes raw text, tokenizes it, and converts it into a format the model can process. You can use this in applications where the input text format is critical, like chatbots or translation systems.
4. Processing Model Outputs with Softmax
Why Use Softmax?
When a model outputs logits (raw scores), these aren’t immediately understandable. The softmax function converts these scores into probabilities. For instance, in sentiment analysis, you might want to know how likely it is that a sentence is positive or negative.
Example: Applying Softmax to Model Outputs
Code:
# Pass the tokenized input to the model
outputs = model(token_ids)
# Apply the softmax activation function to get probabilities
import tensorflow as tf
y_probas = tf.keras.activations.softmax(outputs.logits)
# Display the probabilities
print(y_probas)
Explanation:
Here, the model’s output logits are converted to class probabilities using softmax. This is crucial when making a final prediction in any classification task, from sentiment analysis to topic classification.
5. Fine-Tuning a Pretrained Model
Why Fine-Tune a Model?
Sometimes the general-purpose pretrained models don’t work well for specific domains (like legal or medical texts). By fine-tuning, you can train the model on a smaller, domain-specific dataset and achieve better performance.
Example: Fine-Tuning with Keras
Code:
# Prepare the training data
sentences = [("Sky is blue", "Sky is red"), ("I love her", "She loves me")]
X_train = tokenizer(sentences, padding=True, return_tensors="tf").data
# Prepare labels for the sentences (0 = contradiction, 2 = entailment)
y_train = tf.constant([0, 2])
# Define the loss function and optimizer
loss = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True)
model.compile(loss=loss, optimizer="adam", metrics=["accuracy"])
# Fine-tune the model
history = model.fit(X_train, y_train, epochs=2)
Explanation:
This code fine-tunes a model using a custom dataset. Fine-tuning is essential in cases where general pretrained models don’t perform well on niche domains like legal or medical text.
6. Using Hugging Face Datasets for Training
Why Use Hugging Face Datasets?
Hugging Face offers a datasets library, which allows you to quickly download and use datasets for training, evaluation, or fine-tuning. This saves time and resources in gathering and cleaning data.
Example: Loading and Using the IMDb Dataset
Code:
from datasets import load_dataset
# Load the IMDb dataset
dataset = load_dataset("imdb")
# Display a sample from the dataset
print(dataset["train"][0])
Explanation:
This code loads the IMDb dataset, which is a collection of movie reviews labeled as positive or negative for sentiment analysis. You can use this dataset to train or evaluate your own sentiment analysis models.
Output:
{'text': 'This is a great movie...', 'label': 1}
The IMDb dataset contains movie reviews (in the text
field) and the corresponding sentiment labels (where 1
indicates positive sentiment).
Conclusion
In this part, we’ve covered how to:
- Use Hugging Face pipelines for quick, easy-to-use NLP tasks like sentiment analysis and text classification.
- Leverage pretrained models to save time and resources for common tasks.
- Control the input data with tokenizers and convert model outputs to meaningful probabilities with softmax.
- Fine-tune a model on a specific domain using custom data for better accuracy.
- Utilize the Hugging Face datasets library for quick access to standard datasets like IMDb.
Each code block is designed to solve a real-world problem, whether you’re building a simple sentiment analysis tool or working on a custom text classification system. Hugging Face makes it incredibly easy to get started with state-of-the-art NLP models without needing to build them from scratch.
If you’re ready to take your NLP projects to the next level, Hugging Face’s documentation and community are great resources to explore. Stay tuned for more tutorials and deep dives into advanced features!