deep learning 2024 - 2025

The Rise of Transformers in Vision and Multimodal Models – Hugging Face – day 72

The Rise of Transformers in Vision and Multimodal Models

In this first part of our blog series, we’ll explore how transformers, originally created for Natural Language Processing (NLP), have expanded into Computer Vision (CV) and even multimodal tasks, handling text, images, and video in a unified way. This will set the stage for Part 2, where we will dive into using Hugging Face and code examples for practical implementations.

1. The Journey of Transformers from NLP to Vision

The introduction of transformers in 2017 revolutionized NLP, but researchers soon realized their potential for tasks beyond just text. Originally used alongside Convolutional Neural Networks (CNNs), transformers were able to handle image captioning tasks by replacing older architectures like Recurrent Neural Networks (RNNs).

How Transformers Replace RNNs

Transformers replaced RNNs due to their ability to capture long-term dependencies and work in parallel rather than sequentially, like RNNs. This made transformers faster and more efficient, especially for image-based tasks where multiple features needed to be processed simultaneously.

2. The Emergence of Vision Transformers (ViT)

In 2020, researchers at Google proposed a completely transformer-based model for vision tasks, named the Vision Transformer (ViT). ViT treats an image in a way similar to text data—by splitting it into smaller image patches and feeding these patches into a transformer model.

How ViT Works:

  • Splitting Images into Patches: Instead of feeding an entire image into a CNN, the ViT divides an image into 16×16 pixel patches.
  • Embedding Patches: Each patch is flattened into a vector, which is then treated like a word in a sentence for the transformer model.
  • Processing Through Self-Attention: The transformer processes these patch vectors through a self-attention mechanism, which looks at the relationships between all patches simultaneously.
Feature CNN Vision Transformer (ViT)
Input Entire image (filtered) Image patches
Processing Style Local (focus on specific parts) Global (entire image at once)
Inductive Bias Strong (assumes local relationships) Weak (learns global relationships)
Best Use Cases Small to medium datasets Large datasets, such as ImageNet

Inductive Bias in CNNs vs. Transformers

CNNs assume that pixels close to each other are related, which is called inductive bias. This makes CNNs very good at image recognition tasks. Transformers don’t make these assumptions, allowing them to capture long-range dependencies better, but they need more data to do this effectively.

3. Multimodal Transformers: Perceiver and GATO

The power of transformers in processing sequences has inspired the development of multimodal models like Perceiver and GATO. These models can handle text, images, video, and even audio in one go.

Perceiver: Efficient Multimodal Transformer

Perceiver, introduced by DeepMind in 2021, can process various types of input by converting them into a compressed latent representation. The Perceiver model is much more efficient when processing long sequences of data, which makes it scalable for multimodal tasks.

Model Modality Support Key Features
Perceiver Text, Images, Video, Audio Latent representations, scalable attention
GATO Text, Images, Atari games, etc. Handles multiple task types

4. Advanced Multimodal Models: Flamingo and GATO

Flamingo and GATO, both introduced by DeepMind in 2022, represent a significant leap forward in multimodal models.

  • Flamingo: Capable of handling text, images, and video, Flamingo is pre-trained across multiple modalities to work on tasks like question answering and image captioning.
  • GATO: A versatile transformer model that can be applied to a variety of tasks, including playing Atari games, handling text input, and image recognition. GATO integrates several capabilities into one unified model.
Model Task Capabilities Special Features
Flamingo Question answering, captioning Trained on multiple modalities simultaneously
GATO Image classification, game playing Unified model for different types of tasks

in 2024 , Google introduced Gemini 1.5, a next-generation AI model that focuses on efficiency and long-context understanding,
processing up to 1 million tokens. With its Mixture-of-Experts (MoE) architecture, it enhances performance
while reducing computational demands. This makes Gemini 1.5 one of the most powerful models for long-context tasks,
allowing more sophisticated applications across various domains.

2. Video Generation and Diffusion Models

Companies like Haiper are advancing with models based on DiT (Diffusion and Transformer) architectures. Haiper 2.0
allows users to generate ultra-realistic videos from prompts, combining diffusion models with Transformer components
to increase speed and efficiency. This breakthrough has implications for video generation and creative content industries.

3. Robotics and Transformer Efficiency

In robotics, Google’s SARA-RT system is refining Transformer models used for robotic tasks, making them faster
and more efficient. This leads to improved real-time decision-making in robots, critical for practical applications
such as autonomous driving and general real-world robotics tasks.

4. New Releases of LLMs

OpenAI and Meta continue to innovate with upcoming releases like GPT-5 and LLama 3. These models will feature
enhanced coherence and performance in natural language processing, especially for open-source development,
intensifying the competitive AI landscape.






Hugging Face Transformers – A Step-by-Step Guide with Code and Explanations


Hugging Face Transformers – A Step-by-Step Guide with Code and Explanations

In this part of the blog post, I will guide you through using Hugging Face’s Transformers library, explaining what each code block does, why you need it, and how it works in real-world scenarios. This way, you won’t just copy and paste code—you’ll understand its purpose and how to use it effectively.

1. Getting Started with Hugging Face Pipelines

The easiest way to start with Hugging Face is by using the pipeline() function. A pipeline is a high-level abstraction that allows you to quickly run pretrained models for various tasks such as sentiment analysis, text generation, or text classification.

Why Use a Pipeline?

Pipelines are great when you want to solve a problem quickly without worrying about the details of model architecture, tokenization, or input processing. For example, if you want to analyze customer reviews for positive or negative sentiment, pipelines allow you to do this in seconds.

Example: Sentiment Analysis Pipeline

Code:

    # Import the pipeline from Hugging Face
    from transformers import pipeline

    # Load the sentiment-analysis pipeline
    classifier = pipeline("sentiment-analysis")

    # Test the classifier with a sample sentence
    result = classifier("The actors were very convincing.")
    print(result)
    

Explanation:

This code loads a pretrained model for sentiment analysis and classifies the sentence as either positive or negative. It’s useful in real-world applications like analyzing product reviews, feedback, or social media comments.

Output:

[{'label': 'POSITIVE', 'score': 0.9998071789741516}]

2. Using Pretrained Models for Text Classification

Why Use Pretrained Models?

Pretrained models save you time. Instead of training a model from scratch, which can take days or even weeks, you can download and use a model that’s already been trained on massive datasets. This is especially useful when working on tight deadlines or without extensive computational resources.

Example: Text Classification with DistilBERT

Code:

    # Specify the model for text classification
    model_name = "huggingface/distilbert-base-uncased-finetuned-mnli"

    # Load the text classification pipeline
    classifier_mnli = pipeline("text-classification", model=model_name)

    # Test with two sentences
    result = classifier_mnli("She loves me. [SEP] She loves me not.")
    print(result)
    

Explanation:

This code compares two sentences and checks for contradiction, entailment, or neutrality. You could use this in applications like legal document review, where you need to check for inconsistencies between two statements.

Output:

[{'label': 'CONTRADICTION', 'score': 0.9790192246437073}]

3. Customizing Tokenization for Input Control

Why Customize Tokenization?

Sometimes, you need more control over how text is tokenized. Tokenization breaks a sentence into smaller pieces (tokens), which are later converted into numerical IDs for the model. This is crucial for fine-tuning models or working with custom datasets.

Example: Custom Tokenization

Code:

    from transformers import AutoTokenizer, TFAutoModelForSequenceClassification

    # Load the tokenizer and model
    tokenizer = AutoTokenizer.from_pretrained(model_name)
    model = TFAutoModelForSequenceClassification.from_pretrained(model_name)

    # Tokenize the input sentences
    token_ids = tokenizer(["I like soccer. [SEP] We all love soccer!", 
                           "Joe lived for a very long time. [SEP] Joe is old."],
                          padding=True, return_tensors="tf")

    # Display the tokenized input
    print(token_ids)
    

Explanation:

This code takes raw text, tokenizes it, and converts it into a format the model can process. You can use this in applications where the input text format is critical, like chatbots or translation systems.

4. Processing Model Outputs with Softmax

Why Use Softmax?

When a model outputs logits (raw scores), these aren’t immediately understandable. The softmax function converts these scores into probabilities. For instance, in sentiment analysis, you might want to know how likely it is that a sentence is positive or negative.

Example: Applying Softmax to Model Outputs

Code:

    # Pass the tokenized input to the model
    outputs = model(token_ids)

    # Apply the softmax activation function to get probabilities
    import tensorflow as tf
    y_probas = tf.keras.activations.softmax(outputs.logits)

    # Display the probabilities
    print(y_probas)
    

Explanation:

Here, the model’s output logits are converted to class probabilities using softmax. This is crucial when making a final prediction in any classification task, from sentiment analysis to topic classification.

5. Fine-Tuning a Pretrained Model

Why Fine-Tune a Model?

Sometimes the general-purpose pretrained models don’t work well for specific domains (like legal or medical texts). By fine-tuning, you can train the model on a smaller, domain-specific dataset and achieve better performance.

Example: Fine-Tuning with Keras

Code:

    # Prepare the training data
    sentences = [("Sky is blue", "Sky is red"), ("I love her", "She loves me")]
    X_train = tokenizer(sentences, padding=True, return_tensors="tf").data

    # Prepare labels for the sentences (0 = contradiction, 1 = entailment)
    y_train = tf.constant([0, 2])

    # Define the loss function and optimizer
    loss = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True)
    model.compile(loss=loss, optimizer="adam", metrics=["accuracy"])

    # Fine-tune the model
    history = model.fit(X_train, y_train, epochs=2)
    

Explanation:

This code fine-tunes a model using a custom dataset. Fine-tuning is essential in cases where general pretrained models don’t perform well on niche domains like legal or medical text.

6. Using Hugging Face Datasets for Training

Why Use Hugging Face Datasets?

Hugging Face offers a datasets library, which allows you to quickly download and use datasets for training, evaluation, or fine-tuning. This saves time and resources in gathering and cleaning data.

Example: Loading and Using the IMDb Dataset

Code:

    from datasets import load_dataset

    # Load the IMDb dataset
    dataset = load_dataset("imdb")

    # Display a sample from the dataset
    print(dataset["train"][0])
    

Explanation:

This code loads the IMDb dataset, which is a collection of movie reviews labeled as positive or negative for sentiment analysis. You can use this dataset to train or evaluate your own sentiment analysis models.

Output:

{'text': 'This is a great movie...', 'label': 1}

The IMDb dataset contains movie reviews (in the text field) and the corresponding sentiment labels (where 1 indicates positive sentiment).

Conclusion

In this part, we’ve covered how to:

  • Use Hugging Face pipelines for quick, easy-to-use NLP tasks like sentiment analysis and text classification.
  • Leverage pretrained models to save time and resources for common tasks.
  • Control the input data with tokenizers and convert model outputs to meaningful probabilities with softmax.
  • Fine-tune a model on a specific domain using custom data for better accuracy.
  • Utilize the Hugging Face datasets library for quick access to standard datasets like IMDb.

Each code block is designed to solve a real-world problem, whether you’re building a simple sentiment analysis tool or working on a custom text classification system. Hugging Face makes it incredibly easy to get started with state-of-the-art NLP models without needing to build them from scratch.

If you’re ready to take your NLP projects to the next level, Hugging Face’s documentation and community are great resources to explore. Stay tuned for more tutorials and deep dives into advanced features!







Don’t forget to support Ingoampt by purchasing our apps in the Apple Store or by purchasing our products!

Check our Apps
Visit our Shop


don't miss our new posts. Subscribe for updates

We don’t spam! Read our privacy policy for more info.