Understanding Tokens in Deep Learning: Types, Examples, and Use Cases
Tokens are fundamental building blocks in Natural Language Processing (NLP) and deep learning models. They represent chunks of text that models process, such as words, subwords, characters, or bytes. Choosing the right tokenization strategy is critical to the performance of your model. In this post, we will explore different types of tokens, their applications, and how they compare to each other, using the token vocabulary you provided as an example.
What Are Tokens?
Tokens are pieces of text converted into numerical representations so that a machine learning model can process them. These could be:
- Words: Whole words treated as single tokens.
- Subwords: Parts of words like prefixes, suffixes, or root forms.
- Characters: Each character treated as an individual token.
- Bytes: Byte-level tokens for language-agnostic processing.
- Special Tokens: Markers like
<PAD>
(padding),<UNK>
(unknown), or<CLS>
(classification).
Types of Tokenization Schemes
1. Predefined Fixed Vocabulary (Word-Level Tokenization)
In this approach, every word in a predefined vocabulary is mapped to a unique token ID. Any word not in the vocabulary is replaced with a special <UNK>
token.
Example Vocabulary:
{ "<PAD>": 0, "<UNK>": 1, "!": 2, "Apfelsaft": 12, "Berlin": 19, "Auto": 15, "Bahnhof": 17 }
How It Works:
- Input: “Apfelsaft ist lecker!”
- Tokens:
[12, 29, 30, 2]
(where29
and30
correspond to “ist” and “lecker” if in the vocabulary) - Out-of-vocabulary (OOV) words are replaced with
<UNK>
.
Advantages:
- Simple and interpretable.
- Efficient for small datasets or domain-specific tasks.
Disadvantages:
- OOV words lose information.
- Large vocabulary sizes are memory-intensive.
2. Subword Tokenization (e.g., Byte-Pair Encoding, WordPiece)
Instead of mapping entire words, this approach splits text into smaller subword units. For example, “Apfelsaft” might be split into “Apf”, “el”, and “saft.”
How It Works:
- Input: “Apfelsaft ist lecker!”
- Tokens:
["Apf", "els", "aft", "ist", "lecker", "!"]
- Each subword has its own token ID, reducing vocabulary size while maintaining flexibility.
Advantages:
- Handles OOV words gracefully by breaking them into known subwords.
- Smaller vocabulary size compared to word-level tokenization.
Disadvantages:
- More complex preprocessing.
- Subwords can be harder to interpret.
3. Character-Level Tokenization
Here, each character is treated as a token. This is useful for languages with rich morphology or when handling noisy data (e.g., social media text).
How It Works:
- Input: “Apfelsaft ist lecker!”
- Tokens:
["A", "p", "f", "e", "l", "s", "a", "f", "t", " ", "i", "s", "t", " ", "l", "e", "c", "k", "e", "r", "!"]
Advantages:
- Very small vocabulary.
- Handles OOV words and typos seamlessly.
Disadvantages:
- Long sequences can increase computational costs.
- May lose semantic understanding.
4. Byte-Level Tokenization
This approach treats each byte of text as a token, making it language-agnostic.
How It Works:
- Input: “Apfelsaft ist lecker!”
- Tokens: Byte IDs corresponding to each character.
Advantages:
- Minimal preprocessing.
- Handles any language or special characters.
Disadvantages:
- Sequences can become very long.
- Harder to interpret.
Comparison Table
Feature | Word-Level Tokens | Subword Tokens | Character Tokens | Byte Tokens |
---|---|---|---|---|
Vocabulary Size | Large | Moderate | Small | Very Small |
Out-of-Vocabulary Handling | Replaced with <UNK> | Split into subwords | N/A (all characters covered) | N/A (all bytes covered) |
Flexibility | Limited | High | Very High | Very High |
Efficiency | High | Moderate | Low | Low |
Interpretability | Easy | Moderate | Easy | Hard |
Use Cases | Small datasets, domain-specific | Large datasets, pretrained models | Morphologically rich languages | Language-agnostic tasks |
Examples of Fine-Tuning and Building Models from Scratch
Fine-Tuning Pretrained Models
When fine-tuning, always use the tokenizer provided with the pretrained model to ensure compatibility. Here are some examples:
Model | Tokenization Method | Dataset Size Recommendation | Why? |
---|---|---|---|
GPT (OpenAI) | Byte-Level Tokens | Small to Medium | Pretrained on large-scale data; byte-level handles multilingual input. |
OpenLLaMA | Subword Tokens (BPE) | Medium to Large | Efficient for general-purpose language tasks with subword splitting. |
Hugging Face BERT | WordPiece Tokens | Small to Medium | WordPiece is optimized for contextual embeddings in pretrained BERT. |
Building Models from Scratch
When building a model from scratch, the choice of tokenization depends on the dataset size and task:
Dataset Size | Recommended Token Type | Model Type | Why? |
---|---|---|---|
Small (<1M samples) | Character-Level Tokens | RNNs, LSTMs | Handles small vocabularies and avoids OOV issues. |
Medium (1-10M) | Subword Tokens (BPE) | Transformers, Custom Models | Balances vocabulary size and sequence length efficiently. |
Large (>10M) | Subword or Byte-Level | Transformers (GPT, BERT, etc.) | Handles multilingual and large datasets effectively. |
When to Use Each Type of Tokenization
If You Have a Small Dataset
- Use fixed vocabulary tokens.
- Keep the vocabulary small and domain-specific.
- Example: Your provided vocabulary with words like “Apfelsaft” and “Berlin.”
If You’re Training a Large Model
- Use subword tokenization (e.g., Byte-Pair Encoding, SentencePiece).
- This balances vocabulary size and sequence length.
If Your Data Is Noisy or Morphologically Rich
- Use character-level tokenization for robustness against typos or complex word forms.
If You’re Pretraining on Multilingual Data
- Use byte-level tokenization to handle multiple languages without requiring language-specific vocabularies.
What If Your Data Isn’t as Large as GPT’s Dataset?
- Use Pretrained Models:
- Fine-tune a model like GPT or BERT on your dataset. These models come with their own tokenizers, so you don’t need to define tokens.
- Optimize Vocabulary:
- If building from scratch, use subword tokenization to reduce vocabulary size while covering more text.
Final Thoughts
The choice of tokenization strategy depends on your dataset size, language complexity, and model requirements. Here’s a summary:
- Use fixed vocabulary for simplicity in small, domain-specific tasks.
- Use subword tokenization for most modern NLP applications.
- Use character-level or byte-level tokenization for highly flexible or multilingual tasks.
- For fine-tuning, always use the tokenizer provided with the pretrained model to ensure compatibility.
Understanding these options ensures you select the best tokenization approach for your deep learning model’s success.