Lets First Go Through its official paper of :
DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning
What Is DeepSeek-R1?
DeepSeek-R1 is a new method for training large language models (LLMs) so they can solve
tough reasoning problems (like math and coding challenges) more reliably. It starts with a base model
(“DeepSeek-V3”) and then applies Reinforcement Learning (RL) in a way that
makes the model teach itself to reason step by step, without relying on a huge amount of labeled examples.
In simpler terms:
- They take an existing language model.
- They let it practice solving problems on its own, rewarding it when it reaches a correct, properly formatted answer.
- Over many practice rounds, it gets really good at giving detailed, logical responses.
Two Main Versions
DeepSeek-R1-Zero
They begin by training the model purely with RL, giving it no extra “teacher” data
(no big supervised datasets). Surprisingly, this alone makes the model much better at step-by-step
reasoning—almost like how a human can get better at math by practicing a bunch of problems and
checking answers.
DeepSeek-R1
Although DeepSeek-R1-Zero improves reasoning, sometimes it produces messy or mixed-language answers.
To fix that, they:
- Gather a small amount of supervised “cold-start” data to clean up its style and correctness.
- Do another round of training that blends reinforcement learning with some curated examples.
This final DeepSeek-R1 version is more understandable to humans (better readability)
and maintains strong performance.
Why Is This Cool?
- Big Accuracy Gains
On tricky math tests (like AIME) and programming challenges, DeepSeek-R1-Zero jumps
from around 15.6% correct up to 71%—and even higher with techniques like majority voting.
That’s comparable to powerful existing models from big AI labs. - Uses Less Supervision
Unlike many methods that rely on huge labeled datasets for “chain-of-thought”
(step-by-step) answers, DeepSeek-R1-Zero first learns through pure RL. That means it
explores solutions, checks correctness, and refines its own reasoning
with minimal human labeling. - Readable Solutions
DeepSeek-R1 then takes that self-taught reasoning ability and spruces it up with a small
supervised dataset—so the final answers are both correct and clear. - Distilling into Smaller Models
They also show they can transfer (“distill”) these improved reasoning skills into
smaller models that still do really well. That’s good news for efficiency—smaller
models can run faster on regular hardware.
How Does It Work?
- Cold-Start Data
They collect a few thousand examples to fine-tune the base model so that it can start answering
in a more structured, chain-of-thought style. - Reinforcement Learning (RL)
- A reward system checks if the final answer is correct and properly formatted (e.g., for math, you box the final solution).
- Each time the model gets a correct, well-formatted solution, it gets a reward. If it’s incorrect or messy, it doesn’t.
- Over time, the model “figures out” how to think through problems step by step.
- Some Extra Fine-Tuning
After RL, they gather new data via rejection sampling (accepting only good answers)
and combine it with existing supervised data from DeepSeek-V3.
They do one more round of training on that mixture, which further cleans up the model’s style
and correctness. - Optional Second RL Pass
A final tune-up with RL again, this time focusing on how the model handles prompts from different
scenarios (math, writing, factual QA, etc.). The end result is DeepSeek-R1.
Strengthening the Model with Reasoning‑Oriented Training
Early versions of DeepSeek‑R1 already showed surprising gains simply by practicing problem‐solving through RL. However, the authors identified areas to improve:
- Taming “Mixed” Language Responses: Sometimes the chain of thought jumped between English and another language. They introduced a consistency requirement so that the model would stick to one language in its reasoning, improving clarity and readability.
- Fine‑Tuning Output Style: They designed additional patterns for how answers should be formatted. This included adding special tokens to clearly mark the solution steps and summaries, making the final responses more user‐friendly.
Polishing Answers via Rejection Sampling and Fine‑Tuning
Even with RL, the model could produce confusing or incorrect answers at times. To address this:
- Multiple Generations: For each prompt, the authors asked the model to produce several answers.
- Filtering Out Bad Answers: They kept only the best responses (e.g., correct, neatly formatted) and threw away the others.
- Supervised Fine‑Tuning: These high‐quality responses formed a new training set. By retraining the model on these “handpicked” answers, they improved both accuracy and clarity.
This process effectively taught DeepSeek‑R1 to learn from its own best attempts, reinforcing good reasoning and presentation habits.
Extending Reasoning Skills to Smaller Models
One challenge in the world of large language models is how resource‐intensive they can be. To tackle this:
- Distillation Approach: The trained DeepSeek‑R1 (the “teacher”) transfers its knowledge into a smaller “student” model—like a 14B or 32B parameter version.
- Performance Gains in a Leaner Package: Even though these student models are physically smaller and cheaper to run, they retain a good deal of DeepSeek‑R1’s reasoning power. This means developers and organizations can deploy more efficient systems without sacrificing too much accuracy.
Putting DeepSeek‑R1 to the Test
The authors ran DeepSeek‑R1 on a variety of benchmarks that measure both general and specialized skills:
- Math and Science: Tasks that require detailed reasoning steps (like advanced exams or problem sets).
- Coding Benchmarks: Challenges in multiple programming languages, where the model’s solutions were checked for correctness.
- Factual Question‑Answering and Reading Comprehension: Tests that evaluate how accurately a model can answer knowledge‐based queries or summarize text.
Results showed:
- DeepSeek‑R1 typically outperformed the original base model in nearly every category—especially math, coding, and long‑form reasoning.
- Distilled versions stayed impressively close to the performance of the larger model, making them practical for real‑world applications.
Reflections on What Worked (and What Didn’t)
In the course of refining DeepSeek‑R1, the authors tried some approaches that didn’t pan out:
- Process Reward Model (PRM): Attempting to score each intermediate step rather than just the final result proved too complex and resource‐heavy.
- Monte Carlo Tree Search (MCTS): Although this strategy powers game‑playing AIs (like AlphaGo), it struggled with the huge range of possible text sequences in language tasks.
Ultimately, straightforward RL with a clear focus on correctness and format was more effective and efficient.
So now lets answer some questions regards this paper :
Why Was DeepSeek’s Training Much Cheaper than ChatGPT’s?
Question: With reported training costs of around $5.6 million for DeepSeek‑V3, contrasted against estimates ranging from $41 to $78 million for OpenAI’s ChatGPT‑4, how did DeepSeek keep expenses so much lower?
Answer: Several overlapping factors contributed to DeepSeek’s lower reported training costs:
-
Algorithmic Efficiency
DeepSeek focused on optimizing its training processes and reward structures to require fewer computational resources. This emphasis on streamlined algorithms reduced the hours (and thus cost) of large‐scale model training.According to an Investopedia note, these optimization techniques were a major driver behind DeepSeek’s lower cost profile.
-
Open‐Source Collaboration
By leveraging the open‐source community, DeepSeek had access to shared innovations and pre‐built libraries without paying for everything from scratch. Adopting tools and code from a collaborative ecosystem saved on development overhead and allowed the team to iterate faster.As cited by Investopedia, open‐source methodologies helped DeepSeek avoid steep licensing fees.
-
Resourceful Hardware Utilization
Despite facing U.S. export restrictions on advanced AI chips, DeepSeek developed workarounds to maximize its available infrastructure—using existing GPUs or custom hardware solutions more efficiently. This adaptive approach kept overall hardware spending down.Again, per Investopedia, DeepSeek’s engineering team found creative ways to stretch compute budgets.
-
Smaller Supervised Dataset & Distillation
DeepSeek reduced reliance on massive labeled datasets. Instead, it focused on reinforcement learning to guide reasoning, supplemented by a small “cold‐start” dataset. After that, distillation allowed them to compress knowledge into smaller models, lowering training and deployment costs over time.- ChatGPT typically requires huge amounts of labeled data and large‐scale human feedback at multiple stages, which is both time‐consuming and expensive.
-
Reported vs. Full Costs
Although DeepSeek’s stated $5.6 million training expense is far below the $41–$78 million range for ChatGPT’s training, these figures mainly reflect direct computational outlays (e.g., GPU usage). Other expenses—such as research staff salaries, data acquisition beyond the main dataset, or infrastructure investments—may not be fully captured.This caveat is emphasized in the Forbes coverage of AI training costs.
In Short
DeepSeek capitalized on more efficient algorithms, open‐source collaboration, creative hardware usage, and a leaner supervised dataset to keep its training bills modest. ChatGPT, meanwhile, likely incurred higher costs due to a broader set of training data, extensive human feedback loops, and a longer training timeline on expensive hardware.
Despite the stark difference in reported budget, it’s worth remembering that these figures only tell part of the story. Full costs (including research, development, and infrastructure) can be harder to measure—and might narrow the gap between DeepSeek and ChatGPT. Nonetheless, by focusing on algorithmic efficiency and resourcefulness, DeepSeek carved a path to advanced AI capabilities without the massive price tag typically associated with top‐tier language models.
Below is a comparison table that brings together ChatGPT, LLaMA, and DeepSeek, reflecting both publicly available information and the details gleaned from the DeepSeek‑R1 paper (the “eight images”). Please note that exact numbers (especially for ChatGPT and LLaMA) can be speculative, as official figures are often not fully disclosed.
Aspect | ChatGPT | LLaMA | DeepSeek |
---|---|---|---|
Developer | OpenAI | Meta (Facebook AI Research) | DeepSeek (Independent / Open‑Source Collaboration) |
First Release (Approx.) | 2022 (GPT‑3.5) / 2023 (GPT‑4) | 2023 | 2025 (DeepSeek‑V3 / DeepSeek‑R1) |
Parameter Range | ~100B+ (unconfirmed for GPT‑4) | 7B, 13B, 33B, 65B | Multiple sizes: from large (≥60B) down to smaller distilled models (e.g., 14B–32B) |
Training Approach |
– Large‑scale supervised pre‑training on web data – Extensive Reinforcement Learning from Human Feedback (RLHF) |
– Large‑scale supervised pre‑training on curated text – Limited to no official RLHF in original release |
– Base model pre‑trained (DeepSeek‑V3) – Reinforcement Learning (correctness & format rewards) – Minimal supervised “cold start” data – Final polishing with rejection sampling + fine‑tuning |
Open Source / License | Closed Source | Partially open (research license) | Leveraged open‑source collaboration and tools; ultimate licensing not fully detailed, but more open involvement than ChatGPT |
Reported Training Cost | $41M – $78M (estimates for GPT‑4) |
Not officially published (likely tens of millions) |
~$5.6M (DeepSeek‑V3, per publicly mentioned figures) (Doesn’t include all R&D or infrastructure costs) |
Key Differentiators |
– Extremely wide range of capabilities – Highly polished chat interface – Strong RLHF alignment |
– Multiple model sizes for different hardware – Generally strong performance, especially in research contexts |
– Emphasis on algorithmic efficiency (lower training budget) – Open‑Source Collaboration to reduce development overhead – Resourceful hardware use to circumvent chip constraints – Reinforcement Learning focus for “chain of thought” reasoning |
Use of Distillation | Internal details not public; possibly some models are compressed, but not a main highlight | Some community projects have distilled LLaMA to smaller versions, but not an official primary method from Meta |
Heavily used DeepSeek transfers “self‑taught” reasoning into smaller student models for cheaper, faster deployment |
Performance Highlights |
– Consistently top‑tier across QA, code gen, general tasks – GPT‑4 has advanced reasoning capabilities |
– Competitive with GPT‑3.5 in many benchmarks – Larger LLaMA variants (65B) can perform on par with some commercial models |
– Strong results on math/coding tasks – Surpasses its baseline in chain‑of‑thought reasoning – Distilled versions remain robust |
Data Requirements | Massive curated datasets, plus large-scale human annotations for RLHF | Large curated text datasets (books, Wikipedia, etc.) | Smaller “cold‑start” dataset + self‑improvement through RL; less reliant on large supervised corpora |
Known Limitations |
– Expensive to train and deploy – Closed source – Can be conservative or refuse answers too readily (due to safety filters) |
– Original license restricts some commercial use – Lacks chat fine‑tuning in official release |
– Exact final licensing unclear – Smaller scale than GPT‑4 in certain domains – Additional cost details (R&D, data acquisitions) not fully published |
Lets do a quick Comparison of ChatGPT, DeepSeek, and LLaMA on the year of 2024 – 2025 as an example
Feature | ChatGPT 4o | DeepSeek V3 | LLaMA 3.1 |
---|---|---|---|
Developer | OpenAI | DeepSeek | Meta AI |
Release Date | March 2024 | December 2024 | September 2024 |
Model Architecture | Transformer-based (parameters undisclosed) | Mixture of Experts (671B total, 37B activated per inference) | Transformer-based (405B parameters) |
Training Data | Extensive dataset across diverse domains | 14.8T tokens (English, Chinese, focus on math & programming) | 15.6T tokens across multiple languages |
Performance | English MMLU: 87.2% HumanEval-Mul: 80.5% |
English MMLU: 88.5% HumanEval-Mul: 82.6% |
Performance metrics not specified |
Training Cost | $41M – $78M | $5.6M (2.788M GPU hours on Nvidia H800) | ~$60M (30.8M GPU hours on newer chips) |
Licensing | Proprietary | Open-source (DeepSeek License) | Open-source (LLaMA 3 License) |
Key Strengths | Versatile across NLP tasks, widely adopted | High efficiency with MoE, strong in language & coding, cost-effective | Robust multilingual support, open-source, suitable for diverse applications |