Lets go through Paper of DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning – Day 80

Lets First Go Through its official paper of : DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning What Is DeepSeek-R1? DeepSeek-R1 is a new method for training large language models (LLMs) so they can solvetough reasoning problems (like math and coding challenges) more reliably. It starts with a base model(“DeepSeek-V3”) and then applies Reinforcement Learning (RL) in a way thatmakes the model teach itself to reason step by step, without relying on a huge amount of labeled examples. In simpler terms: They take an existing language model. They let it practice solving problems on its own, rewarding it when it reaches a correct, properly formatted answer. Over many practice rounds, it gets really good at giving detailed, logical responses. Two Main Versions DeepSeek-R1-Zero They begin by training the model purely with RL, giving it no extra “teacher” data(no big supervised datasets). Surprisingly, this alone makes the model much better at step-by-stepreasoning—almost like how a human can get better at math by practicing a bunch of problems andchecking answers. DeepSeek-R1  Although DeepSeek-R1-Zero improves reasoning, sometimes it produces messy or mixed-language answers.To fix that, they: Gather a small amount of supervised “cold-start” data to clean up its style and correctness. Do another round of training that blends reinforcement learning with some curated examples. This final DeepSeek-R1 version is more understandable to humans (better readability)and maintains strong performance. Why Is This Cool? Big Accuracy GainsOn tricky math tests (like AIME) and programming challenges, DeepSeek-R1-Zero jumpsfrom around 15.6% correct up to 71%—and even higher with techniques like majority voting.That’s comparable to powerful existing models from big AI labs. Uses Less SupervisionUnlike many methods that rely on huge labeled datasets for “chain-of-thought”(step-by-step) answers, DeepSeek-R1-Zero first learns through pure RL. That means itexplores solutions, checks correctness, and refines its own reasoningwith minimal human labeling. Readable SolutionsDeepSeek-R1 then takes that self-taught reasoning ability and spruces it up with a smallsupervised dataset—so the final answers are both correct and clear. Distilling into Smaller ModelsThey also show they can transfer (“distill”) these improved reasoning skills intosmaller models that still do really well. That’s good news for efficiency—smallermodels can run faster on regular hardware. How Does It Work? Cold-Start DataThey collect a few thousand examples to fine-tune the base model so that it can start answeringin a more structured, chain-of-thought style. Reinforcement Learning (RL) A reward system checks if the final answer is correct and properly formatted (e.g., for math, you box the final solution). Each time the model gets a correct, well-formatted solution, it gets a reward. If it’s incorrect or messy, it doesn’t. Over time, the model “figures out” how to think through problems step by step. Some Extra Fine-TuningAfter RL, they gather new data via rejection sampling (accepting only good answers)and combine it with existing supervised data from DeepSeek-V3.They do one more round of training on that mixture, which further cleans up the model’s styleand correctness. Optional Second RL PassA final tune-up with RL again, this time focusing on how the model handles prompts from differentscenarios (math, writing, factual QA, etc.). The end result is DeepSeek-R1.     Strengthening the Model with Reasoning‑Oriented Training Early versions of DeepSeek‑R1 already showed surprising gains simply by practicing problem‐solving through RL. However, the authors identified areas to improve: Taming “Mixed” Language Responses: Sometimes the chain of thought jumped between English and another language. They introduced a consistency requirement so that the model would stick to one language in its reasoning, improving clarity and readability. Fine‑Tuning Output Style: They designed additional patterns for how answers should be formatted. This included adding special tokens to clearly mark the solution steps and summaries, making the final responses more user‐friendly. Polishing Answers via Rejection Sampling and Fine‑Tuning Even with RL, the model could produce confusing or incorrect answers at times. To address this: Multiple Generations: For each prompt, the authors asked the model to produce several answers. Filtering Out Bad Answers: They kept only the best responses (e.g., correct, neatly formatted) and threw away the others. Supervised Fine‑Tuning: These high‐quality responses formed a new training set. By retraining the model on these “handpicked” answers, they improved both accuracy and clarity. This process effectively taught DeepSeek‑R1 to learn from its own best attempts, reinforcing good reasoning and presentation habits. Extending Reasoning Skills to Smaller Models One challenge in the world of large language models is how resource‐intensive they can be. To tackle this: Distillation Approach: The trained DeepSeek‑R1 (the “teacher”) transfers its knowledge into a smaller “student” model—like a 14B or 32B parameter version. Performance Gains in a Leaner Package: Even though these student models are physically smaller and cheaper to run, they retain a good deal of DeepSeek‑R1’s reasoning power. This means developers and organizations can deploy more efficient systems without sacrificing too much accuracy. Putting DeepSeek‑R1 to the Test The authors ran DeepSeek‑R1 on a variety of benchmarks that measure both general and specialized skills: Math and Science: Tasks that require detailed reasoning steps (like advanced exams or problem sets). Coding Benchmarks: Challenges in multiple programming languages, where the model’s solutions were checked for correctness. Factual Question‑Answering and Reading Comprehension: Tests that evaluate how accurately a model can answer knowledge‐based queries or summarize text. Results showed: DeepSeek‑R1 typically outperformed the original base model in nearly every category—especially math, coding, and long‑form reasoning. Distilled versions stayed impressively close to the performance of the larger model, making them practical for real‑world applications. Reflections on What Worked (and What Didn’t) In the course of refining DeepSeek‑R1, the authors tried some approaches that didn’t pan out: Process Reward Model (PRM): Attempting to score each intermediate step rather than just the final result proved too complex and resource‐heavy. Monte Carlo Tree Search (MCTS): Although this strategy powers game‑playing AIs (like AlphaGo), it struggled with the huge range of possible text sequences in language tasks. Ultimately, straightforward RL with a clear focus on correctness and format was more effective and efficient.   So now lets answer some questions regards this paper :    Why Was DeepSeek’s Training Much Cheaper than ChatGPT’s? Question: With reported training costs of around $5.6 million for DeepSeek‑V3, contrasted against estimates ranging from $41 to $78 million for OpenAI’s ChatGPT‑4, how did DeepSeek keep expenses so much lower? Answer: Several overlapping factors contributed to DeepSeek’s lower reported training costs: Algorithmic EfficiencyDeepSeek focused on optimizing its training processes and reward structures to require fewer computational resources. This emphasis on streamlined algorithms reduced the hours (and thus cost) of large‐scale model training. According to an Investopedia note, these optimization techniques were a major driver behind DeepSeek’s lower cost profile. Open‐Source CollaborationBy leveraging the open‐source community, DeepSeek had access to…

Thank you for reading this post, don't forget to subscribe!

Membership Required

You must be a member to access this content.

View Membership Levels

Already a member? Log in here
FAQ Chatbot

Select a Question

Or type your own question

For best results, phrase your question similar to our FAQ examples.