ByteDance advances DeepSeek work in AI reasoning with open-source project led by intern
DAPO is a scalable reinforcement learning algorithm that helps a large language model achieve better complex reasoning behaviour

TikTok owner ByteDance, which has invested heavily in artificial intelligence (AI), has unveiled a new system that claims to improve on the work done by DeepSeek in training AI reasoning models.
DAPO, or Decoupled Clip and Dynamic Sampling Policy Optimisation, is a scalable reinforcement learning algorithm that helps a large language model (LLM) achieve better complex reasoning behaviour such as self-verification and iterative refinement, according to a research paper published earlier this week by ByteDance and Tsinghua University’s Institute for AI Industry Research.
The algorithm outperformed the reinforcement learning approach in DeepSeek’s R1 reasoning model, scoring 50 points in the American Invitational Mathematics Examination (AIME) 2024 using Alibaba Group Holding’s Qwen2.5-32B base model, compared with 47 points attained by R1 when applying the same Alibaba model, the paper showed. Alibaba owns the South China Morning Post.
Notably, DAPO achieved the better result with 50 per cent fewer training steps.

The achievement drew positive academic and industry comments. Google DeepMind engineer Philipp Schmid, who shared the project on X, said the new method was “better than” DeepSeek’s “group relative policy optimisation (GRPO)” in reinforcement learning. GRPO is one of DeepSeek’s training methods that enables a model to learn by comparing different actions and making updates with a “group” of observations.