ByteDance unveils DAPO, enhancing AI reasoning capabilities
ByteDance has introduced a new system aimed at enhancing artificial intelligence (AI) reasoning. This development is based on their ongoing project called DeepSeek. The new system, named DAPO, stands for Decoupled Clip and Dynamic Sampling Policy Optimisation. DAPO is a scalable reinforcement learning algorithm. It helps a large language model (LLM) improve at complex reasoning tasks, such as self-verification and iterative refinement. ByteDance's research paper, published with Tsinghua University, highlights that DAPO outperformed the earlier DeepSeek R1 model. In recent tests, DAPO scored 50 points in the American Invitational Mathematics Examination (AIME) 2024 using Alibaba's Qwen2.5-32B base model. In comparison, the R1 model scored 47 points with the same base model. Importantly, DAPO achieved this higher score with 50% fewer training steps. The results have received acclaim from both academics and industry experts. Google DeepMind engineer Philipp Schmid praised DAPO on social media, stating it is “better than” DeepSeek’s method known as group relative policy optimisation (GRPO). GRPO allows models to learn by comparing actions and adjusting based on group observations.