DeepSeek’s updated R1 AI model equals coding ability of Google, Anthropic in new benchmark
The updated reasoning model, released in May, performed well against leading US models in the real-time WebDev Arena tests

The updated version of DeepSeek-R1 tied for first place with Google’s Gemini-2.5 and Anthropic’s Claude Opus 4 on the WebDev Arena leaderboard, which evaluates large language models (LLMs) on their ability to solve coding tasks quickly and accurately. The Hangzhou-based company’s R1 scored 1,408.84, in line with Opus 4’s 1,405.51 and Gemini-2.5’s 1,433.16.
The quality of the models’ output is evaluated by humans, who determine the scores. DeepSeek’s reasoning model has consistently performed at levels close to leading models in various benchmark tests since it was unveiled in January, despite significantly lower training costs.
The R1 update attracted attention from the developer community amid widespread anticipation for DeepSeek’s next-generation reasoning model, R2. The company has said little about when it might release its big follow-up.