Popular AI model performance benchmark may be flawed, Meta researchers warn

‘We’ve identified multiple loopholes with SWE-bench Verified,’ the manager at Meta Platforms’ AI research lab Fair says

Reading Time:2 minutes

The shortcomings and limitations of widely used third-party benchmarks for AI models are under increased scrutiny. Photo: Shutterstock

Published: 10:00am, 9 Sep 2025

A popular benchmark for measuring the performance of artificial intelligence models could be flawed, a group of Meta Platforms researchers warned, raising fresh questions on the veracity of evaluations that have been made on major AI systems.

“We’ve identified multiple loopholes with SWE-bench Verified,” wrote Jacob Kahn, manager at Meta AI research lab Fair, in a post last week on the developer platform GitHub.

The post from Fair, which stands for Fundamental AI Research, found several prominent AI models – including Anthropic’s Claude and Alibaba Cloud’s Qwen – had “cheated” on SWE-bench Verified. Alibaba Cloud is the AI and cloud computing services unit of Alibaba Group Holding, owner of the South China Morning Post.

OpenAI-backed SWE-bench Verified, a human-validated subset of the large language model benchmark SWE-bench, evaluates AI models based on how these systems fix hundreds of real-world software issues collected from GitHub, a Microsoft subsidiary.

Fair’s post, however, claimed that models evaluated using SWE-bench Verified directly searched for known solutions shared elsewhere on the GitHub platform and passed them off as their own, instead of using their built-in coding capabilities to fix the issues.

The AI models found to have shown such behaviour included Anthropic’s Claude 4 Sonnet, Z.ai’s GLM-4.5 and Alibaba Cloud’s Qwen3-Coder-30B-A3B – with official scores of 70.4 per cent, 64.2 per cent and 51.6 per cent, respectively, on SWE-bench Verified.

“We’re still assessing [the] broader impact on evaluations and understanding trajectories for sources of leakage,” Kahn wrote.

Popular AI model performance benchmark may be flawed, Meta researchers warn

.css-1c6uqr6{color:inherit;font-weight:inherit;font-size:inherit;font-family:inherit;line-height:inherit;overflow-wrap:break-word;}‘We’ve identified multiple loopholes with SWE-bench Verified,’ the manager at Meta Platforms’ AI research lab Fair says

‘We’ve identified multiple loopholes with SWE-bench Verified,’ the manager at Meta Platforms’ AI research lab Fair says