-
Advertisement
Artificial intelligence
TechTech Trends

Popular AI model performance benchmark may be flawed, Meta researchers warn

‘We’ve identified multiple loopholes with SWE-bench Verified,’ the manager at Meta Platforms’ AI research lab Fair says

Reading Time:2 minutes
Why you can trust SCMP
The shortcomings and limitations of widely used third-party benchmarks for AI models are under increased scrutiny. Photo: Shutterstock
Vincent Chow
A popular benchmark for measuring the performance of artificial intelligence models could be flawed, a group of Meta Platforms researchers warned, raising fresh questions on the veracity of evaluations that have been made on major AI systems.
“We’ve identified multiple loopholes with SWE-bench Verified,” wrote Jacob Kahn, manager at Meta AI research lab Fair, in a post last week on the developer platform GitHub.
The post from Fair, which stands for Fundamental AI Research, found several prominent AI models – including Anthropic’s Claude and Alibaba Cloud’s Qwen – had “cheated” on SWE-bench Verified. Alibaba Cloud is the AI and cloud computing services unit of Alibaba Group Holding, owner of the South China Morning Post.
Advertisement
OpenAI-backed SWE-bench Verified, a human-validated subset of the large language model benchmark SWE-bench, evaluates AI models based on how these systems fix hundreds of real-world software issues collected from GitHub, a Microsoft subsidiary.

Fair’s post, however, claimed that models evaluated using SWE-bench Verified directly searched for known solutions shared elsewhere on the GitHub platform and passed them off as their own, instead of using their built-in coding capabilities to fix the issues.

Advertisement
The AI models found to have shown such behaviour included Anthropic’s Claude 4 Sonnet, Z.ai’s GLM-4.5 and Alibaba Cloud’s Qwen3-Coder-30B-A3B – with official scores of 70.4 per cent, 64.2 per cent and 51.6 per cent, respectively, on SWE-bench Verified.

“We’re still assessing [the] broader impact on evaluations and understanding trajectories for sources of leakage,” Kahn wrote.

Advertisement
Select Voice
Select Speed
1.00x