Advertisement
Artificial intelligence
TechTech Trends

Popular AI model performance benchmark may be flawed, Meta researchers warn

‘We’ve identified multiple loopholes with SWE-bench Verified,’ the manager at Meta Platforms’ AI research lab Fair says

Reading Time:2 minutes
Why you can trust SCMP
The shortcomings and limitations of widely used third-party benchmarks for AI models are under increased scrutiny. Photo: Shutterstock
Vincent Chow
A popular benchmark for measuring the performance of artificial intelligence models could be flawed, a group of Meta Platforms researchers warned, raising fresh questions on the veracity of evaluations that have been made on major AI systems.
“We’ve identified multiple loopholes with SWE-bench Verified,” wrote Jacob Kahn, manager at Meta AI research lab Fair, in a post last week on the developer platform GitHub.
The post from Fair, which stands for Fundamental AI Research, found several prominent AI models – including Anthropic’s Claude and Alibaba Cloud’s Qwen – had “cheated” on SWE-bench Verified. Alibaba Cloud is the AI and cloud computing services unit of Alibaba Group Holding, owner of the South China Morning Post.
Advertisement
OpenAI-backed SWE-bench Verified, a human-validated subset of the large language model benchmark SWE-bench, evaluates AI models based on how these systems fix hundreds of real-world software issues collected from GitHub, a Microsoft subsidiary.

Fair’s post, however, claimed that models evaluated using SWE-bench Verified directly searched for known solutions shared elsewhere on the GitHub platform and passed them off as their own, instead of using their built-in coding capabilities to fix the issues.

Advertisement
The AI models found to have shown such behaviour included Anthropic’s Claude 4 Sonnet, Z.ai’s GLM-4.5 and Alibaba Cloud’s Qwen3-Coder-30B-A3B – with official scores of 70.4 per cent, 64.2 per cent and 51.6 per cent, respectively, on SWE-bench Verified.

“We’re still assessing [the] broader impact on evaluations and understanding trajectories for sources of leakage,” Kahn wrote.

Advertisement
Select Voice
Choose your listening speed
Get through articles 2x faster
1.25x
250 WPM
Slow
Average
Fast
1.25x