Is DeepSeek’s AI ‘distillation’ theft? OpenAI seeks answers over China’s breakthrough

Experts say AI model distillation is likely widespread and hard to detect, but DeepSeek has not admitted to using it on its full models

3-MIN READ3-MIN

99+

OpenAI says it has evidence that Chinese AI start-up DeepSeek distilled its closed-source models through authorised access. Photo: Reuters

Published: 8:00pm, 30 Jan 2025

Since Chinese artificial intelligence (AI) start-up DeepSeek rattled Silicon Valley and Wall Street with its cost-effective models, the company has been accused of data theft through a practice that is common across the industry.

OpenAI said it has evidence that DeepSeek used “distillation” of its GPT models to train the open-source V3 and R1 models at a fraction of the cost of what Western tech giants are spending on their own models, the Financial Times reported on Wednesday. OpenAI and Microsoft, the ChatGPT maker’s biggest backer, have started investigating whether a group linked to DeepSeek exfiltrated large amounts of data through an application programming interface (API) in the autumn, Bloomberg reported, citing people familiar with the matter.

Distillation is a means of training smaller models to mimic the behaviour of larger, more sophisticated models. The practice is common internally at many companies looking to scale down the size of their models while offering similar performance to users. This combined with the fact that model training often relies on a lot of data of questionable provenance has led some experts to question OpenAI’s sincerity in its accusations of intellectual property infringement.

“Distillation will violate most terms of service, yet it’s ironic – or even hypocritical – that Big Tech is calling it out. Training ChatGPT on Forbes or New York Times content also violated their terms of service,” Lutz Finger, a senior visiting lecturer at Cornell University who has worked in AI at tech companies including Google and LinkedIn, said in an emailed statement. “Knowledge is free and hard to protect.”

OpenAI co-founder and CEO Sam Altman (right) stands next to (right to left) SoftBank Group chairman and CEO Masayoshi Son and Oracle executive chairman Larry Ellison as US President Donald Trump announced a new AI initiative called Stargate on January 21. Photo: AFP

DeepSeek has its own distilled models that use other open-source models such as Meta Platforms’ Llama and Alibaba Group Holding’s Qwen. Alibaba owns the South China Morning Post.

However, OpenAI is alleging that DeepSeek used API access to the closed-source GPT models to distil those in an unauthorised manner. DeepSeek has not admitted to using distillation in training its main models, V3 and R1.

SCMP Series

Understanding DeepSeek’s impact on US-China relations

[ 8 of 9 ]

Is DeepSeek’s AI ‘distillation’ theft? OpenAI seeks answers over China’s breakthrough

.css-1c6uqr6{color:inherit;font-weight:inherit;font-size:inherit;font-family:inherit;line-height:inherit;overflow-wrap:break-word;}Experts say AI model distillation is likely widespread and hard to detect, but DeepSeek has not admitted to using it on its full models

Experts say AI model distillation is likely widespread and hard to detect, but DeepSeek has not admitted to using it on its full models