Advertisement

Is DeepSeek’s AI ‘distillation’ theft? OpenAI seeks answers over China’s breakthrough

Experts say AI model distillation is likely widespread and hard to detect, but DeepSeek has not admitted to using it on its full models

Reading Time:3 minutes
Why you can trust SCMP
99+
OpenAI says it has evidence that Chinese AI start-up DeepSeek distilled its closed-source models through authorised access. Photo: Reuters
Since Chinese artificial intelligence (AI) start-up DeepSeek rattled Silicon Valley and Wall Street with its cost-effective models, the company has been accused of data theft through a practice that is common across the industry.
Advertisement
OpenAI said it has evidence that DeepSeek used “distillation” of its GPT models to train the open-source V3 and R1 models at a fraction of the cost of what Western tech giants are spending on their own models, the Financial Times reported on Wednesday. OpenAI and Microsoft, the ChatGPT maker’s biggest backer, have started investigating whether a group linked to DeepSeek exfiltrated large amounts of data through an application programming interface (API) in the autumn, Bloomberg reported, citing people familiar with the matter.

Distillation is a means of training smaller models to mimic the behaviour of larger, more sophisticated models. The practice is common internally at many companies looking to scale down the size of their models while offering similar performance to users. This combined with the fact that model training often relies on a lot of data of questionable provenance has led some experts to question OpenAI’s sincerity in its accusations of intellectual property infringement.

“Distillation will violate most terms of service, yet it’s ironic – or even hypocritical – that Big Tech is calling it out. Training ChatGPT on Forbes or New York Times content also violated their terms of service,” Lutz Finger, a senior visiting lecturer at Cornell University who has worked in AI at tech companies including Google and LinkedIn, said in an emailed statement. “Knowledge is free and hard to protect.”
OpenAI co-founder and CEO Sam Altman (right) stands next to (right to left) SoftBank Group chairman and CEO Masayoshi Son and Oracle executive chairman Larry Ellison as US President Donald Trump announced a new AI initiative called Stargate on January 21. Photo: AFP
OpenAI co-founder and CEO Sam Altman (right) stands next to (right to left) SoftBank Group chairman and CEO Masayoshi Son and Oracle executive chairman Larry Ellison as US President Donald Trump announced a new AI initiative called Stargate on January 21. Photo: AFP
DeepSeek has its own distilled models that use other open-source models such as Meta Platforms’ Llama and Alibaba Group Holding’s Qwen. Alibaba owns the South China Morning Post.
Advertisement
However, OpenAI is alleging that DeepSeek used API access to the closed-source GPT models to distil those in an unauthorised manner. DeepSeek has not admitted to using distillation in training its main models, V3 and R1.
loading
Advertisement