-
Advertisement
Artificial intelligence
TechTech Trends

Tsinghua and Microsoft researchers train AI model using synthetic data, Nvidia chips

The team’s SynthSmith data pipeline develops a coding model that overcomes scarcity of real-world data to improve AI models

Reading Time:2 minutes
Why you can trust SCMP
2
Synthetic data mimicking real-world data is generated by AI algorithms. Photo: Shutterstock
Vincent Chow

Tsinghua University and Microsoft researchers have developed a synthetic data pipeline for training artificial intelligence models without the need for real-world data, using chips from leading US chip designer Nvidia.

The pipeline called SynthSmith was able to develop a small coding model that outperformed a model twice its size, potentially addressing a key bottleneck of real-world data scarcity for improving AI models, according to the paper published on open access repository arXiv on January 11.

“In-depth analysis reveals that scaling laws hold on our synthetic dataset,” said the researchers from Tsinghua University, Microsoft Research Asia and Wuhan University.

Advertisement

Synthetic data mimicking real-world data is generated by AI algorithms. As new real-world data becomes scarce, AI researchers are experimenting with synthetic data to continue improving AI models.

Nvidia’s H20 and H200 chips provided computational power for the experiment. Photo: Nvidia
Nvidia’s H20 and H200 chips provided computational power for the experiment. Photo: Nvidia

Using SynthSmith, the researchers trained an X-Coder model with 7 billion parameters that scored higher than models with 14 billion parameters on major coding benchmarks despite using less data and none from the real world, the paper said.

Advertisement
Advertisement
Select Voice
Choose your listening speed
Get through articles 2x faster
1.25x
250 WPM
Slow
Average
Fast
1.25x