Tsinghua and Microsoft researchers train AI model using synthetic data, Nvidia chips
The team’s SynthSmith data pipeline develops a coding model that overcomes scarcity of real-world data to improve AI models

Tsinghua University and Microsoft researchers have developed a synthetic data pipeline for training artificial intelligence models without the need for real-world data, using chips from leading US chip designer Nvidia.
The pipeline called SynthSmith was able to develop a small coding model that outperformed a model twice its size, potentially addressing a key bottleneck of real-world data scarcity for improving AI models, according to the paper published on open access repository arXiv on January 11.
“In-depth analysis reveals that scaling laws hold on our synthetic dataset,” said the researchers from Tsinghua University, Microsoft Research Asia and Wuhan University.
Synthetic data mimicking real-world data is generated by AI algorithms. As new real-world data becomes scarce, AI researchers are experimenting with synthetic data to continue improving AI models.

Using SynthSmith, the researchers trained an X-Coder model with 7 billion parameters that scored higher than models with 14 billion parameters on major coding benchmarks despite using less data and none from the real world, the paper said.