Tsinghua and Microsoft researchers train AI model using synthetic data, Nvidia chips

The team’s SynthSmith data pipeline develops a coding model that overcomes scarcity of real-world data to improve AI models

2-MIN READ2-MIN

Listen

Synthetic data mimicking real-world data is generated by AI algorithms. Photo: Shutterstock

Published: 3:45pm, 26 Jan 2026

Tsinghua University and Microsoft researchers have developed a synthetic data pipeline for training artificial intelligence models without the need for real-world data, using chips from leading US chip designer Nvidia.

The pipeline called SynthSmith was able to develop a small coding model that outperformed a model twice its size, potentially addressing a key bottleneck of real-world data scarcity for improving AI models, according to the paper published on open access repository arXiv on January 11.

“In-depth analysis reveals that scaling laws hold on our synthetic dataset,” said the researchers from Tsinghua University, Microsoft Research Asia and Wuhan University.

Synthetic data mimicking real-world data is generated by AI algorithms. As new real-world data becomes scarce, AI researchers are experimenting with synthetic data to continue improving AI models.

Nvidia’s H20 and H200 chips provided computational power for the experiment. Photo: Nvidia

Using SynthSmith, the researchers trained an X-Coder model with 7 billion parameters that scored higher than models with 14 billion parameters on major coding benchmarks despite using less data and none from the real world, the paper said.

Tsinghua and Microsoft researchers train AI model using synthetic data, Nvidia chips

.css-1c6uqr6{color:inherit;font-weight:inherit;font-size:inherit;font-family:inherit;line-height:inherit;overflow-wrap:break-word;}The team’s SynthSmith data pipeline develops a coding model that overcomes scarcity of real-world data to improve AI models

The team’s SynthSmith data pipeline develops a coding model that overcomes scarcity of real-world data to improve AI models