Advertisement
Alibaba
TechBig Tech

Alibaba challenges OpenAI’s GPT-4o and Google’s Nano Banana with new multimodal AI model

Two variants of Qwen3-Omni outperform GPT-4o and Gemini-2.5-Flash in audio, image and video comprehension, developers say

Reading Time:2 minutes
Why you can trust SCMP
Alibaba’s logo seen at a trade fair in Beijing. Photo: AP
Ben Jiangin Beijing
Alibaba Group Holding on Tuesday unveiled a suite of new artificial intelligence models, including a multimodal system rivalling OpenAI’s GPT-4o and Google’s popular “Nano Banana” image editor, intensifying both domestic and international competition in the field.

Chief among the new releases was Qwen3-Omni, a flagship multimodal model akin to OpenAI’s GPT-4o launched in May 2024. The Alibaba model is designed to process a combination of text, audio, image and video inputs and respond with text and audio.

Qwen3-Omni was the first native end-to-end multimodal system that “unifies text, images, audio and video in one model”, the development team said on social media. Alibaba owns the Post.

Advertisement

The model competes with similar offerings already available outside China, including OpenAI’s GPT-4o and Google’s Gemini 2.5-Flash, also known as “Nano Banana” – an image editing and generating tool that has been making waves recently.

FULL EVENT: China Future Tech Webinar | The US-China chip war

FULL EVENT: China Future Tech Webinar | The US-China chip war

Citing benchmark tests on audio recognition and comprehension, as well as image and video understanding, developers said two variants of Qwen3-Omni outperformed their predecessor, Qwen2.5-Omni-7B, as well as GPT-4o and Gemini-2.5-Flash.

Advertisement

Lin Junyang, a researcher on the Qwen team under Alibaba’s cloud unit, attributed the improvements to various foundational projects related to audio and images.

Advertisement
Select Voice
Choose your listening speed
Get through articles 2x faster
1.25x
250 WPM
Slow
Average
Fast
1.25x