Alibaba challenges OpenAI’s GPT-4o and Google’s Nano Banana with new multimodal AI model
Two variants of Qwen3-Omni outperform GPT-4o and Gemini-2.5-Flash in audio, image and video comprehension, developers say

Chief among the new releases was Qwen3-Omni, a flagship multimodal model akin to OpenAI’s GPT-4o launched in May 2024. The Alibaba model is designed to process a combination of text, audio, image and video inputs and respond with text and audio.
Qwen3-Omni was the first native end-to-end multimodal system that “unifies text, images, audio and video in one model”, the development team said on social media. Alibaba owns the Post.
The model competes with similar offerings already available outside China, including OpenAI’s GPT-4o and Google’s Gemini 2.5-Flash, also known as “Nano Banana” – an image editing and generating tool that has been making waves recently.
Citing benchmark tests on audio recognition and comprehension, as well as image and video understanding, developers said two variants of Qwen3-Omni outperformed their predecessor, Qwen2.5-Omni-7B, as well as GPT-4o and Gemini-2.5-Flash.
Lin Junyang, a researcher on the Qwen team under Alibaba’s cloud unit, attributed the improvements to various foundational projects related to audio and images.
