Alibaba launches AI model that can process images and video on phones and laptops

The multimodal Qwen2.5-Omni-7B model is designed to run locally on mobile devices and tops rivals in some benchmarks

Reading Time:2 minutes

Alibaba’s open-source Qwen models have been popular options for AI developers to build upon. Photo: Reuters

Published: 6:00pm, 27 Mar 2025

Alibaba Group Holding has introduced a new multimodal artificial intelligence (AI) model capable of processing text, images, audio and video on smartphones and laptops, as the tech giant moves to solidify its advantages in generative AI.

The company launched Qwen2.5-Omni-7B on Thursday as the latest addition to its Qwen family of models. With just 7 billion parameters, it is designed to run on mobile phones, tablets and laptops, making advanced AI capabilities more accessible to everyday users.

The model can handle various types of inputs and generate real-time responses as text or audio, Alibaba said in a statement. The company made the model open-source, and it is currently available on Hugging Face, Microsoft’s GitHub, and Alibaba’s ModelScope. It is also integrated into the company’s Qwen Chat. Alibaba owns the Post.

The company highlighted potential use cases such as assisting visually impaired users with real-time audio descriptions and providing step-by-step cooking guidance by analysing ingredients. The model’s versatility underscores the growing demand for AI systems that go beyond text generation.

Alibaba’s foundational Qwen models have emerged as popular options for AI developers to build upon, making it one of the few major alternatives to DeepSeek’s V3 and R1 models in mainland China.

Qwen2.5-Omni-7B has demonstrated strong performance in benchmark tests. It scored 56.1 on OmniBench, surpassing the 42.9 achieved by Google’s Gemini-1.5-Pro. It also outperformed Alibaba’s earlier Qwen2-Audio model in the CV15 audio benchmark, scoring one point higher with 92.4. For image-related tasks, it achieved 59.2 on the Massive Multi-discipline Multimodal Understanding and Reasoning Benchmark, beating the Qwen2.5-VL vision-language model.