Just a few days after the viral AI company DeepSeek released DeepSeek-R1, the internet went crazy with its stability and performance.
Today, DeepSeek has announced Janus-Pro, an advanced multimodal AI model that sets new benchmarks in image generation and visual understanding capabilities.
The latest iteration, particularly the 7B parameter version, has outperformed industry leaders, including OpenAI's DALL-E 3 and Stable Diffusion 3 in key performance metrics.
According to the research paper [pdf], Janus-Pro-7B achieved an impressive 80% accuracy on GenEval, a challenging benchmark for image generation, surpassing DALL-E 3's 67% and Stable Diffusion 3 Medium's 74%.
The model also excelled in the DPG-Bench test, scoring 84.19 and demonstrating superior capabilities in following detailed image generation instructions.
Janus-Pro is now available for download from the AI dev platform Hugging Face; it is part of a new DeepSeek model family.
What sets Janus-Pro apart is its novel approach to visual processing. The model employs a unique architecture that decouples visual encoding for multimodal understanding and generation tasks, allowing it to excel in both areas simultaneously.
This architectural innovation helps resolve the traditional conflict between these two objectives that have challenged previous multimodal models.
The improvements in Janus-Pro stem from three key enhancements over its predecessor.
First, the team optimized the training strategy by extending the initial training phase and refining the data distribution across different training stages.
Second, they significantly expanded the training dataset, incorporating approximately 90 million new samples for multimodal understanding and 72 million samples of synthetic aesthetic data for image generation.
Finally, they scaled up the model size from 1.5B to 7B parameters, demonstrating improved convergence speeds and overall performance.
Janus pro visual generation results |
In multimodal understanding benchmarks, Janus-Pro-7B achieved a score of 79.2 on MMBench, surpassing previous unified multimodal models, including the original Janus (69.4) and competing models like TokenFlow (68.9) and MetaMorph (75.2).
The model can process images up to 384×384 pixels in resolution, though the researchers acknowledge this as a current limitation they plan to address in future iterations.
The model's ability to handle complex instructions and produce high-quality images while maintaining strong understanding capabilities suggests a promising direction for future developments in multimodal AI technology.
Janus Pro is under an MIT license, meaning it can be used commercially without restriction.