Democratizing GPT-4o Level Image Generation: The Janus-4o And ShareGPT-4o-Image Challenge

24/07/2025

3 main points
✔️ ShareGPT-4o-Image, a synthetic dataset of 91K images that mimics GPT-4o's image generation capabilities
✔️ New Janus-4o model fine-tuned with this data supports both image generation and image editing
✔️ Outperforms existing high-performance models with a small amount of data and short training time Outperforms existing high-performance models in image generation with small amount of data and short training time

ShareGPT-4o-Image: Aligning Multimodal Models with GPT-4o-Level Image Generation
written by Junying Chen, Zhenyang Cai, Pengcheng Chen, Shunian Chen, Ke Ji, Xidong Wang, Yunjin Yang, Benyou Wang
(Submitted on 22 Jun 2025)
Comments: Published on arxiv.
Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

The images used in this article are from the paper, the introductory slides, or were created based on them.

Summary

In this paper, a new large-scale synthetic dataset called "ShareGPT-4o-Image" is constructed to transfer the advanced image generation capability of GPT-4o to an open-source multimodal model. In addition,we propose a multimodal large-scale language model called "Janus-4o" that uses it.

ShareGPT-4o-Image consists of image generation data from 45,000 texts and image transformation data from 46,000 images and texts, both of which are high-quality samples generated using GPT-4o-Image. By fine-tuning the existing Janus-Pro model based on this data, Janus-4o can not only generate images from text, but also edit images (image generation from text + image input). Notably, with only 91K samples and 6 hours of training, Janus-4o outperforms previous models.

This research contributes to the democratization of high-performance image generation techniques and is an important step toward accelerating open multimodal research.

Proposed Method

ShareGPT-4o-Image is a synthetic dataset designed to mimic and distill the capabilities of GPT-4o-Image. This data was created using two generation schemes.

One is "prompt-driven," in which attributes (objects, backgrounds, styles, etc.) are defined, based on which LLM generates natural language prompts and GPT-4o-Image outputs images. The other is "image-driven," in which LLM generates a detailed description from an existing image and pairs it with the image to create data.The data for image editing consists of a three-part set of the original image, editing instructions, and the edited image, based on 14 different tasks, covering a wealth of style transformations and element additions.

Using this data set, we fine-tuned the existing Janus-Pro model and developed Janus-4o. Janus-4o is structured to accommodate both text-only input and combined text + image input, and is designed to provide appropriate representation learning for each.

Experiments

The performance of Janus-4o was evaluated in two tasks: image generation from text and image editing.

For image generation from text, we used the benchmarks GenEval and DPG-Bench to measure compositional consistency and visual fidelity. The results showed a performance improvement of +4 points for GenEval and +1.6 points for DPG-Bench compared to Janus-Pro.

Meanwhile, image editing capabilities were evaluated by the ImgEdit-Bench benchmark, which recorded high scores for detailed editing items such as movement changes and style transfers. Particularly noteworthy is the fact that with a small amount of training data (91K), the performance was comparable to or even exceeded that of other prior models using more than 4M data.

In addition, human evaluation experiments clearly showed a preference for Janus-4o over Janus-Pro and UltraEdit in terms of visual appeal and instruction fidelity of the generated images. This demonstrated the high data quality of ShareGPT-4o-Image and its effectiveness.

Categories related to this article

nakata

Democratizing GPT-4o Level Image Generation: The Janus-4o And ShareGPT-4o-Image Challenge

Summary

Proposed Method

Experiments

LongVie: A New Era Of 1-minute Ultra-High Quality Video Generation Realized By Multimodal Control

LongVie: A New Era Of 1-minute Ultra-High Quality Video Generation Realized By Multimodal Control

Skywork UniPic: Next-generation Multimodal Model That Integrates Image Understanding, Generation, And Editing With High Efficiency

Skywork UniPic: Next-generation Multimodal Model That Integrates Image Understanding, Generation, An ...

Seed Diffusion Preview: Next-generation Code Generation Model That Combines Fast Inference And High Performance

Seed Diffusion Preview: Next-generation Code Generation Model That Combines Fast Inference And High ...

MATE: Multi-agent Accessibility-specific Modality Transformation Framework

MATE: Multi-agent Accessibility-specific Modality Transformation Framework

Biomed-Enriched: Large Biomedical Dataset With LLM Annotation For Clinical And Educational Value

Biomed-Enriched: Large Biomedical Dataset With LLM Annotation For Clinical And Educational Value

How Many Times Is Debugging LLM Effective? What Is The New Indicator "DDI" To Detect The Decay Of Effectiveness?

How Many Times Is Debugging LLM Effective? What Is The New Indicator "DDI" To Detect The Decay Of Ef ...