
Skywork UniPic: Next-generation Multimodal Model That Integrates Image Understanding, Generation, And Editing With High Efficiency
3 main points
✔️ Proposes a highly efficient 1.5B-parameter multimodal model that integrates image understanding, generation, and editing
✔️ Decoupled coding strategy with MAR and SigLIP2 for image quality and semantic understanding
✔️ Achieves high performance such as GenEval 0.86 and GEdit-Bench 5.83, and general GPUs and GEdit-Bench 5.83, enabling high-resolution generation even on general GPUs.
Skywork UniPic: Unified Autoregressive Modeling for Visual Understanding and Generation
written by Peiyu Wang, Yi Peng, Yimeng Gan, Liang Hu, Tianyidan Xie, Xiaokun Wang, Yichen Wei, Chuanxin Tang, Bo Zhu, Changshi Li, Hongyang Wei, Eric Li, Xuchen Song, Yang Liu, Yahui Zhou
(Submitted on 5 Aug 2025)
Comments: Published on arxiv.
Subjects: Computer Vision and Pattern Recognition (cs.CV)
The images used in this article are from the paper, the introductory slides, or were created based on them.
Overview
We propose Skywork UniPic, a 1.5B-parameter autoregressive model that integrates image understanding, image generation from text, and image editing in a single architecture.
Traditionally, many multimodal AIs handle understanding, generation, and editing with separate models and adapters, resulting in performance fragmentation and increased inference costs.
UniPic employs a "decoupled encoding strategy" that connects a generation-focused Masked Autoregressive (MAR) encoder and a comprehension-focused SigLIP2 encoder to a common LLM backbone, enabling both task-specific optimization and mutual knowledge transfer. Furthermore, it combines a 100M-scale high-quality data set, data quality control with a reward model, and gradual resolution expansion learning from 256 to 1024 pixels to achieve high-definition image generation on common GPU environments such as RTX 4090.
Evaluations show high performance such as GenEval 0.86, DPG-Bench 85.5, and GEdit-Bench 5.83, making it competitive with existing integrated models of similar size or larger.
Proposed Methodology
The main feature of Skywork UniPic is its "decoupled encoding strategy" for highly efficient integration of image understanding, generation, and editing in a single model.
The generation task uses a MAR encoder-decoder to support high-resolution composition while maintaining pixel-level fidelity.
On the other hand, the understanding task employs a SigLIP2 encoder for semantically rich feature extraction.
Both are connected to a 1.5B-parameter Qwen2.5 backbone via separate MLP projection layers, enabling inter-task knowledge sharing with unified autoregressive processing.
Learning progresses through a four-stage curriculum structure, moving to MAR pre-training, MAR-LLM alignment, task integration optimization, and SFT utilizing a reward model.
For data quality assurance, Skywork-ImgReward trained on GRPO and Skywork-EditReward specialized for editing accuracy are used to build datasets that can handle a variety of editing and generation scenarios while eliminating low-quality samples.
Experiments
Experiments were conducted in three domains: image generation, image editing, and image comprehension.
For the generation task, we used GenEval (constructive comprehension) and DPG-Bench (long instruction tracking), and UniPic achieved 0.86 on GenEval and 85.5 on DPG-Bench.
It showed particularly high accuracy in single object generation, multiple object composition, and position understanding.
For editing tasks, it achieved 5.83 on GEdit-Bench and 3.49 on ImgEdit-Bench, showing its superiority in specific categories such as behavior editing and style modification.
Comparisons included integrated models such as OmniGen2 and BAGEL, as well as specialized editing models such as ICEdit and Step1X-Edit, where UniPic demonstrated competitive performance despite its small parameter size.
Furthermore, it was able to generate 1024 x 1024 resolution images on the RTX 4090 with less than 15 GB of GPU memory, confirming its usefulness as an integrated multimodal infrastructure model with a good balance of performance, efficiency, and versatility.
Categories related to this article