New Method "USO" By Separate Learning And Reward Learning: The Frontier Of Image Generation Integrating Style And Subject

03/09/2025

3 main points
✔️ Proposes USO model and framework for generating triad data to handle style-driven and subject-driven data in a unified manner
✔️ Combines style-consistent learning, content separation learning, and style reward learning to achieve highly accurate generation
✔️ Validated using new benchmark USO-Bench and achieves better results in both style fidelity and subject consistency. Outperforms conventional methods in both style fidelity and subject consistency

USO: Unified Style and Subject-Driven Generation via Disentangled and Reward Learning
written by Shaojin Wu, Mengqi Huang, Yufeng Cheng, Wenxu Wu, Jiahe Tian, Yiming Luo, Fei Ding, Qian He
(Submitted on 26 Aug 2025)
Comments: Project page: this https URL Code and model: this https URL
Subjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

The images used in this article are from the paper, the introductory slides, or were created based on them.

Overview

In recent years, "style-driven generation" and "subject-driven generation" have been studied as separate issues in image generation.
The former emphasizes the style of the reference image while the latter focuses on maintaining the consistency of the person or object, and the two have been viewed as opposites.

However, this paper argues that it is possible to treat them in a unified manner.
The reason for this is that both issues are nothing more than the task of separating and recombining "content" and "style.

Therefore, the authors propose the USO (Unified Style-Subject Optimized) model.
USO builds a large triplet dataset (content image, style image, and style-applied image) and further combines style-aligned learning and content/style separation learning style-aligned learning and content-styleseparated learning.

In addition, style reward learning (SRL) is introduced to enhance style fidelity.
The authors also constructed a new benchmark, USO-Bench, to evaluate style similarity and subject consistency simultaneously.

Experimental results report that USO outperforms conventional methods and achieves state-of-the-art performance in both style and subject consistency.

Proposed Methodology

The central idea of USO is to learn both style-driven and subject-driven tasks simultaneously as "complementary tasks.

First, the authors constructed a Cross-Task Triplet Curation Framework.
This is a mechanism to automatically generate triples of data (reference style image, subject reference image, and style application result image) using a stylization specialized model and a de-stylization model.

Next, the Unified Customization Framework Unified Customization Framework (USO) is introduced.
Learning is done in two phases.

In the first stage, Style Alignment Training using SigLIP encoders and a hierarchical projector is used to accurately extract style features.
In the second step, content and style images are input to separate encoders to perform Content-Style Disentanglement Training (Content-Style Disentanglement Training) to avoid unwanted feature contamination.

In addition, Style Reward Learning (SRL) is introduced to reflect how closely the generated result resembles the reference style as a reward signal.
By doing so, they succeeded in increasing both style fidelity and subject consistency at the same time.

Experiments

To verify the effectiveness of the proposed method, the authors conducted large-scale experiments using the newly constructed USO-Bench and the existing DreamBench.

USO-Bench is a benchmark that combines 50 types of content images and 50 types of style images and can comprehensively evaluate subject-driven, style-driven, and both-integrated tasks.
CLIP-I and DINO, which measure subject consistency, CSD, which measures style similarity, and CLIP-T, which measures text and image consistency, were used as evaluation metrics.

Experimental results showed that USO outperformed existing methods in both subject-driven and style-driven tasks, and showed significant performance improvement, especially in complex tasks where style and subject are handled simultaneously.
In the quantitative evaluation, CSD and CLIP-T scores were the highest, and in the qualitative evaluation, USO faithfully reproduced a variety of painting styles while preserving the appearance of the subject.

Furthermore, ablation experiments confirmed that style reward learning and hierarchical projection contributed significantly to the performance improvement.
In general, it can be concluded that USO is a state-of-the-art unified generative model that achieves both style and subject.

Categories related to this article

nakata

New Method "USO" By Separate Learning And Reward Learning: The Frontier Of Image Generation Integrating Style And Subject

Overview

Proposed Methodology

Experiments

MMR1: A Multimodal Inference Model That Stabilizes Reinforcement Learning With Sampling Based On Reward Variance

MMR1: A Multimodal Inference Model That Stabilizes Reinforcement Learning With Sampling Based On Rew ...

VCRL: A New Approach To LLM Reinforcement Learning That Controls Learning Difficulty With Reward Variance

VCRL: A New Approach To LLM Reinforcement Learning That Controls Learning Difficulty With Reward Var ...

The Challenge Of Social-MAE, A Social AI That Uses Self-supervised Learning To Decipher Emotions, Laughter, And Personality

The Challenge Of Social-MAE, A Social AI That Uses Self-supervised Learning To Decipher Emotions, La ...

OnGoal: New Chat Interface To Visualize The Goals Of LLM Dialogue

OnGoal: New Chat Interface To Visualize The Goals Of LLM Dialogue

TriMM: Collaborative Multimodal Coding For High-quality 3D Generation

TriMM: Collaborative Multimodal Coding For High-quality 3D Generation

Dress&Dance: Video Diffusion Model For Highly Accurate Virtual Fitting And Motion Generation

Dress&Dance: Video Diffusion Model For Highly Accurate Virtual Fitting And Motion Generation