Catch up on the latest AI articles

New Method

New Method "USO" By Separate Learning And Reward Learning: The Frontier Of Image Generation Integrating Style And Subject

3 main points
✔️ Proposes USO model and framework for generating triad data to handle style-driven and subject-driven data in a unified manner
✔️ Combines style-consistent learning, content separation learning, and style reward learning to achieve highly accurate generation
✔️ Validated using new benchmark USO-Bench and achieves better results in both style fidelity and subject consistency. Outperforms conventional methods in both style fidelity and subject consistency

USO: Unified Style and Subject-Driven Generation via Disentangled and Reward Learning
written by Shaojin WuMengqi HuangYufeng ChengWenxu WuJiahe TianYiming LuoFei DingQian He
(Submitted on 26 Aug 2025)
Comments: Project page: this https URL Code and model: this https URL

Subjects:  Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

The images used in this article are from the paper, the introductory slides, or were created based on them.

Overview

In recent years, "style-driven generation" and "subject-driven generation" have been studied as separate issues in image generation.
The former emphasizes the style of the reference image while the latter focuses on maintaining the consistency of the person or object, and the two have been viewed as opposites.

However, this paper argues that it is possible to treat them in a unified manner.
The reason for this is that both issues are nothing more than the task of separating and recombining "content" and "style.

Therefore, the authors propose the USO (Unified Style-Subject Optimized) model.
USO builds a large triplet dataset (content image, style image, and style-applied image) and further combines style-aligned learning and content/style separation learning style-aligned learning and content-styleseparated learning.

In addition, style reward learning (SRL) is introduced to enhance style fidelity.
The authors also constructed a new benchmark, USO-Bench, to evaluate style similarity and subject consistency simultaneously.

Experimental results report that USO outperforms conventional methods and achieves state-of-the-art performance in both style and subject consistency.

Proposed Methodology

The central idea of USO is to learn both style-driven and subject-driven tasks simultaneously as "complementary tasks.

First, the authors constructed a Cross-Task Triplet Curation Framework.
This is a mechanism to automatically generate triples of data (reference style image, subject reference image, and style application result image) using a stylization specialized model and a de-stylization model.

Next, the Unified Customization Framework Unified Customization Framework (USO) is introduced.
Learning is done in two phases.

In the first stage, Style Alignment Training using SigLIP encoders and a hierarchical projector is used to accurately extract style features.
In the second step, content and style images are input to separate encoders to perform Content-Style Disentanglement Training (Content-Style Disentanglement Training) to avoid unwanted feature contamination.

In addition, Style Reward Learning (SRL) is introduced to reflect how closely the generated result resembles the reference style as a reward signal.
By doing so, they succeeded in increasing both style fidelity and subject consistency at the same time.

Experiments

To verify the effectiveness of the proposed method, the authors conducted large-scale experiments using the newly constructed USO-Bench and the existing DreamBench.

USO-Bench is a benchmark that combines 50 types of content images and 50 types of style images and can comprehensively evaluate subject-driven, style-driven, and both-integrated tasks.
CLIP-I and DINO, which measure subject consistency, CSD, which measures style similarity, and CLIP-T, which measures text and image consistency, were used as evaluation metrics.

Experimental results showed that USO outperformed existing methods in both subject-driven and style-driven tasks, and showed significant performance improvement, especially in complex tasks where style and subject are handled simultaneously.
In the quantitative evaluation, CSD and CLIP-T scores were the highest, and in the qualitative evaluation, USO faithfully reproduced a variety of painting styles while preserving the appearance of the subject.

Furthermore, ablation experiments confirmed that style reward learning and hierarchical projection contributed significantly to the performance improvement.
In general, it can be concluded that USO is a state-of-the-art unified generative model that achieves both style and subject.

  • メルマガ登録(ver
  • ライター
  • エンジニア_大募集!!

If you have any suggestions for improvement of the content of the article,
please contact the AI-SCHOLAR editorial team through the contact form.

Contact Us