LAMIC: A Learning-free, Layout-controllable, Multi-reference Image Generation Method

28/08/2025

3 main points
✔️ LAMIC combines multiple reference images and layout control to generate images without learning
✔️ Group Isolation Attention and Region-Modulated Attention enable interference prevention and precise placement control
✔️ Experiments show superior performance over existing methods Experimental results show superiority over existing methods in terms of ID retention, background consistency, and layout accuracy.

LAMIC: Layout-Aware Multi-Image Composition via Scalability of Multimodal Diffusion Transformer
written by Yuzhuo Chen, Zehua Ma, Jianhua Wang, Kai Kang, Shunyu Yao, Weiming Zhang
(Submitted on 1 Aug 2025)
Comments: 8 pages, 5 figures, 3 tables
Subjects: Computer Vision and Pattern Recognition (cs.CV)

The images used in this article are from the paper, the introductory slides, or were created based on them.

Summary

This paper proposes a new method, "LAMIC," which achieves high-quality composition with layout information in controllable image generation using multiple reference images.

Conventional diffusion models have strengths in generation based on a single reference image, but when dealing with multiple references, they have issues such as "inconsistent identity" and "layout collapse.
In addition, many of the existing methods require additional training and large data sets, limiting their versatility and scalability.

LAMIC, based on the Multimodal Diffusion Transformer (MMDiT), is a zero-shot method that does not involve learning and can generate a combination of multiple images and text, as well as region specification (bounding boxes and masks).
In particular, we introduced entity separation using Group Isolation Attention (GIA) and layout control using Region-Modulated Attention (RMA) to faithfully reproduce spatial layout while preventing semantic confusion.

In evaluation experiments, it outperformed existing methods in terms of identity preservation, background consistency, and layout accuracy, and showed superior performance in multiple references and complex compositions.
With an efficient framework that requires no additional training, this research shows great potential for real-world applications such as video production and narrative generation.

Proposed Methodology

The central idea of LAMIC is to construct a token representation that integrates reference images, textual descriptions, and layout information, which is then input to MMDiT to enable consistent synthesis of multi-reference images.

First, each reference is defined as a VTS triplet consisting of visual (V), textual (T), and spatial (S) elements, to which are added relationships among entities (Cross-Entity Interaction, CEI) and uncontrolled regions (U).

These are then converted into an integrated sequence of tokens and processed as a consistent representation within MMDiT.
In doing so, Group Isolation Attention (GIA) is introduced to block unnecessary mutual interference between each VTS group, preventing the mixing of features from different entities.

In addition, Region-Modulated Attention (RMA) is applied in the early stages of generation to maintain the independence of each spatial region while integrating them at a later stage.
This enables accurate reproduction of character positioning and background consistency, and can accommodate complex layouts.
Importantly, this method does not require additional training or fine-tuning and can be extended directly from existing single reference models, making it both efficient and versatile in actual operation.

Experiments

To evaluate the proposed method LAMIC, the authors extended the existing XVerseBench dataset with a variety of reference images (people, animals, objects, clothing, and scenes) and corresponding layout information.

In the experimental setting, two, three, and four reference images were used as input and performance was compared on several metrics, including ID retention (ID-S), background similarity (BG-S), appearance consistency (IP-S), and aesthetic evaluation (AES).
In addition, the newly proposed Inclusion Ratio (IN-R) and Fill Ratio (FI-R) were used to quantitatively evaluate how well the products adhered to layout instructions.
The results showed that LAMIC outperformed the existing methods in all settings in terms of average score (AVG), especially in ID retention and background consistency.

Even in the high-difficulty three- and four-page reference tasks, LAMIC achieved an average improvement of 4 to 8 points over the existing method.
In addition, ablation experiments confirmed that removing GIA and RMA significantly reduced performance, confirming the effectiveness of both mechanisms.
Furthermore, by adjusting the RMA application ratio, a tradeoff between layout accuracy and overall visual smoothness was observed, concluding that a ratio of 0.05 is optimal.

Overall, LAMIC was shown to achieve state-of-the-art performance without additional training and could become the new standard in multi-reference image synthesis.

Categories related to this article

nakata

LAMIC: A Learning-free, Layout-controllable, Multi-reference Image Generation Method

Summary

Proposed Methodology

Experiments

MMR1: A Multimodal Inference Model That Stabilizes Reinforcement Learning With Sampling Based On Reward Variance

MMR1: A Multimodal Inference Model That Stabilizes Reinforcement Learning With Sampling Based On Rew ...

VCRL: A New Approach To LLM Reinforcement Learning That Controls Learning Difficulty With Reward Variance

VCRL: A New Approach To LLM Reinforcement Learning That Controls Learning Difficulty With Reward Var ...

The Challenge Of Social-MAE, A Social AI That Uses Self-supervised Learning To Decipher Emotions, Laughter, And Personality

The Challenge Of Social-MAE, A Social AI That Uses Self-supervised Learning To Decipher Emotions, La ...

OnGoal: New Chat Interface To Visualize The Goals Of LLM Dialogue

OnGoal: New Chat Interface To Visualize The Goals Of LLM Dialogue

TriMM: Collaborative Multimodal Coding For High-quality 3D Generation

TriMM: Collaborative Multimodal Coding For High-quality 3D Generation

Dress&Dance: Video Diffusion Model For Highly Accurate Virtual Fitting And Motion Generation

Dress&Dance: Video Diffusion Model For Highly Accurate Virtual Fitting And Motion Generation