PosterLlama: Ability To Design Language Models And Generate Content-aware Layouts

Layout-gen 28/01/2025

3 main points
✔️ Visual layout in advertising, posters, web UI design, etc. is critical, and traditional methods often miss semantic details
✔️ PosterLlama leverages the design capabilities of large language models in HTML format to produce visually and textually consistent It aims to generate layouts
✔️ Experimental results show that PosterLlama outperforms existing methods and is a versatile tool for a wide variety of conditions

PosterLlama: Bridging Design Ability of Langauge Model to Content-Aware Layout Generation
written by Jaejung Seol, Seojun Kim, Jaejun Yoo
(Submitted on 1 Apr 2024 (v1), last revised 28 Jul 2024 (this version, v3))
Comments: ECCV 2024
Subjects: Computer Vision and Pattern Recognition (cs.CV)

code：

The images used in this article are from the paper, the introductory slides, or were created based on them.

Introduction

Layout is critical in graphic design to effectively position elements such as logos and text to attract visual attention and convey information. It is essential for applications such as web UI, posters, document typesetting, area-controlled image generation and layout-guided video generation. Layout generation models offer the potential for cost savings by reducing manual labor and increasing aesthetic and functional efficiency.

In layout generation, it is important to ensure readability and visual balance of text; following ContentGAN, prior models such as CGL-GAN, DS-GAN, and RADM have improved layout generation by incorporating visual and textual content. However, these methods tend to treat layout elements as simple numbers and do not adequately capture semantic relationships.

Recent models such as LayoutPrompter, Layout GPT, and LayoutNUWA can generate high-quality layouts using language models, but struggle with fine visual content. In this introductory paper, we propose PosterLlama. This is a model that integrates visual and textual content to generate poster layouts. It translates layout elements into HTML code and leverages the design knowledge of the language model. A two-step training process trains the model to connect the visual encoder and the LLM to generate HTML sequences.

To address the challenges of the dataset, we also propose a data augmentation technique that focuses on salient objects within the poster. We will also introduce a pipeline for generating advertising posters using a scene text generation module.

PosterLlama achieves state-of-the-art performance in almost all metrics; by leveraging LLM's design knowledge, the quality is nearly equivalent to the actual layout. posterLlama is the first model that can handle all types of content-aware layout generation tasks and is expected to be used in many PosterLlama is the first model that can handle all types of content-aware layout generation tasks and is expected to be used in many situations.

Proposed Method

Input/Output Sequence Format

Layout Format

The goal of content-aware layout generation is to generate a layout based on a given content condition $C$. In poster layout generation, $C$ is defined as multimodal content such as poster canvas and text description. A layout is represented by $N$ elements ${e_i}_{i=1}^N$, each element $e_i = (t_i, s_i, c_i)$ contains: $e_i = (t_i, s_i, c_i)$ (t_i, s_i, c_i)$ (t_i, s_i, c_i)

Bounding box position $t_i = (x_i, y_i)$
Size $s_i = (w_i, h_i)$
Category $c_i$.

In addition to content conditions, a subset of layout elements may also serve as constraints.

HTML format

To leverage the extensive knowledge contained in LLM for layout generation, layout is represented in the form of HTML sequences. This approach allows us to leverage prior design knowledge embedded in LLM training data, such as web UI design, and provides a more powerful representation capability than expressing layout attributes as numbers.

Building on our previous approach, we developed a template that generates a text recognition layout by building the model's input sequence by task definition, HTML formatting, and text constraints: the template is shown below.

Task Definition: Specifies the condition of the input sequence identified by {Task Condition} (e.g., {"according to the categories and image"} in the prior study Gen-IT).
HTML formatting: encapsulates layout elements using HTML tags such as <rect> and takes advantage of the variety of tags that characterize Web UI layouts.
Mask tokens: <M> Introduce mask tokens to encourage LLM to predict masked tokens and facilitate conditional layout generation.

Since layout elements do not have a unique order, fixing the order of mask tokens during training can easily lead to overtraining when data is limited or conditions are diverse. To address this, random substitutions are introduced into the layout order while maintaining synchronization between input and output elements.

In addition, the attributes of each element are discretized, as in previous work, for efficient training and to reduce the overall token length. This approach allows the model to effectively train and generate text recognition layouts with high quality.

Learning Methods

Figure 1: Overview of PosterLlama and Learning

The entire training process is represented in Figure 1.For poster layout generation, a two-step training approach inspired by Mini-GPT4's efficient Visual Question Answering method and instructional coordination is employed.

Phase 1: Adapter Training

Tune the adapter: use the linear layer as an adapter and align the image encoder with the LLM. Fix the other parts of the model and train the adapter only.
Training data: uses an extensive collection of classified image-text pairs.
Image Feature Encoding: encapsulates the image feature from the encoder within an <img> token and handles it with a text token and text instructions: "<img><ImageFeature></img> Describe this imageindetail."
Visual Encoder: Utilizes the latest visual embedding model, DINOv2.

Phase 2: Layout generation fine tuning

Fixing the adapter: Fix the visual adapter and fine-tune the LLM.
Dataset in HTML format: Use a dataset in HTML format for layout generation (data format described in the previous section).
Preventing catastrophic forgetting: use LoRA (low-rank adaptation) to optimize the fine-tuning process and prevent catastrophic forgetting.
Objective function: cross-entropy loss

This two-step approach leverages the visual and verbal capabilities of the model to ensure effective coordination and fine tuning for high-quality poster layout generation.

Data Extension

Although the performance of the generative model improves with diverse and rich data, poster data sets are limited in quantity, and copyright issues make it difficult to collect large data sets.

To address this, we propose a new poster data expansion method using depth-based expansion and top-k similarity selection. An overview is represented in Figure 2-a. The method leverages a text and depth map based generative model ControlNet-Depth. Captions are generated using Blip-2 and depth maps are estimated with the available network.

Despite ControlNet's high quality synthesis, artifacts can occur in the diffusion-generated images, especially for prominent objects, affecting the correlation between layout and image canvas. To mitigate this, we use DreamSIM to select the top k samples from N generated samples (N = 10, k = 3) using a similarity measure sensitive to layout and semantic content.

This process produces high-quality composite data with minimal changes while preserving composition and prominent objects. Figure 2(b) shows an extended example and illustrates the effectiveness of this method.

Figure 2: Overview and example of the proposed data expansion method

Experiment

Quantitative Evaluation

Table 1. quantitative comparison with baseline for content-aware layout generation task

This section compares the performance of the PosterLlama model with DS-GAN, LayoutPrompter, and RADM. These are all advanced layout generation methods. Eight different metrics are used in the evaluation.

Since the PKU dataset does not have text annotations, RADM performance is compared only on the CGL dataset. Table 1 summarizes the quantitative results for the annotated test split without user constraints; PosterLlama achieved the highest scores on five indicators in the CGL dataset and the second highest scores on FD, rea, and oc. It also achieved the highest score on all indicators in the PKU dataset except FD.

Qualitative Evaluation

This section provides a qualitative comparison of PosterLlama and the baseline methodology based on the details presented in Table 1 and Figure 3.

DS-GAN: Elements are fixed and tend to be concentrated in the upper left corner, often overlapping or poorly aligned. This is due to non-element layouts being placed in the upper left (0, 0, 0, 0).
Layout Prompter: Alignment is good, but content recognition is lacking, resulting in large overlaps.
RADM: Generates a structure similar to real data for all samples.
PosterLlama: Demonstrates the ability to generate appropriate and sensible layouts without overfitting real data.

Overall, PosterLlama can be seen to outperform the baseline method by generating layouts that are well aligned and content-aware, and avoiding common problems such as misalignment, overlap, and concealment.

Summary

This article introduced PosterLlama, a new method for visual and textual content-aware layout generation. For content-aware layout generation, an efficient Visual Question Answering training method is used to introduce visual recognition into the LLM and process the layout in a code format suitable for language models. To overcome data shortages, we propose a depth-guided extension using a commercially available generative model to mitigate inpainting artifacts and allow for fair evaluation.

Extensive experimentation has shown that PosterLlama outperforms existing approaches, achieves diverse condition generation by processing conditions in text format, and is robust against learning shortcuts due to inpainting artifacts. Thanks to this robustness and the way it can be extended, PosterLlama is very effective on small data sets and adaptable to real-world applications.

Categories related to this article

JACK