Object Background Generation Using Text-2-Image Diffusion Model

Image Generation 10/01/2025

3 main points
✔️ Background generation plays an important role in creative design, e-commerce, and other areas, including improving user experience and advertising efficiency.
✔️ Current text-guided inpainting models,when used for background generation, often extend the boundaries of the main object and change its identity, a problem we call "object extension".
✔️ This paper proposes a model that uses Stable Diffusion and the ControlNet architecture to adapt inpainting diffusion models to background generation, and finds that object expansion, without compromising standard visual metrics across multiple datasets, is reduction by a factor of 3.6 on average.

Salient Object-Aware Background Generation using Text-Guided Diffusion Models
written by Amir Erfan Eshratifar, Joao V. B. Soares, Kapil Thadani, Shaunak Mishra, Mikhail Kuznetsov, Yueh-Ning Ku, Paloma de Juan
(Submitted on 15 Apr 2024)
Comments: Accepted for publication at CVPR 2024's Generative Models for Computer Vision workshop
Subjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

code：

The images used in this article are from the paper, the introductory slides, or were created based on them.

Introduction

In areas such as creative design and e-commerce, it is very important to create a background scene for an object. This helps to highlight the subject and provide context by placing it in a custom environment. This process is called "text-conditional outpainting" and involves extending the content of an image beyond a blank background.

Popular text-guided inpainting models can be applied to outpainting by inverting the mask, but these are designed to fill in the missing pieces rather than integrate the object into the scene. As a result, these models often extend the boundaries of the object, changing its identity. This problem is called "object extension," and Figure 1 is one example.

This article introduces a new model that uses the Stable Diffusion and ControlNet architectures to adapt the in-painting diffusion model for out-painting of the main object.

Compared to Stable Diffusion 2.0 Inpainting, this approach reduces object dilation by an average of 3.6x without compromising standard visual quality metrics. object dilation by an average of 3.6x without compromising standard visual quality metrics.

Proposed Method

ControlNet for Background Generation

In this explanatory paper, Stable Inpainting 2.0 (SI2) is used as the base model and the ControlNet model is added on top of it to adapt it to the outpainting task of the main objects. An overview of the entire model is shown in Figure 2.

As shown in Figure 2, we fix all weights in SI2 except ControlNet and use the trained model. The inputs to the model are as follows.

Mask: a matrix of 1 for pixels with objects and 0 for pixels without objects
Masked Image: Image with all non-object parts having a value of 0
Prompt: Desired background description
Time: Current time step of the diffusion process

Figure 2: Overview of the proposed method

To reduce computational cost, SI2 uses Encoder to convert the image to 64 × 64 × 4 latent space instead of pixel space before the diffusion process. Therefore, the ControlNet architecture requires the conditionalimage to be converted to a 64 × 64 × 4 latent space. Specifically, the image is encoded into a feature map by a small neural network consisting of four convolutional layers. This network uses the following settings

Kernel size: 4 x 4
Stride: 2 x 2
Activation function: ReLU
Channel dimensions: 16, 32, 64, 128 (each corresponding to 4 convolution layers)
Initialize weights: Gaussian weights

This network is trained jointly with the ControlNet model and thenpassed toControlNet'sU-Net model.

ControlNet uses several zero convolution layers to gradually modify the output of the U-Net decoder, as shown in Figure 2. Mathematically, we have a feature map $x ∈ R^{h×w×c}$ with height, width, and number of channels ${h, w, c}$, respectively; a U-Net encoder block with a set of parameters $Θe$, $E(. ; Θ_e)$, and a U-Net decoder block $D(. ; Θ_d)$ is given. The zero convolution operation is given by $Z(. ; Θ_z)$. The ControlNet structure used by the proposed method is defined as follows

where ( y ) represents the output of the decoder layer modulated by the ControlNet structure. Since the parameters of the zero convolution layer are initialized as zero, ( Z(x; \Theta_z) = 0 ) in the first gradient descent step, the original output of the decoder layer remains unchanged. As a result, all inputs and outputs of the trainable and frozen copies of the U-Net model remain unchanged, as if ControlNet did not exist. Furthermore, if the ControlNet structure is applied to some layers before the gradient descent step, the intermediate features are not affected.

The loss function for learning is similar to the usual diffusion model, as follows

Metrics for Evaluating Object Extensions

Figure 3: Computationalpipeline forobject extensions

The main challenge of the text-guided inpainting model in outpainting certain objects is the inability to maintain object boundaries. Quantitative error measurement methods are needed to address the problem of object expansion. Instead of expensive human labeling, object segmentation (SOS) models were first used to create masks of input and outpainted images. However, these models performed poorly on outpainted images, probably due to distributional changes.

The Segment Anything Model (SAM) has been found to work well on outpainted images; SAM is not an SOS model, but it can segment objects using positive and negative point prompts.

By selecting points from the original image mask created by the InSPyReNet SOS model, SAM segments objects in the outpainted image and generates a mask. This process is repeated on the input image, allowing direct comparison of masks. Figure 4 shows the detailed pipeline for obtaining these masks.

With this pipeline, the object extension evaluation metric can be calculated as in the following equation where AREA is the percentage of the object's area in the image.

Experiment

Comparison Targets and Evaluation Indicators

To test the effectiveness of the proposed method, we compare it to representative models such as Blended Diffusion, GLIDE, and Stable Inpainting on ImageNet-1k, ABO, COCO, DAVIS, and Pascal data sets. The evaluation metrics are as follows.

FID: Evaluates perceptual quality by measuring the distribution distance between the generated image and the real image.
Perceived Image Patch Similarity (LPIPS): evaluates the diversity of the generated background by calculating the average LPIPS score between pairs of outpainting images for the same object image.
CLIP Score: CLIP-ViT-L/14 is used to measure consistency as the cosine distance between text prompts and the generated image embedding.
Object Similarity: measures how well an object's identity is preserved after background generation. It is calculated as the cosine distance between the outpainting image and the embedding of the input object-only image using BLIP-2.
Object Extension: quantifies the degree of extension of the main object in pixel space, as described above.

Comparison Results with Previous Studies

The results in Table 1 highlight that compared to the state-of-the-art SI2 model, the proposed method reduces object expansion by an average of 3.6x; while SI2 trained on the LAION dataset struggles with unrealistic images, the proposed method trained on the real image dataset achieves better FID and LPIPS scores.

GLIDE occupies a slightly higher rank in LPIPS, but performs poorly in FID and CLIP scores, indicating object expansion; SD2 achieves the highest CLIP score because it is less constrained by objects.

The proposed method slightly reduces the SI2 CLIP score due to its reliance on the distribution of training images and BLIP-2 synthetic captions. However, the architecture of the proposed method allows the intensity of ControlNet to be adjusted during inference, providing flexibility in output control.

In addition, the proposed method achieves the highest object similarity score, indicating better retention of object identity. The object extension measure shows a 3.6x improvement over SI2, which is due to the architecture of the proposed method and the training data.

Object Extensions by Category

Figure 6 plots the expansion of objects across the 12 COCO super categories.

The order of super categories by extension score is found to be nearly similar across the benchmarked models. The highest expansion scores for each model are found in indoor settings that include prominent objects with lots of fine detail and no clear dimensions, such as FOOD, KITCHEN, and FURNITURE.

Similarly, the lowest extension scores occur in outdoor scenes with objects that contrast well with the background, such as SPORTS and ANIMAL.

Figure 4: Evaluation results of object extensions by category

Summary

In this article, we introduced an approach based on the diffusion model to generate backgrounds without changing the object's boundaries. Preserving object identity is critical in applications such as design and e-commerce. The paper described identified the problem of object expansion and provided measurements to capture it.

Background generation for less salient objects remains a future challenge, which may require high-quality instances or panoptic segmentation masks. In addition, a combination of T2I-adapter modulating U-Net encoders as an alternative to ControlNet and a new control architecture for the task of object recognition background generation could also improve the overall accuracy and quality of the generated image.