E-commerce Background Image Generation Based On Product Category And Brand Style

Image Generation 17/01/2025

3 main points
✔️ E-commerce images require consideration of product categories and brand styles, which significantly increases the time and cost of image generation
✔️ This paper presents the first background generation dataset and aims to solve this challenge by integrating category commonality and individual styles into a diffusion model
✔️ Experimental results show that the proposed method generates high-quality backgrounds across categories and preserves individual styles from reference images

Generate E-commerce Product Background by Integrating Category Commonality and Personalized Style
written by Haohan Wang, Wei Feng, Yang Lu, Yaoyu Li, Zheng Zhang, Jingjing Lv, Xin Zhu, Junjie Shen, Zhangang Lin, Lixing Bo, Jingping Shao
(Submitted on 20 Dec 2023)
Comments: 12 pages, 11 figures
Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

code：

The images used in this article are from the paper, the introductory slides, or were created based on them.

Introduction

Background generation for e-commerce products has practical value in image generation research and is intended to create natural and realistic backgrounds for specific products to improve online performance metrics such as click through rates (CTR). Traditionally, advertisers have hired professional designers to create suitable backgrounds for their products, but this process is time consuming and financially burdensome due to the large differences in backgrounds required for different categories and brands.

To reduce costs, traditional methods utilizing image generation models often pre-generate complete images, which are then combined with products and other visual elements to create advertising images. However, the final image often lacks realism because the background is generated independently of the appearance of the product.

Although text-based in-painting methods have recently been applied to generate backgrounds with products in mind, designing appropriate prompts for each product remains time consuming and inefficient for large-scale background generation. In addition, certain branded products require backgrounds with a detailed, consistent, and individual style, which is difficult to effectively describe using text alone, further complicating the generation process.

In this commentary paper, we aim to address this issue by integrating category commonality and individual styles into a diffusion model. To improve the overall quality, we also collected a dataset specific to e-commerce products.

Extensive experimentation has demonstrated that the proposed method significantly outperforms state-of-the-art inpainting methods in both background similarity and quality.

Proposed Method

Overall Overview

As shown in Figure 1, the proposed method consists of the following three components

Stable DiffusionModel ($SD$)
Category-Wise Generator ($CG$)
Personality-Wise Generator ($PG$)

The $CG$ and $PG$ are built upon and modified from the ControlNet architecture. During training, given an ad image $I$ and a product mask $M$, $CG$ takes $I ⊗ M$ as input to integrate general category knowledge and $PG$ captures individual styles from $I ⊗ (1 - M)$.

Generation Utilizing Commonality of Categories ($CG$)

Although e-commerce platforms handle a wide variety of products, products within the same category have much in common. Therefore, it is a natural idea to share the same prompt for products within the same category. For example, one might inject the category name into the prompt template "Photo of [category]". However, it is not optimal to simply store category knowledge in the category name and inherit ControlNet's original architecture. The category name should be used to generate the background, but when generating the foreground during training, the foreground knowledge is also inevitably encoded.

There are two steps to solving this problem

Input prompt: "Pictures in [category], with [D] background", where [D] is a specific identifier.
Mask-guided cross-attention layer: is formulated by the following equation where $M$, $P_{fg}$, and $P_{bg}$ denote the product mask, the encoded product prompt ("photo in [category]"), and the background prompt ("in [D] background"), respectively.

This approach forces the subnetwork to generate backgrounds based solely on background prompts. See the left half of Figure 1 for details.

Generation Utilizing Personalized Styles ($CP$)

While category-specific backgrounds are suitable for most products, well-known brands require consistent signature-style backgrounds. To address this issue, this paper proposes a method for generating personalized backgrounds (PG) that mimic the layout and elements of reference images.

The proposed method utilizes an architecture similar to ControlNet, which preserves semantic and spatial information by maintaining a high-resolution feature map. See the right half of Figure 1 for details.

To ensure that the personalized style affects only the generated background, the proposed method masks the output $y_i$ of the i-th cross-attention layer by the following formula

Due to the lack of sufficient training pairs of reference images and corresponding generated images, the personalized background generator (PG) is trained in a self-supervised manner. This involves sampling the advertising image, extracting the backgrounds of its products, and using them to reconstruct the original image.

The problem here is that the original background acts as a ground truth, which can cause shortcuts where the PG pastes the goods directly onto the background. To solve this problem,we extended the data using dilation, random mask, and translation for the mask M. See the right half of Figure 1 for details.

For image $I$: add perturbations as follows, where $I_{rand}$ is another randomly sampled ad image.

The new input data will look like this

BG60k: Dataset for E-commerce Product Background Generation

The LAION dataset, commonly used to train traditional image generation models, was not designed specifically for e-commerce scenarios, and many training images do not meet the requirements of advertising images. Figure 2 shows some examples.

Figure 2:Some examples of LAION datasets not suited for e-commerce products

This paper addresses this problem by collecting the BG60k dataset for e-commerce product background generation; BG60k is collected from a well-known e-commerce platform and contains 63,293 advertising images from 2032 categories. Each image is associated with a corresponding category.

Data is cleaned according to the following requirements

Be attractive.
No text must be included.
Must not contain any persons.

Two test sets were also created to evaluate the proposed methodology.

BG1K: 1,000 commodity art dealers from over 200 categories and their original backgrounds
BG-pair: 1,600 pairs of commodity artwork and reference images to evaluate the ability to generate in an individual style

Experiment

Comparison of $CG$ and Previous Studies

In this experiment, we test the effectiveness of the proposed method by comparing it with SOTA in previous studies such as LaMa, Stable Diffusion, and ControlNet. The input prompt for the proposed method ($CG$ only) is "A photo of [category], in the background of [D]". The input prompt for the previous study is "A photo of [category]". Results are summarized in Table 1, where$CG$performs better on CLIP similarity and FID scores.

Table 1: Comparison results with previous studies.

The tSNE visualization in Figure 3 also reveals that the background features generated by CG are clustered more compactly around the corresponding centers compared to ControlNet; ControlNet has more distributed features and outliers, whereas CG has more consistent features and outliers. In a real-world comparison, CG may produce more consistent and photorealistic indoor backgrounds, such as the "refrigerator" category, whereas ControlNet may produce less relevant or unrealistic backgrounds.

Figure 3.(a) Visualization of tSNE. Circles represent cluster centers in the training data. Each triangle/rectangle represents the background embedding of one image generated by CG/ControlNet. (b) Visualization of a comparison of refrigerator categories; compared to ControlNet, CG can generate more realistic and complex backgrounds.

Personalized Background Generation

Table 2 shows the comparison results between the proposed method and previous studies. The proposed method significantly outperforms the other methods with a CLIP similarity of 4.75 and FID of 1.23. The high CLIP similarity indicates that the proposed method successfully mimics the background features of the reference image, while the low FID indicates that the proposed method can generate a new background that consistently follows the distribution of the advertising image.

In addition, we also evaluated a simple scenario in which the products in the reference image are the same as those requiring a background. In this case, all elements of the generated background spatially match the elements of the reference image, reducing the need for the model to understand the reference image in depth. As shown in Table 2, Self → Self, the proposed method also achieves the best performance in this scenario for both CLIP similarity and FID. These results suggest that selecting reference images of goods that are similar in shape may further improve the quality of the generated background.

Table 2: Experimental results of personalized background generation.

Figure 4 shows an example of some of the generation. It can be seen that the proposed method is able to generate a background similar to the reference image, including style, layout, and elements.

Figure 4. Example of personalized background generation

Summary

This commentary paper focuses on several practical challenges in background generation for e-commerce products. First, a per-category generator is used to improve the efficiency of large-scale generation, and a mask-guided cross attention layer maps the common style of each category to a unique identifier.

In addition, it effectively maintains the individual style of a particular brand from the reference image. We propose a per-personality generator, and background data expansion to prevent copy and paste. Finally, we present the first large-scale product background generation dataset BG60k.

Experimental results show that the proposed method can generate high-quality backgrounds for different categories of products and can generate backgrounds that resemble individual styles given a reference image.