MimicBrush, A New Image Editing Method "Imitative Editing" Is Proposed

Image Editing 16/01/2025

3 main points
✔️ Proposal for a new editing pipeline that mimics reference portions using masked source and reference images as input and naturally fills in masked regions
✔️ Building a "MimicBrush" framework that uses two U-Nets to recover masked regions in source images with self-supervised learning & nbsp;
✔️ Systematically evaluate the performance of the proposed method by constructing a high-quality benchmark that includes two tasks: part composition and texture transfer

Zero-shot Image Editing with Reference Imitation
written by Xi Chen, Yutong Feng, Mengting Chen, Yiyang Wang, Shilong Zhang, Yu Liu, Yujun Shen, Hengshuang Zhao
(Submitted on 11 Jun 2024)
Comments: this https URL.
Subjects: Computer Vision and Pattern Recognition (cs.CV)

code：

The images used in this article are from the paper, the introductory slides, or were created based on them.

Summary

Image editing is used to create new content for a variety of purposes, including adding new objects, changing attributes, and converting image styles. Recently, Diffusion Models have evolved to generate images from large, pre-trained texts, and with this evolution, the ability to edit images has been greatly enhanced. This has raised expectations for the ability to freely edit entire images or portions thereof to meet a variety of user requirements.

However, existing image editing models have difficulty handling complex edits and present various practical challenges. For example, edits such as changing a shoe design with reference to the soles of other shoes or attaching a specific pattern to a mug are important in realistic applications such as product design, character creation, and special effects. For such localized edits, the source image is usually edited using a binary mask, and itis difficult toachieve the desired resultusing text alone.

In the conventional method with composition, a reference image is used as input to represent the reference area with a mask or box, and "individual objects" can be inserted from the reference image into the source image. However, it is difficult to deal with local elements such as shoe soles or hair, or local patterns such as logos or textures, which require that the reference region be accurately extracted from the image. In addition, local elements are intertwined with the overall context, so separating themmakes it impossible toproperlyunderstand theinformation.

To solve these problems, this paper proposes a new editing pipeline called "Imitative Editing. This method takes a masked source image and a reference image as input and automatically finds and mimics the corresponding parts in the reference image to fill in the masked regions. This allows for more flexible interaction without strictly separating the referenced elements from the entire image.

To achieve this imitative editing, this paper further designs a framework called "MimicBrush" MimicBrushuses two U-Nets,"Imitative U-Net" and"Reference U-Net" (Diffusion Model) to process source and reference images. It also performs self-supervised learning and learns to automatically discover reference regions and combine them naturally with the source image by using two frames of video as source and reference imagesMimicBrush handles different orientations, lighting, and categories, and the regions it generates are reference balanced with the background while also retaining the visual details of the image.

In addition, this paper also builds a high-quality benchmark to evaluate the proposed method. This benchmark includes two main tasks, part composition and texture transfer, and includes subtracks for real-world applications such as fashion and product design.

MimicBrush Architecture

The figure below shows an overview of MimicBrush. The framework uses two U-Nets (Diffusion Model), the Imitative U-Net and the Reference U-Net, for self-supervised learning.The videobelowcontains visual changes such as changing the same dog posture while keeping the content consistent.mimicBrush uses two randomly selected frames as training samples. one frame is used as the source image and part of it is masked. The other frame is used as a reference image to help restore the masked source image.

In this way, MimicBrush learns to identify corresponding visual information (e.g., a dog's face) and naturally redraw masked areas of the source image. It also learns to transfer the visual content to the same posture, lighting, and viewpoint. This learning process is done using the original video clips and can be easily scaled to a larger scale, as no text or tracking annotations are required.

MimicBrush Learning Strategies

To maximize MimicBrush's ability to mimic images, the paper also suggests ways to find suitable training samples. The paper states that for this purpose, itis important to focus on two points: theexistence of a correspondence between the source and reference images, and the existence of significant variation between the source and reference images.

In training, two frames from the same video are sampled. SSIM is then used as a measure of similarity between video frames, and pairs of frames with too great or too little similarity are excluded to ensure that the selected image pairs contain both semantic correspondence and visual variability.

Strong data augmentation is used to increase the variability between the source and reference images.In addition toaggressivelyapplyingcolor jitter, rotation, resizing, and flipping, random projection transformations are implemented to simulate stronger deformations.

In masking, the source image is divided into N x N grids and each grid is masked randomly. However, simple random maskingtends to simplify themasking result.For example, if a large percentage of the image is background (e.g., grassland or sky) and there is a lot of repetitive content or texture, no guidance from the reference image is needed to restore these areas. To find more useful regions,we applySIFTmatchingbetween the source and reference imagesto obtain a series of matching points. The paper states that while the matching results are not perfect, they are sufficient to build a better training sample. The possibility of masking grids with matched feature points can be improved.

Since it is easier to collect imagesthan video, extensions are applied to still images and the results of object segmentation are used to mask the source image to construct a pseudo-frame. Segmentation masking also improves the robustness of MimicBrush because it assists in masking more freeform shapes.

MimicBrush does not rely on annotation of training data. It obtains sufficient information from the consistency and variability of the videos and leverages images to extend the diversity, making the learning pipeline more scalable.

MimicBrush Evaluation Benchmarks

SinceImitativeEditing is a new task, we have constructed our own benchmark to systematically evaluate its performance. As shown in the figure below,we havedivided the application into two tasks, "Part Composition" and "Texture Transfer," with Inter-ID and Inner-IDtracksfor each.

The first, Part Composition, evaluates the ability to find semantic correspondences between source and reference images and synthesize local parts; the Inter-ID track aims to synthesize local parts from different instances and categories. Data is collected from a variety of topics (fashion, animals, products, scenarios). For each topic, 30 samples are manually collected from Pexels, for a total of 120 samples. Each sample contains a source image and a reference image pair. Source masks are manually drawn and compositing requirements are defined. Since there is no Ground Truth in the generated results, we annotated the reference region and described the text prompts for the expected results. This allows the calculation of DINO , CLIP image similarity between the generated and annotated reference areas according to DreamBooth. In addition, we also report the CLIP text similarity between the edited image and the text prompt.

An Inner-ID track was also set up, collecting 30 image pairs from DreamBooth, manually masking identifiable regions of the source images and using reference images to complete them. Reference images are images that contain the same instances in different scenarios.This allows us to calculateSSIM, PSNR, and LPIPS using the unmasked source images as ground truth.

The second, Texture Transfer, requires that the shape of the source object be strictly preserved and only the textures and patterns of the reference image be transferred. For this task, a depth map is enabled as an additional condition. Unlike the part configuration, which looks for semantic correspondences, this task masks the full object and allows the model to find correspondences between the texture (reference) and the shape (source). It also sets up Inter-ID and Inner-ID tracks.

Experiment

Here we compare MimicBrush with other similar methods.Becauseimitativeediting is a new task, existing methods cannot fully address it. Therefore, we allow additional input for other methods. For example, AnyDoor and Paint-by-Example allow additional input of masks and boxes to indicate reference areas. We also give a detailed text description to the state-of-the-art inpainting tool Firefly.

The qualitative results are shown in the figure below, and while Firefly is able to follow the instructions precisely and produce high-quality images, it is difficult to capture details in the text prompts for patterns such as logos and tattoos.

Also, Paint-by-Example (PbE) requires a cropped reference image centered on the reference area, but because this model represents the reference as a single token, it cannot guarantee the fidelity of the generated area and the reference area. and inputting them, but it is not possible to properly composite them. This may be due to the fact that local parts are difficult to understand when taken out of context, and that many of AnyDoor's training samples are whole objects. On the other hand, Ours (MimicBrush) avoids this problem by having the model itself learn the correspondence in the whole context without using paired masks, and shows excellent performance in completing any part with a complete reference image.

The quantitative results are showninthe table below.These are the results of the benchmark for Part Composition (Part Composition). For Inner-ID with Ground Truth, MimicBrush performs better, even when additional conditions are given to the other methods. MimicBrush shows competitive performance compared to AnyDoor. However, AnyDoor has the advantage of being given a reference mask, which forces it to identify the reference area.

Because the evaluation metrics may not fully match human preferences, the paper also conducts a user study, in which 10 annotators are asked to rank the model's generated results with the benchmarks proposed in the paper. Each sample is evaluated in terms of fidelity, harmonization, and quality. Fidelity evaluates the ability of the reference region to retain its distinctiveness, Harmony evaluates whether the generated region can be naturally combined with the background, and Quality evaluates whether the generated region is of high quality down to the smallest detail. The evaluation results are shown in the table below, indicating that MimicBrush received significantly higher ratings than the other methods.

The paper also includes an ablation study to validate the effectiveness of the various components:MimicBrush utilizes two U-Nets (Diffusion Models) to extract features from the source and reference images, respectively. Previous studies have shown that pre-trained Diffusion Models have the ability to capture semantic correspondence. Therefore, we are testing whether a self-supervised learning pipeline with an asymmetric structure can also learn this semantic correspondence.

As can be seen from the visual comparison shown in the figure below, CLIP and DINOv2 also identify the reference region well, but U-Net also shows excellent results in preserving detail.

The table below examines the effectiveness of the video-based learning pipeline. We find that the performance of each task is significantly degraded when only static images are used. This may indicate that deformation and variation of objects in the video are important to achieve mimetic editing. We also observed that removing color jitter, resizing, and projection transformations degraded the performance of Part Composition (Part Composition), especially on the Inter-ID track. This indicates that data expansion is important for robust semantic-enabled matching.

We are also considering different masking strategies for the source images. Simple or random masking strategies can lead to many low-quality training samples. On the other hand, better performance is achieved by leveraging SIFT matching to enhance masking.

In addition, the paper presents more visual examples and discusses different applications. As shown in the figure below, it is clear that MimicBrush can handle images from a wide variety of topics and domains.

The first example shows its application to product design. The next example shows a jewelry fitting; the third example demonstrates its high versatility, showing that it can be used for background and natural effects.

Summary

This paperintroducesa new image editing method called "ImitativeEditing" that can be implemented with simple interactions. In this technique, the user simply marks the areas to be edited in the source image, provides a reference image containing the desired visual elements, and MimicBrush automatically finds the corresponding reference areas to complete the source image.

To achieveimitativeediting, we designed a self-supervised learning pipeline that takes full advantage of video consistency and variability, using one frame to restore masked areas of another frame. mimicBrush delivers superior performance in a variety of editing tasks and can be used in a wide range of applications.We have also built benchmarks to comprehensively evaluateimitativeediting.This newimmersiveediting technology is expected to help many people expand their creativity even further.

However, although MimicBrush shows robust performance, it may not find the reference area accurately when the reference area is too small or when there are multiple candidates in the reference image. In such cases, the user must crop the reference image to enlarge the desired area. In addition, because MimicBrush can handle a wide range of images, there is a risk of generating content that may have a negative impact internally. The authors of this paperwill add the ability to filter out harmful content when using the code and demosthey publish.

Categories related to this article

Takumu: I have worked as a Project Manager/Product Manager and Researcher at internet advertising companies (DSP, DMP, etc.) and machine learning startups. Currently, I am a Product Manager for new business at an IT company. I also plan services utilizing data and machine learning, and conduct seminars related to machine learning and mathematics.