Catch up on the latest AI articles

MVANet: The Most Powerful Model For Background Removal

MVANet: The Most Powerful Model For Background Removal

Neural Network

3 main points
✔️ The main challenge of the foreground extraction (background removal) task is to capture high-resolution details in small areas and not lose accuracy in larger areas.
✔️
Inspired by human vision, we propose to treat MVANet as a problem of viewing an object from multiple angles.
✔️ This new method outperforms current SOTA on the DIS-5K dataset in both accuracy and speed by improving long-range visual interaction and focusing on details.

Multi-view Aggregation Network for Dichotomous Image Segmentation
written by Qian Yu, Xiaoqi Zhao, Youwei Pang, Lihe Zhang, Huchuan Lu
(Submitted on 
11 Apr 2024)
Comments: Accepted by CVPR2024 as Highlight

Subjects: Computer Vision and Pattern Recognition (cs.CV)

code:
 

The images used in this article are from the paper, the introductory slides, or were created based on them.

Introduction

Foreground extraction (background removal), one of the key challenges in modern computer vision, is becoming increasingly important in a wide variety of applications. Effective background removal in image editing and video production not only increases aesthetic value, but also enables a more efficient workflow. Background removal also plays an important role in fields that require accuracy, such as medical image analysis and object recognition in automated driving technology. The main challenge is to capture fine details in small areas in high-resolution images while maintaining accuracy in large areas. Until now, there has been a lack of methods that combine detail reproduction with global accuracy. However, a new approach called MVANet offers an innovative solution to this challenge.

MVANet employs a unique method inspired by human vision. Just as humans observe an object from multiple angles, MVANet analyzes an object from multiple perspectives. This approach improves overall accuracy without losing detail. Furthermore, the integration of multiple viewpoints allows for long-range visual interactions that are difficult to achieve with traditional methods.

Background removal technology is in increasing demand in a variety of industries, including marketing, entertainment, healthcare, and security. In online shopping, it is expected to increase the willingness to buy by making the foreground of a product stand out. It is also important for videoconferencing applications that use virtual backgrounds and as an alternative technology to green screens in video production. As these applications gain more attention, improved foreground extraction performance will have a significant impact on the industry as a whole.

This new method has already demonstrated its effectiveness. In particular, on the DIS-5K dataset, it has outperformed the current SOTA in both accuracy and speed; MVANet has the potential to become the new standard for foreground extraction tasks and is expected to have a wider range of applications in the future.

Proposed Method

Summary

Figure 1: Overview of MVANet

The overall architecture of MVANet is similar to UNet, as shown in Figure 1. The encoder uses as input a far view ($G$) and a close-up view ($Lm$) consisting of $M$ ($M=4$ in this paper) non-overlapping local patches.

The $G$ and $Lm$ constitute a multi-view patch sequence that is input to the feature extractor in batches to produce a multi-level feature map $E_i (i=1,2,3,4,5)$. Each $E_i$ contains representations of both distant and close-up views. The highest level feature map $E_5$ is split into two distinct sets of global and local features along the batch dimension, which are input to the Multi-View Completion Localization Module (MCLM, Figure 2-a) to highlight positional information about objects in the global representation.

The decoder is similar to the FPN ( Lin et.al, 2017 ) architecture, but an on-the-fly multiview completion refinement module (MCRM, Figure 2-b) is inserted at each decoding stage. The output of each stage is used to reconstruct the SDO map (the map with only foreground) and compute the loss. Multi-view integration is shown in the lower right portion of Figure 1. Local features are combined and input to the Conv Head for refinement and concatenation with global features.

Figure 2: MCLM and MCRM architecture

Learning Loss Function

As shown in Figure 1, supervision is added to the output and final prediction of each layer of the decoder.

Specifically, the former consists of three parts, $l_l$, $l_g$, and $l_a$, representing the combined local representation, the global representation, and the token attention map in the refinement module. Each of these side outputs requires a separate convolution layer to obtain a single-channel prediction. The latter is represented as $l_f$. These components use a combination of binary cross-entropy (BCE) loss and weighted IoU loss, commonly used in most segmentation tasks.

The final learning loss function is the following equation In this paper, we set $ λ_g=0.3, λ_h=0.3$.

Experiment

Data Sets and Evaluation Indicators

Data Set

This paper experimented with the DIS5K benchmark dataset. This dataset contains 5,470 high-resolution images (2K, 4K, or larger) across 225 categories. The dataset is divided into three parts

  • DIS-TR: 3,000 training images
  • DIS-VD: 470 verified images
  • DIS-TE: 2,000 test images, divided into four subsets (DIS-TE1, 2, 3, 4) of 500 images each of increasing shape complexity

The DIS5K dataset is more challenging than other segmentation datasets due to its high resolution images, detailed structure, and excellent annotation quality, requiring advanced models to capture complex details.

Evaluation Indicators

To evaluate performance, the following indicators were used

  • Maximum F Value :Measures the maximum score for accuracy and reproducibility, β² is set to 0.3.
  • Weighted F Value: similar to F, but weighted.
  • Structural Similarity Index (Sm): evaluates the structural similarity between predictions and true values, considering both region and object recognition.
  • E-Measure: Used to evaluate pixel-level and image-level matching.
  • Mean Absolute Error (MAE): Calculates the average error between the prediction map and the true value.

These metrics help to understand the performance of the model in identifying and segmenting objects with complex structures in the DIS5K dataset.

Experimental Results

Quantitative Evaluation

In Table 1, we compare the proposed MVANet with 11 other well-known related models (F3Net, GCPANet, PFNet, BSANet, ISDNet, IFA, IS-Net, FPDIS, UDUN, PGNet, InSPyReNet). For a fair comparison, we standardized the input size to 1024 × 1024. The results show that MVANet significantly outperforms the other models on different indices for all datasets. In particular, MVANet outperformed InSPyReNet on F, Em, Sm, and MAE by 2.5%, 2.1%, 0.5%, and 0.4%, respectively.

We also evaluated the inference speed of InSPyReNet and MVANet. Both were tested on NVIDIA RTX 3090 GPUs. Thanks to its simple single-stream design, MVANet achieved 4.6 FPS compared to InSPyReNet's 2.2 FPS.

Table 1.Quantitative evaluation in DIS5K

Qualitative Evaluation

To intuitively demonstrate the high accuracy of the proposed method's predictions, we visualized the output of selected images from the test set. As shown in Figure 3, the proposed method is able to accurately locate objects and capture edge details even in complex scenes. In particular, the proposed method is able to accurately distinguish the complete segmentation of the chair and the interior of each lattice, while other methods suffer interference from conspicuous yellow gauze and shadows (see bottom row).

Figure 3. qualitative evaluation with DIS5K

Summary

In this commentary paper, we model high-accuracy foreground extraction (background removal) as a multi-view object recognition problem and provide an efficient and simple multi-view aggregation network. In doing so, we aim to better balance model design, accuracy, and inference speed.

To address the target alignment problem for multiple views, we propose a multi-view completion localization module to jointly compute the co-attention regions of the target. Furthermore, the proposed multi-view completion refinement module is embedded in each decoder block to fully integrate complementary local information and reduce the lack of semantics in a single view patch. This allows the final view refinement to be accomplished with only a single convolutional layer.

Extensive experiments have shown that the proposed method performs well. In particular, on the DIS-5K dataset, the proposed method outperforms the current SOTA in both accuracy and speed; MVANet has the potential to become a new standard for foreground extraction tasks and is expected to have a wider range of applications in the future.

  • メルマガ登録(ver
  • ライター
  • エンジニア_大募集!!

If you have any suggestions for improvement of the content of the article,
please contact the AI-SCHOLAR editorial team through the contact form.

Contact Us