
MVANet: The Most Powerful Model For Background Removal
3 main points
✔️ The main challenge of the foreground extraction (background removal) task is to capture high-resolution details in small areas and not lose accuracy in larger areas.
✔️ Inspired by human vision, we propose to treat MVANet as a problem of viewing an object from multiple angles.
✔️ This new method outperforms current SOTA on the DIS-5K dataset in both accuracy and speed by improving long-range visual interaction and focusing on details.
Multi-view Aggregation Network for Dichotomous Image Segmentation
written by Qian Yu, Xiaoqi Zhao, Youwei Pang, Lihe Zhang, Huchuan Lu
(Submitted on 11 Apr 2024)
Comments: Accepted by CVPR2024 as Highlight
Subjects: Computer Vision and Pattern Recognition (cs.CV)
code:![]()
The images used in this article are from the paper, the introductory slides, or were created based on them.
Introduction
Foreground extraction (background removal), one of the key challenges in modern computer vision, is becoming increasingly important in a wide variety of applications. Effective background removal in image editing and video production not only increases aesthetic value, but also enables a more efficient workflow. Background removal also plays an important role in fields that require accuracy, such as medical image analysis and object recognition in automated driving technology. The main challenge is to capture fine details in small areas in high-resolution images while maintaining accuracy in large areas. Until now, there has been a lack of methods that combine detail reproduction with global accuracy. However, a new approach called MVANet offers an innovative solution to this challenge.
MVANet employs a unique method inspired by human vision. Just as humans observe an object from multiple angles, MVANet analyzes an object from multiple perspectives. This approach improves overall accuracy without losing detail. Furthermore, the integration of multiple viewpoints allows for long-range visual interactions that are difficult to achieve with traditional methods.
Background removal technology is in increasing demand in a variety of industries, including marketing, entertainment, healthcare, and security. In online shopping, it is expected to increase the willingness to buy by making the foreground of a product stand out. It is also important for videoconferencing applications that use virtual backgrounds and as an alternative technology to green screens in video production. As these applications gain more attention, improved foreground extraction performance will have a significant impact on the industry as a whole.
This new method has already demonstrated its effectiveness. In particular, on the DIS-5K dataset, it has outperformed the current SOTA in both accuracy and speed; MVANet has the potential to become the new standard for foreground extraction tasks and is expected to have a wider range of applications in the future.
Proposed Method
Summary

The overall architecture of MVANet is similar to UNet, as shown in Figure 1. The encoder uses as input a far view ($G$) and a close-up view ($Lm$) consisting of $M$ ($M=4$ in this paper) non-overlapping local patches.
The $G$ and $Lm$ constitute a multi-view patch sequence that is input to the feature extractor in batches to produce a multi-level feature map $E_i (i=1,2,3,4,5)$. Each $E_i$ contains representations of both distant and close-up views. The highest level feature map $E_5$ is split into two distinct sets of global and local features along the batch dimension, which are input to the Multi-View Completion Localization Module (MCLM, Figure 2-a) to highlight positional information about objects in the global representation.
The decoder is similar to the FPN ( Lin et.al, 2017 ) architecture, but an on-the-fly multiview completion refinement module (MCRM, Figure 2-b) is inserted at each decoding stage. The output of each stage is used to reconstruct the SDO map (the map with only foreground) and compute the loss. Multi-view integration is shown in the lower right portion of Figure 1. Local features are combined and input to the Conv Head for refinement and concatenation with global features.

Learning Loss Function
As shown in Figure 1, supervision is added to the output and final prediction of each layer of the decoder.
Specifically, the former consists of three parts, $l_l$, $l_g$, and $l_a$, representing the combined local representation, the global representation, and the token attention map in the refinement module. Each of these side outputs requires a separate convolution layer to obtain a single-channel prediction. The latter is represented as $l_f$. These components use a combination of binary cross-entropy (BCE) loss and weighted IoU loss, commonly used in most segmentation tasks.
The final learning loss function is the following equation In this paper, we set $ λ_g=0.3, λ_h=0.3$.
Experiment
Data Sets and Evaluation Indicators
Data Set
This paper experimented with the DIS5K benchmark dataset. This dataset contains 5,470 high-resolution images (2K, 4K, or larger) across 225 categories. The dataset is divided into three parts
- DIS-TR: 3,000 training images
- DIS-VD: 470 verified images
- DIS-TE: 2,000 test images, divided into four subsets (DIS-TE1, 2, 3, 4) of 500 images each of increasing shape complexity
The DIS5K dataset is more challenging than other segmentation datasets due to its high resolution images, detailed structure, and excellent annotation quality, requiring advanced models to capture complex details.
Evaluation Indicators
To evaluate performance, the following indicators were used
- Maximum F Value :Measures the maximum score for accuracy and reproducibility, β² is set to 0.3.
- Weighted F Value: similar to F, but weighted.
- Structural Similarity Index (Sm): evaluates the structural similarity between predictions and true values, considering both region and object recognition.
- E-Measure: Used to evaluate pixel-level and image-level matching.
- Mean Absolute Error (MAE): Calculates the average error between the prediction map and the true value.
These metrics help to understand the performance of the model in identifying and segmenting objects with complex structures in the DIS5K dataset.
Experimental Results
Quantitative Evaluation
In Table 1, we compare the proposed MVANet with 11 other well-known related models (F3Net, GCPANet, PFNet, BSANet, ISDNet, IFA, IS-Net, FPDIS, UDUN, PGNet, InSPyReNet). For a fair comparison, we standardized the input size to 1024 × 1024. The results show that MVANet significantly outperforms the other models on different indices for all datasets. In particular, MVANet outperformed InSPyReNet on F, Em, Sm, and MAE by 2.5%, 2.1%, 0.5%, and 0.4%, respectively.
We also evaluated the inference speed of InSPyReNet and MVANet. Both were tested on NVIDIA RTX 3090 GPUs. Thanks to its simple single-stream design, MVANet achieved 4.6 FPS compared to InSPyReNet's 2.2 FPS.

Qualitative Evaluation
To intuitively demonstrate the high accuracy of the proposed method's predictions, we visualized the output of selected images from the test set. As shown in Figure 3, the proposed method is able to accurately locate objects and capture edge details even in complex scenes. In particular, the proposed method is able to accurately distinguish the complete segmentation of the chair and the interior of each lattice, while other methods suffer interference from conspicuous yellow gauze and shadows (see bottom row).

Summary
In this commentary paper, we model high-accuracy foreground extraction (background removal) as a multi-view object recognition problem and provide an efficient and simple multi-view aggregation network. In doing so, we aim to better balance model design, accuracy, and inference speed.
To address the target alignment problem for multiple views, we propose a multi-view completion localization module to jointly compute the co-attention regions of the target. Furthermore, the proposed multi-view completion refinement module is embedded in each decoder block to fully integrate complementary local information and reduce the lack of semantics in a single view patch. This allows the final view refinement to be accomplished with only a single convolutional layer.
Extensive experiments have shown that the proposed method performs well. In particular, on the DIS-5K dataset, the proposed method outperforms the current SOTA in both accuracy and speed; MVANet has the potential to become a new standard for foreground extraction tasks and is expected to have a wider range of applications in the future.
Categories related to this article