CMP-NAS, A Neural Architecture Search For Compatibility
3 main points
✔️ Proposed a neural architecture search (NAS) method that considers compatibility
✔️ Showed the impact of architecture on compatibility and the effectiveness of the proposed method
✔️ Maximized efficiency with minimal accuracy degradation in image retrieval systems!
Compatibility-aware Heterogeneous Visual Search
written by Rahul Duggal, Hao Zhou, Shuo Yang, Yuanjun Xiong, Wei Xia, Zhuowen Tu, Stefano Soatto
(Submitted on 13 May 2021)
Comments: Accepted by CVPR 2021.
Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
The images used in this article are either from the paper or created based on it.
first of all
This is an accepted paper for CVPR2021. In image retrieval systems, considering real services, it is common that gallery sets and queries have different requirements in embedding, and queries are usually resource-constrained. On top of that. Basically, it needs to be processed in real-time (or even faster if not). However, image retrieval systems are generally homogeneous systems where the same embedding model is used for gallery sets and queries.
There are two possible designs for this case: if we use a large model tailored to the gallery set (orange in the figure above), we sacrifice efficiency in exchange for high accuracy. On the contrary, if we use a smaller model for the query (green in the figure above), we sacrifice accuracy for high efficiency. Thus, there is a trade-off between accuracy and efficiency in configuring an image retrieval system.
Here, the authors aim to achieve a Heterogeneous system consisting of different embedding models, where the gallery set is reasoned with a large model and the query is reasoned with a compact model. With this configuration, both accuracy and efficiency can be taken. To achieve this, each embedding model must be compatible, which is where the authors' previous work, Backward-Compatible Training (BCT), comes into play. In addition, the authors show that BCT can be combined with Neural Architecture Search (NAS) to discover architectures that achieve state-of-the-art compatibility accuracy.
It is a method to realize backward-compatible representation learning, which is the basic technology of this method. I have introduced it in this article before, so it would be easier to understand if you read that first.
The problem background is that when updating the embedded model in an image (vector) retrieval system, the separately trained models are usually This method aims to solve the problem that the embedding representations cannot be compared before and after the update, as shown in the figure below because they are not compatible and do not necessarily share the same embedding space (they are usually completely different). We propose a method to learn representations so that embedding is compatible. The details are omitted here.
This problem also applies to creating a compact embedding model for queries from a large embedding model for gallery sets, which we will also clear in this paper by using this BCT as a basic technology. Also, in the subsequent sections, we discuss the compatibility criteria under the same criteria as in the BCT paper.
Proposed method: CMP-NAS
In this section, we will introduce the novelty part of the proposed method from the aforementioned previous studies. As a further application of BCT, the authors present the idea of realizing a system with a Heterogeneous configuration of image retrieval as discussed so far, and in addition, they propose the idea of applying BCT to Neural Architecture Search (NAS). This will allow us to keep the model as compact as possible with as little accuracy loss as possible.
This part of this paper considers the concept of compatibility in architectural optimization and verifies it is the novelty and a major contribution of this paper. In this paper, we will discuss the two levels of compatibility. In the following, I will first explain the two levels of compatibility that the authors mention.
Given a gallery set embedding model $\phi_g$ and its classifier $K_g$, Weight-level compatibility aims to learn the weights $w_q$ of the query model $\phi_q$ such that the compatibility rule is satisfied. In this case, the following objective function can be considered. In the image, it is like adding the loss of the classifier $K_g$ for the embedding vector of the gallery set.
As we have already discussed, the best compatibility performance is achieved by using BCT. It is also possible to achieve some weight-level compatibility by fine-tuning from older models, but this has the disadvantage of being more constrained by the architecture of the query model.
Next, we have the architecture-level compatibility. Given a gallery model $\phi_g$ and a classifier $K_g$, for a query model $\phi_q$, we aim to search for the architecture $a_q$ that is most compatible with the fixed gallery model.
The need for architecture-level compatibility is motivated by the following two questions and conclusions from our experiments.
- Q1: To what extent does architecture affect compatibility?
- Q2: Can traditional NAS find a compatible architecture?
In order to answer these questions, the authors conducted experiments on 40 randomly selected architectures from the ShuffleNet search space that have a size of about 300 million flops. Our conclusions from the experiments are as follows.
- A1: In the above figure (a), we plot the accuracy of the Heterogeneous setting when these architectures are trained with BCT, divided by the flop number axis. The large difference in accuracy for the same number of flops shows that the architecture does indeed have a measurable impact on accuracy.
- A2: The above figure (b) compares the pattern of normal learning with the same architecture and the pattern of learning with BCT. the accuracy of the Homogeneous setting and the accuracy of the Heterogeneous setting are plotted, and the figure shows that the correlation between the two accuracies is low, indicating that the traditional NAS may not be The figure shows that the correlation between the two accuracies is low, indicating that traditional NAS may not be successful in finding the most compatible architecture.
Also, from the above figure (c), we can see that the correlation between the accuracy of Homogeneous and Heterogeneous settings is higher when using BCT. From this, I think we can expect that applying BCT to conventional NAS will enable compatible architecture exploration.
Algorithm of CMP-NAS
As mentioned, it is desirable to be able to ensure both Weight-level and Architecture-level compatibility when creating compact models for queries. The usual NAS and distillations used for model weight reduction do not ensure either compatibility, but the authors solve this problem by combining NAS with BCT. In the following, CMP-NAS will be described in detail, but the algorithm itself will be simple.
- First step: for query architecture $a_q$, use training set $T$ and perform training with BCT to obtain optimized weights $w^*_q$ and classifier $K^*_q$.
- Second step: find the optimal query architecture in the search space $\omega$ by maximizing the reward $R$ evaluated on the validation set using $W^*_q$ and $K^*_q$ obtained in the first step.
In addition, the following three reward functions $R$ are verified.
$M(model used for query embedding, model used for gallery embedding)$. $R_1$ is the baseline reward in the Homogeneous setting, $R_2$ is the accuracy in the Heterogeneous setting, and $R_3$ is designed to include both accuracies.
Incidentally, the authors' experimental results show that $R_3$ is the best. We also use ShuffleNet-based super-networks for the search space and EA as the search strategy. For more details, please refer to the paper.
Now, we will briefly introduce the experiments conducted by the authors. Experiments are conducted to show the effectiveness of the Heterogeneous system proposed by the authors on two tasks: face retrieval and fashion item retrieval.
First, the figure below (a) compares the architecture obtained by CMP-NAS with other lightweight architectures in face retrieval. The Paragon setting is the accuracy for the case where the gallery and query are embedded in an unconstrained huge model (ResNet-101 is used in this paper), and the Homogeneous setting is the case where the architecture shown in the lower axis is used for both galleries and queries, and Heterogeneous setting is the case where the gallery set is embedded with ResNet-101 and the queries are embedded with the architecture in the lower axis.
The results show that the architecture obtained by CMP-NAS is the closest to Paragon and that the compatibility of the architectures significantly improves the accuracy. It also shows the effectiveness of the Heterogeneous system as it outperforms the Homogeneous system in all architecture cases.
The above figure (b) shows the experimental results for fashion item retrieval, which also shows the best performance of the proposed method.
Including the inference cost aspect, which is shown below in terms of metrics for the face search task, the result is that we have succeeded in acquiring an architecture that achieves the same efficiency as the baseline while achieving a significant improvement in accuracy compared to the baseline (MobileNetV2).
As I have mentioned, this is a very effective method to deal with the problem of accuracy and efficiency, which has been a trade-off in vector search systems. Personally, I find the idea of solving such trade-offs very interesting, and I think it is an interesting paper because it has great merits in practical applications.
Categories related to this article