GAIA, A Transition Learning System That Can Handle Any Downstream Task

Transfer Learning 08/12/2022

3 main points
✔️ Focusing on the field of object detection, we propose a transfer learning system called GAIA
✔️ Combining transfer learning and weighted learning, we find an efficient and reliable approach that simultaneously generates powerful pre-trained models for various architectures and finds suitable architectures for downstream tasksDiscovered an approach
✔️ Confirmed promising results on UODB datasets such as COCO, Objects365, Open Images, Caltech, CityPersons, KITTI, VOC, WiderFace, DOTA, Clipart, Comic

GAIA: A Transfer Learning System of Object Detection that Fits Your Needs
written by Xingyuan Bu, Junran Peng, Junjie Yan, Tieniu Tan, Zhaoxiang Zhang
(Submitted on 21 Jun 2021)
Comments: CVPR2021.
Subjects: Computer Vision and Pattern Recognition(cs.CV)

code:

The images used in this article are from the paper, the introductory slides, or were created based on them.

first of all

In recent years, transition learning using prior learning on large datasets has played an important role in computer vision and natural language processing.
However, there are many application scenarios with unique requirements, such as latency constraints and special data distributions, which make the use of large-scale prior learning for task-specific requirements prohibitively expensive.
Hence, this paper focuses on the field of object detection, where a transition learning system called GAIA is proposed to automatically and efficiently generate customized solutions for downstream tasks.

GAIA provides strong pre-trained weights, selects a model that fits downstream requirements such as latency constraints and specified data domains, and can collect relevant data when there are very few data points for a task.

The specific contributions of this paper are as follows.

We show how a successful combination of transfer learning and weighted learning can be used to simultaneously generate powerful pre-trained models for various architectures.
It also proposes an efficient and reliable approach to finding a suitable architecture for downstream tasks
GAIA achieves surprisingly good results for 10 downstream tasks by pre-training and selecting task-specific architectures, without the need to exclusively tune hyperparameters
GAIA retains the ability to discover relevant data from two images per category in downstream tasks to support fine-tuning
This increases the usefulness of GAIA in data-poor environments

GAIA has obtained promising results on UODB, a dataset that includes COCO, Objects365, Open Images, Caltech, CityPersons, KITTI, VOC, WiderFace, DOTA, Clipart, and Comic.
Taking COCO as an example, GAIA can efficiently create models covering a wide range of delays from 16ms to 53ms and efficiently obtain APs from 38.2 to 46.5.

In the subsequent chapters, we will briefly explain transition learning as prior knowledge, and then describe the proposed method, experimental contents, and results.

What is transfer learning?

In transfer learning, we reuse a learned model by diverting parameters from a model that has learned some data in advance.
Here, the data learned in advance is the source data, and the model from which the source data was learned in advance is the source model.
The data to be learned next is the target data, and the model to be learned is the target model.
The source model can detect features of the source data by learning, and by reusing the source model, learning can start from a state where features common to the source data can be detected from the target data, which makes it possible to create a highly accurate model with a small amount of learning. This method makes it possible to create a highly accurate model with a small amount of training.

As shown in the figure below, there are two types of transfer learning methods: one is to update the parameters of the source model with the target data without updating the parameters of the source model, and the other is to relearn all layers with the target data.

proposed method

We introduce GAIA, a transfer learning framework, and its detailed implementation.

GAIA consists of two main components: task-agnostic unification and task-specific adaptation.
In task-agnostic unification, we collect data from multiple sources and build a large data pool with a unified label space.
Then, by using a technique called weight-sharing learning for training supernets, we can collectively optimize models from various architectures.

In task-specific adaptation, GAIA searches for the optimal architecture for a given downstream task, initializes the network with weights extracted from a previously learned supernet, and performs fine-tuning using downstream data.
This process is called "Task-Specific Architecture Selection" (TSAS).

Also, for those times when the number of data for a task is small, GAIA can collect relevant data from a large pool of data that is most correlated with a given task.
This process is called "Task-Specific Data Selection" (TSDS).

Unification of label space and construction of a huge data pool

In the proposed method, we consider building a huge data pool with a unified label space ∪L that unifies multiple independent datasets to collect the most correlated data with a given task as relevant data from a huge data pool for when the task has few data points.

Let $N$ be the number of datasets to be handled, we define dataset $D$ and its corresponding label $L$ as $D={d_1,d_2,...,d_N}$ and $L={l_1,l_2, .l_N}$ and $L={l_1,l_2,...,l_N}$.
The $l_i$-th element of $L$, $l_i={c_{i1},c_{i2},. .c_{i|l_i|}}$.
where $c_{ij}$ means the $j$-th label of the dataset $d_i$.

To construct a unified label space∪L$, which is our goal, we take the one with the largest label space among $L$ as the initial value.
Next, we map the other label spaces to ∪L.
At this time, if the word2vec similarity of the $p$-th category $c_{ip}$ from dataset $d_i$ is higher than the threshold 0.8, we mark it as the same category.
If the word2vec similarity is smaller than the threshold 0.8, we mark it as a new category and add $c_{ip}$ to ∪L L.
We perform these processes on all datasets to build a huge data pool.

Task-Specific Architecture Selection (TSAS)

TSAS searches for the optimal architecture for a given downstream task, initializes the network with weights extracted from previously learned supernets, and performs fine-tuning using downstream data.
For training the supernet, we use a huge data pool constructed by the procedure described in the previous section.

The specific TSAS algorithm is as follows.

Randomly select 5 models for each combination of input scale and depth and check if they satisfy the objective domain and computational cost constraints
Directly evaluate the models so selected and select the best-performing model
Among the selected models, the ones that are in the top 50% of the direct evaluation are subjected to fast fine-tuning (1-epoch warm-up followed by 2-epoch training), and based on the results, the optimal architecture is selected.

The reason why we chose a representative model for each combination of input scale and depth at the beginning is that we confirmed that models with similar input scales and depths can achieve the same degree of accuracy as shown in Fig. 1.

Figure 1: Performance of models with similar input scales and depths on the COCO dataset.

Task-Specific Data Selection (TSDS)

TSDS collects the most correlated data with a given task as relevant data from a huge pool of data for when the number of data for a task is small.

Data collection is a subset of $D_s^*$ such that given a large amount of upstream ream upstream data set $D_s$ and a specific task (downstream) data set $D_t$, the risk of the model $\mathcal{F}$ for the task (downstream) data set $D_t$ is minimized by $D_s^* \in P(D_s)$($P(D_s)$ is the power set of $D_s$). This can be expressed in a mathematical formula as in Equation 1 below.

where $\mathcal{F}(D_t \cup D_s^*)$ is the model trained with $D_t$ and $D_s^*$ and $\mathbb{E}_{D_t}$ is the risk in the validation set of $D_t$.

The specific data completion algorithm is as follows.

For each image with $D_s=\left\{I_{s_1}, I_{s_2}, \cdots, I_{s_P}\right\}$ and $D_t=\left\{I_{t_1}, I_{t_2}, \cdots, I_{t_Q}\right\}$, the representation vector for each class is Compute
We obtain the representation vector $V(I_{s_i},c_{\lor q})$ by averaging the outputs from all the coupled layers of the model for each image class
Find the most relevant data in each class using the cosine similarity between $V(I_{s_i},c_{\lor q})$ and $V(I_{t_j},c_{\lor q})$.
For each $I_{tj}$, select the top $k$ images from $D_s$ (top-k), or collect similar images in all pairs of $P\times Q$ (most-similar)
Continue processing either top-k or most-similar until the target number of copies is collected. ex.)|$D_s^*$|=1000

experimental setup

GAIA was trained under a unified label space using Open Images, Objects365, MS-COCO, Caltech, and CityPersons.
Open Images, Objects365, and MS-COCO are common detection datasets, each with 500, 365, and 80 class numbers.

The use of each dataset in this paper will be as follows.

Open Images 2019challenge: 1.7M images for training, 40k images for validation
Objects365:600k images for training, 30k images for validation
COCO:115k subset for training, 5k for validation
Caltech: 42k images for training, 4k images for validation
CityPersons: 3k images for training, 0.5k images for validation

The unification of these upstream datasets resulted in a unified label space of 700 classes.

As a downstream task, we also performed extensive experiments on the Universal Object Detection Benchmark (UODB).
We followed the proper data partitioning (split into training and validation) and metrics for all 10 diverse sub-datasets that make up the UDOB We have followed the proper data partitioning (split into training and assurance) and metrics for all 10 diverse sub-datasets that make up UDOB.

Results and Discussion

Using the COCO dataset, we conducted experiments to confirm that GAIA can generate high-quality models through data completion and optimal architecture exploration and use.
First, we compare the results of training ResNet50 and ResNet101 with different weight initializations.

As shown in Figure 2, the model with GAIA pre-training achieved a significant improvement over the model with ImageNet pre-training, 5.83% for ResNet50 and 6.66% for ResNet101.
Also, since the data in the COCO dataset is included in the SuperNet data pool, for fairness, we compared the results of the model with 3x ImageNet pre-training (results marked with *) with the results of GAIA.
The results showed a significant improvement of 3.23% for GAIA and 4.22% for COCO, indicating that data from other sources is valid.

Figure 2: Accuracy comparison of GAIA pre-trained models with ImageNet pre-trained models.

GAIA can also efficiently generate models with a wide latency range.
Other than ResNet50 and ResNet101, models for customized architectures must be trained from scratch because they have no pre-trained weights.
As shown in Figure 3, models trained with GAIA outperform models trained from scratch in similar training times by an average of 12.67%.

Figure 3: Accuracy of GAIA on the COCO dataset per training time vs. accuracy when learning from scratch.

To evaluate the generality of GAIA, we conducted a transfer learning experiment to UODB.
Figure 4 shows the results (average accuracy).
From Fig. 4, we can see that GAIA shows excellent accuracy on all datasets.

We also find that COCO pre-training improves by 2.9%.
These results are sufficient to validate the effectiveness of GAIA.
GAIA achieves an average performance improvement of 4.4% with pre-training with a unified label space and large data set.
In addition, GAIA's TSAS yields an additional 2.5% improvement overall. Figure 4: Accuracy from transition learning experiments to UODB

As shown in Figure 5, the average accuracy of GAIA without TSDS exceeded the COCO pre-trained baseline by 5.6%.
In data completion, it is important to select relevant data.
Therefore, randomly selecting data can hurt accuracy because it includes a lot of out-of-domain data, which inhibits learning in the target domain.

In addition to this, top-k and most-similar lead to an average accuracy improvement of 0.8% to 2.3% compared to the case without TSDS, confirming the effectiveness of GAIA in data completion.

Figure 5: Comparison of the average accuracy of GAIA with and without TSDS (top-k and most-similar) and the accuracy of COCO's pre-trained baseline

summary

In this study, we reconsidered the effective generalizability of transfer learning and its adaptation to downstream tasks and proposed a transfer learning system called GAIA, which can automatically and efficiently generate customized solutions according to downstream tasks.

GAIA has been validated in experiments under various conditions using two methods: "Task-Specific Architecture Selection" (TSAS), which searches for the optimal architecture for a downstream task, and "Task-Specific Data Selection" (TSDS), which collects the most correlated data with the task as relevant data from a data pool when the number of data for the task is small. Specific Data Selection" (TSDS), which collects the most correlated data with the task as relevant data from the data pool when the number of task data is small, and their effectiveness was confirmed in experiments under various conditions.

The specific contributions of this paper are as follows.

We show how a successful combination of transfer learning and weighted learning can be used to simultaneously generate powerful pre-trained models for various architectures.
It also proposes an efficient and reliable approach to finding a suitable architecture for downstream tasks
GAIA achieves excellent results for 10 downstream tasks without dedicated hyperparameter tuning by pre-training and task-specific architecture selection
GAIA retains the ability to discover relevant data from two images per class in downstream tasks to support fine-tuning
This increases the usefulness of GAIA in data-poor environments