# EXT5: Extreme Multitasking Scaling For Transition Learning

3 main points
✔️ Examine the effect of large-scale multi-task learning on natural language processing models
✔️ Proposal of EXMIX, a diverse set of tasks
✔️ Proposed EXT5 model combining supervised multi-task pre-training and self-supervised pre-training

written by authors' websites ： Vamsi, Yi, and Donald
(Submitted on 22 Nov 2021 (v1), last revised 29 Jan 2022 (this version, v2))

Subjects: Computation and Language (cs.CL); Machine Learning (cs.LG)

code：

The images used in this article are from the paper, the introductory slides, or were created based on them.

## first of all

Currently, multitask learning and transfer learning have been successfully used in natural language processing, but the effect of tasks on model performance in pre-training is not clear.

For example, does a larger number of tasks during pre-training improve performance in downstream tasks?　Or do we need to carefully select tasks during pre-training to achieve effective performance in a particular downstream task?

In the paper presented in this article, we address these questions by introducing EXMIX (EXtreme MIXture), which consists of 107 supervised natural language processing tasks, and we investigate various aspects of the effectiveness of multi-task pre-training. We also proposed a model using EXMIX, EXT5, which outperforms T5 on various tasks.

An important goal of the paper is to investigate the effects of the number of tasks and other factors in multi-task prior learning.

For this purpose, we introduce a collection called EXMIX (EXtreme MIXture), consisting of 107 diverse English NLP tasks with a total of 18M examples. The breakdown is shown in the table below.

In addition, when the size of each data set is arranged in ascending order, it becomes the figure below, and when sampling from EXMIX, the sampling rate is determined for each data set size.

Note that in order to balance the data set size, the sampling rate is capped to a maximum of $3×10^5$ examples.

In the following sections, we use this EXMIX to perform a variety of experiments on multitask learning.

## Diverse experiments on multi-task learning

The ultimate goal is to address questions such as "Are there tasks that can negatively affect downstream task performance (and should not be included during multitask pre-training)?" and "Are there task sets within EXMIX that are effective for obtaining better representations? Nevertheless, it is impractical to experiment with all combinations of task sets during pre-training and transition learning, so in the paper we address these questions through a variety of experiments.

First, we create eight task families from the tasks in EXMIX and investigate the relationship between task families during transition learning.

We then test whether the performance on one task family improves or degrades when training with other task families simultaneously. At this point, the task families are as follows.

Thus, for each of the eight task families, each consisting of three representative datasets, we select a pair of task families, create models with Fine-Tuning of the pre-trained models on the six datasets in them, and investigate their performance.

The sampling ratio of the task-family pairs is set to 1:1, which results in a total of 200k steps of Fine-Tuning. The results are as follows.

The entries in row $i$ and column $j$ of the table show the average performance in the $j$ task family for models trained in transition with $i,j$ task family pairs.

In the rightmost column, we show how much performance is improved when one task family is trained simultaneously with another task family.

The performance on the diagonal (when trained on a single task family) is shown for 100k steps (constant data budget) and for 200k steps (constant computation budget). Experimental results show that while certain task-family pairs may improve performance (e.g., NLI and other joint learning often improves performance), overall performance is typically worse.

Specifically, we found that performance deteriorated in 21/56 patterns for the same data budget and 38/56 patterns for the same computation budget compared to training on a single task family. We also found that there are different relationships among task families, such as the summary (SUM) task family often degrades the performance of other task families.

In addition, for each of the three datasets within a single task family, the correlations were as follows.

As shown in the figure, we found that although the overall correlation is positive, in some cases the correlation is negative, even when they belong to the same task family. These results indicate that multi-task transfer learning on pre-trained models does not necessarily improve performance.

Next, we consider whether the relationships between tasks observed in the above experiments during Fine-Tuning can be used to find effective task sets for multi-task pre-training. In the previous experiment, we found that some task families, e.g., NLI and CMNS, contributed to the performance improvement when trained simultaneously with other task families (see the rightmost column in the table).

Here, we select 48 tasks classified as NLI, CMNS, CLS, and CBWA, which improve the performance of other tasks, and experiment with using them for pre-training. The results are as follows.

The results of the experiment are shown in the Best-effort of the table, but the results are not as good as the average of the randomly selected tasks (Random-55) or EXMIX (all tasks).

Thus, it is suggested that multitask transfer learning and multitask pre-training are different problems, and that better results are obtained when more diverse tasks are included during pre-training, even when multitask transfer learning has a negative effect.

### Multitasking pre-training vs. pre-fine tuning

There is also a method called pre-finetuning, which utilizes multitask learning as an intermediate step between pre-finetuning and fine tuning.

We now consider the case of pre-finetuning with EXMIX and then finetuning with SuperGLUE based on standard pre-trained T5 checkpoints. The results are as follows.

As a result, we find that multitask pre-training is significantly better based on the overall computational complexity (Compute in the table: the number of layers of tokens processed).

### On Mixing Labeled Data with Self-Supervised Prior Learning

Next, we experiment with the performance of mixing the labeled data, EXMIX, with the Colossal Clean Crawled Corpus (C4) used in the self-supervised pre-training of the T5 model. The results are as follows.

In this figure, the results for the EXT5 model (details in later sections) for varying the hyperparameter R are shown for the case where the C4 sample is included R times EXMIX.

where R → ∞ results in only C4 (dashed line in the figure) and R = 0 results in only EXMIX. Overall, we find that in some cases, mixing EXMIX and self-supervised pre-training can improve performance.

However, the performance is significantly worse when R=0, which also shows the importance of self-supervised learning.

### Does a larger number of tasks during pre-training improve performance?

Next, we investigate how much model performance varies with the number of tasks during multitask pre-training.

Here is the average performance on (3 random seeds) when pre-training with 30, 55, and 80 random tasks selected and fine-tuning with SuperGLUE.

The results show that for large batch sizes, the larger the number of tasks, the better the results.

However, this tendency is less pronounced when the batch size is small (possibly due to the possibility of noise from multitask learning). (This may be due to the possibility that multitask learning is noisy.)

### Increased sample efficiency with EXMIX

We also investigate the sample efficiency of prior learning with EXMIX.

Here we investigate the results for EXMIX other than SuperGLUE when we pre-trained 200k steps and performed fine-tuning on SuperGLUE in the middle of the pre-training.

The comparison results between EXT5 and T5 are shown below.

As shown in the figure, large-scale multi-task learning leads to improved sample efficiency compared to self-supervised pre-training.

## EXT5 Model

Finally, we discuss the EXT5 model, which is based on the T5 model and combines multi-task learning with EXMIX.

When pre-training the EXT5 model, we combine EXMIX, which is labeled data (as discussed in the previous experiment), and C4 (Colossal Clean Crawled Corpus), which was used in the self-supervised pre-training of the T5 model. We control the hyperparameter $R$ so that the C4 sample contains $R$ times the EXMIX sample.

The total number of steps during pre-training is the same as in the T5 model. Also, the learning rate is set to $10^{-4}$ during fine-tuning ($10^{-3}$ for T5).

### EXT5 experimental setup

In our experiments, we experiment with both tasks that are included in EXMIX and tasks that are not included in EXMIX.

For the former, the goal is to investigate the benefit of the extreme number of pre-trained tasks on task performance, while the latter is to measure generalization performance for unknown tasks.

### experimental results

To begin, the results for the tasks in EXMIX are as follows

In general, EXT5 showed the same or better results as T5.

In addition, the results for tasks not included within EXMIX are as follows.

The experimental results show that EXT5 outperforms T5 in all tasks. This was also true for the NER and Machine Translation tasks, for which EXMIX has no similar datasets.