Catch up on the latest AI articles

Fine-Tuning To Enhance Zero-Shot Performance Of Language Models

Fine-Tuning To Enhance Zero-Shot Performance Of Language Models

Zero Shot

3 main points
✔️ Proposed a simple method to improve the performance of large language models
✔️ Proposed Instruction Tuning to train natural language tasks as simple instructions
✔️ Proposed FLAN model outperforms GPT-3's Zero-Shot performance on many tasks

Finetuned Language Models Are Zero-Shot Learners
written by Jason WeiMaarten BosmaVincent Y. ZhaoKelvin GuuAdams Wei YuBrian LesterNan DuAndrew M. DaiQuoc V. Le
(Submitted on 3 Sep 2021 (v1), last revised 8 Feb 2022 (this version, v5))
Comments: Published on arxiv.

Subjects: Computation and Language (cs.CL)


The images used in this article are from the paper, the introductory slides, or were created based on them.

first of all

Few-shot learning succeeds surprisingly well on large language models such as GPT-3. However, Zero-shot learning is less successful and shows much worse results than Few-shot performance, including tasks such as reading comprehension, question answering, and inference.

In the paper presented in this article, we proposed a model called Finetuned LAnguage Net (FLAN) by a simple method to improve the zero-shot performance of large language models. By applying a method called Instruction Tuning to a pre-trained model consisting of 137B parameters, this model outperforms GPT-3 by 175B on various datasets.

FLAN: Improving Zero-Shot Learning Performance with Instruction Tuning

The proposed method, FLAN, uses a simple idea called Instruction Tuning.

This is Fine-tuning a pre-trained language model to perform a task described as some instruction (Instruction). Then, even when an unknown task is given, by giving some instruction that indicates the task, the model is encouraged to execute the task according to the instruction.

As shown in (C) in this figure, by learning to execute tasks according to specific instructions on various tasks (B, C, D), even unknown tasks (A) can be solved according to the instructions.

Tasks and Templates for Instruction Tuning

Instruction Tuning requires a data set consisting of various tasks and a template for creating instructions (Instructions) describing each task.

About Tasks

First, we use the 62 text datasets published in Tensorflow Datasets as the dataset for Instruction Tuning. It consists of language understanding and language generation tasks and is summarized in the following figure

Here, each dataset is classified into 12 different task clusters.

During the evaluation, based on this task cluster, a decision is made as to whether or not each task is unknown to the model.

For example, when evaluating NLI (Natural language inference: upper left in the figure), we create instruction tuned checkpoints for all datasets except those classified as NLI clusters and use the model to evaluate the NLI task using the model.

About Templates

To perform Instruction Tuning on these datasets, we need to prepare data containing instructions describing each task for the pre-trained model. In the paper, we use 10 manually created templates for this goal. For example, they include the following.

In this figure, the template for the Natural language inference task is shown as an example.

These templates describe tasks in the form "Can you conclude <hypothesis> from the above <Premise>?" and "Read the following and decide if <Hypothesis> can be inferred from <Premise>". The tasks are described in the form "Read the following to determine if <Hypothesis> can be inferred from <Premise>. Some of the 10 templates include examples where the tasks are reversed to increase the diversity of the dataset.

For example, the task of determining whether a movie review is positive or negative includes a template that requires the user to generate a positive or negative review.

Classification and Generation Tasks

For a given task, the output space of the model is several classes (classification task) or texts (generation task). In the classification task, the model is trained with an explicit list of output classes for that task, as represented as <options> in the previous template.

Finally, Instruction Tuning is summarized in the following figure

For the architecture of the FLAN model, we use a pre-trained model of LaMDA-PT consisting of 137B parameters.

During Instruction Tuning, the number of samples per dataset is limited to 30k, and the training is performed on 128-core TPUv3 for about 60 hours. During the evaluation, the results of the final checkpoint, where 30k steps of training were performed, are reported. (See the original paper for details on other hyperparameters, etc.)

experimental results

In the experiment, the following tasks are evaluated.

  • Natural language inference (NLI)
  • Reading comprehension
  • Closed-book QA
  • Translation
  • Commonsense reasoning
  • Coreference resolution
  • Struct-to-text

As mentioned earlier, the datasets are classified into task clusters, and after performing Instruction Tuning on the clusters other than the evaluation target, we perform experiments on the unknown tasks that fall under the evaluation target. In the performance evaluation, we obtain the average of the performance for all templates for each dataset.

First, the zero-shot performance results in Natural language inference (NLI), Reading comprehension, Closed-book QA, and Translation are as follows.

In all tasks, FLAN showed excellent overall results. More detailed results including other tasks (Commonsense reasoning, Coreference resolution, and Struct-to-text) are shown below.

Compared to the previous results, FLAN was not very effective in commonsense reasoning and coreference resolution, only outperforming LamDA-PT in three of the seven tasks. This may be because Instruction Tuning in FLAN is almost redundant when the original pre-training objective and the downstream task are similar.

Impact of the number of task clusters

As an additional experiment, we first examine how the number of task clusters/tasks used for Instruction Tuning affects the performance.

Here, NLI, Closed-book QA, and Commonsense reasoning are selected as the clusters to be evaluated, and the other seven clusters are used for Instruction Tuning. The results of Instruction Tuning by selecting clusters 1 to 7 in order of the number of tasks per cluster are as follows.

Overall, we find that performance improves as we add more clusters (except for Sentiment analysis).

Also, the results of this experiment suggest that adding more task clusters may improve performance, as the performance of each of them does not appear to be saturated.

About Model Scale

Next, to investigate the relationship between Instruction Tuning and model scale, we conduct experiments on models with parameter sizes 422M, 2B, 8B, 68B, and 137B.

The results are as follows

Interestingly, we found that for models below 8B, the performance on unknown tasks is degraded by Instruction tuning. This may be because when the model capacity is insufficient, it only fits the task at the time of Instruction tuning, making the ability to generalize to new tasks difficult.

About the Instruction Effect

Next, to investigate the effect of Instructions in Instruction Tuning, we examine the case where only inputs and outputs are simply given to the model without instruction templates and the case where fine-tuning is performed with task and dataset names in front of the inputs. The results are as follows.

In general, the performance was significantly lower than the results in FLAN, indicating that training with instructions is important for zero-shot performance in unknown tasks.

About the results in the Few-Shot setting

So far we have focused on the results for the Zero-Shot setting, but we also examine the results for the Few-Shot setting, which gives a small number of exemplars. The results are as follows

Although the performance of all tasks is improved by using the Few-shot setting, especially for tasks with large and complex output space (e.g., Struct-to-text, Translation, and Closed-book QA), the performance is greatly improved. On the contrary, for some task clusters such as Reading comprehension and Commonsense reasoning, Zero-shot and Few-shot showed almost the same performance.


In this article, we presented a paper that proposed Instruction Tuning, a simple method for improving task performance in the Zero-Shot setting, and a model based on it, FLAN. The model outperformed GPT-3 on many tasks, demonstrating the potential ability of large language models to execute tasks according to instructions.

If you have any suggestions for improvement of the content of the article,
please contact the AI-SCHOLAR editorial team through the contact form.

Contact Us