How Do I Decide On A Text Template For CoOp:CLIP?

Self-supervised Learning 22/09/2021

3 main points
✔️ Prompt engineering of CLIP is a major practical challenge
✔️ We propose CoOp to automatically learn that Prompt engineering in an end-to-end manner
✔️ We show the effectiveness and robustness of CoOp on 11 datasets

Learning to Prompt for Vision-Language Models
written by Kaiyang Zhou, Jingkang Yang, Chen Change Loy, Ziwei Liu
(Submitted on 2 Sep 2021 (v1), last revised 21 Sep 2021 (this version, v2))
Comments: Published on arxiv.
Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

code：

The images used in this article are from the paper or created based on it.

first of all

In this article, we introduce the paper which proposed CoOp (Context Optimization).

OpenAI's CLIP is a multimodal model that learns from a large amount of data on the Internet, which consists of images and language captions.

CLIP has two encoders: image encoder and language encoder. Each encoder projects images and language captions (denoted by Prompt) into the representation space and classifies them by their similarity.

However, the Prompt in the CLIP paper was prepared by humans based on experience. It is necessary to prepare different Prompts for different data sets, and it requires computational resources to tune them. In particular, small changes in Prompt have a large impact on classification accuracy. For example (Figure 1), the difference between a photo of [CLASS]' and 'a photo of a [CLASS]' and the absence of 'a' in the Caltech101 dataset alone makes a difference of about 5%.

Therefore, from a practical point of view, the major issue is the Prompt engineering of CLIP. There is one answer to the question "How do I determine the text template (Prompt) for CLIP? There is one answer to this question. There is one answer to the question "How to decide the text template (Prompt) for CLIP", and that is CoOP, a method to do Prompt engineering automatically. One of the features of CoOP is that it can learn Prompts in an end-to-end manner using already learned CLIPs.

Proposed method: CoOp

Since the CoOp proposed in this paper is based on CLIP, we formulate Zero-Shot reasoning for CLIP.

In Equation (1), f and w_i are the image and Prompt projected into the representation space by the Image Encoder and Language Encoder, respectively. w_i represents the Prompt representation obtained from CLASS i. Then, we calculate the cosine similarity (<.,>) between f and each w_i in the representation space. The obtained cosine similarity is multiplied by the 1/τ coefficient and then applied to the softmax function to calculate the classification category probability. However, the image encoder, language encoder, and 1/τ coefficients are from the pre-trained CLIP.

Figure 2 shows the overall view of the proposed method CoOp, which is the part of CLIP that learns Prompt input to the language encoder.

We now describe the part of the proposed method CoOp that learns Prompt.

We extend the Prompt input to the Language Encoder as in Equation (2). [V]_m(m=1,2. M) is a vector with the dimension of the language embedding (512 for CLIP), where M is a hyperparameter that determines the number of tokens in the Prompt to be designed, which we set to 16 in our experiments.

If we input Prompt t into the language Encoder g(⋅), the formulation of the proposed method CoOp becomes as in Equation (3).

The proposed method, CoOp, can have a central word indicating the CLASS in Prompt, or it can have V with different parameters for each CLASS. In this study, we experimented with four types of CoOp that combined these two methods.

Experiment: Comparison with CLIP

Few-Shot Learning

For the new dataset, the proposed method CoOp can learn the Prompt of CLIP with a small number of samples. In our experiments, we compared the Prompt of the proposed method CoOp trained in the Few-Shot learning setting with two baselines on 11 different image classification datasets used in the CLIP paper. The first is Zero-Shot CLIP using the Prompt designed in the CLIP paper. The second is the Linear Probe CLIP proposed in the CLIP paper, which adds a linear classifier to CLIP.

From the results (Figure 3.), we can see that the proposed method CoOp performs well in Few-Shot learning. The first figure shows the average of 11 data sets. For the designed CLIP Prompt, the Prompt trained with two updates with an average of two samples shows better results. It can also be seen that the proposed method outperforms the linear probe CLIP on most of the datasets.

Next, we compare the proposed method with Zero-Shot CLIP and Linear Probe CLIP, respectively. Figure 4. shows the accuracy improvement of the proposed method CoOp over Zero-Shot CLIP for 16 shots. 10% accuracy improvement is achieved on 7/11 datasets. The accuracy improvement of 5% is also significant because ImageNet is a 1000-class classification.

Next, in comparison with Linear Probe CLIP, the proposed method CoOp is superior on 8/11 datasets. In particular, the accuracy of the proposed method CoOp is more than 10% better than that of Linear Probe CLIP when both methods are trained for 8 shots on average. From these results, the authors claimed that the proposed method CoOp can learn Prompt related to the dataset.

Robustness to Distribution Shift

Unlike CLIP, the proposed method CoOp is trained on specific data distribution. Therefore, there is a risk that the proposed method cannot be generalized to unknown data distributions by learning pseudo-correlation. Here, we compare the proposed method trained on ImageNet with Zero-Shot CLIP and Linear-Probe CLIP on the target datasets ImageNetV2, ImageNet-Sketch, ImageNet-A, and ImageNet-R with varying data distributions of ImageNet. Probe CLIP.

It can be seen that the proposed method CoOp shows better accuracy on the target dataset than Zero-Shot CLIP and Linear Probe CLIP. However, the accuracy of CoOp on the source dataset ImageNet is already about 5% better than the baseline, so the results in Table 1. are considered reasonable by the author.

Further Analysis

Finally, we introduce some additional comparison experiments, because the Prompt of CLIP is not uniquely determined, which was mentioned in the CLIP paper, and we proposed Prompt Ensembling, which uses multiple Prompt simultaneously. Table 2. shows that our CoOp is better than Prompt Ensembling.

The accuracy of the proposed method CoOp with M=4,8 and the comparison of the Vision backbone used in the proposed method CoOp are shown in Figure 5. It can be seen that M=16 is the best.

summary

We believe that image-language multimodal pre-trained models will grow significantly in the wake of CLIP and ALIGN proposed in 2021. When dealing with such models, the design of Prompt input to the language encoder is inevitable. This is also a popular field in natural language processing NLP. Especially after GPT-3, there is a growing trend of research on Prompt (How to extract the true value of GPT-3). This is because the use of Prompt realizes a system that can learn about downstream tasks in the pre-training stage, and leads to the development of more general pre-training models.

With this background in mind, I introduced a method called CoOp that learns Prompt in CLIP.