Assessing The Robustness Of Zero-shot Image Understanding Models Through CLIP

Contrastive Learning 24/06/2024

3 main points
✔️ A comprehensive benchmark using CLIP investigated the zero-shot robustness of the multimodal foundation model.
✔️ A pilot study using CLIP revealed reduced robustness, especially to synthetic data and attacks. Data duplication analysis suggests that some of the robustness may be due to data duplication.
✔️ Looking ahead, improving the robustness of CLIP and other multimodal models will require the development of new strategies, consideration of data diversity, introduction of new evaluation metrics, real-world application and application, and international collaboration and sharing.

Benchmarking Zero-Shot Robustness of Multimodal Foundation Models: A Pilot Study
written by Chenguang Wang, Ruoxi Jia, Xin Liu, Dawn Song
(Submitted on 15 Mar 2024)
Comments: Published on arxiv.
Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)

code：

The images used in this article are from the paper, the introductory slides, or were created based on them.

Summary

By pre-training image representations from raw text, image understanding models can be developed that can be applied to specific tasks without prior training. For example, multimodal underlying models, such as CLIP, are trained on samples collected from millions of the Internet and subsequently perform well on a zero-shot basis without additional task-specific training. These models show performance comparable to models trained on ImageNet and also report robustness to natural distribution changes. Such robustness is essential for safety-critical applications.

This paper provides a comprehensive assessment of robustness against various shifts and attacks and demonstrates the importance of robustness; a pilot study using CLIP revealed reduced robustness, especially against synthetic data and attacks. Data duplication analysis suggests that some of the robustness may be due to data duplication. In short, the importance of a comprehensive assessment of robustness and of improving the robustness of zero-shot multimodal models is emphasized.

Introduction

Assessing robustness is important and should consider not only natural distribution changes, but also robustness to noise and hostile attacks. In this study, the robustness of zero-shot image classification will be comprehensively evaluated using CLIP and a new set of robustness tests will be introduced.

This demonstrates the importance of robustness in multimodal applications and helps in the evaluation of other models. It also underscores the need to improve the robustness of zero shot multimodal foundation models.

Proposed Method

The ROZ benchmark is a comprehensive test set for measuring the robustness of multimodal foundation models. This benchmark adds a new test set to the current suite of robustness data sets to provide a more extensive robustness assessment.

The main elements focus on general robustness test sets and hostile attacks. It is divided into two categories, natural distribution shifts and synthetic distribution shifts, each containing different data sets. The natural distribution shift includes seven shifts, including ImageNetV2 and ObjectNet. The synthetic distribution shift includes datasets such as ImageNetC and Stylized ImageNet. In addition to this, robustness against hostile attacks is also tested. A variety of attack techniques are used in this test, including targeted and transfer-based attacks.

The benchmark primarily targets zero-shot image classifiers and evaluates their performance using the CLIP model; CLIP processes both images and text, using prompts that are automatically trained to perform image classification. The robustness of the CLIP model compared to existing standard models will then also be evaluated.

Finally, model robustness is assessed based on two types of metrics: effective robustness and relative robustness. This provides comprehensive insight into model robustness.

Experiment

This study focused on three aspects: changes in the natural distribution, changes in the synthetic distribution, and hostile attacks.

Changes in the natural distribution refer to changes in the data that the model faces in its everyday environment. For example, it evaluates whether an image classification model can properly classify an image against a new background or under new lighting conditions. In contrast, changes in synthetic distributions refer to artificially generated data changes that the model did not encounter during training. It tests whether the model can adapt to new environments and conditions. Finally, adversarial attacks evaluate whether the model is vulnerable to intentionally created misleading data. This is a technique to test if the model performs correctly against offensive data.

First, for natural distribution shifts, CLIP was shown to improve robustness over the standard model. In particular, we observed an effective robustness improvement of CLIP for natural distrubiton shifts. This means that CLIP performs better than the standard model in the image classification task. However, different results were shown for changes in synthetic distributions and adversarial attacks.

Red: Standard ImageNet model Blue: Zero-shot CLIP model Purple: CLIP-Auto model

The changes in the composite distribution in the figure above showed a trend toward decreasing robustness for CLIP. In particular, the results showed that CLIP is vulnerable to attacks based on the addition of text to images. This means that hostile changes to the text could fool the model, since CLIP is trained to react to both images and text.

Furthermore, CLIP was shown to be less robust than the standard model with respect to hostile attacks. In particular, CLIP was vulnerable to typographic attacks, and significant performance degradation was observed. This suggests that since CLIP relies on text as well as image representations, hostile changes to the text can affect the performance of the model.

In summary, this study provides a comprehensive assessment of the robustness of CLIP, a multimodal model, showing that it is highly robust to changes in the natural distribution, but less robust to changes in the synthetic distribution and to adversarial attacks. This provides important insights to consider in future model design and learning strategies.

Data Duplication Analysis

A new perspective on CLIP robustness focused on data duplication.

It is possible that the pre-training data set may contain portions of the test data set, which could affect the performance of natural distribution shifts;CLIP performed a duplicate data analysis, but this analysis was found not to be rigorous. The proposed approach is to remove images from the test set that are similar to the training sample to create a clean test set and reassess robustness.

This approach uses the ResNet50x16 image encoder to detect duplicates and exclude images whose similarity exceeds a threshold.Focusing on natural versus synthetic distribution shifts, we investigated the impact of data duplication on robustness.Results indicate that cleaning up data duplicates is important for assessing robustness.

Conclusion

This study investigated the zero-shot robustness of the multimodal underlying model through a comprehensive benchmark using CLIP. The results show that CLIP is not robust to synthetic distribution shifts or adversarial attacks, and that the previously reported robustness to natural distribution shifts may be due to data duplication. This differs from the results described in the original paper on CLIP. A comprehensive robustness assessment is important for real-world applications, and has implications for use in safety-critical areas.

Looking ahead, improving the robustness of CLIP and other multimodal models will require the development of new strategies, consideration of data diversity, introduction of new evaluation metrics, real-world application and application, and international collaboration and sharing.