Open Vocabulary Object Detection Enabled By OWL-ViT

Neural Network 28/02/2025

3 main points
✔️ Overall model architecture is simple, repeatable, and easy to incorporate into other frameworks
✔️ Image and text encoders are separated, allowing queries to be generated from images as well as text
✔️ Similar objects can be detected simply by giving a query image, Effective even for objects that are difficult to explain

Simple Open-Vocabulary Object Detection with Vision Transformers
written by Matthias Minderer, Alexey Gritsenko, Austin Stone, Maxim Neumann, Dirk Weissenborn, Alexey Dosovitskiy, Aravindh Mahendran, Anurag Arnab, Mostafa Dehghani, Zhuoran Shen, Xiao Wang, Xiaohua Zhai, Thomas Kipf, Neil Houlsby
(Submitted on 12 May 2022)
Comments: ECCV 2022 camera-ready version
Subjects: Computer Vision and Pattern Recognition (cs.CV)

code：

The images used in this article are from the paper, the introductory slides, or were created based on them.

Introduction

Conventional object detection models have been of the closed-vocabulary type, which can recognize only for a limited number of fixed classes. And adding new classes required large annotation data. However, the nearly infinite number of object categories in the real world demands an open vocabulary type that can detect unknown categories. Contrastive learning, which uses paired image and language data, has been attracting attention to address this issue. CLIP is a well-known model, but its application to object detection, such as dealing with unseen classes during training, has remained a challenge.

This paper uses the standard Vision Transformer (ViT) to build an open-vocabulary object detection model, Vision Transformer for Open-World Localization (OWL-ViT), with minimal modifications. The model performs well in pre-training with contrastive learning on large image-text pairs and in fine tuning for end-to-end detection. In particular, the use of embedded class names enables zero-shot detection for unlearned categories.
OWL-ViT is also powerful in one-shot detection because it can use image embeddings as well as text as queries. In particular, it offers a significant performance improvement over previous state-of-the-art models for untrained categories in the COCO data set. This feature is useful for detecting objects that are difficult to describe, such as specialized parts.

We have also shown that increasing the duration of pre-training and model size consistently improves detection performance. In particular, we found that the performance improvement in open vocabulary detection continues even when the number of image-text pairs exceeds 20 billion. In addition, high zero-shot and one-shot detection performance can be achieved even with simple training recipes through the appropriate use of data expansion and regularization in fine tuning for detection .

Proposed Method

OWL-ViT will be taught in the following two phases

Contrastive pre-training with large image-text pairs
Transition learning to detection tasks

Contrastive Pre-Training with Large Image-Text Pairs

The goal here is to map visual and verbal modalities into a unified representation space. The learning process is trained to use image and text encoders to process each modality, bringing related image and text embeddings closer to each other and moving unrelated pairs away from each other.

The image encoder uses the Vision Transformer (ViT) architecture, which is scalable and has powerful representational capabilities. Images are segmented into patches, and each patch is processed as a token to enable feature extraction that takes spatial relationships into account. The tokenization process in ViT converts the image into a fixed-length sequence of tokens, and the relationships between the patches are learned through the Transformer layer. The Text Encoder, on the other hand, processes the tokenized sentences and produces an embedding that condenses the meaning of the entire sentence. The textual representation is typically obtained using the output of the end-of-sentence tokens (EOS tokens) from the final layer of the Transformer.

An important design feature of OWL-ViT's pre-training is that the image and text encoders are independent. This design allows the query text and image embeddings to be pre-computed, greatly improving computational efficiency during inference. This independence also provides the flexibility to handle the same architecture whether the query is text or images.

Transition Learning to Detection Tasks

Here we first remove the final token pooling layer in ViT (usually used to extract a representation of the entire image). Instead, we connect a small classification head and a box regression head directly to each output token. With this design, each output token in ViT corresponds to a different spatial location in the image, with each token representing a potential object candidate. The classification head predicts the object class and the box regression head estimates the location of the corresponding bounding box.

While traditional object detection models learn weights for each class that are fixed in the classification layer, OWL-ViT does not use a fixed class classification layer. Instead, object class names are entered into a text encoder and the generated text embedding is directly used as weights for the classification head. This approach allows the model to detect the corresponding object even for untrained classes, as long as the class name is given.

Transition learning employs the bipartite matching loss used in DEtection TRansformer (DETR) to predict object location. This loss is the optimal mapping between the bounding box predicted by the model and the correct box, and the loss is calculated for each pair. This adjusts the model so that the predicted and actual object positions are consistent.

For classification, we use focal sigmoid cross-entropy to account for imbalances in data sets with long-tail distributions. This loss function improves the detection performance of rare classes by providing a larger penalty for false positives for rare classes than for classes that appear frequently.

Also, for federation datasets, which are datasets where not all images are fully annotated for all classes, but only a limited number of classes are annotated in each image, for each training image, the query uses the class that is annotated (positive example ) and classes that are explicitly marked as not present (negative example) are used as queries. This allows the model to learn based on explicitly identified information and reduces the handling of incorrect negative examples. To further avoid misinterpretation of unannotated classes, we randomly select classes during training and include them as "pseudo-negative examples" so that we have at least 50 negative example queries for each image.

Experiment

The experiment utilizes multiple datasets. For training, we primarily used OpenImages V4 (approximately 1.7 million images, 600+ classes), Visual Genome (84.5 thousand images, including extensive object relationship information), and Objects365 (a large detection dataset containing 365 classes) The evaluation is based on a long-tail distribution. On the other hand, LVIS v1.0 with its long-tail distribution was primarily used for the evaluation, particularly to validate zero-shot performance. Additionally, COCO 2017 was used to compare standard object detection performance, and Objects365 to verify general detection capabilities.

The evaluation of open vocabulary object detection focuses on performance in the untrained class on the LVIS dataset. In this experiment, OWL-ViT achieved an APrare of 31.2% under zero-shot conditions, significantly outperforming existing state-of-the-art methods. This indicates that the use of image-text pairs during pre-training enabled OWL-ViT to effectively extract semantic features of objects from class names and descriptions. In particular, text-conditional detection can detect unlearned classes with high accuracy simply by entering a textual query of class names, which is a key differentiator between this method and previous methods.

Experiments on heuristic image-conditional detection evaluated the detection performance of image queries on the COCO dataset; OWL-ViT achieved up to 72% improvement over existing state-of-the-art methods, increasing the AP50 score from 26.0 to 41.8. These results demonstrate that OWL-ViT leverages an integrated visual and verbal representation to achieve superior detection performance even for unknown objects for which no name is given. In particular, image-conditional detection effectively detected visually similar objects by using embeddings obtained from images containing specific objects as queries.

Analysis of scaling properties confirmed that increasing the number of image-text pairs used in pre-training and the model size consistently improved detection performance. In particular, using more than 20 billion image-text pairs in the pre-training tended to significantly improve zero-shot detection performance. This result suggests that the use of large data sets in pre-training is also effective in the transition to object detection tasks. It also reveals that models based on Vision Transformer have better scaling properties than other architectures, especially at large model sizes.

Summary

Simple Open-Vocabulary Object Detection with Vision Transformers (OWL-ViT) is a groundbreaking study that leverages integrated visual and verbal pre-training to simply and effectively solve a key challenge in open vocabulary object detection. The most important contribution of this research is the use of image and language pre-training. The greatest contribution of this research is that it leverages large-scale contrastive pre-training of images and text to achieve zero-shot and one-shot object detection of unknown classes with high accuracy. In particular, the design of using the pre-trained text encoder output directly as class embedding, rather than using a fixed class classification layer, represents a major advance in flexibility and scalability.

Categories related to this article

Yoshitake

Open Vocabulary Object Detection Enabled By OWL-ViT

Introduction

Proposed Method

Contrastive Pre-Training with Large Image-Text Pairs

Transition Learning to Detection Tasks

Experiment

Summary

MVANet: The Most Powerful Model For Background Removal

MVANet: The Most Powerful Model For Background Removal

Zero-shot Learning] AI Voice Cloning And Lip-syncing Verification And Explanation

Zero-shot Learning] AI Voice Cloning And Lip-syncing Verification And Explanation

TIMEX++] Framework For Improving Explainability In Deep Learning Of Time Series

TIMEX++] Framework For Improving Explainability In Deep Learning Of Time Series

Innovative Active Learning In A Single Index Model

Innovative Active Learning In A Single Index Model

Knowledge Graphs Evolve Through Human-AI Cooperation! All About KG-HAIT, A New Link Prediction Technology

Knowledge Graphs Evolve Through Human-AI Cooperation! All About KG-HAIT, A New Link Prediction Techn ...

A Method For Predicting Cancer Drug Candidates Based On Hostile Domain Adaptation Is Proposed!

A Method For Predicting Cancer Drug Candidates Based On Hostile Domain Adaptation Is Proposed!