
PictSure: A New Method To Challenge Few-Shot Classification With The Power Of Visual Embedding
3 main points
✔️ PictSure is an ICL method that uses only images and shows that pre-training of embedding is essential to improve accuracy
✔️ Freezes and uses pre-trained models from ResNet and ViT to achieve high Few-Shot classification performance
✔️ Especially in fields where linguistic information is scarce such as medicine and agriculture Demonstrates better generalization performance than conventional methods
PictSure: Pretraining Embeddings Matters for In-Context Learning Image Classifiers
written by Lukas Schiesser, Cornelius Wolff, Sophie Haas, Simon Pukrop
(Submitted on 16 Jun 2025)
Comments: 15 pages, 10 figures
Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (AI)![]()
![]()
code:![]()
![]()
The images used in this article are from the paper, the introductory slides, or were created based on them.
Overview
In recent years, Few-Shot Image Classification (FSIC), which identifies new classes from a small number of samples, has attracted much attention in the field of image classification.
In particular, the In-Context Learning (ICL) method, which allows inference without training by providing only a few labeled images at test time, is promising in terms of flexibility and efficiency. However, previous studies have not fully investigated the impact of preprocessing and learning methods of image features (embedding vectors) on classification accuracy in ICL.
In this study, we proposed an ICL model called "PictSure," which uses only visual information, and systematically analyzed how factors such as the structure of the embedding model, pre-training method, and training timing affect FSIC performance. The results show that PictSure significantly outperforms conventional methods in accuracy, especially when using pre-trained embedding models.
This suggests that highly accurate classification is possible based on the quality of visual features alone, without relying on semantic linguistic information.
Proposed Method
The proposed "PictSure" is an ICL model for few-shot classification using only images and a Transformer-based architecture that predicts query labels from the input support set (images and their labels) and the context of the query image. The Transformer's input is a sequence of tokens that combines image and label embeddings, and a specially designed asymmetric attention mask provides a structure in which the query is support-dependent.
Another feature of this model is its strong dependence on the quality of the visual embedding. ResNet and Vision Transformer (ViT) are used as embedding models, which are either fixed or fine-tuned after being pre-trained with ImageNet or other models.
In particular, it was confirmed that ViT can obtain a more structured embedding space by introducing Triplet Loss in addition to the usual classification loss. This allows visually similar images to be placed close to each other, which helps stabilize label predictions in ICL.
PictSure is designed to provide strong generalization performance not only for natural images, but also for domains with limited linguistic information, such as medicine and agriculture.
Experiments
To test the effectiveness of PictSure, we conducted a few-shot classification task using in-domain (close to the training data) and out-of-domain (different distributions) data sets. Experiments were conducted in 5-way 1-shot and 5-way 5-shot settings, and PictSure's performance was compared to the baseline KNN and the CLIP-based ICL method CAML.
The results showed that PictSure, using ResNet and ViT with pre-trained and frozen embedding models, significantly outperformed the other methods, especially in medical imaging and special imaging areas (e.g., Brain Tumor, OrganCMNIST).
In contrast, CAML, which uses CLIP, achieved high accuracy on natural images, but lower accuracy in the medical domain. This suggests that CLIP's language-based pre-training is not suitable for identifying visual details.
In addition, PictSure showed strong and stable learning behavior against over-learning, with accuracy improving as the number of support set pieces (context length) was increased, but a performance head-start was observed after 8 pieces.
Categories related to this article