https://visinf.github.io/emat
Efficient Masked Attention Transformer for Few-Shot Classification and Segmentation
Abstract
Few-shot classification and segmentation (FS-CS) focuses on jointly performing multi-label classification and multi-class segmentation using few annotated examples. Although the current state of the art (SOTA) achieves high accuracy in both tasks, it struggles with small objects. To overcome this, we propose the Efficient Masked Attention Transformer (EMAT), which improves classification and segmentation accuracy, especially for small objects. EMAT introduces three modifications: a novel memory-efficient masked attention mechanism, a learnable downscaling strategy, and parameter-efficiency enhancements. EMAT outperforms all FS-CS methods on the PASCAL- and COCO- datasets, using at least four times fewer trainable parameters. Moreover, as the current FS-CS evaluation setting discards available annotations, despite their costly collection, we introduce two novel evaluation settings that consider these annotations to better reflect practical scenarios.
Keywords:
Few-shot learning Efficiency Segmentation Classification.1 Introduction
Recently, data-intensive methods have been introduced for various deep learning applications [5, 32, 23, 34, 41, 25, 8]. These methods rely on large training datasets, making them impractical in fields where collecting extensive datasets is challenging or costly [14, 65, 13]. Consequently, few-shot learning (FSL) methods have gained significant attention for their ability to learn from just a few examples and quickly adapt to new classes [44, 52, 56, 1]. In computer vision, FSL has been mostly applied to image classification (FS-C) [3, 40, 43, 18] and segmentation (FS-S) [11, 30, 62, 55, 63].
FS-C and FS-S often co-occur in real-world applications, e.g., in agriculture, where crops must be segmented and classified by type or health status. Hence, recent works [19, 20] integrate multi-label classification and multi-class segmentation into a single few-shot classification and segmentation (FS-CS) task. While FS-CS addresses some limitations of FS-C (e.g., assuming the query image contains only one class) and FS-S (e.g., assuming the target class is always present in the query image), it also increases the task difficulty by simultaneously tackling classification and segmentation. Moreover, some applications, e.g., medical imaging, rely on precise small-object analysis [13, 65, 16]. Thus, achieving high accuracy on small objects is a desired property for FS-CS methods. Yet, as shown in Fig.˜1, the current state-of-the-art (SOTA) FS-CS method [20] struggles with small objects, a limitation we address in this work.
GT Mask | CST* [20] | EMAT | GT Mask | CST* [20] | EMAT |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
To better align the evaluation of FS-CS models with practical scenarios, FS-CS uses the N-way K-shot configuration, where the model learns N classes from NK examples (K per class). However, the current evaluation setting [19, 20] discards available annotations, which is not ideal given the cost of data annotation. To address this, we introduce two new evaluation settings.
1.0.1 Contributions.
(1) Building on the current SOTA FS-CS method [20], we propose an efficient masked attention transformer (EMAT), which enhances classification and segmentation accuracy, particularly for small objects, while using approximately four times fewer trainable parameters. (2) Our EMAT outperforms all FS-CS methods on the PASCAL- and COCO- datasets, supports the N-way K-shot configuration, and can generate empty segmentation masks when no target objects are present. (3) Finally, we introduce two new FS-CS evaluation settings that better utilize available annotations during inference.
2 Related Work
2.0.1 Few-shot Classification (FS-C)
methods can be categorized into three groups based on what the model learns. Representation-based approaches learn class-agnostic, discriminative embeddings [48, 39, 3, 21, 60, 17, 46]. Optimization-based approaches learn the optimal set of weights that allow the model to adapt to new classes in just a few optimization steps [15, 40, 4, 35]. Transfer-based approaches adapt large pre-trained [6, 10, 43, 24, 28] or foundation models [64, 18, 37]. A major limitation of most FS-C methods is the assumption of a single label per image [2, 38], limiting them in multi-label settings.
2.0.2 Few-shot Segmentation (FS-S)
methods can also be categorized into three groups: prototype matching, which aligns support embeddings with query features [11, 50, 45, 58, 26, 51]; dense correlation, which constructs support–query correlation tensors [33, 29, 53, 30, 7, 54]; and model-adaptation, which fine-tunes large pre-trained models [61, 62, 49, 27, 57]. Despite the advancements in FS-S, most methods have two main limitations: (1) they target only the 1-way K-shot configuration and (2) they assume the query image contains the target class, preventing the models from predicting empty segmentation masks. Only a few recent works [42, 59] address the more general N-way K-shot configuration.
2.0.3 Few-shot Classification and Segmentation (FS-CS)
focuses on jointly predicting the multi-label classification vector and multi-class segmentation mask without assuming support classes are present in the query image [19]. The current SOTA FS-CS method, the classification-segmentation transformer (CST) [20], uses a memory-intensive masked-attention mechanism that requires significant downsampling of the correlation features, reducing its accuracy on small objects. In this work, we enhance CST by proposing an efficient masked-attention formulation and adding further refinements, resulting in a more memory- and parameter-efficient method with improved accuracy, especially for small objects.
3 Problem Definition
This work focuses on the few-shot classification and segmentation (FS-CS) task [19], formulated as an N-way K-shot learning problem [48]. We assume two disjoint class sets: for training and for testing. Accordingly, training tasks are sampled from , and testing tasks from . Each task consists of a support set and a query image , where contains N classes ( or ), each represented by K examples:
(1) |
where , , and denote the support image, segmentation mask, and class label for the example of the class. Although in Eq.˜1, we use this notation for compatibility with multi-label settings where can vary.
The goal of FS-CS is to learn from such that, given , the model can (i) identify which support classes are present (multi-label classification), and (ii) segment those classes (multi-class segmentation). Moreover, FS-CS allows to contain a subset of the support classes. Thus, when , can contain: (1) none of the support classes (), (2) a subset of them (), or (3) all support classes (). Note that case (1) is important in real-world applications where query images may not contain relevant classes, requiring models to predict empty segmentation masks when necessary.
The drawback of the current FS-CS setting is that each support image is assumed to contain only one annotated class (). If includes multiple support classes (), its label vector and segmentation mask need to be adjusted before constructing . This adjustment discards available annotations, as illustrated in Fig.˜2, where contains both support classes (person, bike). However, since it is an example of the \nth1 class (person), the annotations of the \nth2 class (bike) are removed in the original setting.
Available Data | Original (Eq. 1) | Partially Aug. (Eq. 2) | Fully Aug. (Eq. 3) | |||||
![]() |
![]() |
1 2 | ![]() |
1 | ![]() |
1 2 | ![]() |
1 2 |
![]() |
![]() |
2 3 | ![]() |
2 | ![]() |
2 | ![]() |
2 3 |
3.1 Proposed Evaluation Settings
To better utilize available annotations and reflect more realistic evaluation scenarios, we introduce two novel FS-CS evaluation settings.
3.1.1 Partially Augmented Setting.
3.1.2 Fully Augmented Setting.
This setting keeps all available annotations for each support image, regardless of whether the corresponding classes are part of the support classes:
(3) |
For example, in Fig.˜2, and include annotations for both support classes (person, bike), while and are augmented with annotations of a non-support class (TV), which can belong to either or . In this setting, the model is expected to classify and segment all support and augmented classes present in . Additionally, this setting aligns closely with the generalized few-shot setting (GFSL) [44], which also evaluates on base classes. However, unlike standard GFSL, which evaluates on all base classes seen during training, our setting restricts evaluation to only those classes present in the support set.
4 Efficient Masked Attention Transformer
Fig.˜3 illustrates the pipeline used by our proposed efficient masked attention transformer (EMAT), which builds upon the classification-segmentation transformer (CST) [20]. Both methods share the same feature extraction process: support and query images are processed by a frozen, pre-trained ViT [12] with patch size , producing support and query image tokens , and a support class token , where , , and is the token dimension of a single ViT head. The support tokens are downsampled via bilinear interpolation and reshaped to . Similarly, query image tokens are reshaped to . Next, and are concatenated to form . Finally, cosine similarity between and is computed across all ViT layers and attention heads , resulting in the correlation tokens , where and .
EMAT differs from CST in the two-layer transformer (purple blocks in Fig.˜3) that processes the correlation tokens and feeds task-specific heads for multi-label classification and multi-class segmentation. EMAT enhances this transformer with three key improvements: (1) a novel memory-efficient masked attention formulation (see Sec.˜4.1) that allows using higher-resolution correlation tokens, (2) a learnable downscaling strategy (see Sec.˜4.2) that avoids reliance on large pooling kernels, and (3) additional modifications for improved parameter efficiency (see Sec.˜4.3), which can help to reduce overfitting a small support set.
Following CST, EMAT is trained using the 1-way 1-shot configuration. Since EMAT uses task-specific heads, it is trained with two losses:
(4) | ||||
(5) |
where and are the ground-truth classification and segmentation labels, and , are the corresponding predictions. The final loss function jointly optimizes both losses using a balancing hyperparameter :
(6) |
Inference on N-way K-shot task is performed as in CST [20], by treating each class as an independent 1-way K-shot task: class-wise logits and segmentation masks are averaged over the K examples, producing N predictions. Logits above a threshold form the multi-label vector, and each pixel is assigned to the class with the highest score, or to background if all scores fall below , thereby allowing empty masks.

4.1 Memory-Efficient Masked Attention
The SOTA FS-CS method, CST [20], uses a masked attention mechanism based on self-attention [47, 12] to process the correlation tokens . Due to the high memory cost of self-attention, CST significantly downsamples the support tokens when computing , sacrificing fine-grained spatial details (see Fig.˜5). To address this, we propose a novel memory-efficient masked attention formulation that allows EMAT to use high-resolution correlation tokens.
Given , let , denote the query, key, and value matrices, where is the embedding size. The dimensions of differ from those of and because the query matrix is progressively downscaled (see Sec.˜4.2). Additionally, as shown in Fig.˜3, the segmentation mask enters directly into the attention mechanism without being processed by the feature extractor. Thus, it is resized, flattened, and then append a trailing “1” to obtain . This appended “1” ensures that the support class token is never masked in Eqs.˜7 and 8.
For one attention head, CST computes:
(7) |
where , , , and . Here , , and , the “” accounts for the support class token, and denotes a downscaled value. The operator represents element-wise multiplication.
In CST, the support dimension is 145 () in the first attention layer and 10 () in the second. Consequently, in the second layer most values of are zero (see Fig.˜5). As a result, Eq.˜7 zeros out most attention values, but they are still processed in all intermediate computations, leading to a memory-inefficient formulation. To overcome this, we introduce a memory-efficient reformulation that excludes the masked-out entries:
(8) |
where is our element-wise masking operator:
(9) |
with and indicating that the corresponding entry is excluded. This exclusion of elements results in the reduced set of indices used in Eq.˜8, where only if contains no zeros. Excluding masked-out tokens reduces memory usage allowing EMAT to increase the support dimension to 401 () in the first attention layer ( times more than CST) and 101 () in the second ( times more than CST). Note that the index set varies across images in a batch; thus, Eq.˜8 is computed sequentially for each image. However, batch processing is still used both before and after this step because the input and output tensors ( and ) have the same dimensions for every image. By computing attention only over unmasked entries, EMAT still achieves a runtime comparable to CST.
4.2 Learnable Downscaling
As mentioned in Sec.˜4.1, the query matrix is progressively downscaled. This downscaling occurs before computing the masked-attention in each layer and it keeps and fixed, while shrinking the support spatial dimensions and , which reduces the support dimension (). Unlike CST, which uses only average pooling, EMAT introduces a lightweight, learnable strategy that combines small convolutions with pooling. This hybrid design removes the need for the large pooling kernels that would otherwise be required to handle the higher-resolution correlation tokens used by EMAT.
In the first attention layer, EMAT splits into support image tokens and a single class token . After reshaping to its support spatial layout, a 3D convolution followed by another reshape produces . Simultaneously, a 2D convolution transforms into . These outputs are concatenated to form the downscaled query matrix used in Eq.˜8, where .
The second attention layer repeats the process, but before the concatenation is collapsed to a single spatial token by a 3D average pool, producing . Concatenating with gives the downscaled query matrix used during the attention computation.
4.3 Modifications for Parameter Efficiency
Few-shot models with too many parameters risk overfitting the small support set, thereby reducing their ability to adapt to new classes. Therefore, EMAT reduces the number of channels across all operations: its two attention layers use 64 and 32 channels, versus 32 and 128 in CST, and its two task-specific heads use 32 and 16 channels, versus 128 and 64 in CST. These channel reductions significantly decrease the number of trainable parameters in EMAT.
5 Experiments
5.0.1 Datasets.
We evaluated our EMAT on the widely used PASCAL- [36] and COCO- [31] datasets. Although they were designed for few-shot segmentation, both can also be used for few-shot classification and segmentation [19]. PASCAL- comprises 20 classes and COCO- 80 classes, each partitioned into four non-overlapping folds.
5.0.2 Implementation Details.
EMAT uses a frozen ViT-S encoder [12] pre-trained with DINOv2 [32]. The two-layer transformer uses our memory-efficient masked attention with 8 heads. We train for 80 epochs with a batch size of 9 using the Adam optimizer [22] with learning rate . Following [20], we use 1-way 1-shot tasks with the original setting (see Eq. 1) and set the loss weight in Eq.˜6 to 0.1. Moreover, we re-train CST [20] with the same DINOv2 backbone used by EMAT and denote it as CST*. All training was conducted on three NVIDIA RTX A6000 GPUs, with evaluation performed on a single GPU.
5.1 Comparison to SOTA FS-CS
To evaluate the effectiveness of our EMAT, we compare it with CST [20] and other state-of-the-art few-shot classification and segmentation (FS-CS) methods. Tab.˜1 shows the mean classification accuracy (Acc.) and mean Intersection over Union (mIoU) over the four folds of PASCAL- [36] and COCO- [31], for 2-way 1-shot tasks across all evaluation settings (see Sec.˜3). Although DINOv2 pre-training [32] already significantly improves CST* over its original version, EMAT consistently outperforms all methods across all settings. These results validate the benefit of processing higher-resolution correlation tokens enabled by our memory-efficient masked attention (see Sec.˜4.1). Moreover, EMAT requires at least four times fewer parameters than CST, making it the most parameter-efficient method among SOTA FS-CS models. The supplementary material provides per-fold results for Tab.˜1 and additional results on -way 5-shot tasks, demonstrating the scalability of our method.
The results in Tab.˜1 also show that our partially augmented setting slightly improves accuracy and mIoU for most methods, confirming the benefit of better exploiting the available annotations. However, the improvement is marginal, likely because only 242 and 106 out of 4000 tasks are augmented for PASCAL- and COCO-, respectively. In contrast, our fully augmented setting lowers accuracy and mIoU for every method, although less significantly for mIoU. As discussed in Sec.˜3.1, this setting augments not only the support examples but also includes every class present in the support images, making the tasks harder. This setting augments 1243 and 1515 out of the 4000 tasks for PASCAL- and COCO-, respectively. The augmentations in both of our proposed settings highlight that the original evaluation setting fails to use available annotations.
Dataset | Method |
|
Original |
|
|
|||||||||
Acc. | mIoU | Acc. | mIoU | Acc. | mIoU | |||||||||
PASCAL- | PANet [50] | 23.51 | 56.53 | 37.20 | 56.93 | 37.49 | 55.75 | 37.25 | ||||||
PFENet [45] | 31.96 | 39.35 | 35.57 | 39.48 | 35.61 | 36.88 | 35.08 | |||||||
HSNet [29] | 2.57 | 67.27 | 44.85 | 67.75 | 44.72 | 65.92 | 44.40 | |||||||
ASNet [19] | 1.32 | 68.30 | 47.87 | 68.62 | 47.78 | 66.40 | 47.58 | |||||||
CST [20] | 0.37 | 70.37 | 53.78 | 70.60 | 53.81 | 68.45 | 53.76 | |||||||
CST* | 0.37 | 80.58 | 63.28 | 80.60 | 63.23 | 78.57 | 63.08 | |||||||
EMAT | 0.09 | 82.70 | 63.38 | 82.92 | 63.32 | 81.23 | 63.24 | |||||||
COCO- | PANet [50] | 23.51 | 51.30 | 23.64 | 51.32 | 23.78 | 45.07 | 23.17 | ||||||
PFENet [45] | 31.96 | 36.45 | 23.37 | 36.50 | 23.39 | 29.33 | 21.61 | |||||||
HSNet [29] | 2.57 | 62.43 | 30.58 | 62.40 | 30.66 | 55.15 | 29.44 | |||||||
ASNet [19] | 1.32 | 63.05 | 31.62 | 63.03 | 31.64 | 55.47 | 30.47 | |||||||
CST [20] | 0.37 | 64.02 | 36.23 | 64.10 | 36.20 | 56.30 | 35.60 | |||||||
CST* | 0.37 | 78.70 | 51.47 | 78.87 | 51.53 | 71.18 | 50.76 | |||||||
EMAT | 0.09 | 80.07 | 52.81 | 80.25 | 52.82 | 73.00 | 51.99 |
5.2 Qualitative Results
As explained in Sec.˜3, when handling N-way K-shot tasks with , the query image can contain (1) none, (2) some, or (3) all of the support classes. The top part of Fig.˜4 shows that EMAT produces more accurate segmentation masks than CST* in these three scenarios, confirming that EMAT can predict empty masks and masks with one or multiple classes. The bottom part of Fig.˜4 illustrates the same task across all few-shot evaluation settings. In the original setting, both models segment the \nth1 class (orange) correctly but make errors on the \nth2 class (glass), mistakenly segmenting visually similar objects (salt shaker). In the partially augmented setting, the additional annotations for the \nth2 class degrade CST, while EMAT maintains a precise segmentation, though both still incorrectly segment non-target objects. In the fully augmented setting, both methods correctly segment the additional class (book).
Support () | Support () | Query () | GT Mask | CST* [20] | EMAT |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
To illustrate the effect of higher-resolution tokens, Fig.˜5 compares the segmentation masks used in the masked-attention layers of CST and EMAT. Thanks to our memory-efficient formulation (see Sec.˜4.1), EMAT preserves more details across layers. The difference is most visible in the second layer, where the mask used by CST barely contains any information since it has a resolution of , whereas EMAT, with a resolution, retains meaningful details and structure. For instance, the person in the bottom-right corner of the last row of Fig.˜5 remains visible in both layers of EMAT but vanishes in the second layer of CST. This ability to preserve fine details explains why EMAT produces more accurate masks especially for small objects, e.g., the dog in the last row of the top part of Fig.˜4, the handle of the beer glass in the bottom part of Fig.˜4, and the boat and person in Fig.˜1.
Full Res. | Max Res. | CSTl=1 | CSTl=2 | EMATl=1 | EMATl=2 |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
5.3 Analysis of Small Objects
To further analyze the impact of higher-resolution correlation tokens on small objects, we filter each fold of PASCAL- [36] and COCO- [31] based on object size, creating three splits: objects occupying 0–5 %, 5–10 %, and 10–15 % of the image (see the supplementary material for details on how these splits were defined). Fig.˜6 shows the average accuracy and mIoU of CST* and the corresponding improvement achieved by EMAT across the three splits for both datasets. The results indicate that accuracy and mIoU increase with object size, and EMAT provides the largest improvement over CST* for the smallest objects, gradually decreasing as object size increases. The enhanced classification and segmentation accuracy of EMAT is likely due to better localization enabled by its higher-resolution correlation tokens (see Fig.˜5).

|
Method | ME | LD | PE | Mem. Usage | Train. Params. | All Dataset | Small Objects | ||||
Acc. | mIoU | Acc. | mIoU | |||||||||
|
CST* | – | – | – | 8.68 | 366.00 | 80.58 | 63.28 | 88.96 | 58.16 | ||
|
CST* | – | – | – | 39.22 | 366.00 | 82.23 | 63.31 | 89.65 | 58.75 | ||
|
CST* | – | – | – | 63 | 366.00 | N/A | N/A | N/A | N/A | ||
EMAT | ✓ | – | – | 36.92 | 366.00 | 81.95 | 62.97 | 87.99 | 58.06 | |||
EMAT | ✓ | ✓ | – | 36.53 | 404.48 | 82.17 | 63.36 | 90.49 | 58.73 | |||
EMAT | ✓ | ✓ | ✓ | 38.31 | 86.02 | 82.70 | 63.38 | 91.74 | 59.17 |
5.4 Ablation Study
Tab.˜2 first reports the results of CST* with its original support dimension per layer . For fair comparison, we increased the of CST* to use the same as EMAT, but it required about 63 GB of GPU memory, which exceeded the 48 GB capacity of our GPUs, so we instead use the largest that fits in our memory. For EMAT we progressively integrated: (1) memory-efficient masked attention (see Sec.˜4.1), (2) learnable downscaling of the query matrix (see Sec.˜4.2), and (3) parameter-efficiency modifications (see Sec.˜4.3). Although Tab.˜2 includes results on the full PASCAL- test set, the discussion below focuses on the small-object subset to highlight the effect of each modification introduced by EMAT.
Adding our memory-efficient masked attention alone lowers memory usage by 26 GB ( 41 %) but does not improve accuracy or mIoU compared to either variant of CST*, likely because the model relies on large pooling windows for processing the higher-resolution correlation tokens. Incorporating our learnable downscaling removes those large windows and yields absolute accuracy gains of +1.53 % over the original CST* and of +0.84 % over the variant with the larger . It also achieves an absolute mIoU gain of +0.57 % compared with the original CST*, while matching the mIoU of the variant with larger .
Because our learnable downscaling increases the number of trainable parameters, we next apply our parameter-efficiency modifications that remove 318 K parameters ( 79 %), while still saving about 39 % of the memory CST* would need for using the same as EMAT. These modifications result in absolute accuracy gains of +2.78 % over the original CST* and +2.09 % over the variant with the larger ; mIoU improves by +1.01 % and +0.42 %, respectively. EMAT also slightly improves accuracy and mIoU on the full test set, but its largest gains appear on images containing small objects.
6 Limitations
While the results of Tab.˜2 validate that our proposed EMAT is memory and parameter efficient for high-resolution correlation tokens, its memory efficiency is constrained to datasets where the segmentation masks contain unlabeled areas. This limitation arises because our memory-efficient masked attention mechanism is equivalent to self-attention [47, 12] when applied to dense semantic-segmentation datasets like Cityscapes [9], where each pixel in the segmentation mask corresponds directly to one of the semantic classes in the image.
7 Conclusion
In this work, we propose EMAT, an enhancement over CST, the state-of-the-art method for few-shot classification and segmentation (FS-CS). EMAT incorporates our novel memory-efficient masked attention mechanism that allows our model to process high-resolution correlation tokens while maintaining memory and parameter efficiency. Our results demonstrate that EMAT consistently outperforms all FS-CS methods across all evaluation settings while requiring at least four times fewer trainable parameters. Moreover, our qualitative results highlight that EMAT is capable of correctly generating empty segmentation masks when necessary and capturing finer details more accurately, which improves accuracy when dealing with small objects. Additionally, we introduce two novel few-shot evaluation settings designed to maximize the use of the available annotations during inference, reflecting practical few-shot scenarios.
Acknowledgments. This work was funded by the Hessian Ministry of Science and Research, Arts and Culture (HMWK) through the project “The Third Wave of Artificial Intelligence – 3AI”. The work was further supported by the Deutsche Forschungsgemeinschaft (German Research Foundation, DFG) under Germany’s Excellence Strategy (EXC 3057/1 “Reasonable Artificial Intelligence”, Project No. 533677015). Stefan Roth acknowledges support by the European Research Council (ERC) under the European Union’s Horizon 2020 research and innovation programme (grant agreement No. 866008).
References
- [1] Aggarwal, P., Deshpande, A., Narasimhan, K.R.: SemSup-XC: Semantic supervision for zero and few-shot extreme classification. In: ICML. vol. 202, pp. 228–247 (2023)
- [2] Alfassy, A., Karlinsky, L., Aides, A., Shtok, J., Harary, S., Feris, R.S., Giryes, R., Bronstein, A.M.: LaSO: Label-set operations networks for multi-label few-shot learning. In: CVPR. pp. 6548–6557 (2019)
- [3] Allen, K.R., Shelhamer, E., Shin, H., Tenenbaum, J.B.: Infinite mixture prototypes for few-shot learning. In: ICML. vol. 97, pp. 232–241 (2019)
- [4] Antoniou, A., Edwards, H., Storkey, A.J.: How to train your MAML. In: ICLR (2019)
- [5] Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: ICCV. pp. 9650–9660 (2021)
- [6] Carrión-Ojeda, D., Alam, M., Escalera, S., et al.: NeurIPS’22 Cross-Domain MetaDL Challenge: Results and lessons learned. In: NeurIPS Competition Track. vol. 220, pp. 50–72 (2022)
- [7] Chen, H., Dong, Y., Lu, Z., Yu, Y., Han, J.: Pixel matching network for cross-domain few-shot segmentation. In: WACV. pp. 978–987 (2024)
- [8] Chen, Z., Wu, J., Wang, W., Su, W., Chen, G., Xing, S., Zhong, M., Zhang, Q., Zhu, X., Lu, L., Li, B., Luo, P., Lu, T., Qiao, Y., Dai, J.: InternVL: Scaling up vision foundation models and aligning for generic visual-linguistic tasks. In: CVPR. pp. 24185–24198 (2024)
- [9] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The Cityscapes dataset for semantic urban scene understanding. In: CVPR. pp. 3213–3223 (2016)
- [10] Dhillon, G.S., Chaudhari, P., Ravichandran, A., Soatto, S.: A baseline for few-shot image classification. In: ICLR (2020)
- [11] Dong, N., Xing, E.P.: Few-shot semantic segmentation with prototype learning. In: BMVC (2018)
- [12] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An image is worth 16x16 words: Transformers for image recognition at scale. In: ICLR (2021)
- [13] Fan, X., Wang, X., Gao, J., Wang, J., Luo, Z., Liu, R.: Bi-level learning of task-specific decoders for joint registration and one-shot medical image segmentation. In: CVPR. pp. 11726–11735 (2024)
- [14] Fang, Z., Wang, X., Li, H., Liu, J., Hu, Q., Xiao, J.: FastRecon: Few-shot industrial anomaly detection via fast feature reconstruction. In: ICCV. pp. 17481–17490 (2023)
- [15] Finn, C., Abbeel, P., Levine, S.: Model-agnostic meta-learning for fast adaptation of deep networks. In: ICML. vol. 70, pp. 1126–1135 (2017)
- [16] Gong, X., Xia, X., Zhu, W., Zhang, B., Doermann, D., Zhuo, L.: Deformable Gabor feature networks for biomedical image classification. In: WACV. pp. 4004–4012 (2021)
- [17] Hao, F., He, F., Liu, L., Wu, F., Tao, D., Cheng, J.: Class-aware patch embedding adaptation for few-shot image classification. In: ICCV. pp. 18905–18915 (2023)
- [18] Herzog, J.: Adapt before comparison: A new perspective on cross-domain few-shot segmentation. In: CVPR. pp. 23605–23615 (2024)
- [19] Kang, D., Cho, M.: Integrative few-shot learning for classification and segmentation. In: CVPR. pp. 9979–9990 (2022)
- [20] Kang, D., Koniusz, P., Cho, M., Murray, N.: Distilling self-supervised vision transformers for weakly-supervised few-shot classification & segmentation. In: CVPR. pp. 19627–19638 (2023)
- [21] Kang, D., Kwon, H., Min, J., Cho, M.: Relational embedding for few-shot classification. In: ICCV. pp. 8822–8833 (2021)
- [22] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: ICLR (2015)
- [23] Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.Y., Dollar, P., Girshick, R.: Segment anything. In: ICCV. pp. 4015–4026 (2023)
- [24] Li, W.H., Liu, X., Bilen, H.: Cross-domain few-shot learning with task-specific adapters. In: CVPR. pp. 7161–7170 (2022)
- [25] Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. In: NeurIPS (2023)
- [26] Liu, J., Bao, Y., Xie, G.S., Xiong, H., Sonke, J.J., Gavves, E.: Dynamic prototype convolution network for few-shot semantic segmentation. In: CVPR. pp. 11553–11562 (2022)
- [27] Liu, Y., Zhu, M., Li, H., Chen, H., Wang, X., Shen, C.: Matcher: Segment anything with one shot using all-purpose feature matching. In: ICLR (2024)
- [28] Ma, T., Sun, Y., Yang, Z., Yang, Y.: ProD: Prompting-to-disentangle domain knowledge for cross-domain few-shot image classification. In: CVPR. pp. 19754–19763 (2023)
- [29] Min, J., Kang, D., Cho, M.: Hypercorrelation squeeze for few-shot segmentation. In: ICCV. pp. 6941–6952 (2021)
- [30] Moon, S., Sohn, S.S., Zhou, H., Yoon, S., Pavlovic, V., Khan, M.H., Kapadia, M.: MSI: Maximize support-set information for few-shot segmentation. In: ICCV. pp. 19266–19276 (2023)
- [31] Nguyen, K., Todorovic, S.: Feature weighting and boosting for few-shot segmentation. In: ICCV. pp. 622–631 (2019)
- [32] Oquab, M., Darcet, T., Moutakanni, T., et al.: DINOv2: Learning robust visual features without supervision. Trans. Mach. Learn. Res. (2024)
- [33] Peng, B., Tian, Z., Wu, X., Wang, C., Liu, S., Su, J., Jia, J.: Hierarchical dense correlation distillation for few-shot segmentation. In: CVPR. pp. 23641–23651 (2023)
- [34] Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., Krueger, G., Sutskever, I.: Learning transferable visual models from natural language supervision. In: ICML. vol. 139, pp. 8748–8763 (2021)
- [35] Raghu, A., Raghu, M., Bengio, S., Vinyals, O.: Rapid learning or feature reuse? Towards understanding the effectiveness of MAML. In: ICLR (2020)
- [36] Shaban, A., Bansal, S., Liu, Z., Essa, I., Boots, B.: One-shot learning for semantic segmentation. In: BMVC (2017)
- [37] Silva-Rodríguez, J., Hajimiri, S., Ben Ayed, I., Dolz, J.: A closer look at the few-shot adaptation of large vision-language models. In: CVPR. pp. 23681–23690 (2024)
- [38] Simon, C., Koniusz, P., Harandi, M.: Meta-learning for multi-label few-shot classification. In: WACV. pp. 346–355 (2022)
- [39] Snell, J., Swersky, K., Zemel, R.S.: Prototypical networks for few-shot learning. In: NIPS. pp. 4077–4087 (2017)
- [40] Sun, Q., Liu, Y., Chua, T.S., Schiele, B.: Meta-transfer learning for few-shot learning. In: CVPR. pp. 403–412 (2019)
- [41] Team, G.G.: Gemini: A family of highly capable multimodal models. arXiv:2312.11805 [cs.CL] (2023)
- [42] Tian, P., Wu, Z., Qi, L., Wang, L., Shi, Y., Gao, Y.: Differentiable meta-learning model for few-shot semantic segmentation. In: AAAI. pp. 12087–12094 (2020)
- [43] Tian, Y., Wang, Y., Krishnan, D., Tenenbaum, J.B., Isola, P.: Rethinking few-shot image classification: A good embedding is all you need? In: ECCV. vol. 12359, pp. 266–282 (2020)
- [44] Tian, Z., Lai, X., Jiang, L., Liu, S., Shu, M., Zhao, H., Jia, J.: Generalized few-shot semantic segmentation. In: CVPR. pp. 11563–11572 (2022)
- [45] Tian, Z., Zhao, H., Shu, M., Yang, Z., Li, R., Jia, J.: Prior guided feature enrichment network for few-shot segmentation. IEEE T. Pattern Anal. Mach. Intell. 44(2), 1050–1065 (2022)
- [46] Ullah, I., Carrión-Ojeda, D., Escalera, S., Guyon, I., Huisman, M., Mohr, F., van Rijn, J.N., Sun, H., Vanschoren, J., Vu, P.A.: Meta-Album: Multi-domain meta-dataset for few-shot image classification. In: NeurIPS. vol. 35, pp. 3232–3247 (2022)
- [47] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS. pp. 5998–6008 (2017)
- [48] Vinyals, O., Blundell, C., Lillicrap, T., Kavukcuoglu, K., Wierstra, D.: Matching networks for one shot learning. In: NIPS. pp. 3630–3638 (2016)
- [49] Wang, J., Zhang, B., Pang, J., Chen, H., Liu, W.: Rethinking prior information generation with CLIP for few-shot segmentation. In: CVPR. pp. 3941–3951 (2024)
- [50] Wang, K., Liew, J.H., Zou, Y., Zhou, D., Feng, J.: PANet: Few-shot image semantic segmentation with prototype alignment. In: ICCV. pp. 9196–9205 (2019)
- [51] Wang, Y., Luo, N., Zhang, T.: Focus on query: Adversarial mining transformer for few-shot segmentation. In: NeurIPS. vol. 36, pp. 31524–31542 (2023)
- [52] Wu, A., Han, Y., Zhu, L., Yang, Y.: Universal-prototype enhancing for few-shot object detection. In: ICCV. pp. 9567–9576 (2021)
- [53] Xie, G.S., Xiong, H., Liu, J., Yao, Y., Shao, L.: Few-shot semantic segmentation with cyclic memory network. In: ICCV. pp. 7293–7302 (2021)
- [54] Xu, Q., Zhao, W., Lin, G., Long, C.: Self-calibrated cross attention network for few-shot segmentation. In: ICCV. pp. 655–665 (2023)
- [55] Yang, Y., Chen, Q., Feng, Y., Huang, T.: MIANet: Aggregating unbiased instance and general information for few-shot semantic segmentation. In: CVPR. pp. 7131–7140 (2023)
- [56] Ye, C., Zhu, H., Liao, Y., Zhang, Y., Chen, T., Fan, J.: What makes for effective few-shot point cloud classification? In: WACV. pp. 1829–1838 (2022)
- [57] Zhang, A., Gao, G., Jiao, J., Liu, C., Wei, Y.: Bridge the points: Graph-based few-shot segment anything semantically. In: NeurIPS (2024)
- [58] Zhang, B., Xiao, J., Qin, T.: Self-guided and cross-guided learning for few-shot segmentation. In: CVPR. pp. 8312–8321 (2021)
- [59] Zhang, M., Shi, M., Li, L.: MFNet: Multiclass few-shot segmentation network with pixel-wise metric learning. IEEE T. Circuits Syst. Video Tech. 32(12), 8586–8598 (2022)
- [60] Zhou, F., Wang, P., Zhang, L., Wei, W., Zhang, Y.: Revisiting prototypical network for cross domain few-shot learning. In: CVPR. pp. 20061–20070 (2023)
- [61] Zhou, Z., Xu, H.M., Shu, Y., Liu, L.: Unlocking the potential of pre-trained vision transformers for few-shot semantic segmentation through relationship descriptors. In: CVPR. pp. 3817–3827 (2024)
- [62] Zhu, L., Chen, T., Ji, D., Ye, J., Liu, J.: LLaFS: When large language models meet few-shot segmentation. In: CVPR. pp. 3065–3075 (2024)
- [63] Zhu, L., Chen, T., Yin, J., See, S., Liu, J.: Addressing background context bias in few-shot segmentation through iterative modulation. In: CVPR. pp. 3370–3379 (2024)
- [64] Zhu, X., Zhang, R., He, B., Zhou, A., Wang, D., Zhao, B., Gao, P.: Not all features matter: Enhancing few-shot clip with adaptive prior refinement. In: ICCV. pp. 2605–2615 (2023)
- [65] Zhu, Y., Wang, S., Xin, T., Zhang, H.: Few-shot medical image segmentation via a region-enhanced prototypical transformer. In: MICCAI. pp. 271–280 (2023)
Appendix 0.A Few-shot Evaluation Settings
As explained in Sec.˜3.1, we introduce two novel few-shot evaluation settings, partially and fully augmented, to better utilize available annotations and represent realistic scenarios. Fig.˜7 illustrates how the support set of a 2-way 2-shot (base task configuration) task is modified under each setting.
In the original setting [19], additional annotations are discarded, leaving the base task configuration unchanged. The drawback of this setting is that it ignores available annotations. For example, in Fig.˜7, the second image for the \nth1 class (dog) also contains annotations for the \nth2 class (horse), but these are removed.
To address the loss of available annotations, our partially augmented setting retains annotations for the support classes (e.g., dog and horse in Fig.˜7) while removing those for non-support classes (e.g., person in Fig.˜7). Since few-shot classification and segmentation (FS-CS) methods treat N-way K-shot tasks as N separate 1-way K-shot tasks, each image (shot) must contain only one class (way). Therefore, if an image contains multiple support classes, we duplicate it so each copy includes only one. This leads to a task configuration that no longer follows the N-way K-shot definition [48], as the number of shots per class can vary. Consequently, the classes with fewer shots are randomly augmented using the available support data. For instance, in Fig.˜7, the partially augmented setting converts the 2-way 2-shot into a 2-way 3-shot task. It is important to note that in this setting, the number of ways remains fixed but the number of shots can increase up to .
On the other hand, our fully augmented setting incorporates all available annotations, including those for non-support classes (e.g., person in Fig.˜7). The task is then processed similarly to the partially augmented setting. However, in this case, both the number of ways and the number of shots can increase, with shots still bounded by . Regardless of the setting, models are evaluated only on the final support classes. For example, in Fig.˜7, models in the original and partially augmented settings are evaluated on dogs and horses, while in the fully augmented setting, they are also evaluated on people.

0.A.0.1 Practical Considerations.
Because annotating data is time- and resource-consuming, few-shot evaluation settings should aim to maximize the use of existing annotations. Our partially augmented setting achieves this without increasing task difficulty and may even enhance FS-CS performance by providing more examples per class. On the other hand, our fully augmented setting incorporates all available annotations, increasing both the number of ways and the number of shots, making the task more challenging by requiring the model to classify and segment additional classes. Nevertheless, neither setting requires extra annotations beyond those already available; instead, they use existing labels discarded by the original setting to better represent realistic scenarios.
Although the loss of annotations in the original setting may not significantly affect training, our partially augmented setting can safely be used in that phase. In contrast, our fully augmented setting should be used only for evaluation to prevent data leakage, as it mixes training and test classes within the support set. For this reason, we recommend using both settings exclusively for evaluation.
0.A.0.2 Object Size Filtering.
Very small objects can add noise during task augmentation in the partially and fully augmented settings, so we filter out any object that occupy less than a threshold of the image area. We set for PASCAL- [36] and COCO- [31], values that correspond to the \nth25 percentile of object sizes in the training set of each dataset. These different thresholds reflect dataset-specific object size distribution.
Fig.˜8 shows the impact of applying these thresholds on the classification accuracy and segmentation mIoU of our efficient masked attention transformer (EMAT) on COCO- for -way 1-shot tasks. Filtering out small objects consistently improves accuracy and mIoU in both augmented settings, with a larger gain in the fully augmented setting because it augments more tasks. Therefore, all results were obtained using the proposed thresholding.

Appendix 0.B FS-CS Per-Fold Results
Method | Fold-0 | Fold-1 | Fold-2 | Fold-3 | Average | |||||
Acc. | mIoU | Acc. | mIoU | Acc. | mIoU | Acc. | mIoU | Acc. | mIoU | |
Original | ||||||||||
PANet [50] | 62.30 | 33.31 | 53.30 | 45.94 | 49.50 | 31.20 | 61.00 | 38.34 | 56.53 | 37.20 |
PFENet [45] | 23.10 | 31.48 | 53.80 | 46.60 | 40.60 | 31.15 | 39.90 | 33.03 | 39.35 | 35.57 |
HSNet [29] | 68.20 | 43.97 | 73.00 | 55.12 | 56.90 | 35.19 | 71.00 | 45.14 | 67.27 | 44.85 |
ASNet [19] | 68.20 | 48.44 | 76.20 | 58.19 | 58.80 | 36.45 | 70.00 | 48.41 | 68.30 | 47.87 |
CST [20] | 70.10 | 53.90 | 75.20 | 59.98 | 61.70 | 46.30 | 74.50 | 54.93 | 70.37 | 53.78 |
CST* | 89.40 | 64.61 | 80.90 | 68.16 | 71.40 | 55.50 | 80.60 | 64.84 | 80.58 | 63.28 |
EMAT | 90.70 | 66.68 | 83.50 | 67.61 | 72.00 | 54.15 | 84.60 | 65.07 | 82.70 | 63.38 |
Partially Augmented ( tasks) | ||||||||||
PANet [50] | 62.30 | 33.31 | 53.00 | 46.00 | 51.70 | 32.35 | 60.70 | 38.32 | 56.93 | 37.49 |
PFENet [45] | 23.10 | 31.48 | 54.10 | 46.68 | 40.70 | 31.28 | 40.00 | 33.02 | 39.48 | 35.61 |
HSNet [29] | 68.20 | 43.97 | 72.90 | 54.85 | 59.00 | 34.90 | 70.90 | 45.18 | 67.75 | 44.72 |
ASNet [19] | 68.20 | 48.44 | 76.00 | 57.88 | 60.50 | 36.28 | 69.80 | 48.51 | 68.62 | 47.78 |
CST [20] | 70.10 | 53.90 | 75.30 | 60.04 | 62.40 | 46.32 | 74.60 | 54.97 | 70.60 | 53.81 |
CST* | 89.40 | 64.61 | 80.90 | 68.10 | 71.50 | 55.44 | 80.60 | 64.77 | 80.60 | 63.23 |
EMAT | 90.70 | 66.68 | 83.30 | 67.48 | 73.00 | 54.03 | 84.70 | 65.10 | 82.92 | 63.32 |
Fully Augmented ( tasks) | ||||||||||
PANet [50] | 62.30 | 33.31 | 51.70 | 45.95 | 49.60 | 31.65 | 59.40 | 38.11 | 55.75 | 37.25 |
PFENet [45] | 23.10 | 31.48 | 52.60 | 46.14 | 33.70 | 30.33 | 38.10 | 32.35 | 36.88 | 35.08 |
HSNet [29] | 68.20 | 43.97 | 72.20 | 54.55 | 53.80 | 33.87 | 69.50 | 45.19 | 65.92 | 44.40 |
ASNet [19] | 68.20 | 48.44 | 75.00 | 57.66 | 54.30 | 35.90 | 68.10 | 48.31 | 66.40 | 47.58 |
CST [20] | 70.10 | 53.90 | 74.50 | 59.50 | 56.70 | 46.62 | 72.50 | 55.01 | 68.45 | 53.76 |
CST* | 89.40 | 64.61 | 80.20 | 67.43 | 66.40 | 55.80 | 78.30 | 64.47 | 78.57 | 63.08 |
EMAT | 90.70 | 66.68 | 82.60 | 66.93 | 68.30 | 54.37 | 83.30 | 65.00 | 81.23 | 63.24 |
Sec.˜5.1 compares our EMAT with current state-of-the-art (SOTA) FS-CS methods, reporting results averaged over the four folds of the PASCAL- and COCO- datasets using 2-way 1-shot tasks. To complement those results, Tabs.˜3 and 4 present per-fold results, consistently showing that EMAT outperforms all FS-CS methods across most folds, datasets, and evaluation settings. These tables also report the number of augmented tasks in our proposed evaluation settings. On PASCAL-, the partially and fully augmented settings augment 242 and 694 tasks, respectively, meaning that about 6 % of tasks contain extra support-class annotations (partially augmented), and 11 % contain extra annotations for non-support classes (fully augmented). On COCO-, the corresponding augmentation rates are roughly 3 % and 35 %. These results confirm that our evaluation settings use annotations discarded by the original setting, thereby creating more realistic evaluation scenarios.
Method | Fold-0 | Fold-1 | Fold-2 | Fold-3 | Average | |||||
Acc. | mIoU | Acc. | mIoU | Acc. | mIoU | Acc. | mIoU | Acc. | mIoU | |
Original | ||||||||||
PANet [50] | 46.60 | 24.91 | 52.70 | 24.98 | 55.90 | 23.31 | 50.00 | 21.36 | 51.30 | 23.64 |
PFENet [45] | 35.60 | 23.99 | 34.30 | 24.57 | 43.10 | 20.99 | 32.80 | 23.93 | 36.45 | 23.37 |
HSNet [29] | 57.70 | 29.77 | 62.30 | 30.94 | 67.10 | 31.31 | 62.60 | 30.31 | 62.43 | 30.58 |
ASNet [19] | 59.50 | 29.75 | 61.50 | 32.99 | 68.80 | 33.41 | 62.40 | 30.35 | 63.05 | 31.62 |
CST [20] | 61.00 | 34.74 | 66.40 | 37.14 | 68.20 | 36.76 | 60.50 | 36.29 | 64.02 | 36.23 |
CST* | 74.00 | 49.38 | 79.90 | 53.88 | 81.30 | 51.46 | 79.60 | 51.15 | 78.70 | 51.47 |
EMAT | 74.00 | 50.54 | 83.10 | 55.44 | 83.10 | 53.05 | 80.10 | 52.19 | 80.07 | 52.81 |
Partially Augmented ( tasks) | ||||||||||
PANet [50] | 46.40 | 25.27 | 52.80 | 25.10 | 55.90 | 23.27 | 50.20 | 21.48 | 51.32 | 23.78 |
PFENet [45] | 35.80 | 24.08 | 34.30 | 24.52 | 43.00 | 21.00 | 32.90 | 23.96 | 36.50 | 23.39 |
HSNet [29] | 57.50 | 29.88 | 62.60 | 31.02 | 67.00 | 31.30 | 62.50 | 30.44 | 62.40 | 30.66 |
ASNet [19] | 59.30 | 29.87 | 61.70 | 32.95 | 68.80 | 33.39 | 62.30 | 30.36 | 63.03 | 31.64 |
CST [20] | 61.30 | 34.59 | 66.20 | 37.08 | 68.40 | 36.76 | 60.50 | 36.36 | 64.10 | 36.20 |
CST* | 74.60 | 49.27 | 79.80 | 53.98 | 81.50 | 51.68 | 79.60 | 51.19 | 78.87 | 51.53 |
EMAT | 74.30 | 50.59 | 83.20 | 55.42 | 83.30 | 53.06 | 80.20 | 52.22 | 80.25 | 52.82 |
Fully Augmented ( tasks) | ||||||||||
PANet [50] | 30.90 | 24.43 | 49.30 | 24.44 | 53.50 | 22.98 | 46.60 | 20.81 | 45.07 | 23.17 |
PFENet [45] | 19.60 | 20.33 | 29.10 | 23.79 | 40.70 | 20.40 | 27.90 | 21.92 | 29.33 | 21.61 |
HSNet [29] | 40.20 | 27.73 | 57.50 | 29.82 | 64.30 | 30.96 | 58.60 | 29.26 | 55.15 | 29.44 |
ASNet [19] | 41.40 | 27.23 | 55.70 | 32.10 | 66.60 | 33.14 | 58.20 | 29.43 | 55.47 | 30.47 |
CST [20] | 44.70 | 33.74 | 59.70 | 36.32 | 65.60 | 36.87 | 55.20 | 35.46 | 56.30 | 35.60 |
CST* | 54.40 | 48.25 | 74.80 | 52.86 | 79.70 | 51.69 | 75.80 | 50.25 | 71.18 | 50.76 |
EMAT | 56.30 | 49.53 | 78.50 | 54.46 | 81.40 | 53.06 | 75.80 | 50.90 | 73.00 | 51.99 |
To evaluate the scalability across different N-way K-shot configurations, Fig.˜9 shows the accuracy and mIoU of all FS-CS methods on PASCAL- and COCO- in the fully augmented setting using -way 5-shot tasks. EMAT consistently achieves higher classification accuracy and mIoU compared to other FS-CS methods. Notably, while increasing the number of classes (N) typically raises task difficulty, EMAT maintains stable segmentation performance, further validating the fully augmented setting as a challenging yet realistic and effective benchmark for FS-CS evaluation.

Appendix 0.C Analysis of Object Size Distribution in PASCAL-

Sec.˜5.3 analyzes the impact of higher-resolution correlation tokens on small objects by filtering each fold of PASCAL- and COCO- based on object size. This filtering results in three splits: objects occupying 0–5 %, 5–10 %, and 10–15 % of the image. These intervals were chosen after examining the object size distribution in the PASCAL- test set, which has fewer test examples than COCO-.
Fig.˜10 presents frequency histograms of object sizes (up to 50 % of the image) for all four PASCAL- test folds. Based on the fold with more examples (Fold 2), we selected the size intervals that ensure each split contains at least 100 examples, i.e., 0–5 %, 5–10 %, and 10–15 %. For consistency and comparability, we apply the same splits to COCO-. Furthermore, Fig.˜10 shows that the majority of objects in PASCAL- are very small, highlighting the importance of accurately classifying and segmenting small objects, a challenge that our EMAT effectively addresses.
Appendix 0.D Comparison to SOTA FS-S
Method | Venue | Backbone | PASCAL- | COCO- | ||
1-s | 5-s | 1-s | 5-s | |||
MIANet [55] | CVPR’23 | ResNet50 | 68.7 | 71.6 | 47.7 | 51.7 |
HDMNet [33] | CVPR’23 | ResNet50 | 69.4 | 71.8 | 50.0 | 56.0 |
VAT + MSI [30] | ICCV’23 | ResNet101 | 70.1 | 72.2 | 49.8 | 54.0 |
SCCAN [54] | ICCV’23 | ResNet101 | 68.3 | 71.5 | 48.2 | 57.0 |
AMFomer [51] | NeurIPS’23 | ResNet101 | 70.7 | 73.6 | 51.0 | 57.3 |
PMNet [7] | WACV’24 | ResNet101 | 68.1 | 73.9 | 43.7 | 53.1 |
ABCB [63] | CVPR’24 | ResNet101 | 72.0 | 74.9 | 51.5 | 58.8 |
EMATs | – | DINOv2-S | 72.5 | 75.9 | 59.8 | 65.0 |
LLaFS [62] | CVPR’24 | ResNet50 + CodeLlama-7B | 73.5 | 75.6 | 53.9 | 60.0 |
PI-CLIP [49] | CVPR’24 | ResNet50 + CLIP-B | 76.8 | 77.2 | 56.8 | 59.1 |
Matcher [27] | ICLR’24 | DINOv2-L + SAH | 68.1 | 74.0 | 52.7 | 60.7 |
GF-SAM [57] | NeurIPS’24 | DINOv2-L + SAH | 72.1 | 82.6 | 58.7 | 66.8 |
EMATs | – | DINOv2-S | 72.5 | 75.9 | 59.8 | 65.0 |
To compare our EMAT with SOTA few-shot segmentation (FS-S) methods, we removed its classification head and denote this adapted version as EMATs. This adaptation ensures a fair comparison to existing FS-S methods, as simultaneously handling classification and segmentation is more challenging than focusing only on segmentation. Furthermore, we adopt the 1-way K-shot formulation used by the FS-S methods, where the query image always contains the support class.
Tab.˜5 presents a comparison between EMATs and SOTA FS-S methods, divided into two groups: methods based on ResNet backbones and those based on foundation models. Although EMATs is based on a foundation model (DINOv2), it uses the small variant (DINOv2-S), whose model size is comparable to ResNet50. Therefore, comparing EMATs with FS-S methods based on ResNet backbones is still meaningful.
The top part of Tab.˜5 shows that EMATs consistently outperforms all ResNet-based FS-S methods across both PASCAL- and COCO- datasets and across all task configurations, even surpassing methods based on a significantly larger backbone, ResNet101. Notably, the improvement in mIoU is more pronounced on COCO-, with absolute gains of +8.3 % and +6.2 % for 1-shot and 5-shot settings, respectively.
On the other hand, comparing EMATs to FS-S methods based on foundation models (bottom part of Tab.˜5) is less straightforward for two main reasons: (1) all baseline methods combine two models, and (2) the foundation models they use are significantly larger than the one used by EMATs. For example, both Matcher [27] and GF-SAM [57] use the large version of DINOv2 (i.e., DINOv2-L), while EMATs uses the small variant (i.e., DINOv2-S). Nevertheless, our method still achieves an absolute improvement of +1.1 % mIoU on COCO- in the 1-way 1-shot setting and overall obtains competitive mIoU compared to other SOTA FS-S methods based on large foundation models.