11institutetext: Department of Computer Science, Technical University of Darmstadt, Germany 22institutetext: Hessian Center for AI (hessian.AI), Germany 22email: {dustin.carrion,stefan.roth,simone.schaub}@visinf.tu-darmstadt.de
https://visinf.github.io/emat

Efficient Masked Attention Transformer for Few-Shot Classification and Segmentation

Dustin Carrión-Ojeda Corresponding author.1122 0000-0001-5322-9130    Stefan Roth 1122 0000-0001-9002-9832    Simone Schaub-Meyer 1122 0000-0001-8644-1074
Abstract

Few-shot classification and segmentation (FS-CS) focuses on jointly performing multi-label classification and multi-class segmentation using few annotated examples. Although the current state of the art (SOTA) achieves high accuracy in both tasks, it struggles with small objects. To overcome this, we propose the Efficient Masked Attention Transformer (EMAT), which improves classification and segmentation accuracy, especially for small objects. EMAT introduces three modifications: a novel memory-efficient masked attention mechanism, a learnable downscaling strategy, and parameter-efficiency enhancements. EMAT outperforms all FS-CS methods on the PASCAL-5i5^{i}5 start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT and COCO-20i20^{i}20 start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT datasets, using at least four times fewer trainable parameters. Moreover, as the current FS-CS evaluation setting discards available annotations, despite their costly collection, we introduce two novel evaluation settings that consider these annotations to better reflect practical scenarios.

Keywords:
Few-shot learning Efficiency Segmentation Classification.

1 Introduction

Recently, data-intensive methods have been introduced for various deep learning applications [5, 32, 23, 34, 41, 25, 8]. These methods rely on large training datasets, making them impractical in fields where collecting extensive datasets is challenging or costly [14, 65, 13]. Consequently, few-shot learning (FSL) methods have gained significant attention for their ability to learn from just a few examples and quickly adapt to new classes [44, 52, 56, 1]. In computer vision, FSL has been mostly applied to image classification (FS-C) [3, 40, 43, 18] and segmentation (FS-S) [11, 30, 62, 55, 63].

FS-C and FS-S often co-occur in real-world applications, e.g., in agriculture, where crops must be segmented and classified by type or health status. Hence, recent works [19, 20] integrate multi-label classification and multi-class segmentation into a single few-shot classification and segmentation (FS-CS) task. While FS-CS addresses some limitations of FS-C (e.g., assuming the query image contains only one class) and FS-S (e.g., assuming the target class is always present in the query image), it also increases the task difficulty by simultaneously tackling classification and segmentation. Moreover, some applications, e.g., medical imaging, rely on precise small-object analysis [13, 65, 16]. Thus, achieving high accuracy on small objects is a desired property for FS-CS methods. Yet, as shown in Fig.˜1, the current state-of-the-art (SOTA) FS-CS method [20] struggles with small objects, a limitation we address in this work.

GT Mask CST* [20] EMAT GT Mask CST* [20] EMAT
Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption
Figure 1: Qualitative comparison of small objects (i.e., objects that occupy less than 15 % of the image) between the current SOTA FS-CS method (CST) [20] and our proposed EMAT. CST* uses the same backbone as EMAT (i.e., DINOv2 [32]). By processing high-resolution correlation tokens, EMAT preserves finer details, yielding more accurate segmentation masks.

To better align the evaluation of FS-CS models with practical scenarios, FS-CS uses the N-way K-shot configuration, where the model learns N classes from N×\times×K examples (K per class). However, the current evaluation setting [19, 20] discards available annotations, which is not ideal given the cost of data annotation. To address this, we introduce two new evaluation settings.

1.0.1 Contributions.

(1) Building on the current SOTA FS-CS method [20], we propose an efficient masked attention transformer (EMAT), which enhances classification and segmentation accuracy, particularly for small objects, while using approximately four times fewer trainable parameters. (2) Our EMAT outperforms all FS-CS methods on the PASCAL-5i5^{i}5 start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT and COCO-20i20^{i}20 start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT datasets, supports the N-way K-shot configuration, and can generate empty segmentation masks when no target objects are present. (3) Finally, we introduce two new FS-CS evaluation settings that better utilize available annotations during inference.

2 Related Work

2.0.1 Few-shot Classification (FS-C)

methods can be categorized into three groups based on what the model learns. Representation-based approaches learn class-agnostic, discriminative embeddings [48, 39, 3, 21, 60, 17, 46]. Optimization-based approaches learn the optimal set of weights that allow the model to adapt to new classes in just a few optimization steps [15, 40, 4, 35]. Transfer-based approaches adapt large pre-trained [6, 10, 43, 24, 28] or foundation models [64, 18, 37]. A major limitation of most FS-C methods is the assumption of a single label per image [2, 38], limiting them in multi-label settings.

2.0.2 Few-shot Segmentation (FS-S)

methods can also be categorized into three groups: prototype matching, which aligns support embeddings with query features [11, 50, 45, 58, 26, 51]; dense correlation, which constructs support–query correlation tensors [33, 29, 53, 30, 7, 54]; and model-adaptation, which fine-tunes large pre-trained models [61, 62, 49, 27, 57]. Despite the advancements in FS-S, most methods have two main limitations: (1) they target only the 1-way K-shot configuration and (2) they assume the query image contains the target class, preventing the models from predicting empty segmentation masks. Only a few recent works [42, 59] address the more general N-way K-shot configuration.

2.0.3 Few-shot Classification and Segmentation (FS-CS)

focuses on jointly predicting the multi-label classification vector and multi-class segmentation mask without assuming support classes are present in the query image [19]. The current SOTA FS-CS method, the classification-segmentation transformer (CST) [20], uses a memory-intensive masked-attention mechanism that requires significant downsampling of the correlation features, reducing its accuracy on small objects. In this work, we enhance CST by proposing an efficient masked-attention formulation and adding further refinements, resulting in a more memory- and parameter-efficient method with improved accuracy, especially for small objects.

3 Problem Definition

This work focuses on the few-shot classification and segmentation (FS-CS) task [19], formulated as an N-way K-shot learning problem [48]. We assume two disjoint class sets: 𝒞train\mathcal{C}_{\text{train}}caligraphic_C start_POSTSUBSCRIPT train end_POSTSUBSCRIPT for training and 𝒞test\mathcal{C}_{\text{test}}caligraphic_C start_POSTSUBSCRIPT test end_POSTSUBSCRIPT for testing. Accordingly, training tasks are sampled from 𝒞train\mathcal{C}_{\text{train}}caligraphic_C start_POSTSUBSCRIPT train end_POSTSUBSCRIPT, and testing tasks from 𝒞test\mathcal{C}_{\text{test}}caligraphic_C start_POSTSUBSCRIPT test end_POSTSUBSCRIPT. Each task consists of a support set 𝒮\mathcal{S}caligraphic_S and a query image 𝐈q\mathbf{I}_{q}bold_I start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT, where 𝒮\mathcal{S}caligraphic_S contains N classes 𝒞s\mathcal{C}_{\text{s}}caligraphic_C start_POSTSUBSCRIPT s end_POSTSUBSCRIPT (𝒞s𝒞train\mathcal{C}_{\text{s}}\subseteq\mathcal{C}_{\text{train}}caligraphic_C start_POSTSUBSCRIPT s end_POSTSUBSCRIPT ⊆ caligraphic_C start_POSTSUBSCRIPT train end_POSTSUBSCRIPT or 𝒞s𝒞test\mathcal{C}_{\text{s}}\subseteq\mathcal{C}_{\text{test}}caligraphic_C start_POSTSUBSCRIPT s end_POSTSUBSCRIPT ⊆ caligraphic_C start_POSTSUBSCRIPT test end_POSTSUBSCRIPT), each represented by K examples:

𝒮={{(𝐈ji,𝐌ji,yji)yji𝒞s}jK}iN,\mathcal{S}=\left\{\left\{(\mathbf{I}^{i}_{j},\mathbf{M}^{i}_{j},y^{i}_{j})\mid y^{i}_{j}\in\mathcal{C}_{\text{s}}\right\}_{j}^{K}\right\}_{i}^{N},caligraphic_S = { { ( bold_I start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , bold_M start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , italic_y start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) ∣ italic_y start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ∈ caligraphic_C start_POSTSUBSCRIPT s end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT } start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT , (1)

where 𝐈ji\mathbf{I}^{i}_{j}bold_I start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT, 𝐌ji\mathbf{M}^{i}_{j}bold_M start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT, and yjiy^{i}_{j}italic_y start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT denote the support image, segmentation mask, and class label for the jthj^{\text{th}}italic_j start_POSTSUPERSCRIPT th end_POSTSUPERSCRIPT example of the ithi^{\text{th}}italic_i start_POSTSUPERSCRIPT th end_POSTSUPERSCRIPT class. Although yji=ijy^{i}_{j}=i\ \forall jitalic_y start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT = italic_i ∀ italic_j in Eq.˜1, we use this notation for compatibility with multi-label settings where 𝐲ji\mathbf{y}^{i}_{j}bold_y start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT can vary.

The goal of FS-CS is to learn from 𝒮\mathcal{S}caligraphic_S such that, given 𝐈q\mathbf{I}_{q}bold_I start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT, the model can (i) identify which support classes are present (multi-label classification), and (ii) segment those classes (multi-class segmentation). Moreover, FS-CS allows 𝐈q\mathbf{I}_{q}bold_I start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT to contain a subset of the support classes. Thus, when N>1N>1italic_N > 1, 𝐈q\mathbf{I}_{q}bold_I start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT can contain: (1) none of the support classes (𝒞q=\mathcal{C}_{\text{q}}=\varnothingcaligraphic_C start_POSTSUBSCRIPT q end_POSTSUBSCRIPT = ∅), (2) a subset of them (𝒞q𝒞s\mathcal{C}_{\text{q}}\subset\mathcal{C}_{\text{s}}caligraphic_C start_POSTSUBSCRIPT q end_POSTSUBSCRIPT ⊂ caligraphic_C start_POSTSUBSCRIPT s end_POSTSUBSCRIPT), or (3) all support classes (𝒞q=𝒞s\mathcal{C}_{\text{q}}=\mathcal{C}_{\text{s}}caligraphic_C start_POSTSUBSCRIPT q end_POSTSUBSCRIPT = caligraphic_C start_POSTSUBSCRIPT s end_POSTSUBSCRIPT). Note that case (1) is important in real-world applications where query images may not contain relevant classes, requiring models to predict empty segmentation masks when necessary.

The drawback of the current FS-CS setting is that each support image 𝐈ji\mathbf{I}^{i}_{j}bold_I start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT is assumed to contain only one annotated class (yjiy^{i}_{j}italic_y start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT). If 𝐈ji\mathbf{I}^{i}_{j}bold_I start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT includes multiple support classes (𝐲ji𝒞s\mathbf{y}^{i}_{j}\subseteq\mathcal{C}_{\text{s}}bold_y start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ⊆ caligraphic_C start_POSTSUBSCRIPT s end_POSTSUBSCRIPT), its label vector and segmentation mask need to be adjusted before constructing 𝒮\mathcal{S}caligraphic_S. This adjustment discards available annotations, as illustrated in Fig.˜2, where 𝐈11\mathbf{I}_{1}^{1}bold_I start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT contains both support classes (person, bike). However, since it is an example of the \nth1 class (person), the annotations of the \nth2 class (bike) are removed in the original setting.

Available Data Original (Eq. 1) Partially Aug. (Eq. 2) Fully Aug. (Eq. 3)
𝐈\mathbf{I}bold_I 𝐌\mathbf{M}bold_M 𝐲\mathbf{y}bold_y 𝐌\mathbf{M}bold_M 𝐲\mathbf{y}bold_y 𝐌\mathbf{M}bold_M 𝐲\mathbf{y}bold_y 𝐌\mathbf{M}bold_M 𝐲\mathbf{y}bold_y
Refer to caption Refer to caption 1 2 Refer to caption 1 Refer to caption 1 2 Refer to caption 1 2
Refer to caption Refer to caption 2 3 Refer to caption 2 Refer to caption 2 Refer to caption 2 3
Figure 2: Example of a 2-way 1-shot (base configuration) support set across different few-shot evaluation settings. 𝐈\mathbf{I}bold_I, 𝐌\mathbf{M}bold_M, and 𝐲\mathbf{y}bold_y represent the images, segmentation masks, and labels, respectively.

3.1 Proposed Evaluation Settings

To better utilize available annotations and reflect more realistic evaluation scenarios, we introduce two novel FS-CS evaluation settings.

3.1.1 Partially Augmented Setting.

This setting keeps all annotations from the support classes:

𝒮={{(𝐈ji,𝐌ji,𝐲ji)𝐲ji𝒞s}jK}iN.\mathcal{S}=\left\{\left\{(\mathbf{I}^{i}_{j},\mathbf{M}^{i}_{j},\mathbf{y}^{i}_{j})\mid\mathbf{y}^{i}_{j}\subseteq\mathcal{C}_{\text{s}}\right\}_{j}^{K}\right\}_{i}^{N}.caligraphic_S = { { ( bold_I start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , bold_M start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , bold_y start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) ∣ bold_y start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ⊆ caligraphic_C start_POSTSUBSCRIPT s end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT } start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT . (2)

Note that when N=1N{=}1italic_N = 1, this setting is equivalent to Eq.˜1. Fig.˜2 shows an example for this setting, where 𝐌11\mathbf{M}_{1}^{1}bold_M start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT and 𝐲11\mathbf{y}_{1}^{1}bold_y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT keep the annotations of the \nth2 class (bike), even though the image is selected as an example of the \nth1 class (person).

3.1.2 Fully Augmented Setting.

This setting keeps all available annotations for each support image, regardless of whether the corresponding classes are part of the support classes:

𝒮={{(𝐈ji,𝐌ji,𝐲ji)𝐲ji𝒞train𝒞test}jK}iN.\mathcal{S}=\left\{\left\{(\mathbf{I}^{i}_{j},\mathbf{M}^{i}_{j},\mathbf{y}^{i}_{j})\mid\mathbf{y}^{i}_{j}\subseteq\mathcal{C}_{\text{train}}\cup\mathcal{C}_{\text{test}}\right\}_{j}^{K}\right\}_{i}^{N}.caligraphic_S = { { ( bold_I start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , bold_M start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , bold_y start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) ∣ bold_y start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ⊆ caligraphic_C start_POSTSUBSCRIPT train end_POSTSUBSCRIPT ∪ caligraphic_C start_POSTSUBSCRIPT test end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT } start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT . (3)

For example, in Fig.˜2, 𝐌11\mathbf{M}_{1}^{1}bold_M start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT and 𝐲11\mathbf{y}_{1}^{1}bold_y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT include annotations for both support classes (person, bike), while 𝐌12\mathbf{M}_{1}^{2}bold_M start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT and 𝐲12\mathbf{y}_{1}^{2}bold_y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT are augmented with annotations of a non-support class (TV), which can belong to either 𝒞train\mathcal{C}_{\text{train}}caligraphic_C start_POSTSUBSCRIPT train end_POSTSUBSCRIPT or 𝒞test\mathcal{C}_{\text{test}}caligraphic_C start_POSTSUBSCRIPT test end_POSTSUBSCRIPT. In this setting, the model is expected to classify and segment all support and augmented classes present in 𝐈q\mathbf{I}_{q}bold_I start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT. Additionally, this setting aligns closely with the generalized few-shot setting (GFSL) [44], which also evaluates on base classes. However, unlike standard GFSL, which evaluates on all base classes seen during training, our setting restricts evaluation to only those classes present in the support set.

4 Efficient Masked Attention Transformer

Fig.˜3 illustrates the pipeline used by our proposed efficient masked attention transformer (EMAT), which builds upon the classification-segmentation transformer (CST) [20]. Both methods share the same feature extraction process: support and query images 𝐈ji,𝐈qH×W×3\mathbf{I}_{j}^{i},\mathbf{I}_{q}\in\mathbb{R}^{H\times W\times 3}bold_I start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT , bold_I start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_H × italic_W × 3 end_POSTSUPERSCRIPT are processed by a frozen, pre-trained ViT [12] with patch size ppitalic_p, producing support and query image tokens 𝐓si,𝐓qih×w×d\mathbf{T}_{s_{\text{i}}},\mathbf{T}_{q_{\text{i}}}\in\mathbb{R}^{h\times w\times d}bold_T start_POSTSUBSCRIPT italic_s start_POSTSUBSCRIPT i end_POSTSUBSCRIPT end_POSTSUBSCRIPT , bold_T start_POSTSUBSCRIPT italic_q start_POSTSUBSCRIPT i end_POSTSUBSCRIPT end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_h × italic_w × italic_d end_POSTSUPERSCRIPT, and a support class token 𝐓sc1×d\mathbf{T}_{s_{\text{c}}}\in\mathbb{R}^{1\times d}bold_T start_POSTSUBSCRIPT italic_s start_POSTSUBSCRIPT c end_POSTSUBSCRIPT end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT 1 × italic_d end_POSTSUPERSCRIPT, where h=H/ph=H/pitalic_h = italic_H / italic_p, w=W/pw=W/pitalic_w = italic_W / italic_p, and dditalic_d is the token dimension of a single ViT head. The support tokens 𝐓si\mathbf{T}_{s_{\text{i}}}bold_T start_POSTSUBSCRIPT italic_s start_POSTSUBSCRIPT i end_POSTSUBSCRIPT end_POSTSUBSCRIPT are downsampled via bilinear interpolation and reshaped to 𝐓sif(hw)×d\mathbf{T}^{f}_{s_{\text{i}}}\in\mathbb{R}^{(h^{\prime}\cdot w^{\prime})\times d}bold_T start_POSTSUPERSCRIPT italic_f end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_s start_POSTSUBSCRIPT i end_POSTSUBSCRIPT end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT ( italic_h start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ⋅ italic_w start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) × italic_d end_POSTSUPERSCRIPT. Similarly, query image tokens 𝐓qi\mathbf{T}_{q_{\text{i}}}bold_T start_POSTSUBSCRIPT italic_q start_POSTSUBSCRIPT i end_POSTSUBSCRIPT end_POSTSUBSCRIPT are reshaped to 𝐓qif(hw)×d\mathbf{T}^{f}_{q_{\text{i}}}\in\mathbb{R}^{(h\cdot w)\times d}bold_T start_POSTSUPERSCRIPT italic_f end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_q start_POSTSUBSCRIPT i end_POSTSUBSCRIPT end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT ( italic_h ⋅ italic_w ) × italic_d end_POSTSUPERSCRIPT. Next, 𝐓sif\mathbf{T}^{f}_{s_{\text{i}}}bold_T start_POSTSUPERSCRIPT italic_f end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_s start_POSTSUBSCRIPT i end_POSTSUBSCRIPT end_POSTSUBSCRIPT and 𝐓sc\mathbf{T}_{s_{\text{c}}}bold_T start_POSTSUBSCRIPT italic_s start_POSTSUBSCRIPT c end_POSTSUBSCRIPT end_POSTSUBSCRIPT are concatenated to form 𝐓sc\mathbf{T}^{c}_{s}bold_T start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT. Finally, cosine similarity between 𝐓sc\mathbf{T}^{c}_{s}bold_T start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT and 𝐓qif\mathbf{T}^{f}_{q_{\text{i}}}bold_T start_POSTSUPERSCRIPT italic_f end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_q start_POSTSUBSCRIPT i end_POSTSUBSCRIPT end_POSTSUBSCRIPT is computed across all ViT layers llitalic_l and attention heads ggitalic_g, resulting in the correlation tokens 𝐂ts×tq×(lg)\mathbf{C}\in\mathbb{R}^{t_{s}\times t_{q}\times(l\cdot g)}bold_C ∈ blackboard_R start_POSTSUPERSCRIPT italic_t start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT × italic_t start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT × ( italic_l ⋅ italic_g ) end_POSTSUPERSCRIPT, where ts=hw+1t_{s}=h^{\prime}\cdot w^{\prime}+1italic_t start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT = italic_h start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ⋅ italic_w start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT + 1 and tq=hwt_{q}=h\cdot witalic_t start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT = italic_h ⋅ italic_w.

EMAT differs from CST in the two-layer transformer (purple blocks in Fig.˜3) that processes the correlation tokens 𝐂\mathbf{C}bold_C and feeds task-specific heads for multi-label classification and multi-class segmentation. EMAT enhances this transformer with three key improvements: (1) a novel memory-efficient masked attention formulation (see Sec.˜4.1) that allows using higher-resolution correlation tokens, (2) a learnable downscaling strategy (see Sec.˜4.2) that avoids reliance on large pooling kernels, and (3) additional modifications for improved parameter efficiency (see Sec.˜4.3), which can help to reduce overfitting a small support set.

Following CST, EMAT is trained using the 1-way 1-shot configuration. Since EMAT uses task-specific heads, it is trained with two losses:

clf\displaystyle\mathcal{L}_{\text{clf}}caligraphic_L start_POSTSUBSCRIPT clf end_POSTSUBSCRIPT =ylogy^,\displaystyle=-y\log\widehat{y},= - italic_y roman_log over^ start_ARG italic_y end_ARG , (4)
seg\displaystyle\mathcal{L}_{\text{seg}}caligraphic_L start_POSTSUBSCRIPT seg end_POSTSUBSCRIPT =1HWi=1Hj=1W𝐌ijlog𝐌^ij,\displaystyle=-\frac{1}{HW}\sum_{i=1}^{H}{\sum_{j=1}^{W}{\mathbf{M}_{ij}\log\widehat{\mathbf{M}}_{ij}}},= - divide start_ARG 1 end_ARG start_ARG italic_H italic_W end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_H end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_W end_POSTSUPERSCRIPT bold_M start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT roman_log over^ start_ARG bold_M end_ARG start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT , (5)

where y{0,1}y\in\{0,1\}italic_y ∈ { 0 , 1 } and 𝐌ij{0,1}\mathbf{M}_{ij}\in\{0,1\}bold_M start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT ∈ { 0 , 1 } are the ground-truth classification and segmentation labels, and y^\widehat{y}over^ start_ARG italic_y end_ARG, 𝐌^ij\widehat{\mathbf{M}}_{ij}over^ start_ARG bold_M end_ARG start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT are the corresponding predictions. The final loss function jointly optimizes both losses using a balancing hyperparameter λ\lambdaitalic_λ:

=λclf+seg.\mathcal{L}=\lambda\mathcal{L}_{\text{clf}}+\mathcal{L}_{\text{seg}}.caligraphic_L = italic_λ caligraphic_L start_POSTSUBSCRIPT clf end_POSTSUBSCRIPT + caligraphic_L start_POSTSUBSCRIPT seg end_POSTSUBSCRIPT . (6)

Inference on N-way K-shot task is performed as in CST [20], by treating each class as an independent 1-way K-shot task: class-wise logits and segmentation masks are averaged over the K examples, producing N predictions. Logits above a threshold δ=0.5\delta=0.5italic_δ = 0.5 form the multi-label vector, and each pixel 𝐌^ij\widehat{\mathbf{M}}_{ij}over^ start_ARG bold_M end_ARG start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT is assigned to the class with the highest score, or to background if all scores fall below δ\deltaitalic_δ, thereby allowing empty masks.

Refer to caption
Figure 3: FS-CS pipeline used by our EMAT. A frozen, pre-trained ViT [12] extracts image and class tokens from support and query images, which are correlated via cosine similarity. The resulting correlation tokens are processed by a two-layer transformer equipped with our masked attention mechanism, learnable downscaling, and parameter-efficient design (see Secs.˜4.1, 4.2 and 4.3). Task-specific heads then predict the multi-label classification vector and multi-class segmentation mask.

4.1 Memory-Efficient Masked Attention

The SOTA FS-CS method, CST [20], uses a masked attention mechanism based on self-attention [47, 12] to process the correlation tokens 𝐂ts×tq×(lg)\mathbf{C}\in\mathbb{R}^{t_{s}\times t_{q}\times(l\cdot g)}bold_C ∈ blackboard_R start_POSTSUPERSCRIPT italic_t start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT × italic_t start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT × ( italic_l ⋅ italic_g ) end_POSTSUPERSCRIPT. Due to the high memory cost of self-attention, CST significantly downsamples the support tokens 𝐓si\mathbf{T}_{s_{\text{i}}}bold_T start_POSTSUBSCRIPT italic_s start_POSTSUBSCRIPT i end_POSTSUBSCRIPT end_POSTSUBSCRIPT when computing 𝐂\mathbf{C}bold_C, sacrificing fine-grained spatial details (see Fig.˜5). To address this, we propose a novel memory-efficient masked attention formulation that allows EMAT to use high-resolution correlation tokens.

Given 𝐂\mathbf{C}bold_C, let 𝐐dtd×tq×e\mathbf{Q}^{d}\in\mathbb{R}^{t_{d}\times t_{q}\times e}bold_Q start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_t start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT × italic_t start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT × italic_e end_POSTSUPERSCRIPT, 𝐊,𝐕ts×tq×e\mathbf{K},\mathbf{V}\in\mathbb{R}^{t_{s}\times t_{q}\times e}bold_K , bold_V ∈ blackboard_R start_POSTSUPERSCRIPT italic_t start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT × italic_t start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT × italic_e end_POSTSUPERSCRIPT denote the query, key, and value matrices, where eeitalic_e is the embedding size. The dimensions of 𝐐d\mathbf{Q}^{d}bold_Q start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT differ from those of 𝐊\mathbf{K}bold_K and 𝐕\mathbf{V}bold_V because the query matrix is progressively downscaled (see Sec.˜4.2). Additionally, as shown in Fig.˜3, the segmentation mask 𝐌H×W\mathbf{M}\in\mathbb{R}^{H\times W}bold_M ∈ blackboard_R start_POSTSUPERSCRIPT italic_H × italic_W end_POSTSUPERSCRIPT enters directly into the attention mechanism without being processed by the feature extractor. Thus, it is resized, flattened, and then append a trailing “1” to obtain 𝐌fhw+1\mathbf{M}^{f}\in\mathbb{R}^{h^{\prime}\cdot w^{\prime}+1}bold_M start_POSTSUPERSCRIPT italic_f end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_h start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ⋅ italic_w start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT + 1 end_POSTSUPERSCRIPT. This appended “1” ensures that the support class token is never masked in Eqs.˜7 and 8.

For one attention head, CST computes:

𝐎ijk=p[softmax((𝐐ijkd𝐊:jk)𝐌:f)]p𝐕pjk,\mathbf{O}_{ijk}=\sum_{p}\left[\mathrm{softmax}\left(\left(\mathbf{Q}_{ijk}^{d}\cdot\mathbf{K}_{:jk}\right)\odot\mathbf{M}^{f}_{:}\right)\right]_{p}\odot\mathbf{V}_{pjk},bold_O start_POSTSUBSCRIPT italic_i italic_j italic_k end_POSTSUBSCRIPT = ∑ start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT [ roman_softmax ( ( bold_Q start_POSTSUBSCRIPT italic_i italic_j italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT ⋅ bold_K start_POSTSUBSCRIPT : italic_j italic_k end_POSTSUBSCRIPT ) ⊙ bold_M start_POSTSUPERSCRIPT italic_f end_POSTSUPERSCRIPT start_POSTSUBSCRIPT : end_POSTSUBSCRIPT ) ] start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT ⊙ bold_V start_POSTSUBSCRIPT italic_p italic_j italic_k end_POSTSUBSCRIPT , (7)

where i{1,,td}i\in\{1,\dots,t_{d}\}italic_i ∈ { 1 , … , italic_t start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT }, j{1,,tq}j\in\{1,\dots,t_{q}\}italic_j ∈ { 1 , … , italic_t start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT }, k{1,,e}k\in\{1,\dots,e\}italic_k ∈ { 1 , … , italic_e }, and p{1,,ts}p\in\{1,\dots,t_{s}\}italic_p ∈ { 1 , … , italic_t start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT }. Here td=h′′w′′+1t_{d}=h^{\prime\prime}\cdot w^{\prime\prime}+1italic_t start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT = italic_h start_POSTSUPERSCRIPT ′ ′ end_POSTSUPERSCRIPT ⋅ italic_w start_POSTSUPERSCRIPT ′ ′ end_POSTSUPERSCRIPT + 1, tq=hwt_{q}=h\cdot witalic_t start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT = italic_h ⋅ italic_w, and ts=hw+1t_{s}=h^{\prime}\cdot w^{\prime}+1italic_t start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT = italic_h start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ⋅ italic_w start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT + 1, the “+1+1+ 1” accounts for the support class token, and ()(\cdot)^{\prime}( ⋅ ) start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT denotes a downscaled value. The operator \odot represents element-wise multiplication.

In CST, the support dimension tst_{s}italic_t start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT is 145 (h=w=12h^{\prime}=w^{\prime}=12italic_h start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT = italic_w start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT = 12) in the first attention layer and 10 (h=w=3h^{\prime}=w^{\prime}=3italic_h start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT = italic_w start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT = 3) in the second. Consequently, in the second layer most values of 𝐌f\mathbf{M}^{f}bold_M start_POSTSUPERSCRIPT italic_f end_POSTSUPERSCRIPT are zero (see Fig.˜5). As a result, Eq.˜7 zeros out most attention values, but they are still processed in all intermediate computations, leading to a memory-inefficient formulation. To overcome this, we introduce a memory-efficient reformulation that excludes the masked-out entries:

𝐎ijk=p[softmax(𝐐ijkd(𝐊:jk𝐌:f))]p(𝐕:jk𝐌:f)p,\mathbf{O}_{ijk}=\sum_{p^{\oslash}}\left[\mathrm{softmax}\left(\mathbf{Q}_{ijk}^{d}\cdot\left(\mathbf{K}_{:jk}\oslash\mathbf{M}^{f}_{:}\right)\right)\right]_{p^{\oslash}}\odot\left(\mathbf{V}_{:jk}\oslash\mathbf{M}^{f}_{:}\right)_{p^{\oslash}},bold_O start_POSTSUBSCRIPT italic_i italic_j italic_k end_POSTSUBSCRIPT = ∑ start_POSTSUBSCRIPT italic_p start_POSTSUPERSCRIPT ⊘ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT [ roman_softmax ( bold_Q start_POSTSUBSCRIPT italic_i italic_j italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT ⋅ ( bold_K start_POSTSUBSCRIPT : italic_j italic_k end_POSTSUBSCRIPT ⊘ bold_M start_POSTSUPERSCRIPT italic_f end_POSTSUPERSCRIPT start_POSTSUBSCRIPT : end_POSTSUBSCRIPT ) ) ] start_POSTSUBSCRIPT italic_p start_POSTSUPERSCRIPT ⊘ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ⊙ ( bold_V start_POSTSUBSCRIPT : italic_j italic_k end_POSTSUBSCRIPT ⊘ bold_M start_POSTSUPERSCRIPT italic_f end_POSTSUPERSCRIPT start_POSTSUBSCRIPT : end_POSTSUBSCRIPT ) start_POSTSUBSCRIPT italic_p start_POSTSUPERSCRIPT ⊘ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT , (8)

where \oslash is our element-wise masking operator:

(𝐙pjk𝐌pf)={𝐙pjkif𝐌pf=1,otherwise,p{1,,ts},(\mathbf{Z}_{pjk}\oslash\mathbf{M}^{f}_{p})=\begin{cases}\mathbf{Z}_{pjk}&\text{if}\;\mathbf{M}^{f}_{p}=1,\\ \varnothing&\text{otherwise},\end{cases}\qquad\forall p\in\{1,\dots,t_{s}\},( bold_Z start_POSTSUBSCRIPT italic_p italic_j italic_k end_POSTSUBSCRIPT ⊘ bold_M start_POSTSUPERSCRIPT italic_f end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT ) = { start_ROW start_CELL bold_Z start_POSTSUBSCRIPT italic_p italic_j italic_k end_POSTSUBSCRIPT end_CELL start_CELL if bold_M start_POSTSUPERSCRIPT italic_f end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT = 1 , end_CELL end_ROW start_ROW start_CELL ∅ end_CELL start_CELL otherwise , end_CELL end_ROW ∀ italic_p ∈ { 1 , … , italic_t start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT } , (9)

with 𝐙ts×tq×e\mathbf{Z}\in\mathbb{R}^{t_{s}\times t_{q}\times e}bold_Z ∈ blackboard_R start_POSTSUPERSCRIPT italic_t start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT × italic_t start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT × italic_e end_POSTSUPERSCRIPT and \varnothing indicating that the corresponding entry is excluded. This exclusion of elements results in the reduced set of indices ppp^{\oslash}\subseteq pitalic_p start_POSTSUPERSCRIPT ⊘ end_POSTSUPERSCRIPT ⊆ italic_p used in Eq.˜8, where p=pp^{\oslash}=pitalic_p start_POSTSUPERSCRIPT ⊘ end_POSTSUPERSCRIPT = italic_p only if 𝐌f\mathbf{M}^{f}bold_M start_POSTSUPERSCRIPT italic_f end_POSTSUPERSCRIPT contains no zeros. Excluding masked-out tokens reduces memory usage allowing EMAT to increase the support dimension tst_{s}italic_t start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT to 401 (h=w=20h^{\prime}=w^{\prime}=20italic_h start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT = italic_w start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT = 20) in the first attention layer (2.7\approx 2.7≈ 2.7 times more than CST) and 101 (h=w=10h^{\prime}=w^{\prime}=10italic_h start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT = italic_w start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT = 10) in the second (11\approx 11≈ 11 times more than CST). Note that the index set pp^{\oslash}italic_p start_POSTSUPERSCRIPT ⊘ end_POSTSUPERSCRIPT varies across images in a batch; thus, Eq.˜8 is computed sequentially for each image. However, batch processing is still used both before and after this step because the input and output tensors (𝐂\mathbf{C}bold_C and 𝐎\mathbf{O}bold_O) have the same dimensions for every image. By computing attention only over unmasked entries, EMAT still achieves a runtime comparable to CST.

4.2 Learnable Downscaling

As mentioned in Sec.˜4.1, the query matrix 𝐐ts×tq×e\mathbf{Q}\in\mathbb{R}^{t_{s}\times t_{q}\times e}bold_Q ∈ blackboard_R start_POSTSUPERSCRIPT italic_t start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT × italic_t start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT × italic_e end_POSTSUPERSCRIPT is progressively downscaled. This downscaling occurs before computing the masked-attention in each layer and it keeps tqt_{q}italic_t start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT and eeitalic_e fixed, while shrinking the support spatial dimensions hh^{\prime}italic_h start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT and ww^{\prime}italic_w start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT, which reduces the support dimension (ts=hw+1t_{s}=h^{\prime}\cdot w^{\prime}+1italic_t start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT = italic_h start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ⋅ italic_w start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT + 1). Unlike CST, which uses only average pooling, EMAT introduces a lightweight, learnable strategy that combines small convolutions with pooling. This hybrid design removes the need for the large pooling kernels that would otherwise be required to handle the higher-resolution correlation tokens used by EMAT.

In the first attention layer, EMAT splits 𝐐\mathbf{Q}bold_Q into support image tokens 𝐐i\mathbf{Q}_{i}bold_Q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and a single class token 𝐐c\mathbf{Q}_{c}bold_Q start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT. After reshaping 𝐐i\mathbf{Q}_{i}bold_Q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT to its h×wh^{\prime}\times w^{\prime}italic_h start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT × italic_w start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT support spatial layout, a 3D convolution followed by another reshape produces 𝐐ir(h′′w′′)×tq×e\mathbf{Q}^{r}_{i}\in\mathbb{R}^{(h^{\prime\prime}\cdot w^{\prime\prime})\times t_{q}\times e}bold_Q start_POSTSUPERSCRIPT italic_r end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT ( italic_h start_POSTSUPERSCRIPT ′ ′ end_POSTSUPERSCRIPT ⋅ italic_w start_POSTSUPERSCRIPT ′ ′ end_POSTSUPERSCRIPT ) × italic_t start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT × italic_e end_POSTSUPERSCRIPT. Simultaneously, a 2D convolution transforms 𝐐c\mathbf{Q}_{c}bold_Q start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT into 𝐐cr1×tq×e\mathbf{Q}^{r}_{c}\in\mathbb{R}^{1\times t_{q}\times e}bold_Q start_POSTSUPERSCRIPT italic_r end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT 1 × italic_t start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT × italic_e end_POSTSUPERSCRIPT. These outputs are concatenated to form the downscaled query matrix 𝐐dtd×tq×e\mathbf{Q}^{d}\in\mathbb{R}^{t_{d}\times t_{q}\times e}bold_Q start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_t start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT × italic_t start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT × italic_e end_POSTSUPERSCRIPT used in Eq.˜8, where td=h′′w′′+1t_{d}=h^{\prime\prime}\cdot w^{\prime\prime}+1italic_t start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT = italic_h start_POSTSUPERSCRIPT ′ ′ end_POSTSUPERSCRIPT ⋅ italic_w start_POSTSUPERSCRIPT ′ ′ end_POSTSUPERSCRIPT + 1.

The second attention layer repeats the process, but before the concatenation 𝐐ir\mathbf{Q}^{r}_{i}bold_Q start_POSTSUPERSCRIPT italic_r end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is collapsed to a single spatial token by a 3D average pool, producing 𝐐ip1×tq×e\mathbf{Q}^{p}_{i}\in\mathbb{R}^{1\times t_{q}\times e}bold_Q start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT 1 × italic_t start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT × italic_e end_POSTSUPERSCRIPT. Concatenating 𝐐ip\mathbf{Q}^{p}_{i}bold_Q start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT with 𝐐cr\mathbf{Q}^{r}_{c}bold_Q start_POSTSUPERSCRIPT italic_r end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT gives the downscaled query matrix 𝐐d2×tq×e\mathbf{Q}^{d}\in\mathbb{R}^{2\times t_{q}\times e}bold_Q start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT 2 × italic_t start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT × italic_e end_POSTSUPERSCRIPT used during the attention computation.

4.3 Modifications for Parameter Efficiency

Few-shot models with too many parameters risk overfitting the small support set, thereby reducing their ability to adapt to new classes. Therefore, EMAT reduces the number of channels across all operations: its two attention layers use 64 and 32 channels, versus 32 and 128 in CST, and its two task-specific heads use 32 and 16 channels, versus 128 and 64 in CST. These channel reductions significantly decrease the number of trainable parameters in EMAT.

5 Experiments

5.0.1 Datasets.

We evaluated our EMAT on the widely used PASCAL-5i5^{i}5 start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT [36] and COCO-20i20^{i}20 start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT [31] datasets. Although they were designed for few-shot segmentation, both can also be used for few-shot classification and segmentation [19]. PASCAL-5i5^{i}5 start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT comprises 20 classes and COCO-20i20^{i}20 start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT 80 classes, each partitioned into four non-overlapping folds.

5.0.2 Implementation Details.

EMAT uses a frozen ViT-S encoder [12] pre-trained with DINOv2 [32]. The two-layer transformer uses our memory-efficient masked attention with 8 heads. We train for 80 epochs with a batch size of 9 using the Adam optimizer [22] with learning rate 10310^{-3}10 start_POSTSUPERSCRIPT - 3 end_POSTSUPERSCRIPT. Following [20], we use 1-way 1-shot tasks with the original setting (see Eq. 1) and set the loss weight λ\lambdaitalic_λ in Eq.˜6 to 0.1. Moreover, we re-train CST [20] with the same DINOv2 backbone used by EMAT and denote it as CST*. All training was conducted on three NVIDIA RTX A6000 GPUs, with evaluation performed on a single GPU.

5.1 Comparison to SOTA FS-CS

To evaluate the effectiveness of our EMAT, we compare it with CST [20] and other state-of-the-art few-shot classification and segmentation (FS-CS) methods. Tab.˜1 shows the mean classification accuracy (Acc.) and mean Intersection over Union (mIoU) over the four folds of PASCAL-5i5^{i}5 start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT [36] and COCO-20i20^{i}20 start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT [31], for 2-way 1-shot tasks across all evaluation settings (see Sec.˜3). Although DINOv2 pre-training [32] already significantly improves CST* over its original version, EMAT consistently outperforms all methods across all settings. These results validate the benefit of processing higher-resolution correlation tokens enabled by our memory-efficient masked attention (see Sec.˜4.1). Moreover, EMAT requires at least four times fewer parameters than CST, making it the most parameter-efficient method among SOTA FS-CS models. The supplementary material provides per-fold results for Tab.˜1 and additional results on {1,,5}\{1,\dots,5\}{ 1 , … , 5 }-way 5-shot tasks, demonstrating the scalability of our method.

The results in Tab.˜1 also show that our partially augmented setting slightly improves accuracy and mIoU for most methods, confirming the benefit of better exploiting the available annotations. However, the improvement is marginal, likely because only 242 and 106 out of 4000 tasks are augmented for PASCAL-5i5^{i}5 start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT and COCO-20i20^{i}20 start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT, respectively. In contrast, our fully augmented setting lowers accuracy and mIoU for every method, although less significantly for mIoU. As discussed in Sec.˜3.1, this setting augments not only the support examples but also includes every class present in the support images, making the tasks harder. This setting augments 1243 and 1515 out of the 4000 tasks for PASCAL-5i5^{i}5 start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT and COCO-20i20^{i}20 start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT, respectively. The augmentations in both of our proposed settings highlight that the original evaluation setting fails to use available annotations.

Table 1: Comparison of FS-CS methods on PASCAL-5i5^{i}5 start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT and COCO-20i20^{i}20 start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT across all evaluation settings: original, partially augmented, and fully augmented, using 2-way 1-shot tasks (base configuration). CST* and EMAT were trained and evaluated, while other methods were only evaluated using the checkpoints from [19]. CST* uses the same backbone as EMAT (i.e., DINOv2 [32]). All values, except the number of trainable parameters (in millions), are percentages (higher is better). Highlight indicates our proposed method. Bold and underlined values indicate the best and second best results.
Dataset Method
Train.
Params.
Original
Partially
Augmented
Fully
Augmented
Acc. mIoU Acc. mIoU Acc. mIoU
PASCAL-5i5^{i}5 start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT PANet [50] 23.51 56.53 37.20 56.93 37.49 55.75 37.25
PFENet [45] 31.96 39.35 35.57 39.48 35.61 36.88 35.08
HSNet [29] 2.57 67.27 44.85 67.75 44.72 65.92 44.40
ASNet [19] 1.32 68.30 47.87 68.62 47.78 66.40 47.58
CST [20] 0.37 70.37 53.78 70.60 53.81 68.45 53.76
CST* 0.37 80.58 63.28 80.60 63.23 78.57 63.08
EMAT 0.09 82.70 63.38 82.92 63.32 81.23 63.24
COCO-20i20^{i}20 start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT PANet [50] 23.51 51.30 23.64 51.32 23.78 45.07 23.17
PFENet [45] 31.96 36.45 23.37 36.50 23.39 29.33 21.61
HSNet [29] 2.57 62.43 30.58 62.40 30.66 55.15 29.44
ASNet [19] 1.32 63.05 31.62 63.03 31.64 55.47 30.47
CST [20] 0.37 64.02 36.23 64.10 36.20 56.30 35.60
CST* 0.37 78.70 51.47 78.87 51.53 71.18 50.76
EMAT 0.09 80.07 52.81 80.25 52.82 73.00 51.99

5.2 Qualitative Results

As explained in Sec.˜3, when handling N-way K-shot tasks with N>1N>1italic_N > 1, the query image can contain (1) none, (2) some, or (3) all of the support classes. The top part of Fig.˜4 shows that EMAT produces more accurate segmentation masks than CST* in these three scenarios, confirming that EMAT can predict empty masks and masks with one or multiple classes. The bottom part of Fig.˜4 illustrates the same task across all few-shot evaluation settings. In the original setting, both models segment the \nth1 class (orange) correctly but make errors on the \nth2 class (glass), mistakenly segmenting visually similar objects (salt shaker). In the partially augmented setting, the additional annotations for the \nth2 class degrade CST, while EMAT maintains a precise segmentation, though both still incorrectly segment non-target objects. In the fully augmented setting, both methods correctly segment the additional class (book).

Support (𝐈11\mathbf{I}_{1}^{1}bold_I start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT) Support (𝐈12\mathbf{I}_{1}^{2}bold_I start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT) Query (𝐈q\mathbf{I}_{q}bold_I start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT) GT Mask CST* [20] EMAT
Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption
Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption
Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption
Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption
Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption
Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption
Figure 4: Qualitative comparison of CST* vs. our EMAT on COCO-20i20^{i}20 start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT [31] using 2-way 1-shot tasks. (Top) Row 1: Query w/o support classes, Row 2: Query w/ subset of support classes, Row 3: Query w/ all support classes. (Bottom) Row 1: Original setting, Row 2: Partially augmented setting, Row 3: Fully augmented setting.

To illustrate the effect of higher-resolution tokens, Fig.˜5 compares the segmentation masks used in the masked-attention layers of CST and EMAT. Thanks to our memory-efficient formulation (see Sec.˜4.1), EMAT preserves more details across layers. The difference is most visible in the second layer, where the mask used by CST barely contains any information since it has a resolution of 3×33\times 33 × 3, whereas EMAT, with a 10×1010\times 1010 × 10 resolution, retains meaningful details and structure. For instance, the person in the bottom-right corner of the last row of Fig.˜5 remains visible in both layers of EMAT but vanishes in the second layer of CST. This ability to preserve fine details explains why EMAT produces more accurate masks especially for small objects, e.g., the dog in the last row of the top part of Fig.˜4, the handle of the beer glass in the bottom part of Fig.˜4, and the boat and person in Fig.˜1.

Full Res. 700×700700\times 700700 × 700 Max Res. 50×5050\times 5050 × 50 CSTl=1 12×1212\times 1212 × 12 CSTl=2 3×33\times 33 × 3 EMATl=1 20×2020\times 2020 × 20 EMATl=2 10×1010\times 1010 × 10
Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption
Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption
Figure 5: Segmentation masks used by CST [20] and EMAT. “Full Res.” shows the mask at full resolution, while “Max Res.” is to the highest resolution compatible with the masked-attention layers of both methods. “CSTl=(⋅)” and “EMATl=(⋅)” indicate the mask resolution used in layer l{1,2}l\in\{1,2\}italic_l ∈ { 1 , 2 } of CST and EMAT, respectively.

5.3 Analysis of Small Objects

To further analyze the impact of higher-resolution correlation tokens on small objects, we filter each fold of PASCAL-5i5^{i}5 start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT [36] and COCO-20i20^{i}20 start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT [31] based on object size, creating three splits: objects occupying 0–5 %, 5–10 %, and 10–15 % of the image (see the supplementary material for details on how these splits were defined). Fig.˜6 shows the average accuracy and mIoU of CST* and the corresponding improvement achieved by EMAT across the three splits for both datasets. The results indicate that accuracy and mIoU increase with object size, and EMAT provides the largest improvement over CST* for the smallest objects, gradually decreasing as object size increases. The enhanced classification and segmentation accuracy of EMAT is likely due to better localization enabled by its higher-resolution correlation tokens (see Fig.˜5).

Refer to caption
Figure 6: Analysis of small objects on PASCAL-5i5^{i}5 start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT [36] and COCO-20i20^{i}20 start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT [31]. Each bar represents the average across the four folds of each dataset, filtered by object size, using 1-way 1-shot tasks. To enable a more controlled analysis, we modified the original setting (see Sec.˜3) to ensure that the query image always contain the class of the support image. CST* uses the same backbone as EMAT (i.e., DINOv2 [32]).
Table 2: Ablation study of EMAT on PASCAL-5i5^{i}5 start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT [36] under the original evaluation setting. “tslt^{l}_{s}italic_t start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT” indicates the value of tst_{s}italic_t start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT for each layer l{1,2}l\in\{1,2\}italic_l ∈ { 1 , 2 }. The memory efficiency (ME), learnable downscaling (LD), and parameter efficiency (PE) columns correspond to the modifications described in Secs.˜4.1, 4.2 and 4.3, respectively. “Mem. Usage” reports the average per-GPU memory used during training. “All Dataset” refers to 2-way 1-shot evaluation on the full test set, while “Small Objects” restricts evaluation to objects occupying less than 15 % of the image, using 1-way 1-shot tasks in which the query always contains the support class. CST* uses the same backbone as EMAT (i.e., DINOv2 [32]). (Top) CST* with its original support dimension per layer tslt^{l}_{s}italic_t start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT. (Middle) CST* with the largest tslt^{l}_{s}italic_t start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT that fits in our 48 GB GPUs. (Bottom) successive modifications introduced by EMAT. Highlight indicates our complete model. Bold and underlined values indicate the best and second best results.
tslt^{l}_{s}italic_t start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT per
Layer
Method ME LD PE Mem. Usage Train. Params. All Dataset Small Objects
Acc. mIoU Acc. mIoU
ts1=145t^{1}_{s}{=}145italic_t start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT = 145
ts2=10t^{2}_{s}{=}10italic_t start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT = 10
CST* 8.68 366.00 80.58 63.28 88.96 58.16
ts1=325t^{1}_{s}{=}325italic_t start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT = 325
ts2=37t^{2}_{s}{=}37italic_t start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT = 37
CST* 39.22 366.00 82.23 63.31 89.65 58.75
ts1=401t^{1}_{s}{=}401italic_t start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT = 401
ts2=101t^{2}_{s}{=}101italic_t start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT = 101
CST* \approx 63 366.00 N/A N/A N/A N/A
EMAT 36.92 366.00 81.95 62.97 87.99 58.06
EMAT 36.53 404.48 82.17 63.36 90.49 58.73
EMAT 38.31 86.02 82.70 63.38 91.74 59.17

5.4 Ablation Study

Tab.˜2 first reports the results of CST* with its original support dimension per layer tslt^{l}_{s}italic_t start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT. For fair comparison, we increased the tslt^{l}_{s}italic_t start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT of CST* to use the same as EMAT, but it required about 63 GB of GPU memory, which exceeded the 48 GB capacity of our GPUs, so we instead use the largest tslt^{l}_{s}italic_t start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT that fits in our memory. For EMAT we progressively integrated: (1) memory-efficient masked attention (see Sec.˜4.1), (2) learnable downscaling of the query matrix (see Sec.˜4.2), and (3) parameter-efficiency modifications (see Sec.˜4.3). Although Tab.˜2 includes results on the full PASCAL-5i5^{i}5 start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT test set, the discussion below focuses on the small-object subset to highlight the effect of each modification introduced by EMAT.

Adding our memory-efficient masked attention alone lowers memory usage by 26 GB (\approx 41 %) but does not improve accuracy or mIoU compared to either variant of CST*, likely because the model relies on large pooling windows for processing the higher-resolution correlation tokens. Incorporating our learnable downscaling removes those large windows and yields absolute accuracy gains of +1.53 % over the original CST* and of +0.84 % over the variant with the larger tslt^{l}_{s}italic_t start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT. It also achieves an absolute mIoU gain of +0.57 % compared with the original CST*, while matching the mIoU of the variant with larger tslt^{l}_{s}italic_t start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT.

Because our learnable downscaling increases the number of trainable parameters, we next apply our parameter-efficiency modifications that remove 318 K parameters (\approx 79 %), while still saving about 39 % of the memory CST* would need for using the same tslt^{l}_{s}italic_t start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT as EMAT. These modifications result in absolute accuracy gains of +2.78 % over the original CST* and +2.09 % over the variant with the larger tslt^{l}_{s}italic_t start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT; mIoU improves by +1.01 % and +0.42 %, respectively. EMAT also slightly improves accuracy and mIoU on the full test set, but its largest gains appear on images containing small objects.

6 Limitations

While the results of Tab.˜2 validate that our proposed EMAT is memory and parameter efficient for high-resolution correlation tokens, its memory efficiency is constrained to datasets where the segmentation masks contain unlabeled areas. This limitation arises because our memory-efficient masked attention mechanism is equivalent to self-attention [47, 12] when applied to dense semantic-segmentation datasets like Cityscapes [9], where each pixel in the segmentation mask corresponds directly to one of the semantic classes in the image.

7 Conclusion

In this work, we propose EMAT, an enhancement over CST, the state-of-the-art method for few-shot classification and segmentation (FS-CS). EMAT incorporates our novel memory-efficient masked attention mechanism that allows our model to process high-resolution correlation tokens while maintaining memory and parameter efficiency. Our results demonstrate that EMAT consistently outperforms all FS-CS methods across all evaluation settings while requiring at least four times fewer trainable parameters. Moreover, our qualitative results highlight that EMAT is capable of correctly generating empty segmentation masks when necessary and capturing finer details more accurately, which improves accuracy when dealing with small objects. Additionally, we introduce two novel few-shot evaluation settings designed to maximize the use of the available annotations during inference, reflecting practical few-shot scenarios.

Acknowledgments.  This work was funded by the Hessian Ministry of Science and Research, Arts and Culture (HMWK) through the project “The Third Wave of Artificial Intelligence – 3AI”. The work was further supported by the Deutsche Forschungsgemeinschaft (German Research Foundation, DFG) under Germany’s Excellence Strategy (EXC 3057/1 “Reasonable Artificial Intelligence”, Project No. 533677015). Stefan Roth acknowledges support by the European Research Council (ERC) under the European Union’s Horizon 2020 research and innovation programme (grant agreement No. 866008).

References

  • [1] Aggarwal, P., Deshpande, A., Narasimhan, K.R.: SemSup-XC: Semantic supervision for zero and few-shot extreme classification. In: ICML. vol. 202, pp. 228–247 (2023)
  • [2] Alfassy, A., Karlinsky, L., Aides, A., Shtok, J., Harary, S., Feris, R.S., Giryes, R., Bronstein, A.M.: LaSO: Label-set operations networks for multi-label few-shot learning. In: CVPR. pp. 6548–6557 (2019)
  • [3] Allen, K.R., Shelhamer, E., Shin, H., Tenenbaum, J.B.: Infinite mixture prototypes for few-shot learning. In: ICML. vol. 97, pp. 232–241 (2019)
  • [4] Antoniou, A., Edwards, H., Storkey, A.J.: How to train your MAML. In: ICLR (2019)
  • [5] Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: ICCV. pp. 9650–9660 (2021)
  • [6] Carrión-Ojeda, D., Alam, M., Escalera, S., et al.: NeurIPS’22 Cross-Domain MetaDL Challenge: Results and lessons learned. In: NeurIPS Competition Track. vol. 220, pp. 50–72 (2022)
  • [7] Chen, H., Dong, Y., Lu, Z., Yu, Y., Han, J.: Pixel matching network for cross-domain few-shot segmentation. In: WACV. pp. 978–987 (2024)
  • [8] Chen, Z., Wu, J., Wang, W., Su, W., Chen, G., Xing, S., Zhong, M., Zhang, Q., Zhu, X., Lu, L., Li, B., Luo, P., Lu, T., Qiao, Y., Dai, J.: InternVL: Scaling up vision foundation models and aligning for generic visual-linguistic tasks. In: CVPR. pp. 24185–24198 (2024)
  • [9] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The Cityscapes dataset for semantic urban scene understanding. In: CVPR. pp. 3213–3223 (2016)
  • [10] Dhillon, G.S., Chaudhari, P., Ravichandran, A., Soatto, S.: A baseline for few-shot image classification. In: ICLR (2020)
  • [11] Dong, N., Xing, E.P.: Few-shot semantic segmentation with prototype learning. In: BMVC (2018)
  • [12] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An image is worth 16x16 words: Transformers for image recognition at scale. In: ICLR (2021)
  • [13] Fan, X., Wang, X., Gao, J., Wang, J., Luo, Z., Liu, R.: Bi-level learning of task-specific decoders for joint registration and one-shot medical image segmentation. In: CVPR. pp. 11726–11735 (2024)
  • [14] Fang, Z., Wang, X., Li, H., Liu, J., Hu, Q., Xiao, J.: FastRecon: Few-shot industrial anomaly detection via fast feature reconstruction. In: ICCV. pp. 17481–17490 (2023)
  • [15] Finn, C., Abbeel, P., Levine, S.: Model-agnostic meta-learning for fast adaptation of deep networks. In: ICML. vol. 70, pp. 1126–1135 (2017)
  • [16] Gong, X., Xia, X., Zhu, W., Zhang, B., Doermann, D., Zhuo, L.: Deformable Gabor feature networks for biomedical image classification. In: WACV. pp. 4004–4012 (2021)
  • [17] Hao, F., He, F., Liu, L., Wu, F., Tao, D., Cheng, J.: Class-aware patch embedding adaptation for few-shot image classification. In: ICCV. pp. 18905–18915 (2023)
  • [18] Herzog, J.: Adapt before comparison: A new perspective on cross-domain few-shot segmentation. In: CVPR. pp. 23605–23615 (2024)
  • [19] Kang, D., Cho, M.: Integrative few-shot learning for classification and segmentation. In: CVPR. pp. 9979–9990 (2022)
  • [20] Kang, D., Koniusz, P., Cho, M., Murray, N.: Distilling self-supervised vision transformers for weakly-supervised few-shot classification & segmentation. In: CVPR. pp. 19627–19638 (2023)
  • [21] Kang, D., Kwon, H., Min, J., Cho, M.: Relational embedding for few-shot classification. In: ICCV. pp. 8822–8833 (2021)
  • [22] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: ICLR (2015)
  • [23] Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.Y., Dollar, P., Girshick, R.: Segment anything. In: ICCV. pp. 4015–4026 (2023)
  • [24] Li, W.H., Liu, X., Bilen, H.: Cross-domain few-shot learning with task-specific adapters. In: CVPR. pp. 7161–7170 (2022)
  • [25] Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. In: NeurIPS (2023)
  • [26] Liu, J., Bao, Y., Xie, G.S., Xiong, H., Sonke, J.J., Gavves, E.: Dynamic prototype convolution network for few-shot semantic segmentation. In: CVPR. pp. 11553–11562 (2022)
  • [27] Liu, Y., Zhu, M., Li, H., Chen, H., Wang, X., Shen, C.: Matcher: Segment anything with one shot using all-purpose feature matching. In: ICLR (2024)
  • [28] Ma, T., Sun, Y., Yang, Z., Yang, Y.: ProD: Prompting-to-disentangle domain knowledge for cross-domain few-shot image classification. In: CVPR. pp. 19754–19763 (2023)
  • [29] Min, J., Kang, D., Cho, M.: Hypercorrelation squeeze for few-shot segmentation. In: ICCV. pp. 6941–6952 (2021)
  • [30] Moon, S., Sohn, S.S., Zhou, H., Yoon, S., Pavlovic, V., Khan, M.H., Kapadia, M.: MSI: Maximize support-set information for few-shot segmentation. In: ICCV. pp. 19266–19276 (2023)
  • [31] Nguyen, K., Todorovic, S.: Feature weighting and boosting for few-shot segmentation. In: ICCV. pp. 622–631 (2019)
  • [32] Oquab, M., Darcet, T., Moutakanni, T., et al.: DINOv2: Learning robust visual features without supervision. Trans. Mach. Learn. Res. (2024)
  • [33] Peng, B., Tian, Z., Wu, X., Wang, C., Liu, S., Su, J., Jia, J.: Hierarchical dense correlation distillation for few-shot segmentation. In: CVPR. pp. 23641–23651 (2023)
  • [34] Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., Krueger, G., Sutskever, I.: Learning transferable visual models from natural language supervision. In: ICML. vol. 139, pp. 8748–8763 (2021)
  • [35] Raghu, A., Raghu, M., Bengio, S., Vinyals, O.: Rapid learning or feature reuse? Towards understanding the effectiveness of MAML. In: ICLR (2020)
  • [36] Shaban, A., Bansal, S., Liu, Z., Essa, I., Boots, B.: One-shot learning for semantic segmentation. In: BMVC (2017)
  • [37] Silva-Rodríguez, J., Hajimiri, S., Ben Ayed, I., Dolz, J.: A closer look at the few-shot adaptation of large vision-language models. In: CVPR. pp. 23681–23690 (2024)
  • [38] Simon, C., Koniusz, P., Harandi, M.: Meta-learning for multi-label few-shot classification. In: WACV. pp. 346–355 (2022)
  • [39] Snell, J., Swersky, K., Zemel, R.S.: Prototypical networks for few-shot learning. In: NIPS. pp. 4077–4087 (2017)
  • [40] Sun, Q., Liu, Y., Chua, T.S., Schiele, B.: Meta-transfer learning for few-shot learning. In: CVPR. pp. 403–412 (2019)
  • [41] Team, G.G.: Gemini: A family of highly capable multimodal models. arXiv:2312.11805 [cs.CL] (2023)
  • [42] Tian, P., Wu, Z., Qi, L., Wang, L., Shi, Y., Gao, Y.: Differentiable meta-learning model for few-shot semantic segmentation. In: AAAI. pp. 12087–12094 (2020)
  • [43] Tian, Y., Wang, Y., Krishnan, D., Tenenbaum, J.B., Isola, P.: Rethinking few-shot image classification: A good embedding is all you need? In: ECCV. vol. 12359, pp. 266–282 (2020)
  • [44] Tian, Z., Lai, X., Jiang, L., Liu, S., Shu, M., Zhao, H., Jia, J.: Generalized few-shot semantic segmentation. In: CVPR. pp. 11563–11572 (2022)
  • [45] Tian, Z., Zhao, H., Shu, M., Yang, Z., Li, R., Jia, J.: Prior guided feature enrichment network for few-shot segmentation. IEEE T. Pattern Anal. Mach. Intell. 44(2), 1050–1065 (2022)
  • [46] Ullah, I., Carrión-Ojeda, D., Escalera, S., Guyon, I., Huisman, M., Mohr, F., van Rijn, J.N., Sun, H., Vanschoren, J., Vu, P.A.: Meta-Album: Multi-domain meta-dataset for few-shot image classification. In: NeurIPS. vol. 35, pp. 3232–3247 (2022)
  • [47] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS. pp. 5998–6008 (2017)
  • [48] Vinyals, O., Blundell, C., Lillicrap, T., Kavukcuoglu, K., Wierstra, D.: Matching networks for one shot learning. In: NIPS. pp. 3630–3638 (2016)
  • [49] Wang, J., Zhang, B., Pang, J., Chen, H., Liu, W.: Rethinking prior information generation with CLIP for few-shot segmentation. In: CVPR. pp. 3941–3951 (2024)
  • [50] Wang, K., Liew, J.H., Zou, Y., Zhou, D., Feng, J.: PANet: Few-shot image semantic segmentation with prototype alignment. In: ICCV. pp. 9196–9205 (2019)
  • [51] Wang, Y., Luo, N., Zhang, T.: Focus on query: Adversarial mining transformer for few-shot segmentation. In: NeurIPS. vol. 36, pp. 31524–31542 (2023)
  • [52] Wu, A., Han, Y., Zhu, L., Yang, Y.: Universal-prototype enhancing for few-shot object detection. In: ICCV. pp. 9567–9576 (2021)
  • [53] Xie, G.S., Xiong, H., Liu, J., Yao, Y., Shao, L.: Few-shot semantic segmentation with cyclic memory network. In: ICCV. pp. 7293–7302 (2021)
  • [54] Xu, Q., Zhao, W., Lin, G., Long, C.: Self-calibrated cross attention network for few-shot segmentation. In: ICCV. pp. 655–665 (2023)
  • [55] Yang, Y., Chen, Q., Feng, Y., Huang, T.: MIANet: Aggregating unbiased instance and general information for few-shot semantic segmentation. In: CVPR. pp. 7131–7140 (2023)
  • [56] Ye, C., Zhu, H., Liao, Y., Zhang, Y., Chen, T., Fan, J.: What makes for effective few-shot point cloud classification? In: WACV. pp. 1829–1838 (2022)
  • [57] Zhang, A., Gao, G., Jiao, J., Liu, C., Wei, Y.: Bridge the points: Graph-based few-shot segment anything semantically. In: NeurIPS (2024)
  • [58] Zhang, B., Xiao, J., Qin, T.: Self-guided and cross-guided learning for few-shot segmentation. In: CVPR. pp. 8312–8321 (2021)
  • [59] Zhang, M., Shi, M., Li, L.: MFNet: Multiclass few-shot segmentation network with pixel-wise metric learning. IEEE T. Circuits Syst. Video Tech. 32(12), 8586–8598 (2022)
  • [60] Zhou, F., Wang, P., Zhang, L., Wei, W., Zhang, Y.: Revisiting prototypical network for cross domain few-shot learning. In: CVPR. pp. 20061–20070 (2023)
  • [61] Zhou, Z., Xu, H.M., Shu, Y., Liu, L.: Unlocking the potential of pre-trained vision transformers for few-shot semantic segmentation through relationship descriptors. In: CVPR. pp. 3817–3827 (2024)
  • [62] Zhu, L., Chen, T., Ji, D., Ye, J., Liu, J.: LLaFS: When large language models meet few-shot segmentation. In: CVPR. pp. 3065–3075 (2024)
  • [63] Zhu, L., Chen, T., Yin, J., See, S., Liu, J.: Addressing background context bias in few-shot segmentation through iterative modulation. In: CVPR. pp. 3370–3379 (2024)
  • [64] Zhu, X., Zhang, R., He, B., Zhou, A., Wang, D., Zhao, B., Gao, P.: Not all features matter: Enhancing few-shot clip with adaptive prior refinement. In: ICCV. pp. 2605–2615 (2023)
  • [65] Zhu, Y., Wang, S., Xin, T., Zhang, H.: Few-shot medical image segmentation via a region-enhanced prototypical transformer. In: MICCAI. pp. 271–280 (2023)

Appendix 0.A Few-shot Evaluation Settings

As explained in Sec.˜3.1, we introduce two novel few-shot evaluation settings, partially and fully augmented, to better utilize available annotations and represent realistic scenarios. Fig.˜7 illustrates how the support set of a 2-way 2-shot (base task configuration) task is modified under each setting.

In the original setting [19], additional annotations are discarded, leaving the base task configuration unchanged. The drawback of this setting is that it ignores available annotations. For example, in Fig.˜7, the second image for the \nth1 class (dog) also contains annotations for the \nth2 class (horse), but these are removed.

To address the loss of available annotations, our partially augmented setting retains annotations for the support classes (e.g., dog and horse in Fig.˜7) while removing those for non-support classes (e.g., person in Fig.˜7). Since few-shot classification and segmentation (FS-CS) methods treat N-way K-shot tasks as N separate 1-way K-shot tasks, each image (shot) must contain only one class (way). Therefore, if an image contains multiple support classes, we duplicate it so each copy includes only one. This leads to a task configuration that no longer follows the N-way K-shot definition [48], as the number of shots per class can vary. Consequently, the classes with fewer shots are randomly augmented using the available support data. For instance, in Fig.˜7, the partially augmented setting converts the 2-way 2-shot into a 2-way 3-shot task. It is important to note that in this setting, the number of ways remains fixed but the number of shots can increase up to N×KN\times Kitalic_N × italic_K.

On the other hand, our fully augmented setting incorporates all available annotations, including those for non-support classes (e.g., person in Fig.˜7). The task is then processed similarly to the partially augmented setting. However, in this case, both the number of ways and the number of shots can increase, with shots still bounded by N×KN\times Kitalic_N × italic_K. Regardless of the setting, models are evaluated only on the final support classes. For example, in Fig.˜7, models in the original and partially augmented settings are evaluated on dogs and horses, while in the fully augmented setting, they are also evaluated on people.

Refer to caption
Figure 7: Task augmentation pipeline for each few-shot evaluation setting. The 2-way 2-shot (base configuration) task is extracted from the PASCAL-5i5^{i}5 start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT dataset [36]. The original setting leaves this configuration unchanged, discarding available annotations. The partially augmented setting incorporates unused support-class annotations, yielding a 2-way 3-shot task. The fully augmented setting includes all annotations, increasing complexity to 3-way 3-shot.

0.A.0.1 Practical Considerations.

Because annotating data is time- and resource-consuming, few-shot evaluation settings should aim to maximize the use of existing annotations. Our partially augmented setting achieves this without increasing task difficulty and may even enhance FS-CS performance by providing more examples per class. On the other hand, our fully augmented setting incorporates all available annotations, increasing both the number of ways and the number of shots, making the task more challenging by requiring the model to classify and segment additional classes. Nevertheless, neither setting requires extra annotations beyond those already available; instead, they use existing labels discarded by the original setting to better represent realistic scenarios.

Although the loss of annotations in the original setting may not significantly affect training, our partially augmented setting can safely be used in that phase. In contrast, our fully augmented setting should be used only for evaluation to prevent data leakage, as it mixes training and test classes within the support set. For this reason, we recommend using both settings exclusively for evaluation.

0.A.0.2 Object Size Filtering.

Very small objects can add noise during task augmentation in the partially and fully augmented settings, so we filter out any object that occupy less than a threshold θ\thetaitalic_θ of the image area. We set θ=7%\theta=7\,\%italic_θ = 7 % for PASCAL-5i5^{i}5 start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT [36] and θ=3%\theta=3\,\%italic_θ = 3 % COCO-20i20^{i}20 start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT [31], values that correspond to the \nth25 percentile of object sizes in the training set of each dataset. These different thresholds reflect dataset-specific object size distribution.

Fig.˜8 shows the impact of applying these thresholds on the classification accuracy and segmentation mIoU of our efficient masked attention transformer (EMAT) on COCO-20i20^{i}20 start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT for {1,2}\{1,2\}{ 1 , 2 }-way 1-shot tasks. Filtering out small objects consistently improves accuracy and mIoU in both augmented settings, with a larger gain in the fully augmented setting because it augments more tasks. Therefore, all results were obtained using the proposed thresholding.

Refer to caption
Figure 8: Impact of object size filtering on task augmentation. “P. Aug.” and “F. Aug.” denote the partially and fully augmented settings, respectively, while θ\thetaitalic_θ indicates whether object size filtering was applied. The bars show the accuracy or mIoU of EMAT on the COCO-20i20^{i}20 start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT dataset [31] for {1,2}\{1,2\}{ 1 , 2 }-way 1-shot tasks. Results for the 1-way 1-shot configuration in the partially augmented setting are omitted becase, when N=1N=1italic_N = 1, this setting is equivalent to the original one. In both augmented settings, removing small objects increases accuracy and mIoU.

Appendix 0.B FS-CS Per-Fold Results

Table 3: Per-fold comparison of FS-CS methods on PASCAL-5i5^{i}5 start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT [36]. Results are reported for all evaluation settings (original, partially augmented, and fully augmented) using 1000 tasks per fold with a base configuration of 2-way 1-shot. The number in parentheses in the partially and fully augmented settings indicates the total number of augmented tasks. CST* and EMAT were trained and evaluated, while other methods were only evaluated using the checkpoints from [19]. CST* uses the same backbone as EMAT (i.e., DINOv2 [32]). All values are percentages (higher is better). Highlight indicates our proposed method. Bold and underlined values represent the best and second best results.
Method Fold-0 Fold-1 Fold-2 Fold-3 Average
Acc. mIoU Acc. mIoU Acc. mIoU Acc. mIoU Acc. mIoU
Original
PANet [50] 62.30 33.31 53.30 45.94 49.50 31.20 61.00 38.34 56.53 37.20
PFENet [45] 23.10 31.48 53.80 46.60 40.60 31.15 39.90 33.03 39.35 35.57
HSNet [29] 68.20 43.97 73.00 55.12 56.90 35.19 71.00 45.14 67.27 44.85
ASNet [19] 68.20 48.44 76.20 58.19 58.80 36.45 70.00 48.41 68.30 47.87
CST [20] 70.10 53.90 75.20 59.98 61.70 46.30 74.50 54.93 70.37 53.78
CST* 89.40 64.61 80.90 68.16 71.40 55.50 80.60 64.84 80.58 63.28
EMAT 90.70 66.68 83.50 67.61 72.00 54.15 84.60 65.07 82.70 63.38
Partially Augmented (242\uparrow 242↑ 242 tasks)
PANet [50] 62.30 33.31 53.00 46.00 51.70 32.35 60.70 38.32 56.93 37.49
PFENet [45] 23.10 31.48 54.10 46.68 40.70 31.28 40.00 33.02 39.48 35.61
HSNet [29] 68.20 43.97 72.90 54.85 59.00 34.90 70.90 45.18 67.75 44.72
ASNet [19] 68.20 48.44 76.00 57.88 60.50 36.28 69.80 48.51 68.62 47.78
CST [20] 70.10 53.90 75.30 60.04 62.40 46.32 74.60 54.97 70.60 53.81
CST* 89.40 64.61 80.90 68.10 71.50 55.44 80.60 64.77 80.60 63.23
EMAT 90.70 66.68 83.30 67.48 73.00 54.03 84.70 65.10 82.92 63.32
Fully Augmented (694\uparrow 694↑ 694 tasks)
PANet [50] 62.30 33.31 51.70 45.95 49.60 31.65 59.40 38.11 55.75 37.25
PFENet [45] 23.10 31.48 52.60 46.14 33.70 30.33 38.10 32.35 36.88 35.08
HSNet [29] 68.20 43.97 72.20 54.55 53.80 33.87 69.50 45.19 65.92 44.40
ASNet [19] 68.20 48.44 75.00 57.66 54.30 35.90 68.10 48.31 66.40 47.58
CST [20] 70.10 53.90 74.50 59.50 56.70 46.62 72.50 55.01 68.45 53.76
CST* 89.40 64.61 80.20 67.43 66.40 55.80 78.30 64.47 78.57 63.08
EMAT 90.70 66.68 82.60 66.93 68.30 54.37 83.30 65.00 81.23 63.24

Sec.˜5.1 compares our EMAT with current state-of-the-art (SOTA) FS-CS methods, reporting results averaged over the four folds of the PASCAL-5i5^{i}5 start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT and COCO-20i20^{i}20 start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT datasets using 2-way 1-shot tasks. To complement those results, Tabs.˜3 and 4 present per-fold results, consistently showing that EMAT outperforms all FS-CS methods across most folds, datasets, and evaluation settings. These tables also report the number of augmented tasks in our proposed evaluation settings. On PASCAL-5i5^{i}5 start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT, the partially and fully augmented settings augment 242 and 694 tasks, respectively, meaning that about 6 % of tasks contain extra support-class annotations (partially augmented), and 11 % contain extra annotations for non-support classes (fully augmented). On COCO-20i20^{i}20 start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT, the corresponding augmentation rates are roughly 3 % and 35 %. These results confirm that our evaluation settings use annotations discarded by the original setting, thereby creating more realistic evaluation scenarios.

Table 4: Per-fold comparison of FS-CS methods on COCO-20i20^{i}20 start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT [31]. Results are reported for all evaluation settings (original, partially augmented, and fully augmented) using 1000 tasks per fold with a base configuration of 2-way 1-shot. The number in parentheses in the partially and fully augmented settings indicates the total number of augmented tasks. CST* and EMAT were trained and evaluated, while other methods were only evaluated using the checkpoints from [19]. CST* uses the same backbone as EMAT (i.e., DINOv2 [32]). All values are percentages (higher is better). Highlight indicates our proposed method. Bold and underlined values represent the best and second best results.
Method Fold-0 Fold-1 Fold-2 Fold-3 Average
Acc. mIoU Acc. mIoU Acc. mIoU Acc. mIoU Acc. mIoU
Original
PANet [50] 46.60 24.91 52.70 24.98 55.90 23.31 50.00 21.36 51.30 23.64
PFENet [45] 35.60 23.99 34.30 24.57 43.10 20.99 32.80 23.93 36.45 23.37
HSNet [29] 57.70 29.77 62.30 30.94 67.10 31.31 62.60 30.31 62.43 30.58
ASNet [19] 59.50 29.75 61.50 32.99 68.80 33.41 62.40 30.35 63.05 31.62
CST [20] 61.00 34.74 66.40 37.14 68.20 36.76 60.50 36.29 64.02 36.23
CST* 74.00 49.38 79.90 53.88 81.30 51.46 79.60 51.15 78.70 51.47
EMAT 74.00 50.54 83.10 55.44 83.10 53.05 80.10 52.19 80.07 52.81
Partially Augmented (106\uparrow 106↑ 106 tasks)
PANet [50] 46.40 25.27 52.80 25.10 55.90 23.27 50.20 21.48 51.32 23.78
PFENet [45] 35.80 24.08 34.30 24.52 43.00 21.00 32.90 23.96 36.50 23.39
HSNet [29] 57.50 29.88 62.60 31.02 67.00 31.30 62.50 30.44 62.40 30.66
ASNet [19] 59.30 29.87 61.70 32.95 68.80 33.39 62.30 30.36 63.03 31.64
CST [20] 61.30 34.59 66.20 37.08 68.40 36.76 60.50 36.36 64.10 36.20
CST* 74.60 49.27 79.80 53.98 81.50 51.68 79.60 51.19 78.87 51.53
EMAT 74.30 50.59 83.20 55.42 83.30 53.06 80.20 52.22 80.25 52.82
Fully Augmented (1515\uparrow 1515↑ 1515 tasks)
PANet [50] 30.90 24.43 49.30 24.44 53.50 22.98 46.60 20.81 45.07 23.17
PFENet [45] 19.60 20.33 29.10 23.79 40.70 20.40 27.90 21.92 29.33 21.61
HSNet [29] 40.20 27.73 57.50 29.82 64.30 30.96 58.60 29.26 55.15 29.44
ASNet [19] 41.40 27.23 55.70 32.10 66.60 33.14 58.20 29.43 55.47 30.47
CST [20] 44.70 33.74 59.70 36.32 65.60 36.87 55.20 35.46 56.30 35.60
CST* 54.40 48.25 74.80 52.86 79.70 51.69 75.80 50.25 71.18 50.76
EMAT 56.30 49.53 78.50 54.46 81.40 53.06 75.80 50.90 73.00 51.99

To evaluate the scalability across different N-way K-shot configurations, Fig.˜9 shows the accuracy and mIoU of all FS-CS methods on PASCAL-5i5^{i}5 start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT and COCO-20i20^{i}20 start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT in the fully augmented setting using {1,,5}\{1,\dots,5\}{ 1 , … , 5 }-way 5-shot tasks. EMAT consistently achieves higher classification accuracy and mIoU compared to other FS-CS methods. Notably, while increasing the number of classes (N) typically raises task difficulty, EMAT maintains stable segmentation performance, further validating the fully augmented setting as a challenging yet realistic and effective benchmark for FS-CS evaluation.

Refer to caption
Figure 9: Comparison of FS-CS methods on PASCAL-5i5^{i}5 start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT [36] and COCO-20i20^{i}20 start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT [31] under the fully augmented setting. CST* uses the same backbone as EMAT (i.e., DINOv2 [32]). Results are averaged over 4000 tasks (1000 per fold) using a base task configuration of {1,,5}\{1,\dots,5\}{ 1 , … , 5 }-way 5-shot. The number of augmented tasks for each value of N in PASCAL-5i5^{i}5 start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT is 1243, 1758, 2094, 2319, and 2522, respectively. For COCO-20i20^{i}20 start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT, the corresponding values are 2136, 2982, 3432, 3674, and 3806.

Appendix 0.C Analysis of Object Size Distribution in PASCAL-5i5^{i}5 start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT

Refer to caption
Figure 10: Per-fold object size distribution in the PASCAL-5i5^{i}5 start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT [36] test set. Each histogram shows the frequency of objects based on their relative size (as a percentage of the image area), up to 50 %.

Sec.˜5.3 analyzes the impact of higher-resolution correlation tokens on small objects by filtering each fold of PASCAL-5i5^{i}5 start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT and COCO-20i20^{i}20 start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT based on object size. This filtering results in three splits: objects occupying 0–5 %, 5–10 %, and 10–15 % of the image. These intervals were chosen after examining the object size distribution in the PASCAL-5i5^{i}5 start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT test set, which has fewer test examples than COCO-20i20^{i}20 start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT.

Fig.˜10 presents frequency histograms of object sizes (up to 50 % of the image) for all four PASCAL-5i5^{i}5 start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT test folds. Based on the fold with more examples (Fold 2), we selected the size intervals that ensure each split contains at least 100 examples, i.e., 0–5 %, 5–10 %, and 10–15 %. For consistency and comparability, we apply the same splits to COCO-20i20^{i}20 start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT. Furthermore, Fig.˜10 shows that the majority of objects in PASCAL-5i5^{i}5 start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT are very small, highlighting the importance of accurately classifying and segmenting small objects, a challenge that our EMAT effectively addresses.

Appendix 0.D Comparison to SOTA FS-S

Table 5: Comparison of FS-S methods on PASCAL-5i5^{i}5 start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT [36] and COCO-20i20^{i}20 start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT [31] using 1-way {1,5}\{1,5\}{ 1 , 5 }-shot tasks. For fairness, EMATs exclude the classification head. All values correspond to mIoU in percentage (higher is better). The mIoU values for EMATs were obtained through our own training and evaluation, while values for other methods are taken directly from their respective papers. The top part of the table shows methods based on ResNet backbones, and the bottom part shows those based on foundation models. Highlight indicates our proposed method. Bold and underlined values represent the best and second best results.
Method Venue Backbone PASCAL-5i5^{i}5 start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT COCO-20i20^{i}20 start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT
1-s 5-s 1-s 5-s
MIANet [55] CVPR’23 ResNet50 68.7 71.6 47.7 51.7
HDMNet [33] CVPR’23 ResNet50 69.4 71.8 50.0 56.0
VAT + MSI [30] ICCV’23 ResNet101 70.1 72.2 49.8 54.0
SCCAN [54] ICCV’23 ResNet101 68.3 71.5 48.2 57.0
AMFomer [51] NeurIPS’23 ResNet101 70.7 73.6 51.0 57.3
PMNet [7] WACV’24 ResNet101 68.1 73.9 43.7 53.1
ABCB [63] CVPR’24 ResNet101 72.0 74.9 51.5 58.8
EMATs DINOv2-S 72.5 75.9 59.8 65.0
LLaFS [62] CVPR’24 ResNet50 + CodeLlama-7B 73.5 75.6 53.9 60.0
PI-CLIP [49] CVPR’24 ResNet50 + CLIP-B 76.8 77.2 56.8 59.1
Matcher [27] ICLR’24 DINOv2-L + SAH 68.1 74.0 52.7 60.7
GF-SAM [57] NeurIPS’24 DINOv2-L + SAH 72.1 82.6 58.7 66.8
EMATs DINOv2-S 72.5 75.9 59.8 65.0

To compare our EMAT with SOTA few-shot segmentation (FS-S) methods, we removed its classification head and denote this adapted version as EMATs. This adaptation ensures a fair comparison to existing FS-S methods, as simultaneously handling classification and segmentation is more challenging than focusing only on segmentation. Furthermore, we adopt the 1-way K-shot formulation used by the FS-S methods, where the query image always contains the support class.

Tab.˜5 presents a comparison between EMATs and SOTA FS-S methods, divided into two groups: methods based on ResNet backbones and those based on foundation models. Although EMATs is based on a foundation model (DINOv2), it uses the small variant (DINOv2-S), whose model size is comparable to ResNet50. Therefore, comparing EMATs with FS-S methods based on ResNet backbones is still meaningful.

The top part of Tab.˜5 shows that EMATs consistently outperforms all ResNet-based FS-S methods across both PASCAL-5i5^{i}5 start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT and COCO-20i20^{i}20 start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT datasets and across all task configurations, even surpassing methods based on a significantly larger backbone, ResNet101. Notably, the improvement in mIoU is more pronounced on COCO-20i20^{i}20 start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT, with absolute gains of +8.3 % and +6.2 % for 1-shot and 5-shot settings, respectively.

On the other hand, comparing EMATs to FS-S methods based on foundation models (bottom part of Tab.˜5) is less straightforward for two main reasons: (1) all baseline methods combine two models, and (2) the foundation models they use are significantly larger than the one used by EMATs. For example, both Matcher [27] and GF-SAM [57] use the large version of DINOv2 (i.e., DINOv2-L), while EMATs uses the small variant (i.e., DINOv2-S). Nevertheless, our method still achieves an absolute improvement of +1.1 % mIoU on COCO-20i20^{i}20 start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT in the 1-way 1-shot setting and overall obtains competitive mIoU compared to other SOTA FS-S methods based on large foundation models.