Multi-Prompt Progressive Alignment for Multi-Source Unsupervised Domain Adaptation

Haoran Chen, Zexiao Wang, Haidong Cao, Zuxuan Wu, and Yu-Gang Jiang H. Chen, Z. Wang, H. Cao, Z. Wu, Y-G. Jiang are with the Institute of Trustworthy Embodied AI, Fudan University. E-mail: {chenhran21, zexiaowang25, hdcao24}@m.fudan.edu.cn {zxwu, ygj}@fudan.edu.cn
Abstract

Large Vision-Language Models like CLIP have become a powerful foundation for Unsupervised Domain Adaptation due to their strong zero-shot generalization. State-of-the-art methods typically leverage CLIP to generate pseudo-labels for the target domain, then fine-tune the model to learn domain-invariant features. However, these methods attempt to align source and target domains using all pseudo-labeled data simultaneously. This one-shot alignment struggles with noisy, hard-to-classify samples, leading to error propagation and suboptimal feature learning. The problem is even more amplified in the multi-source scenario, where diverse domain gaps and varying noise levels across multiple source domains further destabilize the alignment process. To address this issue, in this work, we propose a progressive alignment strategy for adapting CLIP to unlabeled downstream task. Our method begins by training the model on a high-confidence subset of target samples, allowing it to first learn a well-aligned representation from the most reliable data. As training progresses, it gradually incorporates more challenging samples, guiding the model to refine its understanding without being overwhelmed by initial label noise. This progressive approach effectively mitigates confirmation bias and promotes a more robust convergence, allowing for the learning of genuinely domain-invariant features. We name our approach MP2\text{P}^{2}P start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPTA and test it on three popular UDA benchmarks, namely ImageCLEF, Office-Home, and the most challenging DomainNet. Experiments showcase that MP2\text{P}^{2}P start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPTA achieves state-of-the-art performance when compared with most recent CLIP-based MS-UDA approaches, demonstrating the effectiveness of our approach.

Index Terms:
Multi-source unsupervised domain adaptation, transfer learning, vision-language models, CLIP.

I Introduction

Recent years have witnessed a massive explosion in the capabilities of artificial intelligence[1, 2, 3], driven by the advent of large-scale models pre-trained on vast, web-scale datasets. These models have demonstrated remarkable abilities to generalize across a wide variety of tasks, often achieving human-level performance when the test data closely resembles their training distribution. However, when deployed in real-world scenarios, these systems frequently encounter data from new environments, sensors, or styles, a phenomenon known as domain shift[4, 5, 6]. This distributional mismatch between training and deployment data can cause a drastic and unexpected drop in accuracy[7, 8, 9]. While one might consider labeling deployment data for downstream adaptation, the manual annotation of data for every new domain is both prohibitively expensive and unscalable. As such, enabling models to autonomously adapt to new, unlabeled domains has become a central and fundamental challenge.

To address this challenge, Unsupervised Domain Adaptation (UDA) has emerged as a key research area, aiming to transfer knowledge from a labeled source domain to an unlabeled target domain[10, 11, 12]. A more challenging yet realistic extension of this setting is Multi-Source Unsupervised Domain Adaptation (MS-UDA)[13], where labeled data from multiple, heterogeneous source domains is available for transfer. This scenario is particularly relevant as real-world data is often aggregated from a variety of sources[12, 14]. However, the multi-source setting introduces significant complexities: the model must now bridge multiple, potentially conflicting domain gaps and handle varying data quality across sources. This can lead to an unstable alignment process and the risk of negative transfer, where a poorly matched source domain hinders rather than helps adaptation to the target.

Refer to caption
Figure 1: Illustration of our progressive domain alignment pipeline. The adaptation process starts with the learning phase, where the model is trained on the current cluster. Following this, we refine the pseudo-labels of the current cluster using the updated model. As training advances, increasingly challenging classes are gradually introduced. A subset of high-confidence samples from the refined cluster is then rehearsed, i.e., transferred into the next cluster’s training set. This progressive learning strategy enables stable adaptation from easy to hard samples while mitigating error accumulation and catastrophic forgetting.

Recently, researchers have leveraged large Vision-Language Models such as CLIP[15] to address the MS-UDA problem[16, 17, 18]. With its strong zero-shot capability, CLIP can generate pseudo-labels for unlabeled target data by aligning images with text-based class descriptions. A common strategy then is to fine-tune the model on the entire pseudo-labeled target set, aiming to align the source and target feature distributions. However, this “one-shot” alignment approach has a critical drawback: it treats all pseudo-labels with equal importance. Noisy predictions on ambiguous or hard samples are immediately incorporated into training, resulting in confirmation bias, error propagation, and suboptimal feature learning[19]. While some studies explore iterative pseudo-label refinement through self-training[20, 21, 22], these methods are computationally expensive, as each round processes the full training set. This challenge is further exacerbated in MS-UDA, where the compounded domain shifts and varying noise levels across multiple source domains make the initial alignment process even less reliable, demanding both greater robustness and computational efficiency.

To address the limitations of existing methods, we propose Multi-Source Progressive Prompt Alignment (MP2\text{P}^{2}P start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPTA), a simple yet effective framework built on a “learn, refine, and rehearse” cycle that progressively adapts CLIP to the target domain through an easy-to-hard curriculum. This cycle is designed to stabilize learning by starting from reliable supervision and gradually incorporating more ambiguous data.

Before detailing the learning cycle, we perform several key preparatory steps. First, we structure the problem space using a balanced KMeans[23, 24] algorithm to cluster classes based on their CLIP visual embeddings. To estimate the difficulty of each cluster, we compute the average cosine similarity between the image embeddings and their corresponding text-based class prototypes. Clusters with higher average similarity are deemed easier, enabling us to sort the clusters from easy to hard and establish a progressive training schedule.

Additionally, unlike prior CLIP-based UDA approaches that directly adopt zero-shot pseudo-labels from the raw CLIP model[16, 25, 18], we argue that this overlooks valuable information from the labeled source domains, particularly in the multi-source setting. To overcome this, we first pre-train CLIP-based models individually on each source domain using supervised data. We then generate pseudo-labels on the target domain by ensembling the predictions from all source-specific models, employing strategies such as confidence averaging or majority voting to produce more reliable pseudo-labels for downstream training.

Within each curriculum stage, we begin with the learning phase, where we independently optimize a set of learnable textual prompts[26, 27] and a shared set of lightweight visual PEFT[28, 29, 30] modules for each source–target domain pair. These components allow for efficient and expressive domain adaptation in both modalities. To consolidate and align the learned textual prompts, we follow prior work[17] by passing them through a shallow auto-encoder, which projects the high-dimensional prompt vectors into a shared, denoised latent space[31, 27]. To further encourage consistency across source domains, we introduce an 1\mathcal{L}_{1}caligraphic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT regularization loss on the predictions from different prompt sets, guiding them toward a unified representation for the target domain.

The core innovation of our approach lies in the refine and rehearse phase. After training on a given cluster, the model regenerates pseudo-labels for that cluster, reflecting its improved understanding. From these, we extract only the most confident samples, i.e., those exceeding a predefined probability threshold, and incorporate them into the training set for the next, more challenging cluster. This selective rehearsal mechanism serves a dual purpose: it anchors future learning with reliable examples to prevent catastrophic forgetting[7], while also constraining the training set size for improved computational efficiency. Crucially, in contrast to conventional self-training methods that indiscriminately reuse all past data[21], our confidence-aware strategy reduces error propagation and mitigates overfitting to noisy labels. By progressively expanding the training data with only well-aligned samples, MP2\text{P}^{2}P start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPTA achieves more stable convergence and learns domain-invariant representations in a robust and scalable manner. The whole pipeline of our training cycle is illustrated in Figure 1.

We validate the effectiveness of our proposed framework through extensive experiments on three widely-used MS-UDA benchmarks: ImageCLEF [32], Office-Home [33], and DomainNet [34]. These datasets encompass a broad range of domains, visual styles, and adaptation challenges, providing a comprehensive testbed for evaluating multi-source domain adaptation methods. Across all benchmarks, MP2\text{P}^{2}P start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPTA achieves average accuracies of 94.3%, 91.8%, and 64.1%, respectively, consistently surpassing recent CLIP-based UDA approaches by a notable margin. These results demonstrate the robustness, scalability, and generalization capability of our progressive alignment framework, particularly under the challenging multi-source scenarios.

A preliminary version of this work, MPA, appeared in [17], where we were among the first to explore the use of CLIP for domain adaptation. Since then, the research landscape has evolved rapidly, with numerous follow-up studies advancing CLIP-based adaptation[18, 35]. To remain competitive and push the frontier further, this journal version introduces several substantial improvements over the original conference paper.

First, while the earlier version primarily emphasized textual prompt learning, the current work fully exploits both the textual and visual modalities of CLIP by incorporating PEFT modules. This enables the model to learn more expressive and adaptable visual representations. Second, we introduce a novel source pretraining strategy that leverages the supervision from labeled source domains to generate more robust and accurate pseudo-labels for the target domain. Third, we propose a curriculum-based alignment framework that progressively adapts the model from easy to hard samples, effectively mitigating confirmation bias and improving training stability.

In addition, we significantly expand the experimental scope: we re-implement and fairly compare our method against a comprehensive set of strong CLIP-based baselines using modern ViT-based backbones, aligning with current trends in large-scale vision models. Finally, we provide in-depth empirical analysis to validate the robustness and compatibility of our proposed progressive alignment strategy across diverse domain adaptation benchmarks.

II Related Works

II-A Multi-Source Unsupervised Domain Adaptation

First introduced by Yang et al.[36], MS-UDA extends the standard UDA setting by leveraging multiple labeled source domains to improve generalization to an unlabeled target domain. Classic approaches often focus on aligning distributions across domains. Adversarial methods like MDAN[37] and DCTN [13] train domain discriminators to encourage feature alignment, while discrepancy-based methods such as MFSAN[38] and M3\text{M}^{3}M start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPTSDA [34] align each source-target pair by minimizing statistical distances like maximum mean discrepancy or moment distance.

Due to the emerging of large-scale pretrained networks, self-training has become a dominant strategy in UDA, where pseudo-labels generated on target data are used to iteratively refine the model. For example, CST[39] employs a bi-model framework where two networks alternately generate pseudo-labels for each other, reducing overfitting to self-predictions. Extending this to the multi-source case, CSR[40] introduces an instance-level ensemble of source-specific models and a domain-ensemble network, refined in a mutual loop via cross-model supervision. These self-training strategies demonstrate the effectiveness of leveraging target structure and inter-model consistency to enhance multi-source adaptation.

II-B CLIP based MS-UDA

Recent advances in vision-language models[41, 42, 43], especially CLIP[15], have sparked new research in domain adaptation by leveraging rich multimodal semantics. A pioneering effort in this direction is DAPL[16], which introduces prompt learning into UDA by disentangling domain and category representations. It optimizes domain-specific prompts to maintain semantic structures while avoiding the limitations of hard domain alignment. Building on this idea, subsequent works have proposed various strategies to enhance prompt learning. AD-CLIP[35] and DAMP[18] extend this direction by designing domain-invariant prompts and integrating visual context into the prompting process. PDA[25] further incorporates domain knowledge into a two-branch prompt tuning framework, aligning class and domain representations via feature banks and contrastive losses. However Most existing CLIP-based UDA methods, are limited to the single-source setting. Addressing this gap, MPA[17] is the first to tackle multi-source UDA, introducing a lightweight and modular multi-prompt alignment strategy that learns separate prompts per source-target pair and aligns them via autoencoding and consensus optimization. Inspired by these insights, PGA[44] formulates UDA as a multi-objective optimization problem and aligns prompt gradients across domains to encourage consistent learning, achieving robust adaptation under both single and multi-source settings. Collectively, these methods demonstrate the promise of prompt-based CLIP adaptation in both efficiency and accuracy for domain adaptation.

II-C Curriculum Learning

The concept of Curriculum Learning, formally introduced by [45], advocates for training models on a structured sequence of examples from easy to hard. This paradigm, inspired by human learning, aims to guide optimization towards more robust solutions and improve generalization by first building a foundation on simpler concepts[46, 47, 48].

Despite its potential, curriculum learning remains underutilized in UDA. One notable effort is C-SFDA[49], which applies a dynamic, self-paced curriculum based on prediction confidence and uncertainty. To determine difficulty, C-SFDA measures variance across multiple test-time augmentations and adaptively thresholds sample reliability. While effective, this process requires a substantial amount of hand-tuned scheduling and heuristic design including augmentation policies, uncertainty statistics, and carefully calibrated thresholds, making the method complex and sensitive to hyperparameter settings.

In contrast, our approach introduces a static, data-driven curriculum that is both simpler and more robust. We predefine the learning schedule by clustering classes using CLIP visual embeddings and ranking them by image-text alignment difficulty, independent of model predictions. Our ”learn, refine, and rehearse” framework progressively introduces harder classes while anchoring training with high-confidence examples from earlier stages. This avoids the need for online confidence computation and extensive scheduling logic, resulting in a more interpretable and computationally efficient curriculum learning strategy.

Refer to caption
Figure 2: Illustration of our architectural design. The left panel shows the text encoder, where textual prompts consist of class-specific context tokens vikv_{i}^{k}italic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT and domain-specific tokens djdd_{j}^{d}italic_d start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT. Only these prompt tokens are updated during training, while the rest of the text encoder remains frozen. The right panel presents three visual PEFT modules used to adapt the CLIP visual encoder to the target domain: (1) Visual Prompt, where learnable tokens are inserted at the input of each transformer[50, 51] layer; (2) LoRA, where trainable low-rank matrices AAitalic_A and BBitalic_B are merged with the query and value projections during self-attention; and (3) Adapter, where lightweight adapter blocks with down-projection, non-linearity, and up-projection are inserted after the feed-forward network.

III Method

To achieve more robust alignment on the target domain, we propose a simple yet effective “learn, refine, and rehearse” framework built upon the pre-trained CLIP model, leveraging its strong zero-shot capability. The core idea is to partition the learning data into subsets and train the model in a sequential, curriculum-based manner that progresses from easy to hard examples. In this way, the adaptation is gradually enhanced and the impact of noisy supervision is mitigated. In the following sections, we begin by introducing the design of textual prompts and visual parameter-efficient tuning modules in Section III-A. The procedure for generating initial pseudo-labels is then detailed in Section III-B. Next, we describe our data partitioning strategy based on CLIP embeddings in Section III-C. The core learning phase is presented in Section III-D, followed by the refine-and-rehearse phase in Section III-E.

III-A Architectural design

Let NNitalic_N denote the total number of domains, where the first N1N{-}1italic_N - 1 domains are source domains and the NNitalic_N-th domain is the target domain. Let KKitalic_K denote the number of classes shared across all domains. In the multi-source UDA setting, our objective is to learn a domain-invariant latent space such that both the inter-source domain gaps and the discrepancies between each source and the target domain are minimized. To this end, we design textual prompts that incorporate both domain-invariant and domain-specific features, and introduce visual Parameter-Efficient Fine-Tuning modules to enhance visual adaptation to the target domain.

III-A1 Textual Prompt Design

For the textual prompts, we construct two types of learnable tokens: class-specific context tokens vikv_{i}^{k}italic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT, where i{1,2,,M1}i\in\{1,2,\ldots,M_{1}\}italic_i ∈ { 1 , 2 , … , italic_M start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT } and k{1,2,,K}k\in\{1,2,\ldots,K\}italic_k ∈ { 1 , 2 , … , italic_K }, and domain-specific tokens shared across all classes, denoted as djdd_{j}^{d}italic_d start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT, where j{1,2,,M2}j\in\{1,2,\ldots,M_{2}\}italic_j ∈ { 1 , 2 , … , italic_M start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT } and d{s,t}d\in\{s,t\}italic_d ∈ { italic_s , italic_t }. Here, M1M_{1}italic_M start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and M2M_{2}italic_M start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT represent the number of tokens in each category, and ssitalic_s and ttitalic_t indicates source and target domains respectively. Combining these tokens, we construct prompts for 2K2K2 italic_K categories (source and target for each class) used in contrastive training. Specifically, for each source-target pair, we define the prompt embedding matrix as:

Pi=[t1si,,tKsi,t1ti,,tKti],i{1,2,,N1}.P_{i}=[t_{1}^{s_{i}},\ldots,t_{K}^{s_{i}},\ t_{1}^{t_{i}},\ldots,t_{K}^{t_{i}}]^{\top},\quad i\in\{1,2,\ldots,N{-}1\}.italic_P start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = [ italic_t start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUPERSCRIPT , … , italic_t start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUPERSCRIPT , italic_t start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUPERSCRIPT , … , italic_t start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ] start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT , italic_i ∈ { 1 , 2 , … , italic_N - 1 } .

An illustration of this is shown in the left of Figure 2.

III-A2 Visual PEFT Modules

The original MPA framework utilizes the raw visual embeddings directly from the frozen CLIP vision encoder. In contrast, for MP2\text{P}^{2}P start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPTA we take a step further and adapt the visual encoder to the target domain in a parameter-efficient manner. Specifically, we investigate three lightweight fine-tuning strategies: prompt tuning[52], LoRA[30], and adapter[28] modules. These modules are inserted into each Transformer block, and only their parameters are updated, while the rest of the vision encoder remains frozen.

Prompt tuning prepends a set of learnable prompt tokens to the patch embeddings at each Transformer layer. Given an input image represented as patch embeddings Xn×dX\in\mathbb{R}^{n\times d}italic_X ∈ blackboard_R start_POSTSUPERSCRIPT italic_n × italic_d end_POSTSUPERSCRIPT, where nnitalic_n is the number of patches and dditalic_d is the embedding dimension, we introduce learnable tokens EM3×dE\in\mathbb{R}^{M_{3}\times d}italic_E ∈ blackboard_R start_POSTSUPERSCRIPT italic_M start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT × italic_d end_POSTSUPERSCRIPT and modify the input as:

X~=[E;X](M3+n)×d\tilde{X}=[E;X]\in\mathbb{R}^{(M_{3}+n)\times d}over~ start_ARG italic_X end_ARG = [ italic_E ; italic_X ] ∈ blackboard_R start_POSTSUPERSCRIPT ( italic_M start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT + italic_n ) × italic_d end_POSTSUPERSCRIPT

LoRA modifies the weight matrices in the attention layers by injecting low-rank updates. For a given weight matrix Wd×dW\in\mathbb{R}^{d\times d}italic_W ∈ blackboard_R start_POSTSUPERSCRIPT italic_d × italic_d end_POSTSUPERSCRIPT, two trainable matrices Ar1×dA\in\mathbb{R}^{r_{1}\times d}italic_A ∈ blackboard_R start_POSTSUPERSCRIPT italic_r start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT × italic_d end_POSTSUPERSCRIPT and Bd×r1B\in\mathbb{R}^{d\times r_{1}}italic_B ∈ blackboard_R start_POSTSUPERSCRIPT italic_d × italic_r start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT with r1dr_{1}\ll ditalic_r start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ≪ italic_d are introduced, and the updated matrix becomes:

W=W+ΔW,ΔW=BAW^{\prime}=W+\Delta W,\quad\Delta W=BAitalic_W start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT = italic_W + roman_Δ italic_W , roman_Δ italic_W = italic_B italic_A

In our design, LoRA is applied to the query and value projection matrices in the self-attention mechanism:

Q=(Wq+BqAq)X,V=(Wv+BvAv)XQ=(W_{q}+B_{q}A_{q})X,\quad V=(W_{v}+B_{v}A_{v})Xitalic_Q = ( italic_W start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT + italic_B start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT italic_A start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT ) italic_X , italic_V = ( italic_W start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT + italic_B start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT italic_A start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT ) italic_X

Adapter modules are lightweight residual layers inserted within each Transformer block. A typical adapter is composed of a down-projection, a nonlinearity, and an up-projection:

Adapter(h)=h+Wdown(ReLU(Wuph))\text{Adapter}(h)=h+W_{\text{down}}(\text{ReLU}(W_{\text{up}}h))Adapter ( italic_h ) = italic_h + italic_W start_POSTSUBSCRIPT down end_POSTSUBSCRIPT ( ReLU ( italic_W start_POSTSUBSCRIPT up end_POSTSUBSCRIPT italic_h ) )

where Wupd×r2W_{\text{up}}\in\mathbb{R}^{d\times r_{2}}italic_W start_POSTSUBSCRIPT up end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_d × italic_r start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT and Wdownr2×dW_{\text{down}}\in\mathbb{R}^{r_{2}\times d}italic_W start_POSTSUBSCRIPT down end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_r start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT × italic_d end_POSTSUPERSCRIPT with r2dr_{2}\ll ditalic_r start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ≪ italic_d. The adapter output is added back residually and layer normalized:

h=LayerNorm(h+Adapter(h))h^{\prime}=\text{LayerNorm}(h+\text{Adapter}(h))italic_h start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT = LayerNorm ( italic_h + Adapter ( italic_h ) )

These visual PEFT modules are designed to guide the image feature extraction process at different hierarchical levels, enabling the model to focus more on task-relevant semantic regions and thereby generate more discriminative visual features. It is worth noting that, for simplicity, we employ a single shared set of visual PEFT modules across all source–target domain pairs.

III-B Initial Pseudo-Label Generation

Most existing CLIP-based domain adaptation methods generate pseudo-labels by directly applying CLIP’s zero-shot predictions on the target domain. However, this approach fails to leverage the rich semantic knowledge embedded in the labeled source domains, which can provide valuable guidance for more accurate pseudo-label generation.

To address this limitation, we propose a supervised ensemble-based strategy that exploits the source domain annotations. Specifically, for each source domain 𝒟si\mathcal{D}_{s}^{i}caligraphic_D start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT where i{1,2,,N1}i\in\{1,2,\ldots,N{-}1\}italic_i ∈ { 1 , 2 , … , italic_N - 1 }, we fine-tune the CLIP model using the textual prompts and visual PEFT modules described in Section III-A, leveraging only the labeled data from that source. This results in N1N{-}1italic_N - 1 independently trained models. Each model is then used to generate class predictions on the unlabeled target domain samples, producing a set of logits:

{𝐳i(xt)}i=1N1,for each xt𝒟t.\{\mathbf{z}_{i}(x_{t})\}_{i=1}^{N-1},\quad\text{for each }x_{t}\in\mathcal{D}_{t}.{ bold_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N - 1 end_POSTSUPERSCRIPT , for each italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∈ caligraphic_D start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT .

To aggregate these predictions into a final pseudo-label y^t\hat{y}_{t}over^ start_ARG italic_y end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT for each target sample xtx_{t}italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, we consider two voting strategies:

  • Average Confidence Voting: Average the softmax outputs across all models and select the most confident class:

    y^t=argmaxk1N1i=1N1softmax(𝐳i(xt))k.\hat{y}_{t}=\arg\max_{k}\ \frac{1}{N-1}\sum_{i=1}^{N-1}\text{softmax}(\mathbf{z}_{i}(x_{t}))_{k}.over^ start_ARG italic_y end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = roman_arg roman_max start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT divide start_ARG 1 end_ARG start_ARG italic_N - 1 end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N - 1 end_POSTSUPERSCRIPT softmax ( bold_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ) start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT .
  • Majority Voting: Take the class that receives the most votes from the N1N{-}1italic_N - 1 models:

    y^t=mode({argmaxk𝐳i(xt)k}i=1N1).\hat{y}_{t}=\text{mode}\left(\{\arg\max_{k}\ \mathbf{z}_{i}(x_{t})_{k}\}_{i=1}^{N-1}\right).over^ start_ARG italic_y end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = mode ( { roman_arg roman_max start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT bold_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N - 1 end_POSTSUPERSCRIPT ) .

The choice of voting strategy is selected empirically. Importantly, since the source-trained models are prone to overfitting their respective source domains, they are not used in the subsequent adaptation process. That is, this training step is performed solely to provide a more informed and stable initialization of pseudo-labels for the target domain.

III-C Curriculum-Oriented Data Grouping

Traditional approaches in unsupervised domain adaptation often align source and target domains using the entire dataset at once. However, this practice can be suboptimal due to the presence of noisy or hard-to-classify target samples, particularly when pseudo-labels are involved. Noisy labels may introduce confirmation bias and hinder model convergence. To address this, we propose a curriculum-style strategy that partitions the data into subsets and performs alignment progressively, starting from easier samples and moving toward more difficult ones.

We begin by generating pseudo-labels {y^t}t=1|𝒟t|\{\hat{y}_{t}\}_{t=1}^{|\mathcal{D}_{t}|}{ over^ start_ARG italic_y end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT | caligraphic_D start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | end_POSTSUPERSCRIPT for the unlabeled target samples 𝒟t={xt(1),xt(2),,xt(n)}\mathcal{D}_{t}=\{x_{t}^{(1)},x_{t}^{(2)},\ldots,x_{t}^{(n)}\}caligraphic_D start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = { italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT , italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( 2 ) end_POSTSUPERSCRIPT , … , italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_n ) end_POSTSUPERSCRIPT } using the method described in Section III-B. For each class k{1,2,,K}k\in\{1,2,\ldots,K\}italic_k ∈ { 1 , 2 , … , italic_K }, we compute its class-wise feature centroid 𝐜k\mathbf{c}_{k}bold_c start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT as the average of CLIP visual embeddings ϕ(x)\phi(x)italic_ϕ ( italic_x ) for target samples assigned to class kkitalic_k, where ϕ()\phi(\cdot)italic_ϕ ( ⋅ ) denotes the CLIP vision encoder:

𝐜k=1|𝒳k|x𝒳kϕ(x),𝒳k={x𝒟ty^t=k}.\mathbf{c}_{k}=\frac{1}{|\mathcal{X}_{k}|}\sum_{x\in\mathcal{X}_{k}}\phi(x),\quad\mathcal{X}_{k}=\{x\in\mathcal{D}_{t}\mid\hat{y}_{t}=k\}.bold_c start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT = divide start_ARG 1 end_ARG start_ARG | caligraphic_X start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT | end_ARG ∑ start_POSTSUBSCRIPT italic_x ∈ caligraphic_X start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_ϕ ( italic_x ) , caligraphic_X start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT = { italic_x ∈ caligraphic_D start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∣ over^ start_ARG italic_y end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = italic_k } .

Next, we apply balanced K-Means clustering to the set of class centroids {𝐜1,𝐜2,,𝐜K}\{\mathbf{c}_{1},\mathbf{c}_{2},\ldots,\mathbf{c}_{K}\}{ bold_c start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , bold_c start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , bold_c start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT } to partition them into TTitalic_T clusters {𝒞1,,𝒞T}\{\mathcal{C}_{1},\ldots,\mathcal{C}_{T}\}{ caligraphic_C start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , caligraphic_C start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT }. Unlike standard K-Means, the balanced variant enforces that each cluster contains approximately the same number of class centroids. This is achieved by initializing cluster centers using standard K-Means and then reassigning centroids while enforcing the size constraint. If a centroid’s nearest cluster has already reached the target capacity, it is assigned to the next nearest available cluster with space. The cluster centers are updated after each full assignment round, and the process is repeated until convergence. The process is presented in Algorithm 1.

Once the class centroids are grouped, we assess the difficulty of each cluster. For each class kkitalic_k in a cluster 𝒞t\mathcal{C}_{t}caligraphic_C start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, we compute the cosine similarity between its visual embedding ϕ(x)\phi(x)italic_ϕ ( italic_x ) and its corresponding textual embedding ψ(tk)\psi(t_{k})italic_ψ ( italic_t start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ). Here ψ()\psi(\cdot)italic_ψ ( ⋅ ) denotes the CLIP text encoder. The average similarity score across all classes in a cluster serves as a proxy for how confidently CLIP aligns images and text for those classes:

st=1|𝒞t|k𝒞t1|𝒳k|x𝒳kcos(ϕ(x),ψ(tk)).s_{t}=\frac{1}{|\mathcal{C}_{t}|}\sum_{k\in\mathcal{C}_{t}}\frac{1}{|\mathcal{X}_{k}|}\sum_{x\in\mathcal{X}_{k}}\cos(\phi(x),\psi(t_{k})).italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = divide start_ARG 1 end_ARG start_ARG | caligraphic_C start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | end_ARG ∑ start_POSTSUBSCRIPT italic_k ∈ caligraphic_C start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT divide start_ARG 1 end_ARG start_ARG | caligraphic_X start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT | end_ARG ∑ start_POSTSUBSCRIPT italic_x ∈ caligraphic_X start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT roman_cos ( italic_ϕ ( italic_x ) , italic_ψ ( italic_t start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) ) .

We then sort the clusters {𝒞1,,𝒞T}\{\mathcal{C}_{1},\ldots,\mathcal{C}_{T}\}{ caligraphic_C start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , caligraphic_C start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT } in descending order of sts_{t}italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT to form a progressive curriculum from easy to hard, as clusters with higher average similarity scores indicates that CLIP is already more confident about their alignment.

Algorithm 1 Balanced K-Means Clustering
1:Feature set 𝒞={𝐜1,,𝐜K}\mathcal{C}=\{\mathbf{c}_{1},\ldots,\mathbf{c}_{K}\}caligraphic_C = { bold_c start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , bold_c start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT }, cluster number TTitalic_T
2:Cluster assignments {𝒞1,,𝒞T}\{\mathcal{C}_{1},\ldots,\mathcal{C}_{T}\}{ caligraphic_C start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , caligraphic_C start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT }
3:Initialize centroids {μ1,,μT}\{\mu_{1},\ldots,\mu_{T}\}{ italic_μ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_μ start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT } using K-Means
4:Set cluster size budget BK/TB\leftarrow\lceil K/T\rceilitalic_B ← ⌈ italic_K / italic_T ⌉
5:repeat
6:  Initialize empty clusters: 𝒞1,,𝒞T\mathcal{C}_{1}\leftarrow\emptyset,\ldots,\mathcal{C}_{T}\leftarrow\emptysetcaligraphic_C start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ← ∅ , … , caligraphic_C start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ← ∅
7:  for all 𝐜k𝒞\mathbf{c}_{k}\in\mathcal{C}bold_c start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∈ caligraphic_C do
8:   Compute distances dj=𝐜kμj2d_{j}=\|\mathbf{c}_{k}-\mu_{j}\|_{2}italic_d start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT = ∥ bold_c start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT - italic_μ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT
9:   Sort clusters by increasing djd_{j}italic_d start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT
10:   for all cluster jjitalic_j in sorted order do
11:     if |𝒞j|<B|\mathcal{C}_{j}|<B| caligraphic_C start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT | < italic_B then
12:      Assign 𝐜k\mathbf{c}_{k}bold_c start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT to 𝒞j\mathcal{C}_{j}caligraphic_C start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT
13:      break
14:     end if
15:   end for
16:  end for
17:  for j=1j=1italic_j = 1 to TTitalic_T do
18:   Update μj1|𝒞j|𝐜𝒞j𝐜\mu_{j}\leftarrow\frac{1}{|\mathcal{C}_{j}|}\sum_{\mathbf{c}\in\mathcal{C}_{j}}\mathbf{c}italic_μ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ← divide start_ARG 1 end_ARG start_ARG | caligraphic_C start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT | end_ARG ∑ start_POSTSUBSCRIPT bold_c ∈ caligraphic_C start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_POSTSUBSCRIPT bold_c
19:  end for
20:until convergence
Refer to caption
Figure 3: Example of the Multi-Prompt Alignment phase on the Office-Home dataset. Here, P1P_{1}italic_P start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT, P2P_{2}italic_P start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT, and P3P_{3}italic_P start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT denote prompts obtained from the Initial Feature Adaptation phase for the Art–Real World, Clipart–Real World, and Product–Real World pairs, respectively. All prompts are projected into a domain-invariant subspace for alignment using an auto-encoder architecture. During this phase, we apply a cross-entropy loss on pseudo-labeled target samples, a reconstruction loss to preserve information between the original and reconstructed prompts, and an alignment loss across all generated prompts.

III-D Cluster Learning Phase

With all components in place, we now describe how each cluster is progressively trained. Following the MPA framework, this process is divided into two steps: an initial feature adaptation step and a combined prompt alignment step.

III-D1 Initial Feature Adaptation

To enable better adaptation to the target domain, we train individual sets of textual prompts for each source–target domain pair. In addition, we jointly optimize a shared visual PEFT module across all domains. To reduce the impact of noisy pseudo-labels, we apply a confidence threshold τ\tauitalic_τ and include only target samples with predicted class probability greater than τ\tauitalic_τ during training. Let PiP_{i}italic_P start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT denote the textual prompts for the iiitalic_i-th source–target pair, where i{1,2,,N1}i\in\{1,2,\ldots,N-1\}italic_i ∈ { 1 , 2 , … , italic_N - 1 }, and let 𝒱\mathcal{V}caligraphic_V represent the shared visual PEFT module. Given a labeled image xsx_{s}italic_x start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT from a source domain 𝒟s\mathcal{D}_{s}caligraphic_D start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT with label ysy_{s}italic_y start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT, and an unlabeled image xtx_{t}italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT from the target domain 𝒟t\mathcal{D}_{t}caligraphic_D start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT with pseudo-label y^t\hat{y}_{t}over^ start_ARG italic_y end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, the optimization objective is defined as:

minPi,𝒱1nsxs𝒟slogP(y=ysxs;Pi,𝒱)1ntxt𝒟tlogP(y=y^txt;Pi,𝒱)\displaystyle\min_{P_{i},\mathcal{V}}-\frac{1}{n_{s}}\sum_{x^{s}\sim\mathcal{D}_{s}}\log P(y=y_{s}\mid x^{s};P_{i},\mathcal{V})-\frac{1}{n_{t}}\sum_{x^{t}\sim\mathcal{D}_{t}}\log P(y=\hat{y}_{t}\mid x^{t};P_{i},\mathcal{V})roman_min start_POSTSUBSCRIPT italic_P start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , caligraphic_V end_POSTSUBSCRIPT - divide start_ARG 1 end_ARG start_ARG italic_n start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT end_ARG ∑ start_POSTSUBSCRIPT italic_x start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT ∼ caligraphic_D start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT end_POSTSUBSCRIPT roman_log italic_P ( italic_y = italic_y start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ∣ italic_x start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT ; italic_P start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , caligraphic_V ) - divide start_ARG 1 end_ARG start_ARG italic_n start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG ∑ start_POSTSUBSCRIPT italic_x start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ∼ caligraphic_D start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT roman_log italic_P ( italic_y = over^ start_ARG italic_y end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∣ italic_x start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ; italic_P start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , caligraphic_V )

(1)

Let 𝒯\mathcal{T}caligraphic_T denote a temperature scaling value, the class probability for image xdx^{d}italic_x start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT (where d{s,t}d\in\{s,t\}italic_d ∈ { italic_s , italic_t }) in Eqn 1 is computed using softmax-normalized cosine similarity between the visual and textual embeddings:

P(y=kxd)=exp(ψ(tkd),ϕ(xd)/𝒯)d{s,t}i=1Kexp(ψ(tid),ϕ(xd)/𝒯)P(y=k\mid x^{d})=\frac{\exp(\langle\psi(t^{d}_{k}),\phi(x^{d})\rangle/\mathcal{T})}{\sum_{d^{\prime}\in\{s,t\}}\sum_{i=1}^{K}\exp(\langle\psi(t^{d^{\prime}}_{i}),\phi(x^{d})\rangle/\mathcal{T})}italic_P ( italic_y = italic_k ∣ italic_x start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT ) = divide start_ARG roman_exp ( ⟨ italic_ψ ( italic_t start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) , italic_ϕ ( italic_x start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT ) ⟩ / caligraphic_T ) end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_d start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∈ { italic_s , italic_t } end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT roman_exp ( ⟨ italic_ψ ( italic_t start_POSTSUPERSCRIPT italic_d start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) , italic_ϕ ( italic_x start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT ) ⟩ / caligraphic_T ) end_ARG (2)

III-D2 Multi-Prompt Alignment

The above training yields a distinct textual prompt PiP_{i}italic_P start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT for each source–target pair. To further consolidate the learned knowledge and eliminate redundancy, we perform multi-prompt alignment. Although directly using the learned prompts can yield decent results, the high-dimensional nature of prompt vectors may encode redundant or domain-specific noise[53]. Inspired by the manifold hypothesis[54], we employ a denoising autoencoder to extract compact, domain-invariant representations of the prompts.

Specifically, let Proj()\textbf{Proj}(\cdot)Proj ( ⋅ ) be a projection function implemented as a one-layer feed-forward network, and let Projb()\textbf{Proj}_{b}(\cdot)Proj start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT ( ⋅ ) be a two-layer non-linear decoder. Each learned prompt PiP_{i}italic_P start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is projected into a latent space:

Proj(Pi)=P~i=W1Pi+b1\textbf{Proj}(P_{i})=\tilde{P}_{i}=W_{1}P_{i}+b_{1}Proj ( italic_P start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) = over~ start_ARG italic_P end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = italic_W start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_P start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT + italic_b start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT (3)

and then reconstructed back to P^i\hat{P}_{i}over^ start_ARG italic_P end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT via:

Projb(P~i)=W3(tanh(W2P~i+b2))+b3\textbf{Proj}_{b}(\tilde{P}_{i})=W_{3}(\tanh(W_{2}\tilde{P}_{i}+b_{2}))+b_{3}Proj start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT ( over~ start_ARG italic_P end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) = italic_W start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT ( roman_tanh ( italic_W start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT over~ start_ARG italic_P end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT + italic_b start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) ) + italic_b start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT (4)

The reconstruction objective is:

AE=1N1i=1N1P^iPi22\mathcal{L}_{AE}=\frac{1}{N-1}\sum_{i=1}^{N-1}\left\|\hat{P}_{i}-P_{i}\right\|_{2}^{2}caligraphic_L start_POSTSUBSCRIPT italic_A italic_E end_POSTSUBSCRIPT = divide start_ARG 1 end_ARG start_ARG italic_N - 1 end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N - 1 end_POSTSUPERSCRIPT ∥ over^ start_ARG italic_P end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - italic_P start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT (5)

While this has proven to be an effective alignment strategy [55], differences in data volume and domain gap across sources may still result in inconsistent prompt behavior. To promote deeper alignment among prompts, we introduce an 1\mathcal{L}_{1}caligraphic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT consistency loss based on predictions over the same target sample xtx_{t}italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT:

1=2(N1)(N2)j=1N2i=j+1N1|P(y=ktxt;Pi)P(y=ktxt;Pj)|\displaystyle\mathcal{L}_{1}=\frac{2}{(N-1)(N-2)}\sum_{j=1}^{N-2}\sum_{i=j+1}^{N-1}\left|P(y=k^{t}\mid x_{t};P_{i})-P(y=k^{t}\mid x_{t};P_{j})\right|caligraphic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = divide start_ARG 2 end_ARG start_ARG ( italic_N - 1 ) ( italic_N - 2 ) end_ARG ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N - 2 end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_i = italic_j + 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N - 1 end_POSTSUPERSCRIPT | italic_P ( italic_y = italic_k start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ∣ italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ; italic_P start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) - italic_P ( italic_y = italic_k start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ∣ italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ; italic_P start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) |

(6)

Putting these together, the total training loss in this step is:

=CLS+AE+α1\mathcal{L}=\mathcal{L}_{CLS}+\mathcal{L}_{AE}+\alpha\mathcal{L}_{1}caligraphic_L = caligraphic_L start_POSTSUBSCRIPT italic_C italic_L italic_S end_POSTSUBSCRIPT + caligraphic_L start_POSTSUBSCRIPT italic_A italic_E end_POSTSUBSCRIPT + italic_α caligraphic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT (7)

where CLS\mathcal{L}_{CLS}caligraphic_L start_POSTSUBSCRIPT italic_C italic_L italic_S end_POSTSUBSCRIPT is a regular cross-entropy loss , and α\alphaitalic_α is a hyperparameter controlling the weight of the alignment term 1\mathcal{L}_{1}caligraphic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT. During this alignment stage, the visual PEFT module 𝒱\mathcal{V}caligraphic_V is kept fixed; only the textual prompts and autoencoder parameters are updated. An overview of the multi-prompt learning and alignment procedure is illustrated in Fig 3.

III-E Refine and Rehearse Phase

After completing the training on each curriculum stage (i.e., cluster), we refine the pseudo-labels for that stage using the updated model. Specifically, we re-evaluate all target samples in the current cluster and regenerate their pseudo-labels based on the latest model predictions. To ensure reliability and improve computational efficiency, we apply a confidence threshold β\betaitalic_β and retain only the most confident samples—those with predicted class probabilities exceeding β\betaitalic_β.

These high-confidence samples serve as anchor points and are carried forward to the training set of the next cluster. This mechanism enables a “refine and rehearse” strategy: refined pseudo-labels from previously learned easier clusters are used to stabilize training on subsequent harder clusters. This gradual accumulation of trusted samples helps mitigate noise, reinforce previously learned knowledge, prevent catastrophic forgetting, and promote smoother convergence. The refinement and rehearsal process continues iteratively across all TTitalic_T clusters until the entire target dataset has been covered.

Finally, after training on all clusters is complete, we obtain the final model. During inference, we aggregate the predictions from all domain-specific prompt sets and use the averaged output as the final prediction for each test sample.

TABLE I: Accuracy (%) on ImageCLEF and Office-Home. We re-implement all compared approaches using the CLIP ViT-B/16 backbone for a fair comparison.
ImageCLEF Office-Home
Method Venue \rightarrow C \rightarrow I \rightarrow P Avg \rightarrow Ar \rightarrow Cl \rightarrow Pr \rightarrow Rw Avg
Zero-Shot
CLIP[15] ICML’21 97.0 96.8 82.7 92.2 83.4 65.3 89.0 89.4 81.8
Non-CLIP Based
ICON[56] NeurIPS’23 97.8 97.4 83.2 92.8 88.2 81.1 93.5 93.9 89.2
CSR[40] AAAI’24 98.3 95.5 82.0 91.9 86.2 79.2 92.4 91.3 87.3
CLIP Based
AD-CLIP[35] ICCV’23 97.8 96.7 82.3 92.3 84.2 72.6 92.9 91.2 85.2
MPA[17] NeurIPS’23 97.8 97.4 81.2 92.1 85.1 76.6 91.3 91.2 86.1
DAPL[16] TNNLS’23 97.8 97.2 83.5 92.8 84.3 72.4 92.6 91.4 85.2
PDA[25] AAAI’24 97.7 97.3 81.3 92.1 87.3 73.9 92.2 92.4 86.5
PGA[44] NeurIPS’24 97.4 96.5 82.5 92.1 83.6 76.9 92.9 91.2 86.2
DAMP[18] CVPR’24 98.5 97.7 81.8 92.7 87.3 77.3 94.1 92.7 87.8
MP2\text{P}^{2}P start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPTA-Adapter - 97.7 92.3 78.2 89.4 90.3 83.2 92.9 89.1 88.9
MP2\text{P}^{2}P start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPTA-Prompt - 97.8 97.1 83.7 92.9 90.2 80.6 94.8 92.5 89.5
MP2\text{P}^{2}P start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPTA-LoRA - 98.8 99.0 85.0 94.3 93.6 83.2 94.8 95.6 91.8

IV Experiments

IV-A Experimental Setup

To comprehensively assess the effectiveness of the proposed MP2\text{P}^{2}P start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPTA method, we perform experiments on three widely-used benchmark datasets in the field of MS-UDA: ImageCLEF, Office-Home, and DomainNet. ImageCLEF is a relatively small dataset comprising 1,800 images from 12 object categories across three distinct domains: ImageNet ILSVRC 2012 (I), Pascal VOC 2012 (P), and Caltech-256 (C). Office-Home represents a medium-scale dataset containing approximately 15,500 images from 65 categories, drawn from four diverse domains: Art, Clipart, Product, and Real-World. In contrast, DomainNet is currently the largest and most challenging UDA benchmark, encompassing around 600,000 images spanning 345 categories over six heterogeneous domains: Clipart, Infograph, Painting, Quickdraw, Real, and Sketch.

We use top-1 accuracy as our evaluation metric and compare our method against several representative baselines, including: (1) zero-shot CLIP inference. (2) CLIP-based UDA approaches, including DAMP[18], DAPL[16], MPA[17], PDA[25], PGA[44], and AD-CLIP[35]. Among these, PDA, DAMP, and DAPL are originally designed for single-source UDA. To adapt them to the multi-source setting, we follow a common practice of merging all source domains into a unified source domain before applying these methods. The remaining approaches are specifically developed for multi-source UDA, but they rely on ResNet-based CLIP image encoders. To ensure a fair comparison, we re-implement all methods using the pre-trained ViT-B/16 image encoder from CLIP. (3) Non-CLIP-based UDA methods, including ICON[56] and CSR[40]. For consistency, we also replace their original image encoders with the CLIP ViT-B/16 backbone.

TABLE II: Accuracy (%) on DomainNet. We re-implement all compared approaches using the CLIP ViT-B/16 backbone for a fair comparison.
DomainNet
Method Venue \rightarrow Clp \rightarrow Inf \rightarrow Pnt \rightarrow Qdr \rightarrow Rel \rightarrow Skt Avg
Zero-Shot
CLIP[15] ICML’21 69.4 45.8 64.3 13.7 82.7 62.3 56.4
Non-CLIP Based
ICON[56] NeurIPS’23 76.2 33.6 76.1 12.1 84.1 63.9 57.7
CSR[40] AAAI’24 80.1 36.2 68.4 38.0 80.4 68.0 61.9
CLIP Based
AD-CLIP[35] ICCV’23 73.7 11.5 66.4 11.5 84.8 66.3 52.4
MPA[17] NeurIPS’23 75.4 49.2 69.4 15.5 83.7 65.3 59.8
DAPL[16] TNNLS’23 75.2 49.4 68.8 0.3 83.7 67.6 57.5
PDA[25] AAAI’24 76.1 50.0 70.6 17.0 84.1 67.1 60.8
PGA[44] NeurIPS’24 75.1 51.1 69.4 14.5 84.5 67.4 60.3
DAMP[18] CVPR’24 75.7 54.1 71.7 20.1 84.9 68.2 62.5
MP2\text{P}^{2}P start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPTA - Prompt - 78.2 55.1 71.4 17.9 86.1 70.0 63.1
MP2\text{P}^{2}P start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPTA - LoRA - 80.2 54.0 72.6 18.8 85.7 71.7 63.8
MP2\text{P}^{2}P start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPTA - Adapter - 81.0 52.8 73.2 19.3 85.7 72.4 64.1

IV-B Implementation Details

We adopt CLIP ViT-B/16 as the backbone architecture for all methods. The textual prompts, visual PEFT modules, and autoencoder components are optimized using the Adam optimizer and CosineAnnealing scheduler. For the autoencoder, the projection layers are configured as follows: W2W_{2}italic_W start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT and W3W_{3}italic_W start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT generate intermediate and final embeddings of dimensions e2=384e_{2}=384italic_e start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = 384 and e3=512e_{3}=512italic_e start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT = 512, while the input projection layer W1W_{1}italic_W start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT produces an embedding of dimension e1e_{1}italic_e start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT, which is 150 for ImageCLEF and Office-Home, and 300 for DomainNet.

The confidence threshold β\betaitalic_β for the refine-and-rehearse phase is fixed at 0.8 for all datasets. The pseudo-label confidence threshold τ\tauitalic_τ used during initial prompt learning is dataset-specific. We choose τ=0.4\tau=0.4italic_τ = 0.4 for DomainNet, and τ=0.6\tau=0.6italic_τ = 0.6 for ImageCLEF and Office-Home. For initial pseudo-label generation, we adopt an average-confidence voting strategy for ImageCLEF and Office-Home, and majority voting for DomainNet due to its larger number of domains. The number of curriculum clusters is set to T=3T=3italic_T = 3 for all datasets. Additional hyperparameter settings are detailed in the subsequent sections.

IV-C Comparison to State-of-the-Art

We present a comprehensive evaluation of our proposed method, MP2\text{P}^{2}P start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPTA, on three standard domain adaptation benchmarks. The performance on ImageCLEF and Office-Home is detailed in Table I, while the results for the more challenging DomainNet are in Table II. For all experiments, we report the performance of MP2\text{P}^{2}P start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPTA integrated with three distinct PEFT modules: Visual Prompts, LoRA, and Adapters, as discussed in Section III-A.

On the ImageCLEF dataset, MP2\text{P}^{2}P start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPTA with LoRA establishes a new state-of-the-art, achieving an average accuracy of 94.3%. This represents a significant 1.5% improvement over the next-best method, ICON. Specifically, our approach demonstrates consistently superior performance across all domains, with accuracies of 98.8% on domain C, 99.0% on domain I, and 85.0% on P. Given the small scale of ImageCLEF and the inherent strength of the CLIP backbone, these gains are particularly noteworthy. The MP2\text{P}^{2}P start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPTA-Prompt variant also surpasses existing methods, securing an average accuracy of 92.9%. In contrast, the MP2\text{P}^{2}P start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPTA-Adapter shows a marked performance degradation on this dataset, with its average accuracy dropping to 89.4%, suggesting potential optimization challenges on smaller-scale benchmarks.

Moving to the more complex Office-Home dataset, MP2\text{P}^{2}P start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPTA-LoRA continues its impressive performance, achieving an average accuracy of 91.8%. This result outperforms all prior approaches by a substantial margin of at least 2.6%. The method achieves the highest accuracy across all four domains, with 93.6% on Art, 83.2% on Clipart, 94.8% on Product, and 95.6% on Real World. These consistent gains underscore the robustness of our progressive alignment strategy in learning domain-invariant features across diverse visual styles. Following a similar trend to ImageCLEF, MP2\text{P}^{2}P start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPTA-Prompt also delivers strong results, outperforming the next-best method, ICON, by 0.3%. Notably, while MP2\text{P}^{2}P start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPTA-Adapter still does not achieve the top performance, its accuracy of 88.9% is competitive, falling short only of ICON but outperforming all other baselines.

Finally, on the most challenging benchmark, DomainNet, we observe a notable shift in the efficacy of the PEFT modules. Here, MP2\text{P}^{2}P start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPTA-Adapter achieves the highest overall accuracy at 64.1%, surpassing all previous state-of-the-art methods. It improves upon DAMP, a strong CLIP-based baseline, by 1.6% and significantly outperforms zero-shot CLIP by 7.1%. This superior performance suggests that the Adapter module’s architectural complexity may require larger data for effective optimization. This finding is consistent with its weaker performance on the smaller ImageCLEF dataset and better performance on the mid-scale Office-Home. While not the top performer in this instance, MP2\text{P}^{2}P start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPTA-LoRA still demonstrates strong results with an accuracy of 63.8%, outperforming DAMP by 1.3%. Furthermore, MP2\text{P}^{2}P start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPTA-Prompt also surpasses other state-of-the-art methods with an average accuracy of 63.1%. The fact that all three PEFT variants of our method outperform existing approaches on this difficult benchmark highlights the remarkable effectiveness of our proposed method in handling severe domain shifts, noisy pseudo-labels, and the inherent complexity of large-scale, multi-source adaptation.

Refer to caption
(a) Visual Prompt.
Refer to caption
(b) LoRA.
Refer to caption
(c) Adapter.
Figure 4: Empirical analysis on the impact of the progressive alignment strategy using Visual Prompt, LoRA, and Adapter.
TABLE III: Ablation Study on Cluster Center.
Visual Prompt LoRA Adapter
Cluster Number \rightarrow Ar \rightarrow Cl \rightarrow Pr \rightarrow Rw Avg \rightarrow Ar \rightarrow Cl \rightarrow Pr \rightarrow Rw Avg \rightarrow Ar \rightarrow Cl \rightarrow Pr \rightarrow Rw Avg
T=2T=2italic_T = 2 89.3 79.8 94.0 93.2 88.9 93.4 83.0 94.4 94.8 91.4 90.6 82.4 92.2 89.5 88.7
T=5T=5italic_T = 5 90.1 79.0 94.8 92.2 89.0 92.9 83.5 94.2 95.4 91.5 90.4 80.6 92.2 88.9 88.0
T=10T=10italic_T = 10 89.4 78.7 94.0 92.2 88.6 92.0 82.7 94.1 94.7 90.9 88.3 81.7 93.0 89.4 88.1
T=3T=3italic_T = 3 90.2 80.6 94.8 92.5 89.5 93.4 83.6 94.8 95.6 91.9 90.3 83.2 92.9 89.5 88.9
Refer to caption
(a) Without Progressive Alignment
Refer to caption
(b) With Progressive Alignment
Figure 5: t-SNE visualization of features extracted by the visual encoder using Adapter on the Office-Home dataset, with Art as the source domain and Real World as the target domain.

IV-D Analysis on Progressive Alignment

To rigorously evaluate the efficacy of our progressive alignment strategy, we conduct a detailed empirical analysis on the Office-Home dataset. This analysis first systematically compares the performance of all three PEFT modules with and without the progressive alignment component. The results are shown in Figure 4. As observed, incorporating progressive alignment consistently leads to performance improvements across all PEFT modules, demonstrating the broad applicability of the proposed strategy.

Specifically, we observe average performance gains of 1.4% for Visual Prompt, 0.9% for LoRA, and 2.1% for Adapter, respectively. These consistent gains show the value of the proposed progressive alignment. By guiding the model through a structured, easy-to-hard curriculum, our strategy effectively mitigates the impact of noisy pseudo-labels, particularly during the critical early stages of training, thereby enhancing the model’s robustness.

Through a closer inspection of the domain-wise performance, we further reveal insightful patterns into the mechanism of our approach. For the Real World domain, which shares a closer visual resemblance to the source domains, the performance improvement is minimal, averaging only 0.2%. In contrast, the Clipart domain, characterized by its abstract and stylized visual properties, exhibits a substantial average improvement of 2.4%. This significant gain suggests that our progressive alignment strategy is particularly advantageous for target domains with a large distributional gap from the source, as it systematically bridges this divide by starting with more easy and transferable features before progressing to more hard and domain-specific ones.

To qualitatively illustrate the effect of our progressive alignment strategy, we visualize the feature space alignment between the source and target domains. Using the t-SNE algorithm, we plot the features extracted by the visual encoder for the Office-Home dataset, with the Art domain serving as the source and Real World as the target. As shown in Figure 5, a baseline model without our strategy exhibits a clear separation between the source and target feature clusters. In contrast, after applying our proposed method, the features from both domains become closely aligned. This significant increase in alignment demonstrates that our strategy successfully bridges the domain gap, confirming its effectiveness in facilitating the learning of robust, domain-invariant representations.

Refer to caption
(a) Visual Prompt.
Refer to caption
(b) LoRA.
Refer to caption
(c) Adapter.
Figure 6: Empirical analysis on the impact of the source pretrain strategy using Visual Prompt, LoRA, and Adapter.
Refer to caption
(a) Ablation study on τ\tauitalic_τ
Refer to caption
(b) Ablation study on β\betaitalic_β
Refer to caption
(c) Ablation study on M1M_{1}italic_M start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and M2M_{2}italic_M start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT
Refer to caption
(d) Ablation study on M3M_{3}italic_M start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT
Figure 7: MP2\text{P}^{2}P start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPTA - Prompt ablation studies. (a) threshold parameter τ\tauitalic_τ (b) threshold parameter β\betaitalic_β (c) textual prompt length M1M_{1}italic_M start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and M2M_{2}italic_M start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT (d) visual prompt length M3M_{3}italic_M start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT.
Refer to caption
(a) Ablation study on τ\tauitalic_τ
Refer to caption
(b) Ablation study on β\betaitalic_β
Refer to caption
(c) Ablation study on M1M_{1}italic_M start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and M2M_{2}italic_M start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT
Refer to caption
(d) Ablation study on r1r_{1}italic_r start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT
Figure 8: MP2\text{P}^{2}P start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPTA - LoRA ablation studies. (a) threshold parameter β\betaitalic_β (b) threshold parameter τ\tauitalic_τ (c) prompt length M1M_{1}italic_M start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and M2M_{2}italic_M start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT (d) lora rank r1r_{1}italic_r start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT.
Refer to caption
(a) Ablation study on τ\tauitalic_τ
Refer to caption
(b) Ablation study on β\betaitalic_β
Refer to caption
(c) Ablation study on M1M_{1}italic_M start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and M2M_{2}italic_M start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT
Refer to caption
(d) Ablation study on r2r_{2}italic_r start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT
Figure 9: MP2\text{P}^{2}P start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPTA - Adapter ablation studies. (a) threshold parameter β\betaitalic_β (b) threshold parameter τ\tauitalic_τ (c) prompt length M1M_{1}italic_M start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and M2M_{2}italic_M start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT (d) inner dimension r2r_{2}italic_r start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT.

We also perform an ablation study to analyze the effect of the number of clusters, with the results summarized in Table III. As shown, using three clusters consistently yields the best overall performance across all PEFT modules. However, we note that the exact number of clusters is not a critical hyperparameter, as the performance variations remain within a narrow 1% margin, indicating the robustness of our method to this choice. Interestingly, a general downward trend in performance is observed as the number of clusters increases. This suggests that using too many learning cycles may lead to over-fragmentation of the data, causing the model to overfit to small, isolated subsets of samples.

Overall, these comprehensive experiments highlight the clear advantage of learning from a structured, easy-to-hard curriculum rather than treating all target samples uniformly. The consistent gains across different modules and datasets demonstrate that our “learn, refine, and rehearse” cycle not only stabilizes the training process but also enhances generalization across diverse and challenging domain shifts.

IV-E Analysis on Source Pretraining

To validate the impact of our source pretraining strategy, we conduct a targeted ablation study comparing the performance of all three PEFT modules with and without this initial step. The results, detailed in Figure 6, reveal that this pretraining is crucial for achieving optimal performance.

As shown, its inclusion consistently yields higher accuracy across all modules, with substantial performance gains of 2.4% for Visual Prompt, 2.3% for LoRA, and 1.7% for Adapter. Unlike the progressive alignment strategy, source pretraining provides a universal performance lift across all domains. This demonstrates that leveraging supervised signals from the source domains enables a more robust initial adaptation and, critically, more reliable target label estimation. The resulting improvement in pseudo-label quality serves as a stronger learning signal during the early stages of training, leading to more stable convergence and superior downstream performance. This confirms that our simple source pretraining strategy provides a valuable foundation for generating high-confidence pseudo-labels, which are crucial for the success of our progressive learning strategy.

IV-F Hyperparameter Ablation Study

A comprehensive analysis of key hyperparameters is conducted to evaluate the sensitivity and robustness of our framework. The experimental results are summarized in Figure 7 for Visual Prompt, Figure 8 for LoRA, and Figure 9 for Adapter, respectively. We investigate the effect of four key factors: the rehearse threshold β\betaitalic_β, the initial pseudo-label selection threshold τ\tauitalic_τ, the textual prompt lengths M1M_{1}italic_M start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and M2M_{2}italic_M start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT, and PEFT-module-specific parameters.

We first analyze the impact of the initial pseudo-label selection threshold τ\tauitalic_τ on model performance. As shown, the best results for all PEFT modules are obtained at τ=0.6\tau=0.6italic_τ = 0.6. We observe a clear trend where performance improves as τ\tauitalic_τ increases from 0.4 to 0.6, followed by a slight decline when τ\tauitalic_τ reaches 0.7. Notably, while Visual Prompt and LoRA remain relatively stable across different τ\tauitalic_τ values, we find larger fluctuations for Adapter, suggesting that its optimization is more sensitive to the pseudo-label quality.

Next, we examine the rehearse threshold β\betaitalic_β. Performance remains relatively stable for β\betaitalic_β values of 0.6, 0.7, and 0.8 across all PEFT modules. However, at β=0.9\beta=0.9italic_β = 0.9, performance drops sharply. This is because such a high threshold filters out too many samples for subsequent stages, causing catastrophic forgetting of previously learned knowledge. For smaller β\betaitalic_β values, while more samples are retained, the additional training data does not lead to notable performance improvements but requires greater computational resources. We find that β=0.8\beta=0.8italic_β = 0.8 provides the best trade-off between performance and efficiency. This is particularly important for large-scale datasets such as DomainNet, where the initial τ\tauitalic_τ is set to 0.4. Our experiments show that this setting improves training efficiency by approximately 30%.

We then study the influence of the textual prompt lengths M1M_{1}italic_M start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and M2M_{2}italic_M start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT, where we set M1=M2M_{1}=M_{2}italic_M start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = italic_M start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT for simplicity. The best performance is achieved at M1=M2=12M_{1}=M_{2}=12italic_M start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = italic_M start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = 12. We hypothesize that shorter prompts lack sufficient semantic capacity to guide adaptation effectively, while overly long prompts introduce redundancy or noise, which may hinder training. This result highlights the importance of selecting an optimal prompt length to balance expressiveness and regularization.

Finally, we examine module-specific hyperparameters: visual prompt length M3M_{3}italic_M start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT, LoRA rank r1r_{1}italic_r start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT, and Adapter inner dimension r2r_{2}italic_r start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT. Both M3M_{3}italic_M start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT and r1r_{1}italic_r start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT yield stable performance across tested values, with peak performance at M3=20M_{3}=20italic_M start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT = 20 and r1=8r_{1}=8italic_r start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = 8. For r2r_{2}italic_r start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT, however, we observe that increasing its value tends to decrease overall accuracy, and thus we choose r2=16r_{2}=16italic_r start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = 16. This finding once again underscores that the Adapter module, while powerful, presents greater optimization challenges and is more sensitive to its hyperparameter configuration.

V Conclusion

In this paper, we presented MP2\text{P}^{2}P start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPTA, a progressive framework for multi-source unsupervised domain adaptation that adapts CLIP through a structured “learn, refine, and rehearse” cycle. By organizing the adaptation process using balanced class clustering and gradually expanding from easy to hard samples, MP2\text{P}^{2}P start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPTA improves training stability and robustness. Specifically, during the learning phase, we train domain-specific textual prompts and a shared visual PEFT module, aligning prompts through a denoising autoencoder with alignment regularization. In the refine-and-rehearse phase, confident samples from previous stages are refined with the updated model and reused to guide training in subsequent stages, enabling efficient knowledge transfer. Furthermore, to enhance pseudo-label quality, we ensemble source-pretrained CLIP models for more reliable target supervision. Experiments on ImageCLEF, Office-Home, and DomainNet show that MP2\text{P}^{2}P start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPTA achieves state-of-the-art performance across all benchmarks. These results highlight the effectiveness of progressive alignment and parameter-efficient adaptation in vision-language models under domain shift.

References

  • [1] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “Bert: Pre-training of deep bidirectional transformers for language understanding,” in NAACL, 2019.
  • [2] T. Brown, B. Mann, N. Ryder, M. Subbiah, J. D. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell et al., “Language models are few-shot learners,” NeurIPS, 2020.
  • [3] R. Bommasani, D. A. Hudson, E. Adeli, R. Altman, S. Arora, S. von Arx, M. S. Bernstein, J. Bohg, A. Bosselut, E. Brunskill et al., “On the opportunities and risks of foundation models,” arXiv preprint arXiv:2108.07258, 2021.
  • [4] J. Quiñonero-Candela, M. Sugiyama, A. Schwaighofer, and N. D. Lawrence, Eds., Dataset Shift in Machine Learning. The MIT Press, 2008.
  • [5] A. Torralba and A. A. Efros, “Unbiased look at dataset bias,” in CVPR, 2011.
  • [6] K. Zhang, B. Schölkopf, K. Muandet, and Z. Wang, “Domain adaptation under target and conditional shift,” in ICML, 2013.
  • [7] M. De Lange, R. Aljundi, M. Masana, S. Parisot, X. Jia, A. Leonardis, G. Slabaugh, and T. Tuytelaars, “A continual learning survey: Defying forgetting in classification tasks,” TPAMI, 2021.
  • [8] H. Chen, M. Goldblum, Z. Wu, and Y.-G. Jiang, “Adaptive retention & correction: Test-time training for continual learning,” in ICLR, 2025.
  • [9] Z. Wang, Z. Zhang, C.-Y. Lee, H. Zhang, R. Sun, X. Ren, G. Su, V. Perot, J. Dy, and T. Pfister, “Learning to prompt for continual learning,” in CVPR, 2022.
  • [10] S. J. Pan and Q. Yang, “A survey on transfer learning,” TKDE, 2009.
  • [11] G. Csurka, “Domain adaptation for visual applications: A comprehensive survey,” arXiv preprint arXiv:1702.05374, 2017.
  • [12] M. Wang and W. Deng, “Deep visual domain adaptation: A survey,” Neurocomputing, 2018.
  • [13] R. Xu, Z. Chen, W. Zuo, J. Yan, and L. Lin, “Deep cocktail network: Multi-source unsupervised domain adaptation with category shift,” in CVPR, 2018.
  • [14] P. Singhal, R. Walambe, S. Ramanna, and K. Kotecha, “Domain adaptation: challenges, methods, datasets, and applications,” IEEE access, 2023.
  • [15] A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark et al., “Learning transferable visual models from natural language supervision,” in ICML, 2021.
  • [16] C. Ge, R. Huang, M. Xie, Z. Lai, S. Song, S. Li, and G. Huang, “Domain adaptation via prompt learning,” TNNLS, 2023.
  • [17] H. Chen, X. Han, Z. Wu, and Y.-G. Jiang, “Multi-prompt alignment for multi-source unsupervised domain adaptation,” NeurIPS, 2023.
  • [18] Z. Du, X. Li, F. Li, K. Lu, L. Zhu, and J. Li, “Domain-agnostic mutual prompting for unsupervised domain adaptation,” in CVPR, 2024.
  • [19] E. Arazo, D. Ortego, P. Albert, N. E. O’Connor, and K. McGuinness, “Pseudo-labeling and confirmation bias in deep semi-supervised learning,” in IJCNN, 2020.
  • [20] K. Mei, C. Zhu, J. Zou, and S. Zhang, “Instance adaptive self-training for unsupervised domain adaptation,” in ECCV, 2020.
  • [21] Q. Xie, M.-T. Luong, E. Hovy, and Q. V. Le, “Self-training with noisy student improves imagenet classification,” in CVPR, 2020.
  • [22] Y. Zou, Z. Yu, B. Kumar, and J. Wang, “Unsupervised domain adaptation for semantic segmentation via class-balanced self-training,” in ECCV, 2018.
  • [23] P. S. Bradley, K. P. Bennett, and A. Demiriz, “Constrained k-means clustering,” Microsoft Research, Redmond, 2000.
  • [24] M. I. Malinen and P. Fränti, “Balanced k-means for clustering,” in Joint IAPR international workshops on statistical techniques in pattern recognition (SPR) and structural and syntactic pattern recognition (SSPR), 2014.
  • [25] S. Bai, M. Zhang, W. Zhou, S. Huang, Z. Luan, D. Wang, and B. Chen, “Prompt-based distribution alignment for unsupervised domain adaptation,” in AAAI, 2024.
  • [26] K. Zhou, J. Yang, C. C. Loy, and Z. Liu, “Learning to prompt for vision-language models,” IJCV, 2022.
  • [27] ——, “Conditional prompt learning for vision-language models,” in CVPR, 2022.
  • [28] N. Houlsby, A. Giurgiu, S. Jastrzebski, B. Morrone, Q. De Laroussilhe, A. Gesmundo, M. Attariyan, and S. Gelly, “Parameter-efficient transfer learning for nlp,” in ICML, 2019.
  • [29] R. Zhang, R. Fang, W. Zhang, P. Gao, K. Li, J. Dai, Y. Qiao, and H. Li, “Tip-adapter: Training-free clip-adapter for better vision-language modeling,” arXiv preprint arXiv:2111.03930, 2021.
  • [30] E. J. Hu, Y. Shen, P. Wallis, Z. Allen-Zhu, Y. Li, S. Wang, L. Wang, W. Chen et al., “Lora: Low-rank adaptation of large language models.” ICLR, 2022.
  • [31] J. Deng, Z. Zhang, F. Eyben, and B. Schuller, “Autoencoder-based unsupervised domain adaptation for speech emotion recognition,” IEEE Signal Processing Letters, 2014.
  • [32] B. Caputo, H. Müller, J. Martinez-Gomez, M. Villegas, B. Acar, N. Patricia, N. Marvasti, S. Üsküdarlı, R. Paredes, M. Cazorla et al., “Imageclef 2014: Overview and analysis of the results,” in International conference of the cross-language evaluation forum for European languages, 2014.
  • [33] H. Venkateswara, J. Eusebio, S. Chakraborty, and S. Panchanathan, “Deep hashing network for unsupervised domain adaptation,” in CVPR, 2017.
  • [34] X. Peng, Q. Bai, X. Xia, Z. Huang, K. Saenko, and B. Wang, “Moment matching for multi-source domain adaptation,” in ICCV, 2019.
  • [35] M. Singha, H. Pal, A. Jha, and B. Banerjee, “Ad-clip: Adapting domains in prompt space using clip,” in ICCV, 2023.
  • [36] J. Yang, R. Yan, and A. G. Hauptmann, “Cross-domain video concept detection using adaptive svms,” in ACM MM, 2007.
  • [37] H. Zhao, S. Zhang, G. Wu, J. M. Moura, J. P. Costeira, and G. J. Gordon, “Adversarial multiple source domain adaptation,” NeurIPS, 2018.
  • [38] Y. Zhu, F. Zhuang, and D. Wang, “Aligning domain-specific distribution and classifier for cross-domain classification from multiple sources,” in AAAI, 2019.
  • [39] H. Liu, J. Wang, and M. Long, “Cycle self-training for domain adaptation,” NeurIPS, 2021.
  • [40] C. Zhou, Z. Wang, B. Du, and Y. Luo, “Cycle self-refinement for multi-source domain adaptation,” in AAAI, 2024.
  • [41] C. Jia, Y. Yang, Y. Xia, Y.-T. Chen, Z. Parekh, H. Pham, Q. Le, Y.-H. Sung, Z. Li, and T. Duerig, “Scaling up visual and vision-language representation learning with noisy text supervision,” in ICML, 2021.
  • [42] J. Li, D. Li, C. Xiong, and S. Hoi, “Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation,” in ICML, 2022.
  • [43] J.-B. Alayrac, J. Donahue, P. Luc, A. Miech, I. Barr, Y. Hasson, K. Lenc, A. Mensch, K. Millican, M. Reynolds et al., “Flamingo: a visual language model for few-shot learning,” NeurIPS, 2022.
  • [44] V. H. Phan, T. L. Tran, Q. Tran, and T. Le, “Enhancing domain adaptation through prompt gradient alignment,” NeurIPS, 2024.
  • [45] Y. Bengio, J. Louradour, R. Collobert, and J. Weston, “Curriculum learning,” in ICML, 2009.
  • [46] Y. Zhang, P. David, and B. Gong, “Curriculum domain adaptation for semantic segmentation of urban scenes,” in ICCV, 2017.
  • [47] T. Zhou and J. Bilmes, “Minimax curriculum learning: Machine teaching with desirable difficulties and scheduled diversity,” in ICLR, 2018.
  • [48] C. Chen, W. Xie, W. Huang, Y. Rong, X. Ding, Y. Huang, T. Xu, and J. Huang, “Progressive feature alignment for unsupervised domain adaptation,” in CVPR, 2019.
  • [49] N. Karim, N. C. Mithun, A. Rajvanshi, H.-p. Chiu, S. Samarasekera, and N. Rahnavard, “C-sfda: A curriculum learning aided self-training framework for efficient source free domain adaptation,” in CVPR, 2023.
  • [50] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,” NeurIPS, 2017.
  • [51] A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly et al., “An image is worth 16x16 words: Transformers for image recognition at scale,” arXiv preprint arXiv:2010.11929, 2020.
  • [52] M. Jia, L. Tang, B.-C. Chen, C. Cardie, S. Belongie, B. Hariharan, and S.-N. Lim, “Visual prompt tuning,” in ECCV, 2022.
  • [53] W. Wang, Y. Huang, Y. Wang, and L. Wang, “Generalized autoencoder: A neural network framework for dimensionality reduction,” in CVPR Workshop, 2014.
  • [54] C. Fefferman, S. Mitter, and H. Narayanan, “Testing the manifold hypothesis,” Journal of the American Mathematical Society, 2016.
  • [55] M. Ilse, J. M. Tomczak, C. Louizos, and M. Welling, “Diva: Domain invariant variational autoencoders,” in Medical Imaging with Deep Learning, 2020.
  • [56] Z. Yue, Q. Sun, and H. Zhang, “Make the u in uda matter: Invariant consistency learning for unsupervised domain adaptation,” NeurIPS, 2023.

VI Biography Section

[Uncaptioned image] Haoran Chen is currently working toward the Ph.D. degree in Computer Science at Fudan University, advised by Prof. Zuxuan Wu and Prof. Yu-Gang Jiang. His research interests include transfer learning, continual learning, and multimodal large language models.
[Uncaptioned image] Zexiao Wang is a first year Masters student in Computer Science at Fudan University, advised by Prof. Zuxuan Wu. His research interests include transfer learning and continual learning.
[Uncaptioned image] Haidong Cao is a second year Master student in Computer Science at Fudan University, advised by Prof. Zuxuan Wu. His research interests include computer vision and robot learning.
[Uncaptioned image] Zuxuan Wu received his Ph.D. in Computer Science from the University of Maryland with Prof. Larry Davis in 2020. He is currently an Associate Professor at the Institute of Trustworthy Embodied AI, Fudan University. His research interests are in computer vision and deep learning. His work has been recognized by an AI 2000 Most Influential Scholars Honorable Mention in 2021, a Microsoft Research PhD Fellowship (10 people Worldwide) in 2019 and a Snap PhD Fellowship (10 people Worldwide) in 2017.
[Uncaptioned image] Yu-Gang Jiang (Fellow, IEEE) received the PhD degree in Computer Science from City University of Hong Kong in 2009 and worked as a Postdoctoral Research Scientist at Columbia University, New York, during 2009-2011. He is currently a Distinguished Professor at the Institute of Trustworthy Embodied AI, Fudan University, Shanghai, China. His research lies in the areas of multimedia, computer vision, embodied AI and trustworthy AI. His research has led to the development of innovative AI tools that have been used in many practical applications like defect detection for high-speed railway infrastructures. His open-source video analysis toolkits and datasets such as CU-VIREO374, CCV, THUMOS, FCVID and WildDeepfake have been widely used in both academia and industry. He currently serves as Chair of ACM Shanghai Chapter and Associate Editor of several international journals. For contributions to large-scale and trustworthy video analysis, he was elected to Fellow of IEEE, IAPR and CCF.