Multi-Prompt Progressive Alignment for Multi-Source Unsupervised Domain Adaptation

Haoran Chen, Zexiao Wang, Haidong Cao, Zuxuan Wu, and Yu-Gang Jiang H. Chen, Z. Wang, H. Cao, Z. Wu, Y-G. Jiang are with the Institute of Trustworthy Embodied AI, Fudan University. E-mail: {chenhran21, zexiaowang25, hdcao24}@m.fudan.edu.cn {zxwu, ygj}@fudan.edu.cn

Abstract

Large Vision-Language Models like CLIP have become a powerful foundation for Unsupervised Domain Adaptation due to their strong zero-shot generalization. State-of-the-art methods typically leverage CLIP to generate pseudo-labels for the target domain, then fine-tune the model to learn domain-invariant features. However, these methods attempt to align source and target domains using all pseudo-labeled data simultaneously. This one-shot alignment struggles with noisy, hard-to-classify samples, leading to error propagation and suboptimal feature learning. The problem is even more amplified in the multi-source scenario, where diverse domain gaps and varying noise levels across multiple source domains further destabilize the alignment process. To address this issue, in this work, we propose a progressive alignment strategy for adapting CLIP to unlabeled downstream task. Our method begins by training the model on a high-confidence subset of target samples, allowing it to first learn a well-aligned representation from the most reliable data. As training progresses, it gradually incorporates more challenging samples, guiding the model to refine its understanding without being overwhelmed by initial label noise. This progressive approach effectively mitigates confirmation bias and promotes a more robust convergence, allowing for the learning of genuinely domain-invariant features. We name our approach M $\text{P}^{2}$ A and test it on three popular UDA benchmarks, namely ImageCLEF, Office-Home, and the most challenging DomainNet. Experiments showcase that M $\text{P}^{2}$ A achieves state-of-the-art performance when compared with most recent CLIP-based MS-UDA approaches, demonstrating the effectiveness of our approach.

Index Terms:

Multi-source unsupervised domain adaptation, transfer learning, vision-language models, CLIP.

I Introduction

Recent years have witnessed a massive explosion in the capabilities of artificial intelligence[1, 2, 3], driven by the advent of large-scale models pre-trained on vast, web-scale datasets. These models have demonstrated remarkable abilities to generalize across a wide variety of tasks, often achieving human-level performance when the test data closely resembles their training distribution. However, when deployed in real-world scenarios, these systems frequently encounter data from new environments, sensors, or styles, a phenomenon known as domain shift[4, 5, 6]. This distributional mismatch between training and deployment data can cause a drastic and unexpected drop in accuracy[7, 8, 9]. While one might consider labeling deployment data for downstream adaptation, the manual annotation of data for every new domain is both prohibitively expensive and unscalable. As such, enabling models to autonomously adapt to new, unlabeled domains has become a central and fundamental challenge.

To address this challenge, Unsupervised Domain Adaptation (UDA) has emerged as a key research area, aiming to transfer knowledge from a labeled source domain to an unlabeled target domain[10, 11, 12]. A more challenging yet realistic extension of this setting is Multi-Source Unsupervised Domain Adaptation (MS-UDA)[13], where labeled data from multiple, heterogeneous source domains is available for transfer. This scenario is particularly relevant as real-world data is often aggregated from a variety of sources[12, 14]. However, the multi-source setting introduces significant complexities: the model must now bridge multiple, potentially conflicting domain gaps and handle varying data quality across sources. This can lead to an unstable alignment process and the risk of negative transfer, where a poorly matched source domain hinders rather than helps adaptation to the target.

Refer to caption — Figure 1: Illustration of our progressive domain alignment pipeline. The adaptation process starts with the learning phase, where the model is trained on the current cluster. Following this, we refine the pseudo-labels of the current cluster using the updated model. As training advances, increasingly challenging classes are gradually introduced. A subset of high-confidence samples from the refined cluster is then rehearsed, i.e., transferred into the next cluster’s training set. This progressive learning strategy enables stable adaptation from easy to hard samples while mitigating error accumulation and catastrophic forgetting.

Recently, researchers have leveraged large Vision-Language Models such as CLIP[15] to address the MS-UDA problem[16, 17, 18]. With its strong zero-shot capability, CLIP can generate pseudo-labels for unlabeled target data by aligning images with text-based class descriptions. A common strategy then is to fine-tune the model on the entire pseudo-labeled target set, aiming to align the source and target feature distributions. However, this “one-shot” alignment approach has a critical drawback: it treats all pseudo-labels with equal importance. Noisy predictions on ambiguous or hard samples are immediately incorporated into training, resulting in confirmation bias, error propagation, and suboptimal feature learning[19]. While some studies explore iterative pseudo-label refinement through self-training[20, 21, 22], these methods are computationally expensive, as each round processes the full training set. This challenge is further exacerbated in MS-UDA, where the compounded domain shifts and varying noise levels across multiple source domains make the initial alignment process even less reliable, demanding both greater robustness and computational efficiency.

To address the limitations of existing methods, we propose Multi-Source Progressive Prompt Alignment (M $\text{P}^{2}$ A), a simple yet effective framework built on a “learn, refine, and rehearse” cycle that progressively adapts CLIP to the target domain through an easy-to-hard curriculum. This cycle is designed to stabilize learning by starting from reliable supervision and gradually incorporating more ambiguous data.

Before detailing the learning cycle, we perform several key preparatory steps. First, we structure the problem space using a balanced KMeans[23, 24] algorithm to cluster classes based on their CLIP visual embeddings. To estimate the difficulty of each cluster, we compute the average cosine similarity between the image embeddings and their corresponding text-based class prototypes. Clusters with higher average similarity are deemed easier, enabling us to sort the clusters from easy to hard and establish a progressive training schedule.

Additionally, unlike prior CLIP-based UDA approaches that directly adopt zero-shot pseudo-labels from the raw CLIP model[16, 25, 18], we argue that this overlooks valuable information from the labeled source domains, particularly in the multi-source setting. To overcome this, we first pre-train CLIP-based models individually on each source domain using supervised data. We then generate pseudo-labels on the target domain by ensembling the predictions from all source-specific models, employing strategies such as confidence averaging or majority voting to produce more reliable pseudo-labels for downstream training.

Within each curriculum stage, we begin with the learning phase, where we independently optimize a set of learnable textual prompts[26, 27] and a shared set of lightweight visual PEFT[28, 29, 30] modules for each source–target domain pair. These components allow for efficient and expressive domain adaptation in both modalities. To consolidate and align the learned textual prompts, we follow prior work[17] by passing them through a shallow auto-encoder, which projects the high-dimensional prompt vectors into a shared, denoised latent space[31, 27]. To further encourage consistency across source domains, we introduce an $\mathcal{L}_{1}$ regularization loss on the predictions from different prompt sets, guiding them toward a unified representation for the target domain.

The core innovation of our approach lies in the refine and rehearse phase. After training on a given cluster, the model regenerates pseudo-labels for that cluster, reflecting its improved understanding. From these, we extract only the most confident samples, i.e., those exceeding a predefined probability threshold, and incorporate them into the training set for the next, more challenging cluster. This selective rehearsal mechanism serves a dual purpose: it anchors future learning with reliable examples to prevent catastrophic forgetting[7], while also constraining the training set size for improved computational efficiency. Crucially, in contrast to conventional self-training methods that indiscriminately reuse all past data[21], our confidence-aware strategy reduces error propagation and mitigates overfitting to noisy labels. By progressively expanding the training data with only well-aligned samples, M $\text{P}^{2}$ A achieves more stable convergence and learns domain-invariant representations in a robust and scalable manner. The whole pipeline of our training cycle is illustrated in Figure 1.

We validate the effectiveness of our proposed framework through extensive experiments on three widely-used MS-UDA benchmarks: ImageCLEF [32], Office-Home [33], and DomainNet [34]. These datasets encompass a broad range of domains, visual styles, and adaptation challenges, providing a comprehensive testbed for evaluating multi-source domain adaptation methods. Across all benchmarks, M $\text{P}^{2}$ A achieves average accuracies of 94.3%, 91.8%, and 64.1%, respectively, consistently surpassing recent CLIP-based UDA approaches by a notable margin. These results demonstrate the robustness, scalability, and generalization capability of our progressive alignment framework, particularly under the challenging multi-source scenarios.

A preliminary version of this work, MPA, appeared in [17], where we were among the first to explore the use of CLIP for domain adaptation. Since then, the research landscape has evolved rapidly, with numerous follow-up studies advancing CLIP-based adaptation[18, 35]. To remain competitive and push the frontier further, this journal version introduces several substantial improvements over the original conference paper.

First, while the earlier version primarily emphasized textual prompt learning, the current work fully exploits both the textual and visual modalities of CLIP by incorporating PEFT modules. This enables the model to learn more expressive and adaptable visual representations. Second, we introduce a novel source pretraining strategy that leverages the supervision from labeled source domains to generate more robust and accurate pseudo-labels for the target domain. Third, we propose a curriculum-based alignment framework that progressively adapts the model from easy to hard samples, effectively mitigating confirmation bias and improving training stability.

In addition, we significantly expand the experimental scope: we re-implement and fairly compare our method against a comprehensive set of strong CLIP-based baselines using modern ViT-based backbones, aligning with current trends in large-scale vision models. Finally, we provide in-depth empirical analysis to validate the robustness and compatibility of our proposed progressive alignment strategy across diverse domain adaptation benchmarks.

II Related Works

II-A Multi-Source Unsupervised Domain Adaptation

First introduced by Yang et al.[36], MS-UDA extends the standard UDA setting by leveraging multiple labeled source domains to improve generalization to an unlabeled target domain. Classic approaches often focus on aligning distributions across domains. Adversarial methods like MDAN[37] and DCTN [13] train domain discriminators to encourage feature alignment, while discrepancy-based methods such as MFSAN[38] and $\text{M}^{3}$ SDA [34] align each source-target pair by minimizing statistical distances like maximum mean discrepancy or moment distance.

Due to the emerging of large-scale pretrained networks, self-training has become a dominant strategy in UDA, where pseudo-labels generated on target data are used to iteratively refine the model. For example, CST[39] employs a bi-model framework where two networks alternately generate pseudo-labels for each other, reducing overfitting to self-predictions. Extending this to the multi-source case, CSR[40] introduces an instance-level ensemble of source-specific models and a domain-ensemble network, refined in a mutual loop via cross-model supervision. These self-training strategies demonstrate the effectiveness of leveraging target structure and inter-model consistency to enhance multi-source adaptation.

II-B CLIP based MS-UDA

Recent advances in vision-language models[41, 42, 43], especially CLIP[15], have sparked new research in domain adaptation by leveraging rich multimodal semantics. A pioneering effort in this direction is DAPL[16], which introduces prompt learning into UDA by disentangling domain and category representations. It optimizes domain-specific prompts to maintain semantic structures while avoiding the limitations of hard domain alignment. Building on this idea, subsequent works have proposed various strategies to enhance prompt learning. AD-CLIP[35] and DAMP[18] extend this direction by designing domain-invariant prompts and integrating visual context into the prompting process. PDA[25] further incorporates domain knowledge into a two-branch prompt tuning framework, aligning class and domain representations via feature banks and contrastive losses. However Most existing CLIP-based UDA methods, are limited to the single-source setting. Addressing this gap, MPA[17] is the first to tackle multi-source UDA, introducing a lightweight and modular multi-prompt alignment strategy that learns separate prompts per source-target pair and aligns them via autoencoding and consensus optimization. Inspired by these insights, PGA[44] formulates UDA as a multi-objective optimization problem and aligns prompt gradients across domains to encourage consistent learning, achieving robust adaptation under both single and multi-source settings. Collectively, these methods demonstrate the promise of prompt-based CLIP adaptation in both efficiency and accuracy for domain adaptation.

II-C Curriculum Learning

The concept of Curriculum Learning, formally introduced by [45], advocates for training models on a structured sequence of examples from easy to hard. This paradigm, inspired by human learning, aims to guide optimization towards more robust solutions and improve generalization by first building a foundation on simpler concepts[46, 47, 48].

Despite its potential, curriculum learning remains underutilized in UDA. One notable effort is C-SFDA[49], which applies a dynamic, self-paced curriculum based on prediction confidence and uncertainty. To determine difficulty, C-SFDA measures variance across multiple test-time augmentations and adaptively thresholds sample reliability. While effective, this process requires a substantial amount of hand-tuned scheduling and heuristic design including augmentation policies, uncertainty statistics, and carefully calibrated thresholds, making the method complex and sensitive to hyperparameter settings.

In contrast, our approach introduces a static, data-driven curriculum that is both simpler and more robust. We predefine the learning schedule by clustering classes using CLIP visual embeddings and ranking them by image-text alignment difficulty, independent of model predictions. Our ”learn, refine, and rehearse” framework progressively introduces harder classes while anchoring training with high-confidence examples from earlier stages. This avoids the need for online confidence computation and extensive scheduling logic, resulting in a more interpretable and computationally efficient curriculum learning strategy.

III Method

To achieve more robust alignment on the target domain, we propose a simple yet effective “learn, refine, and rehearse” framework built upon the pre-trained CLIP model, leveraging its strong zero-shot capability. The core idea is to partition the learning data into subsets and train the model in a sequential, curriculum-based manner that progresses from easy to hard examples. In this way, the adaptation is gradually enhanced and the impact of noisy supervision is mitigated. In the following sections, we begin by introducing the design of textual prompts and visual parameter-efficient tuning modules in Section III-A. The procedure for generating initial pseudo-labels is then detailed in Section III-B. Next, we describe our data partitioning strategy based on CLIP embeddings in Section III-C. The core learning phase is presented in Section III-D, followed by the refine-and-rehearse phase in Section III-E.

III-A Architectural design

Let $N$ denote the total number of domains, where the first $N{-}1$ domains are source domains and the $N$ -th domain is the target domain. Let $K$ denote the number of classes shared across all domains. In the multi-source UDA setting, our objective is to learn a domain-invariant latent space such that both the inter-source domain gaps and the discrepancies between each source and the target domain are minimized. To this end, we design textual prompts that incorporate both domain-invariant and domain-specific features, and introduce visual Parameter-Efficient Fine-Tuning modules to enhance visual adaptation to the target domain.

III-A1 Textual Prompt Design

For the textual prompts, we construct two types of learnable tokens: class-specific context tokens $v_{i}^{k}$ , where $i\in\{1,2,\ldots,M_{1}\}$ and $k\in\{1,2,\ldots,K\}$ , and domain-specific tokens shared across all classes, denoted as $d_{j}^{d}$ , where $j\in\{1,2,\ldots,M_{2}\}$ and $d\in\{s,t\}$ . Here, $M_{1}$ and $M_{2}$ represent the number of tokens in each category, and $s$ and $t$ indicates source and target domains respectively. Combining these tokens, we construct prompts for $2K$ categories (source and target for each class) used in contrastive training. Specifically, for each source-target pair, we define the prompt embedding matrix as:

P_{i}=[t_{1}^{s_{i}},\ldots,t_{K}^{s_{i}},\ t_{1}^{t_{i}},\ldots,t_{K}^{t_{i}}]^{\top},\quad i\in\{1,2,\ldots,N{-}1\}.

An illustration of this is shown in the left of Figure 2.

III-A2 Visual PEFT Modules

The original MPA framework utilizes the raw visual embeddings directly from the frozen CLIP vision encoder. In contrast, for M $\text{P}^{2}$ A we take a step further and adapt the visual encoder to the target domain in a parameter-efficient manner. Specifically, we investigate three lightweight fine-tuning strategies: prompt tuning[52], LoRA[30], and adapter[28] modules. These modules are inserted into each Transformer block, and only their parameters are updated, while the rest of the vision encoder remains frozen.

Prompt tuning prepends a set of learnable prompt tokens to the patch embeddings at each Transformer layer. Given an input image represented as patch embeddings $X\in\mathbb{R}^{n\times d}$ , where $n$ is the number of patches and $d$ is the embedding dimension, we introduce learnable tokens $E\in\mathbb{R}^{M_{3}\times d}$ and modify the input as:

\tilde{X}=[E;X]\in\mathbb{R}^{(M_{3}+n)\times d}

LoRA modifies the weight matrices in the attention layers by injecting low-rank updates. For a given weight matrix $W\in\mathbb{R}^{d\times d}$ , two trainable matrices $A\in\mathbb{R}^{r_{1}\times d}$ and $B\in\mathbb{R}^{d\times r_{1}}$ with $r_{1}\ll d$ are introduced, and the updated matrix becomes:

W^{\prime}=W+\Delta W,\quad\Delta W=BA

In our design, LoRA is applied to the query and value projection matrices in the self-attention mechanism:

Q=(W_{q}+B_{q}A_{q})X,\quad V=(W_{v}+B_{v}A_{v})X

Adapter modules are lightweight residual layers inserted within each Transformer block. A typical adapter is composed of a down-projection, a nonlinearity, and an up-projection:

\text{Adapter}(h)=h+W_{\text{down}}(\text{ReLU}(W_{\text{up}}h))

where $W_{\text{up}}\in\mathbb{R}^{d\times r_{2}}$ and $W_{\text{down}}\in\mathbb{R}^{r_{2}\times d}$ with $r_{2}\ll d$ . The adapter output is added back residually and layer normalized:

h^{\prime}=\text{LayerNorm}(h+\text{Adapter}(h))

These visual PEFT modules are designed to guide the image feature extraction process at different hierarchical levels, enabling the model to focus more on task-relevant semantic regions and thereby generate more discriminative visual features. It is worth noting that, for simplicity, we employ a single shared set of visual PEFT modules across all source–target domain pairs.

III-B Initial Pseudo-Label Generation

Most existing CLIP-based domain adaptation methods generate pseudo-labels by directly applying CLIP’s zero-shot predictions on the target domain. However, this approach fails to leverage the rich semantic knowledge embedded in the labeled source domains, which can provide valuable guidance for more accurate pseudo-label generation.

To address this limitation, we propose a supervised ensemble-based strategy that exploits the source domain annotations. Specifically, for each source domain $\mathcal{D}_{s}^{i}$ where $i\in\{1,2,\ldots,N{-}1\}$ , we fine-tune the CLIP model using the textual prompts and visual PEFT modules described in Section III-A, leveraging only the labeled data from that source. This results in $N{-}1$ independently trained models. Each model is then used to generate class predictions on the unlabeled target domain samples, producing a set of logits:

\{\mathbf{z}_{i}(x_{t})\}_{i=1}^{N-1},\quad\text{for each }x_{t}\in\mathcal{D}_{t}.

To aggregate these predictions into a final pseudo-label $\hat{y}_{t}$ for each target sample $x_{t}$ , we consider two voting strategies:

•

Average Confidence Voting: Average the softmax outputs across all models and select the most confident class:

\hat{y}_{t}=\arg\max_{k}\ \frac{1}{N-1}\sum_{i=1}^{N-1}\text{softmax}(\mathbf{z}_{i}(x_{t}))_{k}.

•

Majority Voting: Take the class that receives the most votes from the $N{-}1$ models:

\hat{y}_{t}=\text{mode}\left(\{\arg\max_{k}\ \mathbf{z}_{i}(x_{t})_{k}\}_{i=1}^{N-1}\right).

The choice of voting strategy is selected empirically. Importantly, since the source-trained models are prone to overfitting their respective source domains, they are not used in the subsequent adaptation process. That is, this training step is performed solely to provide a more informed and stable initialization of pseudo-labels for the target domain.

III-C Curriculum-Oriented Data Grouping

Traditional approaches in unsupervised domain adaptation often align source and target domains using the entire dataset at once. However, this practice can be suboptimal due to the presence of noisy or hard-to-classify target samples, particularly when pseudo-labels are involved. Noisy labels may introduce confirmation bias and hinder model convergence. To address this, we propose a curriculum-style strategy that partitions the data into subsets and performs alignment progressively, starting from easier samples and moving toward more difficult ones.

We begin by generating pseudo-labels $\{\hat{y}_{t}\}_{t=1}^{|\mathcal{D}_{t}|}$ for the unlabeled target samples $\mathcal{D}_{t}=\{x_{t}^{(1)},x_{t}^{(2)},\ldots,x_{t}^{(n)}\}$ using the method described in Section III-B. For each class $k\in\{1,2,\ldots,K\}$ , we compute its class-wise feature centroid $\mathbf{c}_{k}$ as the average of CLIP visual embeddings $\phi(x)$ for target samples assigned to class $k$ , where $\phi(\cdot)$ denotes the CLIP vision encoder:

\mathbf{c}_{k}=\frac{1}{|\mathcal{X}_{k}|}\sum_{x\in\mathcal{X}_{k}}\phi(x),\quad\mathcal{X}_{k}=\{x\in\mathcal{D}_{t}\mid\hat{y}_{t}=k\}.

Next, we apply balanced K-Means clustering to the set of class centroids $\{\mathbf{c}_{1},\mathbf{c}_{2},\ldots,\mathbf{c}_{K}\}$ to partition them into $T$ clusters $\{\mathcal{C}_{1},\ldots,\mathcal{C}_{T}\}$ . Unlike standard K-Means, the balanced variant enforces that each cluster contains approximately the same number of class centroids. This is achieved by initializing cluster centers using standard K-Means and then reassigning centroids while enforcing the size constraint. If a centroid’s nearest cluster has already reached the target capacity, it is assigned to the next nearest available cluster with space. The cluster centers are updated after each full assignment round, and the process is repeated until convergence. The process is presented in Algorithm 1.

Once the class centroids are grouped, we assess the difficulty of each cluster. For each class $k$ in a cluster $\mathcal{C}_{t}$ , we compute the cosine similarity between its visual embedding $\phi(x)$ and its corresponding textual embedding $\psi(t_{k})$ . Here $\psi(\cdot)$ denotes the CLIP text encoder. The average similarity score across all classes in a cluster serves as a proxy for how confidently CLIP aligns images and text for those classes:

s_{t}=\frac{1}{|\mathcal{C}_{t}|}\sum_{k\in\mathcal{C}_{t}}\frac{1}{|\mathcal{X}_{k}|}\sum_{x\in\mathcal{X}_{k}}\cos(\phi(x),\psi(t_{k})).

We then sort the clusters $\{\mathcal{C}_{1},\ldots,\mathcal{C}_{T}\}$ in descending order of $s_{t}$ to form a progressive curriculum from easy to hard, as clusters with higher average similarity scores indicates that CLIP is already more confident about their alignment.

Algorithm 1 Balanced K-Means Clustering

1:Feature set

\mathcal{C}=\{\mathbf{c}_{1},\ldots,\mathbf{c}_{K}\}

, cluster number

T

2:Cluster assignments

\{\mathcal{C}_{1},\ldots,\mathcal{C}_{T}\}

3:Initialize centroids

\{\mu_{1},\ldots,\mu_{T}\}

using K-Means

4:Set cluster size budget

B\leftarrow\lceil K/T\rceil

5:repeat

6: Initialize empty clusters:

\mathcal{C}_{1}\leftarrow\emptyset,\ldots,\mathcal{C}_{T}\leftarrow\emptyset

7: for all

\mathbf{c}_{k}\in\mathcal{C}

8: Compute distances

d_{j}=\|\mathbf{c}_{k}-\mu_{j}\|_{2}

9: Sort clusters by increasing

d_{j}

10: for all cluster

j

in sorted order do

11: if

|\mathcal{C}_{j}|<B

then

12: Assign

\mathbf{c}_{k}

\mathcal{C}_{j}

13: break

14: end if

15: end for

16: end for

17: for

j=1

T

18: Update

\mu_{j}\leftarrow\frac{1}{|\mathcal{C}_{j}|}\sum_{\mathbf{c}\in\mathcal{C}_{j}}\mathbf{c}

19: end for

20:until convergence

III-D Cluster Learning Phase

With all components in place, we now describe how each cluster is progressively trained. Following the MPA framework, this process is divided into two steps: an initial feature adaptation step and a combined prompt alignment step.

III-D1 Initial Feature Adaptation

To enable better adaptation to the target domain, we train individual sets of textual prompts for each source–target domain pair. In addition, we jointly optimize a shared visual PEFT module across all domains. To reduce the impact of noisy pseudo-labels, we apply a confidence threshold $\tau$ and include only target samples with predicted class probability greater than $\tau$ during training. Let $P_{i}$ denote the textual prompts for the $i$ -th source–target pair, where $i\in\{1,2,\ldots,N-1\}$ , and let $\mathcal{V}$ represent the shared visual PEFT module. Given a labeled image $x_{s}$ from a source domain $\mathcal{D}_{s}$ with label $y_{s}$ , and an unlabeled image $x_{t}$ from the target domain $\mathcal{D}_{t}$ with pseudo-label $\hat{y}_{t}$ , the optimization objective is defined as:

$\displaystyle\min_{P_{i},\mathcal{V}}-\frac{1}{n_{s}}\sum_{x^{s}\sim\mathcal{D}_{s}}\log P(y=y_{s}\mid x^{s};P_{i},\mathcal{V})-\frac{1}{n_{t}}\sum_{x^{t}\sim\mathcal{D}_{t}}\log P(y=\hat{y}_{t}\mid x^{t};P_{i},\mathcal{V})$

(1)

Let $\mathcal{T}$ denote a temperature scaling value, the class probability for image $x^{d}$ (where $d\in\{s,t\}$ ) in Eqn 1 is computed using softmax-normalized cosine similarity between the visual and textual embeddings:

P(y=k\mid x^{d})=\frac{\exp(\langle\psi(t^{d}_{k}),\phi(x^{d})\rangle/\mathcal{T})}{\sum_{d^{\prime}\in\{s,t\}}\sum_{i=1}^{K}\exp(\langle\psi(t^{d^{\prime}}_{i}),\phi(x^{d})\rangle/\mathcal{T})}

(2)

III-D2 Multi-Prompt Alignment

The above training yields a distinct textual prompt $P_{i}$ for each source–target pair. To further consolidate the learned knowledge and eliminate redundancy, we perform multi-prompt alignment. Although directly using the learned prompts can yield decent results, the high-dimensional nature of prompt vectors may encode redundant or domain-specific noise[53]. Inspired by the manifold hypothesis[54], we employ a denoising autoencoder to extract compact, domain-invariant representations of the prompts.

Specifically, let $\textbf{Proj}(\cdot)$ be a projection function implemented as a one-layer feed-forward network, and let $\textbf{Proj}_{b}(\cdot)$ be a two-layer non-linear decoder. Each learned prompt $P_{i}$ is projected into a latent space:

\textbf{Proj}(P_{i})=\tilde{P}_{i}=W_{1}P_{i}+b_{1}

(3)

and then reconstructed back to $\hat{P}_{i}$ via:

\textbf{Proj}_{b}(\tilde{P}_{i})=W_{3}(\tanh(W_{2}\tilde{P}_{i}+b_{2}))+b_{3}

(4)

The reconstruction objective is:

\mathcal{L}_{AE}=\frac{1}{N-1}\sum_{i=1}^{N-1}\left\|\hat{P}_{i}-P_{i}\right\|_{2}^{2}

(5)

While this has proven to be an effective alignment strategy [55], differences in data volume and domain gap across sources may still result in inconsistent prompt behavior. To promote deeper alignment among prompts, we introduce an $\mathcal{L}_{1}$ consistency loss based on predictions over the same target sample $x_{t}$ :

$\displaystyle\mathcal{L}_{1}=\frac{2}{(N-1)(N-2)}\sum_{j=1}^{N-2}\sum_{i=j+1}^{N-1}\left|P(y=k^{t}\mid x_{t};P_{i})-P(y=k^{t}\mid x_{t};P_{j})\right|$

(6)

Putting these together, the total training loss in this step is:

\mathcal{L}=\mathcal{L}_{CLS}+\mathcal{L}_{AE}+\alpha\mathcal{L}_{1}

(7)

where $\mathcal{L}_{CLS}$ is a regular cross-entropy loss , and $\alpha$ is a hyperparameter controlling the weight of the alignment term $\mathcal{L}_{1}$ . During this alignment stage, the visual PEFT module $\mathcal{V}$ is kept fixed; only the textual prompts and autoencoder parameters are updated. An overview of the multi-prompt learning and alignment procedure is illustrated in Fig 3.

III-E Refine and Rehearse Phase

After completing the training on each curriculum stage (i.e., cluster), we refine the pseudo-labels for that stage using the updated model. Specifically, we re-evaluate all target samples in the current cluster and regenerate their pseudo-labels based on the latest model predictions. To ensure reliability and improve computational efficiency, we apply a confidence threshold $\beta$ and retain only the most confident samples—those with predicted class probabilities exceeding $\beta$ .

These high-confidence samples serve as anchor points and are carried forward to the training set of the next cluster. This mechanism enables a “refine and rehearse” strategy: refined pseudo-labels from previously learned easier clusters are used to stabilize training on subsequent harder clusters. This gradual accumulation of trusted samples helps mitigate noise, reinforce previously learned knowledge, prevent catastrophic forgetting, and promote smoother convergence. The refinement and rehearsal process continues iteratively across all $T$ clusters until the entire target dataset has been covered.

Finally, after training on all clusters is complete, we obtain the final model. During inference, we aggregate the predictions from all domain-specific prompt sets and use the averaged output as the final prediction for each test sample.

TABLE I: Accuracy (%) on ImageCLEF and Office-Home. We re-implement all compared approaches using the CLIP ViT-B/16 backbone for a fair comparison.

		ImageCLEF				Office-Home
Method	Venue	$\rightarrow$ C	$\rightarrow$ I	$\rightarrow$ P	Avg	$\rightarrow$ Ar	$\rightarrow$ Cl	$\rightarrow$ Pr	$\rightarrow$ Rw	Avg
Zero-Shot
CLIP[15]	ICML’21	97.0	96.8	82.7	92.2	83.4	65.3	89.0	89.4	81.8
Non-CLIP Based
ICON[56]	NeurIPS’23	97.8	97.4	83.2	92.8	88.2	81.1	93.5	93.9	89.2
CSR[40]	AAAI’24	98.3	95.5	82.0	91.9	86.2	79.2	92.4	91.3	87.3
CLIP Based
AD-CLIP[35]	ICCV’23	97.8	96.7	82.3	92.3	84.2	72.6	92.9	91.2	85.2
MPA[17]	NeurIPS’23	97.8	97.4	81.2	92.1	85.1	76.6	91.3	91.2	86.1
DAPL[16]	TNNLS’23	97.8	97.2	83.5	92.8	84.3	72.4	92.6	91.4	85.2
PDA[25]	AAAI’24	97.7	97.3	81.3	92.1	87.3	73.9	92.2	92.4	86.5
PGA[44]	NeurIPS’24	97.4	96.5	82.5	92.1	83.6	76.9	92.9	91.2	86.2
DAMP[18]	CVPR’24	98.5	97.7	81.8	92.7	87.3	77.3	94.1	92.7	87.8
M $\text{P}^{2}$ A-Adapter	-	97.7	92.3	78.2	89.4	90.3	83.2	92.9	89.1	88.9
M $\text{P}^{2}$ A-Prompt	-	97.8	97.1	83.7	92.9	90.2	80.6	94.8	92.5	89.5
M $\text{P}^{2}$ A-LoRA	-	98.8	99.0	85.0	94.3	93.6	83.2	94.8	95.6	91.8

IV Experiments

IV-A Experimental Setup

To comprehensively assess the effectiveness of the proposed M $\text{P}^{2}$ A method, we perform experiments on three widely-used benchmark datasets in the field of MS-UDA: ImageCLEF, Office-Home, and DomainNet. ImageCLEF is a relatively small dataset comprising 1,800 images from 12 object categories across three distinct domains: ImageNet ILSVRC 2012 (I), Pascal VOC 2012 (P), and Caltech-256 (C). Office-Home represents a medium-scale dataset containing approximately 15,500 images from 65 categories, drawn from four diverse domains: Art, Clipart, Product, and Real-World. In contrast, DomainNet is currently the largest and most challenging UDA benchmark, encompassing around 600,000 images spanning 345 categories over six heterogeneous domains: Clipart, Infograph, Painting, Quickdraw, Real, and Sketch.

We use top-1 accuracy as our evaluation metric and compare our method against several representative baselines, including: (1) zero-shot CLIP inference. (2) CLIP-based UDA approaches, including DAMP[18], DAPL[16], MPA[17], PDA[25], PGA[44], and AD-CLIP[35]. Among these, PDA, DAMP, and DAPL are originally designed for single-source UDA. To adapt them to the multi-source setting, we follow a common practice of merging all source domains into a unified source domain before applying these methods. The remaining approaches are specifically developed for multi-source UDA, but they rely on ResNet-based CLIP image encoders. To ensure a fair comparison, we re-implement all methods using the pre-trained ViT-B/16 image encoder from CLIP. (3) Non-CLIP-based UDA methods, including ICON[56] and CSR[40]. For consistency, we also replace their original image encoders with the CLIP ViT-B/16 backbone.

TABLE II: Accuracy (%) on DomainNet. We re-implement all compared approaches using the CLIP ViT-B/16 backbone for a fair comparison.

		DomainNet
Method	Venue	$\rightarrow$ Clp	$\rightarrow$ Inf	$\rightarrow$ Pnt	$\rightarrow$ Qdr	$\rightarrow$ Rel	$\rightarrow$ Skt	Avg
Zero-Shot
CLIP[15]	ICML’21	69.4	45.8	64.3	13.7	82.7	62.3	56.4
Non-CLIP Based
ICON[56]	NeurIPS’23	76.2	33.6	76.1	12.1	84.1	63.9	57.7
CSR[40]	AAAI’24	80.1	36.2	68.4	38.0	80.4	68.0	61.9
CLIP Based
AD-CLIP[35]	ICCV’23	73.7	11.5	66.4	11.5	84.8	66.3	52.4
MPA[17]	NeurIPS’23	75.4	49.2	69.4	15.5	83.7	65.3	59.8
DAPL[16]	TNNLS’23	75.2	49.4	68.8	0.3	83.7	67.6	57.5
PDA[25]	AAAI’24	76.1	50.0	70.6	17.0	84.1	67.1	60.8
PGA[44]	NeurIPS’24	75.1	51.1	69.4	14.5	84.5	67.4	60.3
DAMP[18]	CVPR’24	75.7	54.1	71.7	20.1	84.9	68.2	62.5
M $\text{P}^{2}$ A - Prompt	-	78.2	55.1	71.4	17.9	86.1	70.0	63.1
M $\text{P}^{2}$ A - LoRA	-	80.2	54.0	72.6	18.8	85.7	71.7	63.8
M $\text{P}^{2}$ A - Adapter	-	81.0	52.8	73.2	19.3	85.7	72.4	64.1

IV-B Implementation Details

We adopt CLIP ViT-B/16 as the backbone architecture for all methods. The textual prompts, visual PEFT modules, and autoencoder components are optimized using the Adam optimizer and CosineAnnealing scheduler. For the autoencoder, the projection layers are configured as follows: $W_{2}$ and $W_{3}$ generate intermediate and final embeddings of dimensions $e_{2}=384$ and $e_{3}=512$ , while the input projection layer $W_{1}$ produces an embedding of dimension $e_{1}$ , which is 150 for ImageCLEF and Office-Home, and 300 for DomainNet.

The confidence threshold $\beta$ for the refine-and-rehearse phase is fixed at 0.8 for all datasets. The pseudo-label confidence threshold $\tau$ used during initial prompt learning is dataset-specific. We choose $\tau=0.4$ for DomainNet, and $\tau=0.6$ for ImageCLEF and Office-Home. For initial pseudo-label generation, we adopt an average-confidence voting strategy for ImageCLEF and Office-Home, and majority voting for DomainNet due to its larger number of domains. The number of curriculum clusters is set to $T=3$ for all datasets. Additional hyperparameter settings are detailed in the subsequent sections.

IV-C Comparison to State-of-the-Art

We present a comprehensive evaluation of our proposed method, M $\text{P}^{2}$ A, on three standard domain adaptation benchmarks. The performance on ImageCLEF and Office-Home is detailed in Table I, while the results for the more challenging DomainNet are in Table II. For all experiments, we report the performance of M $\text{P}^{2}$ A integrated with three distinct PEFT modules: Visual Prompts, LoRA, and Adapters, as discussed in Section III-A.

On the ImageCLEF dataset, M $\text{P}^{2}$ A with LoRA establishes a new state-of-the-art, achieving an average accuracy of 94.3%. This represents a significant 1.5% improvement over the next-best method, ICON. Specifically, our approach demonstrates consistently superior performance across all domains, with accuracies of 98.8% on domain C, 99.0% on domain I, and 85.0% on P. Given the small scale of ImageCLEF and the inherent strength of the CLIP backbone, these gains are particularly noteworthy. The M $\text{P}^{2}$ A-Prompt variant also surpasses existing methods, securing an average accuracy of 92.9%. In contrast, the M $\text{P}^{2}$ A-Adapter shows a marked performance degradation on this dataset, with its average accuracy dropping to 89.4%, suggesting potential optimization challenges on smaller-scale benchmarks.

Moving to the more complex Office-Home dataset, M $\text{P}^{2}$ A-LoRA continues its impressive performance, achieving an average accuracy of 91.8%. This result outperforms all prior approaches by a substantial margin of at least 2.6%. The method achieves the highest accuracy across all four domains, with 93.6% on Art, 83.2% on Clipart, 94.8% on Product, and 95.6% on Real World. These consistent gains underscore the robustness of our progressive alignment strategy in learning domain-invariant features across diverse visual styles. Following a similar trend to ImageCLEF, M $\text{P}^{2}$ A-Prompt also delivers strong results, outperforming the next-best method, ICON, by 0.3%. Notably, while M $\text{P}^{2}$ A-Adapter still does not achieve the top performance, its accuracy of 88.9% is competitive, falling short only of ICON but outperforming all other baselines.

Finally, on the most challenging benchmark, DomainNet, we observe a notable shift in the efficacy of the PEFT modules. Here, M $\text{P}^{2}$ A-Adapter achieves the highest overall accuracy at 64.1%, surpassing all previous state-of-the-art methods. It improves upon DAMP, a strong CLIP-based baseline, by 1.6% and significantly outperforms zero-shot CLIP by 7.1%. This superior performance suggests that the Adapter module’s architectural complexity may require larger data for effective optimization. This finding is consistent with its weaker performance on the smaller ImageCLEF dataset and better performance on the mid-scale Office-Home. While not the top performer in this instance, M $\text{P}^{2}$ A-LoRA still demonstrates strong results with an accuracy of 63.8%, outperforming DAMP by 1.3%. Furthermore, M $\text{P}^{2}$ A-Prompt also surpasses other state-of-the-art methods with an average accuracy of 63.1%. The fact that all three PEFT variants of our method outperform existing approaches on this difficult benchmark highlights the remarkable effectiveness of our proposed method in handling severe domain shifts, noisy pseudo-labels, and the inherent complexity of large-scale, multi-source adaptation.

TABLE III: Ablation Study on Cluster Center.

	Visual Prompt					LoRA					Adapter
Cluster Number	$\rightarrow$ Ar	$\rightarrow$ Cl	$\rightarrow$ Pr	$\rightarrow$ Rw	Avg	$\rightarrow$ Ar	$\rightarrow$ Cl	$\rightarrow$ Pr	$\rightarrow$ Rw	Avg	$\rightarrow$ Ar	$\rightarrow$ Cl	$\rightarrow$ Pr	$\rightarrow$ Rw	Avg
$T=2$	89.3	79.8	94.0	93.2	88.9	93.4	83.0	94.4	94.8	91.4	90.6	82.4	92.2	89.5	88.7
$T=5$	90.1	79.0	94.8	92.2	89.0	92.9	83.5	94.2	95.4	91.5	90.4	80.6	92.2	88.9	88.0
$T=10$	89.4	78.7	94.0	92.2	88.6	92.0	82.7	94.1	94.7	90.9	88.3	81.7	93.0	89.4	88.1
$T=3$	90.2	80.6	94.8	92.5	89.5	93.4	83.6	94.8	95.6	91.9	90.3	83.2	92.9	89.5	88.9

IV-D Analysis on Progressive Alignment

To rigorously evaluate the efficacy of our progressive alignment strategy, we conduct a detailed empirical analysis on the Office-Home dataset. This analysis first systematically compares the performance of all three PEFT modules with and without the progressive alignment component. The results are shown in Figure 4. As observed, incorporating progressive alignment consistently leads to performance improvements across all PEFT modules, demonstrating the broad applicability of the proposed strategy.

Specifically, we observe average performance gains of 1.4% for Visual Prompt, 0.9% for LoRA, and 2.1% for Adapter, respectively. These consistent gains show the value of the proposed progressive alignment. By guiding the model through a structured, easy-to-hard curriculum, our strategy effectively mitigates the impact of noisy pseudo-labels, particularly during the critical early stages of training, thereby enhancing the model’s robustness.

Through a closer inspection of the domain-wise performance, we further reveal insightful patterns into the mechanism of our approach. For the Real World domain, which shares a closer visual resemblance to the source domains, the performance improvement is minimal, averaging only 0.2%. In contrast, the Clipart domain, characterized by its abstract and stylized visual properties, exhibits a substantial average improvement of 2.4%. This significant gain suggests that our progressive alignment strategy is particularly advantageous for target domains with a large distributional gap from the source, as it systematically bridges this divide by starting with more easy and transferable features before progressing to more hard and domain-specific ones.

To qualitatively illustrate the effect of our progressive alignment strategy, we visualize the feature space alignment between the source and target domains. Using the t-SNE algorithm, we plot the features extracted by the visual encoder for the Office-Home dataset, with the Art domain serving as the source and Real World as the target. As shown in Figure 5, a baseline model without our strategy exhibits a clear separation between the source and target feature clusters. In contrast, after applying our proposed method, the features from both domains become closely aligned. This significant increase in alignment demonstrates that our strategy successfully bridges the domain gap, confirming its effectiveness in facilitating the learning of robust, domain-invariant representations.

We also perform an ablation study to analyze the effect of the number of clusters, with the results summarized in Table III. As shown, using three clusters consistently yields the best overall performance across all PEFT modules. However, we note that the exact number of clusters is not a critical hyperparameter, as the performance variations remain within a narrow 1% margin, indicating the robustness of our method to this choice. Interestingly, a general downward trend in performance is observed as the number of clusters increases. This suggests that using too many learning cycles may lead to over-fragmentation of the data, causing the model to overfit to small, isolated subsets of samples.

Overall, these comprehensive experiments highlight the clear advantage of learning from a structured, easy-to-hard curriculum rather than treating all target samples uniformly. The consistent gains across different modules and datasets demonstrate that our “learn, refine, and rehearse” cycle not only stabilizes the training process but also enhances generalization across diverse and challenging domain shifts.

IV-E Analysis on Source Pretraining

To validate the impact of our source pretraining strategy, we conduct a targeted ablation study comparing the performance of all three PEFT modules with and without this initial step. The results, detailed in Figure 6, reveal that this pretraining is crucial for achieving optimal performance.

As shown, its inclusion consistently yields higher accuracy across all modules, with substantial performance gains of 2.4% for Visual Prompt, 2.3% for LoRA, and 1.7% for Adapter. Unlike the progressive alignment strategy, source pretraining provides a universal performance lift across all domains. This demonstrates that leveraging supervised signals from the source domains enables a more robust initial adaptation and, critically, more reliable target label estimation. The resulting improvement in pseudo-label quality serves as a stronger learning signal during the early stages of training, leading to more stable convergence and superior downstream performance. This confirms that our simple source pretraining strategy provides a valuable foundation for generating high-confidence pseudo-labels, which are crucial for the success of our progressive learning strategy.

IV-F Hyperparameter Ablation Study

A comprehensive analysis of key hyperparameters is conducted to evaluate the sensitivity and robustness of our framework. The experimental results are summarized in Figure 7 for Visual Prompt, Figure 8 for LoRA, and Figure 9 for Adapter, respectively. We investigate the effect of four key factors: the rehearse threshold $\beta$ , the initial pseudo-label selection threshold $\tau$ , the textual prompt lengths $M_{1}$ and $M_{2}$ , and PEFT-module-specific parameters.

We first analyze the impact of the initial pseudo-label selection threshold $\tau$ on model performance. As shown, the best results for all PEFT modules are obtained at $\tau=0.6$ . We observe a clear trend where performance improves as $\tau$ increases from 0.4 to 0.6, followed by a slight decline when $\tau$ reaches 0.7. Notably, while Visual Prompt and LoRA remain relatively stable across different $\tau$ values, we find larger fluctuations for Adapter, suggesting that its optimization is more sensitive to the pseudo-label quality.

Next, we examine the rehearse threshold $\beta$ . Performance remains relatively stable for $\beta$ values of 0.6, 0.7, and 0.8 across all PEFT modules. However, at $\beta=0.9$ , performance drops sharply. This is because such a high threshold filters out too many samples for subsequent stages, causing catastrophic forgetting of previously learned knowledge. For smaller $\beta$ values, while more samples are retained, the additional training data does not lead to notable performance improvements but requires greater computational resources. We find that $\beta=0.8$ provides the best trade-off between performance and efficiency. This is particularly important for large-scale datasets such as DomainNet, where the initial $\tau$ is set to 0.4. Our experiments show that this setting improves training efficiency by approximately 30%.

We then study the influence of the textual prompt lengths $M_{1}$ and $M_{2}$ , where we set $M_{1}=M_{2}$ for simplicity. The best performance is achieved at $M_{1}=M_{2}=12$ . We hypothesize that shorter prompts lack sufficient semantic capacity to guide adaptation effectively, while overly long prompts introduce redundancy or noise, which may hinder training. This result highlights the importance of selecting an optimal prompt length to balance expressiveness and regularization.

Finally, we examine module-specific hyperparameters: visual prompt length $M_{3}$ , LoRA rank $r_{1}$ , and Adapter inner dimension $r_{2}$ . Both $M_{3}$ and $r_{1}$ yield stable performance across tested values, with peak performance at $M_{3}=20$ and $r_{1}=8$ . For $r_{2}$ , however, we observe that increasing its value tends to decrease overall accuracy, and thus we choose $r_{2}=16$ . This finding once again underscores that the Adapter module, while powerful, presents greater optimization challenges and is more sensitive to its hyperparameter configuration.

V Conclusion

In this paper, we presented M $\text{P}^{2}$ A, a progressive framework for multi-source unsupervised domain adaptation that adapts CLIP through a structured “learn, refine, and rehearse” cycle. By organizing the adaptation process using balanced class clustering and gradually expanding from easy to hard samples, M $\text{P}^{2}$ A improves training stability and robustness. Specifically, during the learning phase, we train domain-specific textual prompts and a shared visual PEFT module, aligning prompts through a denoising autoencoder with alignment regularization. In the refine-and-rehearse phase, confident samples from previous stages are refined with the updated model and reused to guide training in subsequent stages, enabling efficient knowledge transfer. Furthermore, to enhance pseudo-label quality, we ensemble source-pretrained CLIP models for more reliable target supervision. Experiments on ImageCLEF, Office-Home, and DomainNet show that M $\text{P}^{2}$ A achieves state-of-the-art performance across all benchmarks. These results highlight the effectiveness of progressive alignment and parameter-efficient adaptation in vision-language models under domain shift.

References

[1] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “Bert: Pre-training of deep bidirectional transformers for language understanding,” in NAACL, 2019.
[2] T. Brown, B. Mann, N. Ryder, M. Subbiah, J. D. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell et al., “Language models are few-shot learners,” NeurIPS, 2020.
[3] R. Bommasani, D. A. Hudson, E. Adeli, R. Altman, S. Arora, S. von Arx, M. S. Bernstein, J. Bohg, A. Bosselut, E. Brunskill et al., “On the opportunities and risks of foundation models,” arXiv preprint arXiv:2108.07258, 2021.
[4] J. Quiñonero-Candela, M. Sugiyama, A. Schwaighofer, and N. D. Lawrence, Eds., Dataset Shift in Machine Learning. The MIT Press, 2008.
[5] A. Torralba and A. A. Efros, “Unbiased look at dataset bias,” in CVPR, 2011.
[6] K. Zhang, B. Schölkopf, K. Muandet, and Z. Wang, “Domain adaptation under target and conditional shift,” in ICML, 2013.
[7] M. De Lange, R. Aljundi, M. Masana, S. Parisot, X. Jia, A. Leonardis, G. Slabaugh, and T. Tuytelaars, “A continual learning survey: Defying forgetting in classification tasks,” TPAMI, 2021.
[8] H. Chen, M. Goldblum, Z. Wu, and Y.-G. Jiang, “Adaptive retention & correction: Test-time training for continual learning,” in ICLR, 2025.
[9] Z. Wang, Z. Zhang, C.-Y. Lee, H. Zhang, R. Sun, X. Ren, G. Su, V. Perot, J. Dy, and T. Pfister, “Learning to prompt for continual learning,” in CVPR, 2022.
[10] S. J. Pan and Q. Yang, “A survey on transfer learning,” TKDE, 2009.
[11] G. Csurka, “Domain adaptation for visual applications: A comprehensive survey,” arXiv preprint arXiv:1702.05374, 2017.
[12] M. Wang and W. Deng, “Deep visual domain adaptation: A survey,” Neurocomputing, 2018.
[13] R. Xu, Z. Chen, W. Zuo, J. Yan, and L. Lin, “Deep cocktail network: Multi-source unsupervised domain adaptation with category shift,” in CVPR, 2018.
[14] P. Singhal, R. Walambe, S. Ramanna, and K. Kotecha, “Domain adaptation: challenges, methods, datasets, and applications,” IEEE access, 2023.
[15] A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark et al., “Learning transferable visual models from natural language supervision,” in ICML, 2021.
[16] C. Ge, R. Huang, M. Xie, Z. Lai, S. Song, S. Li, and G. Huang, “Domain adaptation via prompt learning,” TNNLS, 2023.
[17] H. Chen, X. Han, Z. Wu, and Y.-G. Jiang, “Multi-prompt alignment for multi-source unsupervised domain adaptation,” NeurIPS, 2023.
[18] Z. Du, X. Li, F. Li, K. Lu, L. Zhu, and J. Li, “Domain-agnostic mutual prompting for unsupervised domain adaptation,” in CVPR, 2024.
[19] E. Arazo, D. Ortego, P. Albert, N. E. O’Connor, and K. McGuinness, “Pseudo-labeling and confirmation bias in deep semi-supervised learning,” in IJCNN, 2020.
[20] K. Mei, C. Zhu, J. Zou, and S. Zhang, “Instance adaptive self-training for unsupervised domain adaptation,” in ECCV, 2020.
[21] Q. Xie, M.-T. Luong, E. Hovy, and Q. V. Le, “Self-training with noisy student improves imagenet classification,” in CVPR, 2020.
[22] Y. Zou, Z. Yu, B. Kumar, and J. Wang, “Unsupervised domain adaptation for semantic segmentation via class-balanced self-training,” in ECCV, 2018.
[23] P. S. Bradley, K. P. Bennett, and A. Demiriz, “Constrained k-means clustering,” Microsoft Research, Redmond, 2000.
[24] M. I. Malinen and P. Fränti, “Balanced k-means for clustering,” in Joint IAPR international workshops on statistical techniques in pattern recognition (SPR) and structural and syntactic pattern recognition (SSPR), 2014.
[25] S. Bai, M. Zhang, W. Zhou, S. Huang, Z. Luan, D. Wang, and B. Chen, “Prompt-based distribution alignment for unsupervised domain adaptation,” in AAAI, 2024.
[26] K. Zhou, J. Yang, C. C. Loy, and Z. Liu, “Learning to prompt for vision-language models,” IJCV, 2022.
[27] ——, “Conditional prompt learning for vision-language models,” in CVPR, 2022.
[28] N. Houlsby, A. Giurgiu, S. Jastrzebski, B. Morrone, Q. De Laroussilhe, A. Gesmundo, M. Attariyan, and S. Gelly, “Parameter-efficient transfer learning for nlp,” in ICML, 2019.
[29] R. Zhang, R. Fang, W. Zhang, P. Gao, K. Li, J. Dai, Y. Qiao, and H. Li, “Tip-adapter: Training-free clip-adapter for better vision-language modeling,” arXiv preprint arXiv:2111.03930, 2021.
[30] E. J. Hu, Y. Shen, P. Wallis, Z. Allen-Zhu, Y. Li, S. Wang, L. Wang, W. Chen et al., “Lora: Low-rank adaptation of large language models.” ICLR, 2022.
[31] J. Deng, Z. Zhang, F. Eyben, and B. Schuller, “Autoencoder-based unsupervised domain adaptation for speech emotion recognition,” IEEE Signal Processing Letters, 2014.
[32] B. Caputo, H. Müller, J. Martinez-Gomez, M. Villegas, B. Acar, N. Patricia, N. Marvasti, S. Üsküdarlı, R. Paredes, M. Cazorla et al., “Imageclef 2014: Overview and analysis of the results,” in International conference of the cross-language evaluation forum for European languages, 2014.
[33] H. Venkateswara, J. Eusebio, S. Chakraborty, and S. Panchanathan, “Deep hashing network for unsupervised domain adaptation,” in CVPR, 2017.
[34] X. Peng, Q. Bai, X. Xia, Z. Huang, K. Saenko, and B. Wang, “Moment matching for multi-source domain adaptation,” in ICCV, 2019.
[35] M. Singha, H. Pal, A. Jha, and B. Banerjee, “Ad-clip: Adapting domains in prompt space using clip,” in ICCV, 2023.
[36] J. Yang, R. Yan, and A. G. Hauptmann, “Cross-domain video concept detection using adaptive svms,” in ACM MM, 2007.
[37] H. Zhao, S. Zhang, G. Wu, J. M. Moura, J. P. Costeira, and G. J. Gordon, “Adversarial multiple source domain adaptation,” NeurIPS, 2018.
[38] Y. Zhu, F. Zhuang, and D. Wang, “Aligning domain-specific distribution and classifier for cross-domain classification from multiple sources,” in AAAI, 2019.
[39] H. Liu, J. Wang, and M. Long, “Cycle self-training for domain adaptation,” NeurIPS, 2021.
[40] C. Zhou, Z. Wang, B. Du, and Y. Luo, “Cycle self-refinement for multi-source domain adaptation,” in AAAI, 2024.
[41] C. Jia, Y. Yang, Y. Xia, Y.-T. Chen, Z. Parekh, H. Pham, Q. Le, Y.-H. Sung, Z. Li, and T. Duerig, “Scaling up visual and vision-language representation learning with noisy text supervision,” in ICML, 2021.
[42] J. Li, D. Li, C. Xiong, and S. Hoi, “Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation,” in ICML, 2022.
[43] J.-B. Alayrac, J. Donahue, P. Luc, A. Miech, I. Barr, Y. Hasson, K. Lenc, A. Mensch, K. Millican, M. Reynolds et al., “Flamingo: a visual language model for few-shot learning,” NeurIPS, 2022.
[44] V. H. Phan, T. L. Tran, Q. Tran, and T. Le, “Enhancing domain adaptation through prompt gradient alignment,” NeurIPS, 2024.
[45] Y. Bengio, J. Louradour, R. Collobert, and J. Weston, “Curriculum learning,” in ICML, 2009.
[46] Y. Zhang, P. David, and B. Gong, “Curriculum domain adaptation for semantic segmentation of urban scenes,” in ICCV, 2017.
[47] T. Zhou and J. Bilmes, “Minimax curriculum learning: Machine teaching with desirable difficulties and scheduled diversity,” in ICLR, 2018.
[48] C. Chen, W. Xie, W. Huang, Y. Rong, X. Ding, Y. Huang, T. Xu, and J. Huang, “Progressive feature alignment for unsupervised domain adaptation,” in CVPR, 2019.
[49] N. Karim, N. C. Mithun, A. Rajvanshi, H.-p. Chiu, S. Samarasekera, and N. Rahnavard, “C-sfda: A curriculum learning aided self-training framework for efficient source free domain adaptation,” in CVPR, 2023.
[50] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,” NeurIPS, 2017.
[51] A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly et al., “An image is worth 16x16 words: Transformers for image recognition at scale,” arXiv preprint arXiv:2010.11929, 2020.
[52] M. Jia, L. Tang, B.-C. Chen, C. Cardie, S. Belongie, B. Hariharan, and S.-N. Lim, “Visual prompt tuning,” in ECCV, 2022.
[53] W. Wang, Y. Huang, Y. Wang, and L. Wang, “Generalized autoencoder: A neural network framework for dimensionality reduction,” in CVPR Workshop, 2014.
[54] C. Fefferman, S. Mitter, and H. Narayanan, “Testing the manifold hypothesis,” Journal of the American Mathematical Society, 2016.
[55] M. Ilse, J. M. Tomczak, C. Louizos, and M. Welling, “Diva: Domain invariant variational autoencoders,” in Medical Imaging with Deep Learning, 2020.
[56] Z. Yue, Q. Sun, and H. Zhang, “Make the u in uda matter: Invariant consistency learning for unsupervised domain adaptation,” NeurIPS, 2023.

VI Biography Section