UniLDiff: Unlocking the Power of Diffusion Priors for All-in-One Image Restoration

Zihan Cheng¹, Liangtai Zhou¹, Dian Chen¹, Ni Tang¹, Xiaotong Luo¹, Yanyun Qu¹

Abstract

All-in-One Image Restoration (AiOIR) has emerged as a promising yet challenging research direction. To address its core challenges, we propose a novel unified image restoration framework based on latent diffusion models (LDMs). Our approach structurally integrates low-quality visual priors into the diffusion process, unlocking the powerful generative capacity of diffusion models for diverse degradations. Specifically, we design a Degradation-Aware Feature Fusion (DAFF) module to enable adaptive handling of diverse degradation types. Furthermore, to mitigate detail loss caused by the high compression and iterative sampling of LDMs, we design a Detail-Aware Expert Module (DAEM) in the decoder to enhance texture and fine-structure recovery. Extensive experiments across multi-task and mixed degradation settings demonstrate that our method consistently achieves state-of-the-art performance, highlighting the practical potential of diffusion priors for unified image restoration. Our code will be released.

Introduction

Traditional image restoration (IR) focuses on a specific degradation type such as denoising, deblurring, dehazing, deraining, or low-light enhancement. While task-specific models have achieved impressive results in their respective domains, they typically lack generalization capability. Such single-purpose models are difficult to scale in practical applications, as switching between restoration tasks often requires retraining or task-specific adaptation, resulting in high deployment overhead and limited flexibility. This has spurred increasing interest in All-in-One Image Restoration (AiOIR), which aims to develop unified frameworks capable of handling diverse degradations under a single model.

Recent AiOIR approaches have explored a variety of strategies to achieve generality, including task-specific architectures (Li et al. 2022), task-specific priors (Valanarasu, Yasarla, and Patel 2022), frequency-aware modulation (Cui et al. 2024), contrastive learning (Chen et al. 2022b), and prompt-driven guidance (Zeng et al. 2025). However, these methods typically rely on predefined degradation types (Potlapalli et al. 2023) and rigid priors, making them less effective in handling real-world images where degradations are often compound, spatially heterogeneous, and lack explicit degradation type labels. Such limitations highlight the need for a more flexible and powerful generative framework.

Refer to caption — Figure 1: MUSIQ-based comparison across five restoration tasks. Our method achieves consistently superior results.

In this context, diffusion models, particularly Latent Diffusion Models (LDMs), have emerged as promising candidates for unified image restoration, owing to their strong generative priors and success in high-fidelity image synthesis and cross-modal tasks (Chen, Pan, and Dong 2025; Jiang et al. 2024a). However, current diffusion-based AiOIR methods commonly rely on global textual or visual prompts, each with intrinsic limitations: textual prompts offer only global semantic cues and lack spatial specificity (Jiang et al. 2024b), while visual prompts depend on pre-trained degradation encoders or modality-specific tokens, assuming known and spatially uniform degradation types (Luo et al. 2023). As a result, prompt-based conditioning constrains the model’s capacity to localize and respond to fine-grained degradations, limiting its adaptability despite the inherent generative strength of diffusion models. Additionally, LDMs often suffer from detail loss due to the high compression of VAE encoders (Kingma, Welling et al. 2013) and the progressive nature of iterative sampling (Dai et al. 2023), resulting in blurred textures and incomplete structures.

To address these challenges, we propose UniLDiff, a novel LDM-based unified image restoration framework that explicitly integrates degradation-aware and detail-aware mechanisms into the diffusion process. Specifically, we introduce the Degradation-Aware Feature Fusion (DAFF) module into the early layers of the diffusion UNet to inject low-quality (LQ) features into every denoising step. By leveraging double-stream attention and adaptive modulation, DAFF provides dynamic, spatially adaptive guidance, enabling the model to perceive and adapt to diverse and heterogeneous degradations throughout the generative trajectory, without relying on predefined degradation assumptions. Furthermore, we design a Detail-Aware Expert Module (DAEM) within the decoder, which leverages decoder features and skip-connected encoder cues to dynamically activate expert branches, thereby enhancing high-frequency details and suppress structural artifacts. A preview of our results is shown in Figure 1. UniLDiff consistently outperforms previous methods across five representative tasks in terms of perceptual quality, demonstrating its robustness and generalization in unified multi-task restoration.

Our contributions can be summarized as follows:

•

We propose UniLDiff, a unified LDM-based framework that integrates restoration priors into the denoising process for flexible and high-fidelity image restoration.
•

We design the DAFF that performs step-wise alignment between low-quality features and evolving latent variables, enabling the model to implicitly adapt to diverse degradations without predefined labels.
•

We develop the DAEM that dynamically routes decoder features through expert branches guided by encoder cues, enabling targeted restoration of fine textures and structural details.
•

Extensive experiments across multi-task and composite restoration benchmarks demonstrate that UniLDiff achieves superior perceptual quality and generalization compared to state-of-the-art methods.

Related Work

Non-Diffusion-Based All-in-One Image Restoration

Traditional AiOIR methods are primarily built upon Convolutional Neural Networks (CNNs) or Transformer architectures, aiming to improve adaptability to diverse degradation types through the incorporation of task-specific structures and priors. In recent years, various strategies have been proposed to enhance model generalization, including task-specific architectural designs (Li et al. 2022), explicit prior modeling (Valanarasu, Yasarla, and Patel 2022), frequency-aware modulation (Cui et al. 2024), contrastive learning (Chen et al. 2022b), and MOE-based frameworks (Zamfir et al. 2025). Among these, prompt learning has emerged as a dominant paradigm in AiOIR. It enables task-conditioned modeling through visual prompts (Zeng et al. 2025; Tian et al. 2025; Cui et al. 2024), textual prompts (Yan et al. 2025), or multimodal prompts (Duan et al. 2024). While these approaches improve modularity and task awareness, they often rely on predefined degradation types and rigid priors (Potlapalli et al. 2023), limiting their effectiveness in real-world scenarios where degradations are compound, spatially variable, and label-agnostic.

Diffusion-Based All-in-One Image Restoration

Recently, diffusion models have demonstrated strong potential in AiOIR, leveraging their powerful generative priors and high-fidelity image synthesis capabilities. AutoDIR (Jiang et al. 2024b) pioneered text-guided diffusion by using natural language prompts for restoration. MPerceiver (Ai et al. 2024) extended this approach with multimodal CLIP features to handle generalized degradations. DiffUIR (Zheng et al. 2024) further introduced a selective hourglass mapping strategy that combines strong condition guidance with shared distribution modeling for unified multi-task restoration. DaCLIP (Luo et al. 2023) leverages contrastive language-image pretraining to align degradation-aware prompts with visual features, enabling adaptive restoration under diverse degradation types. These works demonstrate the versatility of diffusion models in AiOIR, particularly under scenarios involving mixed, or unknown degradations. However, existing approaches often rely on global textual or visual prompts, which provide limited spatial specificity and typically assume known, uniform degradation types. This restricts the model’s ability to localize and respond to compound or fine-grained degradations that are common in real-world scenarios.

Method

We present UniLDiff, a unified image restoration framework that unlocks the power of diffusion priors. As illustrated in Figure 2, UniLDiff employs a pre-trained VAE encoder to project both low-quality (LQ) and high-quality (HQ) images into a latent space, extracting multi-scale features $f^{LQ}$ and $f^{HQ}$ . The forward diffusion process adds noise to $f^{HQ}$ to produce latent variables $X_{t}^{HQ}$ at each timestep $t$ . To inject degradation-aware guidance into the denoising trajectory, the DAFF module is integrated into the early layers of the diffusion UNet, aligning $f^{LQ}$ with $X_{t}^{HQ}$ at every step. Additionally, the DAEM is applied in the decoder to enhance high-frequency textures and preserve fine-grained structural details.

Degradation-Aware Feature Fusion (DAFF)

In diffusion-based image restoration, a common practice is to simply concatenate or add the LQ features $f^{LQ}$ with the latent variable $X_{t}^{HQ}$ at each timestep $t$ . However, this naive fusion can be suboptimal: as $X_{t}^{HQ}$ progressively approaches the clean HQ representation, static $f^{LQ}$ may introduce interference, undermining structural consistency in later stages.

To mitigate this, we introduce the DAFF module, which dynamically aligns LQ features with evolving latent representations at each diffusion timestep, providing degradation-aware modulation to condition the denoising trajectory according to the inferred degradation type. Inspired by FLUX (Labs et al. 2025), we adopt a cascaded fusion design combining double-stream and single-stream pathways (Fig. 3), leveraging joint and disentangled feature modeling.

Double Stream Block.

We apply LayerNorm and conditional modulation to $f^{LQ}$ and $X_{t}^{HQ}$ independently. Each branch computes its own query, key, and value matrices, which are then concatenated and passed through a shared attention module:

	$\displaystyle Q_{t}^{D},K_{t}^{D},V_{t}^{D}$	$\displaystyle=\text{concat}[{QKV}(f^{LQ}),{QKV}(X_{t}^{HQ})]$		(1)
	$\displaystyle A_{t}^{D}$	$\displaystyle=\text{softmax}\left(\frac{Q_{t}^{D}(K_{t}^{D})^{T}+PE}{\sqrt{d}}\right)V_{t}^{D}$		(2)

The attention output $A_{t}^{D}$ is projected back via a gating mechanism to obtain $X_{t,gate}^{LQ}$ and $X_{t,gate}^{HQ}$ , enabling degradation-aware guidance and structural preservation. This bidirectional interaction surpasses naive fusion by reducing inconsistency and structural artifacts.

Single Stream Block.

The gated features are concatenated as $f_{t}^{cat}=\text{concat}[X_{t,gate}^{LQ},X_{t,gate}^{HQ}]$ , then normalized and linearly projected to produce the attention triplet and an auxiliary vector:

\displaystyle[Q_{t}^{S},K_{t}^{S},V_{t}^{S}],M=\text{Linear}_{1}(f_{t}^{{cat}})

(3)

The attention is computed via scaled dot-product:

\displaystyle A_{t}^{S}=\text{softmax}\left(\frac{Q_{t}^{S}(K_{t}^{S})^{T}+PE}{\sqrt{d}}\right)V_{t}^{S}

(4)

Here, PE is positional embeddings. We concatenate $A_{t}^{S}$ with $\phi(M)$ and feed it through a second linear layer. The output is combined with the input via gating to yield $f_{t}^{align}$ :

\displaystyle f_{t}^{align}=f^{cat}_{t}+g\cdot\text{Linear}_{2}(A_{t}^{S},\phi(M))

(5)

where $\phi(\cdot)$ is a nonlinear activation and $g$ controls fusion strength.

This unified stream improves feature coherence and adaptively modulates degradation influence throughout diffusion, mitigating guidance mismatch and preserving structure.

To enhance semantic alignment, we also incorporate a content embedding $c$ from a pre-trained text encoder, fused with $f_{t}^{align}$ via a lightweight cross-attention mechanism. In summary, given the latent HQ feature $X_{t}^{HQ}$ at diffusion step $t$ , the degradation feature $f^{LQ}$ , and the content embedding $c$ , the model predicts the latent HQ feature $X_{t-1}^{HQ}$ at the previous timestep as follows:

\displaystyle x_{t-1}^{HQ}=\frac{1}{\sqrt{\alpha_{t}}}\left(x_{t}^{HQ}-\frac{1-\alpha_{t}}{\sqrt{1-\bar{\alpha}_{t}}}\hat{\epsilon}_{\theta}(f_{t}^{align},c,t)\right)+\sigma_{t}z

(6)

where $\hat{\epsilon}_{\theta}$ is the predicted noise from the denoising UNet, conditioned on $f_{t}^{align}$ , $c$ , and $t$ . The variable $z\sim\mathcal{N}(0,I)$ adds stochasticity, while $\alpha_{t}$ , $\sigma_{t}$ , and $\bar{\alpha}_{t}=\prod_{i=1}^{t}\alpha_{i}$ are diffusion schedule parameters controlling denoising and noise levels.

Detail-Aware Expert Module (DAEM)

Although the DAFF module improves degradation perception and adaptation, LDMs still struggle to recover fine details, often resulting in texture blurring and structural loss in small objects. These limitations arise from the high compression ratio of the VAE encoder, which discards high-frequency signals, and the progressive sampling process, which introduces texture hallucinations (Dai et al. 2023; Zhu et al. 2023). Furthermore, in multi-degradation scenarios, spatially heterogeneous degradations challenge the expressiveness of a single reconstruction path.

To mitigate these issues, we introduce the DAEM—a specialized decoder component designed to selectively enhance high-frequency structures and texture fidelity across spatially diverse regions (see Fig. 4). Unlike conventional decoders that operate uniformly, DAEM leverages a Mixture-of-Experts (MoE) architecture to dynamically specialize its response based on local degradation context. More importantly, DAEM integrates encoder features via skip connections to retrieve early-stage, high-resolution cues that are otherwise lost in the VAE compression process. By fusing these fine-grained encoder features with the decoder’s latent representations, DAEM explicitly injects detail priors into the reconstruction pipeline—enabling targeted enhancement of textures, edges, and small structures.

Specifically, for each input $x$ , a lightweight routing function computes the expert assignment as:

\text{Router}(x)=\text{top-}k\left(\text{Softmax}(Wx+\boldsymbol{\xi})\right),

(7)

where $W$ denotes the router weight, $\boldsymbol{\xi}\sim\mathcal{N}(0,\sigma^{2})$ is sampled Gaussian noise, and $k$ is the number of activated experts. In our experiments, we empirically set $k=1$ . Each expert $E^{i}$ is built using lightweight NAFBlocks (Chen et al. 2022a) with varied receptive fields, allowing for multi-scale perception of both fine and coarse structures.

To ensure global semantic coherence, a shared expert branch $S(\cdot)$ with transposed self-attention (Zamir et al. 2022) is added. The output of each expert is modulated by this global branch as:

\hat{y}_{E}^{i}=E^{i}(x)\otimes S(x),

(8)

where $\otimes$ denotes element-wise multiplication. This design effectively fuses localized, detail-specific refinement with global contextual guidance.

By bridging early encoder features and dynamic expert processing, DAEM equips the model with robust fine-detail recovery capabilities and enhanced adaptability to complex, spatially variant degradations.

Unified Training Strategy

Our framework comprises five core components: a pre-trained VAE encoder, a trainable VAE encoder for extracting degradation features, the DAFF module, a diffusion-based denoising network $\epsilon_{\theta}$ , and a VAE decoder integrated with the DAEM. In this work, we adopt Stable Diffusion XL (Podell et al. 2023) as our underlying latent diffusion model, leveraging its pre-trained components to initialize the encoder, decoder, and denoising UNet. The training procedure is organized into two stages: (1) degradation modeling with DAFF, and (2) detail refinement with DAEM.

Stage 1: Degradation Modeling

We first freeze the VAE encoder and denoising UNet, and train only the DAFF module. This allows DAFF to learn how to align LQ features $f^{LQ}$ with the noisy latent variable $x_{t}^{HQ}$ at each timestep, capturing degradation-aware priors to guide the denoising process. Once DAFF converges, we unfreeze the trainable LQ encoder and jointly optimize it with DAFF and the denoising network $\epsilon_{\theta}$ to improve coordination among feature extraction, fusion, and noise prediction. The training loss follows the standard diffusion noise estimation objective:

\mathcal{L}_{\text{stage-1}}=\left\|\epsilon-\hat{\epsilon}_{\theta}\left(\sqrt{\bar{\alpha}_{t}x_{0}^{HQ}}+\sqrt{1-\bar{\alpha}_{t}}\epsilon,f^{LQ},c,t\right)\right\|_{1},

(9)

where $\epsilon\sim\mathcal{N}(0,I)$ is Gaussian noise, and $c$ is the auxiliary text embedding.

Stage 2: Detail Refinement

After the degradation modeling stage, we fine-tune the VAE decoder together with the DAEM to enhance the recovery of fine-grained details. This phase is supervised by a composite loss that combines pixel-wise reconstruction loss $\mathcal{L}_{\text{recon}}$ and structural similarity loss $\mathcal{L}_{\text{ssim}}$ , ensuring both low-level accuracy and structural consistency. To fully leverage the MoE architecture and prevent expert collapse, we introduce an auxiliary load-balancing loss $\mathcal{L}_{\text{aux}}$ (Riquelme et al. 2021), which encourages uniform expert utilization across batches. Details of its formulation are provided in the Appendix.

The total objective for this stage is:

\mathcal{L}_{\text{stage-2}}=\mathcal{L}_{\text{recon}}+\lambda_{1}\mathcal{L}_{\text{ssim}}+\lambda_{2}\mathcal{L}_{\text{aux}},

(10)

where $\lambda_{1}$ and $\lambda_{2}$ are hyperparameters controlling the trade-off between structural preservation and expert diversity.

	Method	Venue	PSNR $\uparrow$	SSIM $\uparrow$	LPIPS $\downarrow$	DISTS $\downarrow$	CLIPIQA $\uparrow$	NIQE $\downarrow$	MUSIQ $\uparrow$	MANIQA $\uparrow$
Non-Diff	AirNet	CVPR2022	31.11	0.9068	0.0925	0.0950	0.6405	3.3911	66.98	0.6470
	IDR	CVPR2023	30.72	0.8960	0.0949	0.0920	0.6613	3.5283	66.63	0.6443
	PromptIR	NeurIPS23	32.18	0.9124	0.0859	0.0864	0.6409	4.0061	67.21	0.6569
	VLU-Net	CVPR2025	32.55	0.9157	0.0796	0.0785	0.6433	3.6559	67.09	0.6698
	DFPIR	CVPR2025	32.75	0.9162	0.0758	0.0758	0.6626	3.5938	67.34	0.6679
	AdaIR	ICLR2025	32.98	0.9155	0.0825	0.0820	0.6435	3.6464	67.50	0.6685
Diff	DA-CLIP	ICLR2024	30.27	0.8780	0.0664	0.0650	0.6580	3.2783	67.31	0.6663
	DiffUIR	CVPR2024	31.89	0.9010	0.0959	0.0964	0.6163	3.7371	67.51	0.6570
	Ours	-	32.18	0.9105	0.0651	0.0639	0.6653	3.6022	68.89	0.7038

Table 1: Quantitative comparison on average performance across three restoration tasks: dehazing, deraining, and Gaussian denoising (

\sigma=15

25

50

). Red and blue indicate the best and second-best results, respectively.

Experiment

	Method	Dehazing		Deraining		Denoising		Deblurring		Low-light
	Method	MUSIQ	MANIQA	MUSIQ	MANIQA	MUSIQ	MANIQA	MUSIQ	MANIQA	MUSIQ	MANIQA
Non-Diff	PromptIR	65.43	0.6829	68.00	0.6679	67.29	0.6295	36.60	0.4250	64.76	0.5922
	InstructIR	59.75	0.6622	68.01	0.6659	68.35	0.6598	38.83	0.4007	67.34	0.6061
	AdaIR	65.83	0.6923	68.31	0.6720	67.65	0.6593	33.41	0.3763	69.43	0.6309
	VLU-Net	65.81	0.6920	68.45	0.6740	68.38	0.6669	32.48	0.3568	65.21	0.5948
	DFPIR	65.61	0.6898	68.29	0.6700	68.06	0.6563	36.93	0.3978	69.19	0.6234
Diff	DA-CLIP	67.88	0.6524	68.60	0.6695	67.88	0.6524	36.89	0.4298	72.59	0.6619
	DiffUIR	64.84	0.6864	68.40	0.6682	68.19	0.6383	33.63	0.3529	71.13	0.6246
	Ours	66.42	0.6948	68.69	0.6903	68.49	0.6700	38.91	0.4440	72.77	0.6528

Table 2: Quantitative comparison of no-reference metrics MUSIQ (

\uparrow

) and MANIQA (

\uparrow

) across five restoration tasks: dehazing, deraining, denoising (

\sigma=25

), deblurring, and low-light enhancement. Red and blue indicate the best and second-best results.

Method	L+H		L+R		L+S		H+R		H+S		L+H+R		L+H+S
Method	PSNR	SSIM	PSNR	SSIM	PSNR	SSIM	PSNR	SSIM	PSNR	SSIM	PSNR	SSIM	PSNR	SSIM
AirNet	23.23	0.779	23.21	0.768	23.06	0.763	25.04	0.883	25.07	0.884	22.85	0.751	22.61	0.748
PromptIR	24.49	0.789	24.33	0.773	24.12	0.770	27.93	0.939	28.02	0.940	23.87	0.763	23.55	0.759
WGWSNet	23.95	0.772	23.70	0.764	23.54	0.759	27.53	0.890	27.60	0.891	23.30	0.748	23.07	0.745
WeatherDiff	22.36	0.756	22.25	0.750	22.10	0.745	23.82	0.867	23.91	0.869	21.87	0.732	21.60	0.730
OneRestore	25.79	0.822	25.65	0.811	25.44	0.808	29.81	0.960	29.92	0.961	25.23	0.797	24.93	0.793
MoCE-IR-S	26.24	0.817	26.05	0.804	26.04	0.802	29.93	0.961	30.19	0.962	25.41	0.789	25.39	0.787
Ours	26.50	0.820	26.40	0.808	26.10	0.805	30.10	0.963	30.50	0.964	25.80	0.794	25.50	0.790

Table 3: Comparison on composite degradation types. PSNR (

\uparrow

) and SSIM (

\uparrow

) are reported. L: Low, H: Haze, R: Rain, S: Snow. Red and blue indicate the best and second-best results, respectively.

Following prior work in general-purpose image restoration (Jiang et al. 2024a), we evaluate our model under two distinct settings: (i) Universal Degradation Restoration, where a single model is trained to handle multiple degradation types, including combinations of three and five types; and (ii) Compositional Degradation Restoration, where the model is assessed on both individual and mixed degradation scenarios (up to three types) using the same unified architecture.

Experimental Settings

Datasets.

We aggregate a diverse set of widely-used datasets across different degradation types. For denoising, we use BSD400 (Arbelaez et al. 2010) and WED (Ma et al. 2016) with synthetic Gaussian noise ( $\sigma=15$ , $25$ , $50$ ) for training, and BSD68 (Martin et al. 2001) for testing. Rain100L (Yang et al. 2017), RESIDE-SOTS (Li et al. 2018), GoPro (Nah, Hyun Kim, and Mu Lee 2017), and LOL-v1 (Wei et al. 2018) are used for deraining, dehazing, deblurring, and low-light enhancement, respectively. To support unified modeling, we construct multi-task settings with three and five degradation types, denoted as AIO-3 and AIO-5. For compositional degradations, we additionally evaluate on CDD11 (Guo et al. 2024).

Metrics.

To comprehensively evaluate restoration performance, we adopt a suite of image quality assessment metrics covering both distortion-based fidelity and perceptual quality. For full-reference evaluation, PSNR and SSIM (Wang et al. 2004) are used to assess pixel-level and structural similarity. To better align with human perception, we employ LPIPS (Zhang et al. 2018) and DISTS (Ding et al. 2020). Additionally, to support reference-free evaluation in real-world scenarios, we include no-reference metrics such as NIQE (Zhang, Zhang, and Bovik 2015), MANIQA (Yang et al. 2022), MUSIQ (Ke et al. 2021), and CLIPIQA (Wang, Chan, and Loy 2023).

Comparison with State-of-the-Arts

Universal Degradation Restoration.

To validate the generalization and restoration performance of our method across diverse degradation types, we first conduct systematic comparisons under a three-task joint training setup involving dehazing, deraining, and denoising. As shown in Table 1, although our PSNR and SSIM are slightly lower than some state-of-the-art non-diffusion methods (e.g., DFPIR, AdaIR), our method achieves the best scores on perceptual metrics, including LPIPS, DISTS, and CLIPIQA. Notably, we obtain the highest MUSIQ and MANIQA scores, reflecting superior perceptual quality. Compared with other diffusion-based methods, our model consistently outperforms them across all metrics, demonstrating enhanced fidelity and perceptual performance.

We further extend the evaluation to a five-task setup by including deblurring and low-light enhancement, where task conflict becomes more pronounced. As shown in Table 2, our method achieves the top MUSIQ score in most tasks and ranks among the best in MANIQA, confirming its robustness and adaptability under complex degradation scenarios. In contrast, other diffusion-based methods suffer from inconsistent task-wise performance. Figure 5 presents representative visual comparisons across five distinct restoration tasks, further demonstrating the capability of our method to recover fine structural details under a unified multi-task setting. More detailed results including runtime and parameter statistics are provided in Appendix.

Composited Degradations.

To better simulate real-world restoration demands, we adopt a unified setting based on the CDD11 dataset, which includes 11 degradation types spanning haze, rain, snow, low-light, and their various combinations. This comprehensive setup imposes higher demands on generalization and robustness. As shown in Table 3, our method consistently delivers strong performance across all tasks. We often rank first or second in PSNR and SSIM, achieving state-of-the-art results on two-fold composite degradations such as L+H, L+S, and H+S. Even under challenging three-fold degradations (L+H+R and L+H+S), our model maintains leading performance with PSNR/SSIM of 25.80/0.794 and 25.50/0.790, respectively—demonstrating its robust cross-degradation modeling capability. On average, our approach achieves 29.35 dB PSNR and 0.886 SSIM across all 11 tasks, outperforming the representative MoCE-IR baseline by 0.30 dB and 0.005. While Table 3 reports only objective metrics, our method also excels in perceptual quality and visual consistency, further validating its unified modeling capabilities under complex degradation.

Ablation Study

Structural Design of the DAFF

To evaluate the necessity of the proposed design in the DAFF module, we conduct a comprehensive ablation study across four configurations: (1) no DAFF fusion (baseline); (2) single-stream fusion only; (3) double-stream fusion only; and (4) our full DAFF design combining both paths. In addition, we compare our method with a representative residual self-attention (RSA) fusion strategy (Radford et al. 2021), which replaces DAFF with a standard attention block.

As reported in Table 4, both single-stream and double-stream branches offer partial benefits—single-stream fusion improves integration but lacks representation disentanglement, while double-stream enhances alignment but suffers from fusion instability. Our full DAFF achieves the best performance by combining their respective strengths. Moreover, RSA performs worse than our method due to its lack of timestep-aware modulation. While RSA captures spatial attention, it fails to adaptively adjust the influence of degradation cues throughout the diffusion process. In contrast, DAFF leverages timestep $t$ as a dynamic conditioning signal, enabling consistent and adaptive fusion across evolving latent states.

Fusion Strategy	PSNR $\uparrow$	MUSIQ $\uparrow$
No fusion	23.12	41.31
Single-stream only	25.39	49.26
Double-stream only	25.84	51.02
Residual Self-Attention (RSA)	24.08	48.11
Ours (Full DAFF)	27.14	61.35

Table 4: DAFF ablation results averaged over five tasks. Fusion of double-stream and single-stream modules with timestep-aware modulation achieves the best performance.

Effectiveness of the DAFF

To further validate the overall effectiveness of the DAFF module, we compare different configurations with and without DAFF under both task prompt-enabled and prompt-free settings. As shown in Table 5, incorporating task prompts alone provides semantic guidance and improves performance in globally homogeneous degradations. However, it remains insufficient for handling spatially localized degradations such as rain or shadow. By contrast, adding DAFF enables dynamic interaction between degradation cues and evolving latent features, resulting in significantly improved restoration fidelity. Notably, the combination of DAFF and task prompt yields the best performance, demonstrating their complementarity: task prompts offer global semantics, while DAFF contributes fine-grained degradation awareness. These results highlight that DAFF is not only effective independently, but also synergizes well with prompt-based guidance to improve robustness and quality across diverse degradation scenarios.

DAFF	Task Prompt	DAEM	PSNR $\uparrow$	MUSIQ $\uparrow$
✗	✗	✗	23.12	41.31
✗	✓	✗	25.87	50.12
✓	✗	✗	27.14	61.35
✓	✓	✗	27.36	62.77
✓	✓	✓	30.27	63.06

Table 5: Component-wise ablation study, averaged over five restoration tasks.

Effectiveness of the DAEM

While DAFF improves alignment and global consistency, fine details—such as text and facial contours—are often inaccurately restored. To address this, we incorporate the DAEM, which selectively enhances high-frequency structures using a dynamic mixture-of-experts design. Each expert branch captures distinct local patterns via diverse receptive fields, and a sparse gating mechanism routes features to the most suitable expert based on local context. As shown in Table 5, introducing DAEM on top of DAFF and prompt guidance yields a significant boost in PSNR and MUSIQ. Figure 6 further shows that DAEM visibly reduces residual artifacts and sharpens fine structures. These results confirm the critical role of DAEM in refining details and complementing DAFF’s structural fusion.

Conclusion

In this paper, we present a novel All-in-One image restoration framework built upon LDMs. To address the limitations of prompt-conditioned diffusion in modeling diverse degradation types, we introduce a structure-aware guidance mechanism through the DAFF module, enabling implicit adaptation to various degradations. Additionally, the proposed DAEM enhances fine-detail recovery and structural consistency in the decoding stage. Extensive experiments across multi-task and mixed degradation setting demonstrate the superiority of our method.

References

Ai et al. (2024) Ai, Y.; Huang, H.; Zhou, X.; Wang, J.; and He, R. 2024. Multimodal prompt perceiver: Empower adaptiveness generalizability and fidelity for all-in-one image restoration. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 25432–25444.
Arbelaez et al. (2010) Arbelaez, P.; Maire, M.; Fowlkes, C.; and Malik, J. 2010. Contour detection and hierarchical image segmentation. IEEE transactions on pattern analysis and machine intelligence, 33(5): 898–916.
Chen, Pan, and Dong (2025) Chen, J.; Pan, J.; and Dong, J. 2025. Faithdiff: Unleashing diffusion priors for faithful image super-resolution. In Proceedings of the Computer Vision and Pattern Recognition Conference, 28188–28197.
Chen et al. (2022a) Chen, L.; Chu, X.; Zhang, X.; and Sun, J. 2022a. Simple baselines for image restoration. In European conference on computer vision, 17–33. Springer.
Chen et al. (2022b) Chen, W.-T.; Huang, Z.-K.; Tsai, C.-C.; Yang, H.-H.; Ding, J.-J.; and Kuo, S.-Y. 2022b. Learning multiple adverse weather removal via two-stage knowledge learning and multi-contrastive regularization: Toward a unified model. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 17653–17662.
Cui et al. (2024) Cui, Y.; Zamir, S. W.; Khan, S.; Knoll, A.; Shah, M.; and Khan, F. S. 2024. Adair: Adaptive all-in-one image restoration via frequency mining and modulation. arXiv preprint arXiv:2403.14614.
Dai et al. (2023) Dai, X.; Hou, J.; Ma, C.-Y.; Tsai, S.; Wang, J.; Wang, R.; Zhang, P.; Vandenhende, S.; Wang, X.; Dubey, A.; et al. 2023. Emu: Enhancing image generation models using photogenic needles in a haystack. arXiv preprint arXiv:2309.15807.
Ding et al. (2020) Ding, K.; Ma, K.; Wang, S.; and Simoncelli, E. P. 2020. Image quality assessment: Unifying structure and texture similarity. IEEE transactions on pattern analysis and machine intelligence, 44(5): 2567–2581.
Duan et al. (2024) Duan, H.; Min, X.; Wu, S.; Shen, W.; and Zhai, G. 2024. Uniprocessor: a text-induced unified low-level image processor. In European Conference on Computer Vision, 180–199. Springer.
Guo et al. (2024) Guo, Y.; Gao, Y.; Lu, Y.; Zhu, H.; Liu, R. W.; and He, S. 2024. Onerestore: A universal restoration framework for composite degradation. In European conference on computer vision, 255–272. Springer.
Jiang et al. (2024a) Jiang, J.; Zuo, Z.; Wu, G.; Jiang, K.; and Liu, X. 2024a. A survey on all-in-one image restoration: Taxonomy, evaluation and future trends. arXiv preprint arXiv:2410.15067.
Jiang et al. (2024b) Jiang, Y.; Zhang, Z.; Xue, T.; and Gu, J. 2024b. Autodir: Automatic all-in-one image restoration with latent diffusion. In European Conference on Computer Vision, 340–359. Springer.
Ke et al. (2021) Ke, J.; Wang, Q.; Wang, Y.; Milanfar, P.; and Yang, F. 2021. Musiq: Multi-scale image quality transformer. In Proceedings of the IEEE/CVF international conference on computer vision, 5148–5157.
Kingma, Welling et al. (2013) Kingma, D. P.; Welling, M.; et al. 2013. Auto-encoding variational bayes.
Labs et al. (2025) Labs, B. F.; Batifol, S.; Blattmann, A.; Boesel, F.; Consul, S.; Diagne, C.; Dockhorn, T.; English, J.; English, Z.; Esser, P.; Kulal, S.; Lacey, K.; Levi, Y.; Li, C.; Lorenz, D.; Müller, J.; Podell, D.; Rombach, R.; Saini, H.; Sauer, A.; and Smith, L. 2025. FLUX.1 Kontext: Flow Matching for In-Context Image Generation and Editing in Latent Space. arXiv:2506.15742.
Li et al. (2022) Li, B.; Liu, X.; Hu, P.; Wu, Z.; Lv, J.; and Peng, X. 2022. All-in-one image restoration for unknown corruption. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 17452–17462.
Li et al. (2018) Li, B.; Ren, W.; Fu, D.; Tao, D.; Feng, D.; Zeng, W.; and Wang, Z. 2018. Benchmarking single-image dehazing and beyond. IEEE Transactions on Image Processing, 28(1): 492–505.
Luo et al. (2023) Luo, Z.; Gustafsson, F. K.; Zhao, Z.; Sjölund, J.; and Schön, T. B. 2023. Controlling vision-language models for universal image restoration. arXiv preprint arXiv:2310.01018, 3(8).
Ma et al. (2016) Ma, K.; Duanmu, Z.; Wu, Q.; Wang, Z.; Yong, H.; Li, H.; and Zhang, L. 2016. Waterloo exploration database: New challenges for image quality assessment models. IEEE Transactions on Image Processing, 26(2): 1004–1016.
Martin et al. (2001) Martin, D.; Fowlkes, C.; Tal, D.; and Malik, J. 2001. A database of human segmented natural images and its application to evaluating segmentation algorithms and measuring ecological statistics. In Proceedings eighth IEEE international conference on computer vision. ICCV 2001, volume 2, 416–423. IEEE.
Nah, Hyun Kim, and Mu Lee (2017) Nah, S.; Hyun Kim, T.; and Mu Lee, K. 2017. Deep multi-scale convolutional neural network for dynamic scene deblurring. In Proceedings of the IEEE conference on computer vision and pattern recognition, 3883–3891.
Podell et al. (2023) Podell, D.; English, Z.; Lacey, K.; Blattmann, A.; Dockhorn, T.; Müller, J.; Penna, J.; and Rombach, R. 2023. Sdxl: Improving latent diffusion models for high-resolution image synthesis. arXiv preprint arXiv:2307.01952.
Potlapalli et al. (2023) Potlapalli, V.; Zamir, S. W.; Khan, S. H.; and Shahbaz Khan, F. 2023. Promptir: Prompting for all-in-one image restoration. Advances in Neural Information Processing Systems, 36: 71275–71293.
Radford et al. (2021) Radford, A.; Kim, J. W.; Hallacy, C.; Ramesh, A.; Goh, G.; Agarwal, S.; Sastry, G.; Askell, A.; Mishkin, P.; Clark, J.; et al. 2021. Learning transferable visual models from natural language supervision. In International conference on machine learning, 8748–8763. PmLR.
Riquelme et al. (2021) Riquelme, C.; Puigcerver, J.; Mustafa, B.; Neumann, M.; Jenatton, R.; Susano Pinto, A.; Keysers, D.; and Houlsby, N. 2021. Scaling vision with sparse mixture of experts. Advances in Neural Information Processing Systems, 34: 8583–8595.
Tian et al. (2025) Tian, X.; Liao, X.; Liu, X.; Li, M.; and Ren, C. 2025. Degradation-Aware Feature Perturbation for All-in-One Image Restoration. In Proceedings of the Computer Vision and Pattern Recognition Conference, 28165–28175.
Valanarasu, Yasarla, and Patel (2022) Valanarasu, J. M. J.; Yasarla, R.; and Patel, V. M. 2022. Transweather: Transformer-based restoration of images degraded by adverse weather conditions. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2353–2363.
Wang, Chan, and Loy (2023) Wang, J.; Chan, K. C.; and Loy, C. C. 2023. Exploring clip for assessing the look and feel of images. In Proceedings of the AAAI conference on artificial intelligence, volume 37, 2555–2563.
Wang et al. (2004) Wang, Z.; Bovik, A. C.; Sheikh, H. R.; and Simoncelli, E. P. 2004. Image quality assessment: from error visibility to structural similarity. IEEE transactions on image processing, 13(4): 600–612.
Wei et al. (2018) Wei, C.; Wang, W.; Yang, W.; and Liu, J. 2018. Deep retinex decomposition for low-light enhancement. arXiv preprint arXiv:1808.04560.
Yan et al. (2025) Yan, Q.; Jiang, A.; Chen, K.; Peng, L.; Yi, Q.; and Zhang, C. 2025. Textual prompt guided image restoration. Engineering Applications of Artificial Intelligence, 155: 110981.
Yang et al. (2022) Yang, S.; Wu, T.; Shi, S.; Lao, S.; Gong, Y.; Cao, M.; Wang, J.; and Yang, Y. 2022. Maniqa: Multi-dimension attention network for no-reference image quality assessment. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 1191–1200.
Yang et al. (2017) Yang, W.; Tan, R. T.; Feng, J.; Liu, J.; Guo, Z.; and Yan, S. 2017. Deep joint rain detection and removal from a single image. In Proceedings of the IEEE conference on computer vision and pattern recognition, 1357–1366.
Zamfir et al. (2025) Zamfir, E.; Wu, Z.; Mehta, N.; Tan, Y.; Paudel, D. P.; Zhang, Y.; and Timofte, R. 2025. Complexity experts are task-discriminative learners for any image restoration. In Proceedings of the Computer Vision and Pattern Recognition Conference, 12753–12763.
Zamir et al. (2022) Zamir, S. W.; Arora, A.; Khan, S.; Hayat, M.; Khan, F. S.; and Yang, M.-H. 2022. Restormer: Efficient transformer for high-resolution image restoration. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 5728–5739.
Zeng et al. (2025) Zeng, H.; Wang, X.; Chen, Y.; Su, J.; and Liu, J. 2025. Vision-Language Gradient Descent-driven All-in-One Deep Unfolding Networks. In Proceedings of the Computer Vision and Pattern Recognition Conference, 7524–7533.
Zhang, Zhang, and Bovik (2015) Zhang, L.; Zhang, L.; and Bovik, A. C. 2015. A feature-enriched completely blind image quality evaluator. IEEE Transactions on Image Processing, 24(8): 2579–2591.
Zhang et al. (2018) Zhang, R.; Isola, P.; Efros, A. A.; Shechtman, E.; and Wang, O. 2018. The unreasonable effectiveness of deep features as a perceptual metric. In Proceedings of the IEEE conference on computer vision and pattern recognition, 586–595.
Zheng et al. (2024) Zheng, D.; Wu, X.-M.; Yang, S.; Zhang, J.; Hu, J.-F.; and Zheng, W.-S. 2024. Selective hourglass mapping for universal image restoration based on diffusion model. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 25445–25455.
Zhu et al. (2023) Zhu, Z.; Feng, X.; Chen, D.; Bao, J.; Wang, L.; Chen, Y.; Yuan, L.; and Hua, G. 2023. Designing a better asymmetric vqgan for stablediffusion. arXiv preprint arXiv:2306.04632.