UniLDiff: Unlocking the Power of Diffusion Priors for All-in-One Image Restoration

Zihan Cheng1, Liangtai Zhou1, Dian Chen1, Ni Tang1, Xiaotong Luo1, Yanyun Qu1
Abstract

All-in-One Image Restoration (AiOIR) has emerged as a promising yet challenging research direction. To address its core challenges, we propose a novel unified image restoration framework based on latent diffusion models (LDMs). Our approach structurally integrates low-quality visual priors into the diffusion process, unlocking the powerful generative capacity of diffusion models for diverse degradations. Specifically, we design a Degradation-Aware Feature Fusion (DAFF) module to enable adaptive handling of diverse degradation types. Furthermore, to mitigate detail loss caused by the high compression and iterative sampling of LDMs, we design a Detail-Aware Expert Module (DAEM) in the decoder to enhance texture and fine-structure recovery. Extensive experiments across multi-task and mixed degradation settings demonstrate that our method consistently achieves state-of-the-art performance, highlighting the practical potential of diffusion priors for unified image restoration. Our code will be released.

Introduction

Traditional image restoration (IR) focuses on a specific degradation type such as denoising, deblurring, dehazing, deraining, or low-light enhancement. While task-specific models have achieved impressive results in their respective domains, they typically lack generalization capability. Such single-purpose models are difficult to scale in practical applications, as switching between restoration tasks often requires retraining or task-specific adaptation, resulting in high deployment overhead and limited flexibility. This has spurred increasing interest in All-in-One Image Restoration (AiOIR), which aims to develop unified frameworks capable of handling diverse degradations under a single model.

Recent AiOIR approaches have explored a variety of strategies to achieve generality, including task-specific architectures (Li et al. 2022), task-specific priors (Valanarasu, Yasarla, and Patel 2022), frequency-aware modulation (Cui et al. 2024), contrastive learning (Chen et al. 2022b), and prompt-driven guidance (Zeng et al. 2025). However, these methods typically rely on predefined degradation types (Potlapalli et al. 2023) and rigid priors, making them less effective in handling real-world images where degradations are often compound, spatially heterogeneous, and lack explicit degradation type labels. Such limitations highlight the need for a more flexible and powerful generative framework.

Refer to caption
Figure 1: MUSIQ-based comparison across five restoration tasks. Our method achieves consistently superior results.
Refer to caption
Figure 2: Overall architecture of the proposed UniLDiff framework.

In this context, diffusion models, particularly Latent Diffusion Models (LDMs), have emerged as promising candidates for unified image restoration, owing to their strong generative priors and success in high-fidelity image synthesis and cross-modal tasks (Chen, Pan, and Dong 2025; Jiang et al. 2024a). However, current diffusion-based AiOIR methods commonly rely on global textual or visual prompts, each with intrinsic limitations: textual prompts offer only global semantic cues and lack spatial specificity (Jiang et al. 2024b), while visual prompts depend on pre-trained degradation encoders or modality-specific tokens, assuming known and spatially uniform degradation types (Luo et al. 2023). As a result, prompt-based conditioning constrains the model’s capacity to localize and respond to fine-grained degradations, limiting its adaptability despite the inherent generative strength of diffusion models. Additionally, LDMs often suffer from detail loss due to the high compression of VAE encoders (Kingma, Welling et al. 2013) and the progressive nature of iterative sampling (Dai et al. 2023), resulting in blurred textures and incomplete structures.

To address these challenges, we propose UniLDiff, a novel LDM-based unified image restoration framework that explicitly integrates degradation-aware and detail-aware mechanisms into the diffusion process. Specifically, we introduce the Degradation-Aware Feature Fusion (DAFF) module into the early layers of the diffusion UNet to inject low-quality (LQ) features into every denoising step. By leveraging double-stream attention and adaptive modulation, DAFF provides dynamic, spatially adaptive guidance, enabling the model to perceive and adapt to diverse and heterogeneous degradations throughout the generative trajectory, without relying on predefined degradation assumptions. Furthermore, we design a Detail-Aware Expert Module (DAEM) within the decoder, which leverages decoder features and skip-connected encoder cues to dynamically activate expert branches, thereby enhancing high-frequency details and suppress structural artifacts. A preview of our results is shown in Figure 1. UniLDiff consistently outperforms previous methods across five representative tasks in terms of perceptual quality, demonstrating its robustness and generalization in unified multi-task restoration.

Our contributions can be summarized as follows:

  • We propose UniLDiff, a unified LDM-based framework that integrates restoration priors into the denoising process for flexible and high-fidelity image restoration.

  • We design the DAFF that performs step-wise alignment between low-quality features and evolving latent variables, enabling the model to implicitly adapt to diverse degradations without predefined labels.

  • We develop the DAEM that dynamically routes decoder features through expert branches guided by encoder cues, enabling targeted restoration of fine textures and structural details.

  • Extensive experiments across multi-task and composite restoration benchmarks demonstrate that UniLDiff achieves superior perceptual quality and generalization compared to state-of-the-art methods.

Related Work

Non-Diffusion-Based All-in-One Image Restoration

Traditional AiOIR methods are primarily built upon Convolutional Neural Networks (CNNs) or Transformer architectures, aiming to improve adaptability to diverse degradation types through the incorporation of task-specific structures and priors. In recent years, various strategies have been proposed to enhance model generalization, including task-specific architectural designs (Li et al. 2022), explicit prior modeling (Valanarasu, Yasarla, and Patel 2022), frequency-aware modulation (Cui et al. 2024), contrastive learning (Chen et al. 2022b), and MOE-based frameworks (Zamfir et al. 2025). Among these, prompt learning has emerged as a dominant paradigm in AiOIR. It enables task-conditioned modeling through visual prompts (Zeng et al. 2025; Tian et al. 2025; Cui et al. 2024), textual prompts (Yan et al. 2025), or multimodal prompts (Duan et al. 2024). While these approaches improve modularity and task awareness, they often rely on predefined degradation types and rigid priors (Potlapalli et al. 2023), limiting their effectiveness in real-world scenarios where degradations are compound, spatially variable, and label-agnostic.

Diffusion-Based All-in-One Image Restoration

Recently, diffusion models have demonstrated strong potential in AiOIR, leveraging their powerful generative priors and high-fidelity image synthesis capabilities. AutoDIR (Jiang et al. 2024b) pioneered text-guided diffusion by using natural language prompts for restoration. MPerceiver (Ai et al. 2024) extended this approach with multimodal CLIP features to handle generalized degradations. DiffUIR (Zheng et al. 2024) further introduced a selective hourglass mapping strategy that combines strong condition guidance with shared distribution modeling for unified multi-task restoration. DaCLIP (Luo et al. 2023) leverages contrastive language-image pretraining to align degradation-aware prompts with visual features, enabling adaptive restoration under diverse degradation types. These works demonstrate the versatility of diffusion models in AiOIR, particularly under scenarios involving mixed, or unknown degradations. However, existing approaches often rely on global textual or visual prompts, which provide limited spatial specificity and typically assume known, uniform degradation types. This restricts the model’s ability to localize and respond to compound or fine-grained degradations that are common in real-world scenarios.

Method

We present UniLDiff, a unified image restoration framework that unlocks the power of diffusion priors. As illustrated in Figure 2, UniLDiff employs a pre-trained VAE encoder to project both low-quality (LQ) and high-quality (HQ) images into a latent space, extracting multi-scale features fLQf^{LQ}italic_f start_POSTSUPERSCRIPT italic_L italic_Q end_POSTSUPERSCRIPT and fHQf^{HQ}italic_f start_POSTSUPERSCRIPT italic_H italic_Q end_POSTSUPERSCRIPT. The forward diffusion process adds noise to fHQf^{HQ}italic_f start_POSTSUPERSCRIPT italic_H italic_Q end_POSTSUPERSCRIPT to produce latent variables XtHQX_{t}^{HQ}italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_H italic_Q end_POSTSUPERSCRIPT at each timestep ttitalic_t. To inject degradation-aware guidance into the denoising trajectory, the DAFF module is integrated into the early layers of the diffusion UNet, aligning fLQf^{LQ}italic_f start_POSTSUPERSCRIPT italic_L italic_Q end_POSTSUPERSCRIPT with XtHQX_{t}^{HQ}italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_H italic_Q end_POSTSUPERSCRIPT at every step. Additionally, the DAEM is applied in the decoder to enhance high-frequency textures and preserve fine-grained structural details.

Degradation-Aware Feature Fusion (DAFF)

Refer to caption
Figure 3: Architecture of the proposed Degradation-Aware Feature Fusion (DAFF).

In diffusion-based image restoration, a common practice is to simply concatenate or add the LQ features fLQf^{LQ}italic_f start_POSTSUPERSCRIPT italic_L italic_Q end_POSTSUPERSCRIPT with the latent variable XtHQX_{t}^{HQ}italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_H italic_Q end_POSTSUPERSCRIPT at each timestep ttitalic_t. However, this naive fusion can be suboptimal: as XtHQX_{t}^{HQ}italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_H italic_Q end_POSTSUPERSCRIPT progressively approaches the clean HQ representation, static fLQf^{LQ}italic_f start_POSTSUPERSCRIPT italic_L italic_Q end_POSTSUPERSCRIPT may introduce interference, undermining structural consistency in later stages.

To mitigate this, we introduce the DAFF module, which dynamically aligns LQ features with evolving latent representations at each diffusion timestep, providing degradation-aware modulation to condition the denoising trajectory according to the inferred degradation type. Inspired by FLUX (Labs et al. 2025), we adopt a cascaded fusion design combining double-stream and single-stream pathways (Fig. 3), leveraging joint and disentangled feature modeling.

Double Stream Block.

We apply LayerNorm and conditional modulation to fLQf^{LQ}italic_f start_POSTSUPERSCRIPT italic_L italic_Q end_POSTSUPERSCRIPT and XtHQX_{t}^{HQ}italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_H italic_Q end_POSTSUPERSCRIPT independently. Each branch computes its own query, key, and value matrices, which are then concatenated and passed through a shared attention module:

QtD,KtD,VtD\displaystyle Q_{t}^{D},K_{t}^{D},V_{t}^{D}italic_Q start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_D end_POSTSUPERSCRIPT , italic_K start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_D end_POSTSUPERSCRIPT , italic_V start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_D end_POSTSUPERSCRIPT =concat[QKV(fLQ),QKV(XtHQ)]\displaystyle=\text{concat}[{QKV}(f^{LQ}),{QKV}(X_{t}^{HQ})]= concat [ italic_Q italic_K italic_V ( italic_f start_POSTSUPERSCRIPT italic_L italic_Q end_POSTSUPERSCRIPT ) , italic_Q italic_K italic_V ( italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_H italic_Q end_POSTSUPERSCRIPT ) ] (1)
AtD\displaystyle A_{t}^{D}italic_A start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_D end_POSTSUPERSCRIPT =softmax(QtD(KtD)T+PEd)VtD\displaystyle=\text{softmax}\left(\frac{Q_{t}^{D}(K_{t}^{D})^{T}+PE}{\sqrt{d}}\right)V_{t}^{D}= softmax ( divide start_ARG italic_Q start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_D end_POSTSUPERSCRIPT ( italic_K start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_D end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT + italic_P italic_E end_ARG start_ARG square-root start_ARG italic_d end_ARG end_ARG ) italic_V start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_D end_POSTSUPERSCRIPT (2)

The attention output AtDA_{t}^{D}italic_A start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_D end_POSTSUPERSCRIPT is projected back via a gating mechanism to obtain Xt,gateLQX_{t,gate}^{LQ}italic_X start_POSTSUBSCRIPT italic_t , italic_g italic_a italic_t italic_e end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_L italic_Q end_POSTSUPERSCRIPT and Xt,gateHQX_{t,gate}^{HQ}italic_X start_POSTSUBSCRIPT italic_t , italic_g italic_a italic_t italic_e end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_H italic_Q end_POSTSUPERSCRIPT, enabling degradation-aware guidance and structural preservation. This bidirectional interaction surpasses naive fusion by reducing inconsistency and structural artifacts.

Single Stream Block.

The gated features are concatenated as ftcat=concat[Xt,gateLQ,Xt,gateHQ]f_{t}^{cat}=\text{concat}[X_{t,gate}^{LQ},X_{t,gate}^{HQ}]italic_f start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_c italic_a italic_t end_POSTSUPERSCRIPT = concat [ italic_X start_POSTSUBSCRIPT italic_t , italic_g italic_a italic_t italic_e end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_L italic_Q end_POSTSUPERSCRIPT , italic_X start_POSTSUBSCRIPT italic_t , italic_g italic_a italic_t italic_e end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_H italic_Q end_POSTSUPERSCRIPT ], then normalized and linearly projected to produce the attention triplet and an auxiliary vector:

[QtS,KtS,VtS],M=Linear1(ftcat)\displaystyle[Q_{t}^{S},K_{t}^{S},V_{t}^{S}],M=\text{Linear}_{1}(f_{t}^{{cat}})[ italic_Q start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_S end_POSTSUPERSCRIPT , italic_K start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_S end_POSTSUPERSCRIPT , italic_V start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_S end_POSTSUPERSCRIPT ] , italic_M = Linear start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_f start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_c italic_a italic_t end_POSTSUPERSCRIPT ) (3)

The attention is computed via scaled dot-product:

AtS=softmax(QtS(KtS)T+PEd)VtS\displaystyle A_{t}^{S}=\text{softmax}\left(\frac{Q_{t}^{S}(K_{t}^{S})^{T}+PE}{\sqrt{d}}\right)V_{t}^{S}italic_A start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_S end_POSTSUPERSCRIPT = softmax ( divide start_ARG italic_Q start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_S end_POSTSUPERSCRIPT ( italic_K start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_S end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT + italic_P italic_E end_ARG start_ARG square-root start_ARG italic_d end_ARG end_ARG ) italic_V start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_S end_POSTSUPERSCRIPT (4)

Here, PE is positional embeddings. We concatenate AtSA_{t}^{S}italic_A start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_S end_POSTSUPERSCRIPT with ϕ(M)\phi(M)italic_ϕ ( italic_M ) and feed it through a second linear layer. The output is combined with the input via gating to yield ftalignf_{t}^{align}italic_f start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_a italic_l italic_i italic_g italic_n end_POSTSUPERSCRIPT:

ftalign=ftcat+gLinear2(AtS,ϕ(M))\displaystyle f_{t}^{align}=f^{cat}_{t}+g\cdot\text{Linear}_{2}(A_{t}^{S},\phi(M))italic_f start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_a italic_l italic_i italic_g italic_n end_POSTSUPERSCRIPT = italic_f start_POSTSUPERSCRIPT italic_c italic_a italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT + italic_g ⋅ Linear start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_A start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_S end_POSTSUPERSCRIPT , italic_ϕ ( italic_M ) ) (5)

where ϕ()\phi(\cdot)italic_ϕ ( ⋅ ) is a nonlinear activation and ggitalic_g controls fusion strength.

This unified stream improves feature coherence and adaptively modulates degradation influence throughout diffusion, mitigating guidance mismatch and preserving structure.

To enhance semantic alignment, we also incorporate a content embedding ccitalic_c from a pre-trained text encoder, fused with ftalignf_{t}^{align}italic_f start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_a italic_l italic_i italic_g italic_n end_POSTSUPERSCRIPT via a lightweight cross-attention mechanism. In summary, given the latent HQ feature XtHQX_{t}^{HQ}italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_H italic_Q end_POSTSUPERSCRIPT at diffusion step ttitalic_t, the degradation feature fLQf^{LQ}italic_f start_POSTSUPERSCRIPT italic_L italic_Q end_POSTSUPERSCRIPT, and the content embedding ccitalic_c, the model predicts the latent HQ feature Xt1HQX_{t-1}^{HQ}italic_X start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_H italic_Q end_POSTSUPERSCRIPT at the previous timestep as follows:

xt1HQ=1αt(xtHQ1αt1α¯tϵ^θ(ftalign,c,t))+σtz\displaystyle x_{t-1}^{HQ}=\frac{1}{\sqrt{\alpha_{t}}}\left(x_{t}^{HQ}-\frac{1-\alpha_{t}}{\sqrt{1-\bar{\alpha}_{t}}}\hat{\epsilon}_{\theta}(f_{t}^{align},c,t)\right)+\sigma_{t}zitalic_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_H italic_Q end_POSTSUPERSCRIPT = divide start_ARG 1 end_ARG start_ARG square-root start_ARG italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG end_ARG ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_H italic_Q end_POSTSUPERSCRIPT - divide start_ARG 1 - italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG start_ARG square-root start_ARG 1 - over¯ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG end_ARG over^ start_ARG italic_ϵ end_ARG start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_f start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_a italic_l italic_i italic_g italic_n end_POSTSUPERSCRIPT , italic_c , italic_t ) ) + italic_σ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT italic_z (6)

where ϵ^θ\hat{\epsilon}_{\theta}over^ start_ARG italic_ϵ end_ARG start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT is the predicted noise from the denoising UNet, conditioned on ftalignf_{t}^{align}italic_f start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_a italic_l italic_i italic_g italic_n end_POSTSUPERSCRIPT, ccitalic_c, and ttitalic_t. The variable z𝒩(0,I)z\sim\mathcal{N}(0,I)italic_z ∼ caligraphic_N ( 0 , italic_I ) adds stochasticity, while αt\alpha_{t}italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, σt\sigma_{t}italic_σ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, and α¯t=i=1tαi\bar{\alpha}_{t}=\prod_{i=1}^{t}\alpha_{i}over¯ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = ∏ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT italic_α start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT are diffusion schedule parameters controlling denoising and noise levels.

Detail-Aware Expert Module (DAEM)

Refer to caption
Figure 4: Architecture of the DAEM, which uses experts with varied receptive fields to adaptively refine details from skip-connected encoder features.

Although the DAFF module improves degradation perception and adaptation, LDMs still struggle to recover fine details, often resulting in texture blurring and structural loss in small objects. These limitations arise from the high compression ratio of the VAE encoder, which discards high-frequency signals, and the progressive sampling process, which introduces texture hallucinations (Dai et al. 2023; Zhu et al. 2023). Furthermore, in multi-degradation scenarios, spatially heterogeneous degradations challenge the expressiveness of a single reconstruction path.

To mitigate these issues, we introduce the DAEM—a specialized decoder component designed to selectively enhance high-frequency structures and texture fidelity across spatially diverse regions (see Fig. 4). Unlike conventional decoders that operate uniformly, DAEM leverages a Mixture-of-Experts (MoE) architecture to dynamically specialize its response based on local degradation context. More importantly, DAEM integrates encoder features via skip connections to retrieve early-stage, high-resolution cues that are otherwise lost in the VAE compression process. By fusing these fine-grained encoder features with the decoder’s latent representations, DAEM explicitly injects detail priors into the reconstruction pipeline—enabling targeted enhancement of textures, edges, and small structures.

Specifically, for each input xxitalic_x, a lightweight routing function computes the expert assignment as:

Router(x)=top-k(Softmax(Wx+𝝃)),\text{Router}(x)=\text{top-}k\left(\text{Softmax}(Wx+\boldsymbol{\xi})\right),Router ( italic_x ) = top- italic_k ( Softmax ( italic_W italic_x + bold_italic_ξ ) ) , (7)

where WWitalic_W denotes the router weight, 𝝃𝒩(0,σ2)\boldsymbol{\xi}\sim\mathcal{N}(0,\sigma^{2})bold_italic_ξ ∼ caligraphic_N ( 0 , italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) is sampled Gaussian noise, and kkitalic_k is the number of activated experts. In our experiments, we empirically set k=1k=1italic_k = 1. Each expert EiE^{i}italic_E start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT is built using lightweight NAFBlocks (Chen et al. 2022a) with varied receptive fields, allowing for multi-scale perception of both fine and coarse structures.

To ensure global semantic coherence, a shared expert branch S()S(\cdot)italic_S ( ⋅ ) with transposed self-attention (Zamir et al. 2022) is added. The output of each expert is modulated by this global branch as:

y^Ei=Ei(x)S(x),\hat{y}_{E}^{i}=E^{i}(x)\otimes S(x),over^ start_ARG italic_y end_ARG start_POSTSUBSCRIPT italic_E end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT = italic_E start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ( italic_x ) ⊗ italic_S ( italic_x ) , (8)

where \otimes denotes element-wise multiplication. This design effectively fuses localized, detail-specific refinement with global contextual guidance.

By bridging early encoder features and dynamic expert processing, DAEM equips the model with robust fine-detail recovery capabilities and enhanced adaptability to complex, spatially variant degradations.

Refer to caption
Figure 5: Visual results. In dehazing, our approach produces more natural color restoration and sharper details, in some cases surpassing the visual quality of the ground truth. For low-light enhancement, it better preserves structural edges and surface textures while avoiding over-smoothing. In deraining, our method reconstructs continuous grid structures that other methods fail to recover, demonstrating superior fidelity and generalization across diverse degradations.

Unified Training Strategy

Our framework comprises five core components: a pre-trained VAE encoder, a trainable VAE encoder for extracting degradation features, the DAFF module, a diffusion-based denoising network ϵθ\epsilon_{\theta}italic_ϵ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT, and a VAE decoder integrated with the DAEM. In this work, we adopt Stable Diffusion XL  (Podell et al. 2023) as our underlying latent diffusion model, leveraging its pre-trained components to initialize the encoder, decoder, and denoising UNet. The training procedure is organized into two stages: (1) degradation modeling with DAFF, and (2) detail refinement with DAEM.

Stage 1: Degradation Modeling

We first freeze the VAE encoder and denoising UNet, and train only the DAFF module. This allows DAFF to learn how to align LQ features fLQf^{LQ}italic_f start_POSTSUPERSCRIPT italic_L italic_Q end_POSTSUPERSCRIPT with the noisy latent variable xtHQx_{t}^{HQ}italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_H italic_Q end_POSTSUPERSCRIPT at each timestep, capturing degradation-aware priors to guide the denoising process. Once DAFF converges, we unfreeze the trainable LQ encoder and jointly optimize it with DAFF and the denoising network ϵθ\epsilon_{\theta}italic_ϵ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT to improve coordination among feature extraction, fusion, and noise prediction. The training loss follows the standard diffusion noise estimation objective:

stage-1=ϵϵ^θ(α¯tx0HQ+1α¯tϵ,fLQ,c,t)1,\mathcal{L}_{\text{stage-1}}=\left\|\epsilon-\hat{\epsilon}_{\theta}\left(\sqrt{\bar{\alpha}_{t}x_{0}^{HQ}}+\sqrt{1-\bar{\alpha}_{t}}\epsilon,f^{LQ},c,t\right)\right\|_{1},caligraphic_L start_POSTSUBSCRIPT stage-1 end_POSTSUBSCRIPT = ∥ italic_ϵ - over^ start_ARG italic_ϵ end_ARG start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( square-root start_ARG over¯ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_H italic_Q end_POSTSUPERSCRIPT end_ARG + square-root start_ARG 1 - over¯ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG italic_ϵ , italic_f start_POSTSUPERSCRIPT italic_L italic_Q end_POSTSUPERSCRIPT , italic_c , italic_t ) ∥ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , (9)

where ϵ𝒩(0,I)\epsilon\sim\mathcal{N}(0,I)italic_ϵ ∼ caligraphic_N ( 0 , italic_I ) is Gaussian noise, and ccitalic_c is the auxiliary text embedding.

Stage 2: Detail Refinement

After the degradation modeling stage, we fine-tune the VAE decoder together with the DAEM to enhance the recovery of fine-grained details. This phase is supervised by a composite loss that combines pixel-wise reconstruction loss recon\mathcal{L}_{\text{recon}}caligraphic_L start_POSTSUBSCRIPT recon end_POSTSUBSCRIPT and structural similarity loss ssim\mathcal{L}_{\text{ssim}}caligraphic_L start_POSTSUBSCRIPT ssim end_POSTSUBSCRIPT, ensuring both low-level accuracy and structural consistency. To fully leverage the MoE architecture and prevent expert collapse, we introduce an auxiliary load-balancing loss aux\mathcal{L}_{\text{aux}}caligraphic_L start_POSTSUBSCRIPT aux end_POSTSUBSCRIPT (Riquelme et al. 2021), which encourages uniform expert utilization across batches. Details of its formulation are provided in the Appendix.

The total objective for this stage is:

stage-2=recon+λ1ssim+λ2aux,\mathcal{L}_{\text{stage-2}}=\mathcal{L}_{\text{recon}}+\lambda_{1}\mathcal{L}_{\text{ssim}}+\lambda_{2}\mathcal{L}_{\text{aux}},caligraphic_L start_POSTSUBSCRIPT stage-2 end_POSTSUBSCRIPT = caligraphic_L start_POSTSUBSCRIPT recon end_POSTSUBSCRIPT + italic_λ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT caligraphic_L start_POSTSUBSCRIPT ssim end_POSTSUBSCRIPT + italic_λ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT caligraphic_L start_POSTSUBSCRIPT aux end_POSTSUBSCRIPT , (10)

where λ1\lambda_{1}italic_λ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and λ2\lambda_{2}italic_λ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT are hyperparameters controlling the trade-off between structural preservation and expert diversity.

Method Venue PSNR\uparrow SSIM\uparrow LPIPS\downarrow DISTS\downarrow CLIPIQA\uparrow NIQE\downarrow MUSIQ\uparrow MANIQA\uparrow

Non-Diff

AirNet CVPR2022 31.11 0.9068 0.0925 0.0950 0.6405 3.3911 66.98 0.6470
IDR CVPR2023 30.72 0.8960 0.0949 0.0920 0.6613 3.5283 66.63 0.6443
PromptIR NeurIPS23 32.18 0.9124 0.0859 0.0864 0.6409 4.0061 67.21 0.6569
VLU-Net CVPR2025 32.55 0.9157 0.0796 0.0785 0.6433 3.6559 67.09 0.6698
DFPIR CVPR2025 32.75 0.9162 0.0758 0.0758 0.6626 3.5938 67.34 0.6679
AdaIR ICLR2025 32.98 0.9155 0.0825 0.0820 0.6435 3.6464 67.50 0.6685

Diff

DA-CLIP ICLR2024 30.27 0.8780 0.0664 0.0650 0.6580 3.2783 67.31 0.6663
DiffUIR CVPR2024 31.89 0.9010 0.0959 0.0964 0.6163 3.7371 67.51 0.6570
Ours - 32.18 0.9105 0.0651 0.0639 0.6653 3.6022 68.89 0.7038
Table 1: Quantitative comparison on average performance across three restoration tasks: dehazing, deraining, and Gaussian denoising (σ=15\sigma=15italic_σ = 15, 252525, 505050). Red and blue indicate the best and second-best results, respectively.

Experiment

Method Dehazing Deraining Denoising Deblurring Low-light
MUSIQ MANIQA MUSIQ MANIQA MUSIQ MANIQA MUSIQ MANIQA MUSIQ MANIQA

Non-Diff

PromptIR 65.43 0.6829 68.00 0.6679 67.29 0.6295 36.60 0.4250 64.76 0.5922
InstructIR 59.75 0.6622 68.01 0.6659 68.35 0.6598 38.83 0.4007 67.34 0.6061
AdaIR 65.83 0.6923 68.31 0.6720 67.65 0.6593 33.41 0.3763 69.43 0.6309
VLU-Net 65.81 0.6920 68.45 0.6740 68.38 0.6669 32.48 0.3568 65.21 0.5948
DFPIR 65.61 0.6898 68.29 0.6700 68.06 0.6563 36.93 0.3978 69.19 0.6234

Diff

DA-CLIP 67.88 0.6524 68.60 0.6695 67.88 0.6524 36.89 0.4298 72.59 0.6619
DiffUIR 64.84 0.6864 68.40 0.6682 68.19 0.6383 33.63 0.3529 71.13 0.6246
Ours 66.42 0.6948 68.69 0.6903 68.49 0.6700 38.91 0.4440 72.77 0.6528
Table 2: Quantitative comparison of no-reference metrics MUSIQ (\uparrow) and MANIQA (\uparrow) across five restoration tasks: dehazing, deraining, denoising (σ=25\sigma=25italic_σ = 25), deblurring, and low-light enhancement. Red and blue indicate the best and second-best results.
Method L+H L+R L+S H+R H+S L+H+R L+H+S
PSNR SSIM PSNR SSIM PSNR SSIM PSNR SSIM PSNR SSIM PSNR SSIM PSNR SSIM
AirNet 23.23 0.779 23.21 0.768 23.06 0.763 25.04 0.883 25.07 0.884 22.85 0.751 22.61 0.748
PromptIR 24.49 0.789 24.33 0.773 24.12 0.770 27.93 0.939 28.02 0.940 23.87 0.763 23.55 0.759
WGWSNet 23.95 0.772 23.70 0.764 23.54 0.759 27.53 0.890 27.60 0.891 23.30 0.748 23.07 0.745
WeatherDiff 22.36 0.756 22.25 0.750 22.10 0.745 23.82 0.867 23.91 0.869 21.87 0.732 21.60 0.730
OneRestore 25.79 0.822 25.65 0.811 25.44 0.808 29.81 0.960 29.92 0.961 25.23 0.797 24.93 0.793
MoCE-IR-S 26.24 0.817 26.05 0.804 26.04 0.802 29.93 0.961 30.19 0.962 25.41 0.789 25.39 0.787
Ours 26.50 0.820 26.40 0.808 26.10 0.805 30.10 0.963 30.50 0.964 25.80 0.794 25.50 0.790
Table 3: Comparison on composite degradation types. PSNR (\uparrow) and SSIM (\uparrow) are reported. L: Low, H: Haze, R: Rain, S: Snow. Red and blue indicate the best and second-best results, respectively.

Following prior work in general-purpose image restoration  (Jiang et al. 2024a), we evaluate our model under two distinct settings: (i) Universal Degradation Restoration, where a single model is trained to handle multiple degradation types, including combinations of three and five types; and (ii) Compositional Degradation Restoration, where the model is assessed on both individual and mixed degradation scenarios (up to three types) using the same unified architecture.

Experimental Settings

Datasets.

We aggregate a diverse set of widely-used datasets across different degradation types. For denoising, we use BSD400 (Arbelaez et al. 2010) and WED (Ma et al. 2016) with synthetic Gaussian noise (σ=15\sigma=15italic_σ = 15, 252525, 505050) for training, and BSD68 (Martin et al. 2001) for testing. Rain100L (Yang et al. 2017), RESIDE-SOTS (Li et al. 2018), GoPro (Nah, Hyun Kim, and Mu Lee 2017), and LOL-v1 (Wei et al. 2018) are used for deraining, dehazing, deblurring, and low-light enhancement, respectively. To support unified modeling, we construct multi-task settings with three and five degradation types, denoted as AIO-3 and AIO-5. For compositional degradations, we additionally evaluate on CDD11 (Guo et al. 2024).

Metrics.

To comprehensively evaluate restoration performance, we adopt a suite of image quality assessment metrics covering both distortion-based fidelity and perceptual quality. For full-reference evaluation, PSNR and SSIM (Wang et al. 2004) are used to assess pixel-level and structural similarity. To better align with human perception, we employ LPIPS (Zhang et al. 2018) and DISTS (Ding et al. 2020). Additionally, to support reference-free evaluation in real-world scenarios, we include no-reference metrics such as NIQE (Zhang, Zhang, and Bovik 2015), MANIQA (Yang et al. 2022), MUSIQ (Ke et al. 2021), and CLIPIQA (Wang, Chan, and Loy 2023).

Comparison with State-of-the-Arts

Universal Degradation Restoration.

To validate the generalization and restoration performance of our method across diverse degradation types, we first conduct systematic comparisons under a three-task joint training setup involving dehazing, deraining, and denoising. As shown in Table 1, although our PSNR and SSIM are slightly lower than some state-of-the-art non-diffusion methods (e.g., DFPIR, AdaIR), our method achieves the best scores on perceptual metrics, including LPIPS, DISTS, and CLIPIQA. Notably, we obtain the highest MUSIQ and MANIQA scores, reflecting superior perceptual quality. Compared with other diffusion-based methods, our model consistently outperforms them across all metrics, demonstrating enhanced fidelity and perceptual performance.

We further extend the evaluation to a five-task setup by including deblurring and low-light enhancement, where task conflict becomes more pronounced. As shown in Table 2, our method achieves the top MUSIQ score in most tasks and ranks among the best in MANIQA, confirming its robustness and adaptability under complex degradation scenarios. In contrast, other diffusion-based methods suffer from inconsistent task-wise performance. Figure 5 presents representative visual comparisons across five distinct restoration tasks, further demonstrating the capability of our method to recover fine structural details under a unified multi-task setting. More detailed results including runtime and parameter statistics are provided in Appendix.

Composited Degradations.

To better simulate real-world restoration demands, we adopt a unified setting based on the CDD11 dataset, which includes 11 degradation types spanning haze, rain, snow, low-light, and their various combinations. This comprehensive setup imposes higher demands on generalization and robustness. As shown in Table 3, our method consistently delivers strong performance across all tasks. We often rank first or second in PSNR and SSIM, achieving state-of-the-art results on two-fold composite degradations such as L+H, L+S, and H+S. Even under challenging three-fold degradations (L+H+R and L+H+S), our model maintains leading performance with PSNR/SSIM of 25.80/0.794 and 25.50/0.790, respectively—demonstrating its robust cross-degradation modeling capability. On average, our approach achieves 29.35 dB PSNR and 0.886 SSIM across all 11 tasks, outperforming the representative MoCE-IR baseline by 0.30 dB and 0.005. While Table 3 reports only objective metrics, our method also excels in perceptual quality and visual consistency, further validating its unified modeling capabilities under complex degradation.

Ablation Study

Structural Design of the DAFF

To evaluate the necessity of the proposed design in the DAFF module, we conduct a comprehensive ablation study across four configurations: (1) no DAFF fusion (baseline); (2) single-stream fusion only; (3) double-stream fusion only; and (4) our full DAFF design combining both paths. In addition, we compare our method with a representative residual self-attention (RSA) fusion strategy (Radford et al. 2021), which replaces DAFF with a standard attention block.

As reported in Table 4, both single-stream and double-stream branches offer partial benefits—single-stream fusion improves integration but lacks representation disentanglement, while double-stream enhances alignment but suffers from fusion instability. Our full DAFF achieves the best performance by combining their respective strengths. Moreover, RSA performs worse than our method due to its lack of timestep-aware modulation. While RSA captures spatial attention, it fails to adaptively adjust the influence of degradation cues throughout the diffusion process. In contrast, DAFF leverages timestep ttitalic_t as a dynamic conditioning signal, enabling consistent and adaptive fusion across evolving latent states.

Fusion Strategy PSNR \uparrow MUSIQ \uparrow
No fusion 23.12 41.31
Single-stream only 25.39 49.26
Double-stream only 25.84 51.02
Residual Self-Attention (RSA) 24.08 48.11
Ours (Full DAFF) 27.14 61.35
Table 4: DAFF ablation results averaged over five tasks. Fusion of double-stream and single-stream modules with timestep-aware modulation achieves the best performance.

Effectiveness of the DAFF

To further validate the overall effectiveness of the DAFF module, we compare different configurations with and without DAFF under both task prompt-enabled and prompt-free settings. As shown in Table 5, incorporating task prompts alone provides semantic guidance and improves performance in globally homogeneous degradations. However, it remains insufficient for handling spatially localized degradations such as rain or shadow. By contrast, adding DAFF enables dynamic interaction between degradation cues and evolving latent features, resulting in significantly improved restoration fidelity. Notably, the combination of DAFF and task prompt yields the best performance, demonstrating their complementarity: task prompts offer global semantics, while DAFF contributes fine-grained degradation awareness. These results highlight that DAFF is not only effective independently, but also synergizes well with prompt-based guidance to improve robustness and quality across diverse degradation scenarios.

DAFF Task Prompt DAEM PSNR\uparrow MUSIQ\uparrow
23.12 41.31
25.87 50.12
27.14 61.35
27.36 62.77
30.27 63.06
Table 5: Component-wise ablation study, averaged over five restoration tasks.
Refer to caption
Figure 6: Effectiveness of the DAEM.

Effectiveness of the DAEM

While DAFF improves alignment and global consistency, fine details—such as text and facial contours—are often inaccurately restored. To address this, we incorporate the DAEM, which selectively enhances high-frequency structures using a dynamic mixture-of-experts design. Each expert branch captures distinct local patterns via diverse receptive fields, and a sparse gating mechanism routes features to the most suitable expert based on local context. As shown in Table 5, introducing DAEM on top of DAFF and prompt guidance yields a significant boost in PSNR and MUSIQ. Figure 6 further shows that DAEM visibly reduces residual artifacts and sharpens fine structures. These results confirm the critical role of DAEM in refining details and complementing DAFF’s structural fusion.

Conclusion

In this paper, we present a novel All-in-One image restoration framework built upon LDMs. To address the limitations of prompt-conditioned diffusion in modeling diverse degradation types, we introduce a structure-aware guidance mechanism through the DAFF module, enabling implicit adaptation to various degradations. Additionally, the proposed DAEM enhances fine-detail recovery and structural consistency in the decoding stage. Extensive experiments across multi-task and mixed degradation setting demonstrate the superiority of our method.

References

  • Ai et al. (2024) Ai, Y.; Huang, H.; Zhou, X.; Wang, J.; and He, R. 2024. Multimodal prompt perceiver: Empower adaptiveness generalizability and fidelity for all-in-one image restoration. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 25432–25444.
  • Arbelaez et al. (2010) Arbelaez, P.; Maire, M.; Fowlkes, C.; and Malik, J. 2010. Contour detection and hierarchical image segmentation. IEEE transactions on pattern analysis and machine intelligence, 33(5): 898–916.
  • Chen, Pan, and Dong (2025) Chen, J.; Pan, J.; and Dong, J. 2025. Faithdiff: Unleashing diffusion priors for faithful image super-resolution. In Proceedings of the Computer Vision and Pattern Recognition Conference, 28188–28197.
  • Chen et al. (2022a) Chen, L.; Chu, X.; Zhang, X.; and Sun, J. 2022a. Simple baselines for image restoration. In European conference on computer vision, 17–33. Springer.
  • Chen et al. (2022b) Chen, W.-T.; Huang, Z.-K.; Tsai, C.-C.; Yang, H.-H.; Ding, J.-J.; and Kuo, S.-Y. 2022b. Learning multiple adverse weather removal via two-stage knowledge learning and multi-contrastive regularization: Toward a unified model. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 17653–17662.
  • Cui et al. (2024) Cui, Y.; Zamir, S. W.; Khan, S.; Knoll, A.; Shah, M.; and Khan, F. S. 2024. Adair: Adaptive all-in-one image restoration via frequency mining and modulation. arXiv preprint arXiv:2403.14614.
  • Dai et al. (2023) Dai, X.; Hou, J.; Ma, C.-Y.; Tsai, S.; Wang, J.; Wang, R.; Zhang, P.; Vandenhende, S.; Wang, X.; Dubey, A.; et al. 2023. Emu: Enhancing image generation models using photogenic needles in a haystack. arXiv preprint arXiv:2309.15807.
  • Ding et al. (2020) Ding, K.; Ma, K.; Wang, S.; and Simoncelli, E. P. 2020. Image quality assessment: Unifying structure and texture similarity. IEEE transactions on pattern analysis and machine intelligence, 44(5): 2567–2581.
  • Duan et al. (2024) Duan, H.; Min, X.; Wu, S.; Shen, W.; and Zhai, G. 2024. Uniprocessor: a text-induced unified low-level image processor. In European Conference on Computer Vision, 180–199. Springer.
  • Guo et al. (2024) Guo, Y.; Gao, Y.; Lu, Y.; Zhu, H.; Liu, R. W.; and He, S. 2024. Onerestore: A universal restoration framework for composite degradation. In European conference on computer vision, 255–272. Springer.
  • Jiang et al. (2024a) Jiang, J.; Zuo, Z.; Wu, G.; Jiang, K.; and Liu, X. 2024a. A survey on all-in-one image restoration: Taxonomy, evaluation and future trends. arXiv preprint arXiv:2410.15067.
  • Jiang et al. (2024b) Jiang, Y.; Zhang, Z.; Xue, T.; and Gu, J. 2024b. Autodir: Automatic all-in-one image restoration with latent diffusion. In European Conference on Computer Vision, 340–359. Springer.
  • Ke et al. (2021) Ke, J.; Wang, Q.; Wang, Y.; Milanfar, P.; and Yang, F. 2021. Musiq: Multi-scale image quality transformer. In Proceedings of the IEEE/CVF international conference on computer vision, 5148–5157.
  • Kingma, Welling et al. (2013) Kingma, D. P.; Welling, M.; et al. 2013. Auto-encoding variational bayes.
  • Labs et al. (2025) Labs, B. F.; Batifol, S.; Blattmann, A.; Boesel, F.; Consul, S.; Diagne, C.; Dockhorn, T.; English, J.; English, Z.; Esser, P.; Kulal, S.; Lacey, K.; Levi, Y.; Li, C.; Lorenz, D.; Müller, J.; Podell, D.; Rombach, R.; Saini, H.; Sauer, A.; and Smith, L. 2025. FLUX.1 Kontext: Flow Matching for In-Context Image Generation and Editing in Latent Space. arXiv:2506.15742.
  • Li et al. (2022) Li, B.; Liu, X.; Hu, P.; Wu, Z.; Lv, J.; and Peng, X. 2022. All-in-one image restoration for unknown corruption. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 17452–17462.
  • Li et al. (2018) Li, B.; Ren, W.; Fu, D.; Tao, D.; Feng, D.; Zeng, W.; and Wang, Z. 2018. Benchmarking single-image dehazing and beyond. IEEE Transactions on Image Processing, 28(1): 492–505.
  • Luo et al. (2023) Luo, Z.; Gustafsson, F. K.; Zhao, Z.; Sjölund, J.; and Schön, T. B. 2023. Controlling vision-language models for universal image restoration. arXiv preprint arXiv:2310.01018, 3(8).
  • Ma et al. (2016) Ma, K.; Duanmu, Z.; Wu, Q.; Wang, Z.; Yong, H.; Li, H.; and Zhang, L. 2016. Waterloo exploration database: New challenges for image quality assessment models. IEEE Transactions on Image Processing, 26(2): 1004–1016.
  • Martin et al. (2001) Martin, D.; Fowlkes, C.; Tal, D.; and Malik, J. 2001. A database of human segmented natural images and its application to evaluating segmentation algorithms and measuring ecological statistics. In Proceedings eighth IEEE international conference on computer vision. ICCV 2001, volume 2, 416–423. IEEE.
  • Nah, Hyun Kim, and Mu Lee (2017) Nah, S.; Hyun Kim, T.; and Mu Lee, K. 2017. Deep multi-scale convolutional neural network for dynamic scene deblurring. In Proceedings of the IEEE conference on computer vision and pattern recognition, 3883–3891.
  • Podell et al. (2023) Podell, D.; English, Z.; Lacey, K.; Blattmann, A.; Dockhorn, T.; Müller, J.; Penna, J.; and Rombach, R. 2023. Sdxl: Improving latent diffusion models for high-resolution image synthesis. arXiv preprint arXiv:2307.01952.
  • Potlapalli et al. (2023) Potlapalli, V.; Zamir, S. W.; Khan, S. H.; and Shahbaz Khan, F. 2023. Promptir: Prompting for all-in-one image restoration. Advances in Neural Information Processing Systems, 36: 71275–71293.
  • Radford et al. (2021) Radford, A.; Kim, J. W.; Hallacy, C.; Ramesh, A.; Goh, G.; Agarwal, S.; Sastry, G.; Askell, A.; Mishkin, P.; Clark, J.; et al. 2021. Learning transferable visual models from natural language supervision. In International conference on machine learning, 8748–8763. PmLR.
  • Riquelme et al. (2021) Riquelme, C.; Puigcerver, J.; Mustafa, B.; Neumann, M.; Jenatton, R.; Susano Pinto, A.; Keysers, D.; and Houlsby, N. 2021. Scaling vision with sparse mixture of experts. Advances in Neural Information Processing Systems, 34: 8583–8595.
  • Tian et al. (2025) Tian, X.; Liao, X.; Liu, X.; Li, M.; and Ren, C. 2025. Degradation-Aware Feature Perturbation for All-in-One Image Restoration. In Proceedings of the Computer Vision and Pattern Recognition Conference, 28165–28175.
  • Valanarasu, Yasarla, and Patel (2022) Valanarasu, J. M. J.; Yasarla, R.; and Patel, V. M. 2022. Transweather: Transformer-based restoration of images degraded by adverse weather conditions. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2353–2363.
  • Wang, Chan, and Loy (2023) Wang, J.; Chan, K. C.; and Loy, C. C. 2023. Exploring clip for assessing the look and feel of images. In Proceedings of the AAAI conference on artificial intelligence, volume 37, 2555–2563.
  • Wang et al. (2004) Wang, Z.; Bovik, A. C.; Sheikh, H. R.; and Simoncelli, E. P. 2004. Image quality assessment: from error visibility to structural similarity. IEEE transactions on image processing, 13(4): 600–612.
  • Wei et al. (2018) Wei, C.; Wang, W.; Yang, W.; and Liu, J. 2018. Deep retinex decomposition for low-light enhancement. arXiv preprint arXiv:1808.04560.
  • Yan et al. (2025) Yan, Q.; Jiang, A.; Chen, K.; Peng, L.; Yi, Q.; and Zhang, C. 2025. Textual prompt guided image restoration. Engineering Applications of Artificial Intelligence, 155: 110981.
  • Yang et al. (2022) Yang, S.; Wu, T.; Shi, S.; Lao, S.; Gong, Y.; Cao, M.; Wang, J.; and Yang, Y. 2022. Maniqa: Multi-dimension attention network for no-reference image quality assessment. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 1191–1200.
  • Yang et al. (2017) Yang, W.; Tan, R. T.; Feng, J.; Liu, J.; Guo, Z.; and Yan, S. 2017. Deep joint rain detection and removal from a single image. In Proceedings of the IEEE conference on computer vision and pattern recognition, 1357–1366.
  • Zamfir et al. (2025) Zamfir, E.; Wu, Z.; Mehta, N.; Tan, Y.; Paudel, D. P.; Zhang, Y.; and Timofte, R. 2025. Complexity experts are task-discriminative learners for any image restoration. In Proceedings of the Computer Vision and Pattern Recognition Conference, 12753–12763.
  • Zamir et al. (2022) Zamir, S. W.; Arora, A.; Khan, S.; Hayat, M.; Khan, F. S.; and Yang, M.-H. 2022. Restormer: Efficient transformer for high-resolution image restoration. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 5728–5739.
  • Zeng et al. (2025) Zeng, H.; Wang, X.; Chen, Y.; Su, J.; and Liu, J. 2025. Vision-Language Gradient Descent-driven All-in-One Deep Unfolding Networks. In Proceedings of the Computer Vision and Pattern Recognition Conference, 7524–7533.
  • Zhang, Zhang, and Bovik (2015) Zhang, L.; Zhang, L.; and Bovik, A. C. 2015. A feature-enriched completely blind image quality evaluator. IEEE Transactions on Image Processing, 24(8): 2579–2591.
  • Zhang et al. (2018) Zhang, R.; Isola, P.; Efros, A. A.; Shechtman, E.; and Wang, O. 2018. The unreasonable effectiveness of deep features as a perceptual metric. In Proceedings of the IEEE conference on computer vision and pattern recognition, 586–595.
  • Zheng et al. (2024) Zheng, D.; Wu, X.-M.; Yang, S.; Zhang, J.; Hu, J.-F.; and Zheng, W.-S. 2024. Selective hourglass mapping for universal image restoration based on diffusion model. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 25445–25455.
  • Zhu et al. (2023) Zhu, Z.; Feng, X.; Chen, D.; Bao, J.; Wang, L.; Chen, Y.; Yuan, L.; and Hua, G. 2023. Designing a better asymmetric vqgan for stablediffusion. arXiv preprint arXiv:2306.04632.