Introduction

Defect detection in industrial applications is a crucial component of quality assurance and operational efficiency. Early identification of defects not only enhances product reliability but also minimizes material waste and financial losses. Traditional deep learning-based defect detection methods have shown promising results in automated visual inspection; however, these approaches typically rely on centralized data collection, which raises privacy concerns and limits real-time adaptability in distributed manufacturing environments1,2. Centralized learning methods often struggle to generalize across diverse industrial settings due to variations in defect patterns and environmental conditions, leading to suboptimal detection performance3,4.

To address these challenges, we propose Federated Multi-Scale Vision Transformer with Adaptive Client Aggregation (Fed-MSVT), a novel framework designed to facilitate decentralized defect detection while preserving data privacy. Our approach leverages multi-scale Vision Transformers (MSVTs) to effectively capture both fine-grained local defects and broader structural anomalies, enhancing detection accuracy across diverse defect categories5,6. Unlike conventional federated learning (FL) models that use static aggregation techniques, we introduce an Adaptive Client Aggregation (ACA) mechanism, which dynamically assigns weights to client models based on data quality, domain shift, and contribution consistency. This ensures optimal knowledge transfer across decentralized industrial nodes and maintains robustness in heterogeneous environments7,8.

Furthermore, to mitigate inter-client domain discrepancies, we integrate a Contrastive Feature Alignment (CFA) module that aligns feature distributions across decentralized datasets. This significantly enhances model generalization, allowing Fed-MSVT to achieve higher accuracy even under varying production conditions9,10. The proposed framework is extensively evaluated on multiple industrial defect detection datasets, demonstrating superior accuracy, robustness, and scalability compared to existing deep learning and federated learning approaches11,12.

The propose method experimental results validate that Fed-MSVT not only preserves data privacy but also enables real-time, adaptive, and highly efficient defect detection, making it a viable solution for next-generation smart manufacturing systems. By addressing both privacy and adaptability concerns, our work advances federated learning for industrial applications and provides a scalable alternative to traditional deep learning models in industrial defect detection13,14.

Related work

Industrial defect detection is a crucial aspect of quality assurance in modern manufacturing. The advent of deep learning has significantly improved automated defect detection by leveraging convolutional neural networks (CNNs) and transformers. However, traditional deep learning-based approaches often require centralized data collection, which presents challenges related to data privacy, domain heterogeneity, and computational overhead. To overcome these issues, federated learning (FL) has emerged as a promising solution by enabling decentralized model training while keeping data localized at individual clients. This section reviews existing methodologies in deep learning-based defect detection, federated learning for industrial applications, and Vision Transformers (ViTs) for visual inspection, highlighting their limitations and how the proposed Federated Multi-Scale Vision Transformer with Adaptive Client Aggregation (Fed-MSVT) addresses these gaps.

In recent years, industrial defect detection has witnessed significant progress through the integration of deep learning and representation learning techniques. Two notable contributions in this domain are the works by Li et al.15,16, which address the challenges of feature inconsistency and geometric embedding in anomaly detection systems.

Li et al.17 proposed a novel Feature Consistency Learning (FCL) approach for anomaly detection, which emphasizes preserving consistency between features of normal and anomalous samples across multiple layers of the network. The proposed method enhances generalization by penalizing feature shifts and thereby ensures robust representation of subtle defect patterns. Their work demonstrated superior performance on several benchmark industrial datasets by effectively minimizing intra-class feature variance and suppressing false positives during anomaly inference.

Li et al.16 introduced a Hyperbolic Anomaly Detection framework that embeds image features in a hyperbolic space rather than traditional Euclidean geometry. By leveraging the properties of hyperbolic space-which is naturally suited for representing hierarchical and complex data structures-the model efficiently captures latent semantic relationships between defective and non-defective samples. This approach has shown improved anomaly localization and detection accuracy, particularly in datasets where the defects exhibit hierarchical patterns or semantic similarity. These studies highlight a paradigm shift in anomaly detection: from purely appearance-based learning to structure-aware and consistency-preserving feature modeling. Inspired by these works, our proposed method incorporates both spatial consistency and semantic embedding to enhance defect discrimination in challenging industrial inspection scenarios.

Traditional defect detection relied on handcrafted feature extraction and classical machine learning techniques, such as support vector machines (SVMs) and decision trees18,19. However, these methods struggled with generalization due to their reliance on predefined features. The introduction of CNNs revolutionized visual inspection tasks by enabling automated feature extraction from images. Notable architectures such as AlexNet20, VGGNet11, and ResNet2 have been widely adopted for defect classification and segmentation. These models demonstrated high accuracy in detecting surface defects in steel, textiles, and semiconductor manufacturing21,22.

Despite their effectiveness, CNN-based methods face two major challenges:

  • CNNs primarily operate on local receptive fields, making them less effective at capturing long-range dependencies in images5.

  • Traditional deep learning models require a centralized dataset for training, which is impractical for privacy-sensitive industrial environments4.

To address these limitations, researchers have explored Vision Transformers (ViTs), which utilize self-attention mechanisms to model global dependencies in an image without spatial constraints6. However, the high computational cost of ViTs limits their adoption in real-time defect detection.

Federated learning (FL) enables collaborative model training across multiple decentralized clients without sharing raw data, ensuring privacy preservation7. It has been increasingly applied to industrial defect detection, allowing multiple factories or manufacturing plants to collaboratively train a model while retaining sensitive production data locally13.

Several FL-based defect detection models have been proposed:

FedAvg7: A standard FL algorithm that aggregates client models using a simple averaging mechanism. While effective, it assumes all clients contribute equally, which is unrealistic in industrial settings with non-IID (non-independent and identically distributed) data distributions. FedProx14: A modification of FedAvg that incorporates a regularization term to address data heterogeneity. However, it does not account for domain shifts across different industrial settings. Personalized Federated Learning (PFL) [?]: A method that adapts FL models for individual clients. While promising, PFL can lead to overfitting on local datasets and reduced generalizability. To overcome these issues, we introduce Adaptive Client Aggregation (ACA), which dynamically assigns weights to clients based on data quality, domain relevance, and contribution consistency, ensuring an optimal global model.

Transformers have demonstrated exceptional performance in computer vision, outperforming CNNs in various tasks5. The self-attention mechanism in Vision Transformers (ViTs) allows for modeling long-range dependencies, making them highly effective for defect detection. However, standard ViTs suffer from high computational complexity, making them impractical for real-time industrial applications.

Swin Transformer23 introduces hierarchical feature extraction with shifted windows, reducing computational cost. PVT (Pyramid Vision Transformer)24 enhances scalability by incorporating pyramidal feature maps similar to CNNs. MSVT (Multi-Scale Vision Transformer)25 Captures fine-grained and global features, making it ideal for defect detection. Our proposed Fed-MSVT framework builds upon MSVT by integrating Contrastive Feature Alignment (CFA) to address inter-client domain discrepancies, improving robustness across diverse industrial environments.

The proposed approach bridges the gap between FL and ViTs by enabling:

  • Ensures data security without compromising accuracy.

  • Leverages MSVT to detect fine-grained defects and structural anomalies.

  • Uses ACA to optimize client contributions based on data quality and relevance.

  • Integrates CFA to align feature representations across heterogeneous industrial environments.

Existing deep learning models for defect detection exhibit strong performance, but fail to generalize across decentralized manufacturing settings due to privacy constraints and domain shifts. Although federated learning mitigates data privacy concerns, conventional FL models are limited by ineffective client aggregation and domain adaptation. By combining Multi-Scale Vision Transformers (MSVTs) with Federated Learning and Adaptive Client Aggregation (ACA), our proposed Fed-MSVT framework achieves superior accuracy, scalability, and robustness for industrial defect detection in distributed environments.

The primary contributions of this paper are as follows:

  • We propose a novel federated learning-based architecture that leverages multi-scale Vision Transformers (MSVTs) to enhance defect detection accuracy in industrial applications. The MSVT structure enables the model to capture both fine-grained local defects and global structural patterns, ensuring robust performance across diverse defect types.

  • Unlike conventional federated learning models that employ simple model averaging, we introduce an adaptive client aggregation (ACA) technique that dynamically assigns weights to different clients based on data quality, domain shift, and contribution consistency. This approach optimally transfers knowledge across decentralized nodes, improving model robustness and generalization.

  • To mitigate inter-client domain discrepancies, we incorporate a Contrastive Feature Alignment (CFA) module that aligns feature distributions from different industrial environments. This improves the model’s adaptability to varying production conditions, reducing the impact of data heterogeneity.

  • The proposed method maintains data privacy by enabling decentralized training, ensuring that sensitive industrial data remains on local devices. The model is designed to be scalable, supporting deployment across multiple industrial sites without requiring centralized data collection.

  • The proposed Fed-MSVT framework is extensively evaluated on multiple industrial defect datasets, demonstrating superior performance over existing deep learning and federated learning methods. Experimental results highlight higher detection accuracy, improved robustness, and enhanced scalability, validating the effectiveness of the proposed approach.

Proposed methodology

The proposed Federated Multi-Scale Vision Transformer with Adaptive Client Aggregation (Fed-MSVT) is a novel federated learning-based defect detection framework for industrial applications. It integrates multi-scale Vision Transformers (MSVTs), adaptive client aggregation (ACA) and contrastive feature alignment (CFA) to achieve accurate, privacy-preserving, and scalable defect detection.

Multi-scale vision transformer (MSVT)

Vision transformers for defect detection

Vision Transformers (ViTs) excel in modeling long-range dependencies but struggle with local defect detection. To address this, the MSVT module captures hierarchical defect representations by operating at multiple spatial resolutions.

Mathematical formulation

Let \(X \in \mathbb {R}^{H \times W \times C}\) represent the input image, where \(H\) and \(W\) are the height and width, and \(C\) is the number of channels. The image is divided into non-overlapping patches of size \(P \times P\), yielding patch embeddings:

$$\begin{aligned} Z = W_e X_p + E_p \end{aligned}$$
(1)

where \(X_p\) is the flattened image patch, \(W_e\) is the learnable embedding matrix, and \(E_p\) is the positional encoding.

The self-attention mechanism is given by:

$$\begin{aligned} \text {Attention}(Q, K, V) = \text {softmax} \left( \frac{QK^T}{\sqrt{d_k}}\right) V \end{aligned}$$
(2)

where \(Q, K, V\) are the query, key, and value matrices, and \(d_k\) is the dimension of the keys.

To capture multi-scale features, we introduce hierarchical pooling, progressively downsampling features while preserving critical defect details.

Need for adaptive aggregation

Standard federated learning models such as FedAvg aggregate client updates using simple averaging, which is often suboptimal in the presence of non-IID data distributions and heterogeneous data quality, especially in industrial anomaly detection scenarios. To overcome this limitation, we propose an Adaptive Client Aggregation strategy that assigns dynamic weights to each client based on three key factors:

  • Data quality (\(q_i\)): Derived from each client’s local validation accuracy.

  • Update stability (\(s_i\)): Measured as the inverse variance of model parameter updates.

  • Domain shift similarity (\(d_i\)): (Optional) Measures the alignment of local data distribution with the global feature distribution.

The aggregation weight for each client is computed as:

$$\begin{aligned} w_i = \frac{q_i \cdot s_i}{\sum _{j=1}^{N} q_j \cdot s_j} \end{aligned}$$
(3)

The data quality score \(q_i\) is calculated as the pixel-wise validation accuracy of the local model evaluated on a small held-out validation set (typically 10-15%) from the client’s local MVTec-AD data. The validation accuracy is defined as:

$$\begin{aligned} q_i = \text {ValAcc}_i = \frac{TP + TN}{TP + TN + FP + FN} \end{aligned}$$
(4)

where \(TP\), \(TN\), \(FP\), and \(FN\) represent the number of true positive, true negative, false positive, and false negative pixels respectively, compared against the ground truth anomaly segmentation masks provided by MVTec-AD.

The update stability score \(s_i\) reflects the smoothness of training and is computed as the inverse of the parameter update variance across consecutive rounds:

$$\begin{aligned} s_i = \frac{1}{\text {Var}(\theta _i^t - \theta _i^{t-1})} \end{aligned}$$
(5)

Finally, the global model is updated using the weighted aggregation of local parameters:

$$\begin{aligned} \theta ^t = \sum _{i=1}^{N} w_i \theta _i^t \end{aligned}$$
(6)

This adaptive scheme prioritizes clients with reliable data and stable learning dynamics, thus enhancing robustness and generalization in federated anomaly detection.

Contrastive feature alignment (CFA)

Addressing domain shift

Industrial datasets exhibit domain shifts due to variations in defect patterns and imaging conditions. To mitigate this, we introduce contrastive learning-based feature alignment.

Contrastive feature alignment and loss formulation

To reduce inter-client domain shifts and enhance anomaly separability in the feature space, we introduce a Contrastive Feature Alignment (CFA) module as an auxiliary training objective in Fed-MSVT framework. The goal of CFA is to promote the alignment of features from similar (normal) samples and enforce separation from anomalous ones across different clients, thereby improving generalization in the global model.

Given two feature embeddings \(f_i\) and \(f_j\) from normal samples of different clients (positive pairs), and embeddings \(f_k\) from anomalous samples (negative pairs), the contrastive loss is formulated using the InfoNCE objective:

$$\begin{aligned} \mathcal {L}_{CFA} = - \sum _{i, j} \log \frac{\exp (\text {sim}(f_i, f_j) / \tau )}{\sum _{k} \exp (\text {sim}(f_i, f_k) / \tau )} \end{aligned}$$
(7)

where \(\text {sim}(f_i, f_j)\) denotes cosine similarity between embeddings and \(\tau\) is a temperature scaling parameter. The CFA module is only active during training and helps cluster normal samples more tightly while pushing anomalous representations apart in the shared embedding space.

Contribution to Anomaly Detection While traditional pixel-wise loss functions optimize spatial accuracy, they do not explicitly enforce semantic separation in latent space. The inclusion of CFA addresses this by regularizing the feature space such that:

  • Features of normal samples from different clients are pulled closer, enhancing consistency and generalization.

  • Anomalous features are pushed further away, improving separability and localization of subtle defects.

An ablation study provided in Table 2 demonstrates that removing CFA results in a performance drop (e.g., AUROC decreases by 1.3%), confirming its effectiveness.

The Contributions of Fed-MSVT: The proposed Fed-MSVT framework integrates the following innovations

  • Learns both fine-grained and global defect representations using multi-scale vision transformers, outperforming conventional CNN and single-scale ViT baselines.

  • Employs adaptive client weighting based on data quality and stability, improving federated aggregation under non-IID conditions.

  • Incorporates contrastive feature alignment to mitigate inter-client domain shifts and enhance latent feature discriminability.

These contributions collectively improve accuracy, privacy, robustness, and scalability in industrial anomaly detection across diverse environments.

Algorithm 1
figure a

Federated Multi-Scale Vision Transformer with Adaptive Client Aggregation (Fed-MSVT)

The proposed Fed-MSVT framework significantly advances industrial defect detection by integrating multi-scale vision transformers, adaptive federated learning, and contrastive domain adaptation. Experimental results demonstrate superior accuracy, robustness, and scalability, making it a promising solution for real-time smart manufacturing.

Results and discussion

MVTec AD is a dataset for comparing anomaly detection methods, with a focus on industrial inspection. It contains over 5,000 high-resolution images in 15 different item and texture categories. Each category contains both a set of training photographs that are defect-free and a test set of photos with various kinds of problems26.

The qualitative results of anomaly detection using proposed frame work applied to industrial and natural objects. These results highlight the capability of proposed model to accurately detect and localize defects in various materials and products. Each image follows a structured format where original inputs, ground truth annotations, and proposed model-predicted heatmaps are displayed, allowing for an intuitive comparison of the model’s performance.

In Fig. 4, the top row consists of original input images, including textured surfaces, a nut, a leather piece, and an electronic component. The second row represents ground truth segmentation, where the white regions indicate actual defects present in the objects. The bottom row contains heatmaps generated by an AI model, with red and yellow areas denoting high-confidence defect detections. This format effectively illustrates how well the model identifies and highlights anomalies in different materials.

Fig. 5 follows a similar structure, showing objects such as pharmaceutical capsules, electrical wires, and mechanical components. The first row contains the original images, the second row overlays ground truth defect masks (highlighted in red), and the third row presents model-predicted heatmaps. These results demonstrate the ability of the proposed method to identify even small-scale defects, which is crucial in quality inspection applications.

Fig. 6 provides an in-depth comparison of anomaly detection in 3D point cloud data. The left column shows anomaly-free objects, while the middle column contains objects with detected defects. The right column presents ground-truth segmentations, where red regions indicate the actual anomalies. This visualization highlights the effectiveness of deep learning in detecting irregularities not only in 2D images but also in 3D surface data, making it valuable for industrial inspection and defect localization.

The qualitative results demonstrate the robustness of the proposed framework work anomaly detection across multiple domains, including manufacturing, electronics, pharmaceuticals, and food processing. By comparing the ground truth annotations with the predicted heatmaps, the results validate the accuracy of the model in localizing and classifying defects, making it a powerful tool for automated quality control.

Fig. 1
figure 1

Training accuracy curve progression of the proposed method over multiple epochs.

Fig. 2
figure 2

Training and validation loss curve demonstrates the decreasing trend in training and validation loss over epochs.

Fig. 3
figure 3

Prediction performance curve presents the prediction performance of the proposed method, measured using AUROC.

Table 1 Comparison of anomaly detection performance with state-of-the-art methods on the MVTec AD dataset.
Table 2 Ablation study on the effect of contrastive feature alignment (CFA) and adaptive aggregation.
Fig. 4
figure 4

The left column shows anomaly-free objects, the middle column displays objects with detected anomalies, and the right column presents ground truth annotations in 3D space. Red-marked regions highlight actual defects, showcasing the model’s ability to detect irregularities in complex surface structures.

Fig. 5
figure 5

Qualitative results of anomaly detection on industrial and natural objects. The first row shows original input images, the second row displays ground truth segmentation masks highlighting defect areas, and the third row presents the proposed model-generated anomaly heatmaps. Red regions in the heatmaps indicate high-confidence defect detections.

Fig. 6
figure 6

The first row contains original images, the second row overlays ground truth defect masks (in red), and the third row visualizes model-predicted heatmaps.

The Fig. 1 figure illustrates the training accuracy curve, depicting the learning progression of the proposed method over multiple epochs. Initially, the accuracy is low, but as training progresses, the accuracy steadily increases, demonstrating effective learning. The curve stabilizes after a certain number of epochs, indicating convergence and minimal overfitting. This suggests that the proposed method successfully learns from the data without significant performance degradation.

The Fig.2 represents the loss curve, showing how the error decreases over time. A high initial loss is observed, which progressively reduces as the model optimizes its parameters. The smooth decline in loss indicates stable convergence, while the absence of abrupt spikes suggests that the training process avoids major divergence issues. Compared to conventional models, the proposed method exhibits a more consistent loss reduction, leading to robust learning.

The Fig.3 visualizes the prediction performance of the proposed method. The plotted predictions closely follow the ground truth values, demonstrating the model’s ability to generalize effectively to unseen data. The minimal deviation between predictions and actual values indicates high reliability in classification and segmentation tasks. Compared to state-of-the-art techniques, the proposed model provides smoother and more accurate predictions, making it well-suited for real-world applications.

The Table 1 presents a comparative analysis of anomaly detection performance between the proposed model and state-of-the-art methods, evaluated on the MVTec AD dataset26. The key performance metrics considered include AUROC (Area Under the Receiver Operating Characteristic curve), Average Precision (AP), F1-score, Intersection over Union (IoU), and Pixel-wise Accuracy. Each of these metrics provides insights into the effectiveness of the models in detecting and localizing anomalies in industrial and natural images.

The proposed model achieves the highest AUROC of 98.9%, outperforming PatchCore (98.1%), FastFlow (97.5%), DifferNet (96.2%), and DRAEM (98.4%). AUROC is a critical metric that reflects the ability of the model to distinguish between normal and anomalous samples effectively. The AP score of 97.6% for the proposed model further confirms its superior precision in detecting defects, surpassing the previous best (97.1% by DRAEM). Higher AP values indicate a better ranking of positive samples, ensuring reliable detection of defective regions.

Additionally, the F1-score of 0.88 demonstrates that our model maintains a balance between precision and recall, outperforming PatchCore (0.85) and other methods. The IoU (80.1%), which measures segmentation accuracy, shows that proposed model provides more precise localization of anomalies than competing approaches. A higher IoU means better alignment between the predicted and ground-truth anomaly regions. Finally, the proposed model achieves the highest Pixel-wise Accuracy (95.2%), indicating that it effectively classifies most pixels correctly, reducing false positives and false negatives.

The robustness of the proposed approach in anomaly detection and segmentation. The performance improvements can be attributed to the integration of advanced deep learning architectures, refined feature extraction, and enhanced spatial-temporal learning capabilities. the proposed method sets a new benchmark for anomaly detection, ensuring accurate identification of defects in industrial and natural environments.

Conclusion

In this work, we proposed a federated multi-scale vision transformer with adap- tive client aggregation (Fed-MSVT) method for anomaly detection and classification, demonstrating superior performance in accuracy, loss convergence, and prediction reliability. The proposed approach effectively learns complex patterns, as evidenced by the steady increase in training accuracy and the smooth reduction in loss during training. Compared to state-of-the-art methods, our model exhibits enhanced generalization capabilities, ensuring robust performance on unseen data. Through extensive qualitative and quantitative evaluations, the proposed method shows significant improvements in detecting and localizing anomalies with high precision. The AUROC-based prediction performance curve further validates the model’s effectiveness, demonstrating its ability to distinguish between normal and anomalous samples accurately. The minimal deviation between ground truth and predictions indicates that our method is well-suited for real-world applications requiring high reliability. the proposed method provides a comprehensive and efficient solution for anomaly detection, offering better accuracy, stability, and generalization than existing approaches. Future work will focus on optimizing the model for real-time applications and extending its applicability to more complex datasets and real-world industrial scenarios.