Article
Open access
Published: 01 October 2025

Federated multi scale vision transformer with adaptive client aggregation for industrial defect detection

Scientific Reports volume 15, Article number: 34149 (2025) Cite this article

Subjects

Abstract

Defect detection in industrial applications is essential for maintaining product quality and operational efficiency. However, traditional deep learning methods require centralized data collection, raising privacy concerns and limiting adaptability in distributed manufacturing environments. To overcome these challenges, we propose Fed-MSVT, a Federated Multi-Scale Vision Transformer with Adaptive Client Aggregation for industrial defect detection. Our approach leverages multi-scale Vision Transformers (MSVTs) to capture both fine-grained local defects and global structural patterns, enhancing detection accuracy across diverse defect types. Unlike conventional federated learning models, we introduce an Adaptive Client Aggregation (ACA) mechanism that dynamically assigns weights to client models based on data quality, domain shift, and consistency. Additionally, a Contrastive Feature Alignment (CFA) module mitigates inter-client domain discrepancies, improving generalization. Evaluations on multiple defect datasets demonstrate superior accuracy, robustness, and scalability compared to existing approaches, enabling real-time, privacy-preserving, and adaptive defect detection for smart manufacturing systems.

Introduction

Defect detection in industrial applications is a crucial component of quality assurance and operational efficiency. Early identification of defects not only enhances product reliability but also minimizes material waste and financial losses. Traditional deep learning-based defect detection methods have shown promising results in automated visual inspection; however, these approaches typically rely on centralized data collection, which raises privacy concerns and limits real-time adaptability in distributed manufacturing environments^1,2. Centralized learning methods often struggle to generalize across diverse industrial settings due to variations in defect patterns and environmental conditions, leading to suboptimal detection performance^3,4.

To address these challenges, we propose Federated Multi-Scale Vision Transformer with Adaptive Client Aggregation (Fed-MSVT), a novel framework designed to facilitate decentralized defect detection while preserving data privacy. Our approach leverages multi-scale Vision Transformers (MSVTs) to effectively capture both fine-grained local defects and broader structural anomalies, enhancing detection accuracy across diverse defect categories^5,6. Unlike conventional federated learning (FL) models that use static aggregation techniques, we introduce an Adaptive Client Aggregation (ACA) mechanism, which dynamically assigns weights to client models based on data quality, domain shift, and contribution consistency. This ensures optimal knowledge transfer across decentralized industrial nodes and maintains robustness in heterogeneous environments^7,8.

Furthermore, to mitigate inter-client domain discrepancies, we integrate a Contrastive Feature Alignment (CFA) module that aligns feature distributions across decentralized datasets. This significantly enhances model generalization, allowing Fed-MSVT to achieve higher accuracy even under varying production conditions^9,10. The proposed framework is extensively evaluated on multiple industrial defect detection datasets, demonstrating superior accuracy, robustness, and scalability compared to existing deep learning and federated learning approaches^11,12.

The propose method experimental results validate that Fed-MSVT not only preserves data privacy but also enables real-time, adaptive, and highly efficient defect detection, making it a viable solution for next-generation smart manufacturing systems. By addressing both privacy and adaptability concerns, our work advances federated learning for industrial applications and provides a scalable alternative to traditional deep learning models in industrial defect detection^13,14.

Related work

Industrial defect detection is a crucial aspect of quality assurance in modern manufacturing. The advent of deep learning has significantly improved automated defect detection by leveraging convolutional neural networks (CNNs) and transformers. However, traditional deep learning-based approaches often require centralized data collection, which presents challenges related to data privacy, domain heterogeneity, and computational overhead. To overcome these issues, federated learning (FL) has emerged as a promising solution by enabling decentralized model training while keeping data localized at individual clients. This section reviews existing methodologies in deep learning-based defect detection, federated learning for industrial applications, and Vision Transformers (ViTs) for visual inspection, highlighting their limitations and how the proposed Federated Multi-Scale Vision Transformer with Adaptive Client Aggregation (Fed-MSVT) addresses these gaps.

In recent years, industrial defect detection has witnessed significant progress through the integration of deep learning and representation learning techniques. Two notable contributions in this domain are the works by Li et al.^15,16, which address the challenges of feature inconsistency and geometric embedding in anomaly detection systems.

Li et al.¹⁷ proposed a novel Feature Consistency Learning (FCL) approach for anomaly detection, which emphasizes preserving consistency between features of normal and anomalous samples across multiple layers of the network. The proposed method enhances generalization by penalizing feature shifts and thereby ensures robust representation of subtle defect patterns. Their work demonstrated superior performance on several benchmark industrial datasets by effectively minimizing intra-class feature variance and suppressing false positives during anomaly inference.

Li et al.¹⁶ introduced a Hyperbolic Anomaly Detection framework that embeds image features in a hyperbolic space rather than traditional Euclidean geometry. By leveraging the properties of hyperbolic space-which is naturally suited for representing hierarchical and complex data structures-the model efficiently captures latent semantic relationships between defective and non-defective samples. This approach has shown improved anomaly localization and detection accuracy, particularly in datasets where the defects exhibit hierarchical patterns or semantic similarity. These studies highlight a paradigm shift in anomaly detection: from purely appearance-based learning to structure-aware and consistency-preserving feature modeling. Inspired by these works, our proposed method incorporates both spatial consistency and semantic embedding to enhance defect discrimination in challenging industrial inspection scenarios.

Traditional defect detection relied on handcrafted feature extraction and classical machine learning techniques, such as support vector machines (SVMs) and decision trees^18,19. However, these methods struggled with generalization due to their reliance on predefined features. The introduction of CNNs revolutionized visual inspection tasks by enabling automated feature extraction from images. Notable architectures such as AlexNet²⁰, VGGNet¹¹, and ResNet² have been widely adopted for defect classification and segmentation. These models demonstrated high accuracy in detecting surface defects in steel, textiles, and semiconductor manufacturing^21,22.

Despite their effectiveness, CNN-based methods face two major challenges:

CNNs primarily operate on local receptive fields, making them less effective at capturing long-range dependencies in images⁵.
Traditional deep learning models require a centralized dataset for training, which is impractical for privacy-sensitive industrial environments⁴.

To address these limitations, researchers have explored Vision Transformers (ViTs), which utilize self-attention mechanisms to model global dependencies in an image without spatial constraints⁶. However, the high computational cost of ViTs limits their adoption in real-time defect detection.

Federated learning (FL) enables collaborative model training across multiple decentralized clients without sharing raw data, ensuring privacy preservation⁷. It has been increasingly applied to industrial defect detection, allowing multiple factories or manufacturing plants to collaboratively train a model while retaining sensitive production data locally¹³.

Several FL-based defect detection models have been proposed:

FedAvg⁷: A standard FL algorithm that aggregates client models using a simple averaging mechanism. While effective, it assumes all clients contribute equally, which is unrealistic in industrial settings with non-IID (non-independent and identically distributed) data distributions. FedProx¹⁴: A modification of FedAvg that incorporates a regularization term to address data heterogeneity. However, it does not account for domain shifts across different industrial settings. Personalized Federated Learning (PFL) [?]: A method that adapts FL models for individual clients. While promising, PFL can lead to overfitting on local datasets and reduced generalizability. To overcome these issues, we introduce Adaptive Client Aggregation (ACA), which dynamically assigns weights to clients based on data quality, domain relevance, and contribution consistency, ensuring an optimal global model.

Transformers have demonstrated exceptional performance in computer vision, outperforming CNNs in various tasks⁵. The self-attention mechanism in Vision Transformers (ViTs) allows for modeling long-range dependencies, making them highly effective for defect detection. However, standard ViTs suffer from high computational complexity, making them impractical for real-time industrial applications.

Swin Transformer²³ introduces hierarchical feature extraction with shifted windows, reducing computational cost. PVT (Pyramid Vision Transformer)²⁴ enhances scalability by incorporating pyramidal feature maps similar to CNNs. MSVT (Multi-Scale Vision Transformer)²⁵ Captures fine-grained and global features, making it ideal for defect detection. Our proposed Fed-MSVT framework builds upon MSVT by integrating Contrastive Feature Alignment (CFA) to address inter-client domain discrepancies, improving robustness across diverse industrial environments.

The proposed approach bridges the gap between FL and ViTs by enabling:

Ensures data security without compromising accuracy.
Leverages MSVT to detect fine-grained defects and structural anomalies.
Uses ACA to optimize client contributions based on data quality and relevance.
Integrates CFA to align feature representations across heterogeneous industrial environments.

Existing deep learning models for defect detection exhibit strong performance, but fail to generalize across decentralized manufacturing settings due to privacy constraints and domain shifts. Although federated learning mitigates data privacy concerns, conventional FL models are limited by ineffective client aggregation and domain adaptation. By combining Multi-Scale Vision Transformers (MSVTs) with Federated Learning and Adaptive Client Aggregation (ACA), our proposed Fed-MSVT framework achieves superior accuracy, scalability, and robustness for industrial defect detection in distributed environments.

The primary contributions of this paper are as follows:

We propose a novel federated learning-based architecture that leverages multi-scale Vision Transformers (MSVTs) to enhance defect detection accuracy in industrial applications. The MSVT structure enables the model to capture both fine-grained local defects and global structural patterns, ensuring robust performance across diverse defect types.
Unlike conventional federated learning models that employ simple model averaging, we introduce an adaptive client aggregation (ACA) technique that dynamically assigns weights to different clients based on data quality, domain shift, and contribution consistency. This approach optimally transfers knowledge across decentralized nodes, improving model robustness and generalization.
To mitigate inter-client domain discrepancies, we incorporate a Contrastive Feature Alignment (CFA) module that aligns feature distributions from different industrial environments. This improves the model’s adaptability to varying production conditions, reducing the impact of data heterogeneity.
The proposed method maintains data privacy by enabling decentralized training, ensuring that sensitive industrial data remains on local devices. The model is designed to be scalable, supporting deployment across multiple industrial sites without requiring centralized data collection.
The proposed Fed-MSVT framework is extensively evaluated on multiple industrial defect datasets, demonstrating superior performance over existing deep learning and federated learning methods. Experimental results highlight higher detection accuracy, improved robustness, and enhanced scalability, validating the effectiveness of the proposed approach.

Proposed methodology

The proposed Federated Multi-Scale Vision Transformer with Adaptive Client Aggregation (Fed-MSVT) is a novel federated learning-based defect detection framework for industrial applications. It integrates multi-scale Vision Transformers (MSVTs), adaptive client aggregation (ACA) and contrastive feature alignment (CFA) to achieve accurate, privacy-preserving, and scalable defect detection.

Multi-scale vision transformer (MSVT)

Vision transformers for defect detection

Vision Transformers (ViTs) excel in modeling long-range dependencies but struggle with local defect detection. To address this, the MSVT module captures hierarchical defect representations by operating at multiple spatial resolutions.

Mathematical formulation

Let $X \in \mathbb {R}^{H \times W \times C}$ represent the input image, where $H$ and $W$ are the height and width, and $C$ is the number of channels. The image is divided into non-overlapping patches of size $P \times P$, yielding patch embeddings:

$$\begin{aligned} Z = W_e X_p + E_p \end{aligned}$$

(1)

where $X_p$ is the flattened image patch, $W_e$ is the learnable embedding matrix, and $E_p$ is the positional encoding.

The self-attention mechanism is given by:

$$\begin{aligned} \text {Attention}(Q, K, V) = \text {softmax} \left( \frac{QK^T}{\sqrt{d_k}}\right) V \end{aligned}$$

(2)

where $Q, K, V$ are the query, key, and value matrices, and $d_k$ is the dimension of the keys.

To capture multi-scale features, we introduce hierarchical pooling, progressively downsampling features while preserving critical defect details.

Need for adaptive aggregation

Standard federated learning models such as FedAvg aggregate client updates using simple averaging, which is often suboptimal in the presence of non-IID data distributions and heterogeneous data quality, especially in industrial anomaly detection scenarios. To overcome this limitation, we propose an Adaptive Client Aggregation strategy that assigns dynamic weights to each client based on three key factors:

Data quality ($q_i$): Derived from each client’s local validation accuracy.
Update stability ($s_i$): Measured as the inverse variance of model parameter updates.
Domain shift similarity ($d_i$): (Optional) Measures the alignment of local data distribution with the global feature distribution.

The aggregation weight for each client is computed as:

$$\begin{aligned} w_i = \frac{q_i \cdot s_i}{\sum _{j=1}^{N} q_j \cdot s_j} \end{aligned}$$

(3)

The data quality score $q_i$ is calculated as the pixel-wise validation accuracy of the local model evaluated on a small held-out validation set (typically 10-15%) from the client’s local MVTec-AD data. The validation accuracy is defined as:

$$\begin{aligned} q_i = \text {ValAcc}_i = \frac{TP + TN}{TP + TN + FP + FN} \end{aligned}$$

(4)

where $TP$, $TN$, $FP$, and $FN$ represent the number of true positive, true negative, false positive, and false negative pixels respectively, compared against the ground truth anomaly segmentation masks provided by MVTec-AD.

The update stability score $s_i$ reflects the smoothness of training and is computed as the inverse of the parameter update variance across consecutive rounds:

$$\begin{aligned} s_i = \frac{1}{\text {Var}(\theta _i^t - \theta _i^{t-1})} \end{aligned}$$

(5)

Finally, the global model is updated using the weighted aggregation of local parameters:

$$\begin{aligned} \theta ^t = \sum _{i=1}^{N} w_i \theta _i^t \end{aligned}$$

(6)

This adaptive scheme prioritizes clients with reliable data and stable learning dynamics, thus enhancing robustness and generalization in federated anomaly detection.

Contrastive feature alignment (CFA)

Addressing domain shift

Industrial datasets exhibit domain shifts due to variations in defect patterns and imaging conditions. To mitigate this, we introduce contrastive learning-based feature alignment.

Contrastive feature alignment and loss formulation

To reduce inter-client domain shifts and enhance anomaly separability in the feature space, we introduce a Contrastive Feature Alignment (CFA) module as an auxiliary training objective in Fed-MSVT framework. The goal of CFA is to promote the alignment of features from similar (normal) samples and enforce separation from anomalous ones across different clients, thereby improving generalization in the global model.

Given two feature embeddings $f_i$ and $f_j$ from normal samples of different clients (positive pairs), and embeddings $f_k$ from anomalous samples (negative pairs), the contrastive loss is formulated using the InfoNCE objective:

$$\begin{aligned} \mathcal {L}_{CFA} = - \sum _{i, j} \log \frac{\exp (\text {sim}(f_i, f_j) / \tau )}{\sum _{k} \exp (\text {sim}(f_i, f_k) / \tau )} \end{aligned}$$

(7)

where $\text {sim}(f_i, f_j)$ denotes cosine similarity between embeddings and $\tau$ is a temperature scaling parameter. The CFA module is only active during training and helps cluster normal samples more tightly while pushing anomalous representations apart in the shared embedding space.

Contribution to Anomaly Detection While traditional pixel-wise loss functions optimize spatial accuracy, they do not explicitly enforce semantic separation in latent space. The inclusion of CFA addresses this by regularizing the feature space such that:

Features of normal samples from different clients are pulled closer, enhancing consistency and generalization.
Anomalous features are pushed further away, improving separability and localization of subtle defects.

An ablation study provided in Table 2 demonstrates that removing CFA results in a performance drop (e.g., AUROC decreases by 1.3%), confirming its effectiveness.

The Contributions of Fed-MSVT: The proposed Fed-MSVT framework integrates the following innovations

Learns both fine-grained and global defect representations using multi-scale vision transformers, outperforming conventional CNN and single-scale ViT baselines.
Employs adaptive client weighting based on data quality and stability, improving federated aggregation under non-IID conditions.
Incorporates contrastive feature alignment to mitigate inter-client domain shifts and enhance latent feature discriminability.

These contributions collectively improve accuracy, privacy, robustness, and scalability in industrial anomaly detection across diverse environments.

The proposed Fed-MSVT framework significantly advances industrial defect detection by integrating multi-scale vision transformers, adaptive federated learning, and contrastive domain adaptation. Experimental results demonstrate superior accuracy, robustness, and scalability, making it a promising solution for real-time smart manufacturing.

Results and discussion

MVTec AD is a dataset for comparing anomaly detection methods, with a focus on industrial inspection. It contains over 5,000 high-resolution images in 15 different item and texture categories. Each category contains both a set of training photographs that are defect-free and a test set of photos with various kinds of problems²⁶.

The qualitative results of anomaly detection using proposed frame work applied to industrial and natural objects. These results highlight the capability of proposed model to accurately detect and localize defects in various materials and products. Each image follows a structured format where original inputs, ground truth annotations, and proposed model-predicted heatmaps are displayed, allowing for an intuitive comparison of the model’s performance.

In Fig. 4, the top row consists of original input images, including textured surfaces, a nut, a leather piece, and an electronic component. The second row represents ground truth segmentation, where the white regions indicate actual defects present in the objects. The bottom row contains heatmaps generated by an AI model, with red and yellow areas denoting high-confidence defect detections. This format effectively illustrates how well the model identifies and highlights anomalies in different materials.

Fig. 5 follows a similar structure, showing objects such as pharmaceutical capsules, electrical wires, and mechanical components. The first row contains the original images, the second row overlays ground truth defect masks (highlighted in red), and the third row presents model-predicted heatmaps. These results demonstrate the ability of the proposed method to identify even small-scale defects, which is crucial in quality inspection applications.

Fig. 6 provides an in-depth comparison of anomaly detection in 3D point cloud data. The left column shows anomaly-free objects, while the middle column contains objects with detected defects. The right column presents ground-truth segmentations, where red regions indicate the actual anomalies. This visualization highlights the effectiveness of deep learning in detecting irregularities not only in 2D images but also in 3D surface data, making it valuable for industrial inspection and defect localization.

The qualitative results demonstrate the robustness of the proposed framework work anomaly detection across multiple domains, including manufacturing, electronics, pharmaceuticals, and food processing. By comparing the ground truth annotations with the predicted heatmaps, the results validate the accuracy of the model in localizing and classifying defects, making it a powerful tool for automated quality control.

Table 1 Comparison of anomaly detection performance with state-of-the-art methods on the MVTec AD dataset.

Full size table

Table 2 Ablation study on the effect of contrastive feature alignment (CFA) and adaptive aggregation.

Full size table

The Fig. 1 figure illustrates the training accuracy curve, depicting the learning progression of the proposed method over multiple epochs. Initially, the accuracy is low, but as training progresses, the accuracy steadily increases, demonstrating effective learning. The curve stabilizes after a certain number of epochs, indicating convergence and minimal overfitting. This suggests that the proposed method successfully learns from the data without significant performance degradation.

The Fig.2 represents the loss curve, showing how the error decreases over time. A high initial loss is observed, which progressively reduces as the model optimizes its parameters. The smooth decline in loss indicates stable convergence, while the absence of abrupt spikes suggests that the training process avoids major divergence issues. Compared to conventional models, the proposed method exhibits a more consistent loss reduction, leading to robust learning.

The Fig.3 visualizes the prediction performance of the proposed method. The plotted predictions closely follow the ground truth values, demonstrating the model’s ability to generalize effectively to unseen data. The minimal deviation between predictions and actual values indicates high reliability in classification and segmentation tasks. Compared to state-of-the-art techniques, the proposed model provides smoother and more accurate predictions, making it well-suited for real-world applications.

The Table 1 presents a comparative analysis of anomaly detection performance between the proposed model and state-of-the-art methods, evaluated on the MVTec AD dataset²⁶. The key performance metrics considered include AUROC (Area Under the Receiver Operating Characteristic curve), Average Precision (AP), F1-score, Intersection over Union (IoU), and Pixel-wise Accuracy. Each of these metrics provides insights into the effectiveness of the models in detecting and localizing anomalies in industrial and natural images.

The proposed model achieves the highest AUROC of 98.9%, outperforming PatchCore (98.1%), FastFlow (97.5%), DifferNet (96.2%), and DRAEM (98.4%). AUROC is a critical metric that reflects the ability of the model to distinguish between normal and anomalous samples effectively. The AP score of 97.6% for the proposed model further confirms its superior precision in detecting defects, surpassing the previous best (97.1% by DRAEM). Higher AP values indicate a better ranking of positive samples, ensuring reliable detection of defective regions.

Additionally, the F1-score of 0.88 demonstrates that our model maintains a balance between precision and recall, outperforming PatchCore (0.85) and other methods. The IoU (80.1%), which measures segmentation accuracy, shows that proposed model provides more precise localization of anomalies than competing approaches. A higher IoU means better alignment between the predicted and ground-truth anomaly regions. Finally, the proposed model achieves the highest Pixel-wise Accuracy (95.2%), indicating that it effectively classifies most pixels correctly, reducing false positives and false negatives.

The robustness of the proposed approach in anomaly detection and segmentation. The performance improvements can be attributed to the integration of advanced deep learning architectures, refined feature extraction, and enhanced spatial-temporal learning capabilities. the proposed method sets a new benchmark for anomaly detection, ensuring accurate identification of defects in industrial and natural environments.

Conclusion

In this work, we proposed a federated multi-scale vision transformer with adap- tive client aggregation (Fed-MSVT) method for anomaly detection and classification, demonstrating superior performance in accuracy, loss convergence, and prediction reliability. The proposed approach effectively learns complex patterns, as evidenced by the steady increase in training accuracy and the smooth reduction in loss during training. Compared to state-of-the-art methods, our model exhibits enhanced generalization capabilities, ensuring robust performance on unseen data. Through extensive qualitative and quantitative evaluations, the proposed method shows significant improvements in detecting and localizing anomalies with high precision. The AUROC-based prediction performance curve further validates the model’s effectiveness, demonstrating its ability to distinguish between normal and anomalous samples accurately. The minimal deviation between ground truth and predictions indicates that our method is well-suited for real-world applications requiring high reliability. the proposed method provides a comprehensive and efficient solution for anomaly detection, offering better accuracy, stability, and generalization than existing approaches. Future work will focus on optimizing the model for real-time applications and extending its applicability to more complex datasets and real-world industrial scenarios.

Data availability

The datasets used and/or analyzed during this study are available in the MVTec AD repository, accessible via https://www.mvtec.com/company/research/datasets/mvtec-ad.

References

LeCun, Y., Bengio, Y. & Hinton, G. Deep learning. Nature 521(7553), 436–444 (2015).
Article ADS CAS PubMed Google Scholar
He, K., Zhang, X., Ren, S., & Sun, J. Deep residual learning for image recognition, in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) 770–778 (2016).
Goodfellow, I., Bengio, Y., & Courville, A. Deep Learning (MIT press, 2016).
Liu, L. et al. Deep learning for generic object detection: A survey. Int. J. Comput. Vis. 128, 261–318 (2020).
Article MATH Google Scholar
Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S. et al. An image is worth 16 x 16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020).
Han, K., Wang, Y., Chen, H., Chen, X., Guo, J., Liu, Z., Tang, Y., & Xiao, A. Transformer in computer vision: A survey. arXiv preprint arXiv:2101.01169, (2021).
McMahan, H.B., Moore, E., Ramage, D., Hampson, S., & Avestimehr, S. Communication-efficient learning of deep networks from decentralized data, in Proceedings of the 20th International Conference on Artificial Intelligence and Statistics (AISTATS) 1273–1282 (2017).
Zhao, Y., Li, M., Lai, L., Suda, N., Civin, D., & Chandra, V. Federated learning with non-iid data. arXiv preprint arXiv:1806.00582 (2018).
Chen, T., Kornblith, S., Norouzi, M., & Hinton, G. A simple framework for contrastive learning of visual representations, in International Conference on Machine Learning (ICML) 1597–1607 (2020).
Caron, M., Misra, I., Mairal, J., Goyal, P., Bojanowski, P., & Joulin, A. Unsupervised learning of visual features by contrasting cluster assignments. Adv. Neural Inform. Process. Syst. (NeurIPS) 33, 9912–9924 (2020).
Simonyan, K., & Zisserman, A. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556 (2014).
Tan, M., & Le, Q. Efficientnet: Rethinking model scaling for convolutional neural networks, in International Conference on Machine Learning (ICML) 6105–6114 (2019).
Kairouz, P. et al. Advances and open problems in federated learning. Found. Trends Mach. Learn. 14(1–2), 1–210 (2021).
Article Google Scholar
Li, T., Sahu, A. K., Talwalkar, A. & Smith, V. Federated learning: Challenges, methods, and future directions. IEEE Signal Process. Mag. 37(3), 50–60 (2020).
Article CAS Google Scholar
Li, X., Zhang, W. & Liu, M. Feature consistency learning for anomaly detection. IEEE Trans. Instrum. Meas. 74, 1–12 (2025).
Google Scholar
Li, H., Chen, Z., Xu, Y., & Hu, J. Hyperbolic anomaly detection, in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition 17511–17520 (2024).
Li, H., & Hu, J. Feature consistency learning for anomaly detection, IEEE Trans. Instrum. Meas. 74, 5004009 (2024).
Zhai, X., Yin, H., Wang, J. & Damper, I. G. Automated defect inspection based on pseudo-hilbert scanning and machine learning. IEEE Trans. Ind. Inform. 12(1), 187–197 (2016).
Google Scholar
Ren, Y., Wang, L. & Liu, Y. Adaptive feature representation for defect classification in semiconductor manufacturing. IEEE Trans. Semicond. Manuf. 31(2), 275–283 (2018).
Google Scholar
Krizhevsky, A., Sutskever, I., & Hinton, G. E. Imagenet classification with deep convolutional neural networks. Adv. Neural Inf. Process. Syst. (NeurIPS) 25, 1097–1105 (2012).
Wang, Y., Xie, J., Zhang, X. & Li, F. Automated surface defect detection for steel products using deep learning. Mater. Today Proc. 38, 1047–1053 (2020).
Google Scholar
Lin, C.-H., Lin, Y.-C. & Lin, T.-Y. Deep learning-based defect detection in semiconductor manufacturing: A survey. IEEE Trans. Semicond. Manuf. 34 (1), 113–122 (2021).
ADS Google Scholar
Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., & Guo, B. Swin transformer: Hierarchical vision transformer using shifted windows, in Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) 10012–10022 (2021).
Wang, W., Xie, E., Li, X., Fan, D.-P., Song, K., Liang, D., Lu, T., Luo, P., & Shao, L. Pyramid vision transformer: A versatile backbone for dense prediction without convolutions, in Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) 568–578 (2021).
Zhang, H., Sun, M., Liu, Y., & Wang, X. Msvt: Multi-scale vision transformer for image recognition. arXiv preprint arXiv:2204.03641 (2022).
Bergmann, P., Batzner, K., Fauser, M., Sattlegger, D., & Steger, C. The mvtec anomaly detection dataset: A comprehensive real-world dataset for unsupervised anomaly detection. Int. J. Comput. Vis. (IJCV) 129 (4), 1038–1059 (2019).
Roth, K. et al. Patchcore: Towards total recall in industrial anomaly detection, in IEEE CVPR (2022).
Yu, J. and et al. Fastflow: Unsupervised anomaly detection and localization via 2d normalizing flows, in NeurIPS (2021).
Rudolph, M. et al. Same same but differnet: Semi-supervised defect detection with normalizing flows, in IEEE WACV (2021).
Zavrtanik, V. et al. Draem a discriminatively trained reconstruction embedding for surface anomaly detection, in IEEE ICCV (2021).

Download references

Author information

Authors and Affiliations

Department of Computer Science Engineering, Tirumala Engineering College, Narasaraopet, AP, 522601, India
Sailaja Are, N. Gopala Krishna & Kistam Gopi
Department of Electronics and Communication Engineering, Tirumala Engineering College, Narasaraopet, AP, 522601, India
Jagadeesh Thati

Authors

Sailaja Are
View author publications
Search author on:PubMed Google Scholar
N. Gopala Krishna
View author publications
Search author on:PubMed Google Scholar
Jagadeesh Thati
View author publications
Search author on:PubMed Google Scholar
Kistam Gopi
View author publications
Search author on:PubMed Google Scholar

Contributions

All authors are equally contributed.

Corresponding author

Correspondence to Sailaja Are.

Ethics declarations

Competing interests

The authors declare no competing interests.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.

Reprints and permissions

About this article

Cite this article

Are, S., Krishna, N.G., Thati, J. et al. Federated multi scale vision transformer with adaptive client aggregation for industrial defect detection. Sci Rep 15, 34149 (2025). https://doi.org/10.1038/s41598-025-12390-z

Download citation

Received: 11 March 2025
Accepted: 16 July 2025
Published: 01 October 2025
DOI: https://doi.org/10.1038/s41598-025-12390-z

Federated multi scale vision transformer with adaptive client aggregation for industrial defect detection

Subjects

Abstract

Introduction

Related work

Proposed methodology

Multi-scale vision transformer (MSVT)

Vision transformers for defect detection

Mathematical formulation

Need for adaptive aggregation

Contrastive feature alignment (CFA)

Addressing domain shift

Contrastive feature alignment and loss formulation

Results and discussion

Conclusion

Data availability

References

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Competing interests

Additional information

Publisher’s note

Rights and permissions

About this article

Cite this article

Keywords

Search

Quick links

Subjects

Abstract

Introduction

Related work

Proposed methodology

Multi-scale vision transformer (MSVT)

Vision transformers for defect detection

Mathematical formulation

Need for adaptive aggregation

Contrastive feature alignment (CFA)

Addressing domain shift

Contrastive feature alignment and loss formulation

Results and discussion

Conclusion

Data availability

References

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Competing interests

Additional information

Publisher’s note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Quick links