Abstract
Plagiarism is one of the most big issues to deal with in the academic domain due to its effects on the credibility of scientific research. One recognized form of plagiarism involves translating texts from one language into another. Cross-lingual plagiarism is an unethical act that continues to proliferate due to the abundance of online information and the availability of translator tools, especially those based on artificial intelligence. To detect this kind of plagiarism, multilingual transformers offer a promising prospect. In this paper, we propose a new methodology for cross-language plagiarism detection based on multilingual pretrained model mT5 along with a Multi-Head Attention (MHAM) and Attention Mechanisms (AM). The approach includes text preprocessing, embedding using mT5, attention layers, and sigmoid function. To evaluate the efficiency of our approach, we conducted a comparative analysis, demonstrating that our method surpasses other pretrained models in performance, namely XLM-RoBERTa, Multilingual BERT, mBART, and M2M-100. Experiments show that the proposed approach based on mT5 transformer and attention layer achieved the high results for the three language pairs with a plagdet of 98.38% for English-French, 98.03% for English-Spanish, 98.73% for English-German, with a granularity of 1.00 for all language pairs.






Similar content being viewed by others
Data Availability
The datasets are accessible and referenced in this article.
References
Potthast M, Barrón-Cedeño A, Stein B, Rosso P. Cross-language plagiarism detection. Lang Resour Eval 2011;45(1):45–62. https://doi.org/10.1007/s10579-009-9114-z
Sutskever I, Vinyals O, Le Q V, sequence to sequence learning with neural networks. Adv Neural Inf Process Syst Curran Assoc, Inc.; 2014.
Vaswani A et al. Attention is all you need, 1 août 2023, arXiv: arXiv:1706.03762.
Xia Y, He T, Tan X, Tian F, He D, Qin T. Tied Transformers: Neural Machine Translation with Shared Encoder and Decoder, Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, no 01, Art. no 01, juill. 2019, https://doi.org/10.1609/aaai.v33i01.33015466
Tabinda Kokab S, Asghar S, Naz S. Transformer-based deep learning models for the sentiment analysis of social media data, Array, vol. 14, p. 100157, juill. 2022, https://doi.org/10.1016/j.array.2022.100157
Ngai H, Park Y, Chen J, Parsapoor M. Transformer-Based Models for Question Answering on COVID19, 16 janvier 2021, arXiv: arXiv:2101.11432.
Moravvej SV, Mousavirad SJ, Moghadam MH, Saadatmand M. An LSTM-based plagiarism detection via attention mechanism and a population-based approach for pre-training parameters with imbalanced classes, in Neural Information Processing, vol. 13110, T. Mantoro, M. Lee, M. A. Ayu, K. W. Wong, et A. N. Hidayanto, Éd., in Lecture Notes in Computer Science, vol. 13110., Cham: Springer International Publishing, 2021, pp. 690–701. https://doi.org/10.1007/978-3-030-92238-2_57
Rosu R, Stoica AS, Popescu PS, Mihaescu MC. NLP based Deep Learning Approach for Plagiarism Detection, IJUSI, vol. 13, no 1, pp. 48–60, 2020, https://doi.org/10.37789/ijusi.2020.13.1.4
Moravvej SV, Mousavirad SJ, Oliva D, Schaefer G, Sobhaninia Z. An improved DE algorithm to optimise the learning process of a BERT-based plagiarism Detection model. 2022, p. 7. https://doi.org/10.1109/CEC55065.2022.9870280
Zhou X, Pappas N, Smith NA. Multilevel Text Alignment with Cross-Document Attention, 2 octobre 2020, arXiv: arXiv:2010.01263.
Wahle JP, Ruas T, Meuschke N, Gipp B. Are neural language models good plagiarists? A benchmark for neural paraphrase detection, in 2021 ACM/IEEE Joint Conference on Digital Libraries (JCDL), sept. 2021, pp. 226–229. https://doi.org/10.1109/JCDL52503.2021.00065
Wahle JP, Ruas T, Foltýnek T, Meuschke N, Gipp B. Identifying Machine-Paraphrased Plagiarism, vol. 13192, 2022, pp. 393–413. https://doi.org/10.1007/978-3-030-96957-8_34
Aliane AA, Aliane H. Evaluating SIAMESE Architecture Neural Models for Arabic Textual Similarity and Plagiarism Detection, in 2020 4th International Symposium on Informatics and its Applications (ISIA), déc. 2020, pp. 1–6. https://doi.org/10.1109/ISIA51297.2020.9416550
Jing Y, Liu Y. A population-based plagiarism detection using DistilBERT-generated word embedding. Int J Adv Comput Sci Appl (IJACSA). 2023;14(8): Art. no 8, 54/30. https://doi.org/10.14569/IJACSA.2023.0140868
Meshram S, Kumar M. Long short-term memory network for learning sentences similarity using deep contextual embeddings. Int J Inform Technol 2021;13:1–9. https://doi.org/10.1007/s41870-021-00686-y
Zubarev DV, Sochenkov IV. Cross-language text alignment for plagiarism detection based on contextual and context-free models.
Chi HVT, Anh DL, Thanh NL, Dinh D. English-Vietnamese cross-lingual paraphrase identification using MT-DNN. Eng Technol Appl Sci Res. 2021;11(5): Art. no 5. https://doi.org/10.48084/etasr.4300
Abdous M, Piroozfar P, Bidgoli BM. PESTS: Persian_English Cross Lingual Corpus for Semantic Textual Similarity
Avetisyan K, Malajyan A, Ghukasyan T, Avetisyan A. A Simple and Effective Method of Cross-Lingual Plagiarism Detection, arXiv.org.
Hourrane O, Benlahmar EH. Graph transformer for cross-lingual plagiarism detection. IAES Int J Artif Intell. 2022;11(3):905–915. https://doi.org/10.11591/ijai.v11.i3.pp905-915
Potthast M, Stein B, Eiselt A, Barrón-Cedeño A, Rosso P. PAN Plagiarism Corpus 2011 (PAN-PC-11). Zenodo, 1 juin 2011. https://doi.org/10.5281/zenodo.3250095
Steinberger R, Pouliquen B, Widiger A, Ignat C, Erjavec T, Tufi D. The JRC-Acquis: a multilingual aligned parallel corpus with 20 + Languages.
Koehn P. Europarl: a parallel corpus for statistical machine translation, in Proceedings of Machine Translation Summit X: Papers, Phuket, Thailand, sept. 2005, pp. 79–86.
Barrón-Cedeño A, España-Bonet C, Boldoba J, Màrquez L. A factory of comparable corpora from Wikipedia, in Proceedings of the Eighth Workshop on Building and Using Comparable Corpora, Beijing, China: Association for Computational Linguistics, juill. 2015, pp. 3–13. https://doi.org/10.18653/v1/W15-3402
Ferrero J, Besacier L, Schwab D, Agnès F. Deep Investigation of Cross-Language Plagiarism Detection Methods, in Proceedings of the 10th Workshop on Building and Using Comparable Corpora, Vancouver, Canada: Association for Computational Linguistics, 2017, pp. 6–15. https://doi.org/10.18653/v1/W17-2502
Xue L et al. mT5: A massively multilingual pre-trained text-to-text transformer, 11 Mars 2021, arxiv: arxiv:2010.11934. https://doi.org/10.48550/arXiv.2010.11934
Conneau A et al. Unsupervised Cross-lingual Representation Learning at Scale, 7 avril 2020, arXiv: arXiv:1911.02116.
Devlin J, Chang M-W, Lee K, Toutanova K. BERT: pre-training of deep bidirectional transformers for language understanding, 24 mai 2019, arXiv: arXiv:1810.04805.
Liu Y et al. Multilingual Denoising Pre-training for Neural Machine Translation, 23 janvier 2020, arXiv: arXiv:2001.08210.
Fan A et al. Beyond English-Centric Multilingual Machine Translation, 21 octobre 2020, arXiv: arXiv:2010.11125.
Niu Z, Zhong G, Yu H. A review on the attention mechanism of deep learning, Neurocomputing, vol. 452, pp. 48–62, sept. 2021, https://doi.org/10.1016/j.neucom.2021.03.091
Potthast M, Stein B, Barrón-Cedeño A, Rosso P. An Evaluation Framework for Plagiarism Detection, in Coling 2010: Posters, C.-R. Huang et D. Jurafsky, Éd., Beijing, China: Coling 2010 Organizing Committee, août 2010, pp. 997–1005.
Funding
No funding was received.
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of Interest
All authors declare that they have no Confict of interest.
Ethical Approval
This article does not contain any studies with human participants or animals performed by any of the authors.
Informed Consent
Not applicable.
Additional information
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Bouaine, C., Benabbou, F. & Zaoui, C. Unlocking the Potential of Transformers with mT5 and Attention Mechanisms in Multilingual Plagiarism Detection. SN COMPUT. SCI. 6, 849 (2025). https://doi.org/10.1007/s42979-025-04379-2
Received:
Accepted:
Published:
DOI: https://doi.org/10.1007/s42979-025-04379-2