Skip to main content
Log in

Unlocking the Potential of Transformers with mT5 and Attention Mechanisms in Multilingual Plagiarism Detection

  • Original Research
  • Published:
SN Computer Science Aims and scope Submit manuscript

Abstract

Plagiarism is one of the most big issues to deal with in the academic domain due to its effects on the credibility of scientific research. One recognized form of plagiarism involves translating texts from one language into another. Cross-lingual plagiarism is an unethical act that continues to proliferate due to the abundance of online information and the availability of translator tools, especially those based on artificial intelligence. To detect this kind of plagiarism, multilingual transformers offer a promising prospect. In this paper, we propose a new methodology for cross-language plagiarism detection based on multilingual pretrained model mT5 along with a Multi-Head Attention (MHAM) and Attention Mechanisms (AM). The approach includes text preprocessing, embedding using mT5, attention layers, and sigmoid function. To evaluate the efficiency of our approach, we conducted a comparative analysis, demonstrating that our method surpasses other pretrained models in performance, namely XLM-RoBERTa, Multilingual BERT, mBART, and M2M-100. Experiments show that the proposed approach based on mT5 transformer and attention layer achieved the high results for the three language pairs with a plagdet of 98.38% for English-French, 98.03% for English-Spanish, 98.73% for English-German, with a granularity of 1.00 for all language pairs.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+
from $39.99 /Month
  • Starting from 10 chapters or articles per month
  • Access and download chapters and articles from more than 300k books and 2,500 journals
  • Cancel anytime
View plans

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6

Similar content being viewed by others

Data Availability

The datasets are accessible and referenced in this article.

References

  1. Potthast M, Barrón-Cedeño A, Stein B, Rosso P. Cross-language plagiarism detection. Lang Resour Eval 2011;45(1):45–62. https://doi.org/10.1007/s10579-009-9114-z

  2. Sutskever I, Vinyals O, Le Q V, sequence to sequence learning with neural networks. Adv Neural Inf Process Syst Curran Assoc, Inc.; 2014.

  3. Vaswani A et al. Attention is all you need, 1 août 2023, arXiv: arXiv:1706.03762.

  4. Xia Y, He T, Tan X, Tian F, He D, Qin T. Tied Transformers: Neural Machine Translation with Shared Encoder and Decoder, Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, no 01, Art. no 01, juill. 2019, https://doi.org/10.1609/aaai.v33i01.33015466

  5. Tabinda Kokab S, Asghar S, Naz S. Transformer-based deep learning models for the sentiment analysis of social media data, Array, vol. 14, p. 100157, juill. 2022, https://doi.org/10.1016/j.array.2022.100157

  6. Ngai H, Park Y, Chen J, Parsapoor M. Transformer-Based Models for Question Answering on COVID19, 16 janvier 2021, arXiv: arXiv:2101.11432.

  7. Moravvej SV, Mousavirad SJ, Moghadam MH, Saadatmand M. An LSTM-based plagiarism detection via attention mechanism and a population-based approach for pre-training parameters with imbalanced classes, in Neural Information Processing, vol. 13110, T. Mantoro, M. Lee, M. A. Ayu, K. W. Wong, et A. N. Hidayanto, Éd., in Lecture Notes in Computer Science, vol. 13110., Cham: Springer International Publishing, 2021, pp. 690–701. https://doi.org/10.1007/978-3-030-92238-2_57

  8. Rosu R, Stoica AS, Popescu PS, Mihaescu MC. NLP based Deep Learning Approach for Plagiarism Detection, IJUSI, vol. 13, no 1, pp. 48–60, 2020, https://doi.org/10.37789/ijusi.2020.13.1.4

  9. Moravvej SV, Mousavirad SJ, Oliva D, Schaefer G, Sobhaninia Z. An improved DE algorithm to optimise the learning process of a BERT-based plagiarism Detection model. 2022, p. 7. https://doi.org/10.1109/CEC55065.2022.9870280

  10. Zhou X, Pappas N, Smith NA. Multilevel Text Alignment with Cross-Document Attention, 2 octobre 2020, arXiv: arXiv:2010.01263.

  11. Wahle JP, Ruas T, Meuschke N, Gipp B. Are neural language models good plagiarists? A benchmark for neural paraphrase detection, in 2021 ACM/IEEE Joint Conference on Digital Libraries (JCDL), sept. 2021, pp. 226–229. https://doi.org/10.1109/JCDL52503.2021.00065

  12. Wahle JP, Ruas T, Foltýnek T, Meuschke N, Gipp B. Identifying Machine-Paraphrased Plagiarism, vol. 13192, 2022, pp. 393–413. https://doi.org/10.1007/978-3-030-96957-8_34

  13. Aliane AA, Aliane H. Evaluating SIAMESE Architecture Neural Models for Arabic Textual Similarity and Plagiarism Detection, in 2020 4th International Symposium on Informatics and its Applications (ISIA), déc. 2020, pp. 1–6. https://doi.org/10.1109/ISIA51297.2020.9416550

  14. Jing Y, Liu Y. A population-based plagiarism detection using DistilBERT-generated word embedding. Int J Adv Comput Sci Appl (IJACSA). 2023;14(8): Art. no 8, 54/30. https://doi.org/10.14569/IJACSA.2023.0140868

  15. Meshram S, Kumar M. Long short-term memory network for learning sentences similarity using deep contextual embeddings. Int J Inform Technol 2021;13:1–9. https://doi.org/10.1007/s41870-021-00686-y

  16. Zubarev DV, Sochenkov IV. Cross-language text alignment for plagiarism detection based on contextual and context-free models.

  17. Chi HVT, Anh DL, Thanh NL, Dinh D. English-Vietnamese cross-lingual paraphrase identification using MT-DNN. Eng Technol Appl Sci Res. 2021;11(5): Art. no 5. https://doi.org/10.48084/etasr.4300

  18. Abdous M, Piroozfar P, Bidgoli BM. PESTS: Persian_English Cross Lingual Corpus for Semantic Textual Similarity

  19. Avetisyan K, Malajyan A, Ghukasyan T, Avetisyan A. A Simple and Effective Method of Cross-Lingual Plagiarism Detection, arXiv.org.

  20. Hourrane O, Benlahmar EH. Graph transformer for cross-lingual plagiarism detection. IAES Int J Artif Intell. 2022;11(3):905–915. https://doi.org/10.11591/ijai.v11.i3.pp905-915

  21. Potthast M, Stein B, Eiselt A, Barrón-Cedeño A, Rosso P. PAN Plagiarism Corpus 2011 (PAN-PC-11). Zenodo, 1 juin 2011. https://doi.org/10.5281/zenodo.3250095

  22. Steinberger R, Pouliquen B, Widiger A, Ignat C, Erjavec T, Tufi D. The JRC-Acquis: a multilingual aligned parallel corpus with 20 + Languages.

  23. Koehn P. Europarl: a parallel corpus for statistical machine translation, in Proceedings of Machine Translation Summit X: Papers, Phuket, Thailand, sept. 2005, pp. 79–86.

  24. Barrón-Cedeño A, España-Bonet C, Boldoba J, Màrquez L. A factory of comparable corpora from Wikipedia, in Proceedings of the Eighth Workshop on Building and Using Comparable Corpora, Beijing, China: Association for Computational Linguistics, juill. 2015, pp. 3–13. https://doi.org/10.18653/v1/W15-3402

  25. Ferrero J, Besacier L, Schwab D, Agnès F. Deep Investigation of Cross-Language Plagiarism Detection Methods, in Proceedings of the 10th Workshop on Building and Using Comparable Corpora, Vancouver, Canada: Association for Computational Linguistics, 2017, pp. 6–15. https://doi.org/10.18653/v1/W17-2502

  26. Xue L et al. mT5: A massively multilingual pre-trained text-to-text transformer, 11 Mars 2021, arxiv: arxiv:2010.11934. https://doi.org/10.48550/arXiv.2010.11934

  27. Conneau A et al. Unsupervised Cross-lingual Representation Learning at Scale, 7 avril 2020, arXiv: arXiv:1911.02116.

  28. Devlin J, Chang M-W, Lee K, Toutanova K. BERT: pre-training of deep bidirectional transformers for language understanding, 24 mai 2019, arXiv: arXiv:1810.04805.

  29. Liu Y et al. Multilingual Denoising Pre-training for Neural Machine Translation, 23 janvier 2020, arXiv: arXiv:2001.08210.

  30. Fan A et al. Beyond English-Centric Multilingual Machine Translation, 21 octobre 2020, arXiv: arXiv:2010.11125.

  31. Niu Z, Zhong G, Yu H. A review on the attention mechanism of deep learning, Neurocomputing, vol. 452, pp. 48–62, sept. 2021, https://doi.org/10.1016/j.neucom.2021.03.091

  32. Potthast M, Stein B, Barrón-Cedeño A, Rosso P. An Evaluation Framework for Plagiarism Detection, in Coling 2010: Posters, C.-R. Huang et D. Jurafsky, Éd., Beijing, China: Coling 2010 Organizing Committee, août 2010, pp. 997–1005.

Download references

Funding

No funding was received.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Chaimaa Bouaine.

Ethics declarations

Conflict of Interest

All authors declare that they have no Confict of interest.

Ethical Approval

This article does not contain any studies with human participants or animals performed by any of the authors.

Informed Consent

Not applicable.

Additional information

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Bouaine, C., Benabbou, F. & Zaoui, C. Unlocking the Potential of Transformers with mT5 and Attention Mechanisms in Multilingual Plagiarism Detection. SN COMPUT. SCI. 6, 849 (2025). https://doi.org/10.1007/s42979-025-04379-2

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1007/s42979-025-04379-2

Keywords