Abstract
ELOQUENT is a CLEF lab for evaluating generative language model quality with a focus on such aspects of quality that do not come to the fore with current standard test suites and test collections and to develop and promote new test regimes and methods that fit a multilingual application scenario for generative artificial intelligence.
This year is the second year of ELOQUENT. This year’s experiment tracks have evolved from the first year: this year we continue challenging the capability of classifiers to distinguish machine-generated from human-authored text; we explore how consistent language models are in responding to value-oriented questions across languages and system settings; we test how accurately language models are able to predict human preferences between variants of generated material; and we investigate how well language models are able provide sensible topical quizzes to fit given target texts.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
Notes
- 1.
This is ongoing work on multilingual human preference prediction and explanation. The English subset is created as part of this shared task. Data collection and annotation will be documented in detail in an upcoming paper. To be available at github.com/Toloka/primeape.
- 2.
- 3.
- 4.
- 5.
Note that we tried to achieve at least some difference in the baseline Teacher system and its evaluation using two different versions of GPT: gpt-4.1-nano-2025-04-14 and gpt-4.1-mini-2025-04-14, resp. We also later reran the evaluation with google’s gemma and got very similar results.
References
Bell, A.: Language style as audience design. Lang. Soc. 13(2) (1984)
Bevendorff, J., et al.: Overview of PAN 2025: generative AI detection, multilingual text detoxification, multi-author writing style analysis, and generative plagiarism detection. In: Hauff, C., et al. (eds.) Advances in Information Retrieval: 47th European Conference on Information Retrieval (ECIR) (2025)
Bevendorff, J., et al.: Overview of the “Voight-Kampff” generative AI authorship verification task at PAN and ELOQUENT 2025. In: Faggioli, G., Ferro, N., Rosso, P., Spina, D. (eds.) Working Notes of CLEF 2025 – Conference and Labs of the Evaluation Forum. CEUR-WS (2025)
Bevendorff, J., et al.: Overview of the Voight-Kampff generative AI authorship verification task at PAN and ELOQUENT 2024. In: Faggioli, G., Ferro, N., Vlachos, M., Galuščáková, P., de Herrera, A.G.S. (eds.) Working Notes of CLEF 2024 - Conference and Labs of the Evaluation Forum. CEUR-WS.org (2024)
Clark, H.H., Murphy, G.L.: Audience design in meaning and reference. In: Advances in Psychology, vol. 9, pp. 287–299. Elsevier (1982)
Colombo, P., Noiry, N., Irurozki, E., Clémençon, S.: What are the best systems? New perspectives on NLP benchmarking. Adv. Neural. Inf. Process. Syst. 35, 26915–26932 (2022)
Creo, A., Hormazábal-Lagos, M., Cerezo-Costas, H., Alonso-Doval, P.: HumanAIzers in Voight-Kampff at ELOQUENT 2025. In: Faggioli, G., Ferro, N., Rosso, P., Spina, D. (eds.) Working Notes of CLEF 2025 – Conference and Labs of the Evaluation Forum. CEUR-WS (2025)
Gunti, R.R.: The Data-Centric Approach for the Voight Kampff Task. In: Faggioli, G., Ferro, N., Rosso, P., Spina, D. (eds.) Working Notes of CLEF 2025 – Conference and Labs of the Evaluation Forum. CEUR-WS (2025)
Hoveyda, M.: Evading Human/Machine classifiers by prompting LLMs for naturally imperfect text. In: Faggioli, G., Ferro, N., Rosso, P., Spina, D. (eds.) Working Notes of CLEF 2025 – Conference and Labs of the Evaluation Forum. CEUR-WS (2025)
Karlgren, J., et al.: ELOQUENT CLEF shared tasks for evaluation of generative language model quality, 2nd edition. In: Hauff, C., Macdonald, C., Jannach, D., Kazai, G., Nardini, F.M., Pinelli, F., Silvestri, F., Tonellotto, N. (eds.) Advances in Information Retrieval: 47th European Conference on Information Retrieval (ECIR) (2025)
Karlgren, J., et al.: Overview of ELOQUENT 2024—shared tasks for evaluating generative language model quality. In: Proceedings of the Fifteenth International Conference of the CLEF Association (CLEF 2024). Springer, Cham (2024). https://doi.org/10.1007/978-3-031-71908-0_3
Karlgren, J., Dürlich, L., Gogoulou, E., Guillou, L., Nivre, J., Talman, A.: ELOQUENT CLEF shared tasks for evaluation of generative language model quality. In: Advances in Information Retrieval: 46th European Conference on IR Research (ECIR) (2024)
Karlgren, J., Engels, M.I., Gunti, R.R., Hoveyda, M., Sotic, B.N., Kamps, J.: Overview and joint report of the robustness and consistency task at the ELOQUENT 2025 lab for evaluating generative language model quality. In: Faggioli, G., Ferro, N., Rosso, P., Spina, D. (eds.) Working Notes of CLEF 2025 – Conference and Labs of the Evaluation Forum. CEUR-WS (2025)
Lin, C.Y.: ROUGE: a package for automatic evaluation of summaries. In: Text Summarization Branches Out, pp. 74–81. Association for Computational Linguistics, Barcelona, Spain (2004). https://aclanthology.org/W04-1013/
Mikhailov, V., Artemova, E., Butenko, Z., Øvrelid, L., Velldal, E.: Overview of the preference prediction task at the ELOQUENT 2025 lab for evaluating generative language model quality. In: Faggioli, G., Ferro, N., Rosso, P., Spina, D. (eds.) Working Notes of CLEF 2025 – Conference and Labs of the Evaluation Forum. CEUR-WS (2025)
Rofin, M., et al.: Vote’n’Rank: revision of benchmarking with social choice theory. In: Vlachos, A., Augenstein, I. (eds.) Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics, pp. 670–686. Association for Computational Linguistics, Dubrovnik, Croatia (2023). https://doi.org/10.18653/v1/2023.eacl-main.48
Saha, S., Das, R., Das, D.: JUNLP_SS at ELOQUENT Lab 2025: humanizing AI - enhancing the realism of machine generated text. In: Faggioli, G., Ferro, N., Rosso, P., Spina, D. (eds.) Working Notes of CLEF 2025 – Conference and Labs of the Evaluation Forum. CEUR-WS (2025)
Sahlgren, M., Karlgren, J., Dürlich, L., Gogoulou, E., Talman, A., Zahra, S.: ELOQUENT 2024-robustness task. In: 25th Working Notes of the Conference and Labs of the Evaluation Forum, CLEF 2024. Grenoble. 9 September 2024 through 12 September 2024, vol. 3740, pp. 703–707. CEUR-WS (2024)
Šindelář, P., Bojar, O.: Overview of the sensemaking task at the ELOQUENT 2025 lab: LLMs as teachers, students and evaluators. In: Faggioli, G., Ferro, N., Rosso, P., Spina, D. (eds.) Working Notes of CLEF 2025 – Conference and Labs of the Evaluation Forum. CEUR-WS (2025)
Vachharajani, P.: Literal re-translation as a method for AI text disguise and detection evasion. In: Faggioli, G., Ferro, N., Rosso, P., Spina, D. (eds.) Working Notes of CLEF 2025 – Conference and Labs of the Evaluation Forum. CEUR-WS (2025)
Zhang, T., Kishore, V., Wu, F., Weinberger, K.Q., Artzi, Y.: BERTScore: evaluating text generation with BERT. In: International Conference on Learning Representations (ICLR) (2020). https://openreview.net/forum?id=SkeHuCVFDr
Acknowledgments
The authors acknowledge the support of the European Commission through the project DeployAI (grant number 101146490), of the National Recovery Plan funded project MPO 60273/24/21300/21000 CEDMO 2.0 NPO, and the funding from the Project OP JAK Mezisektorová spolupráce Nr. CZ.02.01.01/00/23_020/0008518 named “Jazykověda, umělá inteligence a jazykové a řečové technologie: od výzkumu k aplikacím.”
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Ethics declarations
Disclosure of Interests
Authors have no competing interests to declare that are relevant to the content of this article.
Rights and permissions
Copyright information
© 2026 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Karlgren, J. et al. (2026). Overview of ELOQUENT 2025: Shared Tasks for Evaluating Generative Language Model Quality. In: Carrillo-de-Albornoz, J., et al. Experimental IR Meets Multilinguality, Multimodality, and Interaction. CLEF 2025. Lecture Notes in Computer Science, vol 16089. Springer, Cham. https://doi.org/10.1007/978-3-032-04354-2_14
Download citation
DOI: https://doi.org/10.1007/978-3-032-04354-2_14
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-032-04353-5
Online ISBN: 978-3-032-04354-2
eBook Packages: Computer ScienceComputer Science (R0)