Overview of ELOQUENT 2025: Shared Tasks for Evaluating Generative Language Model Quality

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 16089))

Included in the following conference series:

International Conference of the Cross-Language Evaluation Forum for European Languages

179 Accesses

Abstract

ELOQUENT is a CLEF lab for evaluating generative language model quality with a focus on such aspects of quality that do not come to the fore with current standard test suites and test collections and to develop and promote new test regimes and methods that fit a multilingual application scenario for generative artificial intelligence.

This year is the second year of ELOQUENT. This year’s experiment tracks have evolved from the first year: this year we continue challenging the capability of classifiers to distinguish machine-generated from human-authored text; we explore how consistent language models are in responding to value-oriented questions across languages and system settings; we test how accurately language models are able to predict human preferences between variants of generated material; and we investigate how well language models are able provide sensible topical quizzes to fit given target texts.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+

from $39.99 /Month

Starting from 10 chapters or articles per month
Access and download chapters and articles from more than 300k books and 2,500 journals
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 69.99; Price excludes VAT (USA)

Softcover Book: USD 89.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

ELOQUENT CLEF Shared Tasks for Evaluation of Generative Language Model Quality

ELOQUENT CLEF Shared Tasks for Evaluation of Generative Language Model Quality, 2025 Edition

Overview of ELOQUENT 2024—Shared Tasks for Evaluating Generative Language Model Quality

Notes

1.
This is ongoing work on multilingual human preference prediction and explanation. The English subset is created as part of this shared task. Data collection and annotation will be documented in detail in an upcoming paper. To be available at github.com/Toloka/primeape.
2.
hf.co/datasets/Eloquent/preference_prediction.
3.
openai.com/gpt-4o-mini-advancing-cost-efficient-intelligence.
4.
github.com/eloquent-lab/eloquent-lab.github.io/task-preference-prediction.
5.
Note that we tried to achieve at least some difference in the baseline Teacher system and its evaluation using two different versions of GPT: gpt-4.1-nano-2025-04-14 and gpt-4.1-mini-2025-04-14, resp. We also later reran the evaluation with google’s gemma and got very similar results.

References

Bell, A.: Language style as audience design. Lang. Soc. 13(2) (1984)
Google Scholar
Bevendorff, J., et al.: Overview of PAN 2025: generative AI detection, multilingual text detoxification, multi-author writing style analysis, and generative plagiarism detection. In: Hauff, C., et al. (eds.) Advances in Information Retrieval: 47th European Conference on Information Retrieval (ECIR) (2025)
Google Scholar
Bevendorff, J., et al.: Overview of the “Voight-Kampff” generative AI authorship verification task at PAN and ELOQUENT 2025. In: Faggioli, G., Ferro, N., Rosso, P., Spina, D. (eds.) Working Notes of CLEF 2025 – Conference and Labs of the Evaluation Forum. CEUR-WS (2025)
Google Scholar
Bevendorff, J., et al.: Overview of the Voight-Kampff generative AI authorship verification task at PAN and ELOQUENT 2024. In: Faggioli, G., Ferro, N., Vlachos, M., Galuščáková, P., de Herrera, A.G.S. (eds.) Working Notes of CLEF 2024 - Conference and Labs of the Evaluation Forum. CEUR-WS.org (2024)
Google Scholar
Clark, H.H., Murphy, G.L.: Audience design in meaning and reference. In: Advances in Psychology, vol. 9, pp. 287–299. Elsevier (1982)
Google Scholar
Colombo, P., Noiry, N., Irurozki, E., Clémençon, S.: What are the best systems? New perspectives on NLP benchmarking. Adv. Neural. Inf. Process. Syst. 35, 26915–26932 (2022)
Google Scholar
Creo, A., Hormazábal-Lagos, M., Cerezo-Costas, H., Alonso-Doval, P.: HumanAIzers in Voight-Kampff at ELOQUENT 2025. In: Faggioli, G., Ferro, N., Rosso, P., Spina, D. (eds.) Working Notes of CLEF 2025 – Conference and Labs of the Evaluation Forum. CEUR-WS (2025)
Google Scholar
Gunti, R.R.: The Data-Centric Approach for the Voight Kampff Task. In: Faggioli, G., Ferro, N., Rosso, P., Spina, D. (eds.) Working Notes of CLEF 2025 – Conference and Labs of the Evaluation Forum. CEUR-WS (2025)
Google Scholar
Hoveyda, M.: Evading Human/Machine classifiers by prompting LLMs for naturally imperfect text. In: Faggioli, G., Ferro, N., Rosso, P., Spina, D. (eds.) Working Notes of CLEF 2025 – Conference and Labs of the Evaluation Forum. CEUR-WS (2025)
Google Scholar
Karlgren, J., et al.: ELOQUENT CLEF shared tasks for evaluation of generative language model quality, 2nd edition. In: Hauff, C., Macdonald, C., Jannach, D., Kazai, G., Nardini, F.M., Pinelli, F., Silvestri, F., Tonellotto, N. (eds.) Advances in Information Retrieval: 47th European Conference on Information Retrieval (ECIR) (2025)
Google Scholar
Karlgren, J., et al.: Overview of ELOQUENT 2024—shared tasks for evaluating generative language model quality. In: Proceedings of the Fifteenth International Conference of the CLEF Association (CLEF 2024). Springer, Cham (2024). https://doi.org/10.1007/978-3-031-71908-0_3
Karlgren, J., Dürlich, L., Gogoulou, E., Guillou, L., Nivre, J., Talman, A.: ELOQUENT CLEF shared tasks for evaluation of generative language model quality. In: Advances in Information Retrieval: 46th European Conference on IR Research (ECIR) (2024)
Google Scholar
Karlgren, J., Engels, M.I., Gunti, R.R., Hoveyda, M., Sotic, B.N., Kamps, J.: Overview and joint report of the robustness and consistency task at the ELOQUENT 2025 lab for evaluating generative language model quality. In: Faggioli, G., Ferro, N., Rosso, P., Spina, D. (eds.) Working Notes of CLEF 2025 – Conference and Labs of the Evaluation Forum. CEUR-WS (2025)
Google Scholar
Lin, C.Y.: ROUGE: a package for automatic evaluation of summaries. In: Text Summarization Branches Out, pp. 74–81. Association for Computational Linguistics, Barcelona, Spain (2004). https://aclanthology.org/W04-1013/
Mikhailov, V., Artemova, E., Butenko, Z., Øvrelid, L., Velldal, E.: Overview of the preference prediction task at the ELOQUENT 2025 lab for evaluating generative language model quality. In: Faggioli, G., Ferro, N., Rosso, P., Spina, D. (eds.) Working Notes of CLEF 2025 – Conference and Labs of the Evaluation Forum. CEUR-WS (2025)
Google Scholar
Rofin, M., et al.: Vote’n’Rank: revision of benchmarking with social choice theory. In: Vlachos, A., Augenstein, I. (eds.) Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics, pp. 670–686. Association for Computational Linguistics, Dubrovnik, Croatia (2023). https://doi.org/10.18653/v1/2023.eacl-main.48
Saha, S., Das, R., Das, D.: JUNLP_SS at ELOQUENT Lab 2025: humanizing AI - enhancing the realism of machine generated text. In: Faggioli, G., Ferro, N., Rosso, P., Spina, D. (eds.) Working Notes of CLEF 2025 – Conference and Labs of the Evaluation Forum. CEUR-WS (2025)
Google Scholar
Sahlgren, M., Karlgren, J., Dürlich, L., Gogoulou, E., Talman, A., Zahra, S.: ELOQUENT 2024-robustness task. In: 25th Working Notes of the Conference and Labs of the Evaluation Forum, CLEF 2024. Grenoble. 9 September 2024 through 12 September 2024, vol. 3740, pp. 703–707. CEUR-WS (2024)
Google Scholar
Šindelář, P., Bojar, O.: Overview of the sensemaking task at the ELOQUENT 2025 lab: LLMs as teachers, students and evaluators. In: Faggioli, G., Ferro, N., Rosso, P., Spina, D. (eds.) Working Notes of CLEF 2025 – Conference and Labs of the Evaluation Forum. CEUR-WS (2025)
Google Scholar
Vachharajani, P.: Literal re-translation as a method for AI text disguise and detection evasion. In: Faggioli, G., Ferro, N., Rosso, P., Spina, D. (eds.) Working Notes of CLEF 2025 – Conference and Labs of the Evaluation Forum. CEUR-WS (2025)
Google Scholar
Zhang, T., Kishore, V., Wu, F., Weinberger, K.Q., Artzi, Y.: BERTScore: evaluating text generation with BERT. In: International Conference on Learning Representations (ICLR) (2020). https://openreview.net/forum?id=SkeHuCVFDr

Download references

Acknowledgments

The authors acknowledge the support of the European Commission through the project DeployAI (grant number 101146490), of the National Recovery Plan funded project MPO 60273/24/21300/21000 CEDMO 2.0 NPO, and the funding from the Project OP JAK Mezisektorová spolupráce Nr. CZ.02.01.01/00/23_020/0008518 named “Jazykověda, umělá inteligence a jazykové a řečové technologie: od výzkumu k aplikacím.”

Author information

Authors and Affiliations

University of Helsinki, Helsinki, Finland
Jussi Karlgren
Charles University, Prague, Czech Republic
Ondřej Bojar & Pavel Šindelář
University of Oslo, Oslo, Norway
Vladislav Mikhailov, Erik Velldal & Lilja Øvrelid
Fraunhofer Institute for Intelligent Analysis and Information Systems, St. Augustin, Germany
Marie Isabel Engels
Toloka AI, Munich, Germany
Ekaterina Artemova

Authors

Jussi Karlgren
View author publications
Search author on:PubMed Google Scholar
Ekaterina Artemova
View author publications
Search author on:PubMed Google Scholar
Ondřej Bojar
View author publications
Search author on:PubMed Google Scholar
Marie Isabel Engels
View author publications
Search author on:PubMed Google Scholar
Vladislav Mikhailov
View author publications
Search author on:PubMed Google Scholar
Pavel Šindelář
View author publications
Search author on:PubMed Google Scholar
Erik Velldal
View author publications
Search author on:PubMed Google Scholar
Lilja Øvrelid
View author publications
Search author on:PubMed Google Scholar

Corresponding author

Correspondence to Jussi Karlgren .

Editor information

Editors and Affiliations

Universidad Nacional de Educación a Distancia, Madrid, Spain
Jorge Carrillo-de-Albornoz
Universidad Nacional de Educación a Distancia, Madrid, Spain
Alba García Seco de Herrera
Universidad Nacional de Educación a Distancia, Madrid, Spain
Julio Gonzalo
Universidad Nacional de Educación a Distancia, Madrid, Spain
Laura Plaza
IRIT Université de Toulouse, Toulouse, France
Josiane Mothe
Technische Universität Wien, Vienna, Austria
Florina Piroi
Universitat Politècnica de València, Valencia, Spain
Paolo Rosso
RMIT University, Melbourne, VIC, Australia
Damiano Spina
University of Padua, Padova, Italy
Guglielmo Faggioli
University of Padua, Padova, Italy
Nicola Ferro

Ethics declarations

Disclosure of Interests

Authors have no competing interests to declare that are relevant to the content of this article.

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Karlgren, J. et al. (2026). Overview of ELOQUENT 2025: Shared Tasks for Evaluating Generative Language Model Quality. In: Carrillo-de-Albornoz, J., et al. Experimental IR Meets Multilinguality, Multimodality, and Interaction. CLEF 2025. Lecture Notes in Computer Science, vol 16089. Springer, Cham. https://doi.org/10.1007/978-3-032-04354-2_14

Download citation

DOI: https://doi.org/10.1007/978-3-032-04354-2_14
Published: 03 September 2025
Publisher Name: Springer, Cham
Print ISBN: 978-3-032-04353-5
Online ISBN: 978-3-032-04354-2
eBook Packages: Computer ScienceComputer Science (R0)

Keywords

Publish with us

Policies and ethics