Skip to main content

Overview of ELOQUENT 2025: Shared Tasks for Evaluating Generative Language Model Quality

  • Conference paper
  • First Online:
Experimental IR Meets Multilinguality, Multimodality, and Interaction (CLEF 2025)

Abstract

ELOQUENT is a CLEF lab for evaluating generative language model quality with a focus on such aspects of quality that do not come to the fore with current standard test suites and test collections and to develop and promote new test regimes and methods that fit a multilingual application scenario for generative artificial intelligence.

This year is the second year of ELOQUENT. This year’s experiment tracks have evolved from the first year: this year we continue challenging the capability of classifiers to distinguish machine-generated from human-authored text; we explore how consistent language models are in responding to value-oriented questions across languages and system settings; we test how accurately language models are able to predict human preferences between variants of generated material; and we investigate how well language models are able provide sensible topical quizzes to fit given target texts.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+
from $39.99 /Month
  • Starting from 10 chapters or articles per month
  • Access and download chapters and articles from more than 300k books and 2,500 journals
  • Cancel anytime
View plans

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
eBook
USD 69.99
Price excludes VAT (USA)
Softcover Book
USD 89.99
Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

Notes

  1. 1.

    This is ongoing work on multilingual human preference prediction and explanation. The English subset is created as part of this shared task. Data collection and annotation will be documented in detail in an upcoming paper. To be available at github.com/Toloka/primeape.

  2. 2.

    hf.co/datasets/Eloquent/preference_prediction.

  3. 3.

    openai.com/gpt-4o-mini-advancing-cost-efficient-intelligence.

  4. 4.

    github.com/eloquent-lab/eloquent-lab.github.io/task-preference-prediction.

  5. 5.

    Note that we tried to achieve at least some difference in the baseline Teacher system and its evaluation using two different versions of GPT: gpt-4.1-nano-2025-04-14 and gpt-4.1-mini-2025-04-14, resp. We also later reran the evaluation with google’s gemma and got very similar results.

References

  1. Bell, A.: Language style as audience design. Lang. Soc. 13(2) (1984)

    Google Scholar 

  2. Bevendorff, J., et al.: Overview of PAN 2025: generative AI detection, multilingual text detoxification, multi-author writing style analysis, and generative plagiarism detection. In: Hauff, C., et al. (eds.) Advances in Information Retrieval: 47th European Conference on Information Retrieval (ECIR) (2025)

    Google Scholar 

  3. Bevendorff, J., et al.: Overview of the “Voight-Kampff” generative AI authorship verification task at PAN and ELOQUENT 2025. In: Faggioli, G., Ferro, N., Rosso, P., Spina, D. (eds.) Working Notes of CLEF 2025 – Conference and Labs of the Evaluation Forum. CEUR-WS (2025)

    Google Scholar 

  4. Bevendorff, J., et al.: Overview of the Voight-Kampff generative AI authorship verification task at PAN and ELOQUENT 2024. In: Faggioli, G., Ferro, N., Vlachos, M., Galuščáková, P., de Herrera, A.G.S. (eds.) Working Notes of CLEF 2024 - Conference and Labs of the Evaluation Forum. CEUR-WS.org (2024)

    Google Scholar 

  5. Clark, H.H., Murphy, G.L.: Audience design in meaning and reference. In: Advances in Psychology, vol. 9, pp. 287–299. Elsevier (1982)

    Google Scholar 

  6. Colombo, P., Noiry, N., Irurozki, E., Clémençon, S.: What are the best systems? New perspectives on NLP benchmarking. Adv. Neural. Inf. Process. Syst. 35, 26915–26932 (2022)

    Google Scholar 

  7. Creo, A., Hormazábal-Lagos, M., Cerezo-Costas, H., Alonso-Doval, P.: HumanAIzers in Voight-Kampff at ELOQUENT 2025. In: Faggioli, G., Ferro, N., Rosso, P., Spina, D. (eds.) Working Notes of CLEF 2025 – Conference and Labs of the Evaluation Forum. CEUR-WS (2025)

    Google Scholar 

  8. Gunti, R.R.: The Data-Centric Approach for the Voight Kampff Task. In: Faggioli, G., Ferro, N., Rosso, P., Spina, D. (eds.) Working Notes of CLEF 2025 – Conference and Labs of the Evaluation Forum. CEUR-WS (2025)

    Google Scholar 

  9. Hoveyda, M.: Evading Human/Machine classifiers by prompting LLMs for naturally imperfect text. In: Faggioli, G., Ferro, N., Rosso, P., Spina, D. (eds.) Working Notes of CLEF 2025 – Conference and Labs of the Evaluation Forum. CEUR-WS (2025)

    Google Scholar 

  10. Karlgren, J., et al.: ELOQUENT CLEF shared tasks for evaluation of generative language model quality, 2nd edition. In: Hauff, C., Macdonald, C., Jannach, D., Kazai, G., Nardini, F.M., Pinelli, F., Silvestri, F., Tonellotto, N. (eds.) Advances in Information Retrieval: 47th European Conference on Information Retrieval (ECIR) (2025)

    Google Scholar 

  11. Karlgren, J., et al.: Overview of ELOQUENT 2024—shared tasks for evaluating generative language model quality. In: Proceedings of the Fifteenth International Conference of the CLEF Association (CLEF 2024). Springer, Cham (2024). https://doi.org/10.1007/978-3-031-71908-0_3

  12. Karlgren, J., Dürlich, L., Gogoulou, E., Guillou, L., Nivre, J., Talman, A.: ELOQUENT CLEF shared tasks for evaluation of generative language model quality. In: Advances in Information Retrieval: 46th European Conference on IR Research (ECIR) (2024)

    Google Scholar 

  13. Karlgren, J., Engels, M.I., Gunti, R.R., Hoveyda, M., Sotic, B.N., Kamps, J.: Overview and joint report of the robustness and consistency task at the ELOQUENT 2025 lab for evaluating generative language model quality. In: Faggioli, G., Ferro, N., Rosso, P., Spina, D. (eds.) Working Notes of CLEF 2025 – Conference and Labs of the Evaluation Forum. CEUR-WS (2025)

    Google Scholar 

  14. Lin, C.Y.: ROUGE: a package for automatic evaluation of summaries. In: Text Summarization Branches Out, pp. 74–81. Association for Computational Linguistics, Barcelona, Spain (2004). https://aclanthology.org/W04-1013/

  15. Mikhailov, V., Artemova, E., Butenko, Z., Øvrelid, L., Velldal, E.: Overview of the preference prediction task at the ELOQUENT 2025 lab for evaluating generative language model quality. In: Faggioli, G., Ferro, N., Rosso, P., Spina, D. (eds.) Working Notes of CLEF 2025 – Conference and Labs of the Evaluation Forum. CEUR-WS (2025)

    Google Scholar 

  16. Rofin, M., et al.: Vote’n’Rank: revision of benchmarking with social choice theory. In: Vlachos, A., Augenstein, I. (eds.) Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics, pp. 670–686. Association for Computational Linguistics, Dubrovnik, Croatia (2023). https://doi.org/10.18653/v1/2023.eacl-main.48

  17. Saha, S., Das, R., Das, D.: JUNLP_SS at ELOQUENT Lab 2025: humanizing AI - enhancing the realism of machine generated text. In: Faggioli, G., Ferro, N., Rosso, P., Spina, D. (eds.) Working Notes of CLEF 2025 – Conference and Labs of the Evaluation Forum. CEUR-WS (2025)

    Google Scholar 

  18. Sahlgren, M., Karlgren, J., Dürlich, L., Gogoulou, E., Talman, A., Zahra, S.: ELOQUENT 2024-robustness task. In: 25th Working Notes of the Conference and Labs of the Evaluation Forum, CLEF 2024. Grenoble. 9 September 2024 through 12 September 2024, vol. 3740, pp. 703–707. CEUR-WS (2024)

    Google Scholar 

  19. Šindelář, P., Bojar, O.: Overview of the sensemaking task at the ELOQUENT 2025 lab: LLMs as teachers, students and evaluators. In: Faggioli, G., Ferro, N., Rosso, P., Spina, D. (eds.) Working Notes of CLEF 2025 – Conference and Labs of the Evaluation Forum. CEUR-WS (2025)

    Google Scholar 

  20. Vachharajani, P.: Literal re-translation as a method for AI text disguise and detection evasion. In: Faggioli, G., Ferro, N., Rosso, P., Spina, D. (eds.) Working Notes of CLEF 2025 – Conference and Labs of the Evaluation Forum. CEUR-WS (2025)

    Google Scholar 

  21. Zhang, T., Kishore, V., Wu, F., Weinberger, K.Q., Artzi, Y.: BERTScore: evaluating text generation with BERT. In: International Conference on Learning Representations (ICLR) (2020). https://openreview.net/forum?id=SkeHuCVFDr

Download references

Acknowledgments

The authors acknowledge the support of the European Commission through the project DeployAI (grant number 101146490), of the National Recovery Plan funded project MPO 60273/24/21300/21000 CEDMO 2.0 NPO, and the funding from the Project OP JAK Mezisektorová spolupráce Nr. CZ.02.01.01/00/23_020/0008518 named “Jazykověda, umělá inteligence a jazykové a řečové technologie: od výzkumu k aplikacím.”

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Jussi Karlgren .

Editor information

Editors and Affiliations

Ethics declarations

Disclosure of Interests

Authors have no competing interests to declare that are relevant to the content of this article.

Rights and permissions

Reprints and permissions

Copyright information

© 2026 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Karlgren, J. et al. (2026). Overview of ELOQUENT 2025: Shared Tasks for Evaluating Generative Language Model Quality. In: Carrillo-de-Albornoz, J., et al. Experimental IR Meets Multilinguality, Multimodality, and Interaction. CLEF 2025. Lecture Notes in Computer Science, vol 16089. Springer, Cham. https://doi.org/10.1007/978-3-032-04354-2_14

Download citation

  • DOI: https://doi.org/10.1007/978-3-032-04354-2_14

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-032-04353-5

  • Online ISBN: 978-3-032-04354-2

  • eBook Packages: Computer ScienceComputer Science (R0)

Keywords

Publish with us

Policies and ethics