The fourth iteration of the Generation, Evaluation & Metrics (GEM) Workshop will be held as part of ACL, July 31, 2025.
The workshop will be held in hybrid mode with sessions in-person and via the conference portal.
Schedule
All times in Vienna local time.
Start | End | |
---|---|---|
9:00 | 10:25 | Opening Remarks, Keynotes by Barbara Plank and Leshem Choshen |
10:25 | 10:55 | Coffee Break |
10:55 | 11:30 | Talk Session 1 |
11:30 | 12:30 | Poster Session Part 1 |
12:30 | 14:00 | Lunch Break |
14:00 | 15:00 | Poster Session Part 2 |
15:00 | 15:30 | Talk Session 2 |
15:30 | 16:00 | Coffee Break |
16:00 | 16:15 | Talk Session 3 |
16:15 | 16:55 | Keynote by Ehud Reiter |
16:55 | 17:40 | Panel Discussion |
17:40 | 17:50 | Closing Remarks |
Keynotes
Keynote 1 - Barbara Plank
Ambiguity, Consistency and Reasoning in LLMs
ABSTRACT
Large Language Models (LLMs) are powerful yet fallible tools, often struggling with ambiguity, inconsistency, and flawed reasoning. This talk explores some of our recent research exposing these limitations in text and language-vision models. We examine how they misinterpret ambiguous entities, fail to maintain self-consistency, and exhibit biases when these issues remain unresolved. Using insights from controlled studies and new benchmarks, we dissect how models “know” but often cannot “apply” or “verify” that knowledge. We also highlight a promising intervention — vector ablation — to surgically address false refusals without sacrificing model accuracy. Together, these findings reveal the critical need for more work on nuanced evaluation and fine-grained control mechanisms in future LLM development.
BIO
Barbara Plank is Full Professor for AI and Computational Linguistics at LMU Munich, where she directs the MaiNLP lab and co-directs the Center for Information and Language Processing. She is also an ELLIS Fellow and Visiting Full Professor at IT University of Copenhagen. Her research lab focuses on human-facing NLP: to make NLP models and evaluation more robust and inclusive, so that NLP can deal better with underlying shifts in data due to language variation, is fairer, and embraces human label variation.
Keynote 2 - Leshem Choshen
Evaluation at the Heart of the AI Wave
ABSTRACT
The AI wind also blows in the sails of evaluation, creating a mass of evaluation works. In this talk, Leshem will present some of the most pressing and open problems in evaluation and exemplify those with efforts they participated in. These "blue sea" problems include pretraining evaluation, unified evaluation, multicultural evaluation, and contamination.
BIO
Leshem Choshen is a postdoctoral researcher at MIT and MIT-IBM, studying communal LLMs, from community-built LLMs to LLMs for humans and communities. They co-created model merging, TIES merging, and the babyLM pretraining challenge. They are constantly working with the community to gather chats (please contribute), LLM games in textArena and other efforts that call for community involvement. During this work, they emphasize evaluation aspects, including reliable and efficient evaluation, tinyBenchmarks, benchmark agreement testing, and pretraining evaluation.
Keynote 3 - Ehud Reiter
We Should Evaluate Real-World Impact
ABSTRACT
The ACL community has shown very little interest in evaluating the real-world impact of deployed NLP systems. This limits the usefulness and rate of adoption of NLP in areas such as medicine. I will discuss various ways of evaluating real-world impact, and then share the results of a structured survey of the ACL Anthology, which suggests that perhaps 0.1% of its papers evaluate real-world impact; furthermore, most Anthology papers which include impact evaluations present them very sketchily and instead focus on metric evaluations. I will conclude with a discussion of when impact evaluation is appropriate, and steps the community could take to encourage it.
BIO
Ehud Reiter is a Professor of Computing Science at the University of Aberdeen and was formerly Chief Scientist of Arria NLG (a spinout he cofounded). He has been working on Natural Language Generation for 35 years, and in recent years has focused on healthcare applications and the evaluation of language generation. He is one of the most cited researchers in NLG, and his awards include an INLG Test of Time award for his work on data-to-text. He writes a widely read blog on NLG and evaluation (ehudreiter.com), and wrote a book on NLG which was published in November 2024.
Panelists
Deuwe Kiela
Thiago Castro Ferreira
Pushkar Mishra
Sessions and Papers
Talk Session 1
Title | Authors |
---|---|
ReproNLP Shared Task Overview | Anya Belz for the https://repronlp.github.io/ Team |
Cleanse: Uncertainty Estimation Approach Using Clustering-based Semantic Consistency in LLMs | Minsuh Joo, Hyunsoo Cho |
Talk Session 2
Title | Authors |
---|---|
CoKe: Customizable Fine-Grained Story Evaluation via Chain-of-Keyword Rationalization | Joshi Brihi, Sriram Venkatapathy, Mohit Bansal, Nanyun Peng, Haw-Shiuan Chang |
PapersPlease: A Benchmark for Evaluating Motivational Values of Large Language Models Based on ERG Theory | Junho Myung, Yeon Su Park, Sunwoo Kim, Shin Yoo, Alice Oh |
Talk Session 3
Title | Authors |
---|---|
Psycholinguistic Word Features: a New Approach for the Evaluation of LLMs Alignment with Humans | Javier Conde, Miguel Gonzalez, Maria Grandury, Pedro Reviriego, Gonzalo Martinez, Marc Brysbaert |
Poster Session - In-Person
All posters can be presented during both parts of the split poster session (with lunch break in between).
-
Does Biomedical Training Lead to Better Medical Performance?, Amin Dada, Marie Bauer, Jean-Philippe Corbeil, Amanda Butler, Osman Alperen, Constantin Marc, Kaleb E, Julian Friedrich, Jens Kleesiek
-
HEDS 3.0: The Human Evaluation Data Sheet Version 3.0, Anya Belz, Craig Thomson
-
ARGENT: Automatic Reference-free Evaluation for Open-Ended Text Generation without Source Inputs, Xinyue Zhang, Agathe Zecevic, Sebastian Zeki, Angus Roberts
-
Are LLMs (Really) Ideological? An IRT-based Analysis and Alignment Tool for Perceived Socio-Economic Bias in LLMs, Jasmin Wachter, Michael Radloff, Maja Smolej, Katharina Kinder-Kurlanda
-
Knockout LLM Assessment: Using Large Language Models for Evaluations through Iterative Pairwise Comparisons, Isik Baran, Tu Anh, Jan Niehues
-
Free-text Rationale Generation under Readability Level Control, Yi-Sheng Hsu, Nils Feldhus, Sherzod Hakimov
-
Selective Shot Learning for Code Explanation, Paheli Bhattacharya, Rishabh Gupta
-
Can LLMs Detect Intrinsic Hallucinations in Paraphrasing and Machine Translation?, Evangelia Gogoulou, Shorouq Zahra, Liane Guillou, Luise Dürlich, Joakim Nivre
-
Evaluating LLMs with Multiple Problems at once, Zhengxiang Wang, Jordan Kodner, Owen Rambow
-
Learning and Evaluating Factual Clarification Question Generation Without Examples, Matthew Toles, Yukun Huang, Zhou Yu
-
SECQUE: A Benchmark for Evaluating Real-World Financial Analysis Capabilities, Noga BenYoash, Menachem Brief, Oded Ovadia, Gil Shenderovitz, Moshik Mishaeli, Rachel Lemberg, Eitam Sheetrit
-
One ruler to measure them all: Benchmarking multilingual long-context language models, Yekyung Kim, Jenna Russell, Marzena Karpinska, Mohit Iyyer
-
Measure only what is measurable: towards conversation requirements for evaluating task-oriented dialogue systems, Emiel van, Anouck Braggaar, Emmelyn Croes, Florian Kunneman, Christine Liebrecht, Gabriëlla Martijn
-
Are Bias Evaluation Methods Biased ?, Lina Berrayana, Sean Rooney, Luis Garcés-Erice, Ioana Giurgiu
-
IRSum: One Model to Rule Summarization and Retrieval, Sotaro Takeshita, Simone Paolo, Kai Eckert
-
Metric assessment protocol in the context of answer fluctuation on MCQ tasks, Ekaterina Goliakova, Xavier Renard, Marie-Jeanne Lesot, Thibault Laugel, Christophe Marsala, Marcin Detyniecki
-
CUS-QA: Local-Knowledge-Oriented Open-Ended Question Answering Dataset, Jindřich Helcl, Andrei-Alexandru Manea, Gianluca Vico, Jindřich Libovický
-
Using LLM-as-judge Evaluation for Sanity-checking Results and Reproducibility of Human Evaluations of NLP Systems, Rudali Huidrom, Anya Belz
-
HuGME: A benchmark system for evaluating Hungarian generative LLMs, Noémi Ligeti-Nagy, Gábor Madarász, Flóra Földesi, Péter Hatvani, Mariann Lengyel, Mátyás Osváth, Bence Sárossy, Kristóf Varga, Győző Zijian, Enikő Héja, Tamás Váradi, Gábor Prószéky
-
CacheSaver: A Modular Framework for Efficient, Cost-Effective, and Reproducible LLM Inference, Nearchos Potamitis, Lars Henning, Laurent Bindschaedler, Niket Tandon, Akhil Arora
-
OpeNLGauge: An Explainable Metric for NLG Evaluation with Open-Weights LLMs, Ivan Kartáč, Mateusz Lango, Ondrej Dusek
-
Investigating the Robustness of Retrieval-Augmented Generation at the Query Level, Sezen Perçin, Xin Su, Qutub Sha, Phillip Howard, Aleksei Kuvshinov, Leo Schwinn, Kay-Ulrich Scholl
-
Sourcing Fresh Resources for Table-to-Text Generation Evaluation, Kristýna Onderková, Ondrej Platek, Zdeněk Kasner, Ondrej Dusek
-
Big Escape Benchmark: Evaluating Human-Like Reasoning in Language Models via Real-World Escape Room Challenges, Zinan Tang, QiYao Sun
-
Event-based evaluation of abstractive news summarization, Huiling You, Samia Touileb, Lilja Øvrelid, Erik Velldal
-
Prompt-Based Evolution for Diverse and Generalizable Toxic Language Datasets, Iago Alves, Julia Soares, Fernanda Bufon, Diogo Fernandes, Arlindo Rodrigues
-
Faithfulness Metrics Do Not Generalize Well: A Case Study in Summarization, Patrícia Schmidtová, Ondrej Dusek, Saad Mahamood
-
Fine-Tune on the Format: First Improving Multiple-Choice Evaluation for Intermediate LLM Checkpoints, Alec Bunn, Ben Bogin, Sarah Wiegreffe
-
Improving Large Language Model Confidence Estimates using Extractive Rationales for Classification, Jane Arleth, Iris Hendrickx, Martha Larson
-
ReproHum #0729-04: Human Evaluation Reproduction Report for "MemSum: Extractive Summarization of Long Documents Using Multi-Step Episodic Markov Decision Processes", Simeon Junker
-
ReproHum #0031-01: Reproducing the Human Evaluation of Readability from "It is AI’s Turn to Ask Humans a Question", Daniel Braun
-
ReproHum #0033-05: Human Evaluation of Factuality from A Multidisciplinary Perspective, Andra-Maria Florescu, Marius Micluța-Câmpeanu, Stefana Arina, Liviu P
-
ReproHum: #0744-02: Investigating the Reproducibility of Semantic Preservation Human Evaluations, Mohammad Arvan, Natalie Parde
-
ReproHum #0669-08: Reproducing Sentiment Transfer Evaluation, Kristýna Onderková, Mateusz Lango, Patrícia Schmidtová, Ondrej Dusek
-
ReproHum #0729-04: Partial reproduction of the human evaluation of the MemSum and NeuSum summarisation systems, Simon Mille, Michela Lorandi
-
Curse of bilinguality: Evaluating monolingual and bilingual language models on Chinese linguistic benchmarks, Yuwen Zhou, Yevgen Matusevych
-
Towards Better Open-Ended Text Generation: A Multicriteria Evaluation Framework, Esteban Garces Arias, Hannah Blocher, Julian Rodemann, Meimingwei Li, Christian Heumann, Matthias Aßenmacher
-
Bridging the LLM Accessibility Divide? Performance, Fairness, and Cost of Closed versus Open LLMs for Automated Essay Scoring, Kezia Oketch, John P. Lalor, Yi Yang, Ahmed Abbasi
-
Prompt, Translate, Fine-Tune, Re-Initialize, or Instruction-Tune? Adapting LLMs for In-Context Learning in Low-Resource Languages, Christopher Toukmaji, Jeffrey Flanigan
-
Winning Big with Small Models: Knowledge Distillation vs. Self-Training for Reducing Hallucination in QA Agents, Ashley Lewis, Michael White, Jing Liu, Toshiaki Koike-Akino, Kieran Parsons, Ye Wang
-
Ad-hoc Concept Forming in the Game Codenames as a Means for Evaluating Large Language Models, Sherzod Hakimov, Lara Pfennigschmidt, David Schlangen
-
Evaluating Grounded Reasoning by Code-Assisted LLMs for Mathematics, Zena Al Khalili, Nick Howell, Dietrich Klakow
-
From Calculation to Adjudication: Examining LLM Judges on Mathematical Reasoning Tasks, Andreas Stephan, Dawei Zhu, Matthias Aßenmacher, Xiaoyu Shen, Benjamin Roth
-
PersonaTwin: A Multi-Tier Prompt Conditioning Framework for Generating and Evaluating Personalized Digital Twins, Sihan Chen, John P. Lalor, Yi Yang, Ahmed Abbasi
-
Coreference as an indicator of context scope in multimodal narrative, Nikolai Ilinykh, Shalom Lappin, Asad B.Sayeed, Sharid Loáiciga
-
PATCH! {P}sychometrics-{A}ssis{T}ed Ben{CH}marking of Large Language Models against Human Populations: A Case Study of Proficiency in 8th Grade Mathematics, Qixiang Fang, Daniel Oberski, Dong Nguyen
-
MCQFormatBench: Robustness Tests for Multiple-Choice Questions, Hiroo Takizawa, Saku Sugawara, Akiko Aizawa
-
(Dis)improved?! How Simplified Language Affects Large Language Model Performance across Languages, Miriam Anschütz, Anastasiya Damaratskaya, Chaeeun Joy Lee, Arthur Schmalz, Edoardo Mosca, Georg Groh
-
Finance Language Model Evaluation (FLaME), Glenn Matlin, Mika Okamoto, Huzaifa Pardawala, Yang Yang, Sudheer Chava
-
sPhinX: Sample Efficient Multilingual Instruction Fine-Tuning Through N-shot Guided Prompting, Sanchit Ahuja, Kumar Tanmay, Hardik Hansrajbhai Chauhan, Barun Patra, Kriti Aggarwal, Luciano Del Corro, Arindam Mitra, Tejas Indulal Dhamecha, Ahmed Hassan Awadallah, Monojit Choudhury, Vishrav Chaudhary, Sunayana Sitaram
-
Single- vs. Dual-Prompt Dialogue Generation With LLMs For Job Interviews In Human Resources, Joachim De Baer, A. Seza Doğruöz, Thomas Demeester, Chris Develder
-
U-MATH: A University-Level Benchmark for Evaluating Mathematical Skills in LLMs, Konstantin Chernyshev, Vitaliy Polshkov, Vlad Stepanov, Alex Myasnikov, Ekaterina Artemova, Alexei Miasnikov, Sergei Tilga
-
[Findings] Do Large Language Models Perform Latent Multi-Hop Reasoning without Exploiting Shortcuts?, Sohee Yang, Nora Kassner, Elena Gribovskaya, Sebastian Riedel, Mor Geva
-
[Findings] Burn After Reading: Do Multimodal Large Language Models Truly Capture Order of Events in Image Sequences?, Yingjin Song, Yupei Du, Denis Paperno, Albert Gatt
-
[Findings] Structured Discourse Representation for Factual Consistency Verification, Kun Zhang, Oana Balalau, Ioana Manolescu
-
[Findings] Evaluating LLMs’ Assessment of Mixed-Context Hallucination Through the Lens of Summarization, Siya Qi, Rui Cao, Yulan He, Zheng Yuan
-
[Findings] Assessing the Reasoning Capabilities of LLMs in the context of Evidence-based Claim Verification, John Dougrez-Lewis, Mahmud Elahi Akhter, Federico Ruggeri, Sebastian Löbbers, Yulan He, Maria Liakata
Poster Session - Virtual
- Multi-Dimensional Evaluation of Open-Source Language Models: Based on Machine Learning and Bayesian Optimization, Qingchen Yu
- Spatial Representation of Large Language Models in 2D Scene, Wenya Wu, Weihong Deng
- The Fellowship of the LLMs: Multi-Model Workflows for Synthetic Preference Optimization Dataset Generation, Samee Arif, Sualeha Farid, Abdul Hameed, Awais Athar, Agha Ali
- Leveraging LLM-based sentiment analysis for portfolio optimization with proximal policy optimization, Kemal Kirtac, Guido Germano
- Evaluating LLMs Beyond Standard Text: A Benchmark on Non-Traditional Text Variations, Jihyun Kim, Yejee Kim, HyunJeong Kang, Sumyeong Kim, Minji Son, Hyeseung Han, Kyungwoo Song
- Vision-Language Models Are Not Pragmatically Competent in Referring Expression Generation, Ziqiao Ma, Jing Ding, Xuejun Zhang, Dezhi Luo, Jiahe Ding, Sihan Xu, Yuchen Huang, Run Peng, Joyce Chai
- Can Perplexity Predict Finetuning Performance? An Investigation of Tokenization Effects on Sequential Language Models for Nepali, Nishant Luitel, Nirajan Bekoju, Anand Kumar, Subarna Shakya
- Modeling the One-to-Many Property in Open-Domain Dialogue with LLMs, Jing Yang, Kong Aik, Woon-Seng Gan
- (Towards) Scalable Reliable Automated Evaluation with Large Language Models, Bertil Braun, Martin Forell
- Clustering Zero-Shot Uncertainty Estimations to Assess LLM Response Accuracy for Yes/No Q&A, Christopher T., Amy Vennos, W. Graham, Daniel Dakota
- Judging the Judges: Evaluating Alignment and Vulnerabilities in LLMs-as-Judges, Aman Singh, Kartik Choudhary, Venkat Srinik, Sankaran Vaidyanathan, Dieuwke Hupkes
- Are AI Datasets Good Enough? A Survey on Quality of Datasets With Machine-Generated Texts, German Gritsai, Anastasia Voznyuk, Andrey Grabovoy, Yury Chekhovich
- Analyzing the Sensitivity of Vision Language Models in Visual Question Answering, Monika shah, Sudarshan Balaji, Somdeb Sarkhel, Sanorita Dey, Deepak Venugopal
- ELAB: Extensive LLM Alignment Benchmark in Persian Language, Zahra Pourbahman, Fatemeh rajabi, Mohammadhossein Sadeghi, Omid Ghahroodi, Somayeh Bakhshaei, Arash Amini, Reza Kazemi, Mahdieh Soleymani
- Evaluating the Quality of Benchmark Datasets for Low-Resource Languages: A Case Study on Turkish, Elif Ecem, Ayşe Aysu, Ahmet Kaan, Seyma Erdem, Burak Aytan, Busra Tufan, Abdullah Topraksoy, Esra Darici, Cagri Toraman
- Shallow Preference Signals: Large Language Model Aligns Even Better with Truncated Data?, Xuan Qi, Jiahao Qiu, Xinzhe Juan, Yue Wu, Mengdi Wang
- ReproHum #0744-02: A Reproduction of the Human Evaluation of Meaning Preservation in ``Factorising Meaning and Form for Intent-Preserving Paraphrasing'', Julius Steen, Katja Markert
- ReproHum #0067-01: A Reproduction of the Evaluation of Cross-Lingual Summarization, Supryadi , Chuang Liu, Deyi Xiong
- Fine-Grained Constraint Generation-Verification for Improved Instruction-Following, Zhixiang Liang, Zhenyu Hou, Xiao Wang
- Natural Language Counterfactual Explanations in Financial Text Classification: A Comparison of Generators and Evaluation Metrics, Karol Dobiczek, Patrick Altmeyer, Cynthia C. S. Liem
- An Analysis of Datasets, Metrics and Models in Keyphrase Generation, Florian Boudin, Akiko Aizawa
Important Dates
July 31
Workshop Date
Organizers
- Sebastian Gehrmann, Bloomberg
- Gabriel Stanovsky, Hebrew University of Jerusalem
- Simon Mille, Dublin City University
- Enrico Santus, Bloomberg
- Miruna Clinciu, Heriot Watt University
- Kaustubh Dhole, Emory University
- Yotam Perlitz, IBM Research
- Rotem Dror, University of Haifa
- Itay Itzhak, Hebrew University of Jerusalem
- Ofir Arviv, IBM Research
- Eliya Habba, Hebrew University of Jerusalem
- Michal Shmueli Scheuer, IBM Research
- João Sedoc, New York University
- Oyvind Tafjord, Allen Institute for Artificial Intelligence