GEM Workshop 2025

GEM 💎 Workshop at ACL 2025

The fourth iteration of the Generation, Evaluation & Metrics (GEM) Workshop will be held as part of ACL, July 31, 2025.

The workshop will be held in hybrid mode with sessions in-person and via the conference portal.

Schedule

All times in Vienna local time.

Start	End
9:00	10:25	Opening Remarks, Keynotes by Barbara Plank and Leshem Choshen
10:25	10:55	Coffee Break
10:55	11:30	Talk Session 1
11:30	12:30	Poster Session Part 1
12:30	14:00	Lunch Break
14:00	15:00	Poster Session Part 2
15:00	15:30	Talk Session 2
15:30	16:00	Coffee Break
16:00	16:15	Talk Session 3
16:15	16:55	Keynote by Ehud Reiter
16:55	17:40	Panel Discussion
17:40	17:50	Closing Remarks

Keynotes

Keynote 1 - Barbara Plank

Ambiguity, Consistency and Reasoning in LLMs

ABSTRACT

Large Language Models (LLMs) are powerful yet fallible tools, often struggling with ambiguity, inconsistency, and flawed reasoning. This talk explores some of our recent research exposing these limitations in text and language-vision models. We examine how they misinterpret ambiguous entities, fail to maintain self-consistency, and exhibit biases when these issues remain unresolved. Using insights from controlled studies and new benchmarks, we dissect how models “know” but often cannot “apply” or “verify” that knowledge. We also highlight a promising intervention — vector ablation — to surgically address false refusals without sacrificing model accuracy. Together, these findings reveal the critical need for more work on nuanced evaluation and fine-grained control mechanisms in future LLM development.

BIO

Barbara Plank is Full Professor for AI and Computational Linguistics at LMU Munich, where she directs the MaiNLP lab and co-directs the Center for Information and Language Processing. She is also an ELLIS Fellow and Visiting Full Professor at IT University of Copenhagen. Her research lab focuses on human-facing NLP: to make NLP models and evaluation more robust and inclusive, so that NLP can deal better with underlying shifts in data due to language variation, is fairer, and embraces human label variation.

Keynote 2 - Leshem Choshen

Evaluation at the Heart of the AI Wave

ABSTRACT

The AI wind also blows in the sails of evaluation, creating a mass of evaluation works. In this talk, Leshem will present some of the most pressing and open problems in evaluation and exemplify those with efforts they participated in. These "blue sea" problems include pretraining evaluation, unified evaluation, multicultural evaluation, and contamination.

BIO

Leshem Choshen is a postdoctoral researcher at MIT and MIT-IBM, studying communal LLMs, from community-built LLMs to LLMs for humans and communities. They co-created model merging, TIES merging, and the babyLM pretraining challenge. They are constantly working with the community to gather chats (please contribute), LLM games in textArena and other efforts that call for community involvement. During this work, they emphasize evaluation aspects, including reliable and efficient evaluation, tinyBenchmarks, benchmark agreement testing, and pretraining evaluation.

Keynote 3 - Ehud Reiter

We Should Evaluate Real-World Impact

ABSTRACT

The ACL community has shown very little interest in evaluating the real-world impact of deployed NLP systems. This limits the usefulness and rate of adoption of NLP in areas such as medicine. I will discuss various ways of evaluating real-world impact, and then share the results of a structured survey of the ACL Anthology, which suggests that perhaps 0.1% of its papers evaluate real-world impact; furthermore, most Anthology papers which include impact evaluations present them very sketchily and instead focus on metric evaluations. I will conclude with a discussion of when impact evaluation is appropriate, and steps the community could take to encourage it.

BIO

Ehud Reiter is a Professor of Computing Science at the University of Aberdeen and was formerly Chief Scientist of Arria NLG (a spinout he cofounded). He has been working on Natural Language Generation for 35 years, and in recent years has focused on healthcare applications and the evaluation of language generation. He is one of the most cited researchers in NLG, and his awards include an INLG Test of Time award for his work on data-to-text. He writes a widely read blog on NLG and evaluation (ehudreiter.com), and wrote a book on NLG which was published in November 2024.

Panelists

Deuwe Kiela

Thiago Castro Ferreira

Pushkar Mishra

Sessions and Papers

Talk Session 1

Title	Authors
ReproNLP Shared Task Overview	Anya Belz for the https://repronlp.github.io/ Team
Cleanse: Uncertainty Estimation Approach Using Clustering-based Semantic Consistency in LLMs	Minsuh Joo, Hyunsoo Cho

Talk Session 2

Title	Authors
CoKe: Customizable Fine-Grained Story Evaluation via Chain-of-Keyword Rationalization	Joshi Brihi, Sriram Venkatapathy, Mohit Bansal, Nanyun Peng, Haw-Shiuan Chang
PapersPlease: A Benchmark for Evaluating Motivational Values of Large Language Models Based on ERG Theory	Junho Myung, Yeon Su Park, Sunwoo Kim, Shin Yoo, Alice Oh

Talk Session 3

Title	Authors
Psycholinguistic Word Features: a New Approach for the Evaluation of LLMs Alignment with Humans	Javier Conde, Miguel Gonzalez, Maria Grandury, Pedro Reviriego, Gonzalo Martinez, Marc Brysbaert

Poster Session - In-Person

All posters can be presented during both parts of the split poster session (with lunch break in between).

Does Biomedical Training Lead to Better Medical Performance?, Amin Dada, Marie Bauer, Jean-Philippe Corbeil, Amanda Butler, Osman Alperen, Constantin Marc, Kaleb E, Julian Friedrich, Jens Kleesiek
HEDS 3.0: The Human Evaluation Data Sheet Version 3.0, Anya Belz, Craig Thomson
ARGENT: Automatic Reference-free Evaluation for Open-Ended Text Generation without Source Inputs, Xinyue Zhang, Agathe Zecevic, Sebastian Zeki, Angus Roberts
Are LLMs (Really) Ideological? An IRT-based Analysis and Alignment Tool for Perceived Socio-Economic Bias in LLMs, Jasmin Wachter, Michael Radloff, Maja Smolej, Katharina Kinder-Kurlanda
Knockout LLM Assessment: Using Large Language Models for Evaluations through Iterative Pairwise Comparisons, Isik Baran, Tu Anh, Jan Niehues
Free-text Rationale Generation under Readability Level Control, Yi-Sheng Hsu, Nils Feldhus, Sherzod Hakimov
Selective Shot Learning for Code Explanation, Paheli Bhattacharya, Rishabh Gupta
Can LLMs Detect Intrinsic Hallucinations in Paraphrasing and Machine Translation?, Evangelia Gogoulou, Shorouq Zahra, Liane Guillou, Luise Dürlich, Joakim Nivre
Evaluating LLMs with Multiple Problems at once, Zhengxiang Wang, Jordan Kodner, Owen Rambow
Learning and Evaluating Factual Clarification Question Generation Without Examples, Matthew Toles, Yukun Huang, Zhou Yu
SECQUE: A Benchmark for Evaluating Real-World Financial Analysis Capabilities, Noga BenYoash, Menachem Brief, Oded Ovadia, Gil Shenderovitz, Moshik Mishaeli, Rachel Lemberg, Eitam Sheetrit
One ruler to measure them all: Benchmarking multilingual long-context language models, Yekyung Kim, Jenna Russell, Marzena Karpinska, Mohit Iyyer
Measure only what is measurable: towards conversation requirements for evaluating task-oriented dialogue systems, Emiel van, Anouck Braggaar, Emmelyn Croes, Florian Kunneman, Christine Liebrecht, Gabriëlla Martijn
Are Bias Evaluation Methods Biased ?, Lina Berrayana, Sean Rooney, Luis Garcés-Erice, Ioana Giurgiu
IRSum: One Model to Rule Summarization and Retrieval, Sotaro Takeshita, Simone Paolo, Kai Eckert
Metric assessment protocol in the context of answer fluctuation on MCQ tasks, Ekaterina Goliakova, Xavier Renard, Marie-Jeanne Lesot, Thibault Laugel, Christophe Marsala, Marcin Detyniecki
CUS-QA: Local-Knowledge-Oriented Open-Ended Question Answering Dataset, Jindřich Helcl, Andrei-Alexandru Manea, Gianluca Vico, Jindřich Libovický
Using LLM-as-judge Evaluation for Sanity-checking Results and Reproducibility of Human Evaluations of NLP Systems, Rudali Huidrom, Anya Belz
HuGME: A benchmark system for evaluating Hungarian generative LLMs, Noémi Ligeti-Nagy, Gábor Madarász, Flóra Földesi, Péter Hatvani, Mariann Lengyel, Mátyás Osváth, Bence Sárossy, Kristóf Varga, Győző Zijian, Enikő Héja, Tamás Váradi, Gábor Prószéky
CacheSaver: A Modular Framework for Efficient, Cost-Effective, and Reproducible LLM Inference, Nearchos Potamitis, Lars Henning, Laurent Bindschaedler, Niket Tandon, Akhil Arora
OpeNLGauge: An Explainable Metric for NLG Evaluation with Open-Weights LLMs, Ivan Kartáč, Mateusz Lango, Ondrej Dusek
Investigating the Robustness of Retrieval-Augmented Generation at the Query Level, Sezen Perçin, Xin Su, Qutub Sha, Phillip Howard, Aleksei Kuvshinov, Leo Schwinn, Kay-Ulrich Scholl
Sourcing Fresh Resources for Table-to-Text Generation Evaluation, Kristýna Onderková, Ondrej Platek, Zdeněk Kasner, Ondrej Dusek
Big Escape Benchmark: Evaluating Human-Like Reasoning in Language Models via Real-World Escape Room Challenges, Zinan Tang, QiYao Sun
Event-based evaluation of abstractive news summarization, Huiling You, Samia Touileb, Lilja Øvrelid, Erik Velldal
Prompt-Based Evolution for Diverse and Generalizable Toxic Language Datasets, Iago Alves, Julia Soares, Fernanda Bufon, Diogo Fernandes, Arlindo Rodrigues
Faithfulness Metrics Do Not Generalize Well: A Case Study in Summarization, Patrícia Schmidtová, Ondrej Dusek, Saad Mahamood
Fine-Tune on the Format: First Improving Multiple-Choice Evaluation for Intermediate LLM Checkpoints, Alec Bunn, Ben Bogin, Sarah Wiegreffe
Improving Large Language Model Confidence Estimates using Extractive Rationales for Classification, Jane Arleth, Iris Hendrickx, Martha Larson
ReproHum #0729-04: Human Evaluation Reproduction Report for "MemSum: Extractive Summarization of Long Documents Using Multi-Step Episodic Markov Decision Processes", Simeon Junker
ReproHum #0031-01: Reproducing the Human Evaluation of Readability from "It is AI’s Turn to Ask Humans a Question", Daniel Braun
ReproHum #0033-05: Human Evaluation of Factuality from A Multidisciplinary Perspective, Andra-Maria Florescu, Marius Micluța-Câmpeanu, Stefana Arina, Liviu P
ReproHum: #0744-02: Investigating the Reproducibility of Semantic Preservation Human Evaluations, Mohammad Arvan, Natalie Parde
ReproHum #0669-08: Reproducing Sentiment Transfer Evaluation, Kristýna Onderková, Mateusz Lango, Patrícia Schmidtová, Ondrej Dusek
ReproHum #0729-04: Partial reproduction of the human evaluation of the MemSum and NeuSum summarisation systems, Simon Mille, Michela Lorandi
Curse of bilinguality: Evaluating monolingual and bilingual language models on Chinese linguistic benchmarks, Yuwen Zhou, Yevgen Matusevych
Towards Better Open-Ended Text Generation: A Multicriteria Evaluation Framework, Esteban Garces Arias, Hannah Blocher, Julian Rodemann, Meimingwei Li, Christian Heumann, Matthias Aßenmacher
Bridging the LLM Accessibility Divide? Performance, Fairness, and Cost of Closed versus Open LLMs for Automated Essay Scoring, Kezia Oketch, John P. Lalor, Yi Yang, Ahmed Abbasi
Prompt, Translate, Fine-Tune, Re-Initialize, or Instruction-Tune? Adapting LLMs for In-Context Learning in Low-Resource Languages, Christopher Toukmaji, Jeffrey Flanigan
Winning Big with Small Models: Knowledge Distillation vs. Self-Training for Reducing Hallucination in QA Agents, Ashley Lewis, Michael White, Jing Liu, Toshiaki Koike-Akino, Kieran Parsons, Ye Wang
Ad-hoc Concept Forming in the Game Codenames as a Means for Evaluating Large Language Models, Sherzod Hakimov, Lara Pfennigschmidt, David Schlangen
Evaluating Grounded Reasoning by Code-Assisted LLMs for Mathematics, Zena Al Khalili, Nick Howell, Dietrich Klakow
From Calculation to Adjudication: Examining LLM Judges on Mathematical Reasoning Tasks, Andreas Stephan, Dawei Zhu, Matthias Aßenmacher, Xiaoyu Shen, Benjamin Roth
PersonaTwin: A Multi-Tier Prompt Conditioning Framework for Generating and Evaluating Personalized Digital Twins, Sihan Chen, John P. Lalor, Yi Yang, Ahmed Abbasi
Coreference as an indicator of context scope in multimodal narrative, Nikolai Ilinykh, Shalom Lappin, Asad B.Sayeed, Sharid Loáiciga
PATCH! {P}sychometrics-{A}ssis{T}ed Ben{CH}marking of Large Language Models against Human Populations: A Case Study of Proficiency in 8th Grade Mathematics, Qixiang Fang, Daniel Oberski, Dong Nguyen
MCQFormatBench: Robustness Tests for Multiple-Choice Questions, Hiroo Takizawa, Saku Sugawara, Akiko Aizawa
(Dis)improved?! How Simplified Language Affects Large Language Model Performance across Languages, Miriam Anschütz, Anastasiya Damaratskaya, Chaeeun Joy Lee, Arthur Schmalz, Edoardo Mosca, Georg Groh
Finance Language Model Evaluation (FLaME), Glenn Matlin, Mika Okamoto, Huzaifa Pardawala, Yang Yang, Sudheer Chava
sPhinX: Sample Efficient Multilingual Instruction Fine-Tuning Through N-shot Guided Prompting, Sanchit Ahuja, Kumar Tanmay, Hardik Hansrajbhai Chauhan, Barun Patra, Kriti Aggarwal, Luciano Del Corro, Arindam Mitra, Tejas Indulal Dhamecha, Ahmed Hassan Awadallah, Monojit Choudhury, Vishrav Chaudhary, Sunayana Sitaram
Single- vs. Dual-Prompt Dialogue Generation With LLMs For Job Interviews In Human Resources, Joachim De Baer, A. Seza Doğruöz, Thomas Demeester, Chris Develder
U-MATH: A University-Level Benchmark for Evaluating Mathematical Skills in LLMs, Konstantin Chernyshev, Vitaliy Polshkov, Vlad Stepanov, Alex Myasnikov, Ekaterina Artemova, Alexei Miasnikov, Sergei Tilga
[Findings] Do Large Language Models Perform Latent Multi-Hop Reasoning without Exploiting Shortcuts?, Sohee Yang, Nora Kassner, Elena Gribovskaya, Sebastian Riedel, Mor Geva
[Findings] Burn After Reading: Do Multimodal Large Language Models Truly Capture Order of Events in Image Sequences?, Yingjin Song, Yupei Du, Denis Paperno, Albert Gatt
[Findings] Structured Discourse Representation for Factual Consistency Verification, Kun Zhang, Oana Balalau, Ioana Manolescu
[Findings] Evaluating LLMs’ Assessment of Mixed-Context Hallucination Through the Lens of Summarization, Siya Qi, Rui Cao, Yulan He, Zheng Yuan
[Findings] Assessing the Reasoning Capabilities of LLMs in the context of Evidence-based Claim Verification, John Dougrez-Lewis, Mahmud Elahi Akhter, Federico Ruggeri, Sebastian Löbbers, Yulan He, Maria Liakata

Poster Session - Virtual

Multi-Dimensional Evaluation of Open-Source Language Models: Based on Machine Learning and Bayesian Optimization, Qingchen Yu
Spatial Representation of Large Language Models in 2D Scene, Wenya Wu, Weihong Deng
The Fellowship of the LLMs: Multi-Model Workflows for Synthetic Preference Optimization Dataset Generation, Samee Arif, Sualeha Farid, Abdul Hameed, Awais Athar, Agha Ali
Leveraging LLM-based sentiment analysis for portfolio optimization with proximal policy optimization, Kemal Kirtac, Guido Germano
Evaluating LLMs Beyond Standard Text: A Benchmark on Non-Traditional Text Variations, Jihyun Kim, Yejee Kim, HyunJeong Kang, Sumyeong Kim, Minji Son, Hyeseung Han, Kyungwoo Song
Vision-Language Models Are Not Pragmatically Competent in Referring Expression Generation, Ziqiao Ma, Jing Ding, Xuejun Zhang, Dezhi Luo, Jiahe Ding, Sihan Xu, Yuchen Huang, Run Peng, Joyce Chai
Can Perplexity Predict Finetuning Performance? An Investigation of Tokenization Effects on Sequential Language Models for Nepali, Nishant Luitel, Nirajan Bekoju, Anand Kumar, Subarna Shakya
Modeling the One-to-Many Property in Open-Domain Dialogue with LLMs, Jing Yang, Kong Aik, Woon-Seng Gan
(Towards) Scalable Reliable Automated Evaluation with Large Language Models, Bertil Braun, Martin Forell
Clustering Zero-Shot Uncertainty Estimations to Assess LLM Response Accuracy for Yes/No Q&A, Christopher T., Amy Vennos, W. Graham, Daniel Dakota
Judging the Judges: Evaluating Alignment and Vulnerabilities in LLMs-as-Judges, Aman Singh, Kartik Choudhary, Venkat Srinik, Sankaran Vaidyanathan, Dieuwke Hupkes
Are AI Datasets Good Enough? A Survey on Quality of Datasets With Machine-Generated Texts, German Gritsai, Anastasia Voznyuk, Andrey Grabovoy, Yury Chekhovich
Analyzing the Sensitivity of Vision Language Models in Visual Question Answering, Monika shah, Sudarshan Balaji, Somdeb Sarkhel, Sanorita Dey, Deepak Venugopal
ELAB: Extensive LLM Alignment Benchmark in Persian Language, Zahra Pourbahman, Fatemeh rajabi, Mohammadhossein Sadeghi, Omid Ghahroodi, Somayeh Bakhshaei, Arash Amini, Reza Kazemi, Mahdieh Soleymani
Evaluating the Quality of Benchmark Datasets for Low-Resource Languages: A Case Study on Turkish, Elif Ecem, Ayşe Aysu, Ahmet Kaan, Seyma Erdem, Burak Aytan, Busra Tufan, Abdullah Topraksoy, Esra Darici, Cagri Toraman
Shallow Preference Signals: Large Language Model Aligns Even Better with Truncated Data?, Xuan Qi, Jiahao Qiu, Xinzhe Juan, Yue Wu, Mengdi Wang
ReproHum #0744-02: A Reproduction of the Human Evaluation of Meaning Preservation in ``Factorising Meaning and Form for Intent-Preserving Paraphrasing'', Julius Steen, Katja Markert
ReproHum #0067-01: A Reproduction of the Evaluation of Cross-Lingual Summarization, Supryadi , Chuang Liu, Deyi Xiong
Fine-Grained Constraint Generation-Verification for Improved Instruction-Following, Zhixiang Liang, Zhenyu Hou, Xiao Wang
Natural Language Counterfactual Explanations in Financial Text Classification: A Comparison of Generators and Evaluation Metrics, Karol Dobiczek, Patrick Altmeyer, Cynthia C. S. Liem
An Analysis of Datasets, Metrics and Models in Keyphrase Generation, Florian Boudin, Akiko Aizawa

Important Dates

July 31 Workshop Date

Organizers

Sebastian Gehrmann, Bloomberg
Gabriel Stanovsky, Hebrew University of Jerusalem
Simon Mille, Dublin City University
Enrico Santus, Bloomberg
Miruna Clinciu, Heriot Watt University
Kaustubh Dhole, Emory University
Yotam Perlitz, IBM Research
Rotem Dror, University of Haifa
Itay Itzhak, Hebrew University of Jerusalem
Ofir Arviv, IBM Research
Eliya Habba, Hebrew University of Jerusalem
Michal Shmueli Scheuer, IBM Research
João Sedoc, New York University
Oyvind Tafjord, Allen Institute for Artificial Intelligence