4th Table Representation Learning Workshop @ ACL 2025

July 31st 2025, Vienna, Austria.



Mailinglist: signup here!
Follow on Bluesky: @trl-research
New: Join the TRL Discord!: TRL Discord

Tables are a promising modality for representation learning and generative models with too much application potential to ignore. However, tables have long been overlooked despite their dominant presence in the data landscape, e.g. data management and analysis pipelines. The majority of datasets in Google Dataset Search, for example, resembles typical tabular file formats like CSVs. Similarly, the top-3 most-used database management systems are all intended for relational data. Representation learning for tables, possibly combined with other modalities such as code and text, has shown impressive performance for tasks like semantic parsing, question answering, table understanding, data preparation, and data analysis (e.g. text-to-sql). The pre-training paradigm was shown to be effective for tabular ML (classification/regression) as well. More recently, we also observe promising potential in applying and enhancing LLMs in the domain of structured data to improve how we process and derive insights from structured data.

The Table Representation Learning (TRL) workshop is the premier venue in this emerging research area and has three main goals:

  • (1) Motivate structured data (e.g. tables) as a primary modality for representation and generative models and advance the area further.
  • (2) Showcase impactful applications of pretrained table models and identify open challenges for future research, with a particular focus on progress in NLP for this edition at ACL in 2025.
  • (3) Foster discussion and collaboration across the NLP, ML, IR and DB communities.

When: July 31st 2025
Where: Room 2.15, Austria Center Vienna, Vienna, Austria

Sponsored by:


Call for Papers


Important Dates


Submission Open March 1st, 2025
Submission Deadline April 15th 21st, 2025 (11:59PM AoE)
Notifications May 15th 19th, 2025 (11:59PM AoE)
Camera-ready May 30th, 2025 (11:59PM AoE)
Slides for selected contributed talks July 28th, 2025 (11:59PM AoE)
Workshop Date July 31st, 2025

Scope

We invite submissions on any of, or related to, the following topics on machine learning for tabular data:

  • Representation Learning for (semi-)Structured Data such as spreadsheets, tables, and full relational databases. Example contributions are new model architectures, data encoding techniques, tailored tokenization methods, pre-training and fine-tuning techniques, etc.
  • Generative Models and LLMs for Structured Data such as Large Language Models (LLMs) and diffusion models, and specialized techniques for prompt engineering, single-task and multi-task fine-tuning, LLM-driven interfaces and multi-agent systems, retrieval-augmented generation, etc.
  • Multimodal Learning where structured data is jointly embedded or combined with other modalities such as text, images, and code (e.g., SQL), knowledge graphs, visualizations/images.
  • Applications of TRL models of table representations for tasks like data preparation (e.g. data cleaning, validation, integration, cataloging, feature engineering), retrieval (e.g. data search, fact-checking/QA, KG alignment), analysis (e.g. text-to-SQL and visualization), tabular data generation, (end-to-end) tabular machine learning, table extraction (e.g. parsers/extraction for unstructured data), and query optimization (e.g. cardinality estimation).
  • Challenges of TRL models in production Work addressing the challenges of maintaining and managing TRL models in fast-evolving contexts, e.g., data updating, error correction, and monitoring, handling data privacy, personalization performance, etc.
  • Domain-specific challenges for learned table models often arise in domains such as enterprise, finance, medical, law. These challenges pertain to table content, table structure, privacy, security limitations, and other factors that necessitate tailored solutions.
  • Benchmarks, analyses, and datasets for TRL including assessing LLMs and other generative models as base models versus alternative approaches, analysis of model robustness with respect to large, messy, and heterogeneous tabular data, etc.
  • Other contributions such as surveys, demonstrations, visions, and reflections on table representation learning and generative models for structured data.

Organization

Workshop Chairs


Qian Liu
ByteDance

Wenhu Chen
University of Waterloo
Huan Sun
The Ohio State University




Program


Invited Speakers


Dan Roth
University of Pennsylvania, Oracle
Tao Yu
HKU


Julian Eisenschlos
Google DeepMind






Schedule



09:00 AM Opening
09:05 PM Session 1 (Retrieval & Conversational Analysis):
09:05 AM - 09:40 AM Invited talk by Dan Roth: On Retrieving & Reasoning LLMs: Myths, Merits, and How to Move Forward
[Abstract]
The rapid progress made over the last few years in generating linguistically coherent natural language has blurred, in the mind of many, the difference between natural language generation, understanding, knowledge retrieval and use, and the ability to reason with respect to the world. Nevertheless, reliably and consistently supporting high-level decisions that depend on natural language understanding and heterogenous information retrieval is still difficult, mostly, but not only, since most of these tasks are computationally more complex than language models can support. I will discuss some of the challenges underlying reasoning and information access and argue that we should exploit what LLMs do well while delegating responsibility to special purpose models and solvers for decision making. I will present some of our work in this space, focusing on supporting reasoning and information access.
[Speaker Bio]
Dan Roth is the Eduardo D. Glandt Distinguished Professor at the University of Pennsylvania and Chief AI Scientist at Oracle. Until June 2024 Dan was a VP/Distinguished Scientist at AWS AI where he led the scientific effort behind Amazon's first-generation GenAI products, including Titan Models, Amazon Q, and Amazon Bedrock. Dan is a Fellow of the AAAS, ACM, AAAI, and ACL, and a recipient of the IJCAI John McCarthy Award "for major conceptual and theoretical advances in the modeling of natural language understanding, machine learning, and reasoning." He has published broadly in natural language processing, machine learning, knowledge representation and reasoning, and learning theory, was the Editor-in-Chief of the Journal of Artificial Intelligence Research (JAIR) and has served as a Program Chair and Conference Chair for the major conferences in his research areas. Roth has been involved in several ML/NLP/GenAI startups in domains that range from legal and compliance to health care. Dan received his B.A Summa cum laude in Mathematics from the Technion, Israel and his Ph.D. in Computer Science from Harvard University in 1995.
09:40 AM - 09:55 AM Can we Retrieve Everything All at Once? ARM: An Alignment-Oriented LLM-based Retrieval Method
Peter Baile Chen, Yi Zhang, Mike Cafarella, Dan Roth
09:55 AM- 10:10 AM Something's Fishy in the Data Lake: A Critical Re-evaluation of Table Union Search Benchmarks
Allaa Boutaleb, Bernd Amann, Hubert Naacke, Rafael Angarita
10:10 AM- 10:25 AM In-context learning of Soft Nearest Neighbor Classifiers for Intelligible Tabular Machine Learning
Mykhailo Koshil, Matthias Feurer, Katharina Eggensperger
10:25 AM Coffee Break
11:00 AM Session 2 (Agentic Data Science & Text-to-SQL):
11:00 AM- 11:35 AM Invited Talk by Tao Yu (HKU): Advancing Data Science with AI Agents
[Abstract]
TBC
[Speaker Bio]
Tao Yu is an Assistant Professor of Computer Science at The University of Hong Kong and a director of the XLANG Lab (as part of the HKU NLP Group). He spent one year in the UW NLP Group working with Noah Smith, Luke Zettlemoyer, and Mari Ostendorf. He completed his Ph.D. in Computer Science from Yale University, advised by Dragomir Radev and master's at Columbia University advised by Owen Rambow and Kathleen McKeown. Tao has received the Google and Amazon faculty research awards (Google Research Scholar Award 2023, Amazon Research Award 2022). His main research interest is in Natural Language Processing. His research aims to develop embodied AI agents that empower users to use language to interact with digital and physical environments to carry out real-world tasks. Such systems need to ground language and perception into code and actions executable in the corresponding embodied environment, helping people perform data science, control computers, and collaborate with robots.
11:35 AM - 11:50 AM R^3: This is My SQL, Are You With Me?
Hanchen Xia, Feng Jiang, Naihao Deng, Cunxiang Wang, Guojiang Zhao, Rada Mihalcea, Yue Zhang
11:50 AM- 12:30 PM Poster session 1 (unfold)
  • Can we Retrieve Everything All at Once? ARM: An Alignment-Oriented LLM-based Retrieval Method
  • Something's Fishy in the Data Lake: A Critical Re-evaluation of Table Union Search Benchmarks
  • In-context learning of Soft Nearest Neighbor Classifiers for Intelligible Tabular Machine Learning
  • R^3: This is My SQL, Are You With Me?
  • Theme-Explanation Structure for Table Summarization using Large Language Models: A Case Study on Korean Tabular Data
  • SQLong: Enhanced NL2SQL for Longer Contexts with LLMs
  • RITT: A Retrieval-Assisted Framework with Image and Text Table Representations for Table Question Answering
  • Ask Me Like I'm Human: LLM-based Evaluation with For-Human Instructions Correlates Better with Human Evaluations than Human Judges
  • Perspective: Leveraging Domain Knowledge for Tabular Machine Learning in the Medical Domain
  • Embeddings for Numerical Features Using tanh Activation
  • Retrieval-Augmented Forecasting with Tabular Time Series Data
  • Improving Table Retrieval with Question Generation from Partial Tables
  • 12:30 Lunch Break
    1:30 PM Session 3 (Text-to-SQL & Domain Challenges)
    1:30 PM - 2:05 PM Invited Talk by Edward Choi (KAIST): Text-to-SQL on Electronic Health Records: From simple QA to multi-modal and interactive QA
    [Abstract]
    Electronic health records (EHR) contain a vast amount of medical records that can be leveraged for improving personal care as well as the operational aspect of medical institutions. Thanks to the introduction of LLMs and its remarkable advancement in recent years, we can now implement agents/services that can have complex interactions with the EHR via text-to-SQL, empowering the users to make more informed decisions than before. This, however, means the evaluation of such agents must also be complex and interactive, even multi-modal especially in the domain of medicine. In this talk, I will first briefly describe my view on AI evaluation, then introduce the basics of text-to-SQL for EHR, expand with multi-modal text-to-SQL, then conclude with interactive text-to-SQL for EHR.
    [Speaker Bio]
    Edward Choi is an associate professor of Kim Jaechul Graduate School of AI, KAIST. He received his PhD in Georgia Tech, under the supervision of Prof. Jimeng Sun, focusing on interpretable deep learning methods for electronic health records. Prior to joining KAIST, he worked on developing and analyzing medical prediction models at Google. His current research interests include question answering for multi-modal medical records, domain-specific reasoning LLMs, and user modeling with LLMs.
    02:05 PM - 02:20 PM Resolution-Alignment-Completion of Tabular Electronic Health Records via Meta-Path Generative Sampling
    S Mehryar
    02:20 PM - 02:35 PM Sparks of Tabular Reasoning via Text2SQL Reinforcement Learning
    Josefa Lia Stoisser, Marc Boubnovski Martell, Julien Fauqueur
    02:35 PM - 02:50 PM BEAVER: An Enterprise Benchmark for Text-to-SQL
    Peter Baile Chen, Fabian Wenz, Yi Zhang, Devin Yang, Justin Choi, Nesime Tatbul, Mike Cafarella, Çağatay Demiralp, Michael Stonebraker
    02:50 PM - 03:30 PM Poster: Session 2 (unfold)
  • Resolution-Alignment-Completion of Tabular Electronic Health Records via Meta-Path Generative Sampling
  • Sparks of Tabular Reasoning via Text2SQL Reinforcement Learning
  • BEAVER: An Enterprise Benchmark for Text-to-SQL
  • Table Understanding and Multimodal (LLMs): A Cross-Domain Case Study on Scientific vs. Non-Scientific Data
  • How Well Do LLMs Reason over Tabular Data, Really?
  • iTBLS: A Dataset of Interactive Conversations Over Tabular Information
  • Generating Synthetic Relational Tabular Data via Structural Causal Models
  • Tables as Thought: Exploring Structured Thoughts in LLM Reasoning
  • Rethinking Table Instruction Tuning
  • LLM-Mixer: Multiscale Mixing in LLMs for Time Series Forecasting
  • TableKV: KV Cache Compression for In-Context Table Processing
  • OrQA – Open Data Retrieval for Question Answering Dataset Generation
  • 03:30 PM Coffee Break
    04:00 PM Session 4 (Tabular and Multimodal Reasoning):
    04:00 PM - 04:35 PM Invited Talk by Julian Eisenschlos: How generation can drive understanding in visually situated language
    [Abstract]
    Large amounts of content both online and offline relies on structure to organize and communicate information more effectively. While natural image and language understanding and generation has been studied extensively, visually situated language such as tables, charts, plots, and infographics, continues to be a challenge for models large and small. In this talk we will show how teaching models to generate visually situated language can improve downstream reading and reasoning on this data modality for tasks such as question answering, entailment and summarization through multi-step reasoning and tool use.
    [Speaker Bio]
    Julian Eisenschlos is a Staff Research Scientist at Google DeepMind tackling problems in visual language understanding and generation. He is a member of the Gemini core team and was previously co-founder of Botmaker and worked on ML at Meta and ASAPP.
    04:35 PM - 04:50 PM Table Understanding and Multimodal (LLMs): A Cross-Domain Case Study on Scientific vs. Non-Scientific Data
    Ekaterina Borisova, Fabio Barth, Nils Feldhus, Raia Abu Ahmad, Malte Ostendorff, Pedro Ortiz Suarez, Georg Rehm, Sebastian Möller
    04:50 PM - 05:05 PM How Well Do LLMs Reason over Tabular Data, Really?
    Cornelius Wolff, Madelon Hulsebos
    05:05 PM - 05:20 PM iTBLS: A Dataset of Interactive Conversations Over Tabular Information
    Anirudh Sundar, Christopher Gordon Richardson, Larry Heck, Adar Avsian
    05:20 PM Closing & Awards


    Submission Guidelines

    Submission link

    Submit your (anonymized) paper through OpenReview at: https://openreview.net/group?id=aclweb.org/ACL/2025/Workshop/TRL
    Please be aware that accepted papers are expected to be presented at the workshop in-person.

    Formatting guidelines

    The workshop accepts regular research papers and industrial papers of the following types:
    • Short paper: 4 pages + references and appendix.
    • Regular paper: 8 pages + references and appendix.


    Submissions should be anonymized and follow the ACL style files, but can exclude the checklist. Non-anonymous preprints are no problem, and artifacts do not have to be anonymized. Just submitting the paper without author names/affiliations is sufficient. Supplementary material, if any, may be added in the appendix. We expect authors to adopt an inclusive and diverse writing style. The “Diversity and Inclusion in Writing” guide by the DE&I in DB Conferences effort is a good resource.

    Review process

    Papers will receive light reviews in a double-anonymous manner. All accepted submissions will be included in the workshop proceedings in the ACL Anthology, published on the website, and made public on OpenReview.

    Novelty and conflicts

    We encourage submissions on most recent work. The workshop does not accept submissions that have been published at ACL or other NLP venues as-is. We also welcome submissions that have been published in, for example, data management or machine learning venues. We rely on OpenReview for handling conflicts, so please ensure that the conflicts in every author's OpenReview profile are complete, in particular, with respect to the organization and program committees.

    Camera-ready instructions

    Camera-ready papers are expected to express the authors and affiliations on the first page and state "Table Representation Learning Workshop at ACL 2025'' in the footer. (The conference proceedings will add the appropriate footer.) The camera-ready version may exceed the page limit for acknowledgements or small content changes up to 1 additional page, but revision is not required (for short papers: please be aware of novelty requirements of archival venues, e.g. SIGMOD, CVPR). The camera-ready version should be submitted through OpenReview (submission -> edit -> revision), and will be published on OpenReview, the ACL anthology and this website. Please make sure that all meta-data is correct as well, as it will be imported to the ACL website.

    Presentation instructions

    All accepted papers will be presented as poster during one of the poster sessions (the schedule per poster session will be released soon). For poster formatting, please refer to the poster instructions on the ACL site (template, upload, etc), you can print and bring the poster yourself or print it locally through the facilities offered by ACL.
    Optional: authors of poster submissions are also invited to send a teaser video of approx. 3 minutes (.mp4) to madelon@berkeley.edu, which will be hosted on the website and YouTube channel of the workshop.
    Papers selected for oral presentation are also asked to prepare a talk of 9 minutes (+1 min Q&A), and upload their slides through the "slides" field in OpenReview (pdf) or share a link to Google Slides with madelon@cwi.nl. The schedule for the oral talks will be published soon. The recordings of oral talks will be published afterwards.

    Accepted Papers

    2025

    The full proceedings can be found at: https://aclanthology.org/2025.trl-1.0.pdf
    2024 (unfold)

    Oral

    Poster





    2023 (unfold)

    Oral

    Poster





    2022 (unfold)

    Oral



    Poster