play silent looping video pause silent looping video

Simulating large systems with Regression Language Models

July 29, 2025

Yash Akhauri, Student Researcher, Google Research, and Xingyou (Richard) Song, Research Scientist, Google DeepMind

We propose text-to-text regression with language models to solve all numeric prediction problems.

Large language models (LLMs) often improve by learning from human preferences and ratings, a process where a reward model is trained to take prompts and responses as input in order to guide further model training. This focus on subjective human feedback has dramatically improved their ability to generate helpful, harmless, and coherent text and has been transformative for conversational assistants (e.g., Gemini).

Another pathway to extend the reward model beyond human subjectivity is to process raw, diverse operational data and treat the observed numerical outcome as a reward signal. This ability can open doors for predicting the performance of vast software infrastructures, the efficiency of industrial processes, or the results of scientific experiments. Fundamentally, we want LLMs to perform regression (i.e., predict a metric y, given input x). Previously, traditional regression methods have relied on inputs being tabular, i.e., fixed length numeric vectors that can be aggregated as a single table. However, converting complex, unstructured data into a tabular format can be very laborious, and the sheer diversity and dynamic nature of real-world data (e.g., intricate configuration files, system logs, and ever-evolving hardware or workload patterns) make this task even more challenging. When new data types emerge, the process often has to be restarted from scratch.

In “Performance Prediction for Large Systems via Text-to-Text Regression”, we describe a simple, general and scalable approach, based on our earlier work on universal regression, OmniPred. This approach enables a Regression Language Model (RLM) to read a string representation of the input, and output the number as a structured text string. For example, we can represent the state (x) of an industrial system — including all its configurations, parameters, and contextual information — as a structured text string, and the RLM then writes the performance metric (y) as a string. The RLM can be pre-trained or even randomly initialized, and when handling a new regression task, it can be trained using next token prediction via cross-entropy loss, with (x) as the prompt and (y) as the target. We describe how this new paradigm has several advantages, such as avoiding feature engineering or normalizations, few-shot adaptation to new tasks, and universal approximation of output probability distributions. We apply the RLM in the context of predicting resource efficiency on Borg, Google’s large-scale compute infrastructure for cluster management. We also released an open-source library for the research community to leverage for any use-case.

Predicting efficiency in Google's compute clusters

The critical task of predicting the Millions of Instructions Per Second per Google Compute Unit (MIPS per GCU) is a key efficiency metric for Google's Borg system. Accurately forecasting the MIPS per GCU for configurations is crucial for optimizing resource allocation and scheduling across thousands of machines. We applied the text-to-text regression method to predict the MIPS per GCU of Google’s digital twin of Borg, a sophisticated backtesting framework to replicate the state of real-world clusters. The end metric is to predict the numeric result of a specialized bin-packing algorithm used to efficiently allocate tasks to resources.

Our approach uses an RLM that only requires a two-layer encoder-decoder of 60 million parameters. For training, we collect large amounts of data from multiple regression tasks with pairs (x,y) that include the system state (x) represented using YAML or JSON, containing the lists of active jobs, execution traces, and textual metadata. Each data point (x) can take up to 1M tokens if we include every feature (i.e., detailed information) about that data point. Since the RLM has a token limit of 8k, we pre-process the data by reordering the most important features at the beginning of the text string. When the string is truncated to fit the token limit, only the less important features are lost.

We pre-train the RLM on the pre-processed data to enable the model to more easily adapt to new types of input data from new tasks, using few-shot gradient updates. Since numbers are represented as text, they can be represented as-is without normalization. If we sample decoded outputs multiple times, this effectively also captures the density of the y-values, important for modeling stochastic or noisy situations.

play silent looping video pause silent looping video

Our method uses RLMs to directly regress numerical performance metrics (y) from complex, textually represented system states (x), such as those from Google's compute clusters across diverse workloads (GMail, YouTube, Maps, etc.) and hardware (CPUs and TPUs).

Below, we demonstrate three resulting capabilities of RLMs that serve as important components for universal regression.

Density capture

By sampling the RLM’s output multiple times, we can capture probability distributions (i.e., densities) of y-values remarkably well even across different time durations. This density estimation is useful because it goes beyond simple point predictions. By modeling the full distribution of possible outcomes, we gain insight into the inherent variability and potential range of MIPS per GCU values. This capability allows us to capture both aleatoric uncertainty (inherent randomness in the system, like stochastic load demand) and potentially identify epistemic indicators (uncertainty due to limited observation or features), giving us a more complete understanding of system behavior.

play silent looping video pause silent looping video

The RLM provides density estimates that align remarkably well with the target instructions per second distribution across time durations, as shown by the regressor density curves (3D) and the target kernel density estimate (KDE) plot (XY plane).

Uncertainty quantification

The RLM’s prediction uncertainty is correlated with residual squared error, allowing us to quantify the model’s confidence in its predictions. When uncertain, the predicted distribution is broader, signalling that the predictions should be treated with more caution. This enables us to know when to rely more heavily on the regressor, and when to potentially fall-back to slower but more accurate bin-packing simulations in managing the compute clusters.

RLM-2

Left: Prediction uncertainty is correlated with regressor error. Right: KDE plot of RLM predictions effectively capture the target points.

Near-perfect, low-cost regression

Beyond density and uncertainty quantification, our RLM is a low-resource, efficient model, achieving very precise pointwise regression on a diverse set of tasks. We present scatter plots with near-perfect Spearman rank-correlation, demonstrating a strong alignment between predicted and actual MIPS per GCU rankings. The model can few-shot adapt to diverse prediction tasks on distinct servers, serving as an adaptable, universal predictor for Borg.

RLM-3

Scatterplot between RLM prediction (x-axis) and true target y-value (y-axis) over multiple regression tasks. Legend displays Spearman rank (⍴).

Resources and future directions

We demonstrate that our relatively simple encoder-decoder RLM effectively trains on rich, non-tabular inputs to deliver highly accurate predictions and adapt to new tasks efficiently. This robust and scalable approach predicts metric outcomes directly from raw text, significantly reducing reliance on manual feature engineering and paving the way for both universal system simulators and sophisticated reward mechanisms. By modeling diverse numerical feedback, RLMs operationalize ”experience” in a manner that will enable future breakthroughs in reinforcement learning for language models. See the paper and open-source code for more information.

Acknowledgements

This research was conducted by core members Yash Akhauri (Cornell University and Google Research), Bryan Lewandowski (Google Platforms), and Xingyou Song (Google DeepMind), with contributions from Cheng-Hsi Lin, Adrian Reyes, Grant C. Forbes, Arissa Wongpanich, Bangding Yang, Mohamed S. Abdelfattah, and Sagi Perel.

We would like to thank previous collaborators throughout this broad research arc: Oscar Li, Chansoo Lee, Daiyi Peng, Yutian Chen, Tung Nguyen, Qiuyi Zhang, Jorg Bornschein, Yingjie Miao, Eric Tang, Dara Bahri, and Mangpo Phothilimthana. We further thank Michal Lukasik, Uri Alon, Amir Yazdanbakhsh, Shao-Hua Sun, Kuang-Huei Lee, Zi Wang, Xinyun Chen, Jiyoun Ha, Aviral Kumar, Jonathan Lai, Ke Xue, Rong-Xi Tan, and David Smalling for useful discussions. We also thank Olena Bogdanov for designing the animation for this post. Lastly, we thank Yili Zheng, Safeen Huda, Asaf Aharoni, Srinadh Bhojanapalli, David Lo, Martin Dixon, Daniel Golovin, Denny Zhou, Claire Cui, Ed Chi, and Benoit Schillings for continued support.