Yannis Assael1,*, Thea Sommerschield2,*, Alison Cooley3, Brendan Shillingford1, John Pavlopoulos4, Priyanka Suresh1, Bailey Herms5, Justin Grayston5, Benjamin Maynard5, Nicholas Dietrich1, Robbe Wulgaert6, Jonathan Prag7, Alex Mullen2, Shakir Mohamed1
1 Google DeepMind
2 University of Nottingham, UK
3 University of Warwick, UK
4 Athens University of Economics and Business, Greece
5 Google
6 Sint-Lievenscollege, Belgium
7 University of Oxford, UK
*Authors contributed equally to this work.
When using any of the source code or outputs of this project, please cite:
@article{asssome2025contextualising,
title={Contextualising ancient texts with generative neural networks},
author={Assael*, Yannis and Sommerschield*, Thea and Cooley, Alison and Pavlopoulos, John and Shillingford, Brendan and Herms, Bailey and Suresh, Priyanka and Maynard, Benjamin and Grayston, Justin and Wulgaert, Robbe and Prag, Jonathan and Mullen, Alex and Mohamed, Shakir},
journal={Nature},
volume={643},
number={8073},
year={2025},
publisher={Nature Publishing}
}
Human history is born in writing. Inscriptions, among the earliest written forms, offer direct insights into the thought, language, and history of ancient civilisations. Historians capture these insights by identifying parallels - inscriptions with shared phrasing, function, or cultural setting - to enable the contextualisation of texts within broader historical frameworks, and perform key tasks such as restoration and geographical or chronological attribution. However, current digital methods are restricted to literal matches and narrow historical scopes. We introduce Aeneas, the first generative neural network for contextualising ancient texts. Aeneas retrieves textual and contextual parallels, leverages visual inputs, handles arbitrary-length text restoration, and advances the state-of-the-art in key tasks.
Fragment of a bronze military diploma from Sardinia, issued by the Emperor
Trajan to a sailor on a warship. 113/14 CE (CIL XVI, 60, The Metropolitan
Museum of Art, Public Domain).
To evaluate its impact, we conduct the largest Historian-AI study to date, with historians considering Aeneas’ retrieved parallels useful research starting points in 90% of cases, improving their confidence in key tasks by 44%. Restoration and geographical attribution tasks yielded superior results when historians were paired with Aeneas, outperforming both humans and AI alone. For dating, Aeneas achieved a 13-year distance from ground-truth ranges. We demonstrate Aeneas’ contribution to historical workflows through analysis of key traits in the Res Gestae Divi Augusti, the most renowned Roman inscription, showing how integrating Science and Humanities can create transformative tools to assist historians and advance our understanding of the past.
Given the image and textual transcription of an inscription (with damaged
sections of unknown-length marked with the "#" character), Aeneas uses a
transformer-based decoder, the "torso", to process the text. Specialised
networks, called "heads", handle character restoration, date attribution, and
geographical attribution (the latter also incorporating visual features). The
torso's intermediate representations are merged into a unified,
historically-enriched embedding to retrieve similar inscriptions from the LED,
ranked by relevance.
To aid further research in the field we created an online interactive python notebook, where researchers can query one of our trained models to get text restorations, visualise attention weights, and more.
Advanced users who want to perform inference using the trained model may want
to do so manually using the predictingthepast
library directly.
First, to install the predictingthepast
library and its dependencies, run:
pip install .
Then, download the model files.
curl --output aeneas_117149994_2.pkl \
https://storage.googleapis.com/ithaca-resources/models/aeneas_117149994_2.pkl
curl --output led.json \
https://storage.googleapis.com/ithaca-resources/models/led.json
curl --output led_emb_xid117149994.pkl \
https://storage.googleapis.com/ithaca-resources/models/led_emb_xid117149994.pkl
curl --output ithaca_153143996_2.pkl \
https://storage.googleapis.com/ithaca-resources/models/ithaca_153143996_2.pkl
curl --output iphi.json \
https://storage.googleapis.com/ithaca-resources/models/iphi.json
curl --output iphi_emb_xid153143996.pkl \
https://storage.googleapis.com/ithaca-resources/models/iphi_emb_xid153143996.pkl
An example of using the library can be run via:
python inference_example.py \
--input_file="example_input.txt" \
--checkpoint_path="aeneas_117149994_2.pkl" \
--dataset_path="led.json" \
--retrieval_path="led_emb_xid117149994.pkl" \
--language="latin"
This will run restoration and attribution on the text in example_input.txt
.
To run it with different input text, use the --input
argument:
python inference_example.py \
--input="..." \
--checkpoint_path="aeneas_117149994_2.pkl" \
--dataset_path="led.json" \
--retrieval_path="led_emb_xid117149994.pkl" \
--language="latin"
Or use text in a UTF-8 encoded text file:
python inference_example.py \
--input_file="some_other_input_file.txt" \
--checkpoint_path="aeneas_117149994_2.pkl" \
--dataset_path="led.json" \
--retrieval_path="led_emb_xid117149994.pkl" \
--language="latin"
The restoration or attribution JSON can be saved to a file:
python inference_example.py \
--input_file="example_input.txt" \
--checkpoint_path="aeneas_117149994_2.pkl" \
--dataset_path="led.json" \
--retrieval_path="led_emb_xid117149994.pkl" \
--language="latin" \
--attribute_json="attribute.json" \
--restore_json="restore.json"
For full help, run:
python inference_example.py --help
For Latin, Aeneas was trained on data from:
- Epigraphic Database Roma (EDR)1: Made available pursuant to a Creative Commons Attribution 4.0 International License (CC-BY) on Zenodo. EDR is also available at edr-edr.it.
- Epigraphic Database Heidelberg (EDH)2: Made available pursuant to a Creative Commons Attribution-ShareAlike 4.0 International License (CC-BY-SA) on Zenodo. EDH is also available at edh.ub.uni-heidelberg.de.
- ETL repository for Epigraphic Database Clauss Slaby (EDCS_ETL)3: Made available pursuant to a Creative Commons Attribution 4.0 International License (CC-BY) on Zenodo. EDCS_ETL is also available at manfredclauss.de and github.com/sdam-au/EDCS_ETL.
For ancient Greek, Aeneas was trained on Searchable Greek Inscriptions of The Packard Humanities Institute. The processed version is available at: I.PHI dataset.
See train/README.md
for instructions.
Copyright 2025 Google LLC
All software is licensed under the Apache License, Version 2.0 (Apache 2.0); you may not use this file except in compliance with the Apache 2.0 license. You may obtain a copy of the Apache 2.0 license at: https://www.apache.org/licenses/LICENSE-2.0
All other materials are licensed under the Creative Commons Attribution 4.0 International License (CC-BY). You may obtain a copy of the CC-BY license at: https://creativecommons.org/licenses/by/4.0/legalcode
The dataset contains modified data from the Epigraphic Database Heidelberg dataset. That data is licensed under the Creative Commons Attribution-ShareAlike 4.0 International License (CC-BY-SA). You may obtain a copy of the CC-BY-SA license at: https://creativecommons.org/licenses/by-sa/4.0/legalcode.en
Unless required by applicable law or agreed to in writing, all software and materials distributed here under the Apache 2.0, CC-BY-SA or CC-BY licenses are distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the licenses for the specific language governing permissions and limitations under those licenses.
This is not an official Google product.
Footnotes
-
Silvio Panciera, Giuseppe Camodeca, Giovanni Cecconi, Silvia Orlandi, Lanfranco Fabriani, & Silvia Evangelisti. (2019). EDR - Epigraphic Database Roma EpiDoc files [Data set]. Zenodo. ↩
-
James M.S. Cowey, Francisca Feraudi-Gruénais, Brigitte Gräf, Frank Grieshaber, Regine Klar, & Jonas Osnabrügge. (2019). Epigraphic Database Heidelberg EpiDoc files [Data set]. Zenodo. ↩
-
Heřmánková, P. (2022). EDCS (2.0) [Data set]. Zenodo. ↩