\addauthor

Sobhan Asasis.asasi@surrey.ac.uk1 \addauthorMohamed Ilyas Lakhalm.lakhal@surrey.ac.uk1 \addauthorÖzge Mercanoğlu Sincano.mercanoglusincan@surrey.ac.uk1 \addauthorRichard Bowdenr.bowden@surrey.ac.uk1 \addinstitution CVSSP
University of Surrey
Guildford, UK Beyond Gloss

Beyond Gloss: A Hand-Centric Framework for Gloss-Free Sign Language Translation

Abstract

Sign Language Translation (SLT) is a challenging task that requires bridging the modality gap between visual and linguistic information while capturing subtle variations in hand shapes and movements. To address these challenges, we introduce BeyondGloss, a novel gloss-free SLT framework that leverages the spatio-temporal reasoning capabilities of Video Large Language Models (VideoLLMs). Since existing VideoLLMs struggle to model long videos in detail, we propose a novel approach to generate fine-grained, temporally-aware textual descriptions of hand motion. A contrastive alignment module aligns these descriptions with video features during pre-training, encouraging the model to focus on hand-centric temporal dynamics and distinguish signs more effectively. To further enrich hand-specific representations, we distill fine-grained features from HaMeR. Additionally, we apply a contrastive loss between sign video representations and target language embeddings to reduce the modality gap in pre-training. BeyondGloss achieves state-of-the-art performance on the Phoenix14T and CSL-Daily benchmarks, demonstrating the effectiveness of the proposed framework. We will release the code upon acceptance of the paper.

1 Introduction

Sign languages use hand and body movement alongside facial expressions, mouth gestures and the complex use of space for communication [Rastgoo et al.(2022)Rastgoo, Kiani, Escalera, Athitsos, and Sabokrou, Stokoe(2005)]. Sign language translation (SLT) aims to convert these visual elements into spoken sentences, enabling communication with non-signers. However, due to the inherent complexities of natural signing and the fundamental difference to spoken languages, SLT faces challenges in accurately recognizing the components of sign and bridging the gap between the visual and textual representations [Duarte(2019), Zhao et al.(2022)Zhao, Qi, Zhou, Duan, Zhou, and Li]. SLT approaches typically fall into two categories: gloss-based and gloss-free. Gloss-based methods [Chen et al.(2022a)Chen, Wei, Sun, Wu, and Lin, Zhang et al.(2023a)Zhang, Müller, and Sennrich, Chen et al.(2022b)Chen, Zuo, Wei, Wu, Liu, and Mak] rely on intermediate representations, where sign videos are first translated into a sequence of glosses, written representations of signs using corresponding words in a spoken language, before generating spoken sentences. In contrast, gloss-free methods directly map videos to text without an explicit gloss representation [Camgoz et al.(2020)Camgoz, Koller, Hadfield, and Bowden, Voskou et al.(2021)Voskou, Panousis, Kosmopoulos, Metaxas, and Chatzis, Chen et al.(2024b)Chen, Zhou, Li, Wan, Lei, Jiang, Lu, and Zhao]. While gloss-based approaches benefit from structured linguistic information, gloss-free methods bypass the need for gloss annotations, making them more flexible but often more challenging. The high cost and effort required for data collection and meticulous gloss annotation have led to increased interest in gloss-free SLT [Wong et al.()Wong, Camgöz, and Bowden, Gong et al.(2024)Gong, Foo, He, Rahmani, and Liu, Liang et al.(2024)Liang, Huang, Xu, Tang, Ye, Zhang, Chen, Yu, and Xu].

Refer to caption
Figure 1: Generated hand motion description from our Sign Video Descriptor.

The increasing use of Large Language Models (LLMs) has had a significant impact on this area [Achiam et al.(2024)Achiam, Adler, Agarwal, Ahmad, Akkaya, Aleman, Almeida, Altenschmidt, Altman, Anadkat, and others.]. As a result, the latest gloss-free models utilise the remarkable language prediction capabilities of LLMs by converting sign language videos into text-like representations that LLMs can process effectively [Chen et al.(2024b)Chen, Zhou, Li, Wan, Lei, Jiang, Lu, and Zhao, Wong et al.()Wong, Camgöz, and Bowden, Zhou et al.(2023)Zhou, Chen, Clapés, Wan, Liang, Escalera, Lei, and Zhang]. When LLMs are applied directly to visual features, they become mere transformers with pre-trained weights due to the large shift between the textual representations on which they were trained and the new visual inputs. To bridge this gap, recent studies have introduced a pre-training phase before the SLT task, which aims to align the visual and textual representations [Wong et al.()Wong, Camgöz, and Bowden, Zhou et al.(2023)Zhou, Chen, Clapés, Wan, Liang, Escalera, Lei, and Zhang]. Given the central role of hand shape and motion in conveying meaning in sign language, our framework explores the benefit of emphasizing hand-centric features. We achieve this by applying feature distillation from Hand Mesh Recovery (HaMeR) [Pavlakos et al.(2024)Pavlakos, Shan, Radosavovic, Kanazawa, Fouhey, and Malik], which guides the model to focus more on hand poses and orientations in individual video frames. We then align the video features with a textual description of the entire video, generated using VideoLLMs, which we employ in the context of SLT. These descriptions provide high-level semantic guidance that is more directly aligned with the visual input than conventional spoken language targets. Finally, to enhance performance in the downstream SLT task, we adopt an encoder-decoder architecture based on language models. To further improve the alignment between the encoder’s output embeddings and the spoken language targets, we incorporate a contrastive learning strategy inspired by the CLIP framework [Radford et al.(2021)Radford, Kim, Hallacy, Ramesh, Goh, Agarwal, Sastry, Askell, Mishkin, Clark, et al.], projecting both video and text features into a shared embedding space.

Our primary contributions can be summarized as: (i) Our proposed framework, BeyondGloss, achieves state-of-the-art results in gloss-free sign language translation on two widely used benchmark datasets. (ii) We introduce a novel method for generating hand-centric textual descriptions of sign language videos using VideoLLMs, offering a new form of semantic supervision aligned with visual content. (iii) We are the first to leverage HaMeR’s transformer-based hand representations in the SLT task. Through feature distillation, we guide the visual encoder to focus on hand pose and orientation, demonstrating the positive impact of this strategy on translation quality.

The rest of this paper is organised as follows: In Section 2, we review the literature in sign language translation. In Section 3, we outline the proposed BeyondGloss approach. Section 4 presents quantitative and qualitative evaluation, whilst Section 5 concludes.

2 Related Works

Gloss-free Sign Language Translation. Gloss-based SLT methods achieve strong performance by using annotated glosses as intermediate supervision [Zhang et al.(2023a)Zhang, Müller, and Sennrich, Chen et al.(2022b)Chen, Zuo, Wei, Wu, Liu, and Mak]. However, glosses are labor-intensive to create and not available for all languages. Gloss-free SLT avoids this by directly translating sign videos into spoken sentences using only video-text pairs. While more scalable, it faces a larger modality gap and typically performs worse than gloss-based methods. To address this gap, CSGCR [Zhao et al.(2022)Zhao, Qi, Zhou, Duan, Zhou, and Li] predicts possible language words, generates sentences based on these predictions, and selects the most appropriate sentence using cosine similarity with sign language images. Recently, GFSLT-VLP [Zhou et al.(2023)Zhou, Chen, Clapés, Wan, Liang, Escalera, Lei, and Zhang] introduced a pre-training method that aligns sign language images with spoken sentences, converting them into text-like features familiar to LLMs. Leveraging this advancement, subsequent gloss-free models focus on generating more comprehensible features for LLMs. FLa-LLM [Chen et al.(2024b)Chen, Zhou, Li, Wan, Lei, Jiang, Lu, and Zhao] introduces a two-step approach: initially training on visual information from sign language images using a lightweight translation model and fine-tuning the LLM for SLT. Sign2GPT [Wong et al.()Wong, Camgöz, and Bowden] pre-trains a sign encoder by aligning visual features with prototypes under the supervision of pseudo-glosses, words from spoken sentences via Parts-Of-Speech (POS) tagging, and then utilizes it for SLT. SignLLM [Gong et al.(2024)Gong, Foo, He, Rahmani, and Liu] designs a vector-quantization technique to convert sign videos into discrete sign tokens. Lastly, LLaVA-SLT [Liang et al.(2024)Liang, Huang, Xu, Tang, Ye, Zhang, Chen, Yu, and Xu] integrates linguistic pretraining, visual contrastive learning, and visual-language tuning within a Large Multimodal Model to effectively bridge the gap between sign language videos and textual translation.
Video Large Language Models. Early VideoLLMs like Video-LLaMA [Zhang et al.(2023b)Zhang, Li, and Bing] and Video-ChatGPT [Maaz et al.(2024)Maaz, Rasheed, Khan, and Khan] extended large language models to the video domain by incorporating temporal and multimodal context. Initially limited to short clips and basic temporal understanding, recent models have advanced by combining large-scale video-language data with autoregressive decoding [Vaswani et al.(2017)Vaswani, Shazeer, Parmar, Uszkoreit, Jones, Gomez, Kaiser, and Polosukhin, Graves(2013)]. These systems integrate spatiotemporal encoders with LLMs (e.g. LLaMA, GPT) to handle longer sequences and support open-ended video understanding, including tasks like action recognition and captioning. [Touvron et al.(2023)Touvron, Lavril, Izacard, Martinet, Lachaux, Lacroix, Rozière, Goyal, Hambro, Azhar, et al., Brown et al.(2020)Brown, Mann, Ryder, Subbiah, Kaplan, Dhariwal, Neelakantan, Shyam, Sastry, Askell, Agarwal, Herbert-Voss, Krueger, Henighan, Child, Ramesh, Ziegler, Wu, Winter, Hesse, Chen, Sigler, Litwin, Gray, Chess, Clark, Berner, McCandlish, Radford, Sutskever, and Amodei]. Although sign language falls under video understanding, it poses unique challenges due to its reliance on fine-grained, frame-by-frame details, especially in hand shapes, movements, and facial expressions. Current VideoLLMs [Zhang et al.(2023b)Zhang, Li, and Bing, Maaz et al.(2024)Maaz, Rasheed, Khan, and Khan, Li et al.(2023)Li, He, Wang, Li, Wang, Luo, Wang, Wang, and Qiao, Lin et al.(2024)Lin, Ye, Zhu, Cui, Ning, Jin, and Yuan] often lose this crucial information due to frame sampling or summarization, and processing all frames at full resolution is computationally costly [Weng et al.(2024)Weng, Han, He, Chang, and Zhuang]. To address this, we propose an approach that leverages VideoLLMs for generating detailed hand-centric descriptions suited for sign language understanding.
Vision-Language Pre-training. Existing visual-language pre-training (VLP) models fall into two types: single-stream models, which fuse visual and textual inputs in one encoder using self-attention [Kim et al.(2021)Kim, Son, and Kim, Li et al.(2019)Li, Yatskar, Yin, Hsieh, and Chang, Kamath et al.(2021)Kamath, Singh, LeCun, Synnaeve, Misra, and Carion], and dual-stream models, which use separate encoders for each modality and align them via contrastive learning or cross-attention [Radford et al.(2021)Radford, Kim, Hallacy, Ramesh, Goh, Agarwal, Sastry, Askell, Mishkin, Clark, et al., Jia et al.(2021)Jia, Yang, Xia, Chen, Parekh, Pham, Le, Sung, Li, and Duerig, Alayrac et al.(2022)Alayrac, Donahue, Luc, Miech, Barr, Hasson, Lenc, Mensch, Millican, Reynolds, et al.]. These pre-trained models have demonstrated remarkable performance in downstream tasks [Radford et al.(2021)Radford, Kim, Hallacy, Ramesh, Goh, Agarwal, Sastry, Askell, Mishkin, Clark, et al., Jia et al.(2021)Jia, Yang, Xia, Chen, Parekh, Pham, Le, Sung, Li, and Duerig]. GFSLT-VLP [Zhou et al.(2023)Zhou, Chen, Clapés, Wan, Liang, Escalera, Lei, and Zhang] is the first study to utilize CLIP [Radford et al.(2021)Radford, Kim, Hallacy, Ramesh, Goh, Agarwal, Sastry, Askell, Mishkin, Clark, et al.], a pioneering approach in VLP, to tackle the modality gap in gloss-free SLT. Like CLIP, GFSLT-VLP adopts a dualstream structure, pre-training the alignment between sign videos and spoken sentences before performing the translation.
HaMeR. The typical input to SLT is either RGB video [Camgoz et al.(2018)Camgoz, Hadfield, Koller, Ney, and Bowden, Liang et al.(2024)Liang, Huang, Xu, Tang, Ye, Zhang, Chen, Yu, and Xu] or skeletal based approach [Gan et al.(2021)Gan, Yin, Jiang, Xie, and Lu, Jiang et al.(2021)Jiang, Sun, Wang, Bai, Li, and Fu]. But recent approaches such as HaMer [Pavlakos et al.(2024)Pavlakos, Shan, Radosavovic, Kanazawa, Fouhey, and Malik], provide potential alternatives that SLT can be built on. HaMeR (Hand Mesh Recovery) is a fully transformer-based framework for 3D hand reconstruction from monocular images, offering enhanced robustness and accuracy by scaling both model capacity and training data [Pavlakos et al.(2024)Pavlakos, Shan, Radosavovic, Kanazawa, Fouhey, and Malik]. It utilises a large Vision Transformer and integrates multiple datasets with 2D and 3D hand annotations, achieving state-of-the-art results on standard 3D hand pose benchmarks. Given the central role of hand shapes and movements in sign language communication, HaMeR has recently been adopted for sign language understanding tasks. Notably, [Low et al.(2025)Low, Walsh, Sincan, and Bowden] was the first to apply HaMeR for sign language segmentation. Inspired by this, our approach similarly leverages HaMeR to extract detailed hand features from video frames, aiding in more accurate sign interpretation.

3 Methodology

This section presents our proposed BeyondGloss framework. We first describe how high-quality hand features are extracted via distillation in Sec. 3.1, followed by a detailed explanation of our novel Sign Video Descriptor for generating hand motion descriptions in Sec. 3.2. Finally, Sec. 3.3 outlines the overall architecture for SLT.

3.1 Distillation

Distillation [Heo et al.(2019)Heo, Lee, Yun, and Choi, Zagoruyko and Komodakis(2017), Romero et al.(2014)Romero, Ballas, Kahou, Chassang, Gatta, and Bengio] is a learning technique where a smaller model (student) is trained to replicate the behavior of a larger, well-trained model (teacher). In feature distillation [Romero et al.(2014)Romero, Ballas, Kahou, Chassang, Gatta, and Bengio, Hinton et al.(2015)Hinton, Vinyals, and Dean], instead of only imitating final predictions, the student learns to match the intermediate feature representations of the teacher. This helps the student model learn richer internal patterns, improving performance and generalization with fewer parameters. In our framework, we apply feature distillation to emphasize hand modeling in sign language videos. Specifically, we align the visual encoder’s features with those from the HaMeR transformer head to enhance sensitivity to hand shapes and orientations. To ensure that broader spatial information is not lost, we combine the hand-focused features with the original ones, maintaining a balance between detailed hand information and overall scene context.

3.2 Sign Language Video Description

To encourage our framework to focus on hands and their movements, key components of sign language, we aim to align sign video representations with hand-centric descriptions extracted from the videos. However, due to the complexity of hand motions and the inability of off-the-shelf VideoLLMs to capture fine-grained temporal details (as well as memory constraints) [Weng et al.(2024)Weng, Han, He, Chang, and Zhuang], we cannot directly use VideoLLMs by simply feeding them full sign videos. To address these challenges, we propose a novel sign video descriptor that leverages the strengths of both VideoLLMs and LLMs, as illustrated in Fig. 2. An example of a generated description is provided in Fig 1. As shown in Fig. 1, we begin by segmenting each video into non-overlapping clips of 16 frames. This length is chosen as a balance between temporal resolution and semantic content: segments shorter than 16 frames may lack sufficient motion or variation to be meaningfully described, while longer segments often lead VideoLLMs to produce less accurate or overly generalized descriptions. For each segment, we prompt the VideoLLM to describe hand movements, shapes, and trajectories. After obtaining a separate description for each segment, we merge them into a single sequence and pass them to a LLM that refines the content by removing redundant or irrelevant information. This final description preserves the temporal order and captures the most important aspects of hand motion across the entire video. See Appendix for more details and examples.

3.3 Architecture

Refer to caption
Figure 2: Overview of our approach. a) Our main framework is divided into two main phases: pre-training(dashed line) and fine-tuning(solid line). b) The overview of proposed sign video descriptor in our framework.

Visual Encoder. Our video-based sign language translation framework is built around an image encoder that extracts spatial features FFitalic_F from input video frames V={v0,v1,,vT}V=\{v_{0},v_{1},\dots,v_{T^{*}}\}italic_V = { italic_v start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_v start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_v start_POSTSUBSCRIPT italic_T start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT }, where TT^{*}italic_T start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT represents the total number of frames. These features, with a dimensionality of DD^{*}italic_D start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT, are structured as FvideoT×DF_{video}\in\mathbb{R}^{T^{*}\times D^{*}}italic_F start_POSTSUBSCRIPT italic_v italic_i italic_d italic_e italic_o end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_T start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT × italic_D start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT. We use and fine-tune Dino V2 [Oquab et al.(2024)Oquab, Darcet, Moutakanni, Vo, Szafraniec, Khalidov, Fernandez, Haziza, Massa, El-Nouby, Assran, Ballas, Galuba, Howes, Huang, Li, Misra, Rabbat, Sharma, Synnaeve, Xu, Jegou, Mairal, Labatut, Joulin, and Bojanowski] (ViT-S/14 variant), a self-supervised Vision Transformer, as our image encoder. The class token’s output serves as a feature vector, which is transformed linearly and batch-normalized to obtain FvideoF_{video}italic_F start_POSTSUBSCRIPT italic_v italic_i italic_d italic_e italic_o end_POSTSUBSCRIPT.
Feature Distillation. In parallel to the visual encoder, sign videos are input to the HaMeR model to extract hand shape features. Specifically, we utilize the transformer head section of HaMeR, which was originally used to regress hand and camera parameters. In our work, we repurposed the transformer head to extract hand pose and global orientation parameters, as they were highly effective hand shape descriptors for our task. For each frame, the hand pose features are represented as a tensor of H2×15×3×3H\in\mathbb{R}^{2\times 15\times 3\times 3}italic_H ∈ blackboard_R start_POSTSUPERSCRIPT 2 × 15 × 3 × 3 end_POSTSUPERSCRIPT, while the global orientation is parameterized as G2×3×3G\in\mathbb{R}^{2\times 3\times 3}italic_G ∈ blackboard_R start_POSTSUPERSCRIPT 2 × 3 × 3 end_POSTSUPERSCRIPT. For simplicity and computational efficiency, we flatten these parameters, resulting in a 1D feature vector of H270H^{\prime}\in\mathbb{R}^{270}italic_H start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT 270 end_POSTSUPERSCRIPT for the hand pose and, G18G^{\prime}\in\mathbb{R}^{18}italic_G start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT 18 end_POSTSUPERSCRIPT for the global orientation. The flattened features are then fused to form FHaMeRT×288F_{HaMeR}\in\mathbb{R}^{T^{*}\times 288}italic_F start_POSTSUBSCRIPT italic_H italic_a italic_M italic_e italic_R end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_T start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT × 288 end_POSTSUPERSCRIPT for each input sign video, where TT^{*}italic_T start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT represents the total number of frames. To guide visual encoder to extract hand feature, we need to map FvideoF_{video}italic_F start_POSTSUBSCRIPT italic_v italic_i italic_d italic_e italic_o end_POSTSUBSCRIPT to FHaMeRF_{HaMeR}italic_F start_POSTSUBSCRIPT italic_H italic_a italic_M italic_e italic_R end_POSTSUBSCRIPT and then combine these mapped features to the original features in order not to lose other important features captured by visual encoder:

Fvideo=mapper1(Fvideo\displaystyle F^{*}_{\text{video}}=\texttt{mapper}_{1}(F_{\text{video}}italic_F start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT video end_POSTSUBSCRIPT = mapper start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_F start_POSTSUBSCRIPT video end_POSTSUBSCRIPT )T×288,Fvideo=mapper2(Fvideo)T×D.\displaystyle)\in\mathbb{R}^{T^{*}\times 288},\quad F^{\prime}_{\text{video}}=\texttt{mapper}_{2}(F^{*}_{\text{video}})\in\mathbb{R}^{T^{*}\times D}.) ∈ blackboard_R start_POSTSUPERSCRIPT italic_T start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT × 288 end_POSTSUPERSCRIPT , italic_F start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT video end_POSTSUBSCRIPT = mapper start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_F start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT video end_POSTSUBSCRIPT ) ∈ blackboard_R start_POSTSUPERSCRIPT italic_T start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT × italic_D end_POSTSUPERSCRIPT . (1)
F^video\displaystyle\hat{F}_{\text{video}}over^ start_ARG italic_F end_ARG start_POSTSUBSCRIPT video end_POSTSUBSCRIPT =concat(Fvideo,Fvideo)T×2D.\displaystyle=\texttt{concat}(F^{\prime}_{\text{video}},F_{\text{video}})\in\mathbb{R}^{T^{*}\times 2D}.= concat ( italic_F start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT video end_POSTSUBSCRIPT , italic_F start_POSTSUBSCRIPT video end_POSTSUBSCRIPT ) ∈ blackboard_R start_POSTSUPERSCRIPT italic_T start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT × 2 italic_D end_POSTSUPERSCRIPT . (2)

For hand features distillation to the visual encoder, a simple L2L_{2}italic_L start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT loss can suffice. The distillation loss between FvideoF^{*}_{\text{video}}italic_F start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT video end_POSTSUBSCRIPT and FHaMeRF_{\text{HaMeR}}italic_F start_POSTSUBSCRIPT HaMeR end_POSTSUBSCRIPT can be represented by:

distill=1H×WiHjWFj,tFj,t2.\mathcal{L}_{\text{distill}}=\frac{1}{H\times W}\sum_{i}^{H}\sum_{j}^{W}\left\|F^{*}_{j,t}-F_{j,t}\right\|_{2}.caligraphic_L start_POSTSUBSCRIPT distill end_POSTSUBSCRIPT = divide start_ARG 1 end_ARG start_ARG italic_H × italic_W end_ARG ∑ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_H end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_W end_POSTSUPERSCRIPT ∥ italic_F start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j , italic_t end_POSTSUBSCRIPT - italic_F start_POSTSUBSCRIPT italic_j , italic_t end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT . (3)

where HHitalic_H and WWitalic_W are the number of frames, TT^{*}italic_T start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT, and HaMeR feature length, 288, respectively. The feature F^video\hat{F}_{\text{video}}over^ start_ARG italic_F end_ARG start_POSTSUBSCRIPT video end_POSTSUBSCRIPT is then passed into the temporal encoder for further processing.
Temporal Encoder. We use features F^video\hat{F}_{video}over^ start_ARG italic_F end_ARG start_POSTSUBSCRIPT italic_v italic_i italic_d italic_e italic_o end_POSTSUBSCRIPT as input to our temporal encoder to learn spatio-temporal sign representations. Since sign language sequences often contain hundreds of frames, we design a spatio-temporal transformer inspired by prior SLT approaches, incorporating two key modifications for efficiency and effectiveness. We integrate local self-attention with a window size of seven, a proven technique for SLT [Yin et al.(2023)Yin, Zhong, Tang, Jin, Jin, and Zhao]. Combined with downsampling, this extends the model’s temporal receptive field while reducing redundancy, ensuring compatibility with the LLM encoder. The resulting features are represented as ZvideoT×DZ_{video}\in\mathbb{R}^{T\times D}italic_Z start_POSTSUBSCRIPT italic_v italic_i italic_d italic_e italic_o end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_T × italic_D end_POSTSUPERSCRIPT, where TTitalic_T is the downsampled sequence length and DDitalic_D is the new feature dimension. To effectively model both temporal and spatial information, we use Rotary Position Embedding (RoPE) [Su et al.(2024)Su, Ahmed, Lu, Pan, Bo, and Liu] for positional encoding.
Video-Description Alignment. To focus on hand-centric temporal dynamics and distinguish signs more effectively, after the temporal encoder, we introduce a contrastive alignment between video and description representations. Rather than relying on a computationally expensive VideoLLM at inference time or discarding temporal information, we project the encoded video features into the description space and then combine them with the original video features to preserve both global and temporal context. As illustrated in Fig. 2, each sign video is processed by a video description module, which generates a textual description. This description is then passed through a text encoder to obtain token-level features, L^N×K\hat{L}\in\mathbb{R}^{N\times K}over^ start_ARG italic_L end_ARG ∈ blackboard_R start_POSTSUPERSCRIPT italic_N × italic_K end_POSTSUPERSCRIPT, where NNitalic_N is the number of tokens and KKitalic_K is the embedding dimension. Simultaneously, the video features ZvideoZ_{video}italic_Z start_POSTSUBSCRIPT italic_v italic_i italic_d italic_e italic_o end_POSTSUBSCRIPT are mapped to the same feature space using a two-stage projection:

Zvideo=mapper3(Zvideo\displaystyle Z^{*}_{\text{video}}=\texttt{mapper}_{3}(Z_{\text{video}}italic_Z start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT video end_POSTSUBSCRIPT = mapper start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT ( italic_Z start_POSTSUBSCRIPT video end_POSTSUBSCRIPT )T×D,Zvideo=mapper4(Zvideo)T×D.\displaystyle)\in\mathbb{R}^{T\times D^{\prime}},\quad Z^{\prime}_{\text{video}}=\texttt{mapper}_{4}(Z^{*}_{\text{video}})\in\mathbb{R}^{T\times D}.) ∈ blackboard_R start_POSTSUPERSCRIPT italic_T × italic_D start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT , italic_Z start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT video end_POSTSUBSCRIPT = mapper start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT ( italic_Z start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT video end_POSTSUBSCRIPT ) ∈ blackboard_R start_POSTSUPERSCRIPT italic_T × italic_D end_POSTSUPERSCRIPT . (4)
Z^video\displaystyle\hat{Z}_{\text{video}}over^ start_ARG italic_Z end_ARG start_POSTSUBSCRIPT video end_POSTSUBSCRIPT =concat(Zvideo,Zvideo)T×D′′.\displaystyle=\texttt{concat}(Z^{\prime}_{\text{video}},Z_{\text{video}})\in\mathbb{R}^{T\times D^{\prime\prime}}.= concat ( italic_Z start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT video end_POSTSUBSCRIPT , italic_Z start_POSTSUBSCRIPT video end_POSTSUBSCRIPT ) ∈ blackboard_R start_POSTSUPERSCRIPT italic_T × italic_D start_POSTSUPERSCRIPT ′ ′ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT . (5)

The resulting feature Z^video\hat{Z}_{\text{video}}over^ start_ARG italic_Z end_ARG start_POSTSUBSCRIPT video end_POSTSUBSCRIPT are forwarded to subsequent network components. To align video and description features for contrastive learning, we apply a linear projection if D′′kD^{\prime\prime}\neq kitalic_D start_POSTSUPERSCRIPT ′ ′ end_POSTSUPERSCRIPT ≠ italic_k , mapping both to a common dimension D^\hat{D}over^ start_ARG italic_D end_ARG. We then use average pooling across the temporal and token dimensions to obtain global representations Z~D^\tilde{Z}\in\mathbb{R}^{\hat{D}}over~ start_ARG italic_Z end_ARG ∈ blackboard_R start_POSTSUPERSCRIPT over^ start_ARG italic_D end_ARG end_POSTSUPERSCRIPT for the video and L~D^\tilde{L}\in\mathbb{R}^{\hat{D}}over~ start_ARG italic_L end_ARG ∈ blackboard_R start_POSTSUPERSCRIPT over^ start_ARG italic_D end_ARG end_POSTSUPERSCRIPT for the description. The contrastive loss is computed as:

desc=12×B(j=1Blogexp(sim(Z~j,L~j)/τ1)k=1Bexp(sim(Z^j,L~k)/τ1)+j=1Blogexp(sim(L~j,Z~j)/τ1)k=1Bexp(sim(L~j,Z~k)/τ1)).\mathcal{L}_{\text{desc}}=-\frac{1}{2\times B}\Big{(}\sum^{B}_{j=1}log\frac{\exp(\texttt{sim}(\tilde{Z}_{j},\tilde{L}_{j})/\tau_{1})}{\sum^{B}_{k=1}\exp(\texttt{sim}(\hat{Z}_{j},\tilde{L}_{k})/\tau_{1})}+\sum^{B}_{j=1}log\frac{\exp(\texttt{sim}(\tilde{L}_{j},\tilde{Z}_{j})/\tau_{1})}{\sum^{B}_{k=1}\exp(\texttt{sim}(\tilde{L}_{j},\tilde{Z}_{k})/\tau_{1})}\Big{)}.caligraphic_L start_POSTSUBSCRIPT desc end_POSTSUBSCRIPT = - divide start_ARG 1 end_ARG start_ARG 2 × italic_B end_ARG ( ∑ start_POSTSUPERSCRIPT italic_B end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT italic_l italic_o italic_g divide start_ARG roman_exp ( sim ( over~ start_ARG italic_Z end_ARG start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , over~ start_ARG italic_L end_ARG start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) / italic_τ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) end_ARG start_ARG ∑ start_POSTSUPERSCRIPT italic_B end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT roman_exp ( sim ( over^ start_ARG italic_Z end_ARG start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , over~ start_ARG italic_L end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) / italic_τ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) end_ARG + ∑ start_POSTSUPERSCRIPT italic_B end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT italic_l italic_o italic_g divide start_ARG roman_exp ( sim ( over~ start_ARG italic_L end_ARG start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , over~ start_ARG italic_Z end_ARG start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) / italic_τ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) end_ARG start_ARG ∑ start_POSTSUPERSCRIPT italic_B end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT roman_exp ( sim ( over~ start_ARG italic_L end_ARG start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , over~ start_ARG italic_Z end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) / italic_τ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) end_ARG ) . (6)

where sim(x,y)\texttt{sim}(x,y)sim ( italic_x , italic_y ) is the cosine similarity, and τ1\tau_{1}italic_τ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT is a learnable temperature parameter.
Video-Target Alignment. To improve the effectiveness of converting video features into output text, we employ an encoder-decoder-based LLM. The previously constructed feature representation Z^video\hat{Z}_{video}over^ start_ARG italic_Z end_ARG start_POSTSUBSCRIPT italic_v italic_i italic_d italic_e italic_o end_POSTSUBSCRIPT is provided as input to the encoder, producing the transformed video features YT×KY\in\mathbb{R}^{T\times K}italic_Y ∈ blackboard_R start_POSTSUPERSCRIPT italic_T × italic_K end_POSTSUPERSCRIPT, which are used for subsequent alignment. To address the modality gap between video and language, we adopt a contrastive learning strategy inspired by CLIP [Radford et al.(2021)Radford, Kim, Hallacy, Ramesh, Goh, Agarwal, Sastry, Askell, Mishkin, Clark, et al.]. This approach aligns the video features with their corresponding spoken sentence representations, encouraging similarity between matched pairs while distinguishing them from unrelated ones. Given a target sentence TSTSitalic_T italic_S associated with a video, we encode it into textual features MN¯×KM\in\mathbb{R}^{\overline{N}\times K}italic_M ∈ blackboard_R start_POSTSUPERSCRIPT over¯ start_ARG italic_N end_ARG × italic_K end_POSTSUPERSCRIPT using a frozen text encoder, where N¯\overline{N}over¯ start_ARG italic_N end_ARG is the number of tokens and KKitalic_K is the embedding dimension. Average pooling is applied to both YvideoT×KY_{video}\in\mathbb{R}^{T\times K}italic_Y start_POSTSUBSCRIPT italic_v italic_i italic_d italic_e italic_o end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_T × italic_K end_POSTSUPERSCRIPT and MMitalic_M across temporal and token dimensions, respectively, resulting in global video and sentence features: Y~K\tilde{{Y}}\in\mathbb{R}^{K}over~ start_ARG italic_Y end_ARG ∈ blackboard_R start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT and M~K\tilde{M}\in\mathbb{R}^{K}over~ start_ARG italic_M end_ARG ∈ blackboard_R start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT. We apply a symmetric contrastive loss as follows:

align=12×B(j=1Blogexp(sim(Y~j,M~j)/τ2)k=1Bexp(sim(Y^j,M~k)/τ2)+j=1Blogexp(sim(M~j,Y~j)/τ2)k=1Bexp(sim(M~j,Y~k)/τ2)).\mathcal{L}_{\text{align}}=-\frac{1}{2\times B}\Big{(}\sum^{B}_{j=1}log\frac{\exp(\texttt{sim}(\tilde{Y}_{j},\tilde{M}_{j})/\tau_{2})}{\sum^{B}_{k=1}\exp(\texttt{sim}(\hat{Y}_{j},\tilde{M}_{k})/\tau_{2})}+\sum^{B}_{j=1}log\frac{\exp(\texttt{sim}(\tilde{M}_{j},\tilde{Y}_{j})/\tau_{2})}{\sum^{B}_{k=1}\exp(\texttt{sim}(\tilde{M}_{j},\tilde{Y}_{k})/\tau_{2})}\Big{)}.caligraphic_L start_POSTSUBSCRIPT align end_POSTSUBSCRIPT = - divide start_ARG 1 end_ARG start_ARG 2 × italic_B end_ARG ( ∑ start_POSTSUPERSCRIPT italic_B end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT italic_l italic_o italic_g divide start_ARG roman_exp ( sim ( over~ start_ARG italic_Y end_ARG start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , over~ start_ARG italic_M end_ARG start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) / italic_τ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) end_ARG start_ARG ∑ start_POSTSUPERSCRIPT italic_B end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT roman_exp ( sim ( over^ start_ARG italic_Y end_ARG start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , over~ start_ARG italic_M end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) / italic_τ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) end_ARG + ∑ start_POSTSUPERSCRIPT italic_B end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT italic_l italic_o italic_g divide start_ARG roman_exp ( sim ( over~ start_ARG italic_M end_ARG start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , over~ start_ARG italic_Y end_ARG start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) / italic_τ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) end_ARG start_ARG ∑ start_POSTSUPERSCRIPT italic_B end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT roman_exp ( sim ( over~ start_ARG italic_M end_ARG start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , over~ start_ARG italic_Y end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) / italic_τ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) end_ARG ) . (7)

where sim(x,y)\texttt{sim}(x,y)sim ( italic_x , italic_y ) is the cosine similarity, and τ2\tau_{2}italic_τ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT is a learnable temperature parameter.
The final pre-training loss is:

pre-train=λalign×align+λdesc×desc+λdistill×distill\mathcal{L}_{\text{pre-train}}=\lambda_{\text{align}}\times\mathcal{L}_{\text{align}}+\lambda_{\text{desc}}\times\mathcal{L}_{\text{desc}}+\lambda_{\text{distill}}\times\mathcal{L}_{\text{distill}}caligraphic_L start_POSTSUBSCRIPT pre-train end_POSTSUBSCRIPT = italic_λ start_POSTSUBSCRIPT align end_POSTSUBSCRIPT × caligraphic_L start_POSTSUBSCRIPT align end_POSTSUBSCRIPT + italic_λ start_POSTSUBSCRIPT desc end_POSTSUBSCRIPT × caligraphic_L start_POSTSUBSCRIPT desc end_POSTSUBSCRIPT + italic_λ start_POSTSUBSCRIPT distill end_POSTSUBSCRIPT × caligraphic_L start_POSTSUBSCRIPT distill end_POSTSUBSCRIPT (8)

where λalign\lambda_{\text{align}}italic_λ start_POSTSUBSCRIPT align end_POSTSUBSCRIPT, λdesc\lambda_{\text{desc}}italic_λ start_POSTSUBSCRIPT desc end_POSTSUBSCRIPT and λdistill\lambda_{\text{distill}}italic_λ start_POSTSUBSCRIPT distill end_POSTSUBSCRIPT control the balance between these three losses. See the Appendix for a comprehensive evaluation conducted to determine the optimal weighting in the pre-training loss.
Sign Language Translation As illustrated in Fig. 2, during the fine-tuning stage, the feature distillation, video-description alignment, and video-target alignment components are removed. The only module not initialized from the pre-training stage is the LLM decoder, which is attached to the pre-trained LLM encoder.

Given a sign language video ViV_{i}italic_V start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, the decoder aims to generate the corresponding spoken sentence TS^i=(TS^i,1,,TS^i,T)\widehat{TS}_{i}=(\widehat{TS}_{i,1},\dots,\widehat{TS}_{i,T^{\prime}})over^ start_ARG italic_T italic_S end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = ( over^ start_ARG italic_T italic_S end_ARG start_POSTSUBSCRIPT italic_i , 1 end_POSTSUBSCRIPT , … , over^ start_ARG italic_T italic_S end_ARG start_POSTSUBSCRIPT italic_i , italic_T start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ), where TT^{\prime}italic_T start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT denotes the number of tokens in the target spoken sentence. The decoder operates in an autoregressive manner, it begins with the special <bos> token and predicts tokens one by one until the <eos> token is reached. The model is trained by minimizing the cross-entropy loss between the predicted tokens TS^i,j\widehat{TS}_{i,j}over^ start_ARG italic_T italic_S end_ARG start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT and the ground truth tokens TSi,jTS_{i,j}italic_T italic_S start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT as follows:

SLTi=j=1Tlogp(TS^i,jTSi,1:j1,Vi).\mathcal{L}_{SLT_{i}}=-\sum_{j=1}^{T^{\prime}}\log\hskip 2.0ptp(\widehat{TS}_{i,j}\mid TS_{i,1:j-1},V_{i}).caligraphic_L start_POSTSUBSCRIPT italic_S italic_L italic_T start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT = - ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT roman_log italic_p ( over^ start_ARG italic_T italic_S end_ARG start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT ∣ italic_T italic_S start_POSTSUBSCRIPT italic_i , 1 : italic_j - 1 end_POSTSUBSCRIPT , italic_V start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) . (9)

4 Experiments

4.1 Datasets and Evaluation Metrics

Datasets. We conduct experiments on the Phoenix2014T [Camgoz et al.(2018)Camgoz, Hadfield, Koller, Ney, and Bowden] (or Phoenix14T) and CSL-Daily [Zhou et al.(2021)Zhou, Zhou, Qi, Pu, and Li] datasets for SLT, evaluating performance on their development and test sets.Phoen-ix-2014T is a German sign language dataset containing 2,887 unique German words, with 7,096 training samples, 519 validation samples, and 642 test samples. CSL-Daily is a Chinese sign language dataset with a vocabulary of 2,343 Chinese words, comprising 18,401 training samples, 1,077 validation samples, and 1,176 test samples.
Evaluation Metrics. We use BLEU [Papineni et al.(2002)Papineni, Roukos, Ward, and Zhu] and ROUGE-L [Lin(2004)] to evaluate translation quality. BLEU-n measures n-gram precision, and we report BLEU-1 to BLEU-4 scores. ROUGE-L (or ROUGE) evaluates the F1 score based on the longest common subsequence between predicted and reference translations. For simplicity, we denote BLEU-1 to BLEU-4 as B1, B2, B3, and B4, and ROUGE as R.

4.2 Implementation details

Model. To address memory constraints, we apply LoRA [Hu et al.(2021)Hu, Shen, Wallis, Allen-Zhu, Li, Wang, Wang, and Chen] adapters to the top three layers of the visual encoder. The adapters are configured with a rank of 4, a dropout rate of 0.1, scaling factor 4.0, and are applied to both self-attention and feed-forward weights. The temporal encoder is a four-layer transformer with a hidden dimension of 512, 8 attention heads and an intermediate size of 2048, with temporal downsampling applied after the second layer. We adopt the mBART-large-cc25 model [Liu et al.(2020)Liu, Gu, Goyal, Li, Edunov, Ghazvininejad, Lewis, and Zettlemoyer], which consists of 12 layers in both the encoder and decoder, as the backbone of our LLM-based encoder-decoder architecture. LoRA [Hu et al.(2021)Hu, Shen, Wallis, Allen-Zhu, Li, Wang, Wang, and Chen] is applied to the LLM encoder and decoder modules, targeting qprojq_{\text{proj}}italic_q start_POSTSUBSCRIPT proj end_POSTSUBSCRIPT and vprojv_{\text{proj}}italic_v start_POSTSUBSCRIPT proj end_POSTSUBSCRIPT layers with a rank of 16, a scaling factor (LoRA alpha) of 32 and a dropout rate of 0.1. All mappers in Fig. 2 are trainable and consist of a linear layer followed by GELU activation function. Training is optimized with the XFormers [Lefaudeux et al.(2022)Lefaudeux, Massa, Liskovich, Xiong, Caggiano, Naren, Xu, Hu, Tintore, Zhang, Labatut, Haziza, Wehrstedt, Reizenstein, and Sizov] library with Flash Attention [Dao(2023)] for improved efficiency. In pre-training, we use mBART-large-50 [Liu et al.(2020)Liu, Gu, Goyal, Li, Edunov, Ghazvininejad, Lewis, and Zettlemoyer] as the text encoder for video-description alignment due to its broader multilingual coverage and robustness to noisy, diverse descriptions. For video-target alignment, we choose mBART-large-cc25 [Liu et al.(2020)Liu, Gu, Goyal, Li, Edunov, Ghazvininejad, Lewis, and Zettlemoyer] as the text encoder as it offers more stable embeddings for clean, well-structured target sentences in high-resource languages. We use ShareGPT4Video-8B  [Chen et al.(2024a)Chen, Wei, Li, Dong, Zhang, Zang, Chen, Duan, Lin, Tang, Yuan, Qiao, Lin, Zhao, and Wang] as the video LLM in the pre-training stage to generate detailed and semantically rich hand-focused descriptions from sign language video segments, where each segment consists of 16 frames. We also use GPT-4o-mini API [OpenAI(2023)] to generate a single coherent description, maintaining the temporal order of events and removing irrelevant or unnecessary information.
Implementation. The model is pre-trained, for 100 epochs, and fine-tuned, for 200 epochs, with a batch size of 16 on one A100 GPU. The input sequences are first resized into 256×256256\times 256256 × 256, and then randomly/centrally cropped into 224×224224\times 224224 × 224 during training/inference. We use the AdamW [Loshchilov and Hutter(2019)] optimizer with a learning rate of 3×1043\times 10^{-4}3 × 10 start_POSTSUPERSCRIPT - 4 end_POSTSUPERSCRIPT and weight decay of 0.0010.0010.001.

Table 1: Experimental results on Phoenix14T and CSL-Daily datasets. The best results for gloss-free models are highlighted in bold, while the second-best results are underlined.
Method Phoenix14T CSL-Daily
B-1 B-2 B-3 B-4 R B-1 B-2 B-3 B-4 R
Gloss-based
MMTLB [Chen et al.(2022a)Chen, Wei, Sun, Wu, and Lin] 53.97 41.75 33.84 28.39 52.65 53.31 40.41 30.87 23.92 53.25
SLTUNET [Zhang et al.(2023a)Zhang, Müller, and Sennrich] 52.92 41.76 33.99 28.47 52.11 54.98 41.44 31.84 25.01 54.08
TS-SLT [Chen et al.(2022b)Chen, Zuo, Wei, Wu, Liu, and Mak] 54.90 42.43 34.46 28.95 53.48 55.44 42.59 32.87 25.79 55.72
Gloss-free
NSLT+Luong [Luong et al.(2015)Luong, Pham, and Manning] 29.86 17.52 11.96 9.00 30.70 34.16 19.57 11.84 7.56 34.54
CSGCR [Zhao et al.(2022)Zhao, Qi, Zhou, Duan, Zhou, and Li] 36.71 25.40 18.86 15.18 38.85
GFSLT-VLP [Zhou et al.(2023)Zhou, Chen, Clapés, Wan, Liang, Escalera, Lei, and Zhang] 43.71 33.18 26.11 21.44 42.49 39.37 24.93 16.26 11.00 36.44
Sign2GPT [Wong et al.()Wong, Camgöz, and Bowden] 49.54 35.96 28.83 22.52 48.90 41.75 28.73 20.60 15.40 42.36
SignCL [Ye et al.()Ye, Wang, Jiao, Liang, and Xiong] 49.76 36.85 29.97 22.74 49.07 47.47 32.53 22.62 16.16 48.92
FLa-LLM [Chen et al.(2024b)Chen, Zhou, Li, Wan, Lei, Jiang, Lu, and Zhao] 46.29 35.33 28.03 23.09 45.27 37.13 25.12 18.38 14.20 37.25
SignLLM [Gong et al.(2024)Gong, Foo, He, Rahmani, and Liu] 45.21 34.78 28.05 23.40 44.49 39.55 28.13 20.07 15.75 39.91
LLaVA-SLT [Liang et al.(2024)Liang, Huang, Xu, Tang, Ye, Zhang, Chen, Yu, and Xu] 51.20 37.51 29.39 23.43 50.44 52.15 36.24 26.47 20.42 51.26
BeyondGloss 52.38 38.57 30.74 25.49 52.89 53.12 38.63 27.82 21.53 53.46

4.3 Results

Results on Phoenix14T and CSL-Daily. Tab. 1 compares gloss-based, and gloss-free methods on the Phoenix14T and CSL-Daily test datasets. Our method outperforms other gloss-free SLT approaches in terms of BLEU-4 and ROUGE.
Qualitative Results. Tab. 6 shows translation examples from Phoenix14T. We compare the reference translations with outputs from GFSLT-VLP [Zhou et al.(2023)Zhou, Chen, Clapés, Wan, Liang, Escalera, Lei, and Zhang], the only other publicly available gloss-free model, and our method. While GFSLT-VLP often captures only a few correct words or produces unrelated sentences, our method generates more accurate translations. These qualitative examples demonstrate the effectiveness of our approach. See appendix for more comparison.

Table 2: Qualitative results of Phoenix14T. Green means totally same as the reference.
Reference: am freitag scheint abseits der nebelgebiete haufig die sonne
(On Friday the sun often shines outside of the foggy areas)
GFSLT-VLP: am freitag sobald der nebel weg ist sonnenschein.
(On Friday as soon as the fog is gone there will be sunshine)
BeyondGloss: am freitag scheint abseits der nebelgebiete haufig die sonne.
(On Friday the sun often shines outside of the foggy areas)

4.4 Ablation studies

Analysis of Pre-Training Components Contributions. We conduct experiments to evaluate the impact of the core components of our proposed method. Tab. 7 reports the performance obtained by progressively incorporating each component. As shown in the first row, omitting pre-training and alignment between sign videos and output texts results in significantly lower downstream performance, highlighting the importance of pre-training for SLT. The second row demonstrates that video-target alignment contributes substantially to performance improvement. Rows three and four further show the positive effect of our proposed video-description alignment. In rows four and five, feature distillation also contributes to modest improvements in performance. Finally, combining all components together (row five) yields the best overall performance, achieving state-of-the-art results.
Evaluating Different Visual Encoders. We conduct an additional ablation study to examine the impact of different visual encoders on final performance. As shown in Tab. 5, using DINO V2 [Oquab et al.(2024)Oquab, Darcet, Moutakanni, Vo, Szafraniec, Khalidov, Fernandez, Haziza, Massa, El-Nouby, Assran, Ballas, Galuba, Howes, Huang, Li, Misra, Rabbat, Sharma, Synnaeve, Xu, Jegou, Mairal, Labatut, Joulin, and Bojanowski], a transformer-based visual encoder, yields better results compared to ResNet-18 [He et al.(2016)He, Zhang, Ren, and Sun], a convolutional encoder. The best performance is achieved when applying LoRA to the last three layers of DINO V2, leading to our state-of-the-art results.
Evaluating Different LLMs. As shown in Tab. 5, we assess the impact of using various LLMs within our framework. Both MT5-Base [Xue et al.(2021)Xue, Constant, Roberts, Kale, Al-Rfou, Siddhant, Barua, and Raffel] and MBART (24-layer) [Liu et al.(2020)Liu, Gu, Goyal, Li, Edunov, Ghazvininejad, Lewis, and Zettlemoyer], which have a comparable number of parameters, yield strong results. MBART slightly outperforms MT5-Base, likely due to its stronger multilingual prior knowledge. These findings highlight the generalizability of our approach. Additionally, we evaluate a 6-layer version of MBART, similar to the configuration used in GFSLT-VLP [Zhou et al.(2023)Zhou, Chen, Clapés, Wan, Liang, Escalera, Lei, and Zhang], which achieves moderate performance. It is worth mentioning that all the tests in the ablation studies are conducted on the Pheonix14T validation set.

Table 3: Ablation study on key elements in our framework.
Feature Distillation Video-Description Alignment Video-Target Alignment B-4 R
(1) 16.31 36.33
(2) 21.87 44.30
(3) 23.47 47.17
(4) 22.47 44.87
(5) 25.68 53.85

5 Conclusion

In this work, we introduced BeyondGloss, a novel gloss-free Sign Language Translation framework that emphasizes hand-centric modeling and leverages the capabilities of VideoLLMs. By generating fine-grained, temporally-aware textual descriptions of hand motion and aligning them with sign video features through contrastive learning, our approach effectively bridges the modality gap between visual and linguistic representations. Additionally, the integration of detailed hand features distilled from HaMeR further enriches the sign video representations. Extensive experiments on Phoenix14T and CSL-Daily demonstrate that BeyondGloss achieves state-of-the-art performance, underscoring the importance of structured hand modeling and multimodal alignment in advancing gloss-free SLT.

6 Acknowledgement

This work was supported by the SNSF project ‘SMILE II’ (CRSII5 193686), the Innosuisse IICT Flagship (PFFS-21-47), EPSRC grant APP24554 (SignGPT-EP/Z535370/1) and through funding from Google.org via the AI for Global Goals scheme. This work reflects only the author’s views and the funders are not responsible for any use that may be made of the information it contains.

References

  • [Achiam et al.(2024)Achiam, Adler, Agarwal, Ahmad, Akkaya, Aleman, Almeida, Altenschmidt, Altman, Anadkat, and others.] Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, and others. Gpt-4 technical report, 2024.
  • [Alayrac et al.(2022)Alayrac, Donahue, Luc, Miech, Barr, Hasson, Lenc, Mensch, Millican, Reynolds, et al.] Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Mensch, Katherine Millican, Malcolm Reynolds, et al. Flamingo: a visual language model for few-shot learning. Advances in neural information processing systems, 35:23716–23736, 2022.
  • [Brown et al.(2020)Brown, Mann, Ryder, Subbiah, Kaplan, Dhariwal, Neelakantan, Shyam, Sastry, Askell, Agarwal, Herbert-Voss, Krueger, Henighan, Child, Ramesh, Ziegler, Wu, Winter, Hesse, Chen, Sigler, Litwin, Gray, Chess, Clark, Berner, McCandlish, Radford, Sutskever, and Amodei] Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott Gray, Benjamin Chess, Jack Clark, Christopher Berner, Sam McCandlish, Alec Radford, Ilya Sutskever, and Dario Amodei. Language models are few-shot learners, 2020.
  • [Camgoz et al.(2018)Camgoz, Hadfield, Koller, Ney, and Bowden] Necati Cihan Camgoz, Simon Hadfield, Oscar Koller, Hermann Ney, and Richard Bowden. Neural sign language translation. In 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 7784–7793, 2018.
  • [Camgoz et al.(2020)Camgoz, Koller, Hadfield, and Bowden] Necati Cihan Camgoz, Oscar Koller, Simon Hadfield, and Richard Bowden. Sign language transformers: Joint end-to-end sign language recognition and translation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10023–10033, 2020.
  • [Chen et al.(2024a)Chen, Wei, Li, Dong, Zhang, Zang, Chen, Duan, Lin, Tang, Yuan, Qiao, Lin, Zhao, and Wang] Lin Chen, Xilin Wei, Jinsong Li, Xiaoyi Dong, Pan Zhang, Yuhang Zang, Zehui Chen, Haodong Duan, Bin Lin, Zhenyu Tang, Li Yuan, Yu Qiao, Dahua Lin, Feng Zhao, and Jiaqi Wang. Sharegpt4video: Improving video understanding and generation with better captions, 2024a.
  • [Chen et al.(2022a)Chen, Wei, Sun, Wu, and Lin] Yutong Chen, Fangyun Wei, Xiao Sun, Zhirong Wu, and Stephen Lin. A simple multi-modality transfer learning baseline for sign language translation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 5120–5130, 2022a.
  • [Chen et al.(2022b)Chen, Zuo, Wei, Wu, Liu, and Mak] Yutong Chen, Ronglai Zuo, Fangyun Wei, Yu Wu, Shujie Liu, and Brian Mak. Two-stream network for sign language recognition and translation. Advances in Neural Information Processing Systems, 35:17043–17056, 2022b.
  • [Chen et al.(2024b)Chen, Zhou, Li, Wan, Lei, Jiang, Lu, and Zhao] Zhigang Chen, Benjia Zhou, Jun Li, Jun Wan, Zhen Lei, Ning Jiang, Quan Lu, and Guoqing Zhao. Factorized learning assisted with large language model for gloss-free sign language translation. In Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024), pages 7071–7081, 2024b.
  • [Dao(2023)] Tri Dao. Flashattention-2: Faster attention with better parallelism and work partitioning, 2023.
  • [Duarte(2019)] Amanda Cardoso Duarte. Cross-modal neural sign language translation. In Proceedings of the 27th ACM International Conference on Multimedia, MM ’19, page 1650–1654, New York, NY, USA, 2019. Association for Computing Machinery. ISBN 9781450368896.
  • [Gan et al.(2021)Gan, Yin, Jiang, Xie, and Lu] Shiwei Gan, Yafeng Yin, Zhiwei Jiang, Lei Xie, and Sanglu Lu. Skeleton-aware neural sign language translation. In Proceedings of the 29th ACM International Conference on Multimedia, pages 4353–4361, 2021.
  • [Gong et al.(2024)Gong, Foo, He, Rahmani, and Liu] Jia Gong, Lin Geng Foo, Yixuan He, Hossein Rahmani, and Jun Liu. Llms are good sign language translators. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 18362–18372, 2024.
  • [Graves(2013)] Alex Graves. Generating sequences with recurrent neural networks. arXiv preprint arXiv:1308.0850, 2013.
  • [He et al.(2016)He, Zhang, Ren, and Sun] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.
  • [Heo et al.(2019)Heo, Lee, Yun, and Choi] Byeongho Heo, Minsik Lee, Sangdoo Yun, and Jin Young Choi. Knowledge transfer via distillation of activation boundaries formed by hidden neurons. In Proceedings of the AAAI conference on artificial intelligence, volume 33, pages 3779–3787, 2019.
  • [Hinton et al.(2015)Hinton, Vinyals, and Dean] Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531, 2015.
  • [Hu et al.(2021)Hu, Shen, Wallis, Allen-Zhu, Li, Wang, Wang, and Chen] Edward J. Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. Lora: Low-rank adaptation of large language models, 2021.
  • [Jia et al.(2021)Jia, Yang, Xia, Chen, Parekh, Pham, Le, Sung, Li, and Duerig] Chao Jia, Yinfei Yang, Ye Xia, Yi-Ting Chen, Zarana Parekh, Hieu Pham, Quoc Le, Yun-Hsuan Sung, Zhen Li, and Tom Duerig. Scaling up visual and vision-language representation learning with noisy text supervision. In International conference on machine learning, pages 4904–4916. PMLR, 2021.
  • [Jiang et al.(2021)Jiang, Sun, Wang, Bai, Li, and Fu] Songyao Jiang, Bin Sun, Lichen Wang, Yue Bai, Kunpeng Li, and Yun Fu. Skeleton aware multi-modal sign language recognition. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 3413–3423, 2021.
  • [Kamath et al.(2021)Kamath, Singh, LeCun, Synnaeve, Misra, and Carion] Aishwarya Kamath, Mannat Singh, Yann LeCun, Gabriel Synnaeve, Ishan Misra, and Nicolas Carion. Mdetr-modulated detection for end-to-end multi-modal understanding. In Proceedings of the IEEE/CVF international conference on computer vision, pages 1780–1790, 2021.
  • [Kim et al.(2021)Kim, Son, and Kim] Wonjae Kim, Bokyung Son, and Ildoo Kim. Vilt: Vision-and-language transformer without convolution or region supervision. In International conference on machine learning, pages 5583–5594. PMLR, 2021.
  • [Lefaudeux et al.(2022)Lefaudeux, Massa, Liskovich, Xiong, Caggiano, Naren, Xu, Hu, Tintore, Zhang, Labatut, Haziza, Wehrstedt, Reizenstein, and Sizov] Benjamin Lefaudeux, Francisco Massa, Diana Liskovich, Wenhan Xiong, Vittorio Caggiano, Sean Naren, Min Xu, Jieru Hu, Marta Tintore, Susan Zhang, Patrick Labatut, Daniel Haziza, Luca Wehrstedt, Jeremy Reizenstein, and Grigory Sizov. xformers: A modular and hackable transformer modelling library, 2022.
  • [Li et al.(2023)Li, He, Wang, Li, Wang, Luo, Wang, Wang, and Qiao] Kunchang Li, Yinan He, Yi Wang, Yizhuo Li, Wenhai Wang, Ping Luo, Yali Wang, Limin Wang, and Yu Qiao. Videochat: Chat-centric video understanding. CoRR, 2023.
  • [Li et al.(2019)Li, Yatskar, Yin, Hsieh, and Chang] Liunian Harold Li, Mark Yatskar, Da Yin, Cho-Jui Hsieh, and Kai-Wei Chang. Visualbert: A simple and performant baseline for vision and language. arXiv preprint arXiv:1908.03557, 2019.
  • [Liang et al.(2024)Liang, Huang, Xu, Tang, Ye, Zhang, Chen, Yu, and Xu] Han Liang, Chengyu Huang, Yuecheng Xu, Cheng Tang, Weicai Ye, Juze Zhang, Xin Chen, Jingyi Yu, and Lan Xu. Llava-slt: Visual language tuning for sign language translation, 2024.
  • [Lin et al.(2024)Lin, Ye, Zhu, Cui, Ning, Jin, and Yuan] Bin Lin, Yang Ye, Bin Zhu, Jiaxi Cui, Munan Ning, Peng Jin, and Li Yuan. Video-llava: Learning united visual representation by alignment before projection. In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 5971–5984, 2024.
  • [Lin(2004)] Chin-Yew Lin. ROUGE: A package for automatic evaluation of summaries. In Text Summarization Branches Out, pages 74–81, Barcelona, Spain, July 2004. Association for Computational Linguistics.
  • [Liu et al.(2020)Liu, Gu, Goyal, Li, Edunov, Ghazvininejad, Lewis, and Zettlemoyer] Yinhan Liu, Jiatao Gu, Naman Goyal, Xian Li, Sergey Edunov, Marjan Ghazvininejad, Mike Lewis, and Luke Zettlemoyer. Multilingual denoising pre-training for neural machine translation, 2020.
  • [Loshchilov and Hutter(2019)] Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization, 2019.
  • [Low et al.(2025)Low, Walsh, Sincan, and Bowden] Jianhe Low, Harry Walsh, Ozge Mercanoglu Sincan, and Richard Bowden. Hands-on: Segmenting individual signs from continuous sequences. 2025.
  • [Luong et al.(2015)Luong, Pham, and Manning] Minh-Thang Luong, Hieu Pham, and Christopher D. Manning. Effective approaches to attention-based neural machine translation, 2015.
  • [Maaz et al.(2024)Maaz, Rasheed, Khan, and Khan] Muhammad Maaz, Hanoona Rasheed, Salman Khan, and Fahad Khan. Video-chatgpt: Towards detailed video understanding via large vision and language models. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 12585–12602, 2024.
  • [OpenAI(2023)] OpenAI. Gpt-4 technical report. Technical report, OpenAI, 2023.
  • [Oquab et al.(2024)Oquab, Darcet, Moutakanni, Vo, Szafraniec, Khalidov, Fernandez, Haziza, Massa, El-Nouby, Assran, Ballas, Galuba, Howes, Huang, Li, Misra, Rabbat, Sharma, Synnaeve, Xu, Jegou, Mairal, Labatut, Joulin, and Bojanowski] Maxime Oquab, Timothée Darcet, Théo Moutakanni, Huy Vo, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, Mahmoud Assran, Nicolas Ballas, Wojciech Galuba, Russell Howes, Po-Yao Huang, Shang-Wen Li, Ishan Misra, Michael Rabbat, Vasu Sharma, Gabriel Synnaeve, Hu Xu, Hervé Jegou, Julien Mairal, Patrick Labatut, Armand Joulin, and Piotr Bojanowski. Dinov2: Learning robust visual features without supervision, 2024.
  • [Papineni et al.(2002)Papineni, Roukos, Ward, and Zhu] Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th Annual Meeting on Association for Computational Linguistics, ACL ’02, page 311–318, USA, 2002. Association for Computational Linguistics.
  • [Pavlakos et al.(2024)Pavlakos, Shan, Radosavovic, Kanazawa, Fouhey, and Malik] Georgios Pavlakos, Dandan Shan, Ilija Radosavovic, Angjoo Kanazawa, David Fouhey, and Jitendra Malik. Reconstructing hands in 3d with transformers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9826–9836, 2024.
  • [Radford et al.(2021)Radford, Kim, Hallacy, Ramesh, Goh, Agarwal, Sastry, Askell, Mishkin, Clark, et al.] Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. In International conference on machine learning, pages 8748–8763. PMLR, 2021.
  • [Rastgoo et al.(2022)Rastgoo, Kiani, Escalera, Athitsos, and Sabokrou] Razieh Rastgoo, Kourosh Kiani, Sergio Escalera, Vassilis Athitsos, and Mohammad Sabokrou. All you need in sign language production, 2022.
  • [Romero et al.(2014)Romero, Ballas, Kahou, Chassang, Gatta, and Bengio] Adriana Romero, Nicolas Ballas, Samira Ebrahimi Kahou, Antoine Chassang, Carlo Gatta, and Yoshua Bengio. Fitnets: Hints for thin deep nets. arXiv preprint arXiv:1412.6550, 2014.
  • [Stokoe(2005)] Jr. Stokoe, William C. Sign language structure: An outline of the visual communication systems of the american deaf. The Journal of Deaf Studies and Deaf Education, 10(1):3–37, 01 2005. ISSN 1081-4159.
  • [Su et al.(2024)Su, Ahmed, Lu, Pan, Bo, and Liu] Jianlin Su, Murtadha Ahmed, Yu Lu, Shengfeng Pan, Wen Bo, and Yunfeng Liu. Roformer: Enhanced transformer with rotary position embedding. Neurocomputing, 568:127063, 2024.
  • [Touvron et al.(2023)Touvron, Lavril, Izacard, Martinet, Lachaux, Lacroix, Rozière, Goyal, Hambro, Azhar, et al.] Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, et al. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023.
  • [Vaswani et al.(2017)Vaswani, Shazeer, Parmar, Uszkoreit, Jones, Gomez, Kaiser, and Polosukhin] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. Advances in neural information processing systems, 30, 2017.
  • [Voskou et al.(2021)Voskou, Panousis, Kosmopoulos, Metaxas, and Chatzis] Andreas Voskou, Konstantinos P Panousis, Dimitrios Kosmopoulos, Dimitris N Metaxas, and Sotirios Chatzis. Stochastic transformer networks with linear competing units: Application to end-to-end sl translation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 11946–11955, 2021.
  • [Weng et al.(2024)Weng, Han, He, Chang, and Zhuang] Yuetian Weng, Mingfei Han, Haoyu He, Xiaojun Chang, and Bohan Zhuang. Longvlm: Efficient long video understanding via large language models. In European Conference on Computer Vision, pages 453–470. Springer, 2024.
  • [Wong et al.()Wong, Camgöz, and Bowden] Ryan Cameron Wong, Necati Cihan Camgöz, and Richard Bowden. Sign2gpt: leveraging large language models for gloss-free sign language translation. In ICLR 2024: The Twelfth International Conference on Learning Representations.
  • [Xue et al.(2021)Xue, Constant, Roberts, Kale, Al-Rfou, Siddhant, Barua, and Raffel] Linting Xue, Noah Constant, Adam Roberts, Mihir Kale, Rami Al-Rfou, Aditya Siddhant, Aditya Barua, and Colin Raffel. mt5: A massively multilingual pre-trained text-to-text transformer. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 483–498, 2021.
  • [Ye et al.()Ye, Wang, Jiao, Liang, and Xiong] Jinhui Ye, Xing Wang, Wenxiang Jiao, Junwei Liang, and Hui Xiong. Improving gloss-free sign language translation by reducing representation density. In The Thirty-eighth Annual Conference on Neural Information Processing Systems.
  • [Yin et al.(2023)Yin, Zhong, Tang, Jin, Jin, and Zhao] Aoxiong Yin, Tianyun Zhong, Li Tang, Weike Jin, Tao Jin, and Zhou Zhao. Gloss attention for gloss-free sign language translation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 2551–2562, 2023.
  • [Zagoruyko and Komodakis(2017)] Sergey Zagoruyko and Nikos Komodakis. Paying more attention to attention: Improving the performance of convolutional neural networks via attention transfer. In International Conference on Learning Representations, 2017.
  • [Zhang et al.(2023a)Zhang, Müller, and Sennrich] Biao Zhang, Mathias Müller, and Rico Sennrich. Sltunet: A simple unified model for sign language translation, 2023a.
  • [Zhang et al.(2023b)Zhang, Li, and Bing] Hang Zhang, Xin Li, and Lidong Bing. Video-llama: An instruction-tuned audio-visual language model for video understanding. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pages 543–553, 2023b.
  • [Zhao et al.(2022)Zhao, Qi, Zhou, Duan, Zhou, and Li] Jian Zhao, Weizhen Qi, Wengang Zhou, Nan Duan, Ming Zhou, and Houqiang Li. Conditional sentence generation and cross-modal reranking for sign language translation. IEEE Transactions on Multimedia, 24:2662–2672, 2022.
  • [Zhou et al.(2023)Zhou, Chen, Clapés, Wan, Liang, Escalera, Lei, and Zhang] Benjia Zhou, Zhigang Chen, Albert Clapés, Jun Wan, Yanyan Liang, Sergio Escalera, Zhen Lei, and Du Zhang. Gloss-free sign language translation: Improving from visual-language pretraining. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 20871–20881, 2023.
  • [Zhou et al.(2021)Zhou, Zhou, Qi, Pu, and Li] Hao Zhou, Wengang Zhou, Weizhen Qi, Junfu Pu, and Houqiang Li. Improving sign language translation with monolingual data by sign back-translation. In 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 1316–1325. IEEE, 2021.

7 Sign Video Descriptor

7.1 Video Segmentor Algorithm

Algorithm 1 Video Segmentation into 16-Frame Segments
0: Video file VVitalic_V
0: List of 16-frame segments SSitalic_S
1: Open video VVitalic_V and get total frames NNitalic_N
2: Initialize empty list S[]S\leftarrow[\,]italic_S ← [ ]
3:segment[]segment\leftarrow[\,]italic_s italic_e italic_g italic_m italic_e italic_n italic_t ← [ ] {Current segment buffer}
4:for i1i\leftarrow 1italic_i ← 1 to NNitalic_N do
5:  frameframe\leftarrowitalic_f italic_r italic_a italic_m italic_e ← read next frame from VVitalic_V
6:  Append frameframeitalic_f italic_r italic_a italic_m italic_e to segmentsegmentitalic_s italic_e italic_g italic_m italic_e italic_n italic_t
7:  if length(segment)==16\text{length}(segment)==16length ( italic_s italic_e italic_g italic_m italic_e italic_n italic_t ) = = 16 then
8:   Append segmentsegmentitalic_s italic_e italic_g italic_m italic_e italic_n italic_t to SSitalic_S
9:   Reset segment[]segment\leftarrow[\,]italic_s italic_e italic_g italic_m italic_e italic_n italic_t ← [ ] {Start new segment}
10:  end if
11:end for
12:if length(segment)>0\text{length}(segment)>0length ( italic_s italic_e italic_g italic_m italic_e italic_n italic_t ) > 0 then
13:  Append remaining segmentsegmentitalic_s italic_e italic_g italic_m italic_e italic_n italic_t to SSitalic_S {Handle leftover frames}
14:end if
15:return  SSitalic_S

7.2 More Examples

As shown in Fig. 3, the rectangle around the sequence of frames and its corresponding description in the final video description share the same color.

Refer to caption
Figure 3: More examples of video description generated by the proposed Sign Video Descriptor in Phoenix-2014T and CSL-Daily.

8 Qualitative Results

Table 6: Qualitative results of Phoenix14T. Green means totally same as the reference.
Reference: am tag wechseln sonne und wolken einander ab teilweise ist es auch langere zeit sonnig.
(During the day sun and clouds alternate partly it is sunny for a long time)
GFSLT-VLP: am tag wechseln sonne und wolken einander ab es bilden sich langere zeit viel sonnenschein.
( During the day, sun and clouds alternate There is a lot of sunshine for a long time )
BeyondGloss: am tag wechseln sonne und wolken einander ab zeigt sich die auch langere zeit sonnig.
(During the day, sun and clouds alternate, and the weather remains sunny for longer periods.)
Reference: am tag nur hier und da einige sonnige momente vor allem an den alpen.
(During the day only here and there some sunny moments, especially in the Alps)
GFSLT-VLP: morgen zeigt sich mal die sonne wenn dann vor allem an den alpen.
(Tomorrow the sun will show up , especially in the Alps)
BeyondGloss: und morgen scheint und da einige die sonnige vor allem an den alpen.
(Tomorrow will shine and there will be some sunny weather, especially in the Alps.)
Reference: am freitag scheint abseits der nebelgebiete haufig die sonne
(On Friday the sun often shines outside of the foggy areas)
GFSLT-VLP: am freitag sobald der nebel weg ist sonnenschein.
(On Friday as soon as the fog is gone there will be sunshine)
BeyondGloss: am freitag scheint abseits der nebelgebiete haufig die sonne.
(On Friday the sun often shines outside of the foggy areas)

9 Ablation studies

9.1 Impact of Loss Weighting in Pre-training for SLT

As shown in Tab. 7, we have conducted a comprehensive evaluation to determine the optimal weighting in the pre-training loss.

Table 7: Impact of Loss Weighting in Pre-training for SLT.
λdistill\lambda_{\text{distill}}italic_λ start_POSTSUBSCRIPT distill end_POSTSUBSCRIPT λdesc\lambda_{\text{desc}}italic_λ start_POSTSUBSCRIPT desc end_POSTSUBSCRIPT λalign\lambda_{\text{align}}italic_λ start_POSTSUBSCRIPT align end_POSTSUBSCRIPT B-4 R
(1) 0.0 0.0 0.0 16.41 36.33
(2) 0.0 0.0 1.0 21.87 44.30
(3) 0.0 1.0 1.0 22.13 45.14
(4) 0.0 0.5 1.0 23.47 47.17
(5) 1.0 1.0 1.0 22.85 46.32
(6) 0.5 1.0 1.0 24.56 50.21
(7) 0.3 0.5 1.0 25.68 53.85