Gaussian Variation Field Diffusion for High-fidelity Video-to-4D Synthesis

Bowen Zhang1∗  Sicheng Xu2  Chuxin Wang1  Jiaolong Yang2
Feng Zhao1† Dong Chen2†  Baining Guo2
1University of Science and Technology of China  2Microsoft Research Asia
Abstract

In this paper, we present a novel framework for video-to-4D generation that creates high-quality dynamic 3D content from single video inputs. Direct 4D diffusion modeling is extremely challenging due to costly data construction and the high-dimensional nature of jointly representing 3D shape, appearance, and motion. We address these challenges by introducing a Direct 4DMesh-to-GS Variation Field VAE that directly encodes canonical Gaussian Splats (GS) and their temporal variations from 3D animation data without per-instance fitting, and compresses high-dimensional animations into a compact latent space. Building upon this efficient representation, we train a Gaussian Variation Field diffusion model with temporal-aware Diffusion Transformer conditioned on input videos and canonical GS. Trained on carefully-curated animatable 3D objects from the Objaverse dataset, our model demonstrates superior generation quality compared to existing methods. It also exhibits remarkable generalization to in-the-wild video inputs despite being trained exclusively on synthetic data, paving the way for generating high-quality animated 3D content. Project page: GVFDiffusion.github.io.

11footnotetext: Intern at Microsoft Research Asia. Corresponding authors.
{strip}[Uncaptioned image]
Figure 1: Our model is capable of creating high-fidelity 4D objects from in-the-wild video inputs. Best viewed with zoom-in.

1 Introduction

Recent advances in generative models have demonstrated remarkable capabilities across various modalities, including image [30, 31, 90, 45, 23, 55, 21], video [2, 3, 19], and 3D content [66, 95, 81, 93, 68, 24]. While these achievements mark important milestones, they naturally lead to the next frontier: 4D generation, which aims to create dynamic 3D content. The challenge of generating such content—a fundamental aspect of representing our inherently four-dimensional world—remains largely unexplored. This gap is particularly significant given that real-world phenomena inherently combine spatial and temporal dynamics, from the subtle object movements to complex character articulations.

Despite diffusion models [15, 45, 59] having demonstrated strong modeling capabilities in both 2D and 3D domains, training a robust 4D diffusion model for dynamic 3D content generation presents two main technical challenges. First, obtaining a large-scale 4D dataset is time-consuming. A straightforward approach involves fitting individual dynamic Gaussian Splatting (4DGS) representations [76] for each 3D animation sequence, but this solution typically requires tens of minutes per instance, making it computationally expensive and less scalable as the number of instances increases. Second, the higher-dimensional nature of the problem necessitates a large number of parameters (usually exceeding 100K tokens) to represent 3D shape, appearance, and motion simultaneously, making direct modeling with diffusion approaches extremely challenging. These limitations have significantly hindered the development of efficient and high-quality 4D generative models.

Motivated by the effectiveness of diffusion models applied to compact latent spaces in recent 2D and 3D generation works [59, 3, 81, 95, 58], we present a novel framework for 4D generative modeling that comprises a Direct 4DMesh-to-GS Variation Field VAE and a Gaussian Variation Field diffusion model. Our VAE framework encodes the canonical 3D Gaussian Splatting (3DGS) of objects and compresses each Gaussian’s attribute variations (i.e., Gaussian Variation Fields) into a compact latent space from 4D mesh data, thereby bypassing costly per-instance reconstructions. Inspired by previous works [91, 5], we employ a perceiver-style transformer network [27, 26, 72] with displacements of mesh points to effectively encode motion information. To bridge the gap between Gaussian Splatting representation and mesh-based ground truth motion, we introduce a mesh-guided loss that aligns the motion of Gaussian points with the corresponding mesh vertices. Our VAE is trained end-to-end with this mesh-guided loss and an image-level loss, enabling faithful compression of complex Gaussian Variation Fields. This approach reduces high-dimensional motion sequences to a compact 512-dimensional latent space, thus facilitating efficient diffusion modeling for 4D content generation.

Following the construction of our VAE, the 4D generative modeling naturally decomposes into canonical 3DGS generation and Gaussian Variation Field modeling. We leverage state-of-the-art 3D generative models [81] for the canonical component while focusing on modeling the Gaussian Variation Fields. To achieve this, we train a diffusion model to learn the latent space distribution of variation fields conditioned on the input video and canonical 3DGS, enabling controlled 4D content generation. Leveraging the compact nature of our latent space, we employ the Diffusion Transformer (DiT) architecture [52], augmented with temporal self-attention layers to capture smooth temporal dynamics across animations. The video frame features  [48] and the canonical 3DGS are taken as conditions for the diffusion model via cross-attention layers. Additionally, we incorporate positional priors into the diffusion model, enhancing its awareness of correspondences between canonical GS and their variation fields during the denoising process, thereby improving generation quality.

We train our model on a carefully curated diverse collection of animatable 3D objects from the Objaverse [13] and Objaverse-XL [12]. Extensive evaluations demonstrate the superior video-to-4D generation quality of our method compared to existing approaches. Despite being trained on synthetic data, our model exhibits remarkable generalization capabilities when applied to in-the-wild video inputs, effectively creating impressive animations from in-the-wild animation sequences. We believe that our approach represents a notable step toward narrowing the gap between static 3D generation and 4D content creation, paving the way for generating high-quality 4D content.

2 Related Work

3D generation. Early GAN-based 3D generation approaches [77, 100, 14, 7, 17, 64, 98, 80] laid the foundation for 3D content synthesis, while diffusion-based methods [74, 92, 93, 69, 44, 22, 6, 62, 29, 86, 9] advanced generation quality. Recent approaches have focused on latent space generation, either separating geometric modeling and appearance synthesis [97, 71, 91, 95, 37, 78, 99, 58] or jointly modeling both [81, 20, 83, 29, 35, 46, 84, 10]. Alternative methods [67, 53, 65, 8, 40, 75] leverage pretrained 2D models [59] through optimization techniques. Recent works [81, 95] have achieved high-quality 3D asset generation with detailed geometry and appearance, establishing a foundation for 4D content creation.

4D reconstruction. Early 4D reconstruction methods [50, 51, 16, 54, 4] extended neural volumetric techniques for dynamic scenes, while recent Gaussian Splatting-based approaches [76, 42, 11, 36, 25, 32] offer improved efficiency. Typical 4D reconstruction methods often require significant optimization time per instance (e.g., 6 minutes for 4DGaussians [76] and over 30 minutes for K-planes [16]), making them impractical to use as a preliminary step in fitting 4D representations for generation. In this paper, we explore an efficient approach to directly encode 4D mesh data for generative modeling in a single pass.

Refer to caption
Figure 2: Framework of 4DMesh-to-GS Variation Field VAE. Our VAE directly encodes 3D animation data into Gaussian Variation Fields within a compact latent space, optimized through image-level reconstruction loss and the proposed mesh-guided loss.

Video-to-4D generation. Early attempts [28, 1, 56, 63] at video-to-4D generation predominantly relied on optimization-based approaches, utilizing pre-trained generative priors [40, 60, 61, 73] as guidance. These methods typically employ score distillation [53] techniques to optimize either neural volumetric representations [43] or 3D Gaussian Splatting [32]. These methods suffer from lengthy optimization times and SDS-related issues [75, 39] such as spatial-temporal inconsistency or poor input alignments. While some works [88, 87] use pseudo-labels for better consistency, recent approaches [57, 82, 49, 85, 94, 38] directly reconstruct 4D content from multiview images or videos. Notably, the introduction of large-scale 4D reconstruction models [57] has significantly reduced the generation time from hours to seconds. However, most of these approaches often struggle to maintain consistent quality across temporal sequences due to inherent multiview inconsistency in 2D generation results.

3 Method

Given an input video sequence 𝓘={It}t=1T\bm{\mathcal{I}}=\{I_{t}\}_{t=1}^{T}bold_caligraphic_I = { italic_I start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT of an object, our goal is to generate a sequence of 3DGS models 𝓖={Gt}t=1T\bm{\mathcal{G}}=\{G_{t}\}_{t=1}^{T}bold_caligraphic_G = { italic_G start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT that captures both the shape, appearance, and motion of the object. We decompose this task into canonical GS G1G_{1}italic_G start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT creation (using the first frame as canonical) and Gaussian Variation Fields 𝓥={ΔGt}t=1T\bm{\mathcal{V}}=\{\Delta G_{t}\}_{t=1}^{T}bold_caligraphic_V = { roman_Δ italic_G start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT generation, where 𝓥\bm{\mathcal{V}}bold_caligraphic_V describes each Gaussian’s attribute variations relative to G1G_{1}italic_G start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT over time. Our framework comprises two main components: (1) a direct 4DMesh-to-GS Variation Field VAE that efficiently encodes 3D animation sequences into a compact latent space, and (2) a Gaussian variation field diffusion model that learns the latent distribution of variation fields conditioned on the input video and canonical GS. The following sections detail each component.

3.1 Direct 4DMesh-to-GS Variation Field VAE

Extending 3DGS to generative modeling of dynamic content presents significant challenges. Fitting individual dynamic 3DGS representations for each animation instance is computationally expensive and scales poorly. Additionally, directly modeling temporal deformation of GS sequences with diffusion models is challenging due to the high dimensionality of both Gaussian quantities (e.g., typically over 100K in [32]) and the time dimension. Therefore, we propose an efficient autoencoding framework that directly encodes 3D animation data into Gaussian Variation Fields with a compact latent space, facilitating subsequent diffusion modeling.

Gaussian Variation Field encoding. Given a sequence of mesh animations 𝓜={Mt}t=1T\bm{\mathcal{M}}=\{M_{t}\}_{t=1}^{T}bold_caligraphic_M = { italic_M start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT, we first convert them to point clouds 𝓟={Pt|Pt���N×3}t=1T\bm{\mathcal{P}}=\{P_{t}|P_{t}\in\mathbb{R}^{N\times 3}\}_{t=1}^{T}bold_caligraphic_P = { italic_P start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | italic_P start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_N × 3 end_POSTSUPERSCRIPT } start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT through uniform surface sampling, where each point cloud contains NNitalic_N points. The displacement fields {ΔPt|ΔPtN×3}t=1T\{\Delta P_{t}|\Delta P_{t}\in\mathbb{R}^{N\times 3}\}_{t=1}^{T}{ roman_Δ italic_P start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | roman_Δ italic_P start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_N × 3 end_POSTSUPERSCRIPT } start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT are computed as temporal differences of corresponding points between frames:

ΔPt=PtP1,\Delta P_{t}=P_{t}-P_{1},roman_Δ italic_P start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = italic_P start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - italic_P start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , (1)

where P1P_{1}italic_P start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT is the canonical frame’s point cloud. We then leverage a pretrained mesh-to-GS autoencoder GS\mathcal{E}_{GS}caligraphic_E start_POSTSUBSCRIPT italic_G italic_S end_POSTSUBSCRIPT and 𝒟GS\mathcal{D}_{GS}caligraphic_D start_POSTSUBSCRIPT italic_G italic_S end_POSTSUBSCRIPT in [81] to obtain the canonical GS representation from canonical mesh M1M_{1}italic_M start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT:

S1=GS(M1),G1=𝒟GS(S1),\begin{split}S_{1}=\mathcal{E}_{GS}(M_{1}),\\ G_{1}=\mathcal{D}_{GS}(S_{1}),\end{split}start_ROW start_CELL italic_S start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = caligraphic_E start_POSTSUBSCRIPT italic_G italic_S end_POSTSUBSCRIPT ( italic_M start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) , end_CELL end_ROW start_ROW start_CELL italic_G start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = caligraphic_D start_POSTSUBSCRIPT italic_G italic_S end_POSTSUBSCRIPT ( italic_S start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) , end_CELL end_ROW (2)

where G1NG×14G_{1}\in\mathbb{R}^{N_{G}\times 14}italic_G start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_N start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT × 14 end_POSTSUPERSCRIPT denotes the Gaussian parameters including positions 𝒑1\bm{p}_{1}bold_italic_p start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT, scales 𝒔1\bm{s}_{1}bold_italic_s start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT, rotation 𝒒1\bm{q}_{1}bold_italic_q start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT, colors 𝒄1\bm{c}_{1}bold_italic_c start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT, and opacity α1\alpha_{1}italic_α start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT, with NGN_{G}italic_N start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT being the total number of canonical Gaussians. S1S_{1}italic_S start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT is the structured latent (SLat) representation for canonical GS (more details are included in the supplementary). We finetune 𝒟GS\mathcal{D}_{GS}caligraphic_D start_POSTSUBSCRIPT italic_G italic_S end_POSTSUBSCRIPT to ensure coherent canonical GS reconstruction with their variation fields, while keeping GS\mathcal{E}_{GS}caligraphic_E start_POSTSUBSCRIPT italic_G italic_S end_POSTSUBSCRIPT frozen to leverage pretrained canonical GS diffusion models.

Inspired by 3DShape2VecSet [91], we employ a cross-attention layer to aggregate motion information from 3D animation sequences into a fixed-length latent representation. While directly using G1G_{1}italic_G start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT as query vectors is a straightforward approach, we find it leads to poor motion awareness. To enhance the network’s sensitivity to mesh deformation, we introduce a mesh-guided interpolation mechanism that generates motion-aware query vectors based on the spatial correspondence between G1G_{1}italic_G start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and P1P_{1}italic_P start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT.

Specifically, for each canonical Gaussian position 𝒑1i\bm{p}_{1}^{i}bold_italic_p start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT, we identify its KKitalic_K nearest neighbors in the canonical point cloud P1P_{1}italic_P start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and compute their distances 𝒅i,k\bm{d}_{i,k}bold_italic_d start_POSTSUBSCRIPT italic_i , italic_k end_POSTSUBSCRIPT. To handle varying point densities across the mesh-sampled point cloud, we introduce an adaptive radius rir_{i}italic_r start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT that adjusts the influence region based on the local point distribution. The interpolation weight 𝒘i,k\bm{w}_{i,k}bold_italic_w start_POSTSUBSCRIPT italic_i , italic_k end_POSTSUBSCRIPT and adaptive radius rir_{i}italic_r start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT are formulated as:

𝒘i,k=exp(β𝒅i,kri2),ri=1Kk=1K𝒅i,k,\bm{w}_{i,k}=\exp(-\frac{\beta\bm{d}_{i,k}}{r_{i}^{2}}),\quad r_{i}=\sqrt{\frac{1}{K}\sum_{k=1}^{K}\bm{d}_{i,k}},bold_italic_w start_POSTSUBSCRIPT italic_i , italic_k end_POSTSUBSCRIPT = roman_exp ( - divide start_ARG italic_β bold_italic_d start_POSTSUBSCRIPT italic_i , italic_k end_POSTSUBSCRIPT end_ARG start_ARG italic_r start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ) , italic_r start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = square-root start_ARG divide start_ARG 1 end_ARG start_ARG italic_K end_ARG ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT bold_italic_d start_POSTSUBSCRIPT italic_i , italic_k end_POSTSUBSCRIPT end_ARG , (3)

where β\betaitalic_β is a hyperparameter controlling the decay rate of interpolation weights with distance, with larger values producing more localized influence regions. We set β=7.0\beta=7.0italic_β = 7.0 in this paper.

We then interpolate the displacement fields ΔPt\Delta P_{t}roman_Δ italic_P start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT for the iiitalic_i-th Gaussian at time ttitalic_t:

Δ𝒑t,iinterp=k=1K𝒘i,kk𝒘i,kΔPt,n(i,k)\Delta\bm{p}^{interp}_{t,i}=\sum_{k=1}^{K}\frac{\bm{w}_{i,k}}{\sum_{k}\bm{w}_{i,k}}\Delta P_{t,n(i,k)}roman_Δ bold_italic_p start_POSTSUPERSCRIPT italic_i italic_n italic_t italic_e italic_r italic_p end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t , italic_i end_POSTSUBSCRIPT = ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT divide start_ARG bold_italic_w start_POSTSUBSCRIPT italic_i , italic_k end_POSTSUBSCRIPT end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT bold_italic_w start_POSTSUBSCRIPT italic_i , italic_k end_POSTSUBSCRIPT end_ARG roman_Δ italic_P start_POSTSUBSCRIPT italic_t , italic_n ( italic_i , italic_k ) end_POSTSUBSCRIPT (4)

where n(i,k)n(i,k)italic_n ( italic_i , italic_k ) denotes the kkitalic_k-th nearest neighbor index. We perform farthest point sampling to Δ𝒑tinterp\Delta\bm{p}^{interp}_{t}roman_Δ bold_italic_p start_POSTSUPERSCRIPT italic_i italic_n italic_t italic_e italic_r italic_p end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT based on their canonical positions to formulate our motion-aware query Δ𝒑tfpsL×3\Delta\bm{p}^{fps}_{t}\in\mathbb{R}^{L\times 3}roman_Δ bold_italic_p start_POSTSUPERSCRIPT italic_f italic_p italic_s end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_L × 3 end_POSTSUPERSCRIPT with reduced sequence length. The point cloud displacement fields ΔPt\Delta P_{t}roman_Δ italic_P start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT serve as keys and values in the cross attention encoder. To preserve spatial relationships, we incorporate positional embedding PE()PE(\cdot)italic_P italic_E ( ⋅ ) based on the canonical positions:

Qe=fdisp(Δ𝒑tfps)+PE(G1),Ke=Ve=fdisp(ΔPt)+PE(P1),𝒛=CrossAttn(Qe,Ke,Ve),\begin{split}Q_{e}&=f_{disp}(\Delta\bm{p}^{fps}_{t})+PE(G_{1}),\\ K_{e}&=V_{e}=f_{disp}(\Delta P_{t})+PE(P_{1}),\\ \bm{z}&=\text{CrossAttn}(Q_{e},K_{e},V_{e}),\end{split}start_ROW start_CELL italic_Q start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT end_CELL start_CELL = italic_f start_POSTSUBSCRIPT italic_d italic_i italic_s italic_p end_POSTSUBSCRIPT ( roman_Δ bold_italic_p start_POSTSUPERSCRIPT italic_f italic_p italic_s end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) + italic_P italic_E ( italic_G start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) , end_CELL end_ROW start_ROW start_CELL italic_K start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT end_CELL start_CELL = italic_V start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT = italic_f start_POSTSUBSCRIPT italic_d italic_i italic_s italic_p end_POSTSUBSCRIPT ( roman_Δ italic_P start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) + italic_P italic_E ( italic_P start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) , end_CELL end_ROW start_ROW start_CELL bold_italic_z end_CELL start_CELL = CrossAttn ( italic_Q start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT , italic_K start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT , italic_V start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT ) , end_CELL end_ROW (5)

where fdispf_{disp}italic_f start_POSTSUBSCRIPT italic_d italic_i italic_s italic_p end_POSTSUBSCRIPT is the displacement embedding layer. This process yields a latent representation 𝒛T×L×C\bm{z}\in\mathbb{R}^{T\times L\times C}bold_italic_z ∈ blackboard_R start_POSTSUPERSCRIPT italic_T × italic_L × italic_C end_POSTSUPERSCRIPT, where TTitalic_T is the number of temporal frames, LLitalic_L is the latent size, and CCitalic_C is the feature dimension. Notably, our encoding procedure compresses the sequence length from N=8192N=8192italic_N = 8192 to L=512L=512italic_L = 512, significantly reducing the subsequent diffusion modeling space.

Refer to caption
Figure 3: Architecture of Gaussian Variation Field diffusion model. Our model is built upon diffusion transformer, which takes noised latent as input and gradually denoises it conditioned on the video sequence and canonical GS.

Gaussian Variation Field decoding. The decoding procedure first transforms the latent representation through nnitalic_n layers of self-attention blocks to enable thorough motion information exchange. The decoder then maps this processed latent to a Gaussian Variation Field 𝓥\bm{\mathcal{V}}bold_caligraphic_V, defined by the variations of Gaussian attributes ΔGt={Δ𝒑t,Δ𝒔t,Δ𝒒t,Δ𝒄t,Δαt}t=1T\Delta G_{t}=\{\Delta\bm{p}_{t},\Delta\bm{s}_{t},\Delta\bm{q}_{t},\Delta\bm{c}_{t},\Delta\alpha_{t}\}_{t=1}^{T}roman_Δ italic_G start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = { roman_Δ bold_italic_p start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , roman_Δ bold_italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , roman_Δ bold_italic_q start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , roman_Δ bold_italic_c start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , roman_Δ italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT. To ensure the decoder is aware of all canonical Gaussian attributes, we use all parameters of G1G_{1}italic_G start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT to query the latent output through a cross attention layer:

Qd=fgs(G1)+PE(G1),Kd=Vd=𝒛n,ΔGt=CrossAttn(Qd,Kd,Vd),\begin{split}Q_{d}=f_{gs}(G_{1})&+PE(G_{1}),\quad K_{d}=V_{d}=\bm{z}_{n},\\ \Delta G_{t}&=\text{CrossAttn}(Q_{d},K_{d},V_{d}),\end{split}start_ROW start_CELL italic_Q start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT = italic_f start_POSTSUBSCRIPT italic_g italic_s end_POSTSUBSCRIPT ( italic_G start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) end_CELL start_CELL + italic_P italic_E ( italic_G start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) , italic_K start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT = italic_V start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT = bold_italic_z start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT , end_CELL end_ROW start_ROW start_CELL roman_Δ italic_G start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_CELL start_CELL = CrossAttn ( italic_Q start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT , italic_K start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT , italic_V start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT ) , end_CELL end_ROW (6)

where fgsf_{gs}italic_f start_POSTSUBSCRIPT italic_g italic_s end_POSTSUBSCRIPT is the embedding layer for the canonical Gaussians and 𝒛n\bm{z}_{n}bold_italic_z start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT is the final self attention layer output. The final 3DGS sequence is obtained by:

𝓖={Gt}t=1T={G1+ΔGt}t=1T.\bm{\mathcal{G}}=\{G_{t}\}_{t=1}^{T}=\{{G_{1}+\Delta G_{t}}\}_{t=1}^{T}.bold_caligraphic_G = { italic_G start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT = { italic_G start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT + roman_Δ italic_G start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT . (7)
Table 1: Quantitative comparison of video-to-4D generation results. Our method demonstrates consistent performance improvements across all metrics while maintaining efficient generation speed. The generation time is measured on a single A100 GPU.
Method PSNR\uparrow LPIPS\downarrow SSIM\uparrow CLIP\uparrow FVD\downarrow Time\downarrow
Consistent4D [28] 16.20 0.146 0.880 0.910 935.19 \sim1.5 hr
SC4D [79] 15.93 0.164 0.872 0.870 833.15 \sim20 min
STAG4D [88] 16.85 0.144 0.887 0.893 1008.40 \sim1 hr
DreamGaussian4D [56] 15.24 0.162 0.868 0.904 799.56 \sim15 min
L4GM [57] 17.03 0.128 0.891 0.930 529.10 3.5 s
Ours 18.47 0.114 0.901 0.935 476.83 4.5s

Training objective. Our training objective consists of three main components. First, we employ image-level reconstruction loss between the rendered images ItrenderI_{t}^{render}italic_I start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_r italic_e italic_n italic_d italic_e italic_r end_POSTSUPERSCRIPT from final predicted Gaussians and ground-truth images ItgtI_{t}^{gt}italic_I start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_g italic_t end_POSTSUPERSCRIPT:

img=1+λlpipslpips+λssimssim,\mathcal{L}_{img}=\mathcal{L}_{1}+\lambda_{lpips}\mathcal{L}_{lpips}+\lambda_{ssim}\mathcal{L}_{ssim},caligraphic_L start_POSTSUBSCRIPT italic_i italic_m italic_g end_POSTSUBSCRIPT = caligraphic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT + italic_λ start_POSTSUBSCRIPT italic_l italic_p italic_i italic_p italic_s end_POSTSUBSCRIPT caligraphic_L start_POSTSUBSCRIPT italic_l italic_p italic_i italic_p italic_s end_POSTSUBSCRIPT + italic_λ start_POSTSUBSCRIPT italic_s italic_s italic_i italic_m end_POSTSUBSCRIPT caligraphic_L start_POSTSUBSCRIPT italic_s italic_s italic_i italic_m end_POSTSUBSCRIPT , (8)

where λlpips,λssim\lambda_{lpips},\lambda_{ssim}italic_λ start_POSTSUBSCRIPT italic_l italic_p italic_i italic_p italic_s end_POSTSUBSCRIPT , italic_λ start_POSTSUBSCRIPT italic_s italic_s italic_i italic_m end_POSTSUBSCRIPT are loss weights for perception loss [96] and SSIM loss, respectively. To ensure faithful motion reconsturction, we introduce a mesh-guided loss that aligns the predicted Gaussian displacements with pseudo ground-truth Δptinterp\Delta p_{t}^{interp}roman_Δ italic_p start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i italic_n italic_t italic_e italic_r italic_p end_POSTSUPERSCRIPT obtained through mesh-guided interpolation:

mg=t=1TΔ𝐩tΔ𝒑tinterp22,\mathcal{L}_{mg}=\sum_{t=1}^{T}\|\Delta\mathbf{p}_{t}-\Delta\bm{p}_{t}^{interp}\|_{2}^{2},caligraphic_L start_POSTSUBSCRIPT italic_m italic_g end_POSTSUBSCRIPT = ∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ∥ roman_Δ bold_p start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - roman_Δ bold_italic_p start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i italic_n italic_t italic_e italic_r italic_p end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT , (9)

which we find is crucial for motion reconstruction quality. Finally, to facilitate subsequent diffusion training, we also regularize the latent distribution with a KL divergence loss kl\mathcal{L}_{kl}caligraphic_L start_POSTSUBSCRIPT italic_k italic_l end_POSTSUBSCRIPT. The total loss is: total=img+λmgmg+λklkl\mathcal{L}_{total}=\mathcal{L}_{img}+\lambda_{mg}\mathcal{L}_{mg}+\lambda_{kl}\mathcal{L}_{kl}caligraphic_L start_POSTSUBSCRIPT italic_t italic_o italic_t italic_a italic_l end_POSTSUBSCRIPT = caligraphic_L start_POSTSUBSCRIPT italic_i italic_m italic_g end_POSTSUBSCRIPT + italic_λ start_POSTSUBSCRIPT italic_m italic_g end_POSTSUBSCRIPT caligraphic_L start_POSTSUBSCRIPT italic_m italic_g end_POSTSUBSCRIPT + italic_λ start_POSTSUBSCRIPT italic_k italic_l end_POSTSUBSCRIPT caligraphic_L start_POSTSUBSCRIPT italic_k italic_l end_POSTSUBSCRIPT, where λmg,λkl\lambda_{mg},\lambda_{kl}italic_λ start_POSTSUBSCRIPT italic_m italic_g end_POSTSUBSCRIPT , italic_λ start_POSTSUBSCRIPT italic_k italic_l end_POSTSUBSCRIPT are respective loss weights.

3.2 Gaussian Variation Field Diffusion

The diffusion process can be formalized as the inversion of a discrete-time Markov forward process. Let 𝒛0T×L×C\bm{z}^{0}\in\mathbb{R}^{T\times L\times C}bold_italic_z start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_T × italic_L × italic_C end_POSTSUPERSCRIPT denote our initial latent of Gaussian Variation Field from the distribution p(𝒛)p(\bm{z})italic_p ( bold_italic_z ). During the forward phase, we progressively corrupt this latent sequence by adding Gaussian noise over diffusion steps s[0,S]s\in[0,S]italic_s ∈ [ 0 , italic_S ], following 𝒛s:=αs𝒛0+σsϵ\bm{z}^{s}:=\alpha_{s}\bm{z}^{0}+\sigma_{s}\bm{\epsilon}bold_italic_z start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT := italic_α start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT bold_italic_z start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT + italic_σ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT bold_italic_ϵ, where ϵ𝒩(𝟎,𝑰)\bm{\epsilon}\sim\mathcal{N}(\mathbf{0},\bm{I})bold_italic_ϵ ∼ caligraphic_N ( bold_0 , bold_italic_I ), and αs,σs\alpha_{s},\sigma_{s}italic_α start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT , italic_σ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT define the noise schedule. After sufficient diffusion steps, 𝒛S\bm{z}^{S}bold_italic_z start_POSTSUPERSCRIPT italic_S end_POSTSUPERSCRIPT approaches pure Gaussian noise. Generation is achieved by reversing this process, starting from random Gaussian noise 𝒛S𝒩(𝟎,𝑰)\bm{z}^{S}\sim\mathcal{N}(\mathbf{0},\bm{I})bold_italic_z start_POSTSUPERSCRIPT italic_S end_POSTSUPERSCRIPT ∼ caligraphic_N ( bold_0 , bold_italic_I ) and progressively denoising it to recover 𝒛0\bm{z}^{0}bold_italic_z start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT.

The compact latent space enables us to build our diffusion model upon the powerful Diffusion Transformer (DiT) architecture [52]. As illustrated in Figure 3, the model takes noise-corrupted latent as input, and processes them through a series of transformer blocks for denoising. Each transformer block incorporates diffusion timestep information through adaptive layer normalization (adaLN) and a gating mechanism. Beyond the standard spatial self attention layers, we introduce dedicated temporal self-attention layers to ensure coherent motion generation across the sequence.

To condition the generation process, we inject two types of features through cross-attention layers: (1) visual features 𝓒v={𝑪tv}t=1T\bm{\mathcal{C}}^{v}=\{\bm{C}^{v}_{t}\}_{t=1}^{T}bold_caligraphic_C start_POSTSUPERSCRIPT italic_v end_POSTSUPERSCRIPT = { bold_italic_C start_POSTSUPERSCRIPT italic_v end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT extracted from input video frames using DINOv2 [48], and (2) geometric features 𝓒GS=G1fps\bm{\mathcal{C}}^{GS}=G_{1}^{fps}bold_caligraphic_C start_POSTSUPERSCRIPT italic_G italic_S end_POSTSUPERSCRIPT = italic_G start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_f italic_p italic_s end_POSTSUPERSCRIPT fartherest sampled from the static GS. We further incorporate positional embeddings based on canonical GS positions 𝒑1fps\bm{p}_{1}^{fps}bold_italic_p start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_f italic_p italic_s end_POSTSUPERSCRIPT in our diffusion transformer, which strengthens the model’s awareness of correspondences between canonical GS and their variation fields during the denoising process, thereby effectively improving the generation quality.

We parameterize our diffusion model 𝒗^θ\hat{\bm{v}}_{\theta}over^ start_ARG bold_italic_v end_ARG start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT to predict the velocity vs:=αsϵσs𝒛0v^{s}:=\alpha_{s}\bm{\epsilon}-\sigma_{s}\bm{z}^{0}italic_v start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT := italic_α start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT bold_italic_ϵ - italic_σ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT bold_italic_z start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT at each diffusion step ssitalic_s. The diffusion model is trained using:

simple=𝔼s,𝒛0,ϵ[𝒗^θ(αs𝒛0+σsϵ,s,𝓒)𝒗s22],\mathcal{L}_{\text{simple}}=\mathbb{E}_{s,\bm{z}^{0},\bm{\epsilon}}\left[\left\|\hat{\bm{v}}_{\theta}\left(\alpha_{s}\bm{z}^{0}+\sigma_{s}\bm{\epsilon},s,\bm{\mathcal{C}}\right)-\bm{v}^{s}\right\|_{2}^{2}\right],caligraphic_L start_POSTSUBSCRIPT simple end_POSTSUBSCRIPT = blackboard_E start_POSTSUBSCRIPT italic_s , bold_italic_z start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT , bold_italic_ϵ end_POSTSUBSCRIPT [ ∥ over^ start_ARG bold_italic_v end_ARG start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_α start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT bold_italic_z start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT + italic_σ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT bold_italic_ϵ , italic_s , bold_caligraphic_C ) - bold_italic_v start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] , (10)

where 𝓒={𝓒v,𝓒GS}\bm{\mathcal{C}}=\{\bm{\mathcal{C}}^{v},\bm{\mathcal{C}}^{GS}\}bold_caligraphic_C = { bold_caligraphic_C start_POSTSUPERSCRIPT italic_v end_POSTSUPERSCRIPT , bold_caligraphic_C start_POSTSUPERSCRIPT italic_G italic_S end_POSTSUPERSCRIPT } represents the conditional features of both 𝓒v\bm{\mathcal{C}}^{v}bold_caligraphic_C start_POSTSUPERSCRIPT italic_v end_POSTSUPERSCRIPT and 𝓒GS\bm{\mathcal{C}}^{GS}bold_caligraphic_C start_POSTSUPERSCRIPT italic_G italic_S end_POSTSUPERSCRIPT.

Refer to caption
Figure 4: Qualitative comparison with previous state-of-the-art methods. Our model directly learns the distribution of Gaussian Variation Fields, enabling high-fidelity 4D generation with coherent temporal dynamics.

3.3 Inference Pipeline

During inference, our framework operates in a sequential pipeline. First, we obtain the canonical GS G1G_{1}italic_G start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT for the first frame using a pretrained 3D diffusion model [81]. Given an input video sequence {It}t=1T\{I_{t}\}_{t=1}^{T}{ italic_I start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT, we extract visual features and combine them with the farthest sampled canonical Gaussians as conditioning signals for our diffusion model. The diffusion model generates latent codes 𝒛\bm{z}bold_italic_z, which are subsequently decoded to obtain the Gaussian Variation Field 𝓥\bm{\mathcal{V}}bold_caligraphic_V. The final animated Gaussian representation GtG_{t}italic_G start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT for each frame is obtained by applying these variations to the canonical Gaussians, effectively creating high-fidelity temporally coherent 4D animations.

4 Experiments

4.1 Dataset and Metrics

We conduct our experiments on Objaverse-V1 and Objaverse-XL [13], following previous work in 4D content generation. After filtering for objects with high-quality animations, we utilize 34K objects for training. To evaluate the video-to-4D generation quality, we construct a comprehensive test set of 100 objects by combining 7 instances from the widely-used Consistent4D [28] testset with 93 additional test instances from Objaverse-XL, ensuring a thorough evaluation with previous works. We render 4 novel views of each timestep for each instance. We assess the generation quality using multiple metrics: PSNR, LPIPS [96], and SSIM for frame-wise quality, and FVD [70] for temporal consistency of the generated sequences. All evaluations are performed on renderings at 512×512512\times 512512 × 512 resolution.

4.2 Implementation Details

For our VAE implementation, the canonical Mesh-to-GS autoencoder is builds upon Trellis [81], with training conducted in two stages: finetuning the sparse GS decoder 𝒟GS\mathcal{D}_{GS}caligraphic_D start_POSTSUBSCRIPT italic_G italic_S end_POSTSUBSCRIPT on canonical 3D only for 150K iterations, followed by joint training with other modules for 200K iterations on 4D animation data. The VAE architecture employs point cloud size N=8192N=8192italic_N = 8192, latent size L=512L=512italic_L = 512, and feature dimension C=16C=16italic_C = 16. We optimize the VAE using AdamW with a learning rate 5e65e-65 italic_e - 6 and 5e55e-55 italic_e - 5 for 𝒟GS\mathcal{D}_{GS}caligraphic_D start_POSTSUBSCRIPT italic_G italic_S end_POSTSUBSCRIPT and other modules respectively, using batch size 32. The diffusion model is trained on 24-frame sequences using AdamW optimizer with identical learning rate and batch size over 1300K iterations. We apply the cosine noise schedule [45] with 1000 timesteps for training the diffusion model. We set T=24T=24italic_T = 24 for training, and T=32T=32italic_T = 32 during inference to compare with prior works. For more implementation details, please refer to the supplementary materials.

Refer to caption
Figure 5: More generation results of our model including in-the-wild videos (left) and videos from test set (right).

4.3 Main Results

Quantitative comparisons. We compare the video-to-4D generation results of our model with previous state-of-the-art methods including both optimization-based approaches [28, 88, 79, 56] and feedforward approach [57]. As shown in Table 1, our method consistently outperforms existing approaches across all quality metrics, demonstrating both superior reconstruction fidelity and better temporal coherence. Unlike some prior works [28, 88, 79, 56] require minutes to hours of optimization, our approach is also more efficient, taking 4.5 seconds to generate a 4D animation sequence (3.0 seconds for canonical GS creation and 1.5 seconds for Gaussian Variation Field diffusion), only slightly slower than feedforward reconstruction method L4GM [57]. These quantitative results collectively validate both the effectiveness and efficiency of our proposed method.

Qualitative comparisons. We also provide qualitative comparisons with previous state-of-the-art methods in Figure 4. The SDS-based approaches [28, 88] turn to generate results with blurry textures and poor geometry. The feedforward method L4GM leverages multiview images generated from 2D generative prior [61] to reconstruct the 4DGS sequences. However, the results of L4GM suffer from 3D inconsistency of the generated multiview images. In contrast, our model directly generates the canonical GS and the Gaussian Variation Fields, capable of creating high-fidelity 3D consistent animations with coherent temporal dynamics.

More visualization of generated results. Figure 5 presents additional generation results from our method, including examples conditioned on both in-the-wild videos (left two cases) and test set videos (right two cases). Our model demonstrates high-quality generation capability with faithful motion reproduction. Despite being trained on synthetic data, the model exhibits strong generalization capability by effectively capturing motion patterns from in-the-wild video inputs. Furthermore, the model successfully handles challenging multi-object scenarios, highlighting the robustness of our approach.

Table 2: Ablation study of key factors in our VAE.
Config. Encoder Query Type Mesh-guided Loss Variation Attrs. PSNR\uparrow LPIPS\downarrow SSIM\uparrow
A. 𝒑tfps\bm{p}_{t}^{fps}bold_italic_p start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_f italic_p italic_s end_POSTSUPERSCRIPT Δ𝒑t,Δ𝒔t,Δ𝒒t\Delta\bm{p}_{t},\Delta\bm{s}_{t},\Delta\bm{q}_{t}roman_Δ bold_italic_p start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , roman_Δ bold_italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , roman_Δ bold_italic_q start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT 23.25 0.0678 0.936
B. 𝒑tfps\bm{p}_{t}^{fps}bold_italic_p start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_f italic_p italic_s end_POSTSUPERSCRIPT Δ𝒑t,Δ𝒔t,Δ𝒒t\Delta\bm{p}_{t},\Delta\bm{s}_{t},\Delta\bm{q}_{t}roman_Δ bold_italic_p start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , roman_Δ bold_italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , roman_Δ bold_italic_q start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT 26.17 0.0544 0.950
C. Δ𝒑tfps\Delta\bm{p}_{t}^{fps}roman_Δ bold_italic_p start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_f italic_p italic_s end_POSTSUPERSCRIPT Δ𝒑t,Δ𝒔t,Δ𝒒t\Delta\bm{p}_{t},\Delta\bm{s}_{t},\Delta\bm{q}_{t}roman_Δ bold_italic_p start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , roman_Δ bold_italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , roman_Δ bold_italic_q start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT 28.58 0.0478 0.958
D. (Ours) Δ𝒑tfps\Delta\bm{p}_{t}^{fps}roman_Δ bold_italic_p start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_f italic_p italic_s end_POSTSUPERSCRIPT Δ𝒑t,Δ𝒔t,Δ𝒒t,Δ𝒄t,Δαt\Delta\bm{p}_{t},\Delta\bm{s}_{t},\Delta\bm{q}_{t},\Delta\bm{c}_{t},\Delta\alpha_{t}roman_Δ bold_italic_p start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , roman_Δ bold_italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , roman_Δ bold_italic_q start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , roman_Δ bold_italic_c start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , roman_Δ italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT 29.28 0.0439 0.964
Refer to caption
Figure 6: Visual ablation of VAE.
Refer to caption
Figure 7: Our model is also capable of creating animations for existing 3D assets with conditional videos.
Table 3: Ablation study of our diffusion model.
Method PSNR\uparrow LPIPS\downarrow SSIM\uparrow CLIP\uparrow FVD\downarrow
Ours w/o Pos. Emb. 17.86 0.121 0.897 0.931 547.20
Ours 18.47 0.114 0.901 0.935 476.83

4.4 Ablation Study

Ablation of our VAE. In Table 2 and Figure 6, we analyze key components of our VAE. Our baseline (Config. A) starts with using positions of farthest sampled canonical GS 𝒑tfps\bm{p}_{t}^{fps}bold_italic_p start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_f italic_p italic_s end_POSTSUPERSCRIPT as query for the encoder’s cross-attention layer, with variation attributes limited to positions Δ𝒑t\Delta\bm{p}_{t}roman_Δ bold_italic_p start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, scaling Δ𝒔t\Delta\bm{s}_{t}roman_Δ bold_italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, and rotation Δ𝒒t\Delta\bm{q}_{t}roman_Δ bold_italic_q start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT following previous 4DGS works [76]. Since we do not have ground-truth GS motion for explicit supervision, the VAE initially struggles with motion learning. After equipped with our mesh-guided loss, the motion reconstruction capability is effectively improved through pseudo displacement supervision (Config. B). We then replace the encoding query with motion-aware Δ𝒑tfps\Delta\bm{p}_{t}^{fps}roman_Δ bold_italic_p start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_f italic_p italic_s end_POSTSUPERSCRIPT using mesh-guided interpolation, which successfully handles most of the motion sequences (Config. C). Finally, to give the model more flexibility to handle complex motion sequence, we incorporate color Δ𝒄t\Delta\bm{c}_{t}roman_Δ bold_italic_c start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT and opacity Δα\Delta\alpharoman_Δ italic_α of Gaussian attributes to the variation fields, which further enhance the reconstruction capability of VAE.

Ablation of our diffusion model. We examine the importance of positional embeddings in our diffusion model training in Table 3. By incorporating positional prior based on canonical GS positions 𝒑1fps\bm{p}_{1}^{fps}bold_italic_p start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_f italic_p italic_s end_POSTSUPERSCRIPT, the diffusion transformer better captures the correspondence between spatial positions and their variations. Removing these positional embeddings leads to a significant performance drop, demonstrating their crucial role in achieving high-quality results.

4.5 Application

Despite being trained on single video inputs, our model is effective at animating existing 3D models according to the motions depicted in conditioned videos. More details are included in the supplementary materials. As demonstrated in Figure 7, this approach produces high-quality animations that faithfully reproduce the target motions. Therefore, for real-world applications, the users can first generate 2D animations from the rendered images of their 3D models using off-the-shelf video diffusion models [47, 33, 34, 18], then employ our model to create corresponding 4D animations.

5 Conclusion

In this paper, we introduce a novel framework to address the challenging task of 4D generative modeling. To efficiently construct the large-scale training dataset and reduce the modeling difficulty for diffusion, we first introduce a Direct 4DMesh-to-GS Variation Field VAE, which is able to efficiently compress complex motion information into a compact latent space without requiring costly per-instance fitting. Then, a Gaussian Variation Field diffusion model that generates high-quality dynamic variation fields conditioned on input videos and canonical 3DGS. By decomposing 4D generation into canonical 3DGS generation and Gaussian Variation Field modeling, our method significantly reduces computational complexity while maintaining high fidelity. Quantitative and qualitative evaluations demonstrate that our approach consistently outperforms existing methods. Furthermore, our model exhibits remarkable generalization capability with in-the-wild video inputs, advancing the state of high-quality animated 3D content generation.

Acknowledgments. We extend our gratitude to all the reviewers for their constructive feedback. We also appreciate Jiaqi Lou for the assistance with chart refinement and the production of the supplementary video.

References

  • Bahmani et al. [2024] Sherwin Bahmani, Ivan Skorokhodov, Victor Rong, Gordon Wetzstein, Leonidas Guibas, Peter Wonka, Sergey Tulyakov, Jeong Joon Park, Andrea Tagliasacchi, and David B Lindell. 4d-fy: Text-to-4d generation using hybrid score distillation sampling. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 7996–8006, 2024.
  • Blattmann et al. [2023a] Andreas Blattmann, Tim Dockhorn, Sumith Kulal, Daniel Mendelevitch, Maciej Kilian, Dominik Lorenz, Yam Levi, Zion English, Vikram Voleti, Adam Letts, et al. Stable video diffusion: Scaling latent video diffusion models to large datasets. arXiv preprint arXiv:2311.15127, 2023a.
  • Blattmann et al. [2023b] Andreas Blattmann, Robin Rombach, Huan Ling, Tim Dockhorn, Seung Wook Kim, Sanja Fidler, and Karsten Kreis. Align your latents: High-resolution video synthesis with latent diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 22563–22575, 2023b.
  • Cao and Johnson [2023] Ang Cao and Justin Johnson. Hexplane: A fast representation for dynamic scenes. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 130–141, 2023.
  • Cao et al. [2024] Wei Cao, Chang Luo, Biao Zhang, Matthias Nießner, and Jiapeng Tang. Motion2vecsets: 4d latent vector set diffusion for non-rigid shape reconstruction and tracking. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 20496–20506, 2024.
  • Cao et al. [2023] Ziang Cao, Fangzhou Hong, Tong Wu, Liang Pan, and Ziwei Liu. Large-vocabulary 3d diffusion model with transformer. arXiv preprint arXiv:2309.07920, 2023.
  • Chan et al. [2022] Eric R Chan, Connor Z Lin, Matthew A Chan, Koki Nagano, Boxiao Pan, Shalini De Mello, Orazio Gallo, Leonidas J Guibas, Jonathan Tremblay, Sameh Khamis, et al. Efficient geometry-aware 3d generative adversarial networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 16123–16133, 2022.
  • Chen et al. [2023a] Rui Chen, Yongwei Chen, Ningxin Jiao, and Kui Jia. Fantasia3d: Disentangling geometry and appearance for high-quality text-to-3d content creation. ArXiv preprint, abs/2303.13873, 2023a.
  • Chen et al. [2023b] Zhaoxi Chen, Fangzhou Hong, Haiyi Mei, Guangcong Wang, Lei Yang, and Ziwei Liu. Primdiffusion: Volumetric primitives diffusion for 3d human generation. Advances in Neural Information Processing Systems, 36:13664–13677, 2023b.
  • Chen et al. [2024] Zhaoxi Chen, Jiaxiang Tang, Yuhao Dong, Ziang Cao, Fangzhou Hong, Yushi Lan, Tengfei Wang, Haozhe Xie, Tong Wu, Shunsuke Saito, et al. 3dtopia-xl: Scaling high-quality 3d asset generation via primitive diffusion. arXiv preprint arXiv:2409.12957, 2024.
  • Cotton and Peyton [2024] R James Cotton and Colleen Peyton. Dynamic gaussian splatting from markerless motion capture reconstruct infants movements. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 60–68, 2024.
  • Deitke et al. [2023a] Matt Deitke, Ruoshi Liu, Matthew Wallingford, Huong Ngo, Oscar Michel, Aditya Kusupati, Alan Fan, Christian Laforte, Vikram Voleti, Samir Yitzhak Gadre, et al. Objaverse-xl: A universe of 10m+ 3d objects. Advances in Neural Information Processing Systems, 36:35799–35813, 2023a.
  • Deitke et al. [2023b] Matt Deitke, Dustin Schwenk, Jordi Salvador, Luca Weihs, Oscar Michel, Eli VanderBilt, Ludwig Schmidt, Kiana Ehsani, Aniruddha Kembhavi, and Ali Farhadi. Objaverse: A universe of annotated 3d objects. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13142–13153, 2023b.
  • Deng et al. [2022] Yu Deng, Jiaolong Yang, Jianfeng Xiang, and Xin Tong. Gram: Generative radiance manifolds for 3d-aware image generation. In IEEE/CVF International Conference on Computer Vision, 2022.
  • Dhariwal and Nichol [2021] Prafulla Dhariwal and Alexander Nichol. Diffusion models beat gans on image synthesis. Advances in Neural Information Processing Systems, 34:8780–8794, 2021.
  • Fridovich-Keil et al. [2023] Sara Fridovich-Keil, Giacomo Meanti, Frederik Rahbæk Warburg, Benjamin Recht, and Angjoo Kanazawa. K-planes: Explicit radiance fields in space, time, and appearance. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 12479–12488, 2023.
  • Gao et al. [2022] Jun Gao, Tianchang Shen, Zian Wang, Wenzheng Chen, Kangxue Yin, Daiqing Li, Or Litany, Zan Gojcic, and Sanja Fidler. Get3d: A generative model of high quality 3d textured shapes learned from images. arXiv preprint arXiv:2209.11163, 2022.
  • Girdhar et al. [2024] Rohit Girdhar, Mannat Singh, Andrew Brown, Quentin Duval, Samaneh Azadi, Sai Saketh Rambhatla, Akbar Shah, Xi Yin, Devi Parikh, and Ishan Misra. Factorizing text-to-video generation by explicit image conditioning. In European Conference on Computer Vision, pages 205–224. Springer, 2024.
  • Guo et al. [2023] Yuwei Guo, Ceyuan Yang, Anyi Rao, Zhengyang Liang, Yaohui Wang, Yu Qiao, Maneesh Agrawala, Dahua Lin, and Bo Dai. Animatediff: Animate your personalized text-to-image diffusion models without specific tuning. arXiv preprint arXiv:2307.04725, 2023.
  • Gupta et al. [2023] Anchit Gupta, Wenhan Xiong, Yixin Nie, Ian Jones, and Barlas Oğuz. 3dgen: Triplane latent diffusion for textured mesh generation. arXiv preprint arXiv:2303.05371, 2023.
  • Hang et al. [2023] Tiankai Hang, Shuyang Gu, Chen Li, Jianmin Bao, Dong Chen, Han Hu, Xin Geng, and Baining Guo. Efficient diffusion training via min-snr weighting strategy. arXiv preprint arXiv:2303.09556, 2023.
  • He et al. [2024] Xianglong He, Junyi Chen, Sida Peng, Di Huang, Yangguang Li, Xiaoshui Huang, Chun Yuan, Wanli Ouyang, and Tong He. Gvgen: Text-to-3d generation with volumetric representation. arXiv preprint arXiv:2403.12957, 2024.
  • Ho et al. [2022] Jonathan Ho, William Chan, Chitwan Saharia, Jay Whang, Ruiqi Gao, Alexey Gritsenko, Diederik P Kingma, Ben Poole, Mohammad Norouzi, David J Fleet, et al. Imagen video: High definition video generation with diffusion models. arXiv preprint arXiv:2210.02303, 2022.
  • Hong et al. [2023] Yicong Hong, Kai Zhang, Jiuxiang Gu, Sai Bi, Yang Zhou, Difan Liu, Feng Liu, Kalyan Sunkavalli, Trung Bui, and Hao Tan. Lrm: Large reconstruction model for single image to 3d. arXiv preprint arXiv:2311.04400, 2023.
  • Huang et al. [2024] Yi-Hua Huang, Yang-Tian Sun, Ziyi Yang, Xiaoyang Lyu, Yan-Pei Cao, and Xiaojuan Qi. Sc-gs: Sparse-controlled gaussian splatting for editable dynamic scenes. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 4220–4230, 2024.
  • Jaegle et al. [2021a] Andrew Jaegle, Sebastian Borgeaud, Jean-Baptiste Alayrac, Carl Doersch, Catalin Ionescu, David Ding, Skanda Koppula, Daniel Zoran, Andrew Brock, Evan Shelhamer, et al. Perceiver io: A general architecture for structured inputs & outputs. arXiv preprint arXiv:2107.14795, 2021a.
  • Jaegle et al. [2021b] Andrew Jaegle, Felix Gimeno, Andy Brock, Oriol Vinyals, Andrew Zisserman, and Joao Carreira. Perceiver: General perception with iterative attention. In International conference on machine learning, pages 4651–4664. PMLR, 2021b.
  • Jiang et al. [2023] Yanqin Jiang, Li Zhang, Jin Gao, Weimin Hu, and Yao Yao. Consistent4d: Consistent 360 {\{{\\backslash\deg}\}} dynamic object generation from monocular video. arXiv preprint arXiv:2311.02848, 2023.
  • Jun and Nichol [2023] Heewoo Jun and Alex Nichol. Shap-e: Generating conditional 3d implicit functions. arXiv preprint arXiv:2305.02463, 2023.
  • Karras et al. [2019] Tero Karras, Samuli Laine, and Timo Aila. A style-based generator architecture for generative adversarial networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4401–4410, 2019.
  • Karras et al. [2020] Tero Karras, Samuli Laine, Miika Aittala, Janne Hellsten, Jaakko Lehtinen, and Timo Aila. Analyzing and improving the image quality of stylegan. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 8110–8119, 2020.
  • Kerbl et al. [2023] Bernhard Kerbl, Georgios Kopanas, Thomas Leimkühler, and George Drettakis. 3d gaussian splatting for real-time radiance field rendering. ACM Transactions on Graphics, 42(4), 2023.
  • Kong et al. [2025] Weijie Kong, Qi Tian, Zijian Zhang, Rox Min, Zuozhuo Dai, Jin Zhou, Jiangfeng Xiong, Xin Li, Bo Wu, Jianwei Zhang, Kathrina Wu, Qin Lin, Junkun Yuan, Yanxin Long, Aladdin Wang, Andong Wang, Changlin Li, Duojun Huang, Fang Yang, Hao Tan, Hongmei Wang, Jacob Song, Jiawang Bai, Jianbing Wu, Jinbao Xue, Joey Wang, Kai Wang, Mengyang Liu, Pengyu Li, Shuai Li, Weiyan Wang, Wenqing Yu, Xinchi Deng, Yang Li, Yi Chen, Yutao Cui, Yuanbo Peng, Zhentao Yu, Zhiyu He, Zhiyong Xu, Zixiang Zhou, Zunnan Xu, Yangyu Tao, Qinglin Lu, Songtao Liu, Dax Zhou, Hongfa Wang, Yong Yang, Di Wang, Yuhong Liu, Jie Jiang, and Caesar Zhong. Hunyuanvideo: A systematic framework for large video generative models, 2025.
  • Kuaishou [2024] Kuaishou. Kling. https://klingai.kuaishou.com, 2024.
  • Lan et al. [2024] Yushi Lan, Fangzhou Hong, Shuai Yang, Shangchen Zhou, Xuyi Meng, Bo Dai, Xingang Pan, and Chen Change Loy. Ln3diff: Scalable latent neural fields diffusion for speedy 3d generation. In ECCV, 2024.
  • Li et al. [2024a] Mengtian Li, Shengxiang Yao, Zhifeng Xie, Keyu Chen, and Yu-Gang Jiang. Gaussianbody: Clothed human reconstruction via 3d gaussian splatting. arXiv preprint arXiv:2401.09720, 2024a.
  • Li et al. [2024b] Weiyu Li, Jiarui Liu, Rui Chen, Yixun Liang, Xuelin Chen, Ping Tan, and Xiaoxiao Long. Craftsman: High-fidelity mesh generation with 3d native generation and interactive geometry refiner. arXiv preprint arXiv:2405.14979, 2024b.
  • Liang et al. [2024a] Hanwen Liang, Yuyang Yin, Dejia Xu, Hanxue Liang, Zhangyang Wang, Konstantinos N Plataniotis, Yao Zhao, and Yunchao Wei. Diffusion4d: Fast spatial-temporal consistent 4d generation via video diffusion models. arXiv preprint arXiv:2405.16645, 2024a.
  • Liang et al. [2024b] Yixun Liang, Xin Yang, Jiantao Lin, Haodong Li, Xiaogang Xu, and Yingcong Chen. Luciddreamer: Towards high-fidelity text-to-3d generation via interval score matching. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 6517–6526, 2024b.
  • Liu et al. [2023] Ruoshi Liu, Rundi Wu, Basile Van Hoorick, Pavel Tokmakov, Sergey Zakharov, and Carl Vondrick. Zero-1-to-3: Zero-shot one image to 3d object. In Proceedings of the IEEE/CVF international conference on computer vision, pages 9298–9309, 2023.
  • Lu et al. [2022] Cheng Lu, Yuhao Zhou, Fan Bao, Jianfei Chen, Chongxuan Li, and Jun Zhu. Dpm-solver: A fast ode solver for diffusion probabilistic model sampling in around 10 steps. Advances in Neural Information Processing Systems, 35:5775–5787, 2022.
  • Luiten et al. [2023] Jonathon Luiten, Georgios Kopanas, Bastian Leibe, and Deva Ramanan. Dynamic 3d gaussians: Tracking by persistent dynamic view synthesis. arXiv preprint arXiv:2308.09713, 2023.
  • Mildenhall et al. [2021] Ben Mildenhall, Pratul P Srinivasan, Matthew Tancik, Jonathan T Barron, Ravi Ramamoorthi, and Ren Ng. Nerf: Representing scenes as neural radiance fields for view synthesis. Communications of the ACM, 65(1):99–106, 2021.
  • Müller et al. [2023] Norman Müller, Yawar Siddiqui, Lorenzo Porzi, Samuel Rota Bulo, Peter Kontschieder, and Matthias Nießner. Diffrf: Rendering-guided 3d radiance field diffusion. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4328–4338, 2023.
  • Nichol and Dhariwal [2021] Alexander Quinn Nichol and Prafulla Dhariwal. Improved denoising diffusion probabilistic models. In International Conference on Machine Learning, pages 8162–8171. PMLR, 2021.
  • Ntavelis et al. [2023] Evangelos Ntavelis, Aliaksandr Siarohin, Kyle Olszewski, Chaoyang Wang, Luc Van Gool, and Sergey Tulyakov. Autodecoding latent 3d diffusion models. arXiv preprint arXiv:2307.05445, 2023.
  • OpenAI [2024] OpenAI. Video generation models as world simulators. https://openai.com/index/video-generation-models-as-world-simulators, 2024.
  • Oquab et al. [2023] Maxime Oquab, Timothée Darcet, Théo Moutakanni, Huy Vo, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, et al. Dinov2: Learning robust visual features without supervision. arXiv preprint arXiv:2304.07193, 2023.
  • Pan et al. [2024] Zijie Pan, Zeyu Yang, Xiatian Zhu, and Li Zhang. Efficient4d: Fast dynamic 3d object generation from a single-view video. arXiv preprint arXiv:2401.08742, 2024.
  • Park et al. [2021a] Keunhong Park, Utkarsh Sinha, Jonathan T Barron, Sofien Bouaziz, Dan B Goldman, Steven M Seitz, and Ricardo Martin-Brualla. Nerfies: Deformable neural radiance fields. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 5865–5874, 2021a.
  • Park et al. [2021b] Keunhong Park, Utkarsh Sinha, Peter Hedman, Jonathan T Barron, Sofien Bouaziz, Dan B Goldman, Ricardo Martin-Brualla, and Steven M Seitz. Hypernerf: A higher-dimensional representation for topologically varying neural radiance fields. arXiv preprint arXiv:2106.13228, 2021b.
  • Peebles and Xie [2023] William Peebles and Saining Xie. Scalable diffusion models with transformers. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 4195–4205, 2023.
  • Poole et al. [2022] Ben Poole, Ajay Jain, Jonathan T Barron, and Ben Mildenhall. Dreamfusion: Text-to-3d using 2d diffusion. arXiv preprint arXiv:2209.14988, 2022.
  • Pumarola et al. [2021] Albert Pumarola, Enric Corona, Gerard Pons-Moll, and Francesc Moreno-Noguer. D-nerf: Neural radiance fields for dynamic scenes. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10318–10327, 2021.
  • Ramesh et al. [2022] Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu, and Mark Chen. Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:2204.06125, 1(2):3, 2022.
  • Ren et al. [2023] Jiawei Ren, Liang Pan, Jiaxiang Tang, Chi Zhang, Ang Cao, Gang Zeng, and Ziwei Liu. Dreamgaussian4d: Generative 4d gaussian splatting. arXiv preprint arXiv:2312.17142, 2023.
  • Ren et al. [2025] Jiawei Ren, Cheng Xie, Ashkan Mirzaei, Karsten Kreis, Ziwei Liu, Antonio Torralba, Sanja Fidler, Seung Wook Kim, Huan Ling, et al. L4gm: Large 4d gaussian reconstruction model. Advances in Neural Information Processing Systems, 37:56828–56858, 2025.
  • Ren et al. [2024] Xuanchi Ren, Jiahui Huang, Xiaohui Zeng, Ken Museth, Sanja Fidler, and Francis Williams. Xcube: Large-scale 3d generative modeling using sparse voxel hierarchies. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4209–4219, 2024.
  • Rombach et al. [2022] Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10684–10695, 2022.
  • Shi et al. [2023a] Ruoxi Shi, Hansheng Chen, Zhuoyang Zhang, Minghua Liu, Chao Xu, Xinyue Wei, Linghao Chen, Chong Zeng, and Hao Su. Zero123++: a single image to consistent multi-view diffusion base model. arXiv preprint arXiv:2310.15110, 2023a.
  • Shi et al. [2023b] Yichun Shi, Peng Wang, Jianglong Ye, Mai Long, Kejie Li, and Xiao Yang. Mvdream: Multi-view diffusion for 3d generation. arXiv preprint arXiv:2308.16512, 2023b.
  • Shue et al. [2023] J Ryan Shue, Eric Ryan Chan, Ryan Po, Zachary Ankner, Jiajun Wu, and Gordon Wetzstein. 3d neural field generation using triplane diffusion. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 20875–20886, 2023.
  • Singer et al. [2023] Uriel Singer, Shelly Sheynin, Adam Polyak, Oron Ashual, Iurii Makarov, Filippos Kokkinos, Naman Goyal, Andrea Vedaldi, Devi Parikh, Justin Johnson, et al. Text-to-4d dynamic scene generation. arXiv preprint arXiv:2301.11280, 2023.
  • Skorokhodov et al. [2023] Ivan Skorokhodov, Aliaksandr Siarohin, Yinghao Xu, Jian Ren, Hsin-Ying Lee, Peter Wonka, and Sergey Tulyakov. 3d generation on imagenet. arXiv preprint arXiv:2303.01416, 2023.
  • Sun et al. [2023] Jingxiang Sun, Bo Zhang, Ruizhi Shao, Lizhen Wang, Wen Liu, Zhenda Xie, and Yebin Liu. Dreamcraft3d: Hierarchical 3d generation with bootstrapped diffusion prior. arXiv preprint arXiv:2310.16818, 2023.
  • Tang et al. [2023a] Jiaxiang Tang, Jiawei Ren, Hang Zhou, Ziwei Liu, and Gang Zeng. Dreamgaussian: Generative gaussian splatting for efficient 3d content creation. arXiv preprint arXiv:2309.16653, 2023a.
  • Tang et al. [2023b] Junshu Tang, Tengfei Wang, Bo Zhang, Ting Zhang, Ran Yi, Lizhuang Ma, and Dong Chen. Make-it-3d: High-fidelity 3d creation from a single image with diffusion prior. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 22819–22829, 2023b.
  • Tang et al. [2024] Jiaxiang Tang, Zhaoxi Chen, Xiaokang Chen, Tengfei Wang, Gang Zeng, and Ziwei Liu. Lgm: Large multi-view gaussian model for high-resolution 3d content creation. arXiv preprint arXiv:2402.05054, 2024.
  • Tang et al. [2023c] Zhicong Tang, Shuyang Gu, Chunyu Wang, Ting Zhang, Jianmin Bao, Dong Chen, and Baining Guo. Volumediffusion: Flexible text-to-3d generation with efficient volumetric encoder. arXiv preprint arXiv:2312.11459, 2023c.
  • Unterthiner et al. [2019] Thomas Unterthiner, Sjoerd Van Steenkiste, Karol Kurach, Raphaël Marinier, Marcin Michalski, and Sylvain Gelly. Fvd: A new metric for video generation. 2019.
  • Vahdat et al. [2022] Arash Vahdat, Francis Williams, Zan Gojcic, Or Litany, Sanja Fidler, Karsten Kreis, et al. Lion: Latent point diffusion models for 3d shape generation. Advances in Neural Information Processing Systems, 35:10021–10039, 2022.
  • Vaswani et al. [2017] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. Advances in Neural Information Processing Systems, 30, 2017.
  • Wang and Shi [2023] Peng Wang and Yichun Shi. Imagedream: Image-prompt multi-view diffusion for 3d generation. arXiv preprint arXiv:2312.02201, 2023.
  • Wang et al. [2023a] Tengfei Wang, Bo Zhang, Ting Zhang, Shuyang Gu, Jianmin Bao, Tadas Baltrusaitis, Jingjing Shen, Dong Chen, Fang Wen, Qifeng Chen, et al. Rodin: A generative model for sculpting 3d digital avatars using diffusion. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4563–4573, 2023a.
  • Wang et al. [2023b] Zhengyi Wang, Cheng Lu, Yikai Wang, Fan Bao, Chongxuan Li, Hang Su, and Jun Zhu. Prolificdreamer: High-fidelity and diverse text-to-3d generation with variational score distillation. ArXiv preprint, abs/2305.16213, 2023b.
  • Wu et al. [2024a] Guanjun Wu, Taoran Yi, Jiemin Fang, Lingxi Xie, Xiaopeng Zhang, Wei Wei, Wenyu Liu, Qi Tian, and Xinggang Wang. 4d gaussian splatting for real-time dynamic scene rendering. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 20310–20320, 2024a.
  • Wu et al. [2016] Jiajun Wu, Chengkai Zhang, Tianfan Xue, Bill Freeman, and Josh Tenenbaum. Learning a probabilistic latent space of object shapes via 3d generative-adversarial modeling. Advances in Neural Information Processing Systems, 29, 2016.
  • Wu et al. [2024b] Shuang Wu, Youtian Lin, Feihu Zhang, Yifei Zeng, Jingxi Xu, Philip Torr, Xun Cao, and Yao Yao. Direct3d: Scalable image-to-3d generation via 3d latent diffusion transformer. arXiv preprint arXiv:2405.14832, 2024b.
  • Wu et al. [2024c] Zijie Wu, Chaohui Yu, Yanqin Jiang, Chenjie Cao, Fan Wang, and Xiang Bai. Sc4d: Sparse-controlled video-to-4d generation and motion transfer. In European Conference on Computer Vision, pages 361–379. Springer, 2024c.
  • Xiang et al. [2022] Jianfeng Xiang, Jiaolong Yang, Yu Deng, and Xin Tong. Gram-hd: 3d-consistent image generation at high resolution with generative radiance manifolds. arXiv preprint arXiv:2206.07255, 2022.
  • Xiang et al. [2024] Jianfeng Xiang, Zelong Lv, Sicheng Xu, Yu Deng, Ruicheng Wang, Bowen Zhang, Dong Chen, Xin Tong, and Jiaolong Yang. Structured 3d latents for scalable and versatile 3d generation. arXiv preprint arXiv:2412.01506, 2024.
  • Xie et al. [2024] Yiming Xie, Chun-Han Yao, Vikram Voleti, Huaizu Jiang, and Varun Jampani. Sv4d: Dynamic 3d content generation with multi-frame and multi-view consistency. arXiv preprint arXiv:2407.17470, 2024.
  • Xiong et al. [2024] Bojun Xiong, Si-Tong Wei, Xin-Yang Zheng, Yan-Pei Cao, Zhouhui Lian, and Peng-Shuai Wang. Octfusion: Octree-based diffusion models for 3d shape generation. arXiv preprint arXiv:2408.14732, 2024.
  • Yang et al. [2024a] Haitao Yang, Yuan Dong, Hanwen Jiang, Dejia Xu, Georgios Pavlakos, and Qixing Huang. Atlas gaussians diffusion for 3d generation with infinite number of points. arXiv preprint arXiv:2408.13055, 2024a.
  • Yang et al. [2024b] Zeyu Yang, Zijie Pan, Chun Gu, and Li Zhang. Diffusion 2: Dynamic 3d content generation via score composition of video and multi-view diffusion models. arXiv preprint arXiv:2404.02148, 2024b.
  • Yariv et al. [2024] Lior Yariv, Omri Puny, Oran Gafni, and Yaron Lipman. Mosaic-sdf for 3d generative models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4630–4639, 2024.
  • Yin et al. [2023] Yuyang Yin, Dejia Xu, Zhangyang Wang, Yao Zhao, and Yunchao Wei. 4dgen: Grounded 4d content generation with spatial-temporal consistency. arXiv preprint arXiv:2312.17225, 2023.
  • Zeng et al. [2024] Yifei Zeng, Yanqin Jiang, Siyu Zhu, Yuanxun Lu, Youtian Lin, Hao Zhu, Weiming Hu, Xun Cao, and Yao Yao. Stag4d: Spatial-temporal anchored generative 4d gaussians. In European Conference on Computer Vision, pages 163–179. Springer, 2024.
  • Zhang and Sennrich [2019] Biao Zhang and Rico Sennrich. Root mean square layer normalization. Advances in Neural Information Processing Systems, 32, 2019.
  • Zhang et al. [2022] Bowen Zhang, Shuyang Gu, Bo Zhang, Jianmin Bao, Dong Chen, Fang Wen, Yong Wang, and Baining Guo. Styleswin: Transformer-based gan for high-resolution image generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 11304–11314, 2022.
  • Zhang et al. [2023] Biao Zhang, Jiapeng Tang, Matthias Niessner, and Peter Wonka. 3dshape2vecset: A 3d shape representation for neural fields and generative diffusion models. ACM Transactions on Graphics (TOG), 42(4):1–16, 2023.
  • Zhang et al. [2024a] Bowen Zhang, Yiji Cheng, Chunyu Wang, Ting Zhang, Jiaolong Yang, Yansong Tang, Feng Zhao, Dong Chen, and Baining Guo. Rodinhd: High-fidelity 3d avatar generation with diffusion models. arXiv preprint arXiv:2407.06938, 2024a.
  • Zhang et al. [2024b] Bowen Zhang, Yiji Cheng, Jiaolong Yang, Chunyu Wang, Feng Zhao, Yansong Tang, Dong Chen, and Baining Guo. Gaussiancube: A structured and explicit radiance representation for 3d generative modeling. arXiv preprint arXiv:2403.19655, 2024b.
  • Zhang et al. [2025] Haiyu Zhang, Xinyuan Chen, Yaohui Wang, Xihui Liu, Yunhong Wang, and Yu Qiao. 4diffusion: Multi-view video diffusion model for 4d generation. Advances in Neural Information Processing Systems, 37:15272–15295, 2025.
  • Zhang et al. [2024c] Longwen Zhang, Ziyu Wang, Qixuan Zhang, Qiwei Qiu, Anqi Pang, Haoran Jiang, Wei Yang, Lan Xu, and Jingyi Yu. Clay: A controllable large-scale generative model for creating high-quality 3d assets. ACM Transactions on Graphics (TOG), 43(4):1–20, 2024c.
  • Zhang et al. [2018] Richard Zhang, Phillip Isola, Alexei A Efros, Eli Shechtman, and Oliver Wang. The unreasonable effectiveness of deep features as a perceptual metric. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 586–595, 2018.
  • Zhao et al. [2024] Zibo Zhao, Wen Liu, Xin Chen, Xianfang Zeng, Rui Wang, Pei Cheng, Bin Fu, Tao Chen, Gang Yu, and Shenghua Gao. Michelangelo: Conditional 3d shape generation based on shape-image-text aligned latent representation. Advances in Neural Information Processing Systems, 36, 2024.
  • Zheng et al. [2022] Xinyang Zheng, Yang Liu, Pengshuai Wang, and Xin Tong. Sdf-stylegan: Implicit sdf-based stylegan for 3d shape generation. In Computer Graphics Forum, pages 52–63, 2022.
  • Zheng et al. [2023] Xin-Yang Zheng, Hao Pan, Peng-Shuai Wang, Xin Tong, Yang Liu, and Heung-Yeung Shum. Locally attentional sdf diffusion for controllable 3d shape generation. ACM Transactions on Graphics (ToG), 42(4):1–13, 2023.
  • Zhu et al. [2018] Jun-Yan Zhu, Zhoutong Zhang, Chengkai Zhang, Jiajun Wu, Antonio Torralba, Josh Tenenbaum, and Bill Freeman. Visual object networks: Image generation with disentangled 3d representations. Advances in neural information processing systems, 31, 2018.

Appendix A Additional Implementation Details

Table 4: Detailed configuration of model architecture. SW and FFN denotes “Shifted Window” and “FeedForward Net”. MSA, MSSA, MTSA, MCA stand for “Multihead Self-Attention”, “Multihead Spatial Self-Attention”, “Multihead Temporal Self-Attention” and “Multihead Cross-Attention”, respectively.
Network #Layer #Dim. #Head Block Arch. Special Modules #Param.
𝓔GS\bm{\mathcal{E}}_{GS}bold_caligraphic_E start_POSTSUBSCRIPT italic_G italic_S end_POSTSUBSCRIPT 12 768 12 3D-SW-MSA + FFN 3D Swin Attn. 85.8M
𝓓GS\bm{\mathcal{D}}_{GS}bold_caligraphic_D start_POSTSUBSCRIPT italic_G italic_S end_POSTSUBSCRIPT 12 768 12 3D-SW-MSA + FFN 3D Swin Attn. 85.1M
VAE Transformer 12 768 12 MSA / MCA + FFN - 125.21M
Diffusion 12 512 16 MSSA + MTSA + MCA + FFN QK Norm. 105.51M

A.1 Model Architecture

We will detail the architecture of each model below, with the summary demonstrated in Table 4.

A.1.1 Gaussian Variation Field Encoder

Our encoder mainly comprises two parts: the canonical GS autoencoder GS\mathcal{E}_{GS}caligraphic_E start_POSTSUBSCRIPT italic_G italic_S end_POSTSUBSCRIPT and 𝒟GS\mathcal{D}_{GS}caligraphic_D start_POSTSUBSCRIPT italic_G italic_S end_POSTSUBSCRIPT and a cross attention layer to create latent space for Gaussian Variation Fields.

For the canonical GS autoencoder, we adopt the model architecture from [81], which introduces a Structured Latent (SLat) representation for static 3D assets. This representation defines a set of local latents on a 3D grid, where each latent is associated with an active voxel intersecting with the surface of the 3D asset. The SLat representation effectively captures both the overall structure through active voxels and fine details through local latent codes. The canonical GS autoencoder is built using a transformer-based architecture. It first aggregates visual features from multiview images using a pre-trained DINOv2 [48] encoder to create voxelized features. These features are then processed through a sparse transformer encoder that handles variable-length tokens corresponding to active voxels. The transformer incorporates shifted window attention in 3D space to enhance local information interaction while maintaining computational efficiency. The encoder outputs structured latents that follow a regularized distribution through KL-divergence penalties, which are then decoded to various representations. For this work, we only leverage its Gaussian representation decoder for our canonical GS autoencoding. 𝒟GS\mathcal{D}_{GS}caligraphic_D start_POSTSUBSCRIPT italic_G italic_S end_POSTSUBSCRIPT is set to resolution 64, and decode to 8 Gaussians per voxel. We finetune the decoder 𝒟GS\mathcal{D}_{GS}caligraphic_D start_POSTSUBSCRIPT italic_G italic_S end_POSTSUBSCRIPT while keeping the encoder GS\mathcal{E}_{GS}caligraphic_E start_POSTSUBSCRIPT italic_G italic_S end_POSTSUBSCRIPT frozen.

For the cross attention layer, we adopt the vanilla full attention [72] implementation. we set the motion-aware Δ𝒑tfps512×3\Delta\bm{p}^{fps}_{t}\in\mathbb{R}^{512\times 3}roman_Δ bold_italic_p start_POSTSUPERSCRIPT italic_f italic_p italic_s end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT 512 × 3 end_POSTSUPERSCRIPT using proposed mesh-guided interpolation mechanism as query and point displacement fields ΔPt8192×3\Delta P_{t}\in\mathbb{R}^{8192\times 3}roman_Δ italic_P start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT 8192 × 3 end_POSTSUPERSCRIPT from mesh as keys and values. Then the latent representation 𝒛512×16\bm{z}\in\mathbb{R}^{512\times 16}bold_italic_z ∈ blackboard_R start_POSTSUPERSCRIPT 512 × 16 end_POSTSUPERSCRIPT is obtained after the cross attention layer.

A.1.2 Gaussian Variation Field Decoder

For the Gaussian Variation Field decoder, we first adopt 12 layers of vanilla self attention for thorough information exchange. For the last cross attention layer to decode Gaussian Variation Fields ΔGtNG×14\Delta G_{t}\in\mathbb{R}^{N_{G}\times 14}roman_Δ italic_G start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_N start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT × 14 end_POSTSUPERSCRIPT The output feature of last self attention layer is set to keys and values, and we adopt all parameters of G1NG×14G_{1}\in\mathbb{R}^{N_{G}\times 14}italic_G start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_N start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT × 14 end_POSTSUPERSCRIPT as query, where NGN_{G}italic_N start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT is the total number of canonical GS.

A.1.3 Canonical GS Generation Model

We adopt the model architecture from [81] to generate structure latent representation for further decoding to canonical GS, which follows a two-stage process. First, a structure generator creates the sparse structure by denoising a low-resolution feature grid using transformer blocks with adaptive layer normalization and cross-attention for condition injection. Then, a latent generator 𝓖L\bm{\mathcal{G}}_{\mathrm{L}}bold_caligraphic_G start_POSTSUBSCRIPT roman_L end_POSTSUBSCRIPT generates local latents for the given structure using a sparse transformer architecture with downsampling and upsampling blocks. These two generators both adopt RMSNorm [89] to the queries and keys (QK Norm.) in diffusion training. They are conditioned on image conditions through CLIP and DINOv2 features respectively, and are trained separately using a continuous flow matching objective. Since we freeze the GS\mathcal{E}_{GS}caligraphic_E start_POSTSUBSCRIPT italic_G italic_S end_POSTSUBSCRIPT, we can directly leverage the pretrained image-to-3D model [81] to create canonical GS.

A.1.4 Gaussian Variation Field Diffusion Model

Our Gaussian Variation Field diffusion model builds upon the diffusion transformer architecture [52]. To enable temporal coherence in generation, we introduce a temporal self-attention layer that complements the existing cross-attention, spatial self-attention, and feedforward layers. For video sequence conditioning, we extract frame-wise features using DINOv2 [48] and incorporate the farthest-sampled canonical Gaussian Splatting to maintain awareness of the canonical 3D model. To enhance spatial consistency, we incorporate positional priors into the generation process. During training, we encode the Gaussian Variation Field latent along with their corresponding canonical GS positions to formulate positional embeddings. During inference, we directly utilize the positions from farthest-sampled Gaussian Splatting for positional embedding computation.

A.2 Additional Training and Inference Details

In this paper, we designate the first frame of each video as the canonical frame. For our Direct 4DMesh-to-GS Variation Field VAE training, we set the loss weights as follows: λlpips=0.2\lambda_{lpips}=0.2italic_λ start_POSTSUBSCRIPT italic_l italic_p italic_i italic_p italic_s end_POSTSUBSCRIPT = 0.2, λssim=0.2\lambda_{ssim}=0.2italic_λ start_POSTSUBSCRIPT italic_s italic_s italic_i italic_m end_POSTSUBSCRIPT = 0.2, λmg=1.0\lambda_{mg}=1.0italic_λ start_POSTSUBSCRIPT italic_m italic_g end_POSTSUBSCRIPT = 1.0, and λkl=1e6\lambda_{kl}=1e-6italic_λ start_POSTSUBSCRIPT italic_k italic_l end_POSTSUBSCRIPT = 1 italic_e - 6. Computationally, the VAE training requires one day on 32 Nvidia Tesla V100 GPUs (32GB) for the first stage and two days on 8 Nvidia Tesla A100 GPUs (40GB) for joint training, while the diffusion model training spans approximately one week on 8 Nvidia Tesla A100 GPUs (80GB). During inference, we adopt the adaptive mode of DPM-Solver [41] with order 2, requiring approximately 18 steps per instance.

During inference, we address potential orientation misalignment between the generated canonical GS and input images through an azimuth alignment process similar to [57]. Specifically, we render the canonical GS from multiple azimuth angles and compute image-level losses between these renders and the first video frame. We then transform the canonical GS according to the azimuth angle that yields the minimal loss, ensuring better alignment with the input video.

The in-the-wild conditional videos shown in the teaser (Figure 1 in main paper) are created by Kling [34]. The walking astronaut and boxing rat video frames in Figure 5 of the main paper are sourced from consistent4D and Emu video [18], respectively.

A.3 Additional Details of Creating Animation for Existing 3D Model

To animate existing 3D models using our approach, users follow a simple pipeline: First, their 3D assets are rendered as multiview images. These images are then processed to extract and aggregate DINOv2 features. Using these features, we construct a canonical Gaussian Splatting representation through our GS\mathcal{E}_{GS}caligraphic_E start_POSTSUBSCRIPT italic_G italic_S end_POSTSUBSCRIPT encoder and 𝒟GS\mathcal{D}_{GS}caligraphic_D start_POSTSUBSCRIPT italic_G italic_S end_POSTSUBSCRIPT decoder. Finally, animations are generated by our diffusion model, which takes both the canonical GS and a conditional video as input. Users can create these conditional videos using state-of-the-art video diffusion models [47, 33, 34, 18] to specify their desired motion for the 3D model.

Appendix B Data Preparation Details

Our training dataset consists of 34K 3D mesh animations sourced from Objaverse-V1 [13] and Objaverse-XL [12]. For Objaverse-V1, we utilize the curated set of 9K high-quality 3D animations from [38]. For Objaverse-XL, we apply two filtering criteria: first, following [81], we filter out samples whose average aesthetic score 111https://github.com/christophschuhmann/improved-aesthetic-predictor across 4 rendered views of the first frame falls below 5.5; second, we remove sequences with minimal motion. This filtering process yields 25K additional animations from Objaverse-XL.

Appendix C Additional Ablation

Ablation of 𝒟GS\mathcal{D}_{GS}caligraphic_D start_POSTSUBSCRIPT italic_G italic_S end_POSTSUBSCRIPT Joint Finetuning. We investigate the importance of jointly finetuning the canonical GS decoder during our Direct 4DMesh-to-GS Variation Field VAE training. Starting from a pretrained canonical 3D 𝒟GS\mathcal{D}_{GS}caligraphic_D start_POSTSUBSCRIPT italic_G italic_S end_POSTSUBSCRIPT checkpoint, we compare two settings: freezing 𝒟GS\mathcal{D}_{GS}caligraphic_D start_POSTSUBSCRIPT italic_G italic_S end_POSTSUBSCRIPT while training other modules, and jointly training all modules (our approach). As shown in Table 5, joint training allows 𝒟GS\mathcal{D}_{GS}caligraphic_D start_POSTSUBSCRIPT italic_G italic_S end_POSTSUBSCRIPT to receive feedback from animation reconstruction rather than being limited to static data only. This ensures the canonical GS reconstruction coherent with its corresponding variation fields.

Ablation of hyper-parameters in mesh-guided interpolation. We ablate the hyper-parameters including nearest neighbors KKitalic_K, and distance decay rate β\betaitalic_β of interpolation in Table 6. Our setting (K=8,β=7.0K=8,\beta=7.0italic_K = 8 , italic_β = 7.0) yields optimal results. Performance is relatively stable for other values, showing reasonable robustness.

Table 5: Additional ablation of our proposed VAE.
Model PSNR\uparrow LPIPS\downarrow SSIM\uparrow
Ours w/o 𝒟GS\mathcal{D}_{GS}caligraphic_D start_POSTSUBSCRIPT italic_G italic_S end_POSTSUBSCRIPT Finetuning 28.80 0.0460 0.962
Ours 29.28 0.0439 0.964
Table 6: Additional ablation of hyper-parameters in our mesh-guided interpolation.
KKitalic_K β\betaitalic_β PSNR\uparrow LPIPS\downarrow SSIM\uparrow KKitalic_K β\betaitalic_β PSNR\uparrow LPIPS\downarrow SSIM\uparrow
16 7.0 28.38 0.0464 0.960 8 10.0 28.55 0.0462 0.961
8 7.0 29.28 0.0439 0.964 8 7.0 29.28 0.0439 0.964
4 7.0 28.94 0.0451 0.963 8 4.0 29.04 0.0446 0.963
1 7.0 28.22 0.0465 0.960 8 1.0 28.64 0.0457 0.962
Refer to caption
Figure 8: Sample of our autoregressive generation result.

Appendix D More Results

D.1 Autoregressive Generation Results for Temporal Generalization

Temporal generalization is a known challenge in 2D/3D video generation. In our case, we can employ an autoregressive approach during inference for videos exceeding our training length: the GS from the last frame of a generated segment serves as the canonical GS for inferring the next segment’s variation fields, which allows for coherent long animations. We show a 120-frame generated sample using such an approach in Figure 8.

D.2 VAE Reconstruction Results

As illustrated in Figure 11, we demonstrate the reconstruction capabilities of our proposed Direct 4DMesh-to-GS Variation Field VAE. Our method efficiently encodes both canonical GS and their temporal variations from 4D meshes in a single pass, eliminating the need for time-consuming per-instance fitting procedures. The results demonstrate our model’s effectiveness in preserving both geometric fidelity and motion dynamics.

Refer to caption
Figure 9: More generation results of real-world input videos.

D.3 More Visual Comparison with SOTA Methods

As illustrated in Figure 12, we provide extensive visual comparisons with state-of-the-art methods. Our approach demonstrates consistent superiority across diverse test cases, achieving better results in terms of both visual fidelity and temporal motion coherence.

D.4 More Reults of Animating Existing 3D Models

As shown in Figure 13, we demonstrate additional results showcasing our method’s capability to animate existing 3D models using conditional videos. Our approach successfully extracts and transfers motion patterns from the input videos, generating high-fidelity animations that faithfully preserve both geometric and temporal characteristics.

D.5 Additional Results on Real-World Video Inputs

Although our model is trained on synthetic data, it effectively generalizes to real-world video inputs. Figure 9 presents additional results, demonstrating the model’s robust generalization capabilities.

Appendix E Borader Impact

Like all generative models, our video-to-4D generation framework requires careful consideration of societal implications. While we mitigate certain ethical concerns by training exclusively on synthetic 3D animations from the Objaverse dataset, thus avoiding privacy and copyright issues associated with real-world data, we acknowledge potential risks. The ability to generate animated 3D content from videos could be misused for creating misleading content. We therefore emphasize the importance of establishing clear guidelines for the responsible deployment of video-to-4D generation technology.

Appendix F Limitation Discussion and Future Work

Refer to caption Refer to caption
Video condition frame 0 Video condition frame 1
Refer to caption Refer to caption
Canonical GS frame 0 Our Animation frame 1
Figure 10: Failure case. When the pretrained static 3D generative model produces canonical GS that are not well-aligned with the conditional video frames, our Gaussian Variation Field diffusion model struggles to bridge this inconsistency, resulting in suboptimal animations.

While our model demonstrates impressive results in video-to-4D generation, it has certain limitations. Our two-stage generation process first employs a pretrained static 3D generative model to create canonical Gaussian Splatting representations, which then serve as conditions for our diffusion model to generate Gaussian Variation Fields. A notable limitation arises when the static 3D generative model [81] produces canonical GS that exhibits discrepancies with the conditional video, such as mismatched head pose, incorrect eyes or light effects in Figure 10, potentially creating inconsistencies in the final animation. To address this limitation, future work could explore either fine-tuning the static 3D model to ensure better image alignment or developing an end-to-end 4D diffusion framework that jointly generates both the canonical representation and its temporal variations.

Refer to caption
Figure 11: Additional visual results of VAE reconstruction.
Refer to caption
Figure 12: More visual comparison with SOTA Methods.
Refer to caption
Figure 13: More results of animating existing 3D model input with conditional videos.