MonoFusion: Sparse-View 4D Reconstruction via Monocular Fusion

Zihan Wang, Jeff Tan, Tarasha Khurana, Neehar Peri^∗, Deva Ramanan
Carnegie Mellon University Equal Contribution.

Abstract

We address the problem of dynamic scene reconstruction from sparse-view videos. Prior work often requires dense multi-view captures with hundreds of calibrated cameras (e.g. Panoptic Studio). Such multi-view setups are prohibitively expensive to build and cannot capture diverse scenes in-the-wild. In contrast, we aim to reconstruct dynamic human behaviors, such as repairing a bike or dancing, from a small set of sparse-view cameras with complete scene coverage (e.g. four equidistant inward-facing static cameras). We find that dense multi-view reconstruction methods struggle to adapt to this sparse-view setup due to limited overlap between viewpoints. To address these limitations, we carefully align independent monocular reconstructions of each camera to produce time- and view-consistent dynamic scene reconstructions. Extensive experiments on PanopticStudio and Ego-Exo4D demonstrate that our method achieves higher quality reconstructions than prior art, particularly when rendering novel views. Code, data, and data‑processing scripts are available on Github.

Figure 1: Dynamic Scene Reconstruction from Sparse Views. MonoFusion reconstructs dynamic human behaviors, such as playing the piano or performing CPR, from four equidistant inward-facing static cameras. We visualize the RGB and depth renderings of a 45^∘ novel view between two training views. Training views are shown below for reference.

1 Introduction

Accurately reconstructing dynamic 3D scenes from multi-view videos is of great interest to the vision community, with applications in AR/VR [49, 36], autonomous driving [37], and robotics [70, 57]. Prior work often studies this problem in the context of dense multi-view videos, which require dedicated capture studios that are prohibitively expensive to build and are difficult to scale to diverse scenes in-the-wild. In this paper, we aim to strike a balance between the ease and informativeness of multi-view data collection by reconstructing skilled human behaviors (e.g., playing a piano and performing CPR) from four equidistant inward-facing static cameras (Fig. 1).

Problem setup.

Despite recent advances in dynamic scene reconstruction [17, 16, 18, 4], current approaches often require dozens of calibrated cameras [38, 23], are category specific [61], or struggle to generate multi-view consistent geometry [33]. We study the problem of reconstructing dynamic human behaviors from an in-the-wild capture studio: a small set of (4) portable cameras with limited overlap but complete scene coverage, such as in the large-scale Ego-Exo4D dataset [20]. We argue that sparse-view limited-overlap reconstruction presents unique challenges not found in dense multi-view setups and typical “sparse view” captures with large covisibility (Fig. 2). For dense multi-view captures, it is often sufficient to rely solely on geometric and photometric cues for reconstruction, often making use of classic techniques from (non-rigid) structure from motion [13]. As a result, these methods fail in sparse-view settings with limited cross-view correspondences.

Refer to caption — Figure 2: Problem Setup. Our sparse-view setup (middle) strikes a balance between ill-posed reconstructions from casual monocular captures [18, 43] and well-constrained reconstructions from dense multi-view studio captures [23]. Unlike existing “sparse-view” datasets like DTU [22] and LLFF [39], our setup is more challenging because input views are 90^∘ apart with limited cross-view correspondences.

Key insights.

We find that initializing sparse-view reconstructions with monocular geometry estimators like MoGe [55] produces higher quality results. However, naively merging independent monocular geometry estimates often yields inconsistent geometry across views (e.g. duplicate structures), resulting in a local minima during 3D optimization. Instead, we carefully align monocular reconstructions (that are independently predicted for each view and time) to a global reference frame that is learned from a static multi-view reconstructor (like DUSt3R [56]). Furthermore, many challenges in inferring view-consistent and time-consistent depth become dramatically simplified when working with fixed cameras with known poses (inherent to the in-the-wild capture setup that we target). For example, temporally consistent background geometry can be enforced by simply averaging predictions over time.

Contributions.

We present three major contributions.

•

We highlight the challenge of reconstructing skilled human behaviors in dynamic environments from sparse-view cameras in-the-wild.
•

We demonstrate that monocular reconstruction methods can be extended to the sparse-view setting by carefully incorporating monocular depth and foundational priors.
•

We extensively ablate our design choices and show that we achieve state-of-the-art performance on PanopticStudio and challenging sequences from Ego-Exo4D.

2 Related Work

Dynamics scene reconstruction.

Dynamic scene reconstruction [4] has received significant interest in recent years. While classical work [41, 9] often relies on RGB-D sensors, or strong domain knowledge [2, 7], recent approaches [33] based on neural radiance fields [40] have progressed towards reconstructing dynamic scenes in-the-wild from RGB video alone. However, such methods are computationally heavy, can only reconstruct short video clips with limited dynamic movement, and struggle with extreme novel view synthesis. Recently, 3D Gaussian Splatting [26, 38] has accelerated radiance field training and rendering via an efficient rasterization process. Follow-up works [64, 58, 35] repurpose 3DGS to reconstruct dynamic scenes, often by optimizing a fixed set of Gaussians in canonical space and modeling their motion with deformation fields. However, as Gao et al. [18] points out, such methods often struggle to reconstruct realistic videos. Many works address this shortcoming by relying on 2D point tracking priors [54], fusing Gaussians from many timesteps [30], modeling isotropic Gaussians [50], or exploiting domain knowledge such as human body priors [52]. However, these approaches study the reconstruction problem in the monocular setting. As 4D reconstruction from a single viewpoint is under-constrained, practical robotics setups for manipulation [27] and hand-object interaction [29, 11, 53] adopt camera rigs where a sparse set of cameras capture the scene of interest. Similarly, datasets like Ego-Exo4D [20], DROID [27] and H2O [29] explore sparse-view capture for dynamic scenes in-the-wild.

Novel-view synthesis from sparse views.

Both NeRF and 3D Gaussian Splatting require dense input view coverage, which hinders their real-world applicability. Recent works aim to reduce the number of required input views by adding additional supervision and regularization, such as depth [8] or semantics [65]. FSGS [72] builds on Gaussian splatting by producing faithful static geometry from as few as three views by unpooling existing Gaussians and adopting extra depth supervision. Recent studies such as [60], on the other hand, adds noise to Gaussian attributes and relies on a pre-trained ControlNet [68] to repair low-quality rendered images. Other works such as MVSplat [5] build a cost volume representation and predict Gaussian attributes in a feed-forward manner. However, they can only synthesize novel views with small deviations from the nearest training view. For methods that rely on learned priors, high-quality novel view synthesis is often limited to images within the training distribution. Such methods cannot handle diverse real-world geometry. Diffusion-based reconstruction methods [19, 59, 69] try to generate additional views consistent with the sparse input views, but often produce artifacts. In our case, four sparse view cameras are separated around 90^∘ apart, posing unique challenges.

Feed-forward geometry estimation.

Learning-based methods, such as monocular depth networks, are able to reconstruct 3D objects and scenes by learning strong priors from training data. While early works [10, 14] focus on in-domain depth estimation, recent works build foundational depth models by scaling up training data [46, 63, 55], resolving metric ambiguity from various camera models [21, 44], or relying on priors such as Stable Diffusion [48, 24, 15]. Unfortunately, monocular depth networks are not scale or view consistent, and often require extensive alignment against ground-truth to produce meaningful metric outputs. To address these shortcomings, DUSt3R [56] and MonST3R [67] propose the task of point map estimation, which aims to recover scene geometry as well as camera intrinsics and extrinsics given a pair of input images. These methods unify single-view and multi-view geometry estimation, and enable consistent depth estimation across either time or space.

3 Towards Sparse-View 4D Reconstruction

Given sparse-view (i.e. 3 – 4) videos from stationary cameras as input, our method recovers the geometry and motion of a dynamic 3D scene (Fig. 3). We model the scene as canonical 3D Gaussians (Sec. 3.1), which translate and rotate via a linear combination of motion bases. We initialize consistent scene geometry by carefully aligning geometry predictions from multiple views (Sec. 3.2), and initialize motion trajectories by clustering per-point 3D semantic features distilled from 2D foundation models (Sec. 3.3). We formulate a joint optimization which simultaneously recovers geometry and motion (Sec. 3.4).

3.1 3D Gaussian Scene Representation

We represent the geometry and appearance of dynamic 3D scenes using 3D Gaussian Splatting [26], due to its efficient optimization and rendering. Each Gaussian in the canonical frame $t_{0}$ is parameterized by $(\mathbf{x}_{0},\mathbf{R}_{0},\mathbf{s},\alpha,\mathbf{c})$ , where $\mathbf{x}_{0}\in\mathbb{R}^{3}$ is the Gaussian’s position in canonical frame, $\mathbf{R}_{0}\in\mathbb{SO}(3)$ is the orientation, $\mathbf{s}\in\mathbb{R}^{3}$ is the scale, $\alpha\in\mathbb{R}$ is the opacity, and $\mathbf{c}\in\mathbb{R}^{3}$ is the color. The position and orientation are time-dependent, while the scale, opacity, and color are persistent over time. We additionally assign a semantic feature $\mathbf{f}\in\mathbb{R}^{N}$ to each Gaussian (Sec. 3.3), where $N=32$ is an arbitrary number representing the embedding dimension of the feature. Empirically, we find that fixing the color and opacity of Gaussians results in a better performance. In summary, for the $i$ -th 3D Gaussian, the optimizable attributes are given by $\Theta^{(i)}=\{\mathbf{x}^{(i)}_{0},\mathbf{R}^{(i)}_{0},\mathbf{s}^{(i)},\mathbf{f}^{(i)}\}$ . Following [71], the optimized Gaussians are rendered from a given camera to produce an RGB image and a feature map using a tile-based rasterization procedure.

3.2 Space-Time Consistent Depth Initialization

Similar to recent methods [51, 54], we rely on data-driven monocular depth priors to initialize the position and appearance of 3D Gaussians over time. Given the success of initializing 3DGS with monocular depth in single-view settings [54], one might think to naturally extend this to multi-view settings by independently initializing from monocular depth for each view. However, this yields conflicting geometry signals, as monocular depth estimators commonly predict up to an unknown scale and shift factor. Thus, the unprojected monocular depths from separate views are often inconsistent, resulting in duplicated object parts.

Multi-view pointmap prediction.

DUSt3R [56] predicts multi-view consistent pointmaps across $K$ input images by first inferring pairwise pointmaps, followed by a global 3D optimization that searches for per-image pointmaps and pairwise similarity transforms (rotation, translation, and scale) that best aligns all pointmaps with each other.

We run DUST3R on the multiview images at time $t$ , but constrain the global optimization to be consistent with the $K$ known stationary camera extrinsics $\{\mathbf{P}_{k}\}$ and intrinsics $\{\mathbf{K}_{k}\}$ . This produces per-image global pointmaps $\{\chi^{t}_{k}\}$ in metric coordinates. One can then compute a depth map by simply projecting each pointmap back to each image with the known cameras

d^{t}_{k}(u,v)\begin{bmatrix}u&v&1\end{bmatrix}^{T}=\mathbf{K}_{k}\mathbf{P}_{k}{\chi}^{t}_{k}(u,v)

(1)

This produces metric-scale multi-view consistent depth maps $d^{t}_{k}(u,v)$ , which are still not consistent over time.

Spatio-temporal alignment of monocular depth with multi-view consistent pointmaps.

In fact, even beyond temporally inconsistency, such multiview predictors tend to underperform on humans since they are trained on multiview data where dynamic humans are treated as outliers. Instead, we find monocular depth estimators such as MoGe [55] to be far more accurate, but such predictions are not metric (since they are accurate only up to an affine transformation) and are not guaranteed to be consistent across views or times. Instead, our strategy is to use the multiview depth maps from DUST3R as a metric target to align monocular depth predictions, which we write as $m^{t}_{k}(u,v)$ . Specifically, we search for scale and shift factors $a_{k}^{t}$ and $b_{k}^{t}$ that minimize the following error:

\operatorname*{arg\,min}_{\{a^{t}_{k},b^{t}_{k}\}}\sum_{t=1}^{T}\sum_{k=1}^{K}\sum_{u,v\in\text{BG}^{t}_{k}}\left\lVert(a^{t}_{k}m^{t}_{k}(u,v)+b^{t}_{k})-d^{t}_{k}(u,v)\right\rVert^{2}

(2)

where $\text{BG}^{t}_{k}$ refers to a pixelwise background mask for camera $k$ at frame $t$ . The above uses metric background points as a target for aligning all monodepth predictions. The above optimization can be solved quite efficiently since each time $t$ and view $k$ can be optimized independently with a simple least-squares solver (implying our approach will easily scale to long videos). However, the above optimization will still produce scale factors that are not temporally consistent since the targets are temporally inconsistent as well. But we can exploit the constraint that background points should be static across time for stationary cameras. To do so, we replace $d^{t}_{k}(u,v)$ with a static target $d_{k}(u,v)$ obtained by averaging depth maps over time or selecting a canonical reference timestamp. The final set of scaled time- and view-consistent depthmaps are then unprojected back to 3D pointmaps. Note that this tends to produce accurate predictions for static background points, but the dynamic foreground may remain noisy because they cannot be naively denoised by simple temporal averaging. Rather, we rely on motion-based 3DGS optimization to enforce smoothness of the foreground, described next.

During our experiments, we identified two additional limitations that significantly impact visual quality.
(1) Scale initialization: We observed that initializing 3D Gaussian scales with $k$ -nearest neighbors often results in poor appearance, such as extremely large Gaussians filling empty space and blurring the background. To address this, we follow SplaTAM [25] and initialize each Gaussian scale based on its projected pixel area: $\textrm{scale}=\frac{d}{0.5(f_{x}+f_{y})}$ , where $d$ is a pixel’s depth and $f_{x},f_{y}$ are focal lengths.
(2) Insufficient Gaussian density: Using only one Gaussian per input pixel fails to adequately capture fine details. We instead initialize 5 Gaussians per input pixel, providing better representation of fine details.

3.3 Grouping-based Motion Initialization

Beyond initializing time- and view-consistent geometry in the canonical frame, we also aim to initialize reasonable estimates of the scene motion. We model a dynamic 3D scene as a set of $\mathcal{N}$ canonical 3D Gaussians, along with time-varying rigid transformations $\mathbf{T}_{0\to t}=[\mathbf{R}_{0\to t}\mathbf{t}_{0\to t}]\in\mathbb{SE}(3)$ that warp from canonical space to time $t$ :

\mathbf{x}_{t}=\mathbf{R}_{0\to t}\mathbf{x}_{0}+\mathbf{t}_{0\to t}\quad\mathbf{R}_{t}=\mathbf{R}_{0\to t}\mathbf{R}_{0}

(3)

Motion bases.

Similar to Shape of Motion [54], we make the observation that in most dynamic scenes, the underlying 3D motion is often low-dimensional, and composed of simpler units of rigid motion. For example, the forearms tend to move together as one rigid unit, despite being composed of thousands of distinct 3D Gaussians. Rather than storing independent 3D motion trajectories for each 3D Gaussian $(i)$ , we define a set of $B$ learnable basis trajectories $\{\mathbf{T}^{(i,b)}_{0\to t}\}_{b=1}^{B}$ . The time-varying rigid transforms are written as a weighted combination of basis trajectories, using fixed per-point basis coefficients $\{w^{(i,b)}\}_{b=1}^{B}$ :

\mathbf{T}^{(i)}_{0\to t}=\sum_{b=1}^{B}\mathbf{w}^{(i,b)}\mathbf{T}_{0\to t}^{(i,b)}

(4)

Motion bases via feature clustering.

Unlike Shape of Motion which initializes motion bases by clustering 3D tracks, our key insight is that semantically grouping similar scene parts together can help regularize dynamic scene motion, without ever initializing trajectories from noisy 3D track predictions. Inspired by the success of robust and universal feature descriptors [42], we obtain pixel-level features for each input image by evaluating DINOv2 on an image pyramid. We average features across pyramid levels and reduce the dimension to 32 via PCA [1]. We choose the small DINOv2 model with registers, as it produces fewer peaky feature artifacts [6].

Given the consistent pixel-aligned pointmaps $\chi_{t,k}^{\text{(time+view)}}$ , we associate each pointmap with the 32-dim feature map $\mathbf{f}_{t,k}$ computed from the corresponding image. We perform k-means clustering on per-point features $\mathbf{f}$ to produce $b$ initial clusters of 3D points. After initializing 3D Gaussians from pointmaps, we set the motion basis weight $\mathbf{w}^{(i,b)}$ to be the L2 distance between the cluster center and 3D Gaussian center. We initialize the basis trajectories $\mathbf{T}^{(b)}_{0\to t}$ to be identity, and optimize them via differentiable rendering.

Dataset	Method	Full Frame				Dynamic Only
Dataset	Method	PSNR $\uparrow$	SSIM $\uparrow$	LPIPS $\downarrow$	AbsRel $\downarrow$	PSNR $\uparrow$	SSIM $\uparrow$	LPIPS $\downarrow$	IOU $\uparrow$
	SOM [54]	17.86	0.687	0.460	0.491	18.75	0.701	0.236	0.358
	Dyn3D-GS [38]	25.37	0.831	0.266	0.207	26.11	0.862	0.129	—
Panoptic Studio	MV-SOM [54]	26.28	0.858	0.241	0.331	26.80	0.883	0.161	0.886
	MonoFusion	28.01	0.899	0.117	0.149	27.52	0.944	0.022	0.965
	SOM [54]	14.73	0.535	0.482	0.843	15.63	0.559	0.450	0.294
	Dyn3D-GS [38]	24.28	0.692	0.539	0.612	24.61	0.673	0.384	—
	MV-SOM-DS [54]	28.37	0.906	0.079	0.398	28.23	0.931	0.063	0.872
ExoRecon	MV-SOM [54]	26.91	0.890	0.138	0.474	27.31	0.919	0.078	0.845
	MonoFusion	30.43	0.927	0.061	0.290	29.71	0.947	0.017	0.963

Table 1: Quantitative analysis of held-out view synthesis. We benchmark our method against state-of-the-art approaches by evaluating the novel-view rendering and geometric quality on both the dynamic foreground region and the entire scene, across the held-out frames from input videos. MV-SOM is a multi-view version of Shape-of-Motion [54] that we construct by instantiating four different instances of single-view shape of motion, and optimize them together. On Panoptic Studio, groundtruth depth for computing the AbsRel metric is obtained from 27-view optimization of the original Dynamic 3DGS, and for ExoRecon, we project the released point clouds obtained via SLAM from Aria glasses. When evaluating single-view baselines, SOM [54], we naively aggregate their predictions from the four views and evaluate this aggregated prediction against the evaluation cameras.

3.4 Optimization

As observed in prior work [32, 17], using photometric supervision alone is insufficient to avoid bad local minima in a sparse-view setting. Our final optimization procedure is a combination of photometric losses, data-driven priors, and regularizations on the learned geometry and motions.

During each training step, we sample a random timestep $t$ and camera $k$ . We render the image $\mathbf{\hat{I}}_{t,k}$ , mask $\mathbf{\hat{M}}_{t,k}$ , features $\mathbf{\hat{F}}_{t,k}$ , and depth $\mathbf{\hat{D}}_{t,k}$ . We compute reconstruction loss by comparing to off-the-shelf estimates:

$\mathcal{L}_{\text{recon}}=\left\lVert\mathbf{\hat{I}}-\mathbf{I}\right\rVert_{1}+\lambda_{\text{m}}\left\lVert\mathbf{\hat{M}}-\mathbf{M}\right\rVert_{1}+\lambda_{\text{f}}\left\lVert\mathbf{\hat{F}}-\mathbf{F}\right\rVert_{1}+\lambda_{\text{d}}\left\lVert\mathbf{\hat{D}}-\mathbf{D}\right\rVert_{1}$

(5)

We additionally enforce a rigidity loss between randomly sampled dynamic Gaussians and their $k$ nearest neighbors. Let $\mathbf{\hat{X}}_{t}$ denote the location of a 3D Gaussian at time $t$ , and let $\mathbf{\hat{X}}_{t^{\prime}}$ denote its location at time $t^{\prime}$ . Over neighboring 3D Gaussians $i$ , we define:

\mathcal{L}_{\text{rigid}}=\sum_{\text{neighbors }i}\left\lVert\mathbf{\hat{X}}_{t}-\mathbf{\hat{X}}^{(i)}_{t}\right\rVert_{2}^{2}-\left\lVert\mathbf{\hat{X}}_{t^{\prime}}-\mathbf{\hat{X}}^{(i)}_{t^{\prime}}\right\rVert_{2}^{2}

(6)

4 Experimental Results

Implementation details.

We optimize our representation with Adam [28]. We use 18k gaussians for the foreground and 1.2M for the background. We fix the number of $\mathbb{SE}(3)$ motion bases to $28$ and obtain these from feature clustering (Sec. 3.3). For the depth alignment, we use points above the confidence threshold of 95%. We show results on 7 10-sec long sequences at 30fps with a resolution of 512 $\times$ 288. Training takes about 30 minutes on a single NVIDIA A6000 GPU. Our rendering speed is about 30fps.

Datasets.

We conduct qualitative and numerical evaluation on Panoptic Studio [23] and a subset of Ego-Exo4D [20] which we call ExoRecon.

Panoptic Studio is a massively multi-view capture system which consists of 480 video streams of humans performing skilled activities. Out of these 480 views, we manually select 4 camera views, 90^∘ apart to simulate the same exocentric camera setup as Ego-Exo4D. Given these 4 training view cameras, we find 4 other intermediate cameras 45^∘ apart from the training views, and use these for evaluating novel view synthesis from 45^∘ camera views.

For in-the-wild evaluation of sparse-view reconstruction, we repurpose Ego-Exo4D [20], which includes sparse-view videos of skilled human activities. While many Ego-Exo4D scenarios are out of scope for dynamic reconstruction with existing methods (due to fine-grained object motion, specular surfaces, or excessive scene clutter), we find one scene each from the 6 different scenarios in Ego-Exo4D with considerable object motion: dance, sports, bike repair, cooking, music, healthcare. For each scene, we extract 300 frames of synchronized RGB video streams, captured from 4 different cameras with known parameters. We remove fisheye distortions from all RGB videos and assume a simple pinhole camera model after undistortion. We call this subset ExoRecon, and show results on these sequences. Please see the appendix for more visuals.

Method	PSNR $\uparrow$	SSIM $\uparrow$	LPIPS $\downarrow$	IOU $\uparrow$	AbsRel ( $\downarrow$ )
SOM	16.73	0.554	0.491	0.287	0.578
Dyn3D-GS	23.31	0.776	0.316	—	0.273
MV-SOM	21.56	0.541	0.433	0.482	0.413
MonoFusion	25.73	0.847	0.158	0.943	0.188

Table 2: Quantitative analysis of 45^∘ novel-view synthesis on Panoptic Studio. We benchmark our method against state-of-the-art approaches by evaluating both the dynamic foreground region and the entire scene. Notably, the evaluation is conducted on novel views where the cameras are at least

45^{\circ}

apart from all training views. We additionally evaluate the geometric reconstruction quality with absolute relative (AbsRel) error in rendered depth.

Metrics.

We follow prior work [38, 62] in evaluating the perceptual and geometric quality of our reconstructions using PSNR, SSIM, LPIPS and absolute relative (AbsRel) error in depth. We compute these metrics on the entire image, and also on only the foreground region of interest. We additionally evaluate the quality of the dynamic foreground silhouette by reporting mask IoU, computed as $(\mathbf{\hat{M}}\&\mathbf{M})/(\mathbf{\hat{M}}||\mathbf{M})$ . Similar to prior work [62], our evaluation views are a set of held-out frames, subsampled from the input videos from 4 exocentric cameras, in both Panoptic Studio and ExoRecon.

Note that since the cameras in our setup are stationary, above evaluation only analyses the interpolation quality of different methods. More explicitly, we also benchmark novel-view synthesis on Panoptic Studio with an evaluation camera placed 45^∘ away from the training view cameras. Since such a ground-truth evaluation camera is not available in ExoRecon, we only show qualitative results.

Baselines.

We compare our method with prior work on dynamic scene reconstruction from single or multiple views. Among methods that operate on monocular videos, we run Shape of Motion [54] on 8 scenes from Panoptic Studio following the setup of Dynamic 3D Gaussians [38] and our curated dataset ExoRecon that covers 6 diverse scenes. Finally, we consider two multi-view dynamic reconstruction baselines, Dynamic 3D Gaussians [38], and a naive multi-view extension of Shape of Motion (MV-SOM). To construct the latter baseline, we simply concatenate the Gaussians and motion bases from four independently-initialized instances of single-view SOM, and optimize all four instances jointly. We also evaluate a variant of MV-SOM with globally-consistent depth (denoted MV-SOM-DS), obtained by running per-frame DUSt3R on the 4 input views and fixing camera poses to ground-truth during DUSt3R’s global alignment. Despite using our same hyperparameters, MV-SOM-DS has more visual artifacts due to reduced depth quality, suggesting the importance of our DUSt3R+MoGe design. In the appendix, we verify that all baselines reconstruct reasonable training views.

4.1 Comparison to State-of-the-Art

Evaluation on held-out views. In Tab. 1, we compare our method to recent dynamic scene reconstruction baselines [67, 38, 54], following evaluation protocols from prior work [54, 62]. Our method beats prior art on both Panoptic Studio and ExoRecon (Fig. 4) datasets, when evaluated on held-out views across photometric (PSNR, SSIM, LPIPS) and geometric error (AbsRel) metrics. Note that when initializing Dynamic 3DGS [38] with 4 views we find that COLMAP fails, and so the point cloud initialization for this baseline is from a 27-view COLMAP optimization.

Interestingly, we find that although the monocular 4D reconstruction method Shape of Motion (SOM) [54] often fails to output accurate metric depth, it is robust to a limited camera shift. We hypothesize that the foundational priors of Shape of Motion allow it to produce reasonable results in under-constrained scenarios, while test-time optimization methods, especially ones that do not always rely on data-driven priors [38], can more easily fall into local optima (e.g. those caused by poor initialization) which are difficult to optimize out of via rendering losses alone.

Evaluation on a 45^∘ novel-view. On Panoptic Studio, we use the four evaluation cameras (placed 45^∘ apart from the training views) to evaluate our method’s novel-view rendering capability. We also evaluate the novel-view rendered depth against a ‘pseudo-groundtruth’ depth obtained from optimizing Dynamic 3DGS [38] with all 24 training views. In Tab. 2 and Fig. 5, we find that our method outperforms all baselines, achieving state-of-the-art 45^∘ novel-view synthesis. Qualitative results on ExoRecon are in Fig. 6 & 7.

4.2 Ablation Study

We ablate the design decisions in our pipeline in Tab. 3. Our proposed space-time consistent depth plays a crucial role in learning accurate scene geometry and appearance (yielding a 3.4 PSNR improvement, Row 1 vs 3). Next, we find that the feature-metric loss $\mathcal{L}_{\text{feat}}=\left\lVert\mathbf{\hat{F}}-\mathbf{F}\right\rVert$ provides a trade-off between learning photometric properties vs.learning foreground motion and silhouette. Although the PSNR decreases, we see an increase in mask IoU (Row 1 vs 2 and Row 3 vs 4). Freezing the color of all Gaussians across frames aids learning the motion mask, as measured by mask IoU. Finally, our motion bases from feature-clustering improve overall scene optimization (final row).

Method	$\mathcal{L}_{\text{feat}}$	$\mathbf{d}_{n}$	$\mathbf{T}^{(b)}_{0\to t}$	$\uparrow$ PSNR	$\uparrow$ SSIM	$\downarrow$ LPIPS	$\uparrow$ IoU
Baseline	✗	✗	✗	26.19	0.915	0.077	0.60
+ $\mathcal{L}_{\text{feat}}$	✓	✗	✗	25.39	0.933	0.087	0.63
+ Our depth / no $\mathcal{L}_{\text{feat}}$	✗	✓	✗	29.55	0.944	0.037	0.73
+ Our depth / $\mathcal{L}_{\text{feat}}$	✓	✓	✗	29.31	0.941	0.041	0.75
+ Motion bases (Ours)	✓	✓	✓	30.40	0.947	0.037	0.81

Table 3: Ablation study of pipeline components. We ablate our choice of feature-metric loss, spacetime consistent depth, and feature-based motion bases. While the proposed depth and feature-based motion bases considerably improve 4D reconstruction (evaluated by photometric errors), we find that our feature loss helps learn better motion masks (evaluated by IoU).

Velocity-based vs. feature-based motion bases In the monocular setting, we empirically found that both designs performed equally well. However, in our 4 camera sparse view setting, we found that feature-based motion bases perform much better than velocity-based motion bases. The reason is that for velocity-based motion bases, we infer 3D velocity by querying the 2D tracking results plus depth per frame following Shape-of-Motion[54]. Thus, noisy foreground depth estimates where the estimated depth of the person flickers between foreground and backward will negatively influence the quality of velocity-based motion bases, causing rigid body parts to move erratically. In contrast, feature-based motion bases, where features are initialized from more reliable image-level observations, are more robust to noisy 3D initialization and force semantically-similar parts to move in similar ways. To validate our points, in Fig. 8 we use PCA analysis to visualize the inferred features and find that they are consistent not only on temporal axis but also across cameras.

Effect of different number of motion bases. When the number of motion bases is not expressive enough (in our experience when the number of motion bases $<20$ ), there are often obvious flaws in the reconstruction, such as missing arms or the two legs joining together into a single leg. In reality, we do not observe that increasing the number of motion bases further hurts the performance. Empirically, the capacity of our design (which is 28 motion bases) can effectively handle different scene dynamics.

5 Conclusion

We address the problem of sparse-view 4D reconstruction of dynamic scenes. Existing multi-view 4D reconstruction methods are designed for dense multi-view setups (e.g. Panoptic Studio). In contrast, we aim to strike a balance between the ease and informativeness of multi-view data capture by reconstructing skilled human behaviors from four equidistant inward-facing static cameras. Our key insight is that carefully incorporating priors, in the form of monocular depth and feature-based motion clustering, is crucial. Our empirical analysis shows that on challenging scenes with object dynamics, we achieve state-of-the-art performance on novel space-time synthesis compared to prior art.

References

Amir et al. [2021] Shir Amir, Yossi Gandelsman, Shai Bagon, and Tali Dekel. Deep vit features as dense visual descriptors. arXiv preprint arXiv:2112.05814, 2021.
Carranza et al. [2003] Joel Carranza, Christian Theobalt, Marcus Magnor, and Hans-Peter Seidel. Free-viewpoint video of human actors. ACM Trans. Graph., 22:569–577, 2003.
Cen et al. [2023] Jiazhong Cen, Zanwei Zhou, Jiemin Fang, Wei Shen, Lingxi Xie, Xiaopeng Zhang, and Qi Tian. Segment anything in 3d with nerfs. arXiv preprint arXiv:2304.12308, 2023.
Chen and Wang [2024] Guikun Chen and Wenguan Wang. A survey on 3d gaussian splatting. arXiv preprint, 2024.
Chen et al. [2024] Yuedong Chen, Haofei Xu, Chuanxia Zheng, Bohan Zhuang, Marc Pollefeys, Andreas Geiger, Tat-Jen Cham, and Jianfei Cai. Mvsplat: Efficient 3d gaussian splatting from sparse multi-view images, 2024.
Darcet et al. [2023] Timothée Darcet, Maxime Oquab, Julien Mairal, and Piotr Bojanowski. Vision transformers need registers, 2023.
de Aguiar et al. [2008] Edilson de Aguiar, Carsten Stoll, Christian Theobalt, Naveed Ahmed, Hans-Peter Seidel, and Sebastian Thrun. Performance capture from sparse multi-view video. SIGGRAPH, 2008.
Deng et al. [2021] Kangle Deng, Andrew Liu, Jun-Yan Zhu, and Deva Ramanan. Depth-supervised nerf: Fewer views and faster training for free. arXiv preprint arXiv:2107.02791, 2021.
Dou et al. [2016] Mingsong Dou, Sameh Khamis, Yury Degtyarev, Philip Davidson, Sean Ryan Fanello, Adarsh Kowdle, Sergio Orts Escolano, Christoph Rhemann, David Kim, Jonathan Taylor, Pushmeet Kohli, Vladimir Tankovich, and Shahram Izadi. Fusion4d: Real-time performance capture of challenging scenes. ACM Transactions on Graphics (TOG) - Proceedings of ACM SIGGRAPH 2016, 35, 2016.
Eigen et al. [2014] David Eigen, Christian Puhrsch, and Rob Fergus. Depth map prediction from a single image using a multi-scale deep network. NeurIPS, pages 2366–2374, 2014.
Fan et al. [2023] Zicong Fan, Omid Taheri, Dimitrios Tzionas, Muhammed Kocabas, Manuel Kaufmann, Michael J Black, and Otmar Hilliges. Arctic: A dataset for dexterous bimanual hand-object manipulation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 12943–12954, 2023.
Fan et al. [2024] Zhiwen Fan, Kairun Wen, Wenyan Cong, Kevin Wang, Jian Zhang, Xinghao Ding, Danfei Xu, Boris Ivanovic, Marco Pavone, Georgios Pavlakos, Zhangyang Wang, and Yue Wang. Instantsplat: Sparse-view gaussian splatting in seconds, 2024.
Forsyth and Ponce [2003] David A Forsyth and Jean Ponce. A modern approach. Computer vision: a modern approach, 17:21–48, 2003.
Fu et al. [2018] Huan Fu, Mingming Gong, Chaohui Wang, Kayhan Batmanghelich, and Dacheng Tao. Deep ordinal regression network for monocular depth estimation. CVPR, pages 2002–2011, 2018.
Fu et al. [2024] Xiao Fu, Wei Yin, Mu Hu, Kaixuan Wang, Yuexin Ma, Ping Tan, Shaojie Shen, Dahua Lin, and Xiaoxiao Long. Geowizard: Unleashing the diffusion priors for 3d geometry estimation from a single image. In ECCV, 2024.
Gao et al. [2021a] Chen Gao, Ayush Saraf, Johannes Kopf, and Jia-Bin Huang. Dynamic view synthesis from dynamic monocular video. In IEEE International Conference on Computer Vision (ICCV), 2021a.
Gao et al. [2021b] Chen Gao, Ayush Saraf, Johannes Kopf, and Jia-Bin Huang. Dynamic view synthesis from dynamic monocular video. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 5712–5721, 2021b.
Gao et al. [2022] Hang Gao, Ruilong Li, Shubham Tulsiani, Bryan Russell, and Angjoo Kanazawa. Monocular dynamic view synthesis: A reality check. In Advances in Neural Information Processing Systems (NeurIPS), 2022.
Gao et al. [2024] Ruiqi Gao, Aleksander Holynski, Philipp Henzler, Arthur Brussee, Ricardo Martin-Brualla, Pratul Srinivasan, Jonathan T. Barron, and Ben Poole. Cat3d: Create anything in 3d with multi-view diffusion models. arXiv preprint 2405.10314, 2024.
Grauman et al. [2023] Kristen Grauman, Andrew Westbury, Lorenzo Torresani, Kris Kitani, Jitendra Malik, Triantafyllos Afouras, Kumar Ashutosh, Vijay Baiyya, Siddhant Bansal, Bikram Boote, Eugene Byrne, Zach Chavis, Joya Chen, Feng Cheng, Fu-Jen Chu, Sean Crane, Avijit Dasgupta, Jing Dong, Maria Escobar, Cristhian Forigua, Abrham Gebreselasie, Sanjay Haresh, Jing Huang, Md Mohaiminul Islam, Suyog Jain, Rawal Khirodkar, Devansh Kukreja, Kevin J Liang, Jia-Wei Liu, Sagnik Majumder, Yongsen Mao, Miguel Martin, Effrosyni Mavroudi, Tushar Nagarajan, Francesco Ragusa, Santhosh Kumar Ramakrishnan, Luigi Seminara, Arjun Somayazulu, Yale Song, Shan Su, Zihui Xue, Edward Zhang, Jinxu Zhang, Angela Castillo, Changan Chen, Xinzhu Fu, Ryosuke Furuta, Cristina Gonzalez, Prince Gupta, Jiabo Hu, Yifei Huang, Yiming Huang, Weslie Khoo, Anush Kumar, Robert Kuo, Sach Lakhavani, Miao Liu, Mi Luo, Zhengyi Luo, Brighid Meredith, Austin Miller, Oluwatumininu Oguntola, Xiaqing Pan, Penny Peng, Shraman Pramanick, Merey Ramazanova, Fiona Ryan, Wei Shan, Kiran Somasundaram, Chenan Song, Audrey Southerland, Masatoshi Tateno, Huiyu Wang, Yuchen Wang, Takuma Yagi, Mingfei Yan, Xitong Yang, Zecheng Yu, Shengxin Cindy Zha, Chen Zhao, Ziwei Zhao, Zhifan Zhu, Jeff Zhuo, Pablo Arbelaez, Gedas Bertasius, David Crandall, Dima Damen, Jakob Engel, Giovanni Maria Farinella, Antonino Furnari, Bernard Ghanem, Judy Hoffman, C. V. Jawahar, Richard Newcombe, Hyun Soo Park, James M. Rehg, Yoichi Sato, Manolis Savva, Jianbo Shi, Mike Zheng Shou, and Michael Wray. Ego-exo4d: Understanding skilled human activity from first- and third-person perspectives, 2023.
Hu et al. [2024] Mu Hu, Wei Yin, Chi Zhang, Zhipeng Cai, Xiaoxiao Long, Hao Chen, Kaixuan Wang, Gang Yu, Chunhua Shen, and Shaojie Shen. Metric3d v2: A versatile monocular geometric foundation model for zero-shot metric depth and surface normal estimation. arXiv preprint arXiv:2404.15506, 2024.
Jensen et al. [2014] Rasmus Jensen, Anders Dahl, George Vogiatzis, Engil Tola, and Henrik Aanæs. Large scale multi-view stereopsis evaluation. In 2014 IEEE Conference on Computer Vision and Pattern Recognition, pages 406–413. IEEE, 2014.
Joo et al. [2017] Hanbyul Joo, Tomas Simon, Xulong Li, Hao Liu, Lei Tan, Lin Gui, Sean Banerjee, Timothy Godisart, Bart Nabbe, Iain Matthews, et al. Panoptic studio: A massively multiview system for social interaction capture. IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 41(1):190–204, 2017.
Ke et al. [2024] Bingxin Ke, Anton Obukhov, Shengyu Huang, Nando Metzger, Rodrigo Caye Daudt, and Konrad Schindler. Repurposing diffusion-based image generators for monocular depth estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024.
Keetha et al. [2024] Nikhil Keetha, Jay Karhade, Krishna Murthy Jatavallabhula, Gengshan Yang, Sebastian Scherer, Deva Ramanan, and Jonathon Luiten. Splatam: Splat, track & map 3d gaussians for dense rgb-d slam, 2024.
Kerbl et al. [2023] Bernhard Kerbl, Georgios Kopanas, Thomas Leimkühler, and George Drettakis. 3d gaussian splatting for real-time radiance field rendering. ACM Transactions on Graphics (ToG), 42(4):1–14, 2023.
Khazatsky et al. [2024] Alexander Khazatsky, Karl Pertsch, Suraj Nair, Ashwin Balakrishna, Sudeep Dasari, Siddharth Karamcheti, Soroush Nasiriany, Mohan Kumar Srirama, Lawrence Yunliang Chen, Kirsty Ellis, et al. Droid: A large-scale in-the-wild robot manipulation dataset. arXiv preprint arXiv:2403.12945, 2024.
Kingma and Ba [2014] Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
Kwon et al. [2021] Taein Kwon, Bugra Tekin, Jan Stühmer, Federica Bogo, and Marc Pollefeys. H2o: Two hands manipulating objects for first person interaction recognition. In Proceedings of the IEEE/CVF international conference on computer vision, pages 10138–10148, 2021.
Lei et al. [2024] Jiahui Lei, Yijia Weng, Adam Harley, Leonidas Guibas, and Kostas Daniilidis. Mosca: Dynamic gaussian fusion from casual videos via 4d motion scaffolds. arXiv preprint arXiv:2405.17421, 2024.
Li et al. [2024a] Jiahe Li, Jiawei Zhang, Xiao Bai, Jin Zheng, Xin Ning, Jun Zhou, and Lin Gu. Dngaussian: Optimizing sparse-view 3d gaussian radiance fields with global-local depth normalization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 20775–20785, 2024a.
Li et al. [2021] Zhengqi Li, Simon Niklaus, Noah Snavely, and Oliver Wang. Neural scene flow fields for space-time view synthesis of dynamic scenes. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6498–6508, 2021.
Li et al. [2023] Zhengqi Li, Qianqian Wang, Forrester Cole, Richard Tucker, and Noah Snavely. Dynibar: Neural dynamic image-based rendering. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2023.
Li et al. [2024b] Zhengqi Li, Richard Tucker, Forrester Cole, Qianqian Wang, Linyi Jin, Vickie Ye, Angjoo Kanazawa, Aleksander Holynski, and Noah Snavely. Megasam: Accurate, fast, and robust structure and motion from casual dynamic videos. arXiv preprint arXiv:2412.04463, 2024b.
Liang et al. [2023] Yiqing Liang, Numair Khan, Zhengqin Li, Thu Nguyen-Phuoc, Douglas Lanman, James Tompkin, and Lei Xiao. Gaufre: Gaussian deformation fields for real-time dynamic novel view synthesis. arXiv preprint, 2023.
Lin et al. [2025] Zhiqiu Lin, Siyuan Cen, Daniel Jiang, Jay Karhade, Hewei Wang, Chancharik Mitra, Tiffany Ling, Yuhan Huang, Sifan Liu, Mingyu Chen, Rushikesh Zawar, Xue Bai, Yilun Du, Chuang Gan, and Deva Ramanan. Towards understanding camera motions in any video. arXiv preprint arXiv:2504.15376, 2025.
Lu et al. [2024] Hao Lu, Tianshuo Xu, Wenzhao Zheng, Yunpeng Zhang, Wei Zhan, Dalong Du, Masayoshi Tomizuka, Kurt Keutzer, and Yingcong Chen. Drivingrecon: Large 4d gaussian reconstruction model for autonomous driving. arXiv preprint arXiv:2412.09043, 2024.
Luiten et al. [2024] Jonathon Luiten, Georgios Kopanas, Bastian Leibe, and Deva Ramanan. Dynamic 3d gaussians: Tracking by persistent dynamic view synthesis. In 3DV, 2024.
Mildenhall et al. [2019] Ben Mildenhall, Pratul P. Srinivasan, Rodrigo Ortiz-Cayon, Nima Khademi Kalantari, Ravi Ramamoorthi, Ren Ng, and Abhishek Kar. Local light field fusion: Practical view synthesis with prescriptive sampling guidelines. ACM Transactions on Graphics (TOG), 2019.
Mildenhall et al. [2020] Ben Mildenhall, Pratul P Srinivasan, Matthew Tancik, Jonathan T Barron, Ravi Ramamoorthi, and Ren Ng. Nerf: Representing scenes as neural radiance fields for view synthesis. In European Conference on Computer Vision (ECCV), 2020.
Newcombe et al. [2015] Richard A Newcombe, Dieter Fox, and Steven M Seitz. Dynamicfusion: Reconstruction and tracking of non-rigid scenes in real-time. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2015.
Oquab et al. [2023] Maxime Oquab, Timothée Darcet, Theo Moutakanni, Huy V. Vo, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, Russell Howes, Po-Yao Huang, Hu Xu, Vasu Sharma, Shang-Wen Li, Wojciech Galuba, Mike Rabbat, Mido Assran, Nicolas Ballas, Gabriel Synnaeve, Ishan Misra, Herve Jegou, Julien Mairal, Patrick Labatut, Armand Joulin, and Piotr Bojanowski. Dinov2: Learning robust visual features without supervision, 2023.
Perazzi et al. [2016] F. Perazzi, J. Pont-Tuset, B. McWilliams, L. Van Gool, M. Gross, and A. Sorkine-Hornung. A benchmark dataset and evaluation methodology for video object segmentation. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016.
Piccinelli et al. [2024] Luigi Piccinelli, Yung-Hsu Yang, Christos Sakaridis, Mattia Segu, Siyuan Li, Luc Van Gool, and Fisher Yu. UniDepth: Universal monocular metric depth estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024.
Piccinelli et al. [2025] Luigi Piccinelli, Christos Sakaridis, Yung-Hsu Yang, Mattia Segu, Siyuan Li, Wim Abbeloos, and Luc Van Gool. Unidepthv2: Universal monocular metric depth estimation made simpler, 2025.
Ranftl et al. [2021] René Ranftl, Alexey Bochkovskiy, and Vladlen Koltun. Vision transformers for dense prediction. In Proceedings of the IEEE/CVF international conference on computer vision, pages 12179–12188, 2021.
Ravi et al. [2024] Nikhila Ravi, Valentin Gabeur, Yuan-Ting Hu, Ronghang Hu, Chaitanya Ryali, Tengyu Ma, Haitham Khedr, Roman Rädle, Chloe Rolland, Laura Gustafson, et al. Sam 2: Segment anything in images and videos. arXiv preprint arXiv:2408.00714, 2024.
Rombach et al. [2021] Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High-resolution image synthesis with latent diffusion models, 2021.
Song et al. [2023] Chonghyuk Song, Gengshan Yang, Kangle Deng, Jun-Yan Zhu, and Deva Ramanan. Total-recon: Deformable scene reconstruction for embodied view synthesis. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 17671–17682, 2023.
Stearns et al. [2024] Colton Stearns, Adam W. Harley, Mikaela Uy, Florian Dubost, Federico Tombari, Gordon Wetzstein, and Leonidas Guibas. Dynamic gaussian marbles for novel view synthesis of casual monocular videos. arXiv preprint arXiv:2406.18717, 2024.
Szymanowicz et al. [2024] Stanislaw Szymanowicz, Christian Rupprecht, and Andrea Vedaldi. Splatter image: Ultra-fast single-view 3d reconstruction. 2024.
Tan et al. [2024] Jeff Tan, Donglai Xiang, Shubham Tulsiani, Deva Ramanan, and Gengshan Yang. Dressrecon: Freeform 4d human reconstruction from monocular video. arXiv preprint arXiv:2409.20563, 2024.
Wang et al. [2024a] Jikai Wang, Qifan Zhang, Yu-Wei Chao, Bowen Wen, Xiaohu Guo, and Yu Xiang. Ho-cap: A capture system and dataset for 3d reconstruction and pose tracking of hand-object interaction. arXiv preprint arXiv:2406.06843, 2024a.
Wang et al. [2024b] Qianqian Wang, Vickie Ye, Hang Gao, Jake Austin, Zhengqi Li, and Angjoo Kanazawa. Shape of motion: 4d reconstruction from a single video. arXiv preprint arXiv:2407.13764, 2024b.
Wang et al. [2024c] Ruicheng Wang, Sicheng Xu, Cassie Dai, Jianfeng Xiang, Yu Deng, Xin Tong, and Jiaolong Yang. Moge: Unlocking accurate monocular geometry estimation for open-domain images with optimal training supervision. arXiv preprint arXiv:2410.19115, 2024c.
Wang et al. [2024d] Shuzhe Wang, Vincent Leroy, Yohann Cabon, Boris Chidlovskii, and Jerome Revaud. Dust3r: Geometric 3d vision made easy. In CVPR, 2024d.
Wang et al. [2024e] Zihan Wang, Bowen Li, Chen Wang, and Sebastian Scherer. Airshot: Efficient few-shot detection for autonomous exploration. In 2024 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 11654–11661, 2024e.
Wu et al. [2023] Guanjun Wu, Taoran Yi, Jiemin Fang, Lingxi Xie, Xiaopeng Zhang, Wei Wei, Wenyu Liu, Qi Tian, and Wang Xinggang. 4d gaussian splatting for real-time dynamic scene rendering. arXiv preprint, 2023.
Wu et al. [2024] Rundi Wu, Ben Mildenhall, Philipp Henzler, Keunhong Park, Ruiqi Gao, Daniel Watson, Pratul P. Srinivasan, Dor Verbin, Jonathan T. Barron, Ben Poole, and Aleksander Holynski. Reconfusion: 3d reconstruction with diffusion priors. In CVPR, 2024.
Yang et al. [2024a] Chen Yang, Sikuang Li, Jiemin Fang, Ruofan Liang, Lingxi Xie, Xiaopeng Zhang, Wei Shen, and Qi Tian. Gaussianobject: High-quality 3d object reconstruction from four views with gaussian splatting. SIGGRAPH Asia, 2024a.
Yang et al. [2022] Gengshan Yang, Minh Vo, Natalia Neverova, Deva Ramanan, Andrea Vedaldi, and Hanbyul Joo. Banmo: Building animatable 3d neural models from many casual videos. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2022.
Yang et al. [2024b] Jiawei Yang, Jiahui Huang, Yuxiao Chen, Yan Wang, Boyi Li, Yurong You, Apoorva Sharma, Maximilian Igl, Peter Karkus, Danfei Xu, et al. Storm: Spatio-temporal reconstruction model for large-scale outdoor scenes. arXiv preprint arXiv:2501.00602, 2024b.
Yang et al. [2024c] Lihe Yang, Bingyi Kang, Zilong Huang, Zhen Zhao, Xiaogang Xu, Jiashi Feng, and Hengshuang Zhao. Depth anything v2. arXiv:2406.09414, 2024c.
Yang et al. [2023] Ziyi Yang, Xinyu Gao, Wen Zhou, Shaohui Jiao, Yuqing Zhang, and Xiaogang Jin. Deformable 3d gaussians for high-fidelity monocular dynamic scene reconstruction. arXiv preprint, 2023.
Yu et al. [2021] Alex Yu, Vickie Ye, Matthew Tancik, and Angjoo Kanazawa. pixelnerf: Neural radiance fields from one or few images. In IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2021, virtual, June 19-25, 2021, pages 4578–4587, 2021.
Yu et al. [2024] Wangbo Yu, Jinbo Xing, Li Yuan, Wenbo Hu, Xiaoyu Li, Zhipeng Huang, Xiangjun Gao, Tien-Tsin Wong, Ying Shan, and Yonghong Tian. Viewcrafter: Taming video diffusion models for high-fidelity novel view synthesis. arXiv preprint arXiv:2409.02048, 2024.
Zhang et al. [2024] Junyi Zhang, Charles Herrmann, Junhwa Hur, Varun Jampani, Trevor Darrell, Forrester Cole, Deqing Sun, and Ming-Hsuan Yang. Monst3r: A simple approach for estimating geometry in the presence of motion. arXiv preprint arXiv:2410.03825, 2024.
Zhang et al. [2023] Lvmin Zhang, Anyi Rao, and Maneesh Agrawala. Adding conditional control to text-to-image diffusion models. In IEEE/CVF International Conference on Computer Vision, ICCV, pages 3813–3824, 2023.
Zhao et al. [2025] Qitao Zhao, Amy Lin, Jeff Tan, Jason Y. Zhang, Deva Ramanan, and Shubham Tulsiani. Diffusionsfm: Predicting structure and motion via ray origin and endpoint diffusion, 2025.
Zhou et al. [2025] Changshi Zhou, Rong Jiang, Feng Luan, Shaoqiang Meng, Zhipeng Wang, Yanchao Dong, Yanmin Zhou, and Bin He. Dual-arm robotic fabric manipulation with quasi-static and dynamic primitives for rapid garment flattening. IEEE/ASME Transactions on Mechatronics, pages 1–11, 2025.
Zhou et al. [2024] Shijie Zhou, Haoran Chang, Sicheng Jiang, Zhiwen Fan, Zehao Zhu, Dejia Xu, Pradyumna Chari, Suya You, Zhangyang Wang, and Achuta Kadambi. Feature 3dgs: Supercharging 3d gaussian splatting to enable distilled feature fields. CVPR, 2024.
Zhu et al. [2023] Zehao Zhu, Zhiwen Fan, Yifan Jiang, and Zhangyang Wang. FSGS: real-time few-shot view synthesis using gaussian splatting. CoRR, abs/2312.00451, 2023.

Appendix

Appendix A Additional Baseline Comparisons

To highlight the challenge of sparse-view 4D reconstruction, we compare with five additional baselines for monocular 4D reconstruction and sparse-view reconstruction. Our method remains state-of-the-art.

Method	PSNR $\uparrow$	SSIM $\uparrow$	LPIPS $\downarrow$	Description
MV-MonST3R [67]	12.64	0.475	0.574	Monocular 4D reconstruction
MV-MegaSAM [34]	14.25	0.578	0.412	Monocular 4D reconstruction
ViewCrafter [66]	24.97	0.834	0.135	Multi-view diffusion
DNGaussian [31]	27.34	0.897	0.103	Sparse-view reconstruction
InstantSplat [12]	29.11	0.909	0.082	Sparse-view reconstruction
MonoFusion	30.64	0.930	0.055	Ours

Baseline: Monocular 4D reconstruction. We compare with MonST3R, as well as MegaSAM which claims better results than MonST3R. For each method, we report the best result among using a single-view video, concatenating the 4 views into longer video, or interleaving the 4 views (simulating a “rotating” camera). MonST3R and MegaSAM fail in sparse-view scenarios, especially with large viewpoint shifts. When given concatenated or interleaved sparse-view videos as input, both models simply copy the input frames onto flat planes: notice the image corners in visual results. Given a single input video, both models are missing large regions due to occlusions and unseen geometry past the video boundary. Without non-trivially handling images from multiple wide-baseline input views, MonST3R and MegaSAM alone cannot produce view-consistent reconstructions. Baseline: Multi-view diffusion . We use ViewCrafter, a video diffusion method for novel-view synthesis, to novel views at each timestep. ViewCrafter’s rendered views are worse than our method’s outputs, due to grid-like rendering artifacts, missing black regions of the scene, additional hallucinations, and imperfect alignment between different input views. Baseline: Sparse-view methods . We compare with alternative sparse-view methods InstantSplat and DNGaussian. Although they do not handle dynamic scenes, we run them independently per frame. Both methods suffer from blurry reconstructions and missing details, although InstantSplat benefits from cross-view consistency coming from DUSt3R. A qualitative analysis of all baselines for 5^∘ and 45^∘ novel view synthesis on the ExoRecon dataset is available in Fig. 10 and 11.

Appendix B Training View Renderings

To build confidence in our implementations, we validate every baseline we run by verifying that each method looks reasonable at training views (Fig. 9). It is worth noticing that in each iteration of optimization, we sample a batch of frames out of the video to optimize the overall loss. As the loss is optimized as a global minimum averaged over all frames, it is possible that some artifacts remain for certain frames.

Appendix C Training Details

In Tab. 4, we report the learning rate and loss weights of Gaussians in our optimization process. These hyperparameters are shared across every scene that we evaluated on. Specifically, $\mathcal{L}_{\text{smooth\_bases}}$ enforces smooth motion bases by penalizing high accelerations in rotations and translations. $\mathcal{L}_{\text{smooth\_tracks}}$ promotes smooth object tracks by penalizing large accelerations in object positions across frames. $\mathcal{L}_{\text{depth\_grad}}$ aligns the gradients of the predicted and ground truth depth maps to preserve structural details. $\mathcal{L}_{\text{z\_accel}}$ penalizes high accelerations along the depth axis to reduce jitter in depth estimation. $\mathcal{L}_{\text{scale\_val}}$ constrains the variance of scale parameters of Gaussians to achieve consistent representations.

Table 4: Training hyper-parameters: learning rates (left) and loss weights (right) shown row-by-row

Learning Rates				Loss Weights
Parameter	FG LR	BG LR	Motion LR	Loss Param.	Weight
means	$1.6\times 10^{-4}$	$1.6\times 10^{-4}$	–	$w_{\text{rgb}}$	$7.0$
opacities	$1\times 10^{-2}$	$1\times 10^{-2}$	–	$w_{\text{mask}}$	$5.0$
scales	$5\times 10^{-3}$	$1\times 10^{-3}$	–	$w_{\text{feat}}$	$7.0$
quats	$1\times 10^{-3}$	$1\times 10^{-3}$	��	$w_{\text{smooth\_bases}}$	$0.1$
colors	$0$	$1\times 10^{-2}$	–	$w_{\text{depth\_reg}}$	$1.0$
feats	$1\times 10^{-3}$	$1\times 10^{-3}$	–	$w_{\text{smooth\_tracks}}$	$2.0$
motion_coefs	$1\times 10^{-3}$	–	–	$w_{\text{scale\_var}}$ ¹¹1Originally set to $0.01$ .	$0.01$
rots	–	–	$1.6\times 10^{-4}$	$w_{\text{z\_accel}}$	$1.0$
transls	–	–	$1.6\times 10^{-4}$	$w_{\text{track}}$	$2.0$

Appendix D Alternative design choice

We explore higher-quality and metric-scale depth initialization, by replacing UniDepth in MV-SOM with improved metric depth estimators. Our results suggest that simply improving the quality of metric-scale depth in MV-SOM is not enough.

Method	PSNR $\uparrow$	SSIM $\uparrow$	LPIPS $\downarrow$	IOU $\uparrow$	AbsRel ( $\downarrow$ )
w/ UniDepth [44]	26.91	0.890	0.138	0.845	0.474
w/ Metric3Dv2 [21]	27.29	0.894	0.132	0.847	0.462
w/ UniDepthV2 [45]	27.65	0.900	0.103	0.872	0.356
MonoFusion	30.64	0.930	0.055	0.963	0.290

Appendix E Limitations and Future Work

We address two key limitations of our work. First, like previous methods, we rely heavily on 2D foundation models to estimate priors (e.g. depth and dynamic masks) for gradient-based differentiable rendering optimization. Thus, imprecise priors can harm the downstream rendering process. In addition, the current pipeline requires a user prompt to specify dynamic masks for each moving object [3], which can be labor-intensive for complex scenes and may fail in some cases (Fig. 12). To solve this, distilling dynamic masks from foundation models or inferring dynamic masks from image level priors (as in [67]) could be beneficial.

Second, most off-the-shelf feed-forward depth estimation networks are trained on simple scene-level datasets, with few dynamic movers (e.g. people) in the foreground. In practice, we observe that the depth of humans in dynamic scenes is often incorrect when observed from other views (Fig. 13). For example, DUSt3R often estimates the depth of a human to be the same as the depth of surrounding walls, causing the human to blend into the background. We believe that these fundamental problems with the depth predictions cannot be solved by any alignment in the output space. To mitigate this issue, we plan to further fine-tune DUSt3R or MonST3R on existing dynamic human datasets.