MonoFusion: Sparse-View 4D Reconstruction via Monocular Fusion

Zihan Wang, Jeff Tan, Tarasha Khurana, Neehar Peri, Deva Ramanan
Carnegie Mellon University
Equal Contribution.
Abstract

We address the problem of dynamic scene reconstruction from sparse-view videos. Prior work often requires dense multi-view captures with hundreds of calibrated cameras (e.g. Panoptic Studio). Such multi-view setups are prohibitively expensive to build and cannot capture diverse scenes in-the-wild. In contrast, we aim to reconstruct dynamic human behaviors, such as repairing a bike or dancing, from a small set of sparse-view cameras with complete scene coverage (e.g. four equidistant inward-facing static cameras). We find that dense multi-view reconstruction methods struggle to adapt to this sparse-view setup due to limited overlap between viewpoints. To address these limitations, we carefully align independent monocular reconstructions of each camera to produce time- and view-consistent dynamic scene reconstructions. Extensive experiments on PanopticStudio and Ego-Exo4D demonstrate that our method achieves higher quality reconstructions than prior art, particularly when rendering novel views. Code, data, and data‑processing scripts are available on Github.

[Uncaptioned image]
Figure 1: Dynamic Scene Reconstruction from Sparse Views. MonoFusion reconstructs dynamic human behaviors, such as playing the piano or performing CPR, from four equidistant inward-facing static cameras. We visualize the RGB and depth renderings of a 45 novel view between two training views. Training views are shown below for reference.

1 Introduction

Accurately reconstructing dynamic 3D scenes from multi-view videos is of great interest to the vision community, with applications in AR/VR [49, 36], autonomous driving [37], and robotics [70, 57]. Prior work often studies this problem in the context of dense multi-view videos, which require dedicated capture studios that are prohibitively expensive to build and are difficult to scale to diverse scenes in-the-wild. In this paper, we aim to strike a balance between the ease and informativeness of multi-view data collection by reconstructing skilled human behaviors (e.g., playing a piano and performing CPR) from four equidistant inward-facing static cameras (Fig. 1).

Problem setup.

Despite recent advances in dynamic scene reconstruction  [17, 16, 18, 4], current approaches often require dozens of calibrated cameras [38, 23], are category specific [61], or struggle to generate multi-view consistent geometry [33]. We study the problem of reconstructing dynamic human behaviors from an in-the-wild capture studio: a small set of (4) portable cameras with limited overlap but complete scene coverage, such as in the large-scale Ego-Exo4D dataset [20]. We argue that sparse-view limited-overlap reconstruction presents unique challenges not found in dense multi-view setups and typical “sparse view” captures with large covisibility (Fig. 2). For dense multi-view captures, it is often sufficient to rely solely on geometric and photometric cues for reconstruction, often making use of classic techniques from (non-rigid) structure from motion [13]. As a result, these methods fail in sparse-view settings with limited cross-view correspondences.

Refer to caption
Figure 2: Problem Setup. Our sparse-view setup (middle) strikes a balance between ill-posed reconstructions from casual monocular captures [18, 43] and well-constrained reconstructions from dense multi-view studio captures [23]. Unlike existing “sparse-view” datasets like DTU [22] and LLFF [39], our setup is more challenging because input views are 90 apart with limited cross-view correspondences.

Key insights.

We find that initializing sparse-view reconstructions with monocular geometry estimators like MoGe [55] produces higher quality results. However, naively merging independent monocular geometry estimates often yields inconsistent geometry across views (e.g. duplicate structures), resulting in a local minima during 3D optimization. Instead, we carefully align monocular reconstructions (that are independently predicted for each view and time) to a global reference frame that is learned from a static multi-view reconstructor (like DUSt3R [56]). Furthermore, many challenges in inferring view-consistent and time-consistent depth become dramatically simplified when working with fixed cameras with known poses (inherent to the in-the-wild capture setup that we target). For example, temporally consistent background geometry can be enforced by simply averaging predictions over time.

Contributions.

We present three major contributions.

  • We highlight the challenge of reconstructing skilled human behaviors in dynamic environments from sparse-view cameras in-the-wild.

  • We demonstrate that monocular reconstruction methods can be extended to the sparse-view setting by carefully incorporating monocular depth and foundational priors.

  • We extensively ablate our design choices and show that we achieve state-of-the-art performance on PanopticStudio and challenging sequences from Ego-Exo4D.

2 Related Work

Dynamics scene reconstruction.

Dynamic scene reconstruction [4] has received significant interest in recent years. While classical work [41, 9] often relies on RGB-D sensors, or strong domain knowledge [2, 7], recent approaches [33] based on neural radiance fields [40] have progressed towards reconstructing dynamic scenes in-the-wild from RGB video alone. However, such methods are computationally heavy, can only reconstruct short video clips with limited dynamic movement, and struggle with extreme novel view synthesis. Recently, 3D Gaussian Splatting [26, 38] has accelerated radiance field training and rendering via an efficient rasterization process. Follow-up works [64, 58, 35] repurpose 3DGS to reconstruct dynamic scenes, often by optimizing a fixed set of Gaussians in canonical space and modeling their motion with deformation fields. However, as Gao et al. [18] points out, such methods often struggle to reconstruct realistic videos. Many works address this shortcoming by relying on 2D point tracking priors [54], fusing Gaussians from many timesteps [30], modeling isotropic Gaussians [50], or exploiting domain knowledge such as human body priors [52]. However, these approaches study the reconstruction problem in the monocular setting. As 4D reconstruction from a single viewpoint is under-constrained, practical robotics setups for manipulation [27] and hand-object interaction [29, 11, 53] adopt camera rigs where a sparse set of cameras capture the scene of interest. Similarly, datasets like Ego-Exo4D [20], DROID [27] and H2O [29] explore sparse-view capture for dynamic scenes in-the-wild.

Novel-view synthesis from sparse views.

Both NeRF and 3D Gaussian Splatting require dense input view coverage, which hinders their real-world applicability. Recent works aim to reduce the number of required input views by adding additional supervision and regularization, such as depth [8] or semantics [65]. FSGS [72] builds on Gaussian splatting by producing faithful static geometry from as few as three views by unpooling existing Gaussians and adopting extra depth supervision. Recent studies such as  [60], on the other hand, adds noise to Gaussian attributes and relies on a pre-trained ControlNet [68] to repair low-quality rendered images. Other works such as MVSplat [5] build a cost volume representation and predict Gaussian attributes in a feed-forward manner. However, they can only synthesize novel views with small deviations from the nearest training view. For methods that rely on learned priors, high-quality novel view synthesis is often limited to images within the training distribution. Such methods cannot handle diverse real-world geometry. Diffusion-based reconstruction methods [19, 59, 69] try to generate additional views consistent with the sparse input views, but often produce artifacts. In our case, four sparse view cameras are separated around 90 apart, posing unique challenges.

Refer to caption
Figure 3: Approach. Given sparse-view video sequences of a scene (left), we aim to optimize a 3D gaussian representation over time. We begin by running DUSt3R [56], a static multi-view reconstruction method, on the sparse views of a given reference timestamp. This generates a global reference frame that connects all views. Next, we use MoGe [55] to independently predict depth maps for each camera. Since these depth predictions are only defined up to an affine transformation, we must estimate a scale and shift for each predicted depth map across all views and time instants. To achieve this, we leverage the fact that background pixels remain static over time. Specifically, for each time instant and each view, we align the background regions of each camera’s depth map to the global reference frame by adjusting the scale and shift parameters accordingly (middle, top). This process requires a foreground-background mask for all input videos (which can be obtained using off-the-shelf tools like SAM 2 [47]). To reduce occlusions and noisy depth predictions, we concatenate all aligned background depth points, and average corresponding background points (where correspondence across time is trivially given by the 2D pixel index of the unprojected pointmap) across time. Lastly, we find that motion bases constructed from feature-clustering form a more geometrically consistent set of bases (middle, bottom), than those initialized by noisy 3D tracks [54]. Our optimization yields a 4D scene representation from which we can rasterize RGB frames, depth maps, a foreground silhouette, and object features from novel views (right).

Feed-forward geometry estimation.

Learning-based methods, such as monocular depth networks, are able to reconstruct 3D objects and scenes by learning strong priors from training data. While early works [10, 14] focus on in-domain depth estimation, recent works build foundational depth models by scaling up training data [46, 63, 55], resolving metric ambiguity from various camera models [21, 44], or relying on priors such as Stable Diffusion [48, 24, 15]. Unfortunately, monocular depth networks are not scale or view consistent, and often require extensive alignment against ground-truth to produce meaningful metric outputs. To address these shortcomings, DUSt3R [56] and MonST3R [67] propose the task of point map estimation, which aims to recover scene geometry as well as camera intrinsics and extrinsics given a pair of input images. These methods unify single-view and multi-view geometry estimation, and enable consistent depth estimation across either time or space.

3 Towards Sparse-View 4D Reconstruction

Given sparse-view (i.e. 3 – 4) videos from stationary cameras as input, our method recovers the geometry and motion of a dynamic 3D scene (Fig. 3). We model the scene as canonical 3D Gaussians (Sec. 3.1), which translate and rotate via a linear combination of motion bases. We initialize consistent scene geometry by carefully aligning geometry predictions from multiple views (Sec. 3.2), and initialize motion trajectories by clustering per-point 3D semantic features distilled from 2D foundation models (Sec. 3.3). We formulate a joint optimization which simultaneously recovers geometry and motion (Sec. 3.4).

3.1 3D Gaussian Scene Representation

We represent the geometry and appearance of dynamic 3D scenes using 3D Gaussian Splatting [26], due to its efficient optimization and rendering. Each Gaussian in the canonical frame t0t_{0}italic_t start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT is parameterized by (𝐱0,𝐑0,𝐬,α,𝐜)(\mathbf{x}_{0},\mathbf{R}_{0},\mathbf{s},\alpha,\mathbf{c})( bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , bold_R start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , bold_s , italic_α , bold_c ), where 𝐱03\mathbf{x}_{0}\in\mathbb{R}^{3}bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT is the Gaussian’s position in canonical frame, 𝐑0𝕊𝕆(3)\mathbf{R}_{0}\in\mathbb{SO}(3)bold_R start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ∈ blackboard_S blackboard_O ( 3 ) is the orientation, 𝐬3\mathbf{s}\in\mathbb{R}^{3}bold_s ∈ blackboard_R start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT is the scale, α\alpha\in\mathbb{R}italic_α ∈ blackboard_R is the opacity, and 𝐜3\mathbf{c}\in\mathbb{R}^{3}bold_c ∈ blackboard_R start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT is the color. The position and orientation are time-dependent, while the scale, opacity, and color are persistent over time. We additionally assign a semantic feature 𝐟N\mathbf{f}\in\mathbb{R}^{N}bold_f ∈ blackboard_R start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT to each Gaussian (Sec. 3.3), where N=32N=32italic_N = 32 is an arbitrary number representing the embedding dimension of the feature. Empirically, we find that fixing the color and opacity of Gaussians results in a better performance. In summary, for the iiitalic_i-th 3D Gaussian, the optimizable attributes are given by Θ(i)={𝐱0(i),𝐑0(i),𝐬(i),𝐟(i)}\Theta^{(i)}=\{\mathbf{x}^{(i)}_{0},\mathbf{R}^{(i)}_{0},\mathbf{s}^{(i)},\mathbf{f}^{(i)}\}roman_Θ start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT = { bold_x start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , bold_R start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , bold_s start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT , bold_f start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT }. Following [71], the optimized Gaussians are rendered from a given camera to produce an RGB image and a feature map using a tile-based rasterization procedure.

3.2 Space-Time Consistent Depth Initialization

Similar to recent methods [51, 54], we rely on data-driven monocular depth priors to initialize the position and appearance of 3D Gaussians over time. Given the success of initializing 3DGS with monocular depth in single-view settings [54], one might think to naturally extend this to multi-view settings by independently initializing from monocular depth for each view. However, this yields conflicting geometry signals, as monocular depth estimators commonly predict up to an unknown scale and shift factor. Thus, the unprojected monocular depths from separate views are often inconsistent, resulting in duplicated object parts.

Multi-view pointmap prediction.

DUSt3R [56] predicts multi-view consistent pointmaps across KKitalic_K input images by first inferring pairwise pointmaps, followed by a global 3D optimization that searches for per-image pointmaps and pairwise similarity transforms (rotation, translation, and scale) that best aligns all pointmaps with each other.

We run DUST3R on the multiview images at time ttitalic_t, but constrain the global optimization to be consistent with the KKitalic_K known stationary camera extrinsics {𝐏k}\{\mathbf{P}_{k}\}{ bold_P start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT } and intrinsics {𝐊k}\{\mathbf{K}_{k}\}{ bold_K start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT }. This produces per-image global pointmaps {χkt}\{\chi^{t}_{k}\}{ italic_χ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT } in metric coordinates. One can then compute a depth map by simply projecting each pointmap back to each image with the known cameras

dkt(u,v)[uv1]T=𝐊k𝐏kχkt(u,v)d^{t}_{k}(u,v)\begin{bmatrix}u&v&1\end{bmatrix}^{T}=\mathbf{K}_{k}\mathbf{P}_{k}{\chi}^{t}_{k}(u,v)italic_d start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_u , italic_v ) [ start_ARG start_ROW start_CELL italic_u end_CELL start_CELL italic_v end_CELL start_CELL 1 end_CELL end_ROW end_ARG ] start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT = bold_K start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT bold_P start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT italic_χ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_u , italic_v ) (1)

This produces metric-scale multi-view consistent depth maps dkt(u,v)d^{t}_{k}(u,v)italic_d start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_u , italic_v ), which are still not consistent over time.

Spatio-temporal alignment of monocular depth with multi-view consistent pointmaps.

In fact, even beyond temporally inconsistency, such multiview predictors tend to underperform on humans since they are trained on multiview data where dynamic humans are treated as outliers. Instead, we find monocular depth estimators such as MoGe [55] to be far more accurate, but such predictions are not metric (since they are accurate only up to an affine transformation) and are not guaranteed to be consistent across views or times. Instead, our strategy is to use the multiview depth maps from DUST3R as a metric target to align monocular depth predictions, which we write as mkt(u,v)m^{t}_{k}(u,v)italic_m start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_u , italic_v ). Specifically, we search for scale and shift factors akta_{k}^{t}italic_a start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT and bktb_{k}^{t}italic_b start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT that minimize the following error:

argmin{akt,bkt}t=1Tk=1Ku,vBGkt(aktmkt(u,v)+bkt)dkt(u,v)2\operatorname*{arg\,min}_{\{a^{t}_{k},b^{t}_{k}\}}\sum_{t=1}^{T}\sum_{k=1}^{K}\sum_{u,v\in\text{BG}^{t}_{k}}\left\lVert(a^{t}_{k}m^{t}_{k}(u,v)+b^{t}_{k})-d^{t}_{k}(u,v)\right\rVert^{2}start_OPERATOR roman_arg roman_min end_OPERATOR start_POSTSUBSCRIPT { italic_a start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , italic_b start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT } end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_u , italic_v ∈ BG start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT ∥ ( italic_a start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT italic_m start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_u , italic_v ) + italic_b start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) - italic_d start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_u , italic_v ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT (2)

where BGkt\text{BG}^{t}_{k}BG start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT refers to a pixelwise background mask for camera kkitalic_k at frame ttitalic_t. The above uses metric background points as a target for aligning all monodepth predictions. The above optimization can be solved quite efficiently since each time ttitalic_t and view kkitalic_k can be optimized independently with a simple least-squares solver (implying our approach will easily scale to long videos). However, the above optimization will still produce scale factors that are not temporally consistent since the targets are temporally inconsistent as well. But we can exploit the constraint that background points should be static across time for stationary cameras. To do so, we replace dkt(u,v)d^{t}_{k}(u,v)italic_d start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_u , italic_v ) with a static target dk(u,v)d_{k}(u,v)italic_d start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_u , italic_v ) obtained by averaging depth maps over time or selecting a canonical reference timestamp. The final set of scaled time- and view-consistent depthmaps are then unprojected back to 3D pointmaps. Note that this tends to produce accurate predictions for static background points, but the dynamic foreground may remain noisy because they cannot be naively denoised by simple temporal averaging. Rather, we rely on motion-based 3DGS optimization to enforce smoothness of the foreground, described next.

During our experiments, we identified two additional limitations that significantly impact visual quality.
(1) Scale initialization: We observed that initializing 3D Gaussian scales with kkitalic_k-nearest neighbors often results in poor appearance, such as extremely large Gaussians filling empty space and blurring the background. To address this, we follow SplaTAM [25] and initialize each Gaussian scale based on its projected pixel area: scale=d0.5(fx+fy)\textrm{scale}=\frac{d}{0.5(f_{x}+f_{y})}scale = divide start_ARG italic_d end_ARG start_ARG 0.5 ( italic_f start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT + italic_f start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT ) end_ARG, where dditalic_d is a pixel’s depth and fx,fyf_{x},f_{y}italic_f start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT , italic_f start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT are focal lengths.
(2) Insufficient Gaussian density: Using only one Gaussian per input pixel fails to adequately capture fine details. We instead initialize 5 Gaussians per input pixel, providing better representation of fine details.

Refer to caption
Figure 4: Qualitative analysis of held-out view synthesis on ExoRecon. We show qualitative results of held-out view synthesis (left) and a 5 deviation from the static camera position at the held-out timestamp (right). As compared to other multi-view baselines, our method does dramatically better at interpolating the motion of dynamic foreground (left), even from new camera views (right). We posit that Dynamic 3DGS suffers because of lack of geometric constraints and MV-SOM has duplicate foreground artifacts because of conflicting depth initialization from the four views.

3.3 Grouping-based Motion Initialization

Beyond initializing time- and view-consistent geometry in the canonical frame, we also aim to initialize reasonable estimates of the scene motion. We model a dynamic 3D scene as a set of 𝒩\mathcal{N}caligraphic_N canonical 3D Gaussians, along with time-varying rigid transformations 𝐓0t=[𝐑0t𝐭0t]𝕊𝔼(3)\mathbf{T}_{0\to t}=[\mathbf{R}_{0\to t}\mathbf{t}_{0\to t}]\in\mathbb{SE}(3)bold_T start_POSTSUBSCRIPT 0 → italic_t end_POSTSUBSCRIPT = [ bold_R start_POSTSUBSCRIPT 0 → italic_t end_POSTSUBSCRIPT bold_t start_POSTSUBSCRIPT 0 → italic_t end_POSTSUBSCRIPT ] ∈ blackboard_S blackboard_E ( 3 ) that warp from canonical space to time ttitalic_t:

𝐱t=𝐑0t𝐱0+𝐭0t𝐑t=𝐑0t𝐑0\mathbf{x}_{t}=\mathbf{R}_{0\to t}\mathbf{x}_{0}+\mathbf{t}_{0\to t}\quad\mathbf{R}_{t}=\mathbf{R}_{0\to t}\mathbf{R}_{0}bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = bold_R start_POSTSUBSCRIPT 0 → italic_t end_POSTSUBSCRIPT bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT + bold_t start_POSTSUBSCRIPT 0 → italic_t end_POSTSUBSCRIPT bold_R start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = bold_R start_POSTSUBSCRIPT 0 → italic_t end_POSTSUBSCRIPT bold_R start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT (3)

Motion bases.

Similar to Shape of Motion [54], we make the observation that in most dynamic scenes, the underlying 3D motion is often low-dimensional, and composed of simpler units of rigid motion. For example, the forearms tend to move together as one rigid unit, despite being composed of thousands of distinct 3D Gaussians. Rather than storing independent 3D motion trajectories for each 3D Gaussian (i)(i)( italic_i ), we define a set of BBitalic_B learnable basis trajectories {𝐓0t(i,b)}b=1B\{\mathbf{T}^{(i,b)}_{0\to t}\}_{b=1}^{B}{ bold_T start_POSTSUPERSCRIPT ( italic_i , italic_b ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 0 → italic_t end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_b = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_B end_POSTSUPERSCRIPT. The time-varying rigid transforms are written as a weighted combination of basis trajectories, using fixed per-point basis coefficients {w(i,b)}b=1B\{w^{(i,b)}\}_{b=1}^{B}{ italic_w start_POSTSUPERSCRIPT ( italic_i , italic_b ) end_POSTSUPERSCRIPT } start_POSTSUBSCRIPT italic_b = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_B end_POSTSUPERSCRIPT:

𝐓0t(i)=b=1B𝐰(i,b)𝐓0t(i,b)\mathbf{T}^{(i)}_{0\to t}=\sum_{b=1}^{B}\mathbf{w}^{(i,b)}\mathbf{T}_{0\to t}^{(i,b)}bold_T start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 0 → italic_t end_POSTSUBSCRIPT = ∑ start_POSTSUBSCRIPT italic_b = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_B end_POSTSUPERSCRIPT bold_w start_POSTSUPERSCRIPT ( italic_i , italic_b ) end_POSTSUPERSCRIPT bold_T start_POSTSUBSCRIPT 0 → italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i , italic_b ) end_POSTSUPERSCRIPT (4)

Motion bases via feature clustering.

Unlike Shape of Motion which initializes motion bases by clustering 3D tracks, our key insight is that semantically grouping similar scene parts together can help regularize dynamic scene motion, without ever initializing trajectories from noisy 3D track predictions. Inspired by the success of robust and universal feature descriptors [42], we obtain pixel-level features for each input image by evaluating DINOv2 on an image pyramid. We average features across pyramid levels and reduce the dimension to 32 via PCA [1]. We choose the small DINOv2 model with registers, as it produces fewer peaky feature artifacts [6].

Given the consistent pixel-aligned pointmaps χt,k(time+view)\chi_{t,k}^{\text{(time+view)}}italic_�� start_POSTSUBSCRIPT italic_t , italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT (time+view) end_POSTSUPERSCRIPT, we associate each pointmap with the 32-dim feature map 𝐟t,k\mathbf{f}_{t,k}bold_f start_POSTSUBSCRIPT italic_t , italic_k end_POSTSUBSCRIPT computed from the corresponding image. We perform k-means clustering on per-point features 𝐟\mathbf{f}bold_f to produce bbitalic_b initial clusters of 3D points. After initializing 3D Gaussians from pointmaps, we set the motion basis weight 𝐰(i,b)\mathbf{w}^{(i,b)}bold_w start_POSTSUPERSCRIPT ( italic_i , italic_b ) end_POSTSUPERSCRIPT to be the L2 distance between the cluster center and 3D Gaussian center. We initialize the basis trajectories 𝐓0t(b)\mathbf{T}^{(b)}_{0\to t}bold_T start_POSTSUPERSCRIPT ( italic_b ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 0 → italic_t end_POSTSUBSCRIPT to be identity, and optimize them via differentiable rendering.

Dataset Method Full Frame Dynamic Only
PSNR \uparrow SSIM \uparrow LPIPS \downarrow AbsRel \downarrow PSNR \uparrow SSIM \uparrow LPIPS \downarrow IOU \uparrow
SOM [54] 17.86 0.687 0.460 0.491 18.75 0.701 0.236 0.358
Dyn3D-GS [38] 25.37 0.831 0.266 0.207 26.11 0.862 0.129
Panoptic Studio MV-SOM [54] 26.28 0.858 0.241 0.331 26.80 0.883 0.161 0.886
MonoFusion 28.01 0.899 0.117 0.149 27.52 0.944 0.022 0.965
SOM [54] 14.73 0.535 0.482 0.843 15.63 0.559 0.450 0.294
Dyn3D-GS [38] 24.28 0.692 0.539 0.612 24.61 0.673 0.384
MV-SOM-DS [54] 28.37 0.906 0.079 0.398 28.23 0.931 0.063 0.872
ExoRecon MV-SOM [54] 26.91 0.890 0.138 0.474 27.31 0.919 0.078 0.845
MonoFusion 30.43 0.927 0.061 0.290 29.71 0.947 0.017 0.963
Table 1: Quantitative analysis of held-out view synthesis. We benchmark our method against state-of-the-art approaches by evaluating the novel-view rendering and geometric quality on both the dynamic foreground region and the entire scene, across the held-out frames from input videos. MV-SOM is a multi-view version of Shape-of-Motion [54] that we construct by instantiating four different instances of single-view shape of motion, and optimize them together. On Panoptic Studio, groundtruth depth for computing the AbsRel metric is obtained from 27-view optimization of the original Dynamic 3DGS, and for ExoRecon, we project the released point clouds obtained via SLAM from Aria glasses. When evaluating single-view baselines, SOM [54], we naively aggregate their predictions from the four views and evaluate this aggregated prediction against the evaluation cameras.

3.4 Optimization

As observed in prior work [32, 17], using photometric supervision alone is insufficient to avoid bad local minima in a sparse-view setting. Our final optimization procedure is a combination of photometric losses, data-driven priors, and regularizations on the learned geometry and motions.

During each training step, we sample a random timestep ttitalic_t and camera kkitalic_k. We render the image 𝐈^t,k\mathbf{\hat{I}}_{t,k}over^ start_ARG bold_I end_ARG start_POSTSUBSCRIPT italic_t , italic_k end_POSTSUBSCRIPT, mask 𝐌^t,k\mathbf{\hat{M}}_{t,k}over^ start_ARG bold_M end_ARG start_POSTSUBSCRIPT italic_t , italic_k end_POSTSUBSCRIPT, features 𝐅^t,k\mathbf{\hat{F}}_{t,k}over^ start_ARG bold_F end_ARG start_POSTSUBSCRIPT italic_t , italic_k end_POSTSUBSCRIPT, and depth 𝐃^t,k\mathbf{\hat{D}}_{t,k}over^ start_ARG bold_D end_ARG start_POSTSUBSCRIPT italic_t , italic_k end_POSTSUBSCRIPT. We compute reconstruction loss by comparing to off-the-shelf estimates:

recon=𝐈^𝐈1+λm𝐌^𝐌1+λf𝐅^𝐅1+λd𝐃^𝐃1\mathcal{L}_{\text{recon}}=\left\lVert\mathbf{\hat{I}}-\mathbf{I}\right\rVert_{1}+\lambda_{\text{m}}\left\lVert\mathbf{\hat{M}}-\mathbf{M}\right\rVert_{1}+\lambda_{\text{f}}\left\lVert\mathbf{\hat{F}}-\mathbf{F}\right\rVert_{1}+\lambda_{\text{d}}\left\lVert\mathbf{\hat{D}}-\mathbf{D}\right\rVert_{1}caligraphic_L start_POSTSUBSCRIPT recon end_POSTSUBSCRIPT = ∥ over^ start_ARG bold_I end_ARG - bold_I ∥ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT + italic_λ start_POSTSUBSCRIPT m end_POSTSUBSCRIPT ∥ over^ start_ARG bold_M end_ARG - bold_M ∥ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT + italic_λ start_POSTSUBSCRIPT f end_POSTSUBSCRIPT ∥ over^ start_ARG bold_F end_ARG - bold_F ∥ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT + italic_λ start_POSTSUBSCRIPT d end_POSTSUBSCRIPT ∥ over^ start_ARG bold_D end_ARG - bold_D ∥ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT

(5)

We additionally enforce a rigidity loss between randomly sampled dynamic Gaussians and their kkitalic_k nearest neighbors. Let 𝐗^t\mathbf{\hat{X}}_{t}over^ start_ARG bold_X end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT denote the location of a 3D Gaussian at time ttitalic_t, and let 𝐗^t\mathbf{\hat{X}}_{t^{\prime}}over^ start_ARG bold_X end_ARG start_POSTSUBSCRIPT italic_t start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT denote its location at time tt^{\prime}italic_t start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT. Over neighboring 3D Gaussians iiitalic_i, we define:

rigid=neighbors i𝐗^t𝐗^t(i)22𝐗^t𝐗^t(i)22\mathcal{L}_{\text{rigid}}=\sum_{\text{neighbors }i}\left\lVert\mathbf{\hat{X}}_{t}-\mathbf{\hat{X}}^{(i)}_{t}\right\rVert_{2}^{2}-\left\lVert\mathbf{\hat{X}}_{t^{\prime}}-\mathbf{\hat{X}}^{(i)}_{t^{\prime}}\right\rVert_{2}^{2}caligraphic_L start_POSTSUBSCRIPT rigid end_POSTSUBSCRIPT = ∑ start_POSTSUBSCRIPT neighbors italic_i end_POSTSUBSCRIPT ∥ over^ start_ARG bold_X end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - over^ start_ARG bold_X end_ARG start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT - ∥ over^ start_ARG bold_X end_ARG start_POSTSUBSCRIPT italic_t start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT - over^ start_ARG bold_X end_ARG start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT (6)

4 Experimental Results

Implementation details.

We optimize our representation with Adam [28]. We use 18k gaussians for the foreground and 1.2M for the background. We fix the number of 𝕊𝔼(3)\mathbb{SE}(3)blackboard_S blackboard_E ( 3 ) motion bases to 282828 and obtain these from feature clustering (Sec. 3.3). For the depth alignment, we use points above the confidence threshold of 95%. We show results on 7 10-sec long sequences at 30fps with a resolution of 512 ×\times× 288. Training takes about 30 minutes on a single NVIDIA A6000 GPU. Our rendering speed is about 30fps.

Datasets.

We conduct qualitative and numerical evaluation on Panoptic Studio [23] and a subset of Ego-Exo4D [20] which we call ExoRecon.

Panoptic Studio is a massively multi-view capture system which consists of 480 video streams of humans performing skilled activities. Out of these 480 views, we manually select 4 camera views, 90 apart to simulate the same exocentric camera setup as Ego-Exo4D. Given these 4 training view cameras, we find 4 other intermediate cameras 45 apart from the training views, and use these for evaluating novel view synthesis from 45 camera views.

For in-the-wild evaluation of sparse-view reconstruction, we repurpose Ego-Exo4D [20], which includes sparse-view videos of skilled human activities. While many Ego-Exo4D scenarios are out of scope for dynamic reconstruction with existing methods (due to fine-grained object motion, specular surfaces, or excessive scene clutter), we find one scene each from the 6 different scenarios in Ego-Exo4D with considerable object motion: dance, sports, bike repair, cooking, music, healthcare. For each scene, we extract 300 frames of synchronized RGB video streams, captured from 4 different cameras with known parameters. We remove fisheye distortions from all RGB videos and assume a simple pinhole camera model after undistortion. We call this subset ExoRecon, and show results on these sequences. Please see the appendix for more visuals.

Method PSNR \uparrow SSIM \uparrow LPIPS \downarrow IOU \uparrow AbsRel (\downarrow)
SOM 16.73 0.554 0.491 0.287 0.578
Dyn3D-GS 23.31 0.776 0.316 0.273
MV-SOM 21.56 0.541 0.433 0.482 0.413
MonoFusion 25.73 0.847 0.158 0.943 0.188
Table 2: Quantitative analysis of 45 novel-view synthesis on Panoptic Studio. We benchmark our method against state-of-the-art approaches by evaluating both the dynamic foreground region and the entire scene. Notably, the evaluation is conducted on novel views where the cameras are at least 4545^{\circ}45 start_POSTSUPERSCRIPT ∘ end_POSTSUPERSCRIPT apart from all training views. We additionally evaluate the geometric reconstruction quality with absolute relative (AbsRel) error in rendered depth.
Refer to caption
Figure 5: Qualitative results of 4545^{\circ}45 start_POSTSUPERSCRIPT ∘ end_POSTSUPERSCRIPT novel-view synthesis results on Panoptic Studio. We show qualitative novel-view synthesis results of our method compared to baselines on the softball (left) and tennis (right) sequences. We visualize the groundtruth RGB image for the 4545^{\circ}45 start_POSTSUPERSCRIPT ∘ end_POSTSUPERSCRIPT at the top. Our rendered extreme novel-view RGB image closely matches ground truth. We find that all other baselines struggle to generalize to extreme novel views.
Refer to caption
Figure 6: Qualitative results of 45 extreme novel view synthesis results on ExoRecon (1/2). We visualize the rasterized RGB image and depth map from each method for 4 diverse EgoExo sequences. Existing monocular methods (Row 2, “SOM”) and their extension to multi-view (Row 3, “MV-SOM”) produce poor results rendered from a drastically different novel view. MV-SOM improves upon SOM by optimizing a 4D scene representation with four view constraints, but it still suffers from duplication artifacts. Our method’s careful point cloud initialization and feature-based motion bases further improve on MV-SOM. Even after running MV-SOM with multi-view-consistent depth from DUSt3R (Row 4, “MV-SOM-DS”), we find that it still fails due to reduced depth quality, often caused by suboptimal pairwise depth predictions on humans. Please see the appendix for more baseline comparisons: we find that multi-view diffusion methods contain additional hallucinations and imperfect alignment between different input views, and per-frame sparse-view 3D reconstruction methods suffer from temporal inconsistency, blurry reconstructions and missing details.

Metrics.

We follow prior work [38, 62] in evaluating the perceptual and geometric quality of our reconstructions using PSNR, SSIM, LPIPS and absolute relative (AbsRel) error in depth. We compute these metrics on the entire image, and also on only the foreground region of interest. We additionally evaluate the quality of the dynamic foreground silhouette by reporting mask IoU, computed as (𝐌^&𝐌)/(𝐌^||𝐌)(\mathbf{\hat{M}}\&\mathbf{M})/(\mathbf{\hat{M}}||\mathbf{M})( over^ start_ARG bold_M end_ARG & bold_M ) / ( over^ start_ARG bold_M end_ARG | | bold_M ). Similar to prior work [62], our evaluation views are a set of held-out frames, subsampled from the input videos from 4 exocentric cameras, in both Panoptic Studio and ExoRecon.

Note that since the cameras in our setup are stationary, above evaluation only analyses the interpolation quality of different methods. More explicitly, we also benchmark novel-view synthesis on Panoptic Studio with an evaluation camera placed 45 away from the training view cameras. Since such a ground-truth evaluation camera is not available in ExoRecon, we only show qualitative results.

Baselines.

We compare our method with prior work on dynamic scene reconstruction from single or multiple views. Among methods that operate on monocular videos, we run Shape of Motion [54] on 8 scenes from Panoptic Studio following the setup of Dynamic 3D Gaussians [38] and our curated dataset ExoRecon that covers 6 diverse scenes. Finally, we consider two multi-view dynamic reconstruction baselines, Dynamic 3D Gaussians [38], and a naive multi-view extension of Shape of Motion (MV-SOM). To construct the latter baseline, we simply concatenate the Gaussians and motion bases from four independently-initialized instances of single-view SOM, and optimize all four instances jointly. We also evaluate a variant of MV-SOM with globally-consistent depth (denoted MV-SOM-DS), obtained by running per-frame DUSt3R on the 4 input views and fixing camera poses to ground-truth during DUSt3R’s global alignment. Despite using our same hyperparameters, MV-SOM-DS has more visual artifacts due to reduced depth quality, suggesting the importance of our DUSt3R+MoGe design. In the appendix, we verify that all baselines reconstruct reasonable training views.

4.1 Comparison to State-of-the-Art

Evaluation on held-out views. In Tab. 1, we compare our method to recent dynamic scene reconstruction baselines [67, 38, 54], following evaluation protocols from prior work [54, 62]. Our method beats prior art on both Panoptic Studio and ExoRecon (Fig. 4) datasets, when evaluated on held-out views across photometric (PSNR, SSIM, LPIPS) and geometric error (AbsRel) metrics. Note that when initializing Dynamic 3DGS [38] with 4 views we find that COLMAP fails, and so the point cloud initialization for this baseline is from a 27-view COLMAP optimization.

Interestingly, we find that although the monocular 4D reconstruction method Shape of Motion (SOM) [54] often fails to output accurate metric depth, it is robust to a limited camera shift. We hypothesize that the foundational priors of Shape of Motion allow it to produce reasonable results in under-constrained scenarios, while test-time optimization methods, especially ones that do not always rely on data-driven priors [38], can more easily fall into local optima (e.g. those caused by poor initialization) which are difficult to optimize out of via rendering losses alone.

Evaluation on a 45 novel-view. On Panoptic Studio, we use the four evaluation cameras (placed 45 apart from the training views) to evaluate our method’s novel-view rendering capability. We also evaluate the novel-view rendered depth against a ‘pseudo-groundtruth’ depth obtained from optimizing Dynamic 3DGS [38] with all 24 training views. In Tab. 2 and Fig. 5, we find that our method outperforms all baselines, achieving state-of-the-art 45 novel-view synthesis. Qualitative results on ExoRecon are in Fig. 6 & 7.

4.2 Ablation Study

We ablate the design decisions in our pipeline in Tab. 3. Our proposed space-time consistent depth plays a crucial role in learning accurate scene geometry and appearance (yielding a 3.4 PSNR improvement, Row 1 vs 3). Next, we find that the feature-metric loss feat=𝐅^𝐅\mathcal{L}_{\text{feat}}=\left\lVert\mathbf{\hat{F}}-\mathbf{F}\right\rVertcaligraphic_L start_POSTSUBSCRIPT feat end_POSTSUBSCRIPT = ∥ over^ start_ARG bold_F end_ARG - bold_F ∥ provides a trade-off between learning photometric properties vs.learning foreground motion and silhouette. Although the PSNR decreases, we see an increase in mask IoU (Row 1 vs 2 and Row 3 vs 4). Freezing the color of all Gaussians across frames aids learning the motion mask, as measured by mask IoU. Finally, our motion bases from feature-clustering improve overall scene optimization (final row).

Method feat\mathcal{L}_{\text{feat}}caligraphic_L start_POSTSUBSCRIPT feat end_POSTSUBSCRIPT 𝐝n\mathbf{d}_{n}bold_d start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT 𝐓0t(b)\mathbf{T}^{(b)}_{0\to t}bold_T start_POSTSUPERSCRIPT ( italic_b ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 0 → italic_t end_POSTSUBSCRIPT \uparrowPSNR \uparrowSSIM \downarrowLPIPS \uparrowIoU
Baseline 26.19 0.915 0.077 0.60
+ feat\mathcal{L}_{\text{feat}}caligraphic_L start_POSTSUBSCRIPT feat end_POSTSUBSCRIPT 25.39 0.933 0.087 0.63
+ Our depth / no feat\mathcal{L}_{\text{feat}}caligraphic_L start_POSTSUBSCRIPT feat end_POSTSUBSCRIPT 29.55 0.944 0.037 0.73
+ Our depth / feat\mathcal{L}_{\text{feat}}caligraphic_L start_POSTSUBSCRIPT feat end_POSTSUBSCRIPT 29.31 0.941 0.041 0.75
+ Motion bases (Ours) 30.40 0.947 0.037 0.81
Table 3: Ablation study of pipeline components. We ablate our choice of feature-metric loss, spacetime consistent depth, and feature-based motion bases. While the proposed depth and feature-based motion bases considerably improve 4D reconstruction (evaluated by photometric errors), we find that our feature loss helps learn better motion masks (evaluated by IoU).

Velocity-based vs. feature-based motion bases In the monocular setting, we empirically found that both designs performed equally well. However, in our 4 camera sparse view setting, we found that feature-based motion bases perform much better than velocity-based motion bases. The reason is that for velocity-based motion bases, we infer 3D velocity by querying the 2D tracking results plus depth per frame following Shape-of-Motion[54]. Thus, noisy foreground depth estimates where the estimated depth of the person flickers between foreground and backward will negatively influence the quality of velocity-based motion bases, causing rigid body parts to move erratically. In contrast, feature-based motion bases, where features are initialized from more reliable image-level observations, are more robust to noisy 3D initialization and force semantically-similar parts to move in similar ways. To validate our points, in Fig. 8 we use PCA analysis to visualize the inferred features and find that they are consistent not only on temporal axis but also across cameras.

Refer to caption
Figure 7: Qualitative results of 45 extreme novel view synthesis results on ExoRecon (2/2). We show qualitative novel-view synthesis results of our method compared to baselines on challenging sequence on ExoRecon: highly-dynamic, large scene with small foreground football (left) and complex, highly-occluded scene bike repair (right). Notably MonoFusion significantly beats other baselines in terms of quality.

Effect of different number of motion bases. When the number of motion bases is not expressive enough (in our experience when the number of motion bases <20<20< 20), there are often obvious flaws in the reconstruction, such as missing arms or the two legs joining together into a single leg. In reality, we do not observe that increasing the number of motion bases further hurts the performance. Empirically, the capacity of our design (which is 28 motion bases) can effectively handle different scene dynamics.

Refer to caption
Figure 8: Spatial-Temporal Visualization of feature PCA. We perform PCA analysis and transform the 32-dim features from Sec. 3.3 down to 3 dimensions for visualization purposes. We find that the features are consistent across views and across time. Notably, when the person turns around between t0t_{0}italic_t start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT and t1t_{1}italic_t start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT in observations from cam1cam_{1}italic_c italic_a italic_m start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and cam2cam_{2}italic_c italic_a italic_m start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT, the feature remains robust and consistent. The semantic consistency of features aids explainability, provides a strong visual clue for tracking, and gives confidence in our feature-guided motion bases.

5 Conclusion

We address the problem of sparse-view 4D reconstruction of dynamic scenes. Existing multi-view 4D reconstruction methods are designed for dense multi-view setups (e.g. Panoptic Studio). In contrast, we aim to strike a balance between the ease and informativeness of multi-view data capture by reconstructing skilled human behaviors from four equidistant inward-facing static cameras. Our key insight is that carefully incorporating priors, in the form of monocular depth and feature-based motion clustering, is crucial. Our empirical analysis shows that on challenging scenes with object dynamics, we achieve state-of-the-art performance on novel space-time synthesis compared to prior art.

References

  • Amir et al. [2021] Shir Amir, Yossi Gandelsman, Shai Bagon, and Tali Dekel. Deep vit features as dense visual descriptors. arXiv preprint arXiv:2112.05814, 2021.
  • Carranza et al. [2003] Joel Carranza, Christian Theobalt, Marcus Magnor, and Hans-Peter Seidel. Free-viewpoint video of human actors. ACM Trans. Graph., 22:569–577, 2003.
  • Cen et al. [2023] Jiazhong Cen, Zanwei Zhou, Jiemin Fang, Wei Shen, Lingxi Xie, Xiaopeng Zhang, and Qi Tian. Segment anything in 3d with nerfs. arXiv preprint arXiv:2304.12308, 2023.
  • Chen and Wang [2024] Guikun Chen and Wenguan Wang. A survey on 3d gaussian splatting. arXiv preprint, 2024.
  • Chen et al. [2024] Yuedong Chen, Haofei Xu, Chuanxia Zheng, Bohan Zhuang, Marc Pollefeys, Andreas Geiger, Tat-Jen Cham, and Jianfei Cai. Mvsplat: Efficient 3d gaussian splatting from sparse multi-view images, 2024.
  • Darcet et al. [2023] Timothée Darcet, Maxime Oquab, Julien Mairal, and Piotr Bojanowski. Vision transformers need registers, 2023.
  • de Aguiar et al. [2008] Edilson de Aguiar, Carsten Stoll, Christian Theobalt, Naveed Ahmed, Hans-Peter Seidel, and Sebastian Thrun. Performance capture from sparse multi-view video. SIGGRAPH, 2008.
  • Deng et al. [2021] Kangle Deng, Andrew Liu, Jun-Yan Zhu, and Deva Ramanan. Depth-supervised nerf: Fewer views and faster training for free. arXiv preprint arXiv:2107.02791, 2021.
  • Dou et al. [2016] Mingsong Dou, Sameh Khamis, Yury Degtyarev, Philip Davidson, Sean Ryan Fanello, Adarsh Kowdle, Sergio Orts Escolano, Christoph Rhemann, David Kim, Jonathan Taylor, Pushmeet Kohli, Vladimir Tankovich, and Shahram Izadi. Fusion4d: Real-time performance capture of challenging scenes. ACM Transactions on Graphics (TOG) - Proceedings of ACM SIGGRAPH 2016, 35, 2016.
  • Eigen et al. [2014] David Eigen, Christian Puhrsch, and Rob Fergus. Depth map prediction from a single image using a multi-scale deep network. NeurIPS, pages 2366–2374, 2014.
  • Fan et al. [2023] Zicong Fan, Omid Taheri, Dimitrios Tzionas, Muhammed Kocabas, Manuel Kaufmann, Michael J Black, and Otmar Hilliges. Arctic: A dataset for dexterous bimanual hand-object manipulation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 12943–12954, 2023.
  • Fan et al. [2024] Zhiwen Fan, Kairun Wen, Wenyan Cong, Kevin Wang, Jian Zhang, Xinghao Ding, Danfei Xu, Boris Ivanovic, Marco Pavone, Georgios Pavlakos, Zhangyang Wang, and Yue Wang. Instantsplat: Sparse-view gaussian splatting in seconds, 2024.
  • Forsyth and Ponce [2003] David A Forsyth and Jean Ponce. A modern approach. Computer vision: a modern approach, 17:21–48, 2003.
  • Fu et al. [2018] Huan Fu, Mingming Gong, Chaohui Wang, Kayhan Batmanghelich, and Dacheng Tao. Deep ordinal regression network for monocular depth estimation. CVPR, pages 2002–2011, 2018.
  • Fu et al. [2024] Xiao Fu, Wei Yin, Mu Hu, Kaixuan Wang, Yuexin Ma, Ping Tan, Shaojie Shen, Dahua Lin, and Xiaoxiao Long. Geowizard: Unleashing the diffusion priors for 3d geometry estimation from a single image. In ECCV, 2024.
  • Gao et al. [2021a] Chen Gao, Ayush Saraf, Johannes Kopf, and Jia-Bin Huang. Dynamic view synthesis from dynamic monocular video. In IEEE International Conference on Computer Vision (ICCV), 2021a.
  • Gao et al. [2021b] Chen Gao, Ayush Saraf, Johannes Kopf, and Jia-Bin Huang. Dynamic view synthesis from dynamic monocular video. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 5712–5721, 2021b.
  • Gao et al. [2022] Hang Gao, Ruilong Li, Shubham Tulsiani, Bryan Russell, and Angjoo Kanazawa. Monocular dynamic view synthesis: A reality check. In Advances in Neural Information Processing Systems (NeurIPS), 2022.
  • Gao et al. [2024] Ruiqi Gao, Aleksander Holynski, Philipp Henzler, Arthur Brussee, Ricardo Martin-Brualla, Pratul Srinivasan, Jonathan T. Barron, and Ben Poole. Cat3d: Create anything in 3d with multi-view diffusion models. arXiv preprint 2405.10314, 2024.
  • Grauman et al. [2023] Kristen Grauman, Andrew Westbury, Lorenzo Torresani, Kris Kitani, Jitendra Malik, Triantafyllos Afouras, Kumar Ashutosh, Vijay Baiyya, Siddhant Bansal, Bikram Boote, Eugene Byrne, Zach Chavis, Joya Chen, Feng Cheng, Fu-Jen Chu, Sean Crane, Avijit Dasgupta, Jing Dong, Maria Escobar, Cristhian Forigua, Abrham Gebreselasie, Sanjay Haresh, Jing Huang, Md Mohaiminul Islam, Suyog Jain, Rawal Khirodkar, Devansh Kukreja, Kevin J Liang, Jia-Wei Liu, Sagnik Majumder, Yongsen Mao, Miguel Martin, Effrosyni Mavroudi, Tushar Nagarajan, Francesco Ragusa, Santhosh Kumar Ramakrishnan, Luigi Seminara, Arjun Somayazulu, Yale Song, Shan Su, Zihui Xue, Edward Zhang, Jinxu Zhang, Angela Castillo, Changan Chen, Xinzhu Fu, Ryosuke Furuta, Cristina Gonzalez, Prince Gupta, Jiabo Hu, Yifei Huang, Yiming Huang, Weslie Khoo, Anush Kumar, Robert Kuo, Sach Lakhavani, Miao Liu, Mi Luo, Zhengyi Luo, Brighid Meredith, Austin Miller, Oluwatumininu Oguntola, Xiaqing Pan, Penny Peng, Shraman Pramanick, Merey Ramazanova, Fiona Ryan, Wei Shan, Kiran Somasundaram, Chenan Song, Audrey Southerland, Masatoshi Tateno, Huiyu Wang, Yuchen Wang, Takuma Yagi, Mingfei Yan, Xitong Yang, Zecheng Yu, Shengxin Cindy Zha, Chen Zhao, Ziwei Zhao, Zhifan Zhu, Jeff Zhuo, Pablo Arbelaez, Gedas Bertasius, David Crandall, Dima Damen, Jakob Engel, Giovanni Maria Farinella, Antonino Furnari, Bernard Ghanem, Judy Hoffman, C. V. Jawahar, Richard Newcombe, Hyun Soo Park, James M. Rehg, Yoichi Sato, Manolis Savva, Jianbo Shi, Mike Zheng Shou, and Michael Wray. Ego-exo4d: Understanding skilled human activity from first- and third-person perspectives, 2023.
  • Hu et al. [2024] Mu Hu, Wei Yin, Chi Zhang, Zhipeng Cai, Xiaoxiao Long, Hao Chen, Kaixuan Wang, Gang Yu, Chunhua Shen, and Shaojie Shen. Metric3d v2: A versatile monocular geometric foundation model for zero-shot metric depth and surface normal estimation. arXiv preprint arXiv:2404.15506, 2024.
  • Jensen et al. [2014] Rasmus Jensen, Anders Dahl, George Vogiatzis, Engil Tola, and Henrik Aanæs. Large scale multi-view stereopsis evaluation. In 2014 IEEE Conference on Computer Vision and Pattern Recognition, pages 406–413. IEEE, 2014.
  • Joo et al. [2017] Hanbyul Joo, Tomas Simon, Xulong Li, Hao Liu, Lei Tan, Lin Gui, Sean Banerjee, Timothy Godisart, Bart Nabbe, Iain Matthews, et al. Panoptic studio: A massively multiview system for social interaction capture. IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 41(1):190–204, 2017.
  • Ke et al. [2024] Bingxin Ke, Anton Obukhov, Shengyu Huang, Nando Metzger, Rodrigo Caye Daudt, and Konrad Schindler. Repurposing diffusion-based image generators for monocular depth estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024.
  • Keetha et al. [2024] Nikhil Keetha, Jay Karhade, Krishna Murthy Jatavallabhula, Gengshan Yang, Sebastian Scherer, Deva Ramanan, and Jonathon Luiten. Splatam: Splat, track & map 3d gaussians for dense rgb-d slam, 2024.
  • Kerbl et al. [2023] Bernhard Kerbl, Georgios Kopanas, Thomas Leimkühler, and George Drettakis. 3d gaussian splatting for real-time radiance field rendering. ACM Transactions on Graphics (ToG), 42(4):1–14, 2023.
  • Khazatsky et al. [2024] Alexander Khazatsky, Karl Pertsch, Suraj Nair, Ashwin Balakrishna, Sudeep Dasari, Siddharth Karamcheti, Soroush Nasiriany, Mohan Kumar Srirama, Lawrence Yunliang Chen, Kirsty Ellis, et al. Droid: A large-scale in-the-wild robot manipulation dataset. arXiv preprint arXiv:2403.12945, 2024.
  • Kingma and Ba [2014] Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
  • Kwon et al. [2021] Taein Kwon, Bugra Tekin, Jan Stühmer, Federica Bogo, and Marc Pollefeys. H2o: Two hands manipulating objects for first person interaction recognition. In Proceedings of the IEEE/CVF international conference on computer vision, pages 10138–10148, 2021.
  • Lei et al. [2024] Jiahui Lei, Yijia Weng, Adam Harley, Leonidas Guibas, and Kostas Daniilidis. Mosca: Dynamic gaussian fusion from casual videos via 4d motion scaffolds. arXiv preprint arXiv:2405.17421, 2024.
  • Li et al. [2024a] Jiahe Li, Jiawei Zhang, Xiao Bai, Jin Zheng, Xin Ning, Jun Zhou, and Lin Gu. Dngaussian: Optimizing sparse-view 3d gaussian radiance fields with global-local depth normalization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 20775–20785, 2024a.
  • Li et al. [2021] Zhengqi Li, Simon Niklaus, Noah Snavely, and Oliver Wang. Neural scene flow fields for space-time view synthesis of dynamic scenes. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6498–6508, 2021.
  • Li et al. [2023] Zhengqi Li, Qianqian Wang, Forrester Cole, Richard Tucker, and Noah Snavely. Dynibar: Neural dynamic image-based rendering. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2023.
  • Li et al. [2024b] Zhengqi Li, Richard Tucker, Forrester Cole, Qianqian Wang, Linyi Jin, Vickie Ye, Angjoo Kanazawa, Aleksander Holynski, and Noah Snavely. Megasam: Accurate, fast, and robust structure and motion from casual dynamic videos. arXiv preprint arXiv:2412.04463, 2024b.
  • Liang et al. [2023] Yiqing Liang, Numair Khan, Zhengqin Li, Thu Nguyen-Phuoc, Douglas Lanman, James Tompkin, and Lei Xiao. Gaufre: Gaussian deformation fields for real-time dynamic novel view synthesis. arXiv preprint, 2023.
  • Lin et al. [2025] Zhiqiu Lin, Siyuan Cen, Daniel Jiang, Jay Karhade, Hewei Wang, Chancharik Mitra, Tiffany Ling, Yuhan Huang, Sifan Liu, Mingyu Chen, Rushikesh Zawar, Xue Bai, Yilun Du, Chuang Gan, and Deva Ramanan. Towards understanding camera motions in any video. arXiv preprint arXiv:2504.15376, 2025.
  • Lu et al. [2024] Hao Lu, Tianshuo Xu, Wenzhao Zheng, Yunpeng Zhang, Wei Zhan, Dalong Du, Masayoshi Tomizuka, Kurt Keutzer, and Yingcong Chen. Drivingrecon: Large 4d gaussian reconstruction model for autonomous driving. arXiv preprint arXiv:2412.09043, 2024.
  • Luiten et al. [2024] Jonathon Luiten, Georgios Kopanas, Bastian Leibe, and Deva Ramanan. Dynamic 3d gaussians: Tracking by persistent dynamic view synthesis. In 3DV, 2024.
  • Mildenhall et al. [2019] Ben Mildenhall, Pratul P. Srinivasan, Rodrigo Ortiz-Cayon, Nima Khademi Kalantari, Ravi Ramamoorthi, Ren Ng, and Abhishek Kar. Local light field fusion: Practical view synthesis with prescriptive sampling guidelines. ACM Transactions on Graphics (TOG), 2019.
  • Mildenhall et al. [2020] Ben Mildenhall, Pratul P Srinivasan, Matthew Tancik, Jonathan T Barron, Ravi Ramamoorthi, and Ren Ng. Nerf: Representing scenes as neural radiance fields for view synthesis. In European Conference on Computer Vision (ECCV), 2020.
  • Newcombe et al. [2015] Richard A Newcombe, Dieter Fox, and Steven M Seitz. Dynamicfusion: Reconstruction and tracking of non-rigid scenes in real-time. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2015.
  • Oquab et al. [2023] Maxime Oquab, Timothée Darcet, Theo Moutakanni, Huy V. Vo, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, Russell Howes, Po-Yao Huang, Hu Xu, Vasu Sharma, Shang-Wen Li, Wojciech Galuba, Mike Rabbat, Mido Assran, Nicolas Ballas, Gabriel Synnaeve, Ishan Misra, Herve Jegou, Julien Mairal, Patrick Labatut, Armand Joulin, and Piotr Bojanowski. Dinov2: Learning robust visual features without supervision, 2023.
  • Perazzi et al. [2016] F. Perazzi, J. Pont-Tuset, B. McWilliams, L. Van Gool, M. Gross, and A. Sorkine-Hornung. A benchmark dataset and evaluation methodology for video object segmentation. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016.
  • Piccinelli et al. [2024] Luigi Piccinelli, Yung-Hsu Yang, Christos Sakaridis, Mattia Segu, Siyuan Li, Luc Van Gool, and Fisher Yu. UniDepth: Universal monocular metric depth estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024.
  • Piccinelli et al. [2025] Luigi Piccinelli, Christos Sakaridis, Yung-Hsu Yang, Mattia Segu, Siyuan Li, Wim Abbeloos, and Luc Van Gool. Unidepthv2: Universal monocular metric depth estimation made simpler, 2025.
  • Ranftl et al. [2021] René Ranftl, Alexey Bochkovskiy, and Vladlen Koltun. Vision transformers for dense prediction. In Proceedings of the IEEE/CVF international conference on computer vision, pages 12179–12188, 2021.
  • Ravi et al. [2024] Nikhila Ravi, Valentin Gabeur, Yuan-Ting Hu, Ronghang Hu, Chaitanya Ryali, Tengyu Ma, Haitham Khedr, Roman Rädle, Chloe Rolland, Laura Gustafson, et al. Sam 2: Segment anything in images and videos. arXiv preprint arXiv:2408.00714, 2024.
  • Rombach et al. [2021] Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High-resolution image synthesis with latent diffusion models, 2021.
  • Song et al. [2023] Chonghyuk Song, Gengshan Yang, Kangle Deng, Jun-Yan Zhu, and Deva Ramanan. Total-recon: Deformable scene reconstruction for embodied view synthesis. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 17671–17682, 2023.
  • Stearns et al. [2024] Colton Stearns, Adam W. Harley, Mikaela Uy, Florian Dubost, Federico Tombari, Gordon Wetzstein, and Leonidas Guibas. Dynamic gaussian marbles for novel view synthesis of casual monocular videos. arXiv preprint arXiv:2406.18717, 2024.
  • Szymanowicz et al. [2024] Stanislaw Szymanowicz, Christian Rupprecht, and Andrea Vedaldi. Splatter image: Ultra-fast single-view 3d reconstruction. 2024.
  • Tan et al. [2024] Jeff Tan, Donglai Xiang, Shubham Tulsiani, Deva Ramanan, and Gengshan Yang. Dressrecon: Freeform 4d human reconstruction from monocular video. arXiv preprint arXiv:2409.20563, 2024.
  • Wang et al. [2024a] Jikai Wang, Qifan Zhang, Yu-Wei Chao, Bowen Wen, Xiaohu Guo, and Yu Xiang. Ho-cap: A capture system and dataset for 3d reconstruction and pose tracking of hand-object interaction. arXiv preprint arXiv:2406.06843, 2024a.
  • Wang et al. [2024b] Qianqian Wang, Vickie Ye, Hang Gao, Jake Austin, Zhengqi Li, and Angjoo Kanazawa. Shape of motion: 4d reconstruction from a single video. arXiv preprint arXiv:2407.13764, 2024b.
  • Wang et al. [2024c] Ruicheng Wang, Sicheng Xu, Cassie Dai, Jianfeng Xiang, Yu Deng, Xin Tong, and Jiaolong Yang. Moge: Unlocking accurate monocular geometry estimation for open-domain images with optimal training supervision. arXiv preprint arXiv:2410.19115, 2024c.
  • Wang et al. [2024d] Shuzhe Wang, Vincent Leroy, Yohann Cabon, Boris Chidlovskii, and Jerome Revaud. Dust3r: Geometric 3d vision made easy. In CVPR, 2024d.
  • Wang et al. [2024e] Zihan Wang, Bowen Li, Chen Wang, and Sebastian Scherer. Airshot: Efficient few-shot detection for autonomous exploration. In 2024 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 11654–11661, 2024e.
  • Wu et al. [2023] Guanjun Wu, Taoran Yi, Jiemin Fang, Lingxi Xie, Xiaopeng Zhang, Wei Wei, Wenyu Liu, Qi Tian, and Wang Xinggang. 4d gaussian splatting for real-time dynamic scene rendering. arXiv preprint, 2023.
  • Wu et al. [2024] Rundi Wu, Ben Mildenhall, Philipp Henzler, Keunhong Park, Ruiqi Gao, Daniel Watson, Pratul P. Srinivasan, Dor Verbin, Jonathan T. Barron, Ben Poole, and Aleksander Holynski. Reconfusion: 3d reconstruction with diffusion priors. In CVPR, 2024.
  • Yang et al. [2024a] Chen Yang, Sikuang Li, Jiemin Fang, Ruofan Liang, Lingxi Xie, Xiaopeng Zhang, Wei Shen, and Qi Tian. Gaussianobject: High-quality 3d object reconstruction from four views with gaussian splatting. SIGGRAPH Asia, 2024a.
  • Yang et al. [2022] Gengshan Yang, Minh Vo, Natalia Neverova, Deva Ramanan, Andrea Vedaldi, and Hanbyul Joo. Banmo: Building animatable 3d neural models from many casual videos. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2022.
  • Yang et al. [2024b] Jiawei Yang, Jiahui Huang, Yuxiao Chen, Yan Wang, Boyi Li, Yurong You, Apoorva Sharma, Maximilian Igl, Peter Karkus, Danfei Xu, et al. Storm: Spatio-temporal reconstruction model for large-scale outdoor scenes. arXiv preprint arXiv:2501.00602, 2024b.
  • Yang et al. [2024c] Lihe Yang, Bingyi Kang, Zilong Huang, Zhen Zhao, Xiaogang Xu, Jiashi Feng, and Hengshuang Zhao. Depth anything v2. arXiv:2406.09414, 2024c.
  • Yang et al. [2023] Ziyi Yang, Xinyu Gao, Wen Zhou, Shaohui Jiao, Yuqing Zhang, and Xiaogang Jin. Deformable 3d gaussians for high-fidelity monocular dynamic scene reconstruction. arXiv preprint, 2023.
  • Yu et al. [2021] Alex Yu, Vickie Ye, Matthew Tancik, and Angjoo Kanazawa. pixelnerf: Neural radiance fields from one or few images. In IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2021, virtual, June 19-25, 2021, pages 4578–4587, 2021.
  • Yu et al. [2024] Wangbo Yu, Jinbo Xing, Li Yuan, Wenbo Hu, Xiaoyu Li, Zhipeng Huang, Xiangjun Gao, Tien-Tsin Wong, Ying Shan, and Yonghong Tian. Viewcrafter: Taming video diffusion models for high-fidelity novel view synthesis. arXiv preprint arXiv:2409.02048, 2024.
  • Zhang et al. [2024] Junyi Zhang, Charles Herrmann, Junhwa Hur, Varun Jampani, Trevor Darrell, Forrester Cole, Deqing Sun, and Ming-Hsuan Yang. Monst3r: A simple approach for estimating geometry in the presence of motion. arXiv preprint arXiv:2410.03825, 2024.
  • Zhang et al. [2023] Lvmin Zhang, Anyi Rao, and Maneesh Agrawala. Adding conditional control to text-to-image diffusion models. In IEEE/CVF International Conference on Computer Vision, ICCV, pages 3813–3824, 2023.
  • Zhao et al. [2025] Qitao Zhao, Amy Lin, Jeff Tan, Jason Y. Zhang, Deva Ramanan, and Shubham Tulsiani. Diffusionsfm: Predicting structure and motion via ray origin and endpoint diffusion, 2025.
  • Zhou et al. [2025] Changshi Zhou, Rong Jiang, Feng Luan, Shaoqiang Meng, Zhipeng Wang, Yanchao Dong, Yanmin Zhou, and Bin He. Dual-arm robotic fabric manipulation with quasi-static and dynamic primitives for rapid garment flattening. IEEE/ASME Transactions on Mechatronics, pages 1–11, 2025.
  • Zhou et al. [2024] Shijie Zhou, Haoran Chang, Sicheng Jiang, Zhiwen Fan, Zehao Zhu, Dejia Xu, Pradyumna Chari, Suya You, Zhangyang Wang, and Achuta Kadambi. Feature 3dgs: Supercharging 3d gaussian splatting to enable distilled feature fields. CVPR, 2024.
  • Zhu et al. [2023] Zehao Zhu, Zhiwen Fan, Yifan Jiang, and Zhangyang Wang. FSGS: real-time few-shot view synthesis using gaussian splatting. CoRR, abs/2312.00451, 2023.

Appendix

Appendix A Additional Baseline Comparisons

To highlight the challenge of sparse-view 4D reconstruction, we compare with five additional baselines for monocular 4D reconstruction and sparse-view reconstruction. Our method remains state-of-the-art.

Method PSNR \uparrow SSIM \uparrow LPIPS \downarrow Description
MV-MonST3R [67] 12.64 0.475 0.574 Monocular 4D reconstruction
MV-MegaSAM [34] 14.25 0.578 0.412 Monocular 4D reconstruction
ViewCrafter [66] 24.97 0.834 0.135 Multi-view diffusion
DNGaussian [31] 27.34 0.897 0.103 Sparse-view reconstruction
InstantSplat [12] 29.11 0.909 0.082 Sparse-view reconstruction
MonoFusion 30.64 0.930 0.055 Ours

Baseline: Monocular 4D reconstruction. We compare with MonST3R, as well as MegaSAM which claims better results than MonST3R. For each method, we report the best result among using a single-view video, concatenating the 4 views into longer video, or interleaving the 4 views (simulating a “rotating” camera). MonST3R and MegaSAM fail in sparse-view scenarios, especially with large viewpoint shifts. When given concatenated or interleaved sparse-view videos as input, both models simply copy the input frames onto flat planes: notice the image corners in visual results. Given a single input video, both models are missing large regions due to occlusions and unseen geometry past the video boundary. Without non-trivially handling images from multiple wide-baseline input views, MonST3R and MegaSAM alone cannot produce view-consistent reconstructions. Baseline: Multi-view diffusion . We use ViewCrafter, a video diffusion method for novel-view synthesis, to novel views at each timestep. ViewCrafter’s rendered views are worse than our method’s outputs, due to grid-like rendering artifacts, missing black regions of the scene, additional hallucinations, and imperfect alignment between different input views. Baseline: Sparse-view methods . We compare with alternative sparse-view methods InstantSplat and DNGaussian. Although they do not handle dynamic scenes, we run them independently per frame. Both methods suffer from blurry reconstructions and missing details, although InstantSplat benefits from cross-view consistency coming from DUSt3R. A qualitative analysis of all baselines for 5 and 45 novel view synthesis on the ExoRecon dataset is available in Fig. 10 and 11.

Appendix B Training View Renderings

To build confidence in our implementations, we validate every baseline we run by verifying that each method looks reasonable at training views (Fig. 9). It is worth noticing that in each iteration of optimization, we sample a batch of frames out of the video to optimize the overall loss. As the loss is optimized as a global minimum averaged over all frames, it is possible that some artifacts remain for certain frames.

Refer to caption
Figure 9: Training view results. We visualize the rasterized RGB image and depth map from each method for the dancing (left) and bike repair (right) sequences. All methods are capable of producing reasonable training views and depth maps. It is worth noticing that in each iteration of optimization, we sample a batch of frames out of the video to optimize the overall loss. As the loss is optimized as a global minimum averaged over all frames, it is possible that some artifacts remain for certain frames.
Refer to caption
Figure 10: Qualitative results of 45 extreme novel view synthesis results on ExoRecon. For each of the baseline methods considered in the main paper and in the appendix, we visualize the rasterized RGB image from each method for a 45 novel-view from one of the training views. Results suggest that prior work that is tuned for a dense camera setup struggles to work in the sparse-view case, and methods that are tuned for a monocular input cannot naively address a multi-view setup. Our proposed MonoFusion, successfully uses priors in the form of monocular depth and feature-based motion clustering to achieve the best of both worlds.

Appendix C Training Details

In Tab. 4, we report the learning rate and loss weights of Gaussians in our optimization process. These hyperparameters are shared across every scene that we evaluated on. Specifically, smooth_bases\mathcal{L}_{\text{smooth\_bases}}caligraphic_L start_POSTSUBSCRIPT smooth_bases end_POSTSUBSCRIPT enforces smooth motion bases by penalizing high accelerations in rotations and translations. smooth_tracks\mathcal{L}_{\text{smooth\_tracks}}caligraphic_L start_POSTSUBSCRIPT smooth_tracks end_POSTSUBSCRIPT promotes smooth object tracks by penalizing large accelerations in object positions across frames. depth_grad\mathcal{L}_{\text{depth\_grad}}caligraphic_L start_POSTSUBSCRIPT depth_grad end_POSTSUBSCRIPT aligns the gradients of the predicted and ground truth depth maps to preserve structural details. z_accel\mathcal{L}_{\text{z\_accel}}caligraphic_L start_POSTSUBSCRIPT z_accel end_POSTSUBSCRIPT penalizes high accelerations along the depth axis to reduce jitter in depth estimation. scale_val\mathcal{L}_{\text{scale\_val}}caligraphic_L start_POSTSUBSCRIPT scale_val end_POSTSUBSCRIPT constrains the variance of scale parameters of Gaussians to achieve consistent representations.

Table 4: Training hyper-parameters: learning rates (left) and loss weights (right) shown row-by-row
Learning Rates Loss Weights
Parameter FG LR BG LR Motion LR Loss Param. Weight
means 1.6×1041.6\times 10^{-4}1.6 × 10 start_POSTSUPERSCRIPT - 4 end_POSTSUPERSCRIPT 1.6×1041.6\times 10^{-4}1.6 × 10 start_POSTSUPERSCRIPT - 4 end_POSTSUPERSCRIPT wrgbw_{\text{rgb}}italic_w start_POSTSUBSCRIPT rgb end_POSTSUBSCRIPT 7.07.07.0
opacities 1×1021\times 10^{-2}1 × 10 start_POSTSUPERSCRIPT - 2 end_POSTSUPERSCRIPT 1×1021\times 10^{-2}1 × 10 start_POSTSUPERSCRIPT - 2 end_POSTSUPERSCRIPT wmaskw_{\text{mask}}italic_w start_POSTSUBSCRIPT mask end_POSTSUBSCRIPT 5.05.05.0
scales 5×1035\times 10^{-3}5 × 10 start_POSTSUPERSCRIPT - 3 end_POSTSUPERSCRIPT 1×1031\times 10^{-3}1 × 10 start_POSTSUPERSCRIPT - 3 end_POSTSUPERSCRIPT wfeatw_{\text{feat}}italic_w start_POSTSUBSCRIPT feat end_POSTSUBSCRIPT 7.07.07.0
quats 1×1031\times 10^{-3}1 × 10 start_POSTSUPERSCRIPT - 3 end_POSTSUPERSCRIPT 1×1031\times 10^{-3}1 × 10 start_POSTSUPERSCRIPT - 3 end_POSTSUPERSCRIPT �� wsmooth_basesw_{\text{smooth\_bases}}italic_w start_POSTSUBSCRIPT smooth_bases end_POSTSUBSCRIPT 0.10.10.1
colors 0 1×1021\times 10^{-2}1 × 10 start_POSTSUPERSCRIPT - 2 end_POSTSUPERSCRIPT wdepth_regw_{\text{depth\_reg}}italic_w start_POSTSUBSCRIPT depth_reg end_POSTSUBSCRIPT 1.01.01.0
feats 1×1031\times 10^{-3}1 × 10 start_POSTSUPERSCRIPT - 3 end_POSTSUPERSCRIPT 1×1031\times 10^{-3}1 × 10 start_POSTSUPERSCRIPT - 3 end_POSTSUPERSCRIPT wsmooth_tracksw_{\text{smooth\_tracks}}italic_w start_POSTSUBSCRIPT smooth_tracks end_POSTSUBSCRIPT 2.02.02.0
motion_coefs 1×1031\times 10^{-3}1 × 10 start_POSTSUPERSCRIPT - 3 end_POSTSUPERSCRIPT wscale_varw_{\text{scale\_var}}italic_w start_POSTSUBSCRIPT scale_var end_POSTSUBSCRIPT111Originally set to 0.010.010.01. 0.010.010.01
rots 1.6×1041.6\times 10^{-4}1.6 × 10 start_POSTSUPERSCRIPT - 4 end_POSTSUPERSCRIPT wz_accelw_{\text{z\_accel}}italic_w start_POSTSUBSCRIPT z_accel end_POSTSUBSCRIPT 1.01.01.0
transls 1.6×1041.6\times 10^{-4}1.6 × 10 start_POSTSUPERSCRIPT - 4 end_POSTSUPERSCRIPT wtrackw_{\text{track}}italic_w start_POSTSUBSCRIPT track end_POSTSUBSCRIPT 2.02.02.0

Appendix D Alternative design choice

We explore higher-quality and metric-scale depth initialization, by replacing UniDepth in MV-SOM with improved metric depth estimators. Our results suggest that simply improving the quality of metric-scale depth in MV-SOM is not enough.

Method PSNR \uparrow SSIM \uparrow LPIPS \downarrow IOU \uparrow AbsRel (\downarrow)
w/ UniDepth [44] 26.91 0.890 0.138 0.845 0.474
w/ Metric3Dv2 [21] 27.29 0.894 0.132 0.847 0.462
w/ UniDepthV2 [45] 27.65 0.900 0.103 0.872 0.356
MonoFusion 30.64 0.930 0.055 0.963 0.290
Refer to caption
Figure 11: Novel view synthesis results from more video sequences. In each row, we visualize the rasterized RGB image, depth map, and foreground mask from our method for various diverse scene including music (top), cooking (middle), and healthcare (bottom). We include results for 55^{\circ}5 start_POSTSUPERSCRIPT ∘ end_POSTSUPERSCRIPT (left) and 4545^{\circ}45 start_POSTSUPERSCRIPT ∘ end_POSTSUPERSCRIPT (right) novel view synthesis results. Notably, the rendered RGB and depth maps produce consistent reconstructions and plausible geometry.

Appendix E Limitations and Future Work

Refer to caption
Figure 12: Failure example of SAM-V2. We qualitatively inspect the SAM-V2 dynamic foreground masks on the kitchen (top) and bike repair (bottom) scenes. The dynamic mask is highlighted in purple, and failures in dynamic mask estimation are highlighted in red circle. We observe that SAM-V2 can miss important body parts (e.g. the person’s hands) or get confused by the background (as shown in top row). Long-term occlusion will also lead to tracking failure (as shown in bottom row). These failure cases suggest that dynamic mask tracking in complex scenes remains an open challenge.

We address two key limitations of our work. First, like previous methods, we rely heavily on 2D foundation models to estimate priors (e.g. depth and dynamic masks) for gradient-based differentiable rendering optimization. Thus, imprecise priors can harm the downstream rendering process. In addition, the current pipeline requires a user prompt to specify dynamic masks for each moving object [3], which can be labor-intensive for complex scenes and may fail in some cases (Fig. 12). To solve this, distilling dynamic masks from foundation models or inferring dynamic masks from image level priors (as in [67]) could be beneficial.

Refer to caption
Figure 13: Visualization of foreground projection for different checkpoints. Here we show the projection by known cameras and ground-truth foreground masks, using the point cloud from DUSt3R (top row) and MonST3R (bottom row) for two selected cameras (each column represents one camera). Notably, although MonST3R is fine-tuned on temporal frame sequences instead of multi-view information, MonST3R benefits from the presence of dynamic foreground movers in its fine-tuning dataset and thus gives a better foreground result.

Second, most off-the-shelf feed-forward depth estimation networks are trained on simple scene-level datasets, with few dynamic movers (e.g. people) in the foreground. In practice, we observe that the depth of humans in dynamic scenes is often incorrect when observed from other views (Fig. 13). For example, DUSt3R often estimates the depth of a human to be the same as the depth of surrounding walls, causing the human to blend into the background. We believe that these fundamental problems with the depth predictions cannot be solved by any alignment in the output space. To mitigate this issue, we plan to further fine-tune DUSt3R or MonST3R on existing dynamic human datasets.