MoGA:3D Generative Avatar Prior for Monocular Gaussian Avatar Reconstruction

Abstract

We present MoGA, a novel method to reconstruct high-fidelity 3D Gaussian avatars from a single-view image. The main challenge lies in inferring unseen appearance and geometric details while ensuring 3D consistency and realism. Most previous methods rely on 2D diffusion models to synthesize unseen views; however, these generated views are sparse and inconsistent, resulting in unrealistic 3D artifacts and blurred appearance. To address these limitations, we leverage a generative avatar model, that can generate diverse 3D avatars by sampling deformed Gaussians from a learned prior distribution. Due to the limited amount of 3D training data such a 3D model alone cannot capture all image details of unseen identities. Consequently, we integrate it as a prior, ensuring 3D consistency by projecting input images into its latent space and enforcing additional 3D appearance and geometric constraints. Our novel approach formulates Gaussian avatar creation as a model inversion process by fitting the generative avatar to synthetic views from 2D diffusion models. The generative avatar provides a meaningful initialization for model fitting, enforces 3D regularization, and helps in refining pose estimation. Experiments show that our method surpasses state-of-the-art techniques and generalizes well to real-world scenarios. Our Gaussian avatars are also inherently animatable.

Video

Method Overview

The key of our method is to leverage a generative 3D avatar model as a human prior and fit this generative model to synthetic images generated from multi-view diffusion.

Generative Avatar Prior Learning

Our 3D human generator creates the appearance and geometry in canonical space represented by 3D Gaussians and leverages an efficient deformation module to deform it into posed space for Gaussian rasterization. To learn this generative avatar model from 3D human dataset, we utilize a single-stage training pipeline that jointly optimizes a Gaussian auto-decoder (including a per-subject latent code and a shared decoder) and a latent diffusion model.

Model Fitting

At test time, we fit the learned generative avatar to synthetic images generated from a pretrained multi-view diffusion model. During this process, we first initialize latent code by image-guided sampling and perform model inversion to compute latent code while freezing the decoder and learned diffusion model. Both the avatar model and camera/human pose parameters are optimized in an alternating manner to correct abnormal poses.

Result on in-the-wild images

Examples with background

Examples with loose clothing

Examples with challenging poses

Animation

BibTeX

@inproceedings{dong2025moga,
      title={{MoGA}: 3D {G}enerative {A}vatar {P}rior for {M}onocular {G}aussian {A}vatar
      {R}econstruction},
      author={Zijian Dong, Longteng Duan, Jie Song, Michael J.Black, Andreas Geiger},
      booktitle={International Conference on Computer Vision (ICCV)},
      year={2025}
    }

MoGA: 3D Generative Avatar Prior for Monocular Gaussian Avatar Reconstruction

ICCV 2025 (Highlight)