The key of our method is to leverage a generative 3D avatar model as a human prior and fit this generative model to synthetic images generated from multi-view diffusion.
Our 3D human generator creates the appearance and geometry in canonical space represented by 3D Gaussians and leverages an efficient deformation module to deform it into posed space for Gaussian rasterization. To learn this generative avatar model from 3D human dataset, we utilize a single-stage training pipeline that jointly optimizes a Gaussian auto-decoder (including a per-subject latent code and a shared decoder) and a latent diffusion model.
At test time, we fit the learned generative avatar to synthetic images generated from a pretrained multi-view diffusion model. During this process, we first initialize latent code by image-guided sampling and perform model inversion to compute latent code while freezing the decoder and learned diffusion model. Both the avatar model and camera/human pose parameters are optimized in an alternating manner to correct abnormal poses.
@inproceedings{dong2025moga,
title={{MoGA}: 3D {G}enerative {A}vatar {P}rior for {M}onocular {G}aussian {A}vatar
{R}econstruction},
author={Zijian Dong, Longteng Duan, Jie Song, Michael J.Black, Andreas Geiger},
booktitle={International Conference on Computer Vision (ICCV)},
year={2025}
}