DiffuMatch: Category-Agnostic Spectral Diffusion Priors for
Robust Non-rigid Shape Matching

Emery Pierson1      Lei Li2      Angela Dai2      Maks Ovsjanikov1     
LIX, Ecole Polytechnique1      Technical University of Munich2
Abstract

Deep functional maps have recently emerged as a powerful tool for solving non-rigid shape correspondence tasks. Methods that use this approach combine the power and flexibility of the functional map framework, with data-driven learning for improved accuracy and generality. However, most existing methods in this area restrict the learning aspect only to the feature functions and still rely on axiomatic modeling for formulating the training loss or for functional map regularization inside the networks. This limits both the accuracy and the applicability of the resulting approaches only to scenarios where assumptions of the axiomatic models hold. In this work, we show, for the first time, that both in-network regularization and functional map training can be replaced with data-driven methods. For this, we first train a generative model of functional maps in the spectral domain using score-based generative modeling, built from a large collection of high-quality maps. We then exploit the resulting model to promote the structural properties of ground truth functional maps on new shape collections. Remarkably, we demonstrate that the learned models are category-agnostic, and can fully replace commonly used strategies such as enforcing Laplacian commutativity or orthogonality of functional maps. Our key technical contribution is a novel distillation strategy from diffusion models in the spectral domain. Experiments demonstrate that our learned regularization leads to better results than axiomatic approaches for zero-shot non-rigid shape matching. Our code is available at: https://github.com/daidedou/diffumatch/

Refer to caption
Figure 1: We learn diffusion priors in the spectral domain from a large collection of functional maps computed on registered human shapes. The learned spectral diffusion priors are category-agnostic and generalize robustly to unseen shape categories, enabling accurate zero-shot non-rigid shape matching.

1 Introduction

Shape matching is a fundamental problem in geometry processing, as it is a necessary step for many applications such as shape interpolation [2], texture transfer [19], and statistical shape analysis [8, 9].

A particularly appealing approach to non-rigid shape matching is the recent deep functional maps framework [42, 20]. It consists of two main blocks: (1) a deep feature extractor that computes descriptor functions approximately preserved across a pair of input shapes, and (2) a differentiable functional map solver that computes the matching in the spectral domain under axiomatic regularizations. Functional maps [51] provide a flexible and compact representation of shape correspondences as small-sized matrices, and have been successfully applied to shape matching, but also other tasks such as shape classification [31] or representation alignment [27]. However, existing methods rely heavily on axiomatic modeling, such as near isometry and local area preservation constraints [15, 67], for formulating the training loss or for functional map regularization inside the networks. This limits their accuracy and generalization to unseen shape pairs where these assumptions may not hold.

This motivates us to explore replacing the axiomatic regularizations in deep functional maps with structural priors learned directly from data. Recently, diffusion models [69, 33] have demonstrated strong capabilities in modeling complex data distributions, achieving impressive results in tasks such as image generation and editing [65, 49]. The rich priors learned by these models have also proven useful in many other tasks, such as 3D generation and reconstruction [57, 76, 78]. Inspired by these successes, we propose to leverage diffusion models to capture structural properties of functional maps directly in the spectral domain. With the spectral diffusion priors as a powerful regularizer, we aim to enhance the robustness and accuracy of functional map estimation for non-rigid shape matching.

In this work, we leverage large-scale datasets of registered non-rigid 3D shapes, primarily human bodies [9], to learn informative structural priors of functional maps. We first construct a large collection of high-quality functional maps from the non-rigid 3D registrations in [9], and then train an unconditional diffusion model in the spectral domain to capture the distribution of these functional maps. Building on this spectral diffusion model, we propose a novel zero-shot deep functional map pipeline, which incorporates a novel data-driven regularization to promote structural properties consistent with those observed in the training data. Specifically, we distill a mask from the trained spectral diffusion model. This mask encodes learned structural priors, replacing conventional axiomatic regularizations, such as Laplacian commutativity or orthogonality, commonly used in deep functional map pipelines. Remarkably, our learned spectral diffusion priors, though trained on human shapes, are category-agnostic and demonstrate strong generalization to unseen shape categories, including humanoids and animals.

In summary, our contributions are:

  • We introduce a spectral diffusion model to learn the distribution of functional maps, effectively capturing their structural characteristics in the spectral domain.

  • We distill the learned spectral diffusion priors into a mask that serves as a data-driven regularizer, replacing axiomatic regularizations and improving robustness in zero-shot deep functional map pipelines.

  • Our spectral diffusion priors, learned from human shapes, demonstrate remarkable adaptability, generalizing effectively to diverse and previously unseen shape categories.

2 Related Work

Functional Maps and Regularizations.  Since the functional maps seminal work [51], in which orthogonality and Laplacian commutativity penalties are derived to encourage near-isometric maps, many axiomatic approaches have been proposed to improve the computation of functional maps. One of the most common penalties is to encourage bijectivity of the functional maps [52], thereby encouraging bijectivity of the corresponding pointwise correspondence. Ren et al[58] propose to encourage orientation preservation and continuity of maps, along with a new iterative algorithm to improve the quality of maps. Panine et al[54] propose to promote conformality of maps with a new penalty and functional basis for the map computation.

Recently, Zoomout [48] has shown that alternating between functional map and pointwise correspondence representation is a remarkably efficient method to regularize the final map quality automatically. One of the key components is the projection of maps to the proper map space [61, 40], the space of functional maps corresponding to valid point correspondences, from which it is easier to optimize spatial energies like the Dirichlet energy [46], or elastic energies with a separate deformation network [16]. This has lately been used to improve the training of deep functional maps, with different losses encouraging functional maps to be proper [39, 15, 17]. Some approaches propose to use geometric information to refine maps using geometric consistency either in the training of deep functional maps [17], or at test time with precomputed deep features [29, 64, 63], which can be useful in partial shape matching [22, 23]. Another way to mitigate errors is to match multiple shapes at the same time rather than a pair [26, 28], with an increased computational cost. Another direction aims at exploring an alternative to the Laplace-Beltrami operator for constructing the shape basis [10, 7, 29, 77], however, it is not straightforward to incorporate these new approaches in a deep functional maps pipeline.

Only a few works have focused on improving mask regularization. In partial functional maps [62], the authors propose a slanted regularization to encourage the map to follow Weyl’s law. Ren et al. improved the original Laplacian commutativity by using the Resolvent operator [59], which has better theoretical properties.

To perform well, most of these approaches require a good quality initialization as input, which is given by the mask regularization [4], or by large-scale pre-trained features [81]. In contrast, we propose a data-driven mask computation, which allows for a better initialization of the maps and a distilled loss to optimize the maps.

Knowledge Distillation of Score-based Generative Models.  Denoising diffusion models [33] are a class of generative models that learn the mapping between an (unknown) data distribution and a Gaussian distribution. They have shown a great generalization capability, surpassing GANs for image generation [18]. Moreover, the distribution learned by the diffusion models, has numerous desirable properties, as it learns the gradient of the log density, the score, of the (noised) data distribution [70]. The learned score can be distilled in various ways depending on the downstream tasks, such as image inpainting [43], denoising [71], and more recently, text-to-3D generation.

Score distillation sampling (SDS) [57] has indeed quickly become a preferred approach to zero-shot text-to-3D generation using 2D image-based diffusion models. The authors of [57] propose to use the learned score of the diffusion model as the gradient of a desired image given a user text prompt. Coupled with a differentiable scene representation and rendering, the proposed loss allows for the accurate generation of new 3D scenes. However, the approach exhibits undesired properties, such as almost deterministic generation (convergence towards the mean image corresponding to the prompt), low-quality shapes, or color saturation of the generated shapes. To overcome this limitation, HiFA [82] proposes a strategy to mitigate those effects by using negative prompts and forcing realistic images by emphasizing SDS steps with low levels of noise. ProlificDreamer [76] instead proposes to learn a fine-tuned diffusion model for the prompt, to avoid deterministic generation. Those approaches, however, require a conditional diffusion model to work properly. Lukoianov et al[44] proposes DDIM inversion to follow the score to avoid wrong gradient directions. However, the inversion process is approximate and can be slow.

Most methods presented here are designed for image generation. As we will see in Sec. 4, their generalization to the regularization of functional maps is not straightforward, and we therefore propose to adapt the distillation strategy for robust non-rigid shape matching.

3D Generative Modeling for Shape Matching.  Generative modeling has proven to be a powerful tool for solving shape-matching tasks. In 3D-CODED [30], the authors train a point-cloud auto-encoder on a large synthetic human dataset and register human scans in a zero-shot approach. The auto-encoder approach has been improved with geometric regularization to avoid degenerated reconstructions. Neural Jacobian Fields [1] improve the overall quality of reconstructions by predicting the Jacobian of deformations instead of directly predicting the deformation field, which implicitly regularizes the final shape. ARAPReg [34] learns a geometrically regularized latent space by penalizing directions that increase the ARAP energy. This strategy has improved to learn correspondences on unregistered datasets [80] but requires a two-step training strategy. Finally, some works directly learn to generate the matching of shapes, whether directly in the spatial domain [25, 74, 50] or in the spectral domain using deep functional maps [42, 32]. In particular, deep functional methods have shown great success for intra-category training of shape matching approaches [66, 40, 14, 15, 73]. A concurrent work [83] proposes to learn conditional distribution of functional maps along with shape descriptors. All the works mentioned above need training on specific categories before being used at test time: a model trained on humans is useful for registering other humans but often fails to generalize well to different categories, such as animals.

We overcome this limitation by training an unconditional diffusion model on the space of functional maps. We propose a novel distillation strategy adapted to the specific problem of functional map regularization. Our approach generalizes to new categories of shapes (e.g., animals).

3 Background & Motivation

Refer to caption
Figure 2: Our diffusion-based zero-shot shape matching pipeline. Given two shapes, we distill a mask using the spectral diffusion denoiser applied to the estimated “raw” functional map from features. We then use the distilled mask in a regularized functional map solver (FMReg) and apply Zoomout [48] to obtain a proper, regularized map. We minimize both score distillation and L2 distance to the proper map.

3.1 Deep Functional Maps

The objective of deep functional maps is to learn shape descriptors to compute high-quality correspondences on pairs of shapes (𝒮1,𝒮2)(\mathcal{S}_{1},\mathcal{S}_{2})( caligraphic_S start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , caligraphic_S start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ), represented as triangular meshes. Let n1,n2n_{1},n_{2}italic_n start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_n start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT be their respective number of vertices. The pipeline generally consists of the following steps:

  • Compute the first kkitalic_k eigenfunctions of an intrinsic surface operator – usually the Laplace-Beltrami operator – on each shape, serving as a basis of functions on these shapes. The Laplacian is discretized as S1WS^{-1}Witalic_S start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT italic_W, where SSitalic_S is the diagonal matrix of mesh vertex areas, and WWitalic_W is the cotangent weight matrix. The eigenfunctions are stored as matrices in the form of Φ1n1×k\Phi_{1}\in\mathbb{R}^{n_{1}\times k}roman_Φ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_n start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT × italic_k end_POSTSUPERSCRIPT and Φ2n2×k\Phi_{2}\in\mathbb{R}^{n_{2}\times k}roman_Φ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_n start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT × italic_k end_POSTSUPERSCRIPT.

  • A set of dditalic_d descriptor functions (approximately preserved by the unknown map) F1,F2=fθ(𝒮1)n1×dF_{1},F_{2}=f_{\theta}(\mathcal{S}_{1})\in\mathbb{R}^{n_{1}\times d}italic_F start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_F start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = italic_f start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( caligraphic_S start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) ∈ blackboard_R start_POSTSUPERSCRIPT italic_n start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT × italic_d end_POSTSUPERSCRIPT, fθ(𝒮2)n2×df_{\theta}(\mathcal{S}_{2})\in\mathbb{R}^{n_{2}\times d}italic_f start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( caligraphic_S start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) ∈ blackboard_R start_POSTSUPERSCRIPT italic_n start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT × italic_d end_POSTSUPERSCRIPT extracted using a neural network fθ:𝒮df_{\theta}:\mathcal{S}\mapsto\mathbb{R}^{d}italic_f start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT : caligraphic_S ↦ blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT. After projecting them onto the respective eigenfunctions, the resulting descriptor coefficients are stored as matrices A1,A2k×dA_{1},A_{2}\in\mathbb{R}^{k\times d}italic_A start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_A start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_k × italic_d end_POSTSUPERSCRIPT, respectively.

  • The functional map matrix CCitalic_C between 𝒮1\mathcal{S}_{1}caligraphic_S start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and 𝒮2\mathcal{S}_{2}caligraphic_S start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT is computed by solving the following:

    C= argmin 𝐶CA1A22+αMregC2,C=\underset{C}{\text{ argmin }}||CA_{1}-A_{2}||^{2}+\alpha||M_{\text{reg}}C||^{2},italic_C = underitalic_C start_ARG argmin end_ARG | | italic_C italic_A start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT - italic_A start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT | | start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_α | | italic_M start_POSTSUBSCRIPT reg end_POSTSUBSCRIPT italic_C | | start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT , (1)

    where the first term is a data preservation term between descriptors, and the second term regularizes the map structure by using a sparsity promoting mask MregM_{\text{reg}}italic_M start_POSTSUBSCRIPT reg end_POSTSUBSCRIPT, derived from Laplacian or Resolvent operator commutativity. The whole pipeline (𝒮1,𝒮2)C(\mathcal{S}_{1},\mathcal{S}_{2})\mapsto C( caligraphic_S start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , caligraphic_S start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) ↦ italic_C, also called FMReg layer [20] is fully differentiable with respect to θ\thetaitalic_θ.

  • The weights θ\thetaitalic_θ are optimized during training with axiomatic regularization terms, like area preservation, which is formulated as an orthogonality penalty:

    ortho(C)=CCTI2,\mathcal{L}_{\text{ortho}}(C)=||CC^{T}-I||^{2},caligraphic_L start_POSTSUBSCRIPT ortho end_POSTSUBSCRIPT ( italic_C ) = | | italic_C italic_C start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT - italic_I | | start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT , (2)

    or other regularizations, such as orientation preservation.

  • A test-time, we use the learned pipeline to extract CCitalic_C. Post-processing algorithms such as Zoomout can be used to increase the accuracy of the map, before extracting the point-to-point-map using the aligned eigenfunctions Φ1CT\Phi_{1}C^{T}roman_Φ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_C start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT and Φ2\Phi_{2}roman_Φ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT and nearest neighbor search or more accurate techniques [53].

Orthogonality and commutativity penalties are essential to ensure that the correspondences are plausible. We propose to replace those axiomatic penalties in deep functional maps with data-driven penalties by distilling priors from trained spectral diffusion models in a zero-shot manner.

3.2 Score-based Generative Modeling

In this section, we follow the formalism of [37] to present denoising score models. The general objective of generative modeling is to learn a distribution pψ(x)p_{\psi}(x)italic_p start_POSTSUBSCRIPT italic_ψ end_POSTSUBSCRIPT ( italic_x ) (where ψ\psiitalic_ψ are the learned parameters) corresponding to an (unknown) data distribution using the available samples of this distribution. Denoising score matching [35], and in particular, denoising diffusion models, are a specific class of models that learn to model the score function, defined as:

s(x)=xlogp(x),s(x)=\nabla_{x}\log p(x),italic_s ( italic_x ) = ∇ start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT roman_log italic_p ( italic_x ) , (3)

instead of the density p. This formulation overcomes the problem of normalizing constants of the data density. Knowing the score function is equivalent to knowing the data distribution, as one can sample from it using Langevin dynamics [55]. Since the score is unknown in parts of the sample space without data, denoising score matching learns the score functions sθ(xσ,σ)=xσlogq(xσ,σ)s_{\theta}(x_{\sigma},\sigma)=\nabla_{x_{\sigma}}\log q(x_{\sigma},\sigma)italic_s start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_σ end_POSTSUBSCRIPT , italic_σ ) = ∇ start_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT italic_σ end_POSTSUBSCRIPT end_POSTSUBSCRIPT roman_log italic_q ( italic_x start_POSTSUBSCRIPT italic_σ end_POSTSUBSCRIPT , italic_σ ) at different noise scales [75], where xσ=x+nσx_{\sigma}=x+n_{\sigma}italic_x start_POSTSUBSCRIPT italic_σ end_POSTSUBSCRIPT = italic_x + italic_n start_POSTSUBSCRIPT italic_σ end_POSTSUBSCRIPT, with nσ𝒩(o,σ2I)n_{\sigma}\sim\mathcal{N}(o,\sigma^{2}I)italic_n start_POSTSUBSCRIPT italic_σ end_POSTSUBSCRIPT ∼ caligraphic_N ( italic_o , italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_I ). This is done by learning a denoiser network Dψ(x+nσ,σ)D_{\psi}(x+n_{\sigma},\sigma)italic_D start_POSTSUBSCRIPT italic_ψ end_POSTSUBSCRIPT ( italic_x + italic_n start_POSTSUBSCRIPT italic_σ end_POSTSUBSCRIPT , italic_σ ) by minimizing the following loss:

𝔼xpdata𝔼nσ𝒩(0,σ2I)Dψ(x+nσ,σ)x2,\mathbb{E}_{x\sim p_{\text{data}}}\mathbb{E}_{n_{\sigma}\sim\mathcal{N}(0,\sigma^{2}I)}||D_{\psi}(x+n_{\sigma},\sigma)-x||^{2},blackboard_E start_POSTSUBSCRIPT italic_x ∼ italic_p start_POSTSUBSCRIPT data end_POSTSUBSCRIPT end_POSTSUBSCRIPT blackboard_E start_POSTSUBSCRIPT italic_n start_POSTSUBSCRIPT italic_σ end_POSTSUBSCRIPT ∼ caligraphic_N ( 0 , italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_I ) end_POSTSUBSCRIPT | | italic_D start_POSTSUBSCRIPT italic_ψ end_POSTSUBSCRIPT ( italic_x + italic_n start_POSTSUBSCRIPT italic_σ end_POSTSUBSCRIPT , italic_σ ) - italic_x | | start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT , (4)

where the optimized parameters are the parameters ψ\psiitalic_ψ of the denoiser. We drop the denoiser parameters sign ψ\psiitalic_ψ to avoid confusion with other learnable parameters, since for the rest of the paper, the denoiser is considered as trained. After training, new samples are generated by progressively denoising random samples, following a probability ordinary differential equation [70]. Moreover, the score at noise level σ\sigmaitalic_σ can be estimated using:

xσlogp(xσ;σ)=(D(xσ;σ)x)/σ2\nabla_{x_{\sigma}}\log p({x_{\sigma}};\sigma)=(D({x_{\sigma}};\sigma)-x)/\sigma^{2}∇ start_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT italic_σ end_POSTSUBSCRIPT end_POSTSUBSCRIPT roman_log italic_p ( italic_x start_POSTSUBSCRIPT italic_σ end_POSTSUBSCRIPT ; italic_σ ) = ( italic_D ( italic_x start_POSTSUBSCRIPT italic_σ end_POSTSUBSCRIPT ; italic_σ ) - italic_x ) / italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT (5)

Score Distillation Sampling.  Score distillation sampling (SDS) [57] is a generic way of transferring knowledge from a diffusion model learned on a source domain Ω\Omegaroman_Ω, to regularize or generate samples yyitalic_y in a target domain. It can be summarized as follows: (1) obtain a trained denoiser D(xσ,σ)D(x_{\sigma},\sigma)italic_D ( italic_x start_POSTSUBSCRIPT italic_σ end_POSTSUBSCRIPT , italic_σ ), with xΩx\in\Omegaitalic_x ∈ roman_Ω, the source domain, on which it is easy to train the denoiser; (2) differentiably extract x=g(yθ)x=g(y_{\theta})italic_x = italic_g ( italic_y start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ) from the target to the source domain, where the representation yθy_{\theta}italic_y start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT of samples in the target domain is optimizable, and ggitalic_g is a differentiable mapping from the target to the source domain. In the original work [57], the source domain is images, and the target domain is 3D scenes. 3D scenes are parameterized using Neural Radiance Fields yθy_{\theta}italic_y start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT, and the function g(yθ)g(y_{\theta})italic_g ( italic_y start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ) is simply the differentiable rendering of a novel view.

At each iteration, SDS consists in sampling x=g(yθ)Ωx=g(y_{\theta})\in\Omegaitalic_x = italic_g ( italic_y start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ) ∈ roman_Ω and perturbing the xxitalic_x with noise nσ𝒩(0,σ)n_{\sigma}\sim\mathcal{N}(0,\sigma)italic_n start_POSTSUBSCRIPT italic_σ end_POSTSUBSCRIPT ∼ caligraphic_N ( 0 , italic_σ ). Then, SDS guides the target representation ψ\psiitalic_ψ by applying the following gradient to the parameters:

θSDS=𝔼σ,xσ𝒩(x,σ)[(xσD(xσ,σ))/σ]gθ.\nabla_{\theta}\mathcal{L}_{\text{SDS}}=\mathbb{E}_{\sigma,x_{\sigma}\sim\mathcal{N}(x,\sigma)}[(x_{\sigma}-D(x_{\sigma},\sigma))/\sigma]\frac{\partial g}{\partial\theta}.∇ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT caligraphic_L start_POSTSUBSCRIPT SDS end_POSTSUBSCRIPT = blackboard_E start_POSTSUBSCRIPT italic_σ , italic_x start_POSTSUBSCRIPT italic_σ end_POSTSUBSCRIPT ∼ caligraphic_N ( italic_x , italic_σ ) end_POSTSUBSCRIPT [ ( italic_x start_POSTSUBSCRIPT italic_σ end_POSTSUBSCRIPT - italic_D ( italic_x start_POSTSUBSCRIPT italic_σ end_POSTSUBSCRIPT , italic_σ ) ) / italic_σ ] divide start_ARG ∂ italic_g end_ARG start_ARG ∂ italic_θ end_ARG . (6)

The gradient is not backpropagated through the denoiser as it is costly and unstable due to the noising step. In practice, only a single denoising step is applied for better performance.

3.3 Motivation

As we can generate training functional maps easily (Sec. 4.1), our goal is to leverage the learned functional maps distribution from score models, for the matching of unseen shapes.

A first solution could be to learn a conditional distribution by conditioning the functional map diffusion model on point descriptors to generate the target map, as in recent diffusion-based rigid shape-matching approaches [36]. However, the learned model would be category-specific, as with classic deep functional maps methods.

A second solution could be to use the learned probability likelihood of the score model [70] as a proxy for measuring map quality. A similar idea, based on axiomatic constraints instead of learned penalties, has been explored in MapTree [60]. However, such an approach outputs a set of candidate maps, including symmetry-swapping maps, as purely intrinsic approaches do not differentiate between them. Moreover, evaluating the likelihood of a diffusion model is costly due to the integration of the generation trajectory.

A third potential solution is to adapt Score Distillation Sampling to the deep functional maps setting. Indeed, the recent Shape-Non-Rigid Kinematics (SNK) [5] work has shown that deep functional maps networks can be used in a zero-shot setting. In this paper, the authors exploit Neural Correspondence Priors [4] to obtain good initializations and optimize the map using spectral and spatial axiomatic constraints.

We now discuss the adaptation of SDS to our problem. Given two shapes 𝒮1,𝒮2\mathcal{S}_{1},\mathcal{S}_{2}caligraphic_S start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , caligraphic_S start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT, the source domain is shape correspondences, represented as functional maps (Sec. 3.1). The target domain is descriptor functions. The pointwise descriptors of shapes FiF_{i}italic_F start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT are parameterized by a deep neural network Fi=fθ(𝒮i)F_{i}=f_{\theta}(\mathcal{S}_{i})italic_F start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = italic_f start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( caligraphic_S start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ), from which we estimate the functional map C12C_{12}italic_C start_POSTSUBSCRIPT 12 end_POSTSUBSCRIPT with the FMReg layer [20]. In the SDS notation, xxitalic_x is the functional map CCitalic_C, θ\thetaitalic_θ corresponds to the weights of the feature extractor, and ggitalic_g is the FMReg layer.

Refer to caption
Figure 3: Comparison between vanilla SDS and our approach. The same diagonal structure is encouraged between the two approaches. However, using our proposed mask allows for fixing misalignments (sign flips in the functional maps, highlighted at the bottom) of the initialization. In contrast, the vanilla SDS converges towards the closest map with the same diagonal.

In Fig. 3, we perform a preliminary experiment with SDS. Notably, the recovered map shows significant mismatches (left and right legs are reversed) compared to the ground truth. We also observe that the structure promoted by the approach is nearly diagonal, as commonly observed in functional maps. The wrong signs along the diagonal cause the observed inconsistencies. We argue that the sign ambiguity of functional maps affects the performance (more discussions in the first section of supp. material). To better capture the underlying structure of the functional maps from data, we adopt a sign-agnostic approach in the following section.

4 Method

We first train a spectral diffusion model of functional maps (Sec. 4.1). Second, we devise a zero-shot training approach by distilling the knowledge learned by the trained model (Sec. 4.2), and the pipeline is illustrated in Fig. 2.

4.1 Training

Given a dataset of registered shapes, our objective is to train a spectral functional maps diffusion model D(Cσ,σ)D(C_{\sigma},\sigma)italic_D ( italic_C start_POSTSUBSCRIPT italic_σ end_POSTSUBSCRIPT , italic_σ ). Using the registered shapes, we first build a dataset of ground-truth functional maps CgtC_{\text{gt}}italic_C start_POSTSUBSCRIPT gt end_POSTSUBSCRIPT. Given 𝒮1,𝒮2\mathcal{S}_{1},\mathcal{S}_{2}caligraphic_S start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , caligraphic_S start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT two registered shapes, and Φ1,Φ2\Phi_{1},\Phi_{2}roman_Φ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , roman_Φ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT their respective eigenfunctions, the ground-truth map between the two shapes is given by:

C12gt=Φ2Φ1,C_{12-\text{gt}}=\Phi_{2}^{\dagger}\Phi_{1},italic_C start_POSTSUBSCRIPT 12 - gt end_POSTSUBSCRIPT = roman_Φ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT † end_POSTSUPERSCRIPT roman_Φ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ,

where ()(\cdot)^{\dagger}( ⋅ ) start_POSTSUPERSCRIPT † end_POSTSUPERSCRIPT is the Moore-Penrose pseudoinverse. We extract functional maps of fixed size n×nn\times nitalic_n × italic_n of template to shape correspondences.

The extracted functional maps are thus matrices Cn()C\in\mathcal{M}_{n}(\mathbb{R})italic_C ∈ caligraphic_M start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( blackboard_R ), which are analogous to images. We thus build upon the available image-based architectures and use the Diffusion Transformer architecture [56], which has shown great capabilities for image generation.

Our spectral denoiser Dψ(Cσ,σ)D_{\psi}(C_{\sigma},\sigma)italic_D start_POSTSUBSCRIPT italic_ψ end_POSTSUBSCRIPT ( italic_C start_POSTSUBSCRIPT italic_σ end_POSTSUBSCRIPT , italic_σ ) takes as input a matrix Cσn()C_{\sigma}\in\mathcal{M}_{n}(\mathbb{R})italic_C start_POSTSUBSCRIPT italic_σ end_POSTSUBSCRIPT ∈ caligraphic_M start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( blackboard_R ) and noise level σ\sigmaitalic_σ. To be sign-agnostic, we train a diffusion model on absolute functional maps, with input training data as the set of |Cgt||C_{\text{gt}}|| italic_C start_POSTSUBSCRIPT gt end_POSTSUBSCRIPT | (which is CgtC_{\text{gt}}italic_C start_POSTSUBSCRIPT gt end_POSTSUBSCRIPT with x|x|x\to|x|italic_x → | italic_x | applied on each element).

4.2 Zero-shot shape matching

In this section, we are now given as input two unseen shapes 𝒮1,𝒮2\mathcal{S}_{1},\mathcal{S}_{2}caligraphic_S start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , caligraphic_S start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT. We aim to estimate the functional map C12C_{12}italic_C start_POSTSUBSCRIPT 12 end_POSTSUBSCRIPT between the two shapes. We use the deep functional maps framework [20] to differentiably estimate the functional map C12C_{12}italic_C start_POSTSUBSCRIPT 12 end_POSTSUBSCRIPT, from pointwise descriptors Fi=fθ(𝒮i)F_{i}=f_{\theta}(\mathcal{S}_{i})italic_F start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = italic_f start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( caligraphic_S start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ), parameterized by neural network weights θ\thetaitalic_θ.

We optimize the parameters θ\thetaitalic_θ by applying a distillation loss to the functional maps. The remaining section discusses the construction of our distillation loss.

Mask regularization is a sign-agnostic regularization that has proven essential in deep functional maps as it provides reliable initialization towards the final solution [4]. In this section, we seek to provide a masked regularization in the form of Eq. 1. We search for sparsity-promoting masks MσM_{\sigma}italic_M start_POSTSUBSCRIPT italic_σ end_POSTSUBSCRIPT, such that given a ground truth map CgtC_{gt}italic_C start_POSTSUBSCRIPT italic_g italic_t end_POSTSUBSCRIPT, we have:

MσCgt0||M_{\sigma}\cdot C_{gt}||\simeq 0| | italic_M start_POSTSUBSCRIPT italic_σ end_POSTSUBSCRIPT ⋅ italic_C start_POSTSUBSCRIPT italic_g italic_t end_POSTSUBSCRIPT | | ≃ 0 (7)

It is equivalent to say that CgtC_{gt}italic_C start_POSTSUBSCRIPT italic_g italic_t end_POSTSUBSCRIPT maximizes the likelihood

p(Cσ;σ)exp(MσCσ2).p(C_{\sigma};\sigma)\propto\text{exp}(-||M_{\sigma}\cdot C_{\sigma}||^{2}).italic_p ( italic_C start_POSTSUBSCRIPT italic_σ end_POSTSUBSCRIPT ; italic_σ ) ∝ exp ( - | | italic_M start_POSTSUBSCRIPT italic_σ end_POSTSUBSCRIPT ⋅ italic_C start_POSTSUBSCRIPT italic_σ end_POSTSUBSCRIPT | | start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) . (8)

Under this hypothesis, the score function derives as:

s(Cσ;σ)=xlogp(x:σ)=2Mσ2Cσ.s(C_{\sigma};\sigma)=\nabla_{x}\text{log}p(x:\sigma)=-2M_{\sigma}^{2}\cdot C_{\sigma}.italic_s ( italic_C start_POSTSUBSCRIPT italic_σ end_POSTSUBSCRIPT ; italic_σ ) = ∇ start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT log italic_p ( italic_x : italic_σ ) = - 2 italic_M start_POSTSUBSCRIPT italic_σ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ⋅ italic_C start_POSTSUBSCRIPT italic_σ end_POSTSUBSCRIPT . (9)

By using Equation 5, we obtain:

Mσ2Cσ=(CσD(Cσ;σ))/2σ2,M_{\sigma}^{2}\cdot C_{\sigma}=(C_{\sigma}-D(C_{\sigma};\sigma))/2\sigma^{2},italic_M start_POSTSUBSCRIPT italic_σ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ⋅ italic_C start_POSTSUBSCRIPT italic_σ end_POSTSUBSCRIPT = ( italic_C start_POSTSUBSCRIPT italic_σ end_POSTSUBSCRIPT - italic_D ( italic_C start_POSTSUBSCRIPT italic_σ end_POSTSUBSCRIPT ; italic_σ ) ) / 2 italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT , (10)

which reduces as the following formula for computing MσM_{\sigma}italic_M start_POSTSUBSCRIPT italic_σ end_POSTSUBSCRIPT (by taking the mean over the noise distribution):

Mσ2=𝔼nσ𝒩(0,σ2I)[(CσD(Cσ;σ))/(2σ2Cσ)].M_{\sigma}^{2}=\mathbb{E}_{n_{\sigma}\sim\mathcal{N}(0,\sigma^{2}I)}\left[(C_{\sigma}-D(C_{\sigma};\sigma))/(2\sigma^{2}C_{\sigma})\right].italic_M start_POSTSUBSCRIPT italic_σ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT = blackboard_E start_POSTSUBSCRIPT italic_n start_POSTSUBSCRIPT italic_σ end_POSTSUBSCRIPT ∼ caligraphic_N ( 0 , italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_I ) end_POSTSUBSCRIPT [ ( italic_C start_POSTSUBSCRIPT italic_σ end_POSTSUBSCRIPT - italic_D ( italic_C start_POSTSUBSCRIPT italic_σ end_POSTSUBSCRIPT ; italic_σ ) ) / ( 2 italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_C start_POSTSUBSCRIPT italic_σ end_POSTSUBSCRIPT ) ] . (11)

Applying the formula directly would cause numerical instabilities when dividing by CσC_{\sigma}italic_C start_POSTSUBSCRIPT italic_σ end_POSTSUBSCRIPT that can contain 0 values, if arbitrary values of noise are sampled. We avoid this by sampling only nσ>0n_{\sigma}>0italic_n start_POSTSUBSCRIPT italic_σ end_POSTSUBSCRIPT > 0, which ensures only positive values when working with absolute functional maps |C||C|| italic_C |. The formula for the mask is finally:

Mσ2=𝔼nσ𝒩(0,σ2I),nσ>0[(|C|σD(|C|σ;σ))/(2σ2|C|σ)].M_{\sigma}^{2}=\mathbb{E}_{n_{\sigma}\sim\mathcal{N}(0,\sigma^{2}I),n_{\sigma}>0}\left[(|C|_{\sigma}-D(|C|_{\sigma};\sigma))/(2\sigma^{2}|C|_{\sigma})\right].italic_M start_POSTSUBSCRIPT italic_σ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT = blackboard_E start_POSTSUBSCRIPT italic_n start_POSTSUBSCRIPT italic_σ end_POSTSUBSCRIPT ∼ caligraphic_N ( 0 , italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_I ) , italic_n start_POSTSUBSCRIPT italic_σ end_POSTSUBSCRIPT > 0 end_POSTSUBSCRIPT [ ( | italic_C | start_POSTSUBSCRIPT italic_σ end_POSTSUBSCRIPT - italic_D ( | italic_C | start_POSTSUBSCRIPT italic_σ end_POSTSUBSCRIPT ; italic_σ ) ) / ( 2 italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT | italic_C | start_POSTSUBSCRIPT italic_σ end_POSTSUBSCRIPT ) ] .

(12)

We can distill the learned structure from the spectral diffusion model into a mask MσM_{\sigma}italic_M start_POSTSUBSCRIPT italic_σ end_POSTSUBSCRIPT for different noise levels, given any functional map matrix CinitC_{\text{init}}italic_C start_POSTSUBSCRIPT init end_POSTSUBSCRIPT. We show in Fig. 4 the estimated masks for different noise levels.

Refer to captionσ=0.1\sigma=0.1italic_σ = 0.1σ=0.5\sigma=0.5italic_σ = 0.5σ=1\sigma=1italic_σ = 1σ=3\sigma=3italic_σ = 3σ=10\sigma=10italic_σ = 10Distilled masksRefer to captionSlanted maskRefer to captionLaplacian maskRefer to captionResolvent mask
Figure 4: Top: usual masks for functional map regularization. Bottom: estimated distillation masks at different noise levels

We can incorporate this mask into the functional map computation: (1) we estimate a “raw” functional map CrawC_{\text{raw}}italic_C start_POSTSUBSCRIPT raw end_POSTSUBSCRIPT based on the input descriptors using Eq. 1 with α=0\alpha=0italic_α = 0; (2) we then estimate a mask MσM_{\sigma}italic_M start_POSTSUBSCRIPT italic_σ end_POSTSUBSCRIPT from CrawC_{\text{raw}}italic_C start_POSTSUBSCRIPT raw end_POSTSUBSCRIPT and solve Eq. 1 with MσM_{\sigma}italic_M start_POSTSUBSCRIPT italic_σ end_POSTSUBSCRIPT to obtain a mask-regularized map CregC_{\text{reg}}italic_C start_POSTSUBSCRIPT reg end_POSTSUBSCRIPT.

Humans Humanoids Animals Methods FAUST SCAPE SHREC19 DT4D-Intra DT4D-Inter SMAL TOSCA \ldelim{3.5cm[ Axiom. ] Ini + Zoomout (Laplacian) 3.8 7.5 13.1 1.8 16.5 18.3 8.1 Ini + Zoomout (Resolvent) 3.2 5.7 12.4 1.6 13.4 19.1 5.4 Smooth shells [24] 2.5 4.7 12.2 / / 16.3 / \ldelim{3.5cm[ Learned ] 3D-CODED [30] 7.5 17.2 13.4 45.0 61.4 54.6 32.8 Neural Jacobian Fields [1] 5.9 11.7 9.6 43.4 32.8 49.2 50.2 Simplified Fmaps [45] 1.7 2.3 3.4 2.0 8.9 42.1 5.1 \ldelim{2.5cm[ Zero-shot ] SNK [5] 1.8 4.7 5.8 2.0 9.0 9.1 3.6 Ini + Zoomout (our mask) 2.4 6.6 8.3 2.1 11.7 12.9 8.3 Ours 1.9 4.4 3.9 1.8 8.6 10.1 2.9

Table 1: Comparison of matching accuracy of axiomatic, learned, and zero-shot shape matching methods. The learning-based methods are trained on human shapes from Dynamic FAUST. The lower the better.

Proper SDS Similarly to SDS, we do not backpropagate through the mask optimization during optimization. Moreover, it has been shown that projecting the functional map on the “proper” map space [61, 6] (space of maps computed from a pointwise map) is necessary for convergence. We apply Zoomout [48] to the regularized functional map CregC_{\text{reg}}italic_C start_POSTSUBSCRIPT reg end_POSTSUBSCRIPT and obtain a proper regularized map CproperC_{\text{proper}}italic_C start_POSTSUBSCRIPT proper end_POSTSUBSCRIPT. We minimize the L2L_{2}italic_L start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT distance between the raw map and the proper map:

proper(Craw)=CrawCproper2,\mathcal{L}_{\text{proper}}(C_{\text{raw}})=||C_{\text{raw}}-C_{\text{proper}}||^{2},caligraphic_L start_POSTSUBSCRIPT proper end_POSTSUBSCRIPT ( italic_C start_POSTSUBSCRIPT raw end_POSTSUBSCRIPT ) = | | italic_C start_POSTSUBSCRIPT raw end_POSTSUBSCRIPT - italic_C start_POSTSUBSCRIPT proper end_POSTSUBSCRIPT | | start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT , (13)

where we only backpropagate to the feature extractor weights through CrawC_{raw}italic_C start_POSTSUBSCRIPT italic_r italic_a italic_w end_POSTSUBSCRIPT. Finally, we also apply SDS to the absolute raw map. Thus, our total loss is :

total(Craw)=proper(Craw)+SDS(|Craw|)\mathcal{L}_{\text{total}}(C_{\text{raw}})=\mathcal{L}_{\text{proper}}(C_{\text{raw}})+\mathcal{L}_{SDS}(|C_{\text{raw}}|)caligraphic_L start_POSTSUBSCRIPT total end_POSTSUBSCRIPT ( italic_C start_POSTSUBSCRIPT raw end_POSTSUBSCRIPT ) = caligraphic_L start_POSTSUBSCRIPT proper end_POSTSUBSCRIPT ( italic_C start_POSTSUBSCRIPT raw end_POSTSUBSCRIPT ) + caligraphic_L start_POSTSUBSCRIPT italic_S italic_D italic_S end_POSTSUBSCRIPT ( | italic_C start_POSTSUBSCRIPT raw end_POSTSUBSCRIPT | ) (14)

Summary  Given a pair of shapes, and a trained spectral diffusion model, at test time we optimize a shape pointwise feature extractor, Fi=fθ(𝒮i)F_{i}=f_{\theta}(\mathcal{S}_{i})italic_F start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = italic_f start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( caligraphic_S start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ). (1) Given features on the two shapes, we first estimate a functional map Craw12C^{12}_{\text{raw}}italic_C start_POSTSUPERSCRIPT 12 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT raw end_POSTSUBSCRIPT between the shapes by solving Eq. 1 with α=0\alpha=0italic_α = 0. (2) We use this map to distill a mask MσM_{\sigma}italic_M start_POSTSUBSCRIPT italic_σ end_POSTSUBSCRIPT from the diffusion model, and solve Eq. 1 a second time to obtain a regularized map Creg12C^{12}_{\text{reg}}italic_C start_POSTSUPERSCRIPT 12 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT reg end_POSTSUBSCRIPT, on which we apply Zoomout to obtain Cproper12C^{12}_{\text{proper}}italic_C start_POSTSUPERSCRIPT 12 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT proper end_POSTSUBSCRIPT. (3) This map, along with the diffusion model, is used to compute total\mathcal{L}_{\text{total}}caligraphic_L start_POSTSUBSCRIPT total end_POSTSUBSCRIPT (Eq. 14). (4) The parameters fθf_{\theta}italic_f start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT of the feature extractor are optimized through back-propagation. (5) We convert the optimized functional map C12C^{12}italic_C start_POSTSUPERSCRIPT 12 end_POSTSUPERSCRIPT to a point-to-point map with the standard approach [51]. Note that our pipeline differs from previous deep functional maps approaches in that we avoid axiomatic priors, such as Laplacian commutativity or orthogonality, both at mask estimation and training. Instead, all of our regularization and objective terms, except for the basic properness term, are derived solely from available training data.

5 Experiments

Refer to caption
Figure 5: Texture transfer on different examples. We observe that deformation-based models such as 3D-CODED or Neural Jacobian Fields struggle with challenging poses and fail to generalize to unseen meshes. In the meantime, the learned pointwise descriptors of Simplified Fmaps generalize well to humanoids but struggle when confronted with new categories like animals. The SNK zero-shot approach can match shapes from different categories but struggles with challenging poses. Our approach can infer qualitative shape correspondences on each of those challenging examples.

5.1 Experimental Details

Functional Map Diffusion Model.  We train our diffusion model on 30×3030\times 3030 × 30 maps, using template-to-shape maps on the D-FAUST dataset, for a total of 40,000\sim 40,000∼ 40 , 000 maps. The architecture is the DiT-S [56] Diffusion Transformer with a patch size of 5. We train our model with EDM [37] for 1000 epochs with the variance preserving loss.

Zero-shot Optimization.  We use DiffusionNet [68] as our feature extractor. The estimated functional map size is 30×3030\times 3030 × 30. We follow the zero-shot experimental settings from SNK [5]. We set σ=1\sigma=1italic_σ = 1 for our distilled mask as we found the best results from this specific value, with N=100N=100italic_N = 100 noisy samples, which can be done in a single batch denoising. Iterating the process did not provide significant improvement. We apply Zoomout to increase the map size from 30×3030\times 3030 × 30 to 40×4040\times 4040 × 40 to compute proper\mathcal{L}_{proper}caligraphic_L start_POSTSUBSCRIPT italic_p italic_r italic_o italic_p italic_e italic_r end_POSTSUBSCRIPT.

5.2 Datasets and Comparison

Near-isometric Shape Matching.  We first test the generalization of our approach on human data. Notably, we test on the oriented versions of the remeshed FAUST [8], SCAPE [3], and SHREC [47] datasets, on the usual test sets from commonly used train/test splits.

Non-isometric Shape Matching.  We then test our approach on unseen data types, namely humanoids and animals. For humanoids, we used the remeshed split of the DynamicThings4D dataset (DT4D) [41], from which we use the intra-category and inter-category test sets from [40]. For animals, we tested our approach on the SMAL remeshed dataset [21] and animal shape pairs from the TOSCA dataset [13], as done in [73].

Baselines.  We compare our method with axiomatic [24], learned [1, 30, 45] and zero-shot [5] baselines. A detailed description is provided in the supplementary material.

5.3 Results

We follow the Princeton benchmark evaluation protocol [38] and evaluate the accuracy of the maps using the geodesic error of the computed correspondence. We present the results on near isometric data on the left of Tab. 1. We outperform both axiomatic and other zero-shot approaches on this task. Notably, we are close to the state-of-the-art deep functional map Simplified Fmaps approach [45]. Also, our distilled mask provides, in general, a good quality initialization, competitive with other approaches, and outperforms the Laplacian and Resolvent masks (Ini + Zoomout).

The results on non-isometric data are in the right of Tab. 1. Notably, our approach outperforms other approaches on the DT4D-Inter challenge and the TOSCA dataset, including Simplified Fmaps. 3D-CODED and Neural Jacobian Fields, based on learned deformation models, perform poorly on humanoids and animals, as the learned deformations are category-specific. Moreover, Simplified Fmaps fails on the SMAL dataset, suggesting that learned descriptors do not generalize well to animals. We also outperform SNK on most datasets, and our distilled mask provides a better initialization than the traditional Laplacian and resolvent masks (Ini + Zoomout).

5.4 Ablation study

We ablate the different components of our approach in Table 2. As stated in Sec. 3.3, vanilla SDS fails to correct misalignments and shows poor results. Using proper\mathcal{L}_{proper}caligraphic_L start_POSTSUBSCRIPT italic_p italic_r italic_o italic_p italic_e italic_r end_POSTSUBSCRIPT alone is efficient but far from state-of-the-art performance. The best is to combine proper\mathcal{L}_{proper}caligraphic_L start_POSTSUBSCRIPT italic_p italic_r italic_o italic_p italic_e italic_r end_POSTSUBSCRIPT and SDS\mathcal{L}_{SDS}caligraphic_L start_POSTSUBSCRIPT italic_S italic_D italic_S end_POSTSUBSCRIPT. Finally, we added axiomatic penalties (orthogonal, bijectivity, and Laplacian losses) to DiffuMatch. We find that the final accuracy is nearly the same, indicating that our formulation already encompasses these axiomatic regularizations.

Approach Geod Error (SHREC)
Vanilla SDS 57.3
Mask + Zoomout 8.3
proper\mathcal{L}_{proper}caligraphic_L start_POSTSUBSCRIPT italic_p italic_r italic_o italic_p italic_e italic_r end_POSTSUBSCRIPT 7.7
Mask + SDS\mathcal{L}_{SDS}caligraphic_L start_POSTSUBSCRIPT italic_S italic_D italic_S end_POSTSUBSCRIPT 7.1
Mask + proper\mathcal{L}_{proper}caligraphic_L start_POSTSUBSCRIPT italic_p italic_r italic_o italic_p italic_e italic_r end_POSTSUBSCRIPT 6.7
Mask + proper\mathcal{L}_{proper}caligraphic_L start_POSTSUBSCRIPT italic_p italic_r italic_o italic_p italic_e italic_r end_POSTSUBSCRIPT + SDS\mathcal{L}_{SDS}caligraphic_L start_POSTSUBSCRIPT italic_S italic_D italic_S end_POSTSUBSCRIPT (ours) 4.4
Ours + Axiomatic 4.3
Table 2: Ablation study of the different components of our approach

6 Limitations and future work

Despite the efficiency of our method, our method might not handle well highly non-isometric shapes or partial shapes [62], which is a known issue of functional map-based methods [11, 12]. A potential direction to mitigate this problem is to jointly learn the basis along with spectral regularization. Moreover, our diffusion model is trained only on human shapes with limited diversity. Using or generating more registered training data will be crucial towards a unified model for functional maps.

7 Conclusion

In this work, we presented a functional map score generative model, trained on registered human shapes, to learn the structure induced by the distribution of functional maps. Based on our model, we proposed a novel functional map penalty and a zero-shot training pipeline to match shape pairs at test time. The results demonstrate state-of-the-art results for zero-shot shape matching on diverse benchmarks, including categories unseen during training. We believe our approach will serve as a first step towards foundation models for shape matching.

Acknowledgements

This work was performed using HPC resources from GENCI-IDRIS (Grant 2025-AD010613760R2). Parts of this work were supported by the ERC Consolidator Grant 101087347 (VEGA), as well as gifts from Ansys and Adobe Research, and the ERC Starting Grant SpatialSem (101076253).

References

  • Aigerman et al. [2022] Noam Aigerman, Kunal Gupta, Vladimir G. Kim, Siddhartha Chaudhuri, Jun Saito, and Thibault Groueix. Neural jacobian fields: learning intrinsic mappings of arbitrary meshes. ACM Trans. Graph., 41(4), 2022.
  • Alexa et al. [2000] Marc Alexa, Daniel Cohen-Or, and David Levin. As-rigid-as-possible shape interpolation. In Proceedings of the 27th Annual Conference on Computer Graphics and Interactive Techniques, page 157–164, USA, 2000. ACM Press/Addison-Wesley Publishing Co.
  • Anguelov et al. [2023] Dragomir Anguelov, Praveen Srinivasan, Daphne Koller, Sebastian Thrun, Jim Rodgers, and James Davis. Scape: Shape completion and animation of people. 2023.
  • Attaiki and Ovsjanikov [2022] Souhaib Attaiki and Maks Ovsjanikov. Ncp: Neural correspondence prior for effective unsupervised shape matching. Advances in Neural Information Processing Systems, 35:28842–28857, 2022.
  • Attaiki and Ovsjanikov [2023a] Souhaib Attaiki and Maks Ovsjanikov. Shape non-rigid kinematics (snk): A zero-shot method for non-rigid shape matching via unsupervised functional map regularized reconstruction. Advances in Neural Information Processing Systems, 36:70012–70032, 2023a.
  • Attaiki and Ovsjanikov [2023b] Souhaib Attaiki and Maks Ovsjanikov. Understanding and improving features learned in deep functional maps. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1316–1326, 2023b.
  • Bensa¨ıd et al. [2023] David Bensaïd, Amit Bracha, and Ron Kimmel. Partial shape similarity by multi-metric hamiltonian spectra matching. In International Conference on Scale Space and Variational Methods in Computer Vision, pages 717–729. Springer, 2023.
  • Bogo et al. [2014] Federica Bogo, Javier Romero, Matthew Loper, and Michael J Black. Faust: Dataset and evaluation for 3d mesh registration. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 3794–3801, 2014.
  • Bogo et al. [2017] Federica Bogo, Javier Romero, Gerard Pons-Moll, and Michael J Black. Dynamic faust: Registering human bodies in motion. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 6233–6242, 2017.
  • Bracha et al. [2020] Amit Bracha, Oshri Halim, and Ron Kimmel. Shape Correspondence by Aligning Scale-invariant LBO Eigenfunctions. In Eurographics Workshop on 3D Object Retrieval. The Eurographics Association, 2020.
  • Bracha et al. [2024] Amit Bracha, Thomas Dagès, and Ron Kimmel. On unsupervised partial shape correspondence. In Proceedings of the Asian Conference on Computer Vision, pages 4488–4504, 2024.
  • Bracha et al. [2025] Amit Bracha, Thomas Dagès, and Ron Kimmel. Wormhole loss for partial shape matching. In Proceedings of the 38th International Conference on Neural Information Processing Systems, 2025.
  • Bronstein et al. [2008] Alexander M Bronstein, Michael M Bronstein, and Ron Kimmel. Numerical geometry of non-rigid shapes. Springer Science & Business Media, 2008.
  • Cao and Bernard [2022] Dongliang Cao and Florian Bernard. Unsupervised deep multi-shape matching. In European Conference on Computer Vision, pages 55–71. Springer, 2022.
  • Cao et al. [2023] Dongliang Cao, Paul Roetzer, and Florian Bernard. Unsupervised learning of robust spectral shape matching. ACM Transactions on Graphics (TOG), 42:1 – 15, 2023.
  • Cao et al. [2024a] Dongliang Cao, Marvin Eisenberger, Nafie El Amrani, Daniel Cremers, and Florian Bernard. Spectral meets spatial: Harmonising 3d shape matching and interpolation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3658–3668, 2024a.
  • Cao et al. [2024b] Dongliang Cao, Zorah Lähner, and Florian Bernard. Synchronous diffusion for unsupervised smooth non-rigid 3d shape matching. In European Conference on Computer Vision, pages 262–281. Springer, 2024b.
  • Dhariwal and Nichol [2021] Prafulla Dhariwal and Alexander Nichol. Diffusion models beat gans on image synthesis. Advances in neural information processing systems, 34:8780–8794, 2021.
  • Dinh et al. [2005] Huong Quynh Dinh, Anthony Yezzi, and Greg Turk. Texture transfer during shape transformation. ACM Transactions on Graphics (ToG), 24(2):289–310, 2005.
  • Donati et al. [2020] Nicolas Donati, Abhishek Sharma, and Maks Ovsjanikov. Deep geometric functional maps: Robust feature learning for shape correspondence. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8592–8601, 2020.
  • Donati et al. [2022] Nicolas Donati, Etienne Corman, Simone Melzi, and Maks Ovsjanikov. Complex functional maps: A conformal link between tangent bundles. In Computer Graphics Forum, pages 317–334. Wiley Online Library, 2022.
  • Ehm et al. [2024a] Viktoria Ehm, Maolin Gao, Paul Roetzer, Marvin Eisenberger, Daniel Cremers, and Florian Bernard. Partial-to-partial shape matching with geometric consistency. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 27488–27497, 2024a.
  • Ehm et al. [2024b] Viktoria Ehm, Paul Roetzer, Marvin Eisenberger, Maolin Gao, Florian Bernard, and Daniel Cremers. Geometrically consistent partial shape matching. In 2024 International Conference on 3D Vision (3DV), pages 914–922. IEEE, 2024b.
  • Eisenberger et al. [2020a] Marvin Eisenberger, Zorah Lahner, and Daniel Cremers. Smooth shells: Multi-scale shape registration with functional maps. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 12265–12274, 2020a.
  • Eisenberger et al. [2020b] Marvin Eisenberger, Aysim Toker, Laura Leal-Taixé, and Daniel Cremers. Deep shells: Unsupervised shape correspondence with optimal transport. Advances in Neural information processing systems, 33:10491–10502, 2020b.
  • Eisenberger et al. [2023] Marvin Eisenberger, Aysim Toker, Laura Leal-Taixé, and Daniel Cremers. G-msm: Unsupervised multi-shape matching with graph-based affinity priors. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023.
  • [27] Marco Fumero, Marco Pegoraro, Valentino Maiorca, Francesco Locatello, and Emanuele Rodolà. Latent functional maps: a spectral framework for representation alignment. In The Thirty-eighth Annual Conference on Neural Information Processing Systems.
  • Gao et al. [2021] Maolin Gao, Zorah Lahner, Johan Thunberg, Daniel Cremers, and Florian Bernard. Isometric multi-shape matching. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14183–14193, 2021.
  • Gao et al. [2023] Maolin Gao, Paul Roetzer, Marvin Eisenberger, Zorah Lähner, Michael Moeller, Daniel Cremers, and Florian Bernard. Sigma: Scale-invariant global sparse shape matching. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 645–654, 2023.
  • Groueix et al. [2018] Thibault Groueix, Matthew Fisher, Vladimir G. Kim, Bryan Russell, and Mathieu Aubry. 3d-coded : 3d correspondences by deep deformation. In ECCV, 2018.
  • Halimi and Kimmel [2018] Oshri Halimi and Ron Kimmel. Self functional maps. In 2018 International Conference on 3D Vision (3DV), pages 710–718. IEEE, 2018.
  • Halimi et al. [2019] Oshri Halimi, Or Litany, Emanuele Rodola, Alex M Bronstein, and Ron Kimmel. Unsupervised learning of dense shape correspondence. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4370–4379, 2019.
  • Ho et al. [2020] Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models. Advances in neural information processing systems, 33:6840–6851, 2020.
  • Huang et al. [2021] Q. Huang, X. Huang, B. Sun, Z. Zhang, J. Jiang, and C. Bajaj. Arapreg: An as-rigid-as possible regularization loss for learning deformable shape generators. In 2021 IEEE/CVF International Conference on Computer Vision (ICCV), pages 5795–5805, Los Alamitos, CA, USA, 2021. IEEE Computer Society.
  • Hyvärinen and Dayan [2005] Aapo Hyvärinen and Peter Dayan. Estimation of non-normalized statistical models by score matching. Journal of Machine Learning Research, 6(4), 2005.
  • Jiang et al. [2023] Haobo Jiang, Mathieu Salzmann, Zheng Dang, Jin Xie, and Jian Yang. Se (3) diffusion model-based point cloud registration for robust 6d object pose estimation. Advances in Neural Information Processing Systems, 36:21285–21297, 2023.
  • Karras et al. [2022] Tero Karras, Miika Aittala, Timo Aila, and Samuli Laine. Elucidating the design space of diffusion-based generative models. Advances in neural information processing systems, 35:26565–26577, 2022.
  • Kim et al. [2011] Vladimir G Kim, Yaron Lipman, and Thomas Funkhouser. Blended intrinsic maps. ACM transactions on graphics (TOG), 30(4):1–12, 2011.
  • Li et al. [2022a] Lei Li, Souhaib Attaiki, and Maks Ovsjanikov. Srfeat: Learning locally accurate and globally consistent non-rigid shape correspondence. In 2022 International Conference on 3D Vision (3DV), pages 144–154. IEEE, 2022a.
  • Li et al. [2022b] Lei Li, Nicolas Donati, and Maks Ovsjanikov. Learning multi-resolution functional maps with spectral attention for robust shape matching. Advances in Neural Information Processing Systems, 35:29336–29349, 2022b.
  • Li et al. [2021] Yang Li, Hikari Takehara, Takafumi Taketomi, Bo Zheng, and Matthias Nießner. 4dcomplete: Non-rigid motion estimation beyond the observable surface. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 12706–12716, 2021.
  • Litany et al. [2017] Or Litany, Tal Remez, Emanuele Rodola, Alex Bronstein, and Michael Bronstein. Deep functional maps: Structured prediction for dense shape correspondence. In Proceedings of the IEEE international conference on computer vision, pages 5659–5667, 2017.
  • Lugmayr et al. [2022] Andreas Lugmayr, Martin Danelljan, Andres Romero, Fisher Yu, Radu Timofte, and Luc Van Gool. Repaint: Inpainting using denoising diffusion probabilistic models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 11461–11471, 2022.
  • Lukoianov et al. [2024] Artem Lukoianov, Haitz Sáez de Ocáriz Borde, Kristjan Greenewald, Vitor Campagnolo Guizilini, Timur Bagautdinov, Vincent Sitzmann, and Justin Solomon. Score distillation via reparametrized DDIM. In The Thirty-eighth Annual Conference on Neural Information Processing Systems, 2024.
  • Magnet and Ovsjanikov [2024] Robin Magnet and Maks Ovsjanikov. Memory-scalable and simplified functional map learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4041–4050, 2024.
  • Magnet et al. [2022] Robin Magnet, Jing Ren, Olga Sorkine-Hornung, and Maks Ovsjanikov. Smooth non-rigid shape matching via effective dirichlet energy optimization. In International Conference on 3D Vision (3DV), 2022.
  • Melzi et al. [2019a] Simone Melzi, Riccardo Marin, Emanuele Rodolà, Umberto Castellani, Jing Ren, Adrien Poulenard, P Ovsjanikov, et al. Shrec’19: matching humans with different connectivity. In Eurographics Workshop on 3D Object Retrieval, pages 1–8. The Eurographics Association, 2019a.
  • Melzi et al. [2019b] Simone Melzi, Jing Ren, Emanuele Rodolà, Abhishek Sharma, Peter Wonka, and Maks Ovsjanikov. Zoomout: spectral upsampling for efficient shape correspondence. ACM Trans. Graph., 38(6), 2019b.
  • Meng et al. [2022] Chenlin Meng, Yutong He, Yang Song, Jiaming Song, Jiajun Wu, Jun-Yan Zhu, and Stefano Ermon. SDEdit: Guided image synthesis and editing with stochastic differential equations. 2022.
  • Merrouche et al. [2023] Aymen Merrouche, Joao Pedro Cova Regateiro, Stefanie Wuhrer, and Edmond Boyer. Deformation-guided unsupervised non-rigid shape matching. In 34th British Machine Vision Conference 2023, BMVC 2023, Aberdeen, UK, November 20-24, 2023. BMVA, 2023.
  • Ovsjanikov et al. [2012] Maks Ovsjanikov, Mirela Ben-Chen, Justin Solomon, Adrian Butscher, and Leonidas Guibas. Functional maps: a flexible representation of maps between shapes. ACM Trans. Graph., 31(4), 2012.
  • Ovsjanikov et al. [2016] Maks Ovsjanikov, Etienne Corman, Michael Bronstein, Emanuele Rodolà, Mirela Ben-Chen, Leonidas Guibas, Frederic Chazal, and Alex Bronstein. Computing and processing correspondences with functional maps. In SIGGRAPH ASIA 2016 Courses, New York, NY, USA, 2016. Association for Computing Machinery.
  • Pai et al. [2021] Gautam Pai, Jing Ren, Simone Melzi, Peter Wonka, and Maks Ovsjanikov. Fast sinkhorn filters: Using matrix scaling for non-rigid shape correspondence with functional maps. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 384–393, 2021.
  • Panine et al. [2022] Mikhail Panine, Maxime Kirgo, and Maks Ovsjanikov. Non-isometric shape matching via functional maps on landmark-adapted bases. In Computer graphics forum, pages 394–417. Wiley Online Library, 2022.
  • Parisi [1981] Giorgio Parisi. Correlation functions and computer simulations. Nuclear Physics B, 180(3):378–384, 1981.
  • Peebles and Xie [2023] William Peebles and Saining Xie. Scalable diffusion models with transformers. In Proceedings of the IEEE/CVF international conference on computer vision, pages 4195–4205, 2023.
  • Poole et al. [2023] Ben Poole, Ajay Jain, Jonathan T. Barron, and Ben Mildenhall. Dreamfusion: Text-to-3d using 2d diffusion. In The Eleventh International Conference on Learning Representations, 2023.
  • Ren et al. [2018] Jing Ren, Adrien Poulenard, Peter Wonka, and Maks Ovsjanikov. Continuous and orientation-preserving correspondences via functional maps. ACM Transactions on Graphics (ToG), 37(6):1–16, 2018.
  • Ren et al. [2019] Jing Ren, Mikhail Panine, Peter Wonka, and Maks Ovsjanikov. Structured regularization of functional map computations. In Computer Graphics Forum, pages 39–53. Wiley Online Library, 2019.
  • Ren et al. [2020] Jing Ren, Simone Melzi, Maks Ovsjanikov, and Peter Wonka. Maptree: Recovering multiple solutions in the space of maps. ACM Transactions on Graphics, 39(6):1–17, 2020.
  • Ren et al. [2021] Jing Ren, Simone Melzi, Peter Wonka, and Maks Ovsjanikov. Discrete optimization for shape matching. In Computer Graphics Forum, pages 81–96. Wiley Online Library, 2021.
  • Rodolà et al. [2017] Emanuele Rodolà, Luca Cosmo, Michael M Bronstein, Andrea Torsello, and Daniel Cremers. Partial functional correspondence. In Computer graphics forum, pages 222–236. Wiley Online Library, 2017.
  • Roetzer and Bernard [2024] Paul Roetzer and Florian Bernard. Spidermatch: 3d shape matching with global optimality and geometric consistency. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14543–14553, 2024.
  • Roetzer et al. [2024] Paul Roetzer, Ahmed Abbas, Dongliang Cao, Florian Bernard, and Paul Swoboda. Discomatch: Fast discrete optimisation for geometrically consistent 3d shape matching. In European Conference on Computer Vision, pages 443–460. Springer, 2024.
  • Rombach et al. [2022] Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10684–10695, 2022.
  • Roufosse et al. [2019] Jean-Michel Roufosse, Abhishek Sharma, and Maks Ovsjanikov. Unsupervised deep learning for structured shape matching. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 1617–1627, 2019.
  • Sharma and Ovsjanikov [2020] Abhishek Sharma and Maks Ovsjanikov. Weakly supervised deep functional maps for shape matching. Advances in Neural Information Processing Systems, 33:19264–19275, 2020.
  • Sharp et al. [2022] Nicholas Sharp, Souhaib Attaiki, Keenan Crane, and Maks Ovsjanikov. Diffusionnet: Discretization agnostic learning on surfaces. ACM Transactions on Graphics (TOG), 41(3):1–16, 2022.
  • Song and Ermon [2019] Yang Song and Stefano Ermon. Generative modeling by estimating gradients of the data distribution. Advances in neural information processing systems, 32, 2019.
  • Song et al. [2021] Yang Song, Jascha Sohl-Dickstein, Diederik P Kingma, Abhishek Kumar, Stefano Ermon, and Ben Poole. Score-based generative modeling through stochastic differential equations. In International Conference on Learning Representations, 2021.
  • Song et al. [2022] Yang Song, Liyue Shen, Lei Xing, and Stefano Ermon. Solving inverse problems in medical imaging with score-based generative models. In International Conference on Learning Representations, 2022.
  • Sorkine and Alexa [2007] Olga Sorkine and Marc Alexa. As-rigid-as-possible surface modeling. In Symposium on Geometry processing, pages 109–116. Citeseer, 2007.
  • Sun et al. [2023] Mingze Sun, Shiwei Mao, Puhua Jiang, Maks Ovsjanikov, and Ruqi Huang. Spatially and spectrally consistent deep functional maps. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 14497–14507, 2023.
  • Trappolini et al. [2021] Giovanni Trappolini, Luca Cosmo, Luca Moschella, Riccardo Marin, Simone Melzi, and Emanuele Rodolà. Shape registration in the time of transformers. Advances in Neural Information Processing Systems, 34:5731–5744, 2021.
  • Vincent [2011] Pascal Vincent. A connection between score matching and denoising autoencoders. Neural computation, 23(7):1661–1674, 2011.
  • Wang et al. [2024] Zhengyi Wang, Cheng Lu, Yikai Wang, Fan Bao, Chongxuan Li, Hang Su, and Jun Zhu. Prolificdreamer: High-fidelity and diverse text-to-3d generation with variational score distillation. Advances in Neural Information Processing Systems, 36, 2024.
  • Weber et al. [2024] Simon Weber, Thomas Dages, Maolin Gao, and Daniel Cremers. Finsler-laplace-beltrami operators with application to shape analysis. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3131–3140, 2024.
  • Wu et al. [2024] Rundi Wu, Ben Mildenhall, Philipp Henzler, Keunhong Park, Ruiqi Gao, Daniel Watson, Pratul P. Srinivasan, Dor Verbin, Jonathan T. Barron, Ben Poole, and Aleksander Holynski. Reconfusion: 3d reconstruction with diffusion priors. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 21551–21561, 2024.
  • Yan et al. [2014] Dong-Ming Yan, Guanbo Bao, Xiaopeng Zhang, and Peter Wonka. Low-resolution remeshing using the localized restricted voronoi diagram. IEEE Transactions on Visualization and Computer Graphics (TVCG), 2014.
  • Yang et al. [2024] Haitao Yang, Xiangru Huang, Bo Sun, Chandrajit L. Bajaj, and Qixing Huang. Gencorres: Consistent shape matching via coupled implicit-explicit shape generative models. In The Twelfth International Conference on Learning Representations, 2024.
  • Yona et al. [2025] Gal Yona, Roy Velich, Ehud Rivlin, and Ron Kimmel. Neural descriptors: Self-supervised learning of robust local surface descriptors using polynomial patches. In International Conference on Scale Space and Variational Methods in Computer Vision, pages 218–230. Springer, 2025.
  • Zhu et al. [2024] Junzhe Zhu, Peiye Zhuang, and Sanmi Koyejo. HIFA: High-fidelity text-to-3d generation with advanced diffusion guidance. In The Twelfth International Conference on Learning Representations, 2024.
  • Zhuravlev et al. [2025] Aleksei Zhuravlev, Zorah Lähner, and Vladislav Golyanik. Denoising functional maps: Diffusion models for shape correspondence. In Proceedings of the Computer Vision and Pattern Recognition Conference, pages 26899–26909, 2025.
  • Zuffi et al. [2017] Silvia Zuffi, Angjoo Kanazawa, David W Jacobs, and Michael J Black. 3d menagerie: Modeling the 3d shape and pose of animals. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 6365–6373, 2017.
\thetitle

Supplementary Material

In this supplementary material, we first provide insights on the sign ambiguity problem of functional maps in Sec. 8. We provide more experimentation details about datasets (Sec. 9), baselines (Sec. 10), and implementation (Sec. 11) for the experiments in Sec. 5 of the paper. Next, we show the behavior of our method with a plot of the loss during zero-shot optimization in Sec. 12, a visualization of denoising trajectories in Sec. 13, and finally an analysis of descriptors in Sec. 14. We also show that the matching provided by our method allows competitive reconstruction of input shapes by combining it with the ARAP energy in Sec. 15. Finally, in Sec. 16, we provide a simple experiment providing insights on sparsity-promoting mask efficiency.

8 Sign Ambiguity of Functional Maps

This phenomenon occurs because functional maps are nearly discrete at low frequencies. Indeed, it has been observed that the ground truth maps at low frequency follow a diagonal structure [51], where the values of the diagonal elements are ±1\pm 1± 1 (modulo volume changes). This affects the overall trajectory of generation - where signs of the diagonal elements remain unchanged (Fig. 6) - and thus the capacity of diffusion models to provide efficient spectral regularization. Thus, to better capture the underlying structure of the functional maps from data, we chose to adopt a sign-agnostic approach.

Refer to caption
Figure 6: Generation process of a functional map using a diffusion model. For low-frequency elements (green square), the sign of diagonal elements at the Gaussian noise step never changes during the denoising process. This explains why spectral regularization with SDS fails to correct misalignments effectively.

9 Datasets

Near-isometric Shape Matching.  The FAUST remeshed [58, 8] version contains 10 individuals in 10 different poses. SCAPE [3] contains 50 challenging poses of one individual. SHREC [47] contains 50 humans from different datasets, with 407 annotated pairs using an automatic human registration algorithm (partial shape matching pairs are excluded).

Non-isometric Shape Matching.  The matching version of the DT4D dataset [46] contains more than 400 shapes, with more than 1000 annotated pairs remeshed using the LRVD algorithm [79], from which we use the intra-category and inter-category test sets from [40]. The SMAL remeshed dataset [21], which contains around 400 animal pairs extracted from real images using the SMAL deformation model [84]. The animal shape pairs from the TOSCA are from cat, dog, horse and wolf categories.

10 Baselines

We compare our method against several baselines for shape matching. 3D-CODED [30] is an autoencoder trained specifically for shape matching. The shape latent vectors are computed and refined by optimizing the obtained registrations. Neural Jacobian Fields [1] is a model that predicts the Jacobian of deformation instead of vertex positions and generalizes to unregistered meshes. Smooth shells [24] is an axiomatic approach that refines functional maps in a coarse-to-fine approach to obtain plausible final correspondences. Shape-Non-Rigid-Kinematics (SNK) [5] is a state-of-the-art zero-shot algorithm to train deep feature extractors on pairs of shapes. We also compare to a state-of-the-art deep functional maps approach, Simplified Fmaps [45]. All trainable models are trained on the D-FAUST dataset. Finally, we also show the results of using a feature extractor with random weights combined with different masks.

11 Experimental Details

Feature Extractor.  We follow the zero-shot experimental settings from SNK [5]. The feature extractor consists of four DiffusionNet blocks of dimension 256, and we use 128 eigenvectors for the heat diffusion. The input features of the feature extractor are XYZ features on the oriented versions of each dataset [67]. We set λ=0.1\lambda=0.1italic_λ = 0.1 for humans and λ=1e3\lambda=1e-3italic_λ = 1 italic_e - 3 for the other datasets, respectively. For the Ini+Zoomout scenario with our mask, we set λ=1\lambda=1italic_λ = 1.

Diffusion Model Training.  We train our spectral diffusion model for 1000 steps. The training setting is the same as in [37], with optimal reweighting of the losses and using the variance-preserving SDE, which reproduces the trajectory of DDPM [33]. No normalization of functional maps is applied, as the values inside the matrices range from -1 to 1 already.

Zero-Shot Training.  We train our deep functional map approach for 1000 gradient steps using Adam optimizer. The overall training on a single pair takes approximately 180 seconds on a NVIDIA L40S GPU.

Evaluation.  For the evaluation, we refined our optimized maps using Zoomout to obtain a final map dimension of 150x150, as commonly done in the deep functional maps approach [5, 45].

12 Loss Behavior

We plot the loss behavior during optimization in Figure 7. The loss is smoothly optimized and converges rapidly.

Refer to caption
Figure 7: Loss and other penalties during optimization of the matching.

13 Generating Functional Maps and Absolute Functional Maps

We show two example denoising trajectories, from the original and absolute spectral diffusion models in Figure 8.

Refer to caption
Refer to caption
Figure 8: Example generation trajectories using spectral diffusion models on functional maps (top) and absolute functional maps (bottom)

14 Quality of Learned Descriptors

Learned descriptors using our approach are meaningful thanks to our proper loss. Indeed, it has been shown that when properness is encouraged, the extracted correspondence is approximately the same whether it is extracted from the functional map or by nearest neighbor search [6]. We visually verify this in Figure 9, where we show the nearest points to a selected point using nearest neighbor in the feature space (after projection on the space spanned by the first 30 eigenfunctions – the only ones used in the map computation), showing that our method enables meaningful descriptor learning in addition to the quality of the shape matching.

Refer to caption
Figure 9: After applying DiffuMatch, we select a point on the source shape and compute the distance of this point to all points on both target and source shapes, in the descriptor space. We plot the obtained distances on both shapes. The closest points are points that are geodesically close to the select point.

15 Comparison of Reconstruction of Deformation Models

As stated in the paper, deformation models are not suitable for generalization to new type of categories. In this section, we provide reconstructions from 3D-CODED and NJF of the source shape in section 4.2. Moreover, as SNK provides a shape reconstruction as output, we also show the reconstruction provided by SNK. Finally, we extract shape correspondence Π\Piroman_Π from DiffuMatch and reconstruct the vertex position of the shape in the target mesh topology, by solving for the closest possible solution minimizing the As-Rigid-As-Possible (ARAP) [72]. Let XXitalic_X be the vertex of the source mesh, the reconstruction YrecY_{rec}italic_Y start_POSTSUBSCRIPT italic_r italic_e italic_c end_POSTSUBSCRIPT in the target mesh topology is given by:

Yrec=argmin𝑌YΠX²+Earap(Y).Y_{rec}=\underset{Y}{\text{argmin}}||Y-\Pi X||\texttwosuperior+E_{arap}(Y).italic_Y start_POSTSUBSCRIPT italic_r italic_e italic_c end_POSTSUBSCRIPT = underitalic_Y start_ARG argmin end_ARG | | italic_Y - roman_Π italic_X | | ² + italic_E start_POSTSUBSCRIPT italic_a italic_r italic_a italic_p end_POSTSUBSCRIPT ( italic_Y ) .

As our matching is nearly perfect, the provided reconstruction, shown in Figure 10 is visually better than the one given by other approaches, up to some artifacts due to our matching being computing on the first 30 eigenfunctions only. The capabilities of our model can also be extended to reconstruction of input meshes in a new topologies.

Refer to caption
Figure 10: Reconstruction of source shape using different approaches. For our reconstruction, we solve for the closest vertex positions to the matched shape minimizing the ARAP energy from the target shape.

16 Importance of Mask Regularization.

Mask regularization plays a key role in most (deep) functional map pipelines [4]. We run a simple experiment to show that the functional map space is particularly well-suited for this type of penalty. Multiplying a ground truth functional map matrix Cn()C\in\mathcal{M}_{n}(\mathbb{R})italic_C ∈ caligraphic_M start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( blackboard_R ) by any scalar 0<λ<10<\lambda<10 < italic_λ < 1 raises approximately the same pointwise correspondence as the original one from CCitalic_C. We also observed the same phenomena after applying Zoomout [48], where the obtained correspondences are the same. This phenomenon is illustrated in Figure 11.

Refer to caption
Figure 11: Given a ground truth functional map Cn()C\in\mathcal{M}_{n}(\mathbb{R})italic_C ∈ caligraphic_M start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( blackboard_R ), and a scalar, 0<λ<10<\lambda<10 < italic_λ < 1, the matrix λC\lambda Citalic_λ italic_C represents approximately the same pointwise correspondence as C. By applying Zoomout to both CCitalic_C and λC\lambda Citalic_λ italic_C, we obtain the same map. The observation does not always hold when lambda>1lambda>1italic_l italic_a italic_m italic_b italic_d italic_a > 1. We plot the geodesic errors of λC\lambda Citalic_λ italic_C for different values of λ\lambdaitalic_λ.

As most masks are sparsity-promoting masks, their mask penalty minimizers have multiple solutions, which are λ×X\lambda\times Xitalic_λ × italic_X where XXitalic_X is any solution. As we observed, optimized maps can be proportional to the ground truth solution and still output a correct pointwise correspondence. Based on this insight and the efficiency of mask regularization in functional map computation, we proposed to distill the knowledge of our trained diffusion model by extracting a sign-agnostic mask that will promote structures seen in the training set.

17 Computation Time

A single run of DiffuMatch takes approximately 150 seconds on an NVIDIA L40S GPU. In the case where computation time is a bottleneck, the scenario Ini (feature extractor with random weights) + Zoomout with our distilled mask is competitive as it requires little computation time. We provide a comparison of computation time with some other competing methods in Tab. 3

Method Computation time
3D-Coded 160s
Neural Jacobian Fields 3.26s
SimplifiedFmaps 1.08s
SNK 130s
Ini + Zoomout (our mask) 0.75s
Ours full 150s
Table 3: Computation costs for different methods.
Refer to caption
Figure 12: DiffuMatch result on a cactus pair.
Refer to caption
Figure 13: Partial matching results on SHREC16

18 Generalization

Non articulated shapes We showcase that DiffuMatch can perform well on a pair of two cactus meshes in  Fig. 12.

Partial shape matching We show in Fig 13 some partial matching results. DiffuMatch can work on pairs where the partiality is moderate. However, when the partiality becomes significant, DiffuMatch is prone to failure, with an error of 19.819.819.8 and 23.423.423.4 on SHREC16 cuts and holes partial shape matching challenges [62]. This is to be expected, as functional maps have a different structure between full and partial correspondence [62], and methods applied to partial shape matching often rely on modified losses  [14, 15] or require feature pre-training [15].