\affiliations

1 Faculty of Electrical Engineering, Czech Technical University in Prague, {firstname.lastname}@cvut.cz
2 Czech Institute of Informatics, Robotics and Cybernetics, Czech Technical University in Prague
3 NAVER LABS Europe, {firstname.lastname}@naverlabs.com
\contributions \websiteref \website \teaserfig[Uncaptioned image] \teasercaptionWe extract 2D feature 𝐅2D{\bf F}^{\text{2D}}bold_F start_POSTSUPERSCRIPT 2D end_POSTSUPERSCRIPT or segmentation 𝐒2D{\bf S}^{\text{2D}}bold_S start_POSTSUPERSCRIPT 2D end_POSTSUPERSCRIPT maps from a query image with unknown pose. We estimate the pose by aligning the maps with feature 𝐅3D{\bf F}^{\text{3D}}bold_F start_POSTSUPERSCRIPT 3D end_POSTSUPERSCRIPT or segmentation 𝐒3D{\bf S}^{\text{3D}}bold_S start_POSTSUPERSCRIPT 3D end_POSTSUPERSCRIPT maps rendered from our Gaussian Splatting Feature Fields (GSFFs), which are learned in a self-supervised way. Using segmentations instead of features makes our localization pipeline privacy-preserving.

Gaussian Splatting Feature Fields for (Privacy-Preserving) Visual Localization

Abstract

Visual localization is the task of estimating a camera pose in a known environment. In this paper, we utilize 3D Gaussian Splatting (3DGS)-based representations for accurate and privacy-preserving visual localization. We propose Gaussian Splatting Feature Fields (GSFFs), a scene representation for visual localization that combines an explicit geometry model (3DGS) with an implicit feature field. We leverage the dense geometric information and differentiable rasterization algorithm from 3DGS to learn robust feature representations grounded in 3D. In particular, we align a 3D scale-aware feature field and a 2D feature encoder in a common embedding space through a contrastive framework. Using a 3D structure-informed clustering procedure, we further regularize the representation learning and seamlessly convert the features to segmentations, which can be used for privacy-preserving visual localization. Pose refinement, which involves aligning either feature maps or segmentations from a query image with those rendered from the GSFFs scene representation, is used to achieve localization. The resulting privacy- and non-privacy-preserving localization pipelines, evaluated on multiple real-world datasets, show state-of-the-art performances.

1 Introduction

Visual localization (VL), a core part of self-driving cars [27] and autonomous robots [38], is the task of estimating the 6DoF camera pose from which an image was captured.

VL methods can be distinguished by how they represent scenes and how they estimate the pose of a query image w.r.t. the scene representation. Popular choices for the former include 3D Structure-from-Motion (SfM) point clouds  [29, 60, 24, 30, 61], databases of images with known intrinsic and extrinsic parameters [97, 2, 93, 63], the weights of neural networks [33, 7, 48, 4, 6], or dense renderable representations such as meshes [54, 74], neural radiance fields [49, 14, 56, 99], or 3D Gaussian Splatting [41, 68, 90, 3]. Common VL approaches are based on establishing 2D-3D correspondences via feature matching [60, 71, 54, 49, 99] or scene coordinate regression [4, 6], relative pose estimation from feature matches [97, 96, 2, 93] or regression [50, 97], absolute pose regressed via neural networks [78, 12, 13, 48], or feature-based pose refinement [74, 61, 55] of an initial pose estimate. While local feature matching-based approaches provide the most accurate camera pose estimates, scale to large scenes [62, 30], handle changing conditions [72] and fit onto mobile devices [44, 38], they suffer from potential privacy-related issues as image details can be recovered from the feature descriptors [58, 9].

Since VL systems are often deployed through cloud-based solutions, preserving the privacy of user-uploaded images and scenes is a critical aspect [70, 69]. An interesting approach to both query and scene privacy is SegLoc [55], which is a pose refinement-based method that represents the scene as a sparse SfM point cloud. Most refinement-based methods project features associated with the scene geometry into the query image [61, 76, 77, 56, 49, 74]. An initial pose estimate, e.g., obtained via image retrieval, is then optimized by aligning the projected features with features extracted from the query image. To increase privacy while decreasing memory requirements for storing high-dimensional features, SegLoc  [55] propose to quantize the features into integer values, which is equivalent to assigning segmentation labels to pixels in the query image and the 3D scene points. These quantized representations leads to better privacy-preservation: only coarse image information without any details can be recovered from both 2D segmentation images and 3D point clouds with associated labels. Final poses are refined by maximizing label consistency between the projected points and pixel labels.

The segmentations used by SegLoc are learned purely in 2D, providing no guarantee that the predicted labels are consistent between viewpoints. In contrast, [56] jointly trains a dense scene representation (in the form of a NeRF) together with an implicit feature field, ensuring multi-view consistency and state-of-the-art feature-based pose refinement results. Yet, [56] is not privacy-preserving. In contrast, in this work, we investigate learning a feature field that can be used for privacy-preserving visual localization jointly with a dense scene representation. We chose 3D Gaussian Splatting (3DGS) [34] due to its fast rendering time, making it particularly suited for dense pose refinement. Compared to SegLoc, it allows us to ground representation learning in 3D. Additionally, the explicit and finite nature of 3DGS is better suited for privacy-preserving localization than NeRFs as, similarly to [55], we can simply quantize the features associated with the Gaussians.

Some prior [3] and concurrent works [41, 68, 90] also use 3DGS in the context of visual localization. However, their pipelines use matching-based solutions, whereas we take advantage of the ability to densely render from any viewpoint within the scene to perform feature-metric or segmentation-based pose refinement. We explicitly backpropagate through the rasterizer on the se(3) Lie algebra, which yields more accurate pose estimates. Furthermore, instead of relying on pre-trained features as [74], we learn our features in a self-supervised manner defining a Gaussian Spatting Feature Field (GSFFs), which associates 3D Gaussians with volumetric features extracted with a kernel-based encoding based on the covariance of the 3D Gaussians. These features are rendered and aligned to features provided by the 2D encoder (which is jointly trained with the GSFFs) through contrastive losses. We apply further regularization by leveraging the geometry of the 3DGS model and by spatially clustering the GSFFs. These clusters enables converting features into segmentations that can be used to perform effective privacy-preserving pose refinement (c.f. Fig. Gaussian Splatting Feature Fields for (Privacy-Preserving) Visual Localization).

To summarize, our contributions are threefold: 1) We introduce Gaussian Feature Fields, a novel representation for VL, jointly learnt with the image feature encoder enabling pose refinement by aligning rendered and extracted features. 2) The explicit nature of the 3D representation in GSFFs allows to spatially cluster the Gaussian cloud and segmenting the feature field based on cluster centers yielding a privacy-preserving 3D scene representation. Similar to [55], the resulting method is based on aligning discrete segmentation labels, thus leading to similar privacy-preserving properties. 3) We demonstrate the accuracy of both GSFFs-based localization pipelines (based on features respectively segmentations) on multiple real-world datasets.

Our approaches outperform prior and concurrent work for privacy- and non-privacy-preserving visual localization.

2 Related Works

Gaussian Splatting. 3D Gaussian Splatting [34] represents a scene as a set of Gaussian primitives and render images by rasterizing Gaussians using depth ordering and alpha-blending, thus allowing efficient training and high resolution real-time rendering. Thanks to these properties, 3D Gaussians have been applied to numerous tasks including SLAM [32, 46], dynamic scene modeling [43, 84], scene segmentation [31, 37], and surface reconstruction [28, 26, 89]. Extensions have been tackling anti-aliasing [88], sparse view setups [82, 87], removing reliance on estimated poses [85], and regularization [16, 15, 94]. In this paper, we adopt the Gaussian Opacity Fields model [89] as our 3D representation, which integrates the anti-aliasing module from [88] and offers highly accurate geometry through regularization, critical for visual localization.

Camera pose refinement-based visual localization. Pose refinement approaches minimize, with respect to the pose, the difference between the query image and a rendering obtained by projecting the scene representation into the image using the current pose estimate [1, 21, 22, 65, 64]. The pose is iteratively refined from an initial estimate. L2 distances between deep features are often used to measure the differences between the query image and the projected features from the scene [76, 77, 61, 25, 40, 83]. [74] integrates pre-trained features with a particle filter, thus removing the need to learn representations per scene. [55] replaces deep features with segmentation labels and aligns them with 2D segmentation maps. In contrast, our GSFFs Pose Refinement pipeline (GSFFs-PR) associates a deep feature or a segmentation label to each 3D Gaussian primitive and performs pose refinement by 1) rasterizing these quantities and 2) explicitly backpropagating the feature-metric or segmentation errors through the rasterizer with respect to the camera pose.

Rendering-based visual localization. Leveraging the capability of novel view synthesis of NeRFs, iNeRF [86, 39] is the first to perform pose refinement through photometric alignment, where the optimization is performed by direct gradient descent back-propagation through the neural field. [45, 39, 74] parallelize this process with faster neural radiance fields or particle filters. NeRF-based representations have used to match [49, 99] or align rendered features from an implicit field [14, 56].

Multiple visual localization works have emerged leveraging the efficient 3DGS models instead of NeRFs. [3] matches rays emerging from ellipsoids with pixels with an attention mechanism and estimates the pose with a closed form solution. [41] performs matching between query and rendered image, lifting the matches in 3D with the rendered depth to refine the pose by solving a PnP problem. Similar to NeRF-based approaches, [68, 90, 91] distills pretrained features in a Gaussian-based 3D feature field. These methods estimate poses by matching these features followed by PnP+RANSAC. In contrast, we do not rely on pretrained features but learn our representations in a self-supervised manner. Furthermore, we solely estimate query poses through feature-metric refinement by explicitly backpropagating through the rasterizer, which yields higher accuracy even without additional PnP+RANSAC step. Furthermore, contrarily to concurrent works our 3D features take scale into account.

Privacy-preserving visual localization. Pittaluga et al. [58] have shown that detailed and recognizable images of a scene can be obtained from sparse 3D point clouds associated with local descriptors. While, [70, 69, 66] tried to address this by transforming 3D point clouds into 3D line clouds, it was shown in [8, 36] that these line clouds still preserve a significant amount of information about the scene geometry that can be used to recover 3D point clouds, hence enabling the inversion attack [58]. Instead, [36] lift pairs of points to lines and [47] relies on a few spatial anchors resulting in 3D lines with non uniformly sampled distributions. Another strategy is to swap coordinates between random pairs of points [53] or performing partial pose estimation against distributed partials maps [23], but as shown in [9], the point positions may be recovered by identifying points neighborhoods from co-occuring descriptors.

Another line of work studies the privacy on the feature representation level. [20, 57] embed descriptors in subspaces designed to withstand different privacy attacks. Closer to our work, [98, 79, 55] remove the need for high-dimensional features. [98, 79] match keypoints and 3D points without descriptors and [55] replace high-dimensional descriptors with segmentation labels. Similar to [55], we also rely on labels to ensure privacy, but our dense label rendering through 3DGS achieves better localization accuracy than [55], which relies on reprojected sparse labels.

Refer to caption
Figure 1: Training pipeline. We extract a feature map 𝐅2D{\bf F}^{\text{2D}}bold_F start_POSTSUPERSCRIPT 2D end_POSTSUPERSCRIPT and segmentation map 𝐒2D{\bf S}^{\text{2D}}bold_S start_POSTSUPERSCRIPT 2D end_POSTSUPERSCRIPT from a training image with known pose (left). For each 3D Gaussian 𝒢i{\mathcal{G}}_{i}caligraphic_G start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT (right), a scale aware feature 𝐠i{\bf g}_{i}bold_g start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is extracted from a triplane representation. Spectral clustering is applied on the Delaunay graph derived from the Gaussian cloud, yielding a set of prototypes PPitalic_P. A label is associated with each 3D Gaussian by assigning the volumetric features to the prototypes. The features and labels are then rendered to obtain the feature map 𝐅3D{\bf F}^{\text{3D}}bold_F start_POSTSUPERSCRIPT 3D end_POSTSUPERSCRIPT and the segmentation map 𝐒3D{\bf S}^{\text{3D}}bold_S start_POSTSUPERSCRIPT 3D end_POSTSUPERSCRIPT, which are respectively aligned with their encoder counterparts 𝐅2D{\bf F}^{\text{2D}}bold_F start_POSTSUPERSCRIPT 2D end_POSTSUPERSCRIPT and 𝐒2D{\bf S}^{\text{2D}}bold_S start_POSTSUPERSCRIPT 2D end_POSTSUPERSCRIPT through LNCE,LPROL_{NCE},L_{PRO}italic_L start_POSTSUBSCRIPT italic_N italic_C italic_E end_POSTSUBSCRIPT , italic_L start_POSTSUBSCRIPT italic_P italic_R italic_O end_POSTSUBSCRIPT and LCEL_{CE}italic_L start_POSTSUBSCRIPT italic_C italic_E end_POSTSUBSCRIPT.

3 3DGS Feature Fields (GSFFs)

First in Section˜3.1 we introduce our Gaussian Splatting Feature Field (GSFFs) which associates a scale aware feature to each 3D Gaussian. We describe how to jointly optimize it along with a 2D feature extractor by aligning rendered and 2D extracted features in a self-supervised manner. In Section˜3.2 we propose to further improve the features’ discriminativeness and their 3D awareness, by clustering the 3D Gaussian cloud and encouraging that both features in a pixel aligned pair of 2D/3D features are close to the same prototype through an auxiliary contrastive loss. These prototypes not only help distilling spatial information in the feature space, but also allow for a smooth transition from feature maps to segmentation maps, enabling privacy-preserving localization (as described in Section˜4). Finally, in Section˜3.3 we describe the GSFFs-PR Feature visual localization pipeline that is based on pose refinement through feature alignment. Its extension, the privacy-preserving GSFFs-PR Privacy visual localization pipeline based on segmentations is described in Section˜4. We show an overview of the architecture and training pipeline in Fig. 1.

3.1 Scene Representation

We adopt the Gaussian Opacity Fields model [89] as our Gaussian Field representation, where a scene is composed of a set of 3D Gaussian primitives parametrized by their center, scaling and rotation matrices, opacity and spherical harmonics coefficients. Given a ray rritalic_r emanating from a camera pose PPitalic_P, [89] finds the intersection between the ray and 3D Gaussians and computes the contribution Ci(r,P)C_{i}(r,P)italic_C start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_r , italic_P ) of each Gaussian 𝒢i{\mathcal{G}}_{i}caligraphic_G start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT traversed by the ray. The Gaussians are then ordered based on depth and the pixel’s color is obtained by alpha blending

c(r,P)=i=1NciαiCi(r,P)j=1i1(1αjCj(r,P)),\displaystyle c(r,P)=\sum_{i=1}^{N}c_{i}\alpha_{i}C_{i}(r,P)\prod_{j=1}^{i-1}(1-\alpha_{j}C_{j}(r,P))\enspace,italic_c ( italic_r , italic_P ) = ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_c start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_α start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_C start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_r , italic_P ) ∏ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i - 1 end_POSTSUPERSCRIPT ( 1 - italic_α start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT italic_C start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( italic_r , italic_P ) ) , (1)

where cic_{i}italic_c start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is the view-dependent color, modeled with spherical harmonics associated with 𝒢i{\mathcal{G}}_{i}caligraphic_G start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and αi\alpha_{i}italic_α start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT are the blending weights. The ray-Gaussian intersection formulation allows for depth distortion and normal consistency regularization [89], which improves the geometry of the 3D scene.

3D feature fields. Similar to Eq.˜1, features or segmentation labels can be rendered through the same alpha blending by simply replacing each Gaussian’s color by the feature or segmentation label assigned to the Gaussian. However, this requires that we assign to each Gaussian 𝒢i{\mathcal{G}}_{i}caligraphic_G start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT a feature, which in general might be extremely costly, especially for high-dimensional features. To avoid this, instead of associating independent features to each Gaussian, we introduce a feature field parametrized by a triplane grid [10]. It is centered at the origin of the world coordinate space and is composed of three two-dimensional orthogonal planes Hxy,Hxz,HyzIRR×RH_{xy},H_{xz},H_{yz}\in{\rm I\!R}^{R\times R}italic_H start_POSTSUBSCRIPT italic_x italic_y end_POSTSUBSCRIPT , italic_H start_POSTSUBSCRIPT italic_x italic_z end_POSTSUBSCRIPT , italic_H start_POSTSUBSCRIPT italic_y italic_z end_POSTSUBSCRIPT ∈ roman_I roman_R start_POSTSUPERSCRIPT italic_R × italic_R end_POSTSUPERSCRIPT, where RRitalic_R is the resolution of the grid. For simplicity, we will also denote by Hxy,Hxz,HyzIRR×R×DH_{xy},H_{xz},H_{yz}\in{\rm I\!R}^{R\times R\times D}italic_H start_POSTSUBSCRIPT italic_x italic_y end_POSTSUBSCRIPT , italic_H start_POSTSUBSCRIPT italic_x italic_z end_POSTSUBSCRIPT , italic_H start_POSTSUBSCRIPT italic_y italic_z end_POSTSUBSCRIPT ∈ roman_I roman_R start_POSTSUPERSCRIPT italic_R × italic_R × italic_D end_POSTSUPERSCRIPT (and use them interchangeably) the three corresponding feature tensors, where DDitalic_D is the dimensionality of the feature space we aim to learn.

To compute the volumetric feature 𝐠i{\bf g}_{i}bold_g start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT associated with a 3D Gaussian 𝒢i{\mathcal{G}}_{i}caligraphic_G start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT from the triplane, we proceed as follows. We project the 3D Gaussian 𝒢i{\mathcal{G}}_{i}caligraphic_G start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT onto the three planes and compute for the three resulting 2D Gaussians 𝒢ixy,𝒢ixz,𝒢iyz{\mathcal{G}}^{xy}_{i},{\mathcal{G}}^{xz}_{i},{\mathcal{G}}^{yz}_{i}caligraphic_G start_POSTSUPERSCRIPT italic_x italic_y end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , caligraphic_G start_POSTSUPERSCRIPT italic_x italic_z end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , caligraphic_G start_POSTSUPERSCRIPT italic_y italic_z end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT three corresponding features 𝐠ixy,𝐠ixz,𝐠iyz{\bf g}^{xy}_{i},{\bf g}^{xz}_{i},{\bf g}^{yz}_{i}bold_g start_POSTSUPERSCRIPT italic_x italic_y end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , bold_g start_POSTSUPERSCRIPT italic_x italic_z end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , bold_g start_POSTSUPERSCRIPT italic_y italic_z end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT by applying an RBF kernel parametrized by the 2D Gaussians (as detailed in the Appendix). The three grid features are averaged yielding the volumetric feature 𝐠i3D{\bf g}^{\text{3D}}_{i}bold_g start_POSTSUPERSCRIPT 3D end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT associated with the Gaussian 𝒢i{\mathcal{G}}_{i}caligraphic_G start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT. This representation, called Gaussian Splatting Feature Field (GSFFs), enables the features to be scale-aware, as 3D Gaussians with large spatial span will aggregate feature information over a large area within the grid, and small Gaussians over smaller areas. Further advantages of the GSFFs representation are: 1) they allow sharing information between Gaussians based on overlapping projections onto the planes as the Gaussian features are optimized interdependently through rasterization, and 2) the field can be queried from any 3D position making it suitable for 3DGS splitting and merging mechanisms.

Self-supervised training. Our aim is to perform pose refinement based on aligning pixel-level 2D encoded features from the image with 3D rendered features. Hence it is important for these feature to be locally discriminative and robust to viewpoint changes. Formally, we want to learn the GSFFs feature field jointly with a 2D encoder such that when we render, at pose PPitalic_P, a feature map 𝐅3D{\bf F}^{\text{3D}}bold_F start_POSTSUPERSCRIPT 3D end_POSTSUPERSCRIPT from the feature field, it is aligned with the corresponding 2D feature map 𝐅2D{\bf F}^{\text{2D}}bold_F start_POSTSUPERSCRIPT 2D end_POSTSUPERSCRIPT of the image IIitalic_I associated to the pose PPitalic_P. The rendered feature map 𝐅3D{\bf F}^{\text{3D}}bold_F start_POSTSUPERSCRIPT 3D end_POSTSUPERSCRIPT is obtained with alpha blending similarly to Eq.˜1, where for each pixel uuitalic_u, we replace cic_{i}italic_c start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT with 𝐠i3D{\bf g}^{\text{3D}}_{i}bold_g start_POSTSUPERSCRIPT 3D end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT. To align 𝐅3D{\bf F}^{\text{3D}}bold_F start_POSTSUPERSCRIPT 3D end_POSTSUPERSCRIPT and 𝐅2D{\bf F}^{\text{2D}}bold_F start_POSTSUPERSCRIPT 2D end_POSTSUPERSCRIPT, we train the model in a self-supervised manner with the contrastive loss [51]:

LNCE=12HWuIlog(exp(𝐅u3D𝐅u2D/τ)2A),\displaystyle L_{NCE}=-\frac{1}{2HW}\sum_{u\in I}\log\left(\frac{\exp{\left({\bf F}^{\text{3D}}_{u}\cdot{\bf F}^{\text{2D}}_{u}/\tau\right)}^{2}}{A}\right),italic_L start_POSTSUBSCRIPT italic_N italic_C italic_E end_POSTSUBSCRIPT = - divide start_ARG 1 end_ARG start_ARG 2 italic_H italic_W end_ARG ∑ start_POSTSUBSCRIPT italic_u ∈ italic_I end_POSTSUBSCRIPT roman_log ( divide start_ARG roman_exp ( bold_F start_POSTSUPERSCRIPT 3D end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT ⋅ bold_F start_POSTSUPERSCRIPT 2D end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT / italic_τ ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG italic_A end_ARG ) , (2)

where τ\tauitalic_τ is the temperature parameter, and AAitalic_A is a normalizing factor (detailed in the Appendix).

Refer to caption
Figure 2: First and third lines: original image, encoder feature map 𝐅2D{\bf F}^{\text{2D}}bold_F start_POSTSUPERSCRIPT 2D end_POSTSUPERSCRIPT (second and forth) and rendered feature map 𝐅3D{\bf F}^{\text{3D}}bold_F start_POSTSUPERSCRIPT 3D end_POSTSUPERSCRIPT (third and fifth) at coarse level (second and third) and fine level (forth and fifth). Below (second and forth lines) correspondingly rendered image, encoder segmentation maps 𝐒2D{\bf S}^{\text{2D}}bold_S start_POSTSUPERSCRIPT 2D end_POSTSUPERSCRIPT (second and forth) and rendered segmentation maps 𝐒3D{\bf S}^{\text{3D}}bold_S start_POSTSUPERSCRIPT 3D end_POSTSUPERSCRIPT (third and fifth) at coarse and fine level respectively. Features are visualized through PCA downprojection.

3.2 Prototypical Feature Regularization

To better align the features and prepare the transition from features to segmentations, we propose to structure the feature space around a set of classes. To that end, we cluster the feature field, derive 3D prototypes, and apply an auxiliary contrastive loss to enforce that corresponding pixel-aligned 2D extracted and 3D rendered features are close to the same prototypes in the feature space.

Spatial prototypes. We want to find a set of feature prototypes that best encode the spatial prior reflecting the Gaussian cloud structure. One option would be to apply spectral clustering on a matrix containing pairwise distance between Gaussian centers. However given the high number of Gaussians, this would quickly become untractable. Therefore, we start with applying a Delaunay triangulation [18] of the Gaussian centers, which yields a graph that already captures local geometric information. Furthermore, the eigenvalues and eigenvectors can now be computed from the Laplacian of the sparse adjacency matrix derived from the graph. The resulting eigenvectors, one for each Gaussian 𝒢{\mathcal{G}}caligraphic_G, are clustered into KKitalic_K groups, which assigns each 3D Gaussian 𝒢i{\mathcal{G}}_{i}caligraphic_G start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT to a cluster kkitalic_k. To build the representative feature (cluster prototype) 𝐩k{\bf p}_{k}bold_p start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT for each cluster, we simply average the volumetric features 𝐠i{\bf g}_{i}bold_g start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT of all Gaussians belonging to cluster kkitalic_k.

Prototypical loss. These prototypes implicitly define "classes" and we use them to maximize intra-class compactness and inter-class separability within the feature space and infuse spatial priors. Given a batch of N pairs of pixel-aligned rendered/encoder features {𝐅u3D,𝐅u2D}\{{\bf F}^{\text{3D}}_{u},{\bf F}^{\text{2D}}_{u}\}{ bold_F start_POSTSUPERSCRIPT 3D end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT , bold_F start_POSTSUPERSCRIPT 2D end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT }, we associate one prototype 𝐩k{\bf p}_{k}bold_p start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT per feature pair and want to enforce both features to be close to the prototype. We thus add the following prototypical contrastive loss:

LPRO=1Nn=1Nlog(exp(𝐅n3D𝐩nt+𝐅n2D𝐩nt)/τ)B),\displaystyle L_{PRO}=-\frac{1}{N}\sum_{n=1}^{N}\log\left(\frac{\exp{({\bf F}^{\text{3D}}_{n}{\bf p}_{n}^{t}+{\bf F}^{\text{2D}}_{n}{\bf p}_{n}^{t})/\tau)}}{B}\right),italic_L start_POSTSUBSCRIPT italic_P italic_R italic_O end_POSTSUBSCRIPT = - divide start_ARG 1 end_ARG start_ARG italic_N end_ARG ∑ start_POSTSUBSCRIPT italic_n = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT roman_log ( divide start_ARG roman_exp ( bold_F start_POSTSUPERSCRIPT 3D end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT bold_p start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT + bold_F start_POSTSUPERSCRIPT 2D end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT bold_p start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) / italic_τ ) end_ARG start_ARG italic_B end_ARG ) , (3)

where 𝐅n2D,𝐅n3D{\bf F}^{\text{2D}}_{n},{\bf F}^{\text{3D}}_{n}bold_F start_POSTSUPERSCRIPT 2D end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT , bold_F start_POSTSUPERSCRIPT 3D end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT are the features corresponding to the pixel unu_{n}italic_u start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT, 𝐩n{\bf p}_{n}bold_p start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT is the prototype assigned to them, τ\tauitalic_τ is a temperature parameter and BBitalic_B is a normalizing factor (detailed in the Appendix along with derivations). To compute the associations over the batch of feature pairs, we use an optimal transport procedure based on the Sinkhorn-Knopp algorithm [17] (see the Appendix for details).

Multi-view consistency. The rendering depends on the incidence of the ray traversing the 3D Gaussians. Hence rendered features may vary across viewing angles. In order to encourage multi-view consistency of both the encoder and the feature field, and to make our features generalizable to out-of-distribution views, we align features belonging to different views but associated to the same 3D points. To that end, given a pixel correspondence (u,v)(u,v)( italic_u , italic_v ) between two images IIitalic_I and I^\widehat{I}over^ start_ARG italic_I end_ARG in the scene – along with their encoder/rendered features {𝐅3D,𝐅2D}\{{\bf F}^{\text{3D}},{\bf F}^{\text{2D}}\}{ bold_F start_POSTSUPERSCRIPT 3D end_POSTSUPERSCRIPT , bold_F start_POSTSUPERSCRIPT 2D end_POSTSUPERSCRIPT }, {𝐅^3D,𝐅^2D}\{\widehat{{\bf F}}^{\text{3D}},\widehat{{\bf F}}^{\text{2D}}\}{ over^ start_ARG bold_F end_ARG start_POSTSUPERSCRIPT 3D end_POSTSUPERSCRIPT , over^ start_ARG bold_F end_ARG start_POSTSUPERSCRIPT 2D end_POSTSUPERSCRIPT } – we randomly replace in LNCEL_{NCE}italic_L start_POSTSUBSCRIPT italic_N italic_C italic_E end_POSTSUBSCRIPT and LPROL_{PRO}italic_L start_POSTSUBSCRIPT italic_P italic_R italic_O end_POSTSUBSCRIPT a subset of the pixel aligned feature pairs {𝐅u3D,𝐅u2D}\{{\bf F}^{\text{3D}}_{u},{\bf F}^{\text{2D}}_{u}\}{ bold_F start_POSTSUPERSCRIPT 3D end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT , bold_F start_POSTSUPERSCRIPT 2D end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT } by {𝐅^v3D,𝐅u2D}\{\widehat{{\bf F}}^{\text{3D}}_{v},{\bf F}^{\text{2D}}_{u}\}{ over^ start_ARG bold_F end_ARG start_POSTSUPERSCRIPT 3D end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT , bold_F start_POSTSUPERSCRIPT 2D end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT } or {𝐅u3D,𝐅^v2D}\{{\bf F}^{\text{3D}}_{u},\widehat{{\bf F}}^{\text{2D}}_{v}\}{ bold_F start_POSTSUPERSCRIPT 3D end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT , over^ start_ARG bold_F end_ARG start_POSTSUPERSCRIPT 2D end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT }. To extract correspondences, given a training image IIitalic_I with pose PPitalic_P and rendered depth DDitalic_D, we generate a random pose P^\widehat{P}over^ start_ARG italic_P end_ARG in the vicinity of PPitalic_P. We render the image I^\widehat{I}over^ start_ARG italic_I end_ARG, depth D^\widehat{D}over^ start_ARG italic_D end_ARG, and features 𝐅^3D\widehat{{\bf F}}^{\text{3D}}over^ start_ARG bold_F end_ARG start_POSTSUPERSCRIPT 3D end_POSTSUPERSCRIPT from pose P^\widehat{P}over^ start_ARG italic_P end_ARG and extract the 2D feature map 𝐅^2D\widehat{{\bf F}}^{\text{2D}}over^ start_ARG bold_F end_ARG start_POSTSUPERSCRIPT 2D end_POSTSUPERSCRIPT from the rendered image. Then, for a randomly sampled set of pixels uuitalic_u in IIitalic_I, we backproject it using DDitalic_D and re-project it into I^\widehat{I}over^ start_ARG italic_I end_ARG, yielding the pixel vvitalic_v in the rendered image. Then, we backproject vvitalic_v into 3D using D^\widehat{D}over^ start_ARG italic_D end_ARG and re-project it into IIitalic_I. If the pixel distance between uuitalic_u and the reprojected vvitalic_v fall within a small threshold, (u,v)(u,v)( italic_u , italic_v ) is considered as a valid correspondence, otherwise it is rejected.

Model Chess Fire Heads Office Pumpkin Redkitchen Stairs
SBM HLoc [60] 0.8/0.11/100 0.9/0.24/99 0.6/0.25/100 1.2/0.20/100 1.4/0.15/100 1.1/0.14/99 2.9/0.80/72
DSAC*[4] 0.5/0.17/100 0.8/0.28/99 0.5/0.34/100 1.2/0.34/98 1.2/0.28/99 0.7/0.21/97 2.7/0.78/92
ACE [6] 0.5/0.18 0.8/0.33 0.5/0.33 1.0/0.29 1/0.22 0.8/0.2 2.9/0.81
ACE + GSLoc [41] 0.5/0.15 0.6/0.25 0.4/0.28 0.9/0.26 1.0/0.23 0.7/0.17 1.4/0.42
RBM NeFeS [14] (DFNet [13]) 2/0.79 2/0.78 2/1.36 2/0.60 2/0.63 2/0.62 5/1.31
MCLoc [74] 2/0.8 3/1.4 3/1.3 4/1.3 5/1.6 6/1.6 6/2.0
SSL-Nif  [56] (DV) 1/0.22/93 0.8/0.28/91 0.8/0.49/71 1.7/0.41/81 1.5/0.34/86 2.1/0.41/75 6.5/0.63/49
NeRFMatch  [99] (DV) 0.9/0.3 1.3/0.4 1.6/1.0 3.3/0.7 3.2/0.6 1.3/0.3 7.2/1.3
GSplatLoc [68] (GSplaLoc) 0.43/0.16 1.03/0.32 1.06/0.62 1.85/0.4 1.8/0.35 2.71/0.55 8.83/2.34
GSLoc [41] (DFNet) 0.7/0.20 0.9/0.32 0.6/0.36 1.2/0.32 1.3/0.31 0.9/0.25 2.2/0.61
GSFFs-PR Feature (DV) 0.4/0.19/95 0.6/0.26/98 0.5/0.36/94 1.0/0.31/97 1.3/0.38/85 0.6/0.23/93 25.1/0.63/32
GSFFs-PR Privacy (DV) 0.8/0.28/96 0.8/0.33/94 1.0/0.67/90 1.5/0.51/90 2.0/0.50/78 1.2/0.33/85 28.2/0.98/29
Table 1: Localization results on 7Scenes using SfM-based pseudo GT poses from [5]. We compare our results with Structure-Based Methods (SBM) and Rendering-Based Methods (RBM). Median position error (cm.) (\downarrow)/ median rotation error (°) (\downarrow)/ recall at 5cm/5° (%\%%) ()\uparrow)↑ ), when available. Bold are the best results, underline the second best.

3.3 Feature-based Localization (GSFFs-PR)

After training, the feature space along with the 3D Gaussians can be used for pose refinement by finding the pose that minimizes feature-metric errors between 2D extracted query features and 3D features rendered from the aforementioned pose. In particular, given a query image whose pose is unknown, we first extract its 2D feature map 𝐅2D{\bf F}^{\text{2D}}bold_F start_POSTSUPERSCRIPT 2D end_POSTSUPERSCRIPT with our trained encoder. Second, we retrieve the closest database image with a global descriptor (our method is agnostic to the global descriptor used) and use the pose PinitP_{init}italic_P start_POSTSUBSCRIPT italic_i italic_n italic_i italic_t end_POSTSUBSCRIPT of the retrieved image as initialization. From PinitP_{init}italic_P start_POSTSUBSCRIPT italic_i italic_n italic_i italic_t end_POSTSUBSCRIPT, we render the 𝐅3D{\bf F}^{\text{3D}}bold_F start_POSTSUPERSCRIPT 3D end_POSTSUPERSCRIPT feature map from GSFFs. Feature inconsistencies between the two feature maps are iteratively minimized with regard to the pose:

P=minPSE(3)𝐅2D𝐅3D(P,𝒢)22,\displaystyle P^{*}=\text{min}_{P\in SE(3)}\|{\bf F}^{\text{2D}}-{\bf F}^{\text{3D}}(P,{\mathcal{G}})\|_{2}^{2}\enspace,italic_P start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT = min start_POSTSUBSCRIPT italic_P ∈ italic_S italic_E ( 3 ) end_POSTSUBSCRIPT ∥ bold_F start_POSTSUPERSCRIPT 2D end_POSTSUPERSCRIPT - bold_F start_POSTSUPERSCRIPT 3D end_POSTSUPERSCRIPT ( italic_P , caligraphic_G ) ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT , (4)

where the pose is parametrized on SE(3) and updates are done on the Lie algebra se(3) by backpropagating explicitly through the rasterizer [46] (that we modify to feature-metric refinement). At each refinement iteration, the pose is updated and features are rendered from this updated pose. We minimize the Eq.˜4 objective until convergence.

4 Privacy-Preserving GSFFs

Due to the clustering and subsequent prototypical formulation adopted in GSFFs, we can seamlessly extend our model to be privacy-preserving. We follow the privacy definition from the literature [98, 55, 79] where privacy is refers to the inability for an attackers to recover texture/color information and fine-level details. Similarly, following prior works, we consider that having coarse geometric information does not violate privacy. Therefore, inspired by [55], we replace the high-dimensional features 𝐠i{\bf g}_{i}bold_g start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT that contain a lot of privacy-critical information with a single segmentation label. After training, the feature field, prototypes, and color information are removed.

Optimizing the segmentations. Features may be converted to segmentations by soft assigning them to the set of prototypes. Any Gaussian feature 𝐠i{\bf g}_{i}bold_g start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT or encoder feature 𝐟i{\bf f}_{i}bold_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT can be assigned to a prototype 𝐩k{\bf p}_{k}bold_p start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT by the likelihood of the feature belonging to the cluster kkitalic_k defined by lik3D=exp(𝐠i𝐩k)/kexp(𝐠i𝐩k)l^{3D}_{ik}=\exp({\bf g}_{i}^{\top}{\bf p}_{k})/\sum_{k^{\prime}}\exp({\bf g}_{i}^{\top}{\bf p}_{k^{\prime}})italic_l start_POSTSUPERSCRIPT 3 italic_D end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i italic_k end_POSTSUBSCRIPT = roman_exp ( bold_g start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_p start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) / ∑ start_POSTSUBSCRIPT italic_k start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT roman_exp ( bold_g start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_p start_POSTSUBSCRIPT italic_k start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ). The resulting scores form a pseudo-logits vector 𝐥i{\bf l}_{i}bold_l start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT that can be rendered with alpha-blending (c.f. Eq.˜1), where the color is replaced by the scores. Similarly, encoder features can be soft-assigned via lik2D=exp(𝐟i𝐩k)/kexp(𝐟i𝐩k)l^{2D}_{ik}=\exp({\bf f}_{i}^{\top}{\bf p}_{k})/\sum_{k^{\prime}}\exp({\bf f}_{i}^{\top}{\bf p}_{k^{\prime}})italic_l start_POSTSUPERSCRIPT 2 italic_D end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i italic_k end_POSTSUBSCRIPT = roman_exp ( bold_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_p start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) / ∑ start_POSTSUBSCRIPT italic_k start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT roman_exp ( bold_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_p start_POSTSUBSCRIPT italic_k start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ).

While the prototypical learning framework induces relatively well aligned encoder/rendered soft assignments, the localization accuracy can further be improved by learning to directly predict the 2D segmentations. Therefore, during training, we add and train a shallow classification head on top of the encoded features that learns to output the segmentation map 𝐒2D{\bf S}^{\text{2D}}bold_S start_POSTSUPERSCRIPT 2D end_POSTSUPERSCRIPT directly from the images. To learn the segmentation head and to further refine our features in a self-supervised way, we add the following cross-entropy loss to Eq.˜2 and Eq.˜3:

LCE=uI𝟙u(log(𝐒u2D)+log(𝐒u3D)),\displaystyle L_{CE}=-\sum_{u\in I}{\mathbbm{1}}_{u}\cdot\left(\log({\bf S}^{\text{2D}}_{u})+\log({\bf S}^{\text{3D}}_{u})\right)\enspace,italic_L start_POSTSUBSCRIPT italic_C italic_E end_POSTSUBSCRIPT = - ∑ start_POSTSUBSCRIPT italic_u ∈ italic_I end_POSTSUBSCRIPT blackboard_1 start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT ⋅ ( roman_log ( bold_S start_POSTSUPERSCRIPT 2D end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT ) + roman_log ( bold_S start_POSTSUPERSCRIPT 3D end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT ) ) , (5)

where 𝐒3D{\bf S}^{\text{3D}}bold_S start_POSTSUPERSCRIPT 3D end_POSTSUPERSCRIPT denotes the segmentation map with rendered pseudo-logits lik3Dl^{3D}_{ik}italic_l start_POSTSUPERSCRIPT 3 italic_D end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i italic_k end_POSTSUBSCRIPT, 𝟙u{\mathbbm{1}}_{u}blackboard_1 start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT is the one hot vector corresponding to the prototype label associated to the pixel uuitalic_u, using the exact same associations as in the prototypical contrastive loss encouraging consistency between the features and segmentations.

Privacy-preserving localization, GSFFs-PR Privacy. After training, any potentially privacy-sensitive information is removed from the 3D Gaussian model (spherical harmonic, features, and GSFFs) as neither photometric information nor features are required for the privacy-preserving localization. The pseudo-logits obtained from the soft-assignments still contain too much information [55]. Therefore, we instead store hard assignments in the form of a single label per Gaussian 𝒢i{\mathcal{G}}_{i}caligraphic_G start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, k=argmaxk(𝐥ik)k^{\ast}=\text{argmax}_{k}({\bf l}_{ik})italic_k start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT = argmax start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( bold_l start_POSTSUBSCRIPT italic_i italic_k end_POSTSUBSCRIPT ). After this labeling procedure, the triplane GSFFs feature field and the prototypes are subsequently removed. As such the 3D models contain only geometric (Gaussians without color information) and segmentation information (cluster labels), effectively increasing the level of privacy. Furthermore, storing only the cluster labels makes the storage of the 3D representation orders of magnitudes smaller compared to storing features. Given a pose PPitalic_P, segmentation maps can be rendered by alpha-blending (via Eq.˜1) from one-hot vectors 𝟙k{\mathbbm{1}}_{k}blackboard_1 start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT (corresponding to label kkitalic_k), yielding 𝐒P3D{\bf S}^{\text{3D}}_{P}bold_S start_POSTSUPERSCRIPT 3D end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_P end_POSTSUBSCRIPT. Given an image, 2D segmentation maps 𝐒2D{\bf S}^{\text{2D}}bold_S start_POSTSUPERSCRIPT 2D end_POSTSUPERSCRIPT are directly obtained through the segmentation head. The localization pipeline is similar to the one in Section˜3.3 with the difference that we minimize segmentation inconsistencies between the segmentation labels instead of features:

P=minPSE(3)CE(𝐒2D,𝐒P3D).\displaystyle P^{*}=\text{min}_{P\in SE(3)}CE({\bf S}^{\text{2D}},{\bf S}^{\text{3D}}_{P})\enspace.italic_P start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT = min start_POSTSUBSCRIPT italic_P ∈ italic_S italic_E ( 3 ) end_POSTSUBSCRIPT italic_C italic_E ( bold_S start_POSTSUPERSCRIPT 2D end_POSTSUPERSCRIPT , bold_S start_POSTSUPERSCRIPT 3D end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_P end_POSTSUBSCRIPT ) . (6)

5 Experimental Evaluation

In this section, we first provide details regarding the training and evaluation of GSFFs-PR. Then, in Section˜5.1 we evaluate our feature-based VL pipeline (dubbed GSFFs-PR Feature) and our privacy-preserving VL pipeline (dubbed GSFFs-PR Privacy) on multiple real world datasets. We show that GSFFs-PR in general is more accurate than the respective non-privacy and privacy-preserving baselines. Section˜5.2 ablates and discusses important aspects of our approach.

Model King’s Old Shop St. Mary’s

SBM

HLoc [60] 11 / 0.20 15 / 0.31 4 / 0.2 7 / 0.24
Kapture R2D2 [29] 5 / 0.10 9 / 0.20 2 / 0.10 3 / 0.10
DSAC*[4] 15 / 0.30 21 / 0.40 5 / 0.30 13 / 0.40
ACE [6] 29 / 0.38 31 / 0.61 5 / 0.30 19 / 0.60
ACE + GSLoc [41] 25 / 0.29 26 / 0.38 5 / 0.23 13 / 0.41

PPM

GoMatch [98] 25 / 0.64 283 / 8.14 48 / 4.77 335 / 9.94
DGC-GNN [79] 18 / 0.47 75 / 2.83 15 / 1.57 106 / 4.03
SegLoc [55] (DV) 24 / 0.26 36 / 0.52 11 / 0.34 17 / 0.46
GSFFs-PR Privacy (DV) 24 / 0.39 26 / 0.49 5 / 0.27 13 / 0.48

RBM

CROSSFIRE [49] (DV) 47 / 0.7 43 / 0.7 20 / 1.2 39 / 1.4
NeFeS [14] (DFNet) 37 / 0.62 55 / 0.9 14 / 0.47 32 / 0.99
MCLoc [74] 31 / 0.42 39 / 0.73 12 / 0.45 26 / 0.88
SSL-Nif [56] (DV) 32 / 0.40 30 / 0.49 8 / 0.36 24 / 0.66
NeRFMatch  [99] (DV) 13 / 0.2 21 / 0.4 8.7 / 0.4 11.3 / 0.4
GSLoc [41] (DFNet) 26 / 0.34 48 / 0.72 10 / 0.36 27 / 0.62
GSplaLoc [68] 27 / 0.46 20 / 0.71 5 / 0.36 16 / 0.61
GSFFs-PR Feature (DV) 18 / 0.27 21 / 0.4 4 / 0.26 10 / 0.34
GSFFs-PR Feature (tuned) (DV) 17 / 0.26 18 / 0.36 4 / 0.25 8 / 0.26
Table 2: Localization results on Cambridge Landmarks. Comparison with Rendering-Based Methods (RBM), Privacy-Preserving Methods (PPM) and Structure-Based Methods (SBM). Median position error (cm.) (\downarrow), median rotation error (°) (\downarrow). Bold: best results per method category, underline: second best. (tuned) means that training hyperparameters are tuned for outdoor dataset as opposed to indoor dataset.

Experimental details. We train and optimize one model per scene for 50K iterations on a single NVIDIA A100 GPU. During the first 15K iterations only the Gaussians parameters are updated through the photometric loss. From iteration 15k, the triplane fields and the encoder are also optimized with the joint loss L=LPHO+.5LNCE+.5LPRO+.5LCE+.1LTVL+0.05LDepthL=L_{PHO}+.5L_{NCE}+.5L_{PRO}+.5L_{CE}+.1L_{TVL}+0.05L_{Depth}italic_L = italic_L start_POSTSUBSCRIPT italic_P italic_H italic_O end_POSTSUBSCRIPT + .5 italic_L start_POSTSUBSCRIPT italic_N italic_C italic_E end_POSTSUBSCRIPT + .5 italic_L start_POSTSUBSCRIPT italic_P italic_R italic_O end_POSTSUBSCRIPT + .5 italic_L start_POSTSUBSCRIPT italic_C italic_E end_POSTSUBSCRIPT + .1 italic_L start_POSTSUBSCRIPT italic_T italic_V italic_L end_POSTSUBSCRIPT + 0.05 italic_L start_POSTSUBSCRIPT italic_D italic_e italic_p italic_t italic_h end_POSTSUBSCRIPT, where LTVLL_{TVL}italic_L start_POSTSUBSCRIPT italic_T italic_V italic_L end_POSTSUBSCRIPT is a total variation loss applied to the triplane parameters for regularization and LDepthL_{Depth}italic_L start_POSTSUBSCRIPT italic_D italic_e italic_p italic_t italic_h end_POSTSUBSCRIPT is a depth prior when available. To increase the convergence basin of our method, we define a coarse feature/segmentation space and a fine feature/segmentation space (both are trained in a similar fashion). The coarse triplane has a resolution of R=256R=256italic_R = 256 and the fine triplane a resolution of R=1024R=1024italic_R = 1024. Coarse image-based features are processed through a ViT [52] with patch size of 14, while fine image-based features are not downsampled. The feature dimension dditalic_d is set to 16 for both fine and coarse levels and the number of cluster KKitalic_K to 34 both for fine and coarse levels. A query image estimated pose is first iteratively refined with the coarse-level features or segmentations, then it is refined with the fine-level.

We evaluate our localization pipeline on the 7Scenes (with Depth SLAM pseudo ground truth poses [67] and with SfM-based pseudo ground truth poses [5]), Indoor6 [19], and Cambridge Landmarks [33] real-world datasets. Results on the 12Scenes [75] dataset and comparisons against [90, 91] are provided in the Appendix. The median position error (cm) and median rotation error (°) are reported. Additionally, we report the recall at 5cm/5° for the indoor scenes. Training parameters (loss coefficients, temperature, 3DGS hyperparameters) are kept the same for all datasets. More details are provided in the Appendix.

5.1 Visual Localization

Comparison with feature rendering-based approaches. We primarily compare GSFFs-PR Feature against other non-privacy-preserving rendering-based VL methods, which either use NeRFs [14, 56, 99] or 3DGS [41, 68] as their underlying 3D representations. These methods all require an initial pose estimate, which is obtained via pose retrieval with a global descriptors. More accurate global descriptors lead to more accurate initial pose estimates, which in turn makes refinement easier. For the sake of fairness, we try to use similar or less accurate global descriptors as the other baselines. The global descriptor used is specified between brackets. We report 7Scenes results (with SfM pseudo GT) in Table˜1, and Cambridge Landmarks results in Table˜2.

From Table˜1, we observe that our GSFFs-PR Feature (DV) is best on 4 out of 7 scenes and second for two further ones in spite of relying on a less accurate initialization, i.e., DenseVLAD is less accurate than DFNet or ACE. On Stairs, the Opacity Gaussian Field struggles due to its textureless and flat layout, yielding a poor results for the initial 3D structure that directly affects GSFFs-PR’s refinement framework. From Table˜2, we can see that GSFFs-PR is more accurate than other 3D Gaussian rendering-based baselines. Overall, our feature pipeline even displays competitive performances on some scenes compared to state-of-the-art structure-based methods.

Model Chess Fire Heads Office Pumpkin Redkitchen Stairs
GoMatch [98] 4/1.65 13/3.86 9/5.17 11/2.48 16/3.32 13/2.84 89/21.12
DGC-GNN [79] 3/1.43 5/1.77 4/2.95 6/1.61 8/1.93 8/2.09 71/19.5
SegLoc [55] 4.5/1.02 4.0/1.51 4.1/1.80 4.7/1.33 6.3/1.88 6.0/1.87 43.7/8.27
GSFFs-PR 2.9/0.92 2.3/0.91 1.5/1.11 4.0/1.28 5.7/1.48 5.1/1.66 28.3/2.39
Table 3: Localization results on 7Scenes (DSLAM based pseudo GT), with median position error (cm.) (\downarrow)/ median rotation error (°) (\downarrow). We compare GSFFs-PR (Privacy) with other privacy-preserving methods. Both SegLoc and GSFFs-PR refine the pose from DenseVLAD [73] initialization. Best numbers in bold.
Model scene1 scene2a scene3 scene4a scene5 scene6
NPPM DSAC* [4] 12.3/2.06/18.7 7.9/0.9/28.0 13.1/2.34/19.7 3.7/0.95/60.8 40.7/6.72/10.6 6.0/1.40/44.3
NBE+SLD [19] 6.5/0.9/38.4 7.2/0.68/32.7 4.4/0.91/53.0 3.8/0.94/66.5 6.0/0.91/40.0 5.0/0.99/50.5
GSFFs-PR Feature (34 classes) (Sgl) 4.9/0.68/51 3.1/0.25/68 3.0/0.58/69 5.1/0.85/49 6.3/0.84/41 3.3/0.69/62
GSFFs-PR Feature (84 classes) (Sgl) 3.6/0.5/63 2.7/0.2/77 2.4/0.39/73 3.6/0.54/63 4.8/0.53/53 2.5/0.42/68
PPM SegLoc (Sgl) [55] 3.9/0.72/51.0 3.2/0.37/56.4 4.2/0.86/41.8 6.6/1.27/33.84 5.1/0.81/43.1 3.5/0.78/34.5
GSFFs-PR Privacy (34 classes) (Sgl) 10.2/1.8/28 5.8/0.49/42 5.8/1.2/45 6.5/1.2/40 13.9/2.32/23 6.5/1.5/46
GSFFs-PR Privacy (84 Classes) (Sgl) 4.6/0.67/53 4.1/0.33/49 4.1/0.74/55 4.0/0.61/58 6.6/0.92/38 3.7/0.75/57
Table 4: Localization results on Indoor6, comparison with Non-Privacy-Preserving Methods (NPPM) and Privacy-Preserving Methods (PPM). Median position error (cm.) (\downarrow)/ median rotation error (°) (\downarrow)/ recall at 5cm/5° (%\%%) ()\uparrow)↑ ).

Comparison with privacy-preserving approaches. We compare the performance of the GSFFs-PRPrivacy version against other privacy-preserving visual localization methods, including SegLoc [55] and descriptor-less visual localization methods [98, 79], on Cambridge Landmarks in Table˜2 and on 7Scenes with DSLAM pseudo GT in Table˜3. Similar to our approach, both baseline try to remove the reliance on traditional high-dimensional features for visual localization and hence try to improve privacy. On both datasets, our pipeline proves to be much more accurate than all the other privacy-preserving methods. Finally, Tab. 4 shows results on the challenging Indoor6 dataset, which contains scenes with multiple rooms, uneven camera distributions, and large illumination changes. GSFFs-PR trained with 84 classes outperforms SegLoc in average recall, while GSFFs-PR trained with 34 classes is on par with SegLoc. This shows that increasing the number of classes increases the discriminative power of the segmentation, which is particularly beneficial for larger and complex scenes such as scene1 and scene5. The Appendixprovides an ablation study on varying the number of classes.

KC (F) OH (F) KC (P) OH (P)
Coarse only 37.5/0.80 670/1.10 92.2/1.55 82.5/1.50
Fine only 22.6/0.32 55.1/0.77 28.0/0.45 151/0.96
k-means 25.2/0.33 27.6/0.50 37.2/0.64 52.3/0.84
No SOpt 23.5/0.30 23.4/0.45 31.0/0.45 30.8/0.57
No MVR 20.0/0.28 22.5/0.49 29.6/0.51 29.1/0.49
No SE 23.5/0.31 21.9/0.42 27.0/0.42 26.0/0.47
GSFFs-PR 17.9/0.27 21.4/0.41 24.3/0.39 25.6/0.49
Table 5: Ablation of different component of our pipeline on King’s College (KC) and Old Hospital (OH) for Feature (F) and Privacy (P) versions. Median position error (cm.)/ rotation error (°) (\downarrow).

5.2 Ablations and Discussion

Table˜5 shows results for our ablation studies, where we study a single component per ablation experiment.

No hierarchy. First, we evaluate the importance of the hierarchical approach by either removing the fine level (Coarse Only) or by removing the coarse level (Fine Only). The results in Table˜5 (first two rows) show that alignment on the finer level yields better pose accuracy and that increasing the convergence basin with a hierarchical approach is necessary to ensure accurate localization.

Clustering. As shown in Tab. 5 (k-means), when we replace the spectral clustering with a simple K-means clustering to derive the prototypes in the 3D feature space, the accuracy drops, confirming that spectral clustering is a critical component of the representation learning process.

Feature field learning. The accuracy drops also when we replace the scale-aware feature encoding with a simple point-wise projection of 3D Gaussian centers (c.f. Table˜5 (No SE)), or when we remove the multi-view regularization in the contrastive losses (c.f. Table˜5 (No MVR)). This shows that both components leverage the geometric information contained in the 3D Gaussian model, yielding more robust and more discriminative representations, and hence a higher pose accuracy.

Segmentation optimization. By removing the loss from Eq.˜5, we do not optimize the segmentations during training (Table˜5 (No SOpt)). Instead, during inference, we align soft assignments between features and prototypes. Due to the prototypical loss, the soft assignments are aligned well enough to provide decent accuracy, however the segmentation head provides a significant gain on accuracy. Furthermore, the feature alignment also benefit from the segmentation optimization during training.

Sparsity of the representation. To evaluate the role of the dense representation, we adopt an approach closer to SegLoc by only applying pose refinement on a sparse set of points. To do this we reproject only Gaussian centers into the image space, interpolate 2D extracted/3D rendered segmentations at these locations and apply Eq.˜6 for pose refinement. In this sparse setup, the pose refinement does not converge, which suggests that a sparse signal is not sufficient when backpropagating through the renderer.

6 Conclusion

In this paper, we propose a Gaussian Feature Field (GSFFs) that combines an explicit geometry representation (3DGS) and an implicit feature field into a scene representation suitable for visual localization through pose refinement. The resulting VL framework relies on a learned feature embedding space shared between the scale-aware GSFFs and a 2D encoder. Both the field and the encoder are optimized in a self-supervised way via contrastive learning. By leveraging the geometry and structure of the GS representation, we regularize the training through multi-view consistency and clustering. The clusters are further used to convert the features into segmentations to enable privacy-preserving localization. Both the privacy-preserving and non-privacy-preserving variants of our localization framework outperform the relevant state-of-the-art.

Acknowledgments. Maxime and Torsten received funding from NAVER LABS Europe. Maxime was also supported by the the Grant Agency of the Czech Technical University in Prague (No. SGS24/057/OHK3/1T/13). We thank our colleagues for providing feedback, in particular Assia Benbihi.

References

  • Alismail et al. [2017] Hatem Alismail, Brett Browning, and Simon Lucey. Photometric Bundle Adjustment for Vision-Based SLAM. In ACCV, 2017.
  • Bhayani et al. [2021] Snehal Bhayani, Torsten Sattler, Daniel Barath, Patrik Beliansky, Janne Heikkilä, and Zuzana Kukelova. Calibrated and Partially Calibrated Semi-Generalized Homographies. In ICCV, 2021.
  • Bortolon et al. [2024] Matteo Bortolon, Theodore Tsesmelis, Stuart James, Fabio Poiesi, and Alessio Del Bue. 6DGS: 6D Pose Estimation from a Single Image and a 3D Gaussian Splatting Model. In ECCV, 2024.
  • Brachmann and Rother [2021] Eric Brachmann and Carsten Rother. Visual Camera Re-Localization from RGB and RGB-D Images Using DSAC. IEEE Transactions on Pattern Analysis and Machine Intelligence (PAMI), 44(9):5847–5865, 2021.
  • Brachmann et al. [2021] Eric Brachmann, Martin Humenberger, Carsten Rother, and Torsten Sattler. On the Limits of Pseudo Ground Truth in Visual Camera Re-Localisation. In ICCV, 2021.
  • Brachmann et al. [2023] Eric Brachmann, Tommaso Cavallari, and Victor Adrian Prisacariu. Accelerated Coordinate Encoding: Learning to Relocalize in Minutes using RGB and Poses. In CVPR, 2023.
  • Brahmbhatt et al. [2018] Samarth Brahmbhatt, Jinwei Gu, Kihwan Kim, James Hays, and Jan Kautz. Geometry-Aware Learning of Maps for Camera Localization. In CVPR, 2018.
  • Chelani et al. [2021] Kunal Chelani, Fredrik Kahl, and Torsten Sattler. How Privacy-Preserving Are Line Clouds? Recovering Scene Details From 3D Lines. In CVPR, 2021.
  • Chelani et al. [2025] Kunal Chelani, Assia Benbihi, Fredrik Kahl, Torsten Sattler, and Zuzana Kukelova. Obfuscation Based Privacy Preserving Representations are Recoverable Using Neighborhood Information. In 3DV, 2025.
  • Chen et al. [2022a] Anpei Chen, Zexiang Xu, Andreas Geiger, Jingyi Yu, and Hao Su. TensoRF: Tensorial Radiance Fields. In ECCV, 2022a.
  • Chen et al. [2024a] Le Chen, Weirong Chen, Rui Wang, and Marc Pollefeys. Leveraging Neural Radiance Fields for Uncertainty-aware Visual Localization. In ICRA, 2024a.
  • Chen et al. [2021] Shuai Chen, Zirui Wang, and Victor A. Prisacariu. Direct-PoseNet: Absolute Pose Regression with Photometric Consistency. In 3DV, 2021.
  • Chen et al. [2022b] Shuai Chen, Xinghui Li, Zirui Wang, and Victor A. Prisacariu. DFNet: Enhance Absolute Pose Regression with Direct Feature Matching. In ECCV, 2022b.
  • Chen et al. [2024b] Shuai Chen, Yash Bhalgat, Xinghui Li, Jiawang Bian, Kejie Li, Zirui Wang, and Victor A. Prisacariu. Neural Refinement for Absolute Pose Regression with Feature Synthesis. In CVPR, 2024b.
  • Cheng et al. [2024] Kai Cheng, Xiaoxiao Long, Kaizhi Yang, Yao Yao, Wei Yin, Yuexin Ma, Wenping Wang, and Xuejin Chen. GaussianPro: 3D Gaussian Splatting with Progressive Propagation. In ICML, 2024.
  • Chung et al. [2024] Jaeyoung Chung, Jeongtaek Oh, and Kyoung Mu Lee. Depth-Regularized Optimization for 3D Gaussian Splatting in Few-Shot Images. In CVPR, 2024.
  • Cuturi [2013] Marco Cuturi. Sinkhorn Distances: Lightspeed Computation of Optimal Transport. In NeurIPS, 2013.
  • Delaunay [1924] Boris Delaunay. Sur la sphère vide: A la mémoire de Georges Voronoï. In Proceedings du Congrés international des mathématiciens, 1924.
  • Do et al. [2022] Tien Do, Ondrej Miksik, Joseph DeGol, Hyun Soo Park, and Sudipta N Sinha. Learning To Detect Scene Landmarks for Camera Localization. In CVPR, 2022.
  • Dusmanu et al. [2021] Mihai Dusmanu, Johannes L. Schönberger, Sudipta N. Sinha, and Marc Pollefeys. Privacy-Preserving Image Features via Adversarial Affine Subspace Embeddings. In CVPR, 2021.
  • Engel et al. [2014] Jakob Engel, Thomas Schöps, and Daniel Cremers. LSD-SLAM: Large-scale Direct Monocular SLAM. In ECCV, 2014.
  • Engel et al. [2017] Jakob Engel, Vladlen Koltun, and Daniel Cremers. Direct Sparse Odometry. IEEE Transactions on Pattern Analysis and Machine Intelligence (PAMI), 40(3):611–625, 2017.
  • Geppert et al. [2022] Marcel Geppert, Viktor Larsson, Johannes L Schönberger, and Marc Pollefeys. Privacy Preserving Partial Localization. In CVPR, 2022.
  • Germain et al. [2019] Hugo Germain, Guillaume Bourmaud, and Vincent Lepetit. Sparse-to-Dense Hypercolumn Matching for Long-Term Visual Localization. In 3DV, 2019.
  • Germain et al. [2022] Hugo Germain, Daniel DeTone, Geoffrey Pascoe, Tanner Schmidt, David Novotny, Richard Newcombe, Chris Sweeney, Richard Szeliski, and Vasileios Balntas. Feature Query Networks: Neural Surface Description for Camera Pose Refinement. In CVPR Workshops , 2022.
  • Guédon and Lepetit [2024] Antoine Guédon and Vincent Lepetit. SuGaR: Surface-Aligned Gaussian Splatting for Efficient 3D Mesh Reconstruction and High-Quality Mesh Rendering. In CVPR, 2024.
  • Heng et al. [2019] Lionel Heng, Benjamin Choi, Zhaopeng Cui, Marcel Geppert, Sixing Hu, Benson Kuan, Peidong Liu, Rang Nguyen, Ye Chuan Yeo, Andreas Geiger, Gim Hee Lee, Marc Pollefeys, and Torsten Sattler. Project AutoVision: Localization and 3D Scene Perception for an Autonomous Vehicle with a Multi-Camera System. In ICRA, 2019.
  • Huang et al. [2024] Binbin Huang, Zehao Yu, Anpei Chen, Andreas Geiger, and Shenghua Gao. 2D Gaussian Splatting for Geometrically Accurate Radiance Fields. In SIGGRAPH, 2024.
  • Humenberger et al. [2020] Martin Humenberger, Yohann Cabon, Nicolas Guerin, Julien Morat, Jérôme Revaud, Philippe Rerole, Noé Pion, César Roberto de Souza, Vincent Leroy, and Gabriela Csurka. Robust Image Retrieval-based Visual Localization using Kapture. arXiv preprint arXiv:2007.13867, 2020.
  • Humenberger et al. [2022] Martin Humenberger, Yohann Cabon, Noé Pion, Philippe Weinzaepfel, Donghwan Lee, Nicolas Guérin, Torsten Sattler, and Gabriela Csurka. Investigating the Role of Image Retrieval for Visual Localization. International Journal of Computer Vision (IJCV), 130(7):1811–1836, 2022.
  • Ji et al. [2024] Shengxiang Ji, Guanjun Wu, Jiemin Fang, Jiazhong Cen, Taoran Yi, Wenyu Liu, Qi Tian, and Xinggang Wang. Segment Any 4D Gaussians. arXiv preprint arXiv:2407.04504, 2024.
  • Keetha et al. [2024] Nikhil Keetha, Jay Karhade, Krishna Murthy Jatavallabhula, Gengshan Yang, Sebastian Scherer, Deva Ramanan, and Jonathon Luiten. SplaTAM: Splat Track & Map 3D Gaussians for Dense RGB-D SLAM. In CVPR, 2024.
  • Kendall et al. [2015] Alex Kendall, Matthew Grimes, and Roberto Cipolla. PoseNet: a Convolutional Network for Real-Time 6-DOF Camera Relocalization. In ICCV, 2015.
  • Kerbl et al. [2023] Bernhard Kerbl, Georgios Kopanas, Thomas Leimkuehler, and George Drettakis. 3D Gaussian Splatting for Real-Time Radiance Field Rendering. IEEE Transactions on Graphics, 42(4):1–14, 2023.
  • Kingma and Ba [2014] Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
  • Lee et al. [2023] Chunghwan Lee, Jaihoon Kim, Chanhyuk Yun, and Je Hyeong Hong. Paired-Point Lifting for Enhanced Privacy-Preserving Visual Localization. In CVPR, 2023.
  • Liao et al. [2024] Guibiao Liao, Jiankun Li, Zhenyu Bao, Xiaoqing Ye, Jingdong Wang, Qing Li, and Kanglin Liu. CLIP-GS: CLIP-Informed Gaussian Splatting for Real-time and View-consistent 3D Semantic Understanding. arXiv preprint arXiv:2404.14249, 2024.
  • Lim et al. [2015] Hyon Lim, Sudipta N. Sinha, Michael F. Cohen, Matt Uyttendaele, and H. Jin Kim. Real-time Monocular Image-based 6-DoF Localization. International Journal of Robotics Research, 34(4–5):476–492, 2015.
  • Lin et al. [2023] Yunzhi Lin, Thomas Müller, Jonathan Tremblay, Bowen Wen, Stephen Tyree, Alex Evans, Patricio A. Vela, and Stan Birchfield. Parallel Inversion of Neural Radiance Fields for Robust Pose Estimation. In ICRA, 2023.
  • Lindenberger et al. [2021] Philipp Lindenberger, Paul-Edouard Sarlin, Viktor Larsson, and Marc Pollefeys. Pixel-Perfect Structure-from-Motion with Featuremetric Refinement. In ICCV, 2021.
  • Liu et al. [2024] Changkun Liu, Shuai Chen, Yash Bhalgat, Siyan Hu, Zirui Wang, Ming Cheng, Victor Adrian Prisacariu, and Tristan Braud. GSLoc: Efficient Camera Pose Refinement via 3D Gaussian Splatting. arXiv preprint arXiv:2408.11085, 2024.
  • Liu et al. [2022] Zhuang Liu, Hanzi Mao, Chao-Yuan Wu, Christoph Feichtenhofer, Trevor Darrell, and Saining Xie. A Convnet for the 2020s. In CVPR, 2022.
  • Luiten et al. [2023] Jonathon Luiten, Georgios Kopanas, Bastian Leibe, and Deva Ramanan. Dynamic 3D Gaussians: Tracking by Persistent Dynamic View Synthesis. arXiv preprint arXiv:2308.09713, 2023.
  • Lynen et al. [2015] Simon Lynen, Torsten Sattler, Michael Bosse, Joel Hesch, Marc Pollefeys, and Roland Siegwart. Get out of my Lab: Large-scale, Real-Time Visual-Inertial Localization. In RSS, 2015.
  • Maggio et al. [2023] Dominic Maggio, Marcus Abate, Jingnan Shi, Courtney Mario, and Luca Carlone. Loc-NeRF: Monte Carlo Localization using Neural Radiance Fields. In ICRA, 2023.
  • Matsuki et al. [2024] Hidenobu Matsuki, Riku Murai, Paul H.J. Kelly, and Andrew J. Davison. Gaussian Splatting SLAM. In CVPR, 2024.
  • Moon et al. [2024] Heejoon Moon, Chunghwan Lee, and Je Hyeong Hong. Efficient Privacy-Preserving Visual Localization Using 3D Ray Clouds. In CVPR, 2024.
  • Moreau et al. [2022] Arthur Moreau, Nathan Piasco, Dzmitry Tsishkou, Bogdan Stanciulescu, and Arnaud e La Fortelle. LENS: Localization Enhanced by NeRF Synthesis. In CoRL, 2022.
  • Moreau et al. [2023] Arthur Moreau, Nathan Piasco, Moussab Bennehar, Dzmitry Tsishkou, Bogdan Stanciulescu, and Arnaud de La Fortelle. CROSSFIRE: Camera Relocalization on Self-Supervised Features from an Implicit Representation. In ICCV, 2023.
  • Ng et al. [2022] Tony Ng, Adrian Lopez-Rodriguez, Vassileios Balntas, and Krystian Mikolajczyk. OoD-Pose: Camera Pose Regression From Out-of-Distribution Synthetic Views. In 3DV, 2022.
  • Oord et al. [2018] Aaron van den Oord, Yazhe Li, and Oriol Vinyals. Representation Learning with Contrastive Predictive Coding. arXiv preprint arXiv:1807.03748, 2018.
  • Oquab et al. [2023] Maxime Oquab, Timothée Darcet, Théo Moutakanni, Huy Vo, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, et al. Dinov2: Learning robust visual features without supervision. arXiv preprint arXiv:2304.07193, 2023.
  • Pan et al. [2023] Linfei Pan, Johannes L Schönberger, Viktor Larsson, and Marc Pollefeys. Privacy Preserving Localization via Coordinate Permutations. In ICCV, 2023.
  • Panek et al. [2022] Vojtech Panek, Zuzana Kukelova, and Torsten Sattler. MeshLoc: Mesh-Based Visual Localization. In ECCV, 2022.
  • Pietrantoni et al. [2023] Maxime Pietrantoni, Martin Humenberger, Torsten Sattler, and Gabriela Csurka. SegLoc: Learning Segmentation-Based Representations for Privacy-Preserving Visual Localization. In CVPR, 2023.
  • Pietrantoni et al. [2024] Maxime Pietrantoni, Gabriela Csurka, Martin Humenberger, and Torsten Sattler. Self-Supervised Learning of Neural Implicit Feature Fields for Camera Pose Refinement. In 3DV, 2024.
  • Pittaluga and Zhuang [2023] Francesco Pittaluga and Bingbing Zhuang. LDP-Feat: Image Features with Local Differential Privacy. In ICCV, 2023.
  • Pittaluga et al. [2019] Francesco Pittaluga, Sanjeev J. Koppal, Sing Bing Kang, and Sudipta N. Sinha. Revealing Scenes by Inverting Structure from Motion Reconstructions. In CVPR, 2019.
  • Ranftl et al. [2021] René Ranftl, Alexey Bochkovskiy, and Vladlen Koltun. Vision Transformers for Dense Prediction. In ICCV, 2021.
  • Sarlin et al. [2019] Paul-Edouard Sarlin, Cesar Cadena, Roland Siegwart, and Marcin Dymczyk. From Coarse to Fine: Robust Hierarchical Localization at Large Scale. In CVPR, 2019.
  • Sarlin et al. [2021] Paul-Edouard Sarlin, Ajaykumar Unagar, Mans Larsson, Hugo Germain, Carl Toft, Viktor Larsson, Marc Pollefeys, Vincent Lepetit, Lars Hammarstrand, Fredrik Kahl, and Torsten Sattler. Back to the Feature: Learning Robust Camera Localization From Pixels To Pose. In CVPR, 2021.
  • Sattler et al. [2015] Torsten Sattler, Michal Havlena, Filip Radenović, Konrad Schindler, and Marc Pollefeys. Hyperpoints and Fine Vocabularies for Large-Scale Location Recognition. In ICCV, 2015.
  • Sattler et al. [2019] Torsten Sattler, Qunjie Zhou, Marc Pollefeys, and Laura Leal-Taixé. Understanding the Limitations of CNN-based Absolute Camera Pose Regression. In CVPR, 2019.
  • Schöps et al. [2017] Thomas Schöps, Johannes L. Schönberger, Silvano Galliani, Torsten Sattler, Konrad Schindler, Marc Pollefeys, and Andreas Geiger. A Multi-view Stereo Benchmark with High-Resolution Images and Multi-camera Videos. In CVPR, 2017.
  • Schöps et al. [2019] Thomas Schöps, Torsten Sattler, and Marc Pollefeys. BAD SLAM: Bundle Adjusted Direct RGB-D SLAM. In CVPR, 2019.
  • Shibuya et al. [2020] Mikiya Shibuya, Shinya Sumikura, and Ken Sakurada. Privacy Preserving Visual SLAM. In ECCV, 2020.
  • Shotton et al. [2013] Jamie Shotton, Ben Glocker, Christopher Zach, Shahram Izadi, Antonio Criminisi, and Andrew Fitzgibbon. Scene Coordinate Regression Forests for Camera Relocalization in RGB-D Images. In CVPR, 2013.
  • Sidorov et al. [2024] Gennady Sidorov, Malik Mohrat, Ksenia Lebedeva, Ruslan Rakhimov, and Sergey Kolyubin. GSplatLoc: Grounding Keypoint Descriptors into 3D Gaussian Splatting for Improved Visual Localization. arXiv preprint arXiv:2409.16502, 2024.
  • Speciale et al. [2019a] Pablo Speciale, Johannes L. Schönberger, Sudipta N. Sinha, and Marc Pollefeys. Privacy Preserving Image Queries for Camera Localization. In ICCV, 2019a.
  • Speciale et al. [2019b] Pablo Speciale, Johannes L. Schönberger, Sing Bing Kang, Sudipta N. Sinha, and Marc Pollefeys. Privacy Preserving Image-Based Localization. In CVPR, 2019b.
  • Taira et al. [2021] Hajime Taira, Masatoshi Okutomi, Torsten Sattler, Mircea Cimpoi, Marc Pollefeys, Josef Sivic, Tomáš Pajdla, and Torii Akihiko. InLoc: Indoor Visual Localization with Dense Matching and View Synthesis. IEEE Transactions on Pattern Analysis and Machine Intelligence (PAMI), 43(4):1293–1307, 2021.
  • Toft et al. [2020] Carl Toft, Will Maddern, Akihiko Torii, Lars Hammarstrand, Erik Stenborg, Daniel Safari, Masatoshi Okutomi, Marc Pollefeys, Josef Sivic, Tomas Pajdla, et al. Long-term Visual Localization Revisited. IEEE Transactions on Pattern Analysis and Machine Intelligence (PAMI), 2020.
  • Torii et al. [2018] Akihiko Torii, Relja Arandjelović, Josef Sivic, Masatoshi Okutomi, and Tomáš Pajdla. 24/7 Place Recognition by View Synthesis. IEEE Transactions on Pattern Analysis and Machine Intelligence (PAMI), 40(2):257–271, 2018.
  • Trivigno et al. [2024] Gabriele Trivigno, Carlo Masone, Barbara Caputo, and Torsten Sattler. The Unreasonable Effectiveness of Pre-Trained Features for Camera Pose Refinement. In CVPR, 2024.
  • Valentin et al. [2016] Julien Valentin, Angela Dai, Matthias Nießner, Pushmeet Kohli, Philip Torr, Shahram Izadi, and Cem Keskin. Learning to Navigate the Energy Landscape. In 3DV. IEEE, 2016.
  • Von Stumberg et al. [2020] Lukas Von Stumberg, Patrick Wenzel, Qadeer Khan, and Daniel Cremers. GN-Net: The Gauss-Newton Loss for Multi-Weather Relocalization. IEEE Robotics and Automation Letters, 5(2):890–897, 2020.
  • von Stumberg et al. [2020] Lukas von Stumberg, Patrick Wenzel, Nan Yang, and Daniel Cremers. LM-Reloc: Levenberg-Marquardt Based Direct Visual Relocalization. In 3DV, 2020.
  • Wang et al. [2020] Bing Wang, Changhao Chen, Chris Xiaoxuan Lu, Peijun Zhao, Niki Trigoni, and Andrew Markham. AtLoc: Attention Guided Camera Localization. In AAAI, 2020.
  • Wang et al. [2024] Shuzhe Wang, Juho Kannala, and Daniel Barath. DGC-GNN: Descriptor-free Geometric-Color Graph Neural Network for 2D-3D Matching. In CVPR, 2024.
  • Wang et al. [2004] Zhou Wang, Alan C. Bovik, Hamid R. Sheikh, and Eero P. Simoncelli. Image Quality Assessment: From Error Visibility to Structural Similarity. IEEE Transactions on Image Processing (TIP), 13(4):600–612, 2004.
  • Wu and He [2018] Yuxin Wu and Kaiming He. Group Normalization. In ECCV, 2018.
  • Xiong et al. [2023] Haolin Xiong, Sairisheek Muttukuru, Rishi Upadhyay, Pradyumna Chari, and Achuta Kadambi. SparseGS: Real-Time 360° Sparse View Synthesis using Gaussian Splatting. arXiv preprint arXiv:2312.00206, 2023.
  • Xu et al. [2021] Binbin Xu, Andrew Davison, and Stefan Leutenegger. Deep Probabilistic Feature-metric Tracking. IEEE Robotics and Automation Letters , 6(1):223 – 230, 2021.
  • Yang et al. [2024] Ziyi Yang, Xinyu Gao, Wen Zhou, Shaohui Jiao, Yuqing Zhang, and Xiaogang Jin. Deformable 3D Gaussians for High-Fidelity Monocular Dynamic Scene Reconstruction. In CVPR, 2024.
  • Ye et al. [2024] Botao Ye, Sifei Liu, Haofei Xu, Xueting Li, Marc Pollefeys, Ming-Hsuan Yang, and Songyou Peng. No Pose, No Problem: Surprisingly Simple 3D Gaussian Splats from Sparse Unposed Images. arXiv preprint arXiv:2410.24207, 2024.
  • Yen-Chen et al. [2021] Lin Yen-Chen, Pete Florence, Jonathan T. Barron, Alberto Rodriguez, Phillip Isola, and Tsung-Yi Lin. INeRF: Inverting Neural Radiance Fields for Pose Estimation. In IROS, 2021.
  • Yu et al. [2024a] Hanyang Yu, Xiaoxiao Long, and Ping Tan. LM-Gaussian: Boost Sparse-view 3D Gaussian Splatting with Large Model Priors. arXiv preprint arXiv:2409.03456, 2024a.
  • Yu et al. [2024b] Zehao Yu, Anpei Chen, Binbin Huang, Torsten Sattler, and Andreas Geiger. Mip-splatting: Alias-free 3D Gaussian Splatting. In CVPR, 2024b.
  • Yu et al. [2024c] Zehao Yu, Torsten Sattler, and Andreas Geiger. Gaussian Opacity Fields: Efficient Adaptive Surface Reconstruction in Unbounded Scenes. IEEE Transactions on Graphics, 43(6):1–15, 2024c.
  • Zhai et al. [2024] Hongjia Zhai, Xiyu Zhang, Boming Zhao, Hai Li, Yijia He, Zhaopeng Cui, Hujun Bao, and Guofeng Zhang. SplatLoc: 3D Gaussian Splatting-based Visual Localization for Augmented Reality. arXiv preprint arXiv:2409.14067, 2024.
  • Zhai et al. [2025] Hongjia Zhai, Boming Zhao, Hai Li, Xiaokun Pan, Yijia He, Zhaopeng Cui, Hujun Bao, and Guofeng Zhang. Neuraloc: Visual localization in neural implicit map with dual complementary features. arXiv preprint arXiv:2503.06117, 2025.
  • Zhang et al. [2018] Richard Zhang, Phillip Isola, Alexei A. Efros, Eli Shechtman, and Oliver Wang. The Unreasonable Effectiveness of Deep Features as a Perceptual Metric. In CVPR, 2018.
  • Zhang and Kosecka [2006] W. Zhang and J. Kosecka. Image based Localization in Urban Environments. In 3DPVT, 2006.
  • Zhang et al. [2024] Zheng Zhang, Wenbo Hu, Yixing Lao, Tong He, and Hengshuang Zhao. Pixel-GS: Density Control with Pixel-aware Gradient for 3D Gaussian Splatting. arXiv preprint arXiv:2403.15530, 2024.
  • Zhao et al. [2024] Boming Zhao, Luwei Yang, Mao Mao, Hujun Bao, and Zhaopeng Cui. PNeRFLoc: Visual localization with point-based neural radiance fields. In AAAI, 2024.
  • Zheng and Wu [2015] Enliang Zheng and Changchang Wu. Structure From Motion Using Structure-Less Resection. In ICCV, 2015.
  • Zhou et al. [2020] Qunjie Zhou, Torsten Sattler, Marc Pollefeys, and Laura Leal-Taixé. To Learn or not to Learn: Visual Localization from Essential Matrices. In ICRA, 2020.
  • Zhou et al. [2022] Qunjie Zhou, Sergio Agostinho, Aljosa Osep, and Laura Leal-Taixe. Is Geometry Enough for Matching in Visual Localization? In ECCV, 2022.
  • Zhou et al. [2024] Qunjie Zhou, Maxim Maximov, Or Litany, and Laura Leal-Taixé. The NeRFect Match: Exploring NeRF Features for Visual Localization. arXiv preprint arXiv:2403.09577, 2024.

APPENDIX

In this appendix first in Appendix˜A, we provide additional explanations with regard to the contrastive losses, the optimal transport association procedure, and the volumetric feature encoding. In Appendix˜B, we detail the training and visual localization parameters. We also provide a pseudo-algorithm describing GSFFs training in Algorithm˜1. Then, in Appendix˜C, we provide visual localization results on the 12Scenes dataset with ablation studies and in Appendix˜D we present a privacy attack experiment. Finally, in Appendix˜E we display more visualizations of rendered segmentation, while commenting on the attached supplementary videos.

Appendix A Technical details of GSFFs

A.1 Contrastive losses and regularization terms

Here we provide more details regarding the derivation of the losses in Section˜3.1. We recall that our is training the model in a self-supervised manner to align the feature maps 𝐅3D{\bf F}^{\text{3D}}bold_F start_POSTSUPERSCRIPT 3D end_POSTSUPERSCRIPT and 𝐅2D{\bf F}^{\text{2D}}bold_F start_POSTSUPERSCRIPT 2D end_POSTSUPERSCRIPT. Therefore, during training at each step we sample NNitalic_N pixels in these maps to be aligned. Let us denote the NNitalic_N corresponding pairs of pixel aligned extracted/rendered features by {𝐅n2D,𝐅n3D}n=1N\{{\bf F}^{\text{2D}}_{n},{\bf F}^{\text{3D}}_{n}\}_{n=1}^{N}{ bold_F start_POSTSUPERSCRIPT 2D end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT , bold_F start_POSTSUPERSCRIPT 3D end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_n = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT. The contrastive loss has two terms. The first one is a term enforcing similarity of 2D extracted features with regard to 3D rendered features, and the second term is enforcing similarity of the 2D rendered features with regard to 3D extracted features:

LNCE=1Nu=1Nlog(exp(𝐅u3D𝐅u2D/τ)j=1Nexp(𝐅u3D𝐅j2D/τ))\textstyle{L_{NCE}=-\frac{1}{N}\sum_{u=1}^{N}\log\left(\frac{\exp{\left({\bf F}^{\text{3D}}_{u}\cdot{\bf F}^{\text{2D}}_{u}/\tau\right)}}{\sum_{j=1}^{N}\exp({\bf F}^{\text{3D}}_{u}\cdot{\bf F}^{\text{2D}}_{j}/\tau)}\right)}italic_L start_POSTSUBSCRIPT italic_N italic_C italic_E end_POSTSUBSCRIPT = - divide start_ARG 1 end_ARG start_ARG italic_N end_ARG ∑ start_POSTSUBSCRIPT italic_u = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT roman_log ( divide start_ARG roman_exp ( bold_F start_POSTSUPERSCRIPT 3D end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT ⋅ bold_F start_POSTSUPERSCRIPT 2D end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT / italic_τ ) end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT roman_exp ( bold_F start_POSTSUPERSCRIPT 3D end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT ⋅ bold_F start_POSTSUPERSCRIPT 2D end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT / italic_τ ) end_ARG )
1Nu=1Nlog(exp(𝐅u2D𝐅u3D/τ)j=1Nexp(𝐅u2D𝐅j3D/τ))\textstyle{-\frac{1}{N}\sum_{u=1}^{N}\log\left(\frac{\exp{\left({\bf F}^{\text{2D}}_{u}\cdot{\bf F}^{\text{3D}}_{u}/\tau\right)}}{\sum_{j=1}^{N}\exp({\bf F}^{\text{2D}}_{u}\cdot{\bf F}^{\text{3D}}_{j}/\tau)}\right)}- divide start_ARG 1 end_ARG start_ARG italic_N end_ARG ∑ start_POSTSUBSCRIPT italic_u = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT roman_log ( divide start_ARG roman_exp ( bold_F start_POSTSUPERSCRIPT 2D end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT ⋅ bold_F start_POSTSUPERSCRIPT 3D end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT / italic_τ ) end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT roman_exp ( bold_F start_POSTSUPERSCRIPT 2D end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT ⋅ bold_F start_POSTSUPERSCRIPT 3D end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT / italic_τ ) end_ARG )

yielding LNCEL_{NCE}italic_L start_POSTSUBSCRIPT italic_N italic_C italic_E end_POSTSUBSCRIPT equal to:

1Nu=1Nlog(exp(𝐅u3D𝐅u2D/τ)exp(𝐅u2D𝐅u3D/τ)j=1Nexp(𝐅u3D𝐅j2D/τ)j=1Nexp(𝐅u2D𝐅j3D/τ)).\textstyle{-\frac{1}{N}\sum_{u=1}^{N}\log\left(\frac{\exp{\left({\bf F}^{\text{3D}}_{u}\cdot{\bf F}^{\text{2D}}_{u}/\tau\right)}\exp{\left({\bf F}^{\text{2D}}_{u}\cdot{\bf F}^{\text{3D}}_{u}/\tau\right)}}{\sum_{j=1}^{N}\exp({\bf F}^{\text{3D}}_{u}\cdot{\bf F}^{\text{2D}}_{j}/\tau)\sum_{j=1}^{N}\exp({\bf F}^{\text{2D}}_{u}\cdot{\bf F}^{\text{3D}}_{j}/\tau)}\right)}\enspace.- divide start_ARG 1 end_ARG start_ARG italic_N end_ARG ∑ start_POSTSUBSCRIPT italic_u = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT roman_log ( divide start_ARG roman_exp ( bold_F start_POSTSUPERSCRIPT 3D end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT ⋅ bold_F start_POSTSUPERSCRIPT 2D end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT / italic_τ ) roman_exp ( bold_F start_POSTSUPERSCRIPT 2D end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT ⋅ bold_F start_POSTSUPERSCRIPT 3D end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT / italic_τ ) end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT roman_exp ( bold_F start_POSTSUPERSCRIPT 3D end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT ⋅ bold_F start_POSTSUPERSCRIPT 2D end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT / italic_τ ) ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT roman_exp ( bold_F start_POSTSUPERSCRIPT 2D end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT ⋅ bold_F start_POSTSUPERSCRIPT 3D end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT / italic_τ ) end_ARG ) .

From this we can derive Eq.˜2, where the normalization factor AAitalic_A is:

A=(j=1Nexp(𝐅u3D𝐅j2D/τ))(j=1Nexp(𝐅u2D𝐅j3D/τ))\textstyle{A=\left(\sum_{j=1}^{N}\exp({\bf F}^{\text{3D}}_{u}\cdot{\bf F}^{\text{2D}}_{j}/\tau)\right)\left(\sum_{j=1}^{N}\exp({\bf F}^{\text{2D}}_{u}\cdot{\bf F}^{\text{3D}}_{j}/\tau)\right)}italic_A = ( ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT roman_exp ( bold_F start_POSTSUPERSCRIPT 3D end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT ⋅ bold_F start_POSTSUPERSCRIPT 2D end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT / italic_τ ) ) ( ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT roman_exp ( bold_F start_POSTSUPERSCRIPT 2D end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT ⋅ bold_F start_POSTSUPERSCRIPT 3D end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT / italic_τ ) )

Similarly, the prototypical contrastive loss has a term enforcing the similarity between 2D extracted features and the associated prototypes, and a second term enforcing similarity between 3D rendered features and the associated prototypes:

LPRO=1Nn=1Nlog(exp((𝐅n3D𝐩n)/τ)j=1Kexp(𝐅n3D𝐩j/τ))\textstyle{L_{PRO}=-\frac{1}{N}\sum_{n=1}^{N}\log\left(\frac{\exp{(({\bf F}^{\text{3D}}_{n}\cdot{\bf p}_{n})/\tau)}}{\sum_{j=1}^{K}\exp({\bf F}^{\text{3D}}_{n}\cdot{\bf p}_{j}/\tau)}\right)}italic_L start_POSTSUBSCRIPT italic_P italic_R italic_O end_POSTSUBSCRIPT = - divide start_ARG 1 end_ARG start_ARG italic_N end_ARG ∑ start_POSTSUBSCRIPT italic_n = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT roman_log ( divide start_ARG roman_exp ( ( bold_F start_POSTSUPERSCRIPT 3D end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ⋅ bold_p start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) / italic_τ ) end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT roman_exp ( bold_F start_POSTSUPERSCRIPT 3D end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ⋅ bold_p start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT / italic_τ ) end_ARG )
1Nn=1Nlog(exp((𝐅n2D𝐩n/τ)j=1Kexp(𝐅n2D𝐩j/τ))\textstyle{-\frac{1}{N}\sum_{n=1}^{N}\log\left(\frac{\exp{(({\bf F}^{\text{2D}}_{n}\cdot{\bf p}_{n}/\tau)}}{\sum_{j=1}^{K}\exp({\bf F}^{\text{2D}}_{n}\cdot{\bf p}_{j}/\tau)}\right)}- divide start_ARG 1 end_ARG start_ARG italic_N end_ARG ∑ start_POSTSUBSCRIPT italic_n = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT roman_log ( divide start_ARG roman_exp ( ( bold_F start_POSTSUPERSCRIPT 2D end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ⋅ bold_p start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT / italic_τ ) end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT roman_exp ( bold_F start_POSTSUPERSCRIPT 2D end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ⋅ bold_p start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT / italic_τ ) end_ARG )

yielding LPROL_{PRO}italic_L start_POSTSUBSCRIPT italic_P italic_R italic_O end_POSTSUBSCRIPT equal to

1Nn=1Nlog(exp((𝐅n3D𝐩n)/τ)exp((𝐅n2D𝐩n/τ)(j=1Kexp(𝐅n3D𝐩j/τ))(j=1Kexp(𝐅n2D𝐩j/τ)))\textstyle{-\frac{1}{N}\sum_{n=1}^{N}\log\left(\frac{\exp{(({\bf F}^{\text{3D}}_{n}\cdot{\bf p}_{n})/\tau)}\exp{(({\bf F}^{\text{2D}}_{n}{\bf p}_{n}/\tau)}}{(\sum_{j=1}^{K}\exp({\bf F}^{\text{3D}}_{n}\cdot{\bf p}_{j}/\tau))(\sum_{j=1}^{K}\exp({\bf F}^{\text{2D}}_{n}\cdot{\bf p}_{j}/\tau))}\right)}- divide start_ARG 1 end_ARG start_ARG italic_N end_ARG ∑ start_POSTSUBSCRIPT italic_n = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT roman_log ( divide start_ARG roman_exp ( ( bold_F start_POSTSUPERSCRIPT 3D end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ⋅ bold_p start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) / italic_τ ) roman_exp ( ( bold_F start_POSTSUPERSCRIPT 2D end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT bold_p start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT / italic_τ ) end_ARG start_ARG ( ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT roman_exp ( bold_F start_POSTSUPERSCRIPT 3D end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ⋅ bold_p start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT / italic_τ ) ) ( ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT roman_exp ( bold_F start_POSTSUPERSCRIPT 2D end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ⋅ bold_p start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT / italic_τ ) ) end_ARG )

which simplifies to Eq. 3 with the normalization factor:

B=(j=1Kexp(𝐅u3D𝐩j/τ))(j=1Kexp(𝐅u2D𝐩j/τ)).\textstyle{B=\left(\sum_{j=1}^{K}\exp({\bf F}^{\text{3D}}_{u}\cdot{\bf p}_{j}/\tau)\right)\left(\sum_{j=1}^{K}\exp({\bf F}^{\text{2D}}_{u}\cdot{\bf p}_{j}/\tau)\right)}\enspace.italic_B = ( ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT roman_exp ( bold_F start_POSTSUPERSCRIPT 3D end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT ⋅ bold_p start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT / italic_τ ) ) ( ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT roman_exp ( bold_F start_POSTSUPERSCRIPT 2D end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT ⋅ bold_p start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT / italic_τ ) ) .
Data: MMitalic_M posed training images, KKitalic_K target number of classes and a total number of training iterations NiterN_{\textrm{iter}}italic_N start_POSTSUBSCRIPT iter end_POSTSUBSCRIPT
Build SfM and pretrain a GoF [89] with the training images
Apply spectral clustering on the set of 3D Gaussian centers to initialize the spatial prototypes {pk}k=1K\{p_{k}\}_{k=1}^{K}{ italic_p start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT
Then, to train the GSFFs, iterate:
for iter in range(NiterN_{\textrm{iter}}italic_N start_POSTSUBSCRIPT iter end_POSTSUBSCRIPT) do
   Sample a random image IIitalic_I and viewpoint PPitalic_P
   Extract 2D image-based features 𝐅2D{\bf F}^{\text{2D}}bold_F start_POSTSUPERSCRIPT 2D end_POSTSUPERSCRIPT and segmentation map 𝐒2D{\bf S}^{\text{2D}}bold_S start_POSTSUPERSCRIPT 2D end_POSTSUPERSCRIPT
   For each 3D Gaussian 𝒢i{\mathcal{G}}_{i}caligraphic_G start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, extract a scale aware volumetric feature 𝐠i{\bf g}_{i}bold_g start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT from the triplane using the covariance kernel based encoding (cf. Sec. 3.1)
   Assign a 3D segmentation label sis_{i}italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT to each 3D Gaussian 𝒢i{\mathcal{G}}_{i}caligraphic_G start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT based on volumetric feature/prototypes similarities (cf. Sec. 4)
   From the pose PPitalic_P rasterize 3D segmentation labels and volumetric features to obtain 𝐒3D{\bf S}^{\text{3D}}bold_S start_POSTSUPERSCRIPT 3D end_POSTSUPERSCRIPT and 𝐅3D{\bf F}^{\text{3D}}bold_F start_POSTSUPERSCRIPT 3D end_POSTSUPERSCRIPT
   Sample a novel viewpoint P^\widehat{P}over^ start_ARG italic_P end_ARG, find correspondences between image IIitalic_I and I^\hat{I}over^ start_ARG italic_I end_ARG (cf. Sec. 3.2).
   Render 3D features/segmentation maps from P^\widehat{P}over^ start_ARG italic_P end_ARG and replace features/segmentation in 𝐅3D{\bf F}^{\text{3D}}bold_F start_POSTSUPERSCRIPT 3D end_POSTSUPERSCRIPT, 𝐒3D{\bf S}^{\text{3D}}bold_S start_POSTSUPERSCRIPT 3D end_POSTSUPERSCRIPT with 𝐅^3D\widehat{{\bf F}}^{\text{3D}}over^ start_ARG bold_F end_ARG start_POSTSUPERSCRIPT 3D end_POSTSUPERSCRIPT, 𝐒^3D\widehat{{\bf S}}^{\text{3D}}over^ start_ARG bold_S end_ARG start_POSTSUPERSCRIPT 3D end_POSTSUPERSCRIPT at correspondence location
   Repeat the replacements process for 2D features/segmentations
   Compute the contrastive loss LNCEL_{NCE}italic_L start_POSTSUBSCRIPT italic_N italic_C italic_E end_POSTSUBSCRIPT from 𝐅2D{\bf F}^{\text{2D}}bold_F start_POSTSUPERSCRIPT 2D end_POSTSUPERSCRIPT and 𝐅3D{\bf F}^{\text{3D}}bold_F start_POSTSUPERSCRIPT 3D end_POSTSUPERSCRIPT
   Associate a prototype pkp_{k}italic_p start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT to each pair {𝐅i2D\{{\bf F}^{\text{2D}}_{i}{ bold_F start_POSTSUPERSCRIPT 2D end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, 𝐅i3D}{\bf F}^{\text{3D}}_{i}\}bold_F start_POSTSUPERSCRIPT 3D end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT } using the OT labeling procedure (cf. Section˜A.2)
   Compute LPROL_{PRO}italic_L start_POSTSUBSCRIPT italic_P italic_R italic_O end_POSTSUBSCRIPT, LCEL_{CE}italic_L start_POSTSUBSCRIPT italic_C italic_E end_POSTSUBSCRIPT from these associations as well as from 𝐅2D{\bf F}^{\text{2D}}bold_F start_POSTSUPERSCRIPT 2D end_POSTSUPERSCRIPT, 𝐅3D{\bf F}^{\text{3D}}bold_F start_POSTSUPERSCRIPT 3D end_POSTSUPERSCRIPT, 𝐒2D{\bf S}^{\text{2D}}bold_S start_POSTSUPERSCRIPT 2D end_POSTSUPERSCRIPT, 𝐒3D{\bf S}^{\text{3D}}bold_S start_POSTSUPERSCRIPT 3D end_POSTSUPERSCRIPT (cf. Sec. 3.2)
   Compute regularization losses LTVLL_{TVL}italic_L start_POSTSUBSCRIPT italic_T italic_V italic_L end_POSTSUBSCRIPT, LDepthL_{Depth}italic_L start_POSTSUBSCRIPT italic_D italic_e italic_p italic_t italic_h end_POSTSUBSCRIPT and photometric loss LPHOL_{PHO}italic_L start_POSTSUBSCRIPT italic_P italic_H italic_O end_POSTSUBSCRIPT from the rendered image
   Jointly update the image encoder, the triplane field and 3D Gaussians by minimizing L=LPHO+.5LNCE+.5LPRO+.5LCE+.1LTVL+0.05LDepthL=L_{PHO}+.5L_{NCE}+.5L_{PRO}+.5L_{CE}+.1L_{TVL}+0.05L_{Depth}italic_L = italic_L start_POSTSUBSCRIPT italic_P italic_H italic_O end_POSTSUBSCRIPT + .5 italic_L start_POSTSUBSCRIPT italic_N italic_C italic_E end_POSTSUBSCRIPT + .5 italic_L start_POSTSUBSCRIPT italic_P italic_R italic_O end_POSTSUBSCRIPT + .5 italic_L start_POSTSUBSCRIPT italic_C italic_E end_POSTSUBSCRIPT + .1 italic_L start_POSTSUBSCRIPT italic_T italic_V italic_L end_POSTSUBSCRIPT + 0.05 italic_L start_POSTSUBSCRIPT italic_D italic_e italic_p italic_t italic_h end_POSTSUBSCRIPT
   Update the prototypes pkp_{k}italic_p start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT with an EMA scheme based on volumetric features gig_{i}italic_g start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and the spectral clustering assignments
end for
Algorithm 1 Pseudo algorithm describing the training process of GSFFs.

A.2 Optimal transport (OT) associations

In this section, we detail the optimal transport association step from Section˜3.2. Given a batch of NNitalic_N pairs of pixel aligned extracted/rendered features {𝐅n2D,𝐅n3D}n=1N\{{\bf F}^{\text{2D}}_{n},{\bf F}^{\text{3D}}_{n}\}_{n=1}^{N}{ bold_F start_POSTSUPERSCRIPT 2D end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT , bold_F start_POSTSUPERSCRIPT 3D end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_n = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT and a set of KKitalic_K prototypes PIRK×DP\!\in\!{\rm I\!R}^{K\times D}italic_P ∈ roman_I roman_R start_POSTSUPERSCRIPT italic_K × italic_D end_POSTSUPERSCRIPT, we aim at associating a prototype per pair of pixel aligned features. We want this association operation to respect two criteria: 1) A single prototype must be associated per pair of features so that the extracted/rendered features are pushed toward the same "class" in the feature space. 2) Predictions must be as balanced as possible to avoid collapse.

To solve these constraints, we resort to using optimal transport, where we frame this problem as finding a mapping QIRN×KQ\!\in\!{\rm I\!R}^{N\times K}italic_Q ∈ roman_I roman_R start_POSTSUPERSCRIPT italic_N × italic_K end_POSTSUPERSCRIPT between pixels and prototypes that maximizes the feature similarity between the pairs of features and the prototypes. We define the joint feature/prototypes similarities SIRK×NS\in{\rm I\!R}^{K\times N}italic_S ∈ roman_I roman_R start_POSTSUPERSCRIPT italic_K × italic_N end_POSTSUPERSCRIPT as:

Skn=exp((F𝐧𝟐𝐃𝐩𝐤+F𝐧𝟑𝐃𝐩𝐤)/τ)/C,\textstyle{S_{kn}=\exp{\left((\bf\text{F}^{2D}_{n}\cdot p_{k}+\bf\text{F}^{3D}_{n}\cdot p_{k})/\tau\right)}/C}\enspace,italic_S start_POSTSUBSCRIPT italic_k italic_n end_POSTSUBSCRIPT = roman_exp ( ( F start_POSTSUPERSCRIPT bold_2 bold_D end_POSTSUPERSCRIPT start_POSTSUBSCRIPT bold_n end_POSTSUBSCRIPT ⋅ bold_p start_POSTSUBSCRIPT bold_k end_POSTSUBSCRIPT + F start_POSTSUPERSCRIPT bold_3 bold_D end_POSTSUPERSCRIPT start_POSTSUBSCRIPT bold_n end_POSTSUBSCRIPT ⋅ bold_p start_POSTSUBSCRIPT bold_k end_POSTSUBSCRIPT ) / italic_τ ) / italic_C ,

with

C=(kexp(F𝐧𝟐𝐃𝐩𝐤/τ))(kexp(F𝐧𝟑𝐃𝐩𝐤/τ)).\textstyle{C=(\sum_{k}{\exp{(\bf\text{F}^{2D}_{n}\cdot p_{k}/\tau)}})(\sum_{k}{\exp{(\bf\text{F}^{3D}_{n}\cdot p_{k}/\tau)}})}\enspace.italic_C = ( ∑ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT roman_exp ( F start_POSTSUPERSCRIPT bold_2 bold_D end_POSTSUPERSCRIPT start_POSTSUBSCRIPT bold_n end_POSTSUBSCRIPT ⋅ bold_p start_POSTSUBSCRIPT bold_k end_POSTSUBSCRIPT / italic_τ ) ) ( ∑ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT roman_exp ( F start_POSTSUPERSCRIPT bold_3 bold_D end_POSTSUPERSCRIPT start_POSTSUBSCRIPT bold_n end_POSTSUBSCRIPT ⋅ bold_p start_POSTSUBSCRIPT bold_k end_POSTSUBSCRIPT / italic_τ ) ) .

We introduce the following objective where QQitalic_Q maximizes the joint feature/prototypes similarities SSitalic_S:

maxQU(1N,1K)Tr(Q(logS)t)+λh(Q),\textstyle{\max_{Q\in U(\frac{1}{N},\frac{1}{K})}Tr(Q(-logS)^{t})+\lambda h(Q)}\enspace,roman_max start_POSTSUBSCRIPT italic_Q ∈ italic_U ( divide start_ARG 1 end_ARG start_ARG italic_N end_ARG , divide start_ARG 1 end_ARG start_ARG italic_K end_ARG ) end_POSTSUBSCRIPT italic_T italic_r ( italic_Q ( - italic_l italic_o italic_g italic_S ) start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) + italic_λ italic_h ( italic_Q ) ,

where the entropy term h(Q)h(Q)italic_h ( italic_Q ) encourages balanced predictions while using joint extracted/rendered feature prototypes similarities yields a single association per feature pair. If we relax QQitalic_Q such that it belongs to the transportation polytope U(1N,1K)U(\frac{1}{N},\frac{1}{K})italic_U ( divide start_ARG 1 end_ARG start_ARG italic_N end_ARG , divide start_ARG 1 end_ARG start_ARG italic_K end_ARG ) [17], QQitalic_Q can be efficiently computed with the iterative Sinkhorn-Knopp algorithm [17]. The final associations ΓIRN\Gamma\!\in\!{\rm I\!R}^{N}roman_Γ ∈ roman_I roman_R start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT are obtained with Γ=argmaxk(Q)\Gamma\!=\!\text{argmax}_{k}(Q)roman_Γ = argmax start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_Q ).

A.3 Projecting 3D Gaussians onto the triplane

In this subsection, we explain how a volumetric feature for each 3D Gaussian is derived from the triplane grid Section˜3.1. The triplane grid is centered at the origin of the world coordinate space and it is composed of three orthogonal 2D planes Hxy,Hxz,HyzIRD×R×RH_{xy},H_{xz},H_{yz}\in{\rm I\!R}^{D\times R\times R}italic_H start_POSTSUBSCRIPT italic_x italic_y end_POSTSUBSCRIPT , italic_H start_POSTSUBSCRIPT italic_x italic_z end_POSTSUBSCRIPT , italic_H start_POSTSUBSCRIPT italic_y italic_z end_POSTSUBSCRIPT ∈ roman_I roman_R start_POSTSUPERSCRIPT italic_D × italic_R × italic_R end_POSTSUPERSCRIPT, DDitalic_D being the triplane feature dimension and RRitalic_R the resolution of the grid. We project each 3D Gaussian 𝒢i{\mathcal{G}}_{i}caligraphic_G start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT onto these planes and derive three Gaussian kernels 𝒢ixy,𝒢ixz,𝒢iyz{\mathcal{G}}^{xy}_{i},{\mathcal{G}}^{xz}_{i},{\mathcal{G}}^{yz}_{i}caligraphic_G start_POSTSUPERSCRIPT italic_x italic_y end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , caligraphic_G start_POSTSUPERSCRIPT italic_x italic_z end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , caligraphic_G start_POSTSUPERSCRIPT italic_y italic_z end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT from the projections, which we use to obtain scale aware volumetric features g𝐢𝟑𝐃\bf\text{g}_{i}^{3D}g start_POSTSUBSCRIPT bold_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT bold_3 bold_D end_POSTSUPERSCRIPT.

To obtain the xyxyitalic_x italic_y feature, we proceed as follows (xzxzitalic_x italic_z and yzyzitalic_y italic_z features are obtained similarly). Let mim_{i}italic_m start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT by the center of 𝒢i{\mathcal{G}}_{i}caligraphic_G start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and Σi\Sigma_{i}roman_Σ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT its covariance matrix. We first project the center on the plane yielding mixym_{i}^{xy}italic_m start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_x italic_y end_POSTSUPERSCRIPT. We perform orthographic projection of the covariance matrix to obtain Σixy\Sigma_{i}^{xy}roman_Σ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_x italic_y end_POSTSUPERSCRIPT. We define a grid of dimension 5 by 5 centered on mixym_{i}^{xy}italic_m start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_x italic_y end_POSTSUPERSCRIPT. On the plane xyxyitalic_x italic_y, we use the coordinates of the points in the grid uuitalic_u to define the following Gaussian kernel:

𝒢ixy(𝐮)=1zexp(𝐮(Σixy)1𝐮t2),\textstyle{{\mathcal{G}}^{xy}_{i}({\bf u})=\frac{1}{z}exp(-\frac{{\bf u}(\Sigma_{i}^{xy})^{-1}{\bf u}^{t}}{2})}\enspace,caligraphic_G start_POSTSUPERSCRIPT italic_x italic_y end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( bold_u ) = divide start_ARG 1 end_ARG start_ARG italic_z end_ARG italic_e italic_x italic_p ( - divide start_ARG bold_u ( roman_Σ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_x italic_y end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT bold_u start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT end_ARG start_ARG 2 end_ARG ) ,

where ZZitalic_Z is a normalization constant. We query the feature plane HxyH_{xy}italic_H start_POSTSUBSCRIPT italic_x italic_y end_POSTSUBSCRIPT for each point uku_{k}italic_u start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT in the grid and apply the Gaussian kernel on the queried features. This yields a DDitalic_D-dimensional feature g𝐢𝐱𝐲\bf\text{g}_{i}^{xy}g start_POSTSUBSCRIPT bold_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT bold_xy end_POSTSUPERSCRIPT associated to 𝒢ixy{\mathcal{G}}^{xy}_{i}caligraphic_G start_POSTSUPERSCRIPT italic_x italic_y end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT. We repeat this operation for Hxz,𝒢ixzH_{xz},{\mathcal{G}}^{xz}_{i}italic_H start_POSTSUBSCRIPT italic_x italic_z end_POSTSUBSCRIPT , caligraphic_G start_POSTSUPERSCRIPT italic_x italic_z end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and Hyz,𝒢iyzH_{yz},{\mathcal{G}}^{yz}_{i}italic_H start_POSTSUBSCRIPT italic_y italic_z end_POSTSUBSCRIPT , caligraphic_G start_POSTSUPERSCRIPT italic_y italic_z end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT. The resulting features g𝐢𝐱𝐲,g𝐢𝐱𝐳,g𝐢𝐲𝐳\bf\text{g}_{i}^{xy},\bf\text{g}_{i}^{xz},\bf\text{g}_{i}^{yz}g start_POSTSUBSCRIPT bold_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT bold_xy end_POSTSUPERSCRIPT , g start_POSTSUBSCRIPT bold_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT bold_xz end_POSTSUPERSCRIPT , g start_POSTSUBSCRIPT bold_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT bold_yz end_POSTSUPERSCRIPT are summed to obtain the volumetric feature g𝐢𝟑𝐃\bf\text{g}_{i}^{3D}g start_POSTSUBSCRIPT bold_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT bold_3 bold_D end_POSTSUPERSCRIPT of 𝒢i{\mathcal{G}}_{i}caligraphic_G start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT.

Appendix B Implementation details

B.1 Additional training details

Similar to GoF [89], densification and pruning operations based on image space gradients are applied until iteration 15000. The densification interval is set to 600 iterations until iteration 7500, and reduced to 400 iterations afterward. To facilitate the convergence of the 3D Gaussian model, we train on images downscaled by a factor 4 until iteration 7500. From iteration 15000, the geometric regularization losses from [89] are applied until the end of the training. The triplane learning rate is set to 7e-3, while the encoder learning rate is set to 1e-4. The learning rates for 3D Gaussian primitives are identical to GoF [89].

GSFFs is optimized with the Adam optimizer [35]. The prototypes are updated based on an exponential moving average (EMA) scheme with α=0.9995\alpha=0.9995italic_α = 0.9995 after each training iteration. The temperature τ\tauitalic_τ for the contrastive losses is set to 0.05. In the Multi-view consistency paragraph from Sec. 3.2, the pixel reprojection threshold is set to 2 for the fine level, and to 4 for the coarse level. The coarse encoder uses a Dinov2 [52] pretrained backbone , followed by projection convolutional layers (convolutional layer with kernel size 1 to reduce the dimension, while maintaining the resolution) and a ConvNeXt block [42]. The fine encoder is composed of a shallow convolutional layers followed by a ConvNeXt block [42]. The segmentation heads contain convolutional layers with ReLU activations and GroupNorm [81] normalization.

We provide in Algorithm˜1 the pseudo algorithm describing the training process of the GSFFs.

B.2 Data pre-processing

To reduce running time and the memory footprint, images are rescaled such that image width is 1024 pixels for Cambridge Landmarks and 480 pixels for Indoor6. The original image resolution of 640 by 480 pixels is used on 7Scenes. During visual localization, we use the same image resolution as the one used during training.

Cambridge Landmarks and Indoor6 contain images with illumination changes, as such we learn an embedding per training image to capture these illumination changes during the training. These embeddings are only used during training to learn the scene representation. On Cambridge Landmarks during training, we mask out the sky and pedestrians. The masks are extracted using the semantic segmentation model from [59]. As Indoor6 contain day/night images with extreme illumination changes we further apply CLAHE normalization on images.

B.3 Visual localization setup

For the pose refinement, we use the Adam optimizer [35] with coarse/fine learning rates of 0.5/0.2 on Cambridge Landmarks, 0.3/0.2 on Indoor6, and 0.2/0.1 on 7Scenes respectively. The number of refinement steps for the coarse and fine level is set to 150/300 on Cambridge Landmarks and Indoor6, and 75/150 on 7Scenes. Rendered areas with high distortion (see [89] for the definition of distortion) are masked out during refinement. Additionally, the sky is masked out on Cambridge Landmarks.

N Classes KingsCollege OldHospital ShopFacade StMarysChurch

FP

34 0.034 0.032 0.024 0.028
84 0.046 0.043 0.027 0.031

BP

34 0.065 0.065 0.071 0.076
84 0.462 0.401 0287 0.392
Table 6: Computation time (s) for a forward pass FP (rendering) and a backward pass BP (pose optimization) on Cambridge Landmark for GSFFs trained with 34 classes and 84 classes.

B.4 Training time and rendering quality

In Table˜7 we report novel view rendering quality evaluated with PSNR, SSIM [80] and LPIPS [92] image metrics as well as the corresponding model training time for the Cambridge Landmark scenes. From these results in Table˜7, we observe that GSFFs-Feature, compared to GoF [89], adds computational overhead (the cost of training the feature fields), but it does not decrease the novel view synthesis quality. Note that GSFFs-Privacy cannot render RGB images for privacy reasons.

We also provide rendering time and backward pass time for the pose optimization in Table˜6 for 34 and for 84 classes. We can observe that while the forward pass is only slightly increased, the cost of backward pass increases significantly with the increase the number of classes. Globally, through our experiments, we found that in general using 34 classes is a good compromise between rendering speed and pose accuracy (see the accuracies in Table˜2 and Table˜6 for the training times).

Training KC OH SF SMC
time (h) PSNR / SSIM / LPIPS
GoF [81] 0.7 16.95/0.69/0.27 16.98/0.63/0.30 20.04/0.74/0.21 18.49/0.71/0.25
GSFFs 10.8 16.92/0.69/0.27 16.94/0.61/0.30 20.02/0.74/0.21 18.47/0.71/0.26
Table 7: Evaluating novel view rendering.

Appendix C Additional Localization Experiments

Model kitchen living bed kitchen living luke gates362 gates381 lounge manolis office2 5a office2 5b
GSFFs-PR Feature (34 Classes) (NV) 0.3/0.2/99 0.3/0.18/100 0.4/0.17/100 0.7/0.42/91 0.4/0.21/96 0.6/0.27/97 0.5/0.23/100 0.5/0.27/99 0.8/0.29/97 0.5/0.22/99 0.9/0.41/99 1.1/0.41/94
GSFFs-PR Privacy (34 Classes) (NV) 0.6/0.32/97 0.6/0.29/100 0.5/0.23/100 0.9/0.51/75 0.7/0.30/99 0.8/0.30/96 0.8/0.31/98 0.7/0.36/98 1.6/0.54/91 0.7/0.31/97 1.3/0.65/95 1.8/0.83/69
NeRF-SCR [11] 0.9/0.5 2.1/0.6 1.6/0.7 1.2/0.5 2.0/0.8 2.6/1.0 2.0/0.8 2.7/1.2 1.8/0.6 1.6/0.7 2.5/0.9 2.6/0.8
PNeRFLoc [95] 1.0/0.6 1.5/0.5 1.2/0.5 0.8/0.4 1.4/0.5 8.1/3.3 1.6/0.7 8.7/3.2 2.3/0.8 1.1/0.5 X 2.8/0.9
GSPlatloc [90] 0.8/0.4 1.1/0.4 1.2/0.5 1.0/0.5 1.2/0.5 1.5/0.6 1.1/0.5 1.2/0.5 1.6/0.5 1.1/0.5 1.4/0.6 1.5/0.5
NeuraLoc [91] 0.9/0.5 1.1/0.4 1.3/0.6 1.0/0.6 1.2/0.5 1.4/0.7 1.1/0.5 1.1/0.5 1.7/0.6 1.0/0.5 1.3/0.6 1.5/0.5
GSFFs-PR Feature (34 Classes) (NV) 0.7/0.4 1.1/0.4 1.1/0.4 0.8/0.4 1.0/0.5 1.3/0.5 1.1/0.4 1.2/0.5 1.4/0.5 0.8/0.4 1.2/0.5 1.7/0.6
Table 8: Localization results on 12Scenes (SfM pGT top rows / DSlam pGT bottom rows). Median pose error (cm.) (\downarrow)/ Median angle error (°) (\downarrow)/ Recall at 5cm/5° (%\%%) ()\uparrow)↑ ).

C.1 12Scenes Dataset

In Table˜8, we report localization results on the 12Scenes [75] dataset for both the SfM pseudo ground truth (pGT) and the DSLAM pGT. We compare our GSFFs-Feature model against three non privacy preserving feature rendering based-approaches methods NeRF-SCR [11], PNeRFLoc [95], NeuraLoc [91] and GSPlatloc [90].

GSFFs is more accurate than all baselines and clearly outperforms the rendering based-approaches [11, 95], while providing accuracy improvement compared to [90, 91] on most of the scenes.

C.2 Varying the number of prototypes

In this section, we study the influence of the number of classes/prototypes on both the feature- and the segmentation-based variants of our approach. All experiments in the main paper use a feature dimension of 16 with 34 segmentation classes. As we render both features and segmentations in the same variable (each channel is independently rendered), we compile the rasterizer with a fixed rendering dimension of 50 (16 + 34) yielding a good compromise between rendering speed and discriminative power.

Refer to caption
Figure 3: Median pose error (cm.) (\downarrow) and Median angle error (°) (\downarrow) on Cambridge Landmarks for models trained with 15/34/59/84 classes.

In this ablation, we maintain a feature dimension of 16 and vary the number of classes. We compile the rasterizer with a map dimension of 100 and then train and evaluate GSFFs-PR with 84 classes, 59 classes, and 15 classes. In Fig.˜3, we plot the median translation and rotation errors on Cambridge Landmarks against the number of classes of each model. In Table˜4 we compare 34 classes versus 84 on the Indoor6 dataset. From these results, we derive the following observations. Using only a few classes is not sufficient because the lack of discriminative power. Increasing the number of classes first yields a significant gain, however above a certain limit we observe a drop in accuracy. We suspect that the reason is over-clustering and hence more difficulty for the representation to converge during the training.

Overall, the number of classes must be high enough to make the segmentation discriminative enough for localization, but not too high to ensure the convergence. Naturally, larger scenes with diverse viewpoints will require more classes, while for simpler scenes it is better to consider less classes.

Furthermore, we can see from Table˜2 that GSFFs-PR Feature localization pipeline also benefits from the increased number of classes as, during training, gradients are backpropagated from segmentation and feature maps and LPROL_{PRO}italic_L start_POSTSUBSCRIPT italic_P italic_R italic_O end_POSTSUBSCRIPT implicitly uses the prototypes.

C.3 Effect of the initialization

In Table˜9 we provide pose refinement results on Cambridge Landmarks with different initial poses. Note that poses estimated with ACE [6] and HLoc [60] are much more accurate than DenseVLAD(DV) [73]. We can observe that, as expected, improving the initial pose results in higher final GSFFs-PR localization accuracy. It also requires less refinement steps to converge. Starting from a wide baseline such as DenseVlad  [73] the model needs more refinement steps but the optimization ultimately reaches high accuracy (especially for high resolution images) showing the robustness of GSFFs-PR.

Init. Image Coarse Runtime KC OH SF SMC
Res. Fine Steps Query (s) cm / deg
[64][64][ 64 ] [64][64][ 64 ] ? 350 3 27/0.46 20/0.71 5/0.36 16/0.61
[40][40][ 40 ] Ace 512 x 0.2 25/0.29 26/0.38 5/0.23 13/0.41
DV - - - - 280/5.7 401/7.1 111/7.6 231/8
Vanilla-GS DV 1024 150-300 45 21.9/0.34 22.3/0.42 4.5/0.27 11.8/0.35
GSFFs-PR DV 1024 150-300 45 17.9/0.27 21.4/0.41 4.1/0.26 10.4/0.30
GSFFs-PR DV 480 150-300 8.1 19.7/0.27 21.7/0.36 4.7/0.23 8.7/0.29
GSFFs-PR DV 480 50-50 1.8 74.6/1.26 162/2.81 16.4/0.73 90.2/2.68
ACE - - - - 28.0/0.4 31.0/0.6 5.0/0.3 18.0/0.6
GSFFs-PR ACE 1024 150-300 45 16.1/0.22 17.9/0.34 3.9/0.18 7.6/0.23
GSFFs-PR ACE 480 150-300 8.1 17.8/0.22 18.7/0.33 4.7/0.21 8.5/0.25
GSFFs-PR ACE 480 75-150 4 18.2/0.24 21.1/0.35 4.8/0.22 8.5/0.25
GSFFs-PR ACE 480 50-50 1.8 19.1/0.27 25.4/0.49 5.1/0.26 11.1/0.35
GSFFs-PR ACE 480 25-25 0.9 19.8/0.44 25.6/0.66 5.8/0.45 13.2/0.53
Hloc - - - - 11.1/0.20 15.5/0.31 4.4/0.2 7.1/0.24
GSFFs-PR Hloc 1024 0-50 5 10.9/0.18 14.1/0.29 3.9/0.19 6.4/0.21
GSFFs-PR Hloc 480 0-50 0.9 11.0/0.19 14.2/0.30 3.9/0.18 6.4/0.22
Table 9: Varying initialization and image resolution.

Refer to caption
Figure 4: Left to right: original image, image inversion attack from rendering our features (middle, GSFFs-PR Feature) or rendering our segmentation (right, GSFFs-PR Privacy).

C.4 Varying resolution and refinement steps

In Table˜9 we also provide results with different coarse/fine number of refinement steps, image resolution and runtime per query. We can see that performances on high resolution images is slightly better than on low resolution images, but this comes at a higher inference cost. We can further decrease the running time by decreasing the refinement steps. The loss in performance is relatively small conditioned that we start from a good initialization. This suggest that we could further increase the localization speed by combining low and hight resolution based refinement and stopping the optimization earlier.

Appendix D Privacy attack

To visualize and assess the degree of privacy of GSFFs-PR-Privacy, we train an inversion model [58, 55] to reconstruct images from rendered segmentations. As a baseline, we train another inversion model to reconstruct images from feature maps rendered from GSFFs-PR-Feature. Both inversions model are trained on 6 scenes (fire, heads, office, pumpkin, redkitchen, stairs) from the 7Scenes dataset and evaluated the remaining scene chess. Example reconstructed images from both GSFFs-PR-Feature and GSFFs-PR-Privacy and displayed in Fig.˜4. We can observe that images reconstructed from GSFFs-PR-Feature reveal a lot of scenes details, while images reconstructed from GSFFs-PR-Privacy totally obfuscate privacy sensitive information.

Appendix E Visualizations

We show in Fig.˜5 pairs of 2D extracted / 3D rendered segmentations for the coarse and fine levels while comparing segmentations from models trained with 34 and 84 classes. The segmentation classes of the coarse level capture larger and less sharply defined segments in the image compared to the fine level segmentations. As shown in the ablation in Table˜5, this allows us to increase the convergence basin of the pose refinement, while improving fine accuracy. Using 84 classes instead of 34 results in visually finer-grained segmentations in turn allows for more accurate pose refinement.

Refer to caption
Figure 5: From left to right, coarse encoder/rendered segmentation, fine encoder/rendered segmentation. Comparison between models trained with 34 classes (line 1/3/5/7) and models trained with 84 classes (line 2/4/6/8). 2D extracted and 3D rendered segmentations are well aligned which allows for accurate privacy preserving visual localization.