\affiliations

¹ Faculty of Electrical Engineering, Czech Technical University in Prague, {firstname.lastname}@cvut.cz
² Czech Institute of Informatics, Robotics and Cybernetics, Czech Technical University in Prague
³ NAVER LABS Europe, {firstname.lastname}@naverlabs.com
\contributions \websiteref \website \teaserfig [Uncaptioned image] \teasercaptionWe extract 2D feature ${\bf F}^{\text{2D}}$ or segmentation ${\bf S}^{\text{2D}}$ maps from a query image with unknown pose. We estimate the pose by aligning the maps with feature ${\bf F}^{\text{3D}}$ or segmentation ${\bf S}^{\text{3D}}$ maps rendered from our Gaussian Splatting Feature Fields (GSFFs), which are learned in a self-supervised way. Using segmentations instead of features makes our localization pipeline privacy-preserving.

Gaussian Splatting Feature Fields for (Privacy-Preserving) Visual Localization

Abstract

Visual localization is the task of estimating a camera pose in a known environment. In this paper, we utilize 3D Gaussian Splatting (3DGS)-based representations for accurate and privacy-preserving visual localization. We propose Gaussian Splatting Feature Fields (GSFFs), a scene representation for visual localization that combines an explicit geometry model (3DGS) with an implicit feature field. We leverage the dense geometric information and differentiable rasterization algorithm from 3DGS to learn robust feature representations grounded in 3D. In particular, we align a 3D scale-aware feature field and a 2D feature encoder in a common embedding space through a contrastive framework. Using a 3D structure-informed clustering procedure, we further regularize the representation learning and seamlessly convert the features to segmentations, which can be used for privacy-preserving visual localization. Pose refinement, which involves aligning either feature maps or segmentations from a query image with those rendered from the GSFFs scene representation, is used to achieve localization. The resulting privacy- and non-privacy-preserving localization pipelines, evaluated on multiple real-world datasets, show state-of-the-art performances.

1 Introduction

Visual localization (VL), a core part of self-driving cars [27] and autonomous robots [38], is the task of estimating the 6DoF camera pose from which an image was captured.

VL methods can be distinguished by how they represent scenes and how they estimate the pose of a query image w.r.t. the scene representation. Popular choices for the former include 3D Structure-from-Motion (SfM) point clouds [29, 60, 24, 30, 61], databases of images with known intrinsic and extrinsic parameters [97, 2, 93, 63], the weights of neural networks [33, 7, 48, 4, 6], or dense renderable representations such as meshes [54, 74], neural radiance fields [49, 14, 56, 99], or 3D Gaussian Splatting [41, 68, 90, 3]. Common VL approaches are based on establishing 2D-3D correspondences via feature matching [60, 71, 54, 49, 99] or scene coordinate regression [4, 6], relative pose estimation from feature matches [97, 96, 2, 93] or regression [50, 97], absolute pose regressed via neural networks [78, 12, 13, 48], or feature-based pose refinement [74, 61, 55] of an initial pose estimate. While local feature matching-based approaches provide the most accurate camera pose estimates, scale to large scenes [62, 30], handle changing conditions [72] and fit onto mobile devices [44, 38], they suffer from potential privacy-related issues as image details can be recovered from the feature descriptors [58, 9].

Since VL systems are often deployed through cloud-based solutions, preserving the privacy of user-uploaded images and scenes is a critical aspect [70, 69]. An interesting approach to both query and scene privacy is SegLoc [55], which is a pose refinement-based method that represents the scene as a sparse SfM point cloud. Most refinement-based methods project features associated with the scene geometry into the query image [61, 76, 77, 56, 49, 74]. An initial pose estimate, e.g., obtained via image retrieval, is then optimized by aligning the projected features with features extracted from the query image. To increase privacy while decreasing memory requirements for storing high-dimensional features, SegLoc [55] propose to quantize the features into integer values, which is equivalent to assigning segmentation labels to pixels in the query image and the 3D scene points. These quantized representations leads to better privacy-preservation: only coarse image information without any details can be recovered from both 2D segmentation images and 3D point clouds with associated labels. Final poses are refined by maximizing label consistency between the projected points and pixel labels.

The segmentations used by SegLoc are learned purely in 2D, providing no guarantee that the predicted labels are consistent between viewpoints. In contrast, [56] jointly trains a dense scene representation (in the form of a NeRF) together with an implicit feature field, ensuring multi-view consistency and state-of-the-art feature-based pose refinement results. Yet, [56] is not privacy-preserving. In contrast, in this work, we investigate learning a feature field that can be used for privacy-preserving visual localization jointly with a dense scene representation. We chose 3D Gaussian Splatting (3DGS) [34] due to its fast rendering time, making it particularly suited for dense pose refinement. Compared to SegLoc, it allows us to ground representation learning in 3D. Additionally, the explicit and finite nature of 3DGS is better suited for privacy-preserving localization than NeRFs as, similarly to [55], we can simply quantize the features associated with the Gaussians.

Some prior [3] and concurrent works [41, 68, 90] also use 3DGS in the context of visual localization. However, their pipelines use matching-based solutions, whereas we take advantage of the ability to densely render from any viewpoint within the scene to perform feature-metric or segmentation-based pose refinement. We explicitly backpropagate through the rasterizer on the se(3) Lie algebra, which yields more accurate pose estimates. Furthermore, instead of relying on pre-trained features as [74], we learn our features in a self-supervised manner defining a Gaussian Spatting Feature Field (GSFFs), which associates 3D Gaussians with volumetric features extracted with a kernel-based encoding based on the covariance of the 3D Gaussians. These features are rendered and aligned to features provided by the 2D encoder (which is jointly trained with the GSFFs) through contrastive losses. We apply further regularization by leveraging the geometry of the 3DGS model and by spatially clustering the GSFFs. These clusters enables converting features into segmentations that can be used to perform effective privacy-preserving pose refinement (c.f. Fig. Gaussian Splatting Feature Fields for (Privacy-Preserving) Visual Localization).

To summarize, our contributions are threefold: 1) We introduce Gaussian Feature Fields, a novel representation for VL, jointly learnt with the image feature encoder enabling pose refinement by aligning rendered and extracted features. 2) The explicit nature of the 3D representation in GSFFs allows to spatially cluster the Gaussian cloud and segmenting the feature field based on cluster centers yielding a privacy-preserving 3D scene representation. Similar to [55], the resulting method is based on aligning discrete segmentation labels, thus leading to similar privacy-preserving properties. 3) We demonstrate the accuracy of both GSFFs-based localization pipelines (based on features respectively segmentations) on multiple real-world datasets.

Our approaches outperform prior and concurrent work for privacy- and non-privacy-preserving visual localization.

2 Related Works

Gaussian Splatting. 3D Gaussian Splatting [34] represents a scene as a set of Gaussian primitives and render images by rasterizing Gaussians using depth ordering and alpha-blending, thus allowing efficient training and high resolution real-time rendering. Thanks to these properties, 3D Gaussians have been applied to numerous tasks including SLAM [32, 46], dynamic scene modeling [43, 84], scene segmentation [31, 37], and surface reconstruction [28, 26, 89]. Extensions have been tackling anti-aliasing [88], sparse view setups [82, 87], removing reliance on estimated poses [85], and regularization [16, 15, 94]. In this paper, we adopt the Gaussian Opacity Fields model [89] as our 3D representation, which integrates the anti-aliasing module from [88] and offers highly accurate geometry through regularization, critical for visual localization.

Camera pose refinement-based visual localization. Pose refinement approaches minimize, with respect to the pose, the difference between the query image and a rendering obtained by projecting the scene representation into the image using the current pose estimate [1, 21, 22, 65, 64]. The pose is iteratively refined from an initial estimate. L2 distances between deep features are often used to measure the differences between the query image and the projected features from the scene [76, 77, 61, 25, 40, 83]. [74] integrates pre-trained features with a particle filter, thus removing the need to learn representations per scene. [55] replaces deep features with segmentation labels and aligns them with 2D segmentation maps. In contrast, our GSFFs Pose Refinement pipeline (GSFFs-PR) associates a deep feature or a segmentation label to each 3D Gaussian primitive and performs pose refinement by 1) rasterizing these quantities and 2) explicitly backpropagating the feature-metric or segmentation errors through the rasterizer with respect to the camera pose.

Rendering-based visual localization. Leveraging the capability of novel view synthesis of NeRFs, iNeRF [86, 39] is the first to perform pose refinement through photometric alignment, where the optimization is performed by direct gradient descent back-propagation through the neural field. [45, 39, 74] parallelize this process with faster neural radiance fields or particle filters. NeRF-based representations have used to match [49, 99] or align rendered features from an implicit field [14, 56].

Multiple visual localization works have emerged leveraging the efficient 3DGS models instead of NeRFs. [3] matches rays emerging from ellipsoids with pixels with an attention mechanism and estimates the pose with a closed form solution. [41] performs matching between query and rendered image, lifting the matches in 3D with the rendered depth to refine the pose by solving a PnP problem. Similar to NeRF-based approaches, [68, 90, 91] distills pretrained features in a Gaussian-based 3D feature field. These methods estimate poses by matching these features followed by PnP+RANSAC. In contrast, we do not rely on pretrained features but learn our representations in a self-supervised manner. Furthermore, we solely estimate query poses through feature-metric refinement by explicitly backpropagating through the rasterizer, which yields higher accuracy even without additional PnP+RANSAC step. Furthermore, contrarily to concurrent works our 3D features take scale into account.

Privacy-preserving visual localization. Pittaluga et al. [58] have shown that detailed and recognizable images of a scene can be obtained from sparse 3D point clouds associated with local descriptors. While, [70, 69, 66] tried to address this by transforming 3D point clouds into 3D line clouds, it was shown in [8, 36] that these line clouds still preserve a significant amount of information about the scene geometry that can be used to recover 3D point clouds, hence enabling the inversion attack [58]. Instead, [36] lift pairs of points to lines and [47] relies on a few spatial anchors resulting in 3D lines with non uniformly sampled distributions. Another strategy is to swap coordinates between random pairs of points [53] or performing partial pose estimation against distributed partials maps [23], but as shown in [9], the point positions may be recovered by identifying points neighborhoods from co-occuring descriptors.

Another line of work studies the privacy on the feature representation level. [20, 57] embed descriptors in subspaces designed to withstand different privacy attacks. Closer to our work, [98, 79, 55] remove the need for high-dimensional features. [98, 79] match keypoints and 3D points without descriptors and [55] replace high-dimensional descriptors with segmentation labels. Similar to [55], we also rely on labels to ensure privacy, but our dense label rendering through 3DGS achieves better localization accuracy than [55], which relies on reprojected sparse labels.

Refer to caption — Figure 1: Training pipeline. We extract a feature map ${\bf F}^{\text{2D}}$ and segmentation map ${\bf S}^{\text{2D}}$ from a training image with known pose (left). For each 3D Gaussian ${\mathcal{G}}_{i}$ (right), a scale aware feature ${\bf g}_{i}$ is extracted from a triplane representation. Spectral clustering is applied on the Delaunay graph derived from the Gaussian cloud, yielding a set of prototypes $P$ . A label is associated with each 3D Gaussian by assigning the volumetric features to the prototypes. The features and labels are then rendered to obtain the feature map ${\bf F}^{\text{3D}}$ and the segmentation map ${\bf S}^{\text{3D}}$ , which are respectively aligned with their encoder counterparts ${\bf F}^{\text{2D}}$ and ${\bf S}^{\text{2D}}$ through $L_{NCE},L_{PRO}$ and $L_{CE}$ .

3 3DGS Feature Fields (GSFFs)

First in Section˜3.1 we introduce our Gaussian Splatting Feature Field (GSFFs) which associates a scale aware feature to each 3D Gaussian. We describe how to jointly optimize it along with a 2D feature extractor by aligning rendered and 2D extracted features in a self-supervised manner. In Section˜3.2 we propose to further improve the features’ discriminativeness and their 3D awareness, by clustering the 3D Gaussian cloud and encouraging that both features in a pixel aligned pair of 2D/3D features are close to the same prototype through an auxiliary contrastive loss. These prototypes not only help distilling spatial information in the feature space, but also allow for a smooth transition from feature maps to segmentation maps, enabling privacy-preserving localization (as described in Section˜4). Finally, in Section˜3.3 we describe the GSFFs-PR Feature visual localization pipeline that is based on pose refinement through feature alignment. Its extension, the privacy-preserving GSFFs-PR Privacy visual localization pipeline based on segmentations is described in Section˜4. We show an overview of the architecture and training pipeline in Fig. 1.

3.1 Scene Representation

We adopt the Gaussian Opacity Fields model [89] as our Gaussian Field representation, where a scene is composed of a set of 3D Gaussian primitives parametrized by their center, scaling and rotation matrices, opacity and spherical harmonics coefficients. Given a ray $r$ emanating from a camera pose $P$ , [89] finds the intersection between the ray and 3D Gaussians and computes the contribution $C_{i}(r,P)$ of each Gaussian ${\mathcal{G}}_{i}$ traversed by the ray. The Gaussians are then ordered based on depth and the pixel’s color is obtained by alpha blending

\displaystyle c(r,P)=\sum_{i=1}^{N}c_{i}\alpha_{i}C_{i}(r,P)\prod_{j=1}^{i-1}(1-\alpha_{j}C_{j}(r,P))\enspace,

(1)

where $c_{i}$ is the view-dependent color, modeled with spherical harmonics associated with ${\mathcal{G}}_{i}$ and $\alpha_{i}$ are the blending weights. The ray-Gaussian intersection formulation allows for depth distortion and normal consistency regularization [89], which improves the geometry of the 3D scene.

3D feature fields. Similar to Eq.˜1, features or segmentation labels can be rendered through the same alpha blending by simply replacing each Gaussian’s color by the feature or segmentation label assigned to the Gaussian. However, this requires that we assign to each Gaussian ${\mathcal{G}}_{i}$ a feature, which in general might be extremely costly, especially for high-dimensional features. To avoid this, instead of associating independent features to each Gaussian, we introduce a feature field parametrized by a triplane grid [10]. It is centered at the origin of the world coordinate space and is composed of three two-dimensional orthogonal planes $H_{xy},H_{xz},H_{yz}\in{\rm I\!R}^{R\times R}$ , where $R$ is the resolution of the grid. For simplicity, we will also denote by $H_{xy},H_{xz},H_{yz}\in{\rm I\!R}^{R\times R\times D}$ (and use them interchangeably) the three corresponding feature tensors, where $D$ is the dimensionality of the feature space we aim to learn.

To compute the volumetric feature ${\bf g}_{i}$ associated with a 3D Gaussian ${\mathcal{G}}_{i}$ from the triplane, we proceed as follows. We project the 3D Gaussian ${\mathcal{G}}_{i}$ onto the three planes and compute for the three resulting 2D Gaussians ${\mathcal{G}}^{xy}_{i},{\mathcal{G}}^{xz}_{i},{\mathcal{G}}^{yz}_{i}$ three corresponding features ${\bf g}^{xy}_{i},{\bf g}^{xz}_{i},{\bf g}^{yz}_{i}$ by applying an RBF kernel parametrized by the 2D Gaussians (as detailed in the Appendix). The three grid features are averaged yielding the volumetric feature ${\bf g}^{\text{3D}}_{i}$ associated with the Gaussian ${\mathcal{G}}_{i}$ . This representation, called Gaussian Splatting Feature Field (GSFFs), enables the features to be scale-aware, as 3D Gaussians with large spatial span will aggregate feature information over a large area within the grid, and small Gaussians over smaller areas. Further advantages of the GSFFs representation are: 1) they allow sharing information between Gaussians based on overlapping projections onto the planes as the Gaussian features are optimized interdependently through rasterization, and 2) the field can be queried from any 3D position making it suitable for 3DGS splitting and merging mechanisms.

Self-supervised training. Our aim is to perform pose refinement based on aligning pixel-level 2D encoded features from the image with 3D rendered features. Hence it is important for these feature to be locally discriminative and robust to viewpoint changes. Formally, we want to learn the GSFFs feature field jointly with a 2D encoder such that when we render, at pose $P$ , a feature map ${\bf F}^{\text{3D}}$ from the feature field, it is aligned with the corresponding 2D feature map ${\bf F}^{\text{2D}}$ of the image $I$ associated to the pose $P$ . The rendered feature map ${\bf F}^{\text{3D}}$ is obtained with alpha blending similarly to Eq.˜1, where for each pixel $u$ , we replace $c_{i}$ with ${\bf g}^{\text{3D}}_{i}$ . To align ${\bf F}^{\text{3D}}$ and ${\bf F}^{\text{2D}}$ , we train the model in a self-supervised manner with the contrastive loss [51]:

\displaystyle L_{NCE}=-\frac{1}{2HW}\sum_{u\in I}\log\left(\frac{\exp{\left({\bf F}^{\text{3D}}_{u}\cdot{\bf F}^{\text{2D}}_{u}/\tau\right)}^{2}}{A}\right),

(2)

where $\tau$ is the temperature parameter, and $A$ is a normalizing factor (detailed in the Appendix).

3.2 Prototypical Feature Regularization

To better align the features and prepare the transition from features to segmentations, we propose to structure the feature space around a set of classes. To that end, we cluster the feature field, derive 3D prototypes, and apply an auxiliary contrastive loss to enforce that corresponding pixel-aligned 2D extracted and 3D rendered features are close to the same prototypes in the feature space.

Spatial prototypes. We want to find a set of feature prototypes that best encode the spatial prior reflecting the Gaussian cloud structure. One option would be to apply spectral clustering on a matrix containing pairwise distance between Gaussian centers. However given the high number of Gaussians, this would quickly become untractable. Therefore, we start with applying a Delaunay triangulation [18] of the Gaussian centers, which yields a graph that already captures local geometric information. Furthermore, the eigenvalues and eigenvectors can now be computed from the Laplacian of the sparse adjacency matrix derived from the graph. The resulting eigenvectors, one for each Gaussian ${\mathcal{G}}$ , are clustered into $K$ groups, which assigns each 3D Gaussian ${\mathcal{G}}_{i}$ to a cluster $k$ . To build the representative feature (cluster prototype) ${\bf p}_{k}$ for each cluster, we simply average the volumetric features ${\bf g}_{i}$ of all Gaussians belonging to cluster $k$ .

Prototypical loss. These prototypes implicitly define "classes" and we use them to maximize intra-class compactness and inter-class separability within the feature space and infuse spatial priors. Given a batch of N pairs of pixel-aligned rendered/encoder features $\{{\bf F}^{\text{3D}}_{u},{\bf F}^{\text{2D}}_{u}\}$ , we associate one prototype ${\bf p}_{k}$ per feature pair and want to enforce both features to be close to the prototype. We thus add the following prototypical contrastive loss:

\displaystyle L_{PRO}=-\frac{1}{N}\sum_{n=1}^{N}\log\left(\frac{\exp{({\bf F}^{\text{3D}}_{n}{\bf p}_{n}^{t}+{\bf F}^{\text{2D}}_{n}{\bf p}_{n}^{t})/\tau)}}{B}\right),

(3)

where ${\bf F}^{\text{2D}}_{n},{\bf F}^{\text{3D}}_{n}$ are the features corresponding to the pixel $u_{n}$ , ${\bf p}_{n}$ is the prototype assigned to them, $\tau$ is a temperature parameter and $B$ is a normalizing factor (detailed in the Appendix along with derivations). To compute the associations over the batch of feature pairs, we use an optimal transport procedure based on the Sinkhorn-Knopp algorithm [17] (see the Appendix for details).

Multi-view consistency. The rendering depends on the incidence of the ray traversing the 3D Gaussians. Hence rendered features may vary across viewing angles. In order to encourage multi-view consistency of both the encoder and the feature field, and to make our features generalizable to out-of-distribution views, we align features belonging to different views but associated to the same 3D points. To that end, given a pixel correspondence $(u,v)$ between two images $I$ and $\widehat{I}$ in the scene – along with their encoder/rendered features $\{{\bf F}^{\text{3D}},{\bf F}^{\text{2D}}\}$ , $\{\widehat{{\bf F}}^{\text{3D}},\widehat{{\bf F}}^{\text{2D}}\}$ – we randomly replace in $L_{NCE}$ and $L_{PRO}$ a subset of the pixel aligned feature pairs $\{{\bf F}^{\text{3D}}_{u},{\bf F}^{\text{2D}}_{u}\}$ by $\{\widehat{{\bf F}}^{\text{3D}}_{v},{\bf F}^{\text{2D}}_{u}\}$ or $\{{\bf F}^{\text{3D}}_{u},\widehat{{\bf F}}^{\text{2D}}_{v}\}$ . To extract correspondences, given a training image $I$ with pose $P$ and rendered depth $D$ , we generate a random pose $\widehat{P}$ in the vicinity of $P$ . We render the image $\widehat{I}$ , depth $\widehat{D}$ , and features $\widehat{{\bf F}}^{\text{3D}}$ from pose $\widehat{P}$ and extract the 2D feature map $\widehat{{\bf F}}^{\text{2D}}$ from the rendered image. Then, for a randomly sampled set of pixels $u$ in $I$ , we backproject it using $D$ and re-project it into $\widehat{I}$ , yielding the pixel $v$ in the rendered image. Then, we backproject $v$ into 3D using $\widehat{D}$ and re-project it into $I$ . If the pixel distance between $u$ and the reprojected $v$ fall within a small threshold, $(u,v)$ is considered as a valid correspondence, otherwise it is rejected.

	Model	Chess	Fire	Heads	Office	Pumpkin	Redkitchen	Stairs
SBM	HLoc [60]	0.8/0.11/100	0.9/0.24/99	0.6/0.25/100	1.2/0.20/100	1.4/0.15/100	1.1/0.14/99	2.9/0.80/72
	DSAC*[4]	0.5/0.17/100	0.8/0.28/99	0.5/0.34/100	1.2/0.34/98	1.2/0.28/99	0.7/0.21/97	2.7/0.78/92
	ACE [6]	0.5/0.18	0.8/0.33	0.5/0.33	1.0/0.29	1/0.22	0.8/0.2	2.9/0.81
	ACE + GSLoc [41]	0.5/0.15	0.6/0.25	0.4/0.28	0.9/0.26	1.0/0.23	0.7/0.17	1.4/0.42
RBM	NeFeS [14] (DFNet [13])	2/0.79	2/0.78	2/1.36	2/0.60	2/0.63	2/0.62	5/1.31
	MCLoc [74]	2/0.8	3/1.4	3/1.3	4/1.3	5/1.6	6/1.6	6/2.0
	SSL-Nif [56] (DV)	1/0.22/93	0.8/0.28/91	0.8/0.49/71	1.7/0.41/81	1.5/0.34/86	2.1/0.41/75	6.5/0.63/49
	NeRFMatch [99] (DV)	0.9/0.3	1.3/0.4	1.6/1.0	3.3/0.7	3.2/0.6	1.3/0.3	7.2/1.3
	GSplatLoc [68] (GSplaLoc)	0.43/0.16	1.03/0.32	1.06/0.62	1.85/0.4	1.8/0.35	2.71/0.55	8.83/2.34
	GSLoc [41] (DFNet)	0.7/0.20	0.9/0.32	0.6/0.36	1.2/0.32	1.3/0.31	0.9/0.25	2.2/0.61
	GSFFs-PR Feature (DV)	0.4/0.19/95	0.6/0.26/98	0.5/0.36/94	1.0/0.31/97	1.3/0.38/85	0.6/0.23/93	25.1/0.63/32
	GSFFs-PR Privacy (DV)	0.8/0.28/96	0.8/0.33/94	1.0/0.67/90	1.5/0.51/90	2.0/0.50/78	1.2/0.33/85	28.2/0.98/29

Table 1: Localization results on 7Scenes using SfM-based pseudo GT poses from [5]. We compare our results with Structure-Based Methods (SBM) and Rendering-Based Methods (RBM). Median position error (cm.) (

\downarrow

)/ median rotation error (°) (

\downarrow

)/ recall at 5cm/5° (

\%

) (

\uparrow)

, when available. Bold are the best results, underline the second best.

3.3 Feature-based Localization (GSFFs-PR)

After training, the feature space along with the 3D Gaussians can be used for pose refinement by finding the pose that minimizes feature-metric errors between 2D extracted query features and 3D features rendered from the aforementioned pose. In particular, given a query image whose pose is unknown, we first extract its 2D feature map ${\bf F}^{\text{2D}}$ with our trained encoder. Second, we retrieve the closest database image with a global descriptor (our method is agnostic to the global descriptor used) and use the pose $P_{init}$ of the retrieved image as initialization. From $P_{init}$ , we render the ${\bf F}^{\text{3D}}$ feature map from GSFFs. Feature inconsistencies between the two feature maps are iteratively minimized with regard to the pose:

\displaystyle P^{*}=\text{min}_{P\in SE(3)}\|{\bf F}^{\text{2D}}-{\bf F}^{\text{3D}}(P,{\mathcal{G}})\|_{2}^{2}\enspace,

(4)

where the pose is parametrized on SE(3) and updates are done on the Lie algebra se(3) by backpropagating explicitly through the rasterizer [46] (that we modify to feature-metric refinement). At each refinement iteration, the pose is updated and features are rendered from this updated pose. We minimize the Eq.˜4 objective until convergence.

4 Privacy-Preserving GSFFs

Due to the clustering and subsequent prototypical formulation adopted in GSFFs, we can seamlessly extend our model to be privacy-preserving. We follow the privacy definition from the literature [98, 55, 79] where privacy is refers to the inability for an attackers to recover texture/color information and fine-level details. Similarly, following prior works, we consider that having coarse geometric information does not violate privacy. Therefore, inspired by [55], we replace the high-dimensional features ${\bf g}_{i}$ that contain a lot of privacy-critical information with a single segmentation label. After training, the feature field, prototypes, and color information are removed.

Optimizing the segmentations. Features may be converted to segmentations by soft assigning them to the set of prototypes. Any Gaussian feature ${\bf g}_{i}$ or encoder feature ${\bf f}_{i}$ can be assigned to a prototype ${\bf p}_{k}$ by the likelihood of the feature belonging to the cluster $k$ defined by $l^{3D}_{ik}=\exp({\bf g}_{i}^{\top}{\bf p}_{k})/\sum_{k^{\prime}}\exp({\bf g}_{i}^{\top}{\bf p}_{k^{\prime}})$ . The resulting scores form a pseudo-logits vector ${\bf l}_{i}$ that can be rendered with alpha-blending (c.f. Eq.˜1), where the color is replaced by the scores. Similarly, encoder features can be soft-assigned via $l^{2D}_{ik}=\exp({\bf f}_{i}^{\top}{\bf p}_{k})/\sum_{k^{\prime}}\exp({\bf f}_{i}^{\top}{\bf p}_{k^{\prime}})$ .

While the prototypical learning framework induces relatively well aligned encoder/rendered soft assignments, the localization accuracy can further be improved by learning to directly predict the 2D segmentations. Therefore, during training, we add and train a shallow classification head on top of the encoded features that learns to output the segmentation map ${\bf S}^{\text{2D}}$ directly from the images. To learn the segmentation head and to further refine our features in a self-supervised way, we add the following cross-entropy loss to Eq.˜2 and Eq.˜3:

\displaystyle L_{CE}=-\sum_{u\in I}{\mathbbm{1}}_{u}\cdot\left(\log({\bf S}^{\text{2D}}_{u})+\log({\bf S}^{\text{3D}}_{u})\right)\enspace,

(5)

where ${\bf S}^{\text{3D}}$ denotes the segmentation map with rendered pseudo-logits $l^{3D}_{ik}$ , ${\mathbbm{1}}_{u}$ is the one hot vector corresponding to the prototype label associated to the pixel $u$ , using the exact same associations as in the prototypical contrastive loss encouraging consistency between the features and segmentations.

Privacy-preserving localization, GSFFs-PR Privacy. After training, any potentially privacy-sensitive information is removed from the 3D Gaussian model (spherical harmonic, features, and GSFFs) as neither photometric information nor features are required for the privacy-preserving localization. The pseudo-logits obtained from the soft-assignments still contain too much information [55]. Therefore, we instead store hard assignments in the form of a single label per Gaussian ${\mathcal{G}}_{i}$ , $k^{\ast}=\text{argmax}_{k}({\bf l}_{ik})$ . After this labeling procedure, the triplane GSFFs feature field and the prototypes are subsequently removed. As such the 3D models contain only geometric (Gaussians without color information) and segmentation information (cluster labels), effectively increasing the level of privacy. Furthermore, storing only the cluster labels makes the storage of the 3D representation orders of magnitudes smaller compared to storing features. Given a pose $P$ , segmentation maps can be rendered by alpha-blending (via Eq.˜1) from one-hot vectors ${\mathbbm{1}}_{k}$ (corresponding to label $k$ ), yielding ${\bf S}^{\text{3D}}_{P}$ . Given an image, 2D segmentation maps ${\bf S}^{\text{2D}}$ are directly obtained through the segmentation head. The localization pipeline is similar to the one in Section˜3.3 with the difference that we minimize segmentation inconsistencies between the segmentation labels instead of features:

\displaystyle P^{*}=\text{min}_{P\in SE(3)}CE({\bf S}^{\text{2D}},{\bf S}^{\text{3D}}_{P})\enspace.

(6)

5 Experimental Evaluation

In this section, we first provide details regarding the training and evaluation of GSFFs-PR. Then, in Section˜5.1 we evaluate our feature-based VL pipeline (dubbed GSFFs-PR Feature) and our privacy-preserving VL pipeline (dubbed GSFFs-PR Privacy) on multiple real world datasets. We show that GSFFs-PR in general is more accurate than the respective non-privacy and privacy-preserving baselines. Section˜5.2 ablates and discusses important aspects of our approach.

	Model	King’s	Old	Shop	St. Mary’s
SBM	HLoc [60]	11 / 0.20	15 / 0.31	4 / 0.2	7 / 0.24
	Kapture R2D2 [29]	5 / 0.10	9 / 0.20	2 / 0.10	3 / 0.10
	DSAC*[4]	15 / 0.30	21 / 0.40	5 / 0.30	13 / 0.40
	ACE [6]	29 / 0.38	31 / 0.61	5 / 0.30	19 / 0.60
	ACE + GSLoc [41]	25 / 0.29	26 / 0.38	5 / 0.23	13 / 0.41
PPM	GoMatch [98]	25 / 0.64	283 / 8.14	48 / 4.77	335 / 9.94
	DGC-GNN [79]	18 / 0.47	75 / 2.83	15 / 1.57	106 / 4.03
	SegLoc [55] (DV)	24 / 0.26	36 / 0.52	11 / 0.34	17 / 0.46
	GSFFs-PR Privacy (DV)	24 / 0.39	26 / 0.49	5 / 0.27	13 / 0.48
RBM	CROSSFIRE [49] (DV)	47 / 0.7	43 / 0.7	20 / 1.2	39 / 1.4
	NeFeS [14] (DFNet)	37 / 0.62	55 / 0.9	14 / 0.47	32 / 0.99
	MCLoc [74]	31 / 0.42	39 / 0.73	12 / 0.45	26 / 0.88
	SSL-Nif [56] (DV)	32 / 0.40	30 / 0.49	8 / 0.36	24 / 0.66
	NeRFMatch [99] (DV)	13 / 0.2	21 / 0.4	8.7 / 0.4	11.3 / 0.4
	GSLoc [41] (DFNet)	26 / 0.34	48 / 0.72	10 / 0.36	27 / 0.62
	GSplaLoc [68]	27 / 0.46	20 / 0.71	5 / 0.36	16 / 0.61
	GSFFs-PR Feature (DV)	18 / 0.27	21 / 0.4	4 / 0.26	10 / 0.34
	GSFFs-PR Feature (tuned) (DV)	17 / 0.26	18 / 0.36	4 / 0.25	8 / 0.26

Table 2: Localization results on Cambridge Landmarks. Comparison with Rendering-Based Methods (RBM), Privacy-Preserving Methods (PPM) and Structure-Based Methods (SBM). Median position error (cm.) (

\downarrow

), median rotation error (°) (

\downarrow

). Bold: best results per method category, underline: second best. (tuned) means that training hyperparameters are tuned for outdoor dataset as opposed to indoor dataset.

Experimental details. We train and optimize one model per scene for 50K iterations on a single NVIDIA A100 GPU. During the first 15K iterations only the Gaussians parameters are updated through the photometric loss. From iteration 15k, the triplane fields and the encoder are also optimized with the joint loss $L=L_{PHO}+.5L_{NCE}+.5L_{PRO}+.5L_{CE}+.1L_{TVL}+0.05L_{Depth}$ , where $L_{TVL}$ is a total variation loss applied to the triplane parameters for regularization and $L_{Depth}$ is a depth prior when available. To increase the convergence basin of our method, we define a coarse feature/segmentation space and a fine feature/segmentation space (both are trained in a similar fashion). The coarse triplane has a resolution of $R=256$ and the fine triplane a resolution of $R=1024$ . Coarse image-based features are processed through a ViT [52] with patch size of 14, while fine image-based features are not downsampled. The feature dimension $d$ is set to 16 for both fine and coarse levels and the number of cluster $K$ to 34 both for fine and coarse levels. A query image estimated pose is first iteratively refined with the coarse-level features or segmentations, then it is refined with the fine-level.

We evaluate our localization pipeline on the 7Scenes (with Depth SLAM pseudo ground truth poses [67] and with SfM-based pseudo ground truth poses [5]), Indoor6 [19], and Cambridge Landmarks [33] real-world datasets. Results on the 12Scenes [75] dataset and comparisons against [90, 91] are provided in the Appendix. The median position error (cm) and median rotation error (°) are reported. Additionally, we report the recall at 5cm/5° for the indoor scenes. Training parameters (loss coefficients, temperature, 3DGS hyperparameters) are kept the same for all datasets. More details are provided in the Appendix.

5.1 Visual Localization

Comparison with feature rendering-based approaches. We primarily compare GSFFs-PR Feature against other non-privacy-preserving rendering-based VL methods, which either use NeRFs [14, 56, 99] or 3DGS [41, 68] as their underlying 3D representations. These methods all require an initial pose estimate, which is obtained via pose retrieval with a global descriptors. More accurate global descriptors lead to more accurate initial pose estimates, which in turn makes refinement easier. For the sake of fairness, we try to use similar or less accurate global descriptors as the other baselines. The global descriptor used is specified between brackets. We report 7Scenes results (with SfM pseudo GT) in Table˜1, and Cambridge Landmarks results in Table˜2.

From Table˜1, we observe that our GSFFs-PR Feature (DV) is best on 4 out of 7 scenes and second for two further ones in spite of relying on a less accurate initialization, i.e., DenseVLAD is less accurate than DFNet or ACE. On Stairs, the Opacity Gaussian Field struggles due to its textureless and flat layout, yielding a poor results for the initial 3D structure that directly affects GSFFs-PR’s refinement framework. From Table˜2, we can see that GSFFs-PR is more accurate than other 3D Gaussian rendering-based baselines. Overall, our feature pipeline even displays competitive performances on some scenes compared to state-of-the-art structure-based methods.

Model	Chess	Fire	Heads	Office	Pumpkin	Redkitchen	Stairs
GoMatch [98]	4/1.65	13/3.86	9/5.17	11/2.48	16/3.32	13/2.84	89/21.12
DGC-GNN [79]	3/1.43	5/1.77	4/2.95	6/1.61	8/1.93	8/2.09	71/19.5
SegLoc [55]	4.5/1.02	4.0/1.51	4.1/1.80	4.7/1.33	6.3/1.88	6.0/1.87	43.7/8.27
GSFFs-PR	2.9/0.92	2.3/0.91	1.5/1.11	4.0/1.28	5.7/1.48	5.1/1.66	28.3/2.39

Table 3: Localization results on 7Scenes (DSLAM based pseudo GT), with median position error (cm.) (

\downarrow

)/ median rotation error (°) (

\downarrow

). We compare GSFFs-PR (Privacy) with other privacy-preserving methods. Both SegLoc and GSFFs-PR refine the pose from DenseVLAD [73] initialization. Best numbers in bold.

	Model	scene1	scene2a	scene3	scene4a	scene5	scene6
NPPM	DSAC* [4]	12.3/2.06/18.7	7.9/0.9/28.0	13.1/2.34/19.7	3.7/0.95/60.8	40.7/6.72/10.6	6.0/1.40/44.3
	NBE+SLD [19]	6.5/0.9/38.4	7.2/0.68/32.7	4.4/0.91/53.0	3.8/0.94/66.5	6.0/0.91/40.0	5.0/0.99/50.5
	GSFFs-PR Feature (34 classes) (Sgl)	4.9/0.68/51	3.1/0.25/68	3.0/0.58/69	5.1/0.85/49	6.3/0.84/41	3.3/0.69/62
	GSFFs-PR Feature (84 classes) (Sgl)	3.6/0.5/63	2.7/0.2/77	2.4/0.39/73	3.6/0.54/63	4.8/0.53/53	2.5/0.42/68
PPM	SegLoc (Sgl) [55]	3.9/0.72/51.0	3.2/0.37/56.4	4.2/0.86/41.8	6.6/1.27/33.84	5.1/0.81/43.1	3.5/0.78/34.5
	GSFFs-PR Privacy (34 classes) (Sgl)	10.2/1.8/28	5.8/0.49/42	5.8/1.2/45	6.5/1.2/40	13.9/2.32/23	6.5/1.5/46
	GSFFs-PR Privacy (84 Classes) (Sgl)	4.6/0.67/53	4.1/0.33/49	4.1/0.74/55	4.0/0.61/58	6.6/0.92/38	3.7/0.75/57

Table 4: Localization results on Indoor6, comparison with Non-Privacy-Preserving Methods (NPPM) and Privacy-Preserving Methods (PPM). Median position error (cm.) (

\downarrow

)/ median rotation error (°) (

\downarrow

)/ recall at 5cm/5° (

\%

) (

\uparrow)

Comparison with privacy-preserving approaches. We compare the performance of the GSFFs-PRPrivacy version against other privacy-preserving visual localization methods, including SegLoc [55] and descriptor-less visual localization methods [98, 79], on Cambridge Landmarks in Table˜2 and on 7Scenes with DSLAM pseudo GT in Table˜3. Similar to our approach, both baseline try to remove the reliance on traditional high-dimensional features for visual localization and hence try to improve privacy. On both datasets, our pipeline proves to be much more accurate than all the other privacy-preserving methods. Finally, Tab. 4 shows results on the challenging Indoor6 dataset, which contains scenes with multiple rooms, uneven camera distributions, and large illumination changes. GSFFs-PR trained with 84 classes outperforms SegLoc in average recall, while GSFFs-PR trained with 34 classes is on par with SegLoc. This shows that increasing the number of classes increases the discriminative power of the segmentation, which is particularly beneficial for larger and complex scenes such as scene1 and scene5. The Appendixprovides an ablation study on varying the number of classes.

	KC (F)	OH (F)	KC (P)	OH (P)
Coarse only	37.5/0.80	670/1.10	92.2/1.55	82.5/1.50
Fine only	22.6/0.32	55.1/0.77	28.0/0.45	151/0.96
k-means	25.2/0.33	27.6/0.50	37.2/0.64	52.3/0.84
No SOpt	23.5/0.30	23.4/0.45	31.0/0.45	30.8/0.57
No MVR	20.0/0.28	22.5/0.49	29.6/0.51	29.1/0.49
No SE	23.5/0.31	21.9/0.42	27.0/0.42	26.0/0.47
GSFFs-PR	17.9/0.27	21.4/0.41	24.3/0.39	25.6/0.49

Table 5: Ablation of different component of our pipeline on King’s College (KC) and Old Hospital (OH) for Feature (F) and Privacy (P) versions. Median position error (cm.)/ rotation error (°) (

\downarrow

5.2 Ablations and Discussion

Table˜5 shows results for our ablation studies, where we study a single component per ablation experiment.

No hierarchy. First, we evaluate the importance of the hierarchical approach by either removing the fine level (Coarse Only) or by removing the coarse level (Fine Only). The results in Table˜5 (first two rows) show that alignment on the finer level yields better pose accuracy and that increasing the convergence basin with a hierarchical approach is necessary to ensure accurate localization.

Clustering. As shown in Tab. 5 (k-means), when we replace the spectral clustering with a simple K-means clustering to derive the prototypes in the 3D feature space, the accuracy drops, confirming that spectral clustering is a critical component of the representation learning process.

Feature field learning. The accuracy drops also when we replace the scale-aware feature encoding with a simple point-wise projection of 3D Gaussian centers (c.f. Table˜5 (No SE)), or when we remove the multi-view regularization in the contrastive losses (c.f. Table˜5 (No MVR)). This shows that both components leverage the geometric information contained in the 3D Gaussian model, yielding more robust and more discriminative representations, and hence a higher pose accuracy.

Segmentation optimization. By removing the loss from Eq.˜5, we do not optimize the segmentations during training (Table˜5 (No SOpt)). Instead, during inference, we align soft assignments between features and prototypes. Due to the prototypical loss, the soft assignments are aligned well enough to provide decent accuracy, however the segmentation head provides a significant gain on accuracy. Furthermore, the feature alignment also benefit from the segmentation optimization during training.

Sparsity of the representation. To evaluate the role of the dense representation, we adopt an approach closer to SegLoc by only applying pose refinement on a sparse set of points. To do this we reproject only Gaussian centers into the image space, interpolate 2D extracted/3D rendered segmentations at these locations and apply Eq.˜6 for pose refinement. In this sparse setup, the pose refinement does not converge, which suggests that a sparse signal is not sufficient when backpropagating through the renderer.

6 Conclusion

In this paper, we propose a Gaussian Feature Field (GSFFs) that combines an explicit geometry representation (3DGS) and an implicit feature field into a scene representation suitable for visual localization through pose refinement. The resulting VL framework relies on a learned feature embedding space shared between the scale-aware GSFFs and a 2D encoder. Both the field and the encoder are optimized in a self-supervised way via contrastive learning. By leveraging the geometry and structure of the GS representation, we regularize the training through multi-view consistency and clustering. The clusters are further used to convert the features into segmentations to enable privacy-preserving localization. Both the privacy-preserving and non-privacy-preserving variants of our localization framework outperform the relevant state-of-the-art.

Acknowledgments. Maxime and Torsten received funding from NAVER LABS Europe. Maxime was also supported by the the Grant Agency of the Czech Technical University in Prague (No. SGS24/057/OHK3/1T/13). We thank our colleagues for providing feedback, in particular Assia Benbihi.

References

Alismail et al. [2017] Hatem Alismail, Brett Browning, and Simon Lucey. Photometric Bundle Adjustment for Vision-Based SLAM. In ACCV, 2017.
Bhayani et al. [2021] Snehal Bhayani, Torsten Sattler, Daniel Barath, Patrik Beliansky, Janne Heikkilä, and Zuzana Kukelova. Calibrated and Partially Calibrated Semi-Generalized Homographies. In ICCV, 2021.
Bortolon et al. [2024] Matteo Bortolon, Theodore Tsesmelis, Stuart James, Fabio Poiesi, and Alessio Del Bue. 6DGS: 6D Pose Estimation from a Single Image and a 3D Gaussian Splatting Model. In ECCV, 2024.
Brachmann and Rother [2021] Eric Brachmann and Carsten Rother. Visual Camera Re-Localization from RGB and RGB-D Images Using DSAC. IEEE Transactions on Pattern Analysis and Machine Intelligence (PAMI), 44(9):5847–5865, 2021.
Brachmann et al. [2021] Eric Brachmann, Martin Humenberger, Carsten Rother, and Torsten Sattler. On the Limits of Pseudo Ground Truth in Visual Camera Re-Localisation. In ICCV, 2021.
Brachmann et al. [2023] Eric Brachmann, Tommaso Cavallari, and Victor Adrian Prisacariu. Accelerated Coordinate Encoding: Learning to Relocalize in Minutes using RGB and Poses. In CVPR, 2023.
Brahmbhatt et al. [2018] Samarth Brahmbhatt, Jinwei Gu, Kihwan Kim, James Hays, and Jan Kautz. Geometry-Aware Learning of Maps for Camera Localization. In CVPR, 2018.
Chelani et al. [2021] Kunal Chelani, Fredrik Kahl, and Torsten Sattler. How Privacy-Preserving Are Line Clouds? Recovering Scene Details From 3D Lines. In CVPR, 2021.
Chelani et al. [2025] Kunal Chelani, Assia Benbihi, Fredrik Kahl, Torsten Sattler, and Zuzana Kukelova. Obfuscation Based Privacy Preserving Representations are Recoverable Using Neighborhood Information. In 3DV, 2025.
Chen et al. [2022a] Anpei Chen, Zexiang Xu, Andreas Geiger, Jingyi Yu, and Hao Su. TensoRF: Tensorial Radiance Fields. In ECCV, 2022a.
Chen et al. [2024a] Le Chen, Weirong Chen, Rui Wang, and Marc Pollefeys. Leveraging Neural Radiance Fields for Uncertainty-aware Visual Localization. In ICRA, 2024a.
Chen et al. [2021] Shuai Chen, Zirui Wang, and Victor A. Prisacariu. Direct-PoseNet: Absolute Pose Regression with Photometric Consistency. In 3DV, 2021.
Chen et al. [2022b] Shuai Chen, Xinghui Li, Zirui Wang, and Victor A. Prisacariu. DFNet: Enhance Absolute Pose Regression with Direct Feature Matching. In ECCV, 2022b.
Chen et al. [2024b] Shuai Chen, Yash Bhalgat, Xinghui Li, Jiawang Bian, Kejie Li, Zirui Wang, and Victor A. Prisacariu. Neural Refinement for Absolute Pose Regression with Feature Synthesis. In CVPR, 2024b.
Cheng et al. [2024] Kai Cheng, Xiaoxiao Long, Kaizhi Yang, Yao Yao, Wei Yin, Yuexin Ma, Wenping Wang, and Xuejin Chen. GaussianPro: 3D Gaussian Splatting with Progressive Propagation. In ICML, 2024.
Chung et al. [2024] Jaeyoung Chung, Jeongtaek Oh, and Kyoung Mu Lee. Depth-Regularized Optimization for 3D Gaussian Splatting in Few-Shot Images. In CVPR, 2024.
Cuturi [2013] Marco Cuturi. Sinkhorn Distances: Lightspeed Computation of Optimal Transport. In NeurIPS, 2013.
Delaunay [1924] Boris Delaunay. Sur la sphère vide: A la mémoire de Georges Voronoï. In Proceedings du Congrés international des mathématiciens, 1924.
Do et al. [2022] Tien Do, Ondrej Miksik, Joseph DeGol, Hyun Soo Park, and Sudipta N Sinha. Learning To Detect Scene Landmarks for Camera Localization. In CVPR, 2022.
Dusmanu et al. [2021] Mihai Dusmanu, Johannes L. Schönberger, Sudipta N. Sinha, and Marc Pollefeys. Privacy-Preserving Image Features via Adversarial Affine Subspace Embeddings. In CVPR, 2021.
Engel et al. [2014] Jakob Engel, Thomas Schöps, and Daniel Cremers. LSD-SLAM: Large-scale Direct Monocular SLAM. In ECCV, 2014.
Engel et al. [2017] Jakob Engel, Vladlen Koltun, and Daniel Cremers. Direct Sparse Odometry. IEEE Transactions on Pattern Analysis and Machine Intelligence (PAMI), 40(3):611–625, 2017.
Geppert et al. [2022] Marcel Geppert, Viktor Larsson, Johannes L Schönberger, and Marc Pollefeys. Privacy Preserving Partial Localization. In CVPR, 2022.
Germain et al. [2019] Hugo Germain, Guillaume Bourmaud, and Vincent Lepetit. Sparse-to-Dense Hypercolumn Matching for Long-Term Visual Localization. In 3DV, 2019.
Germain et al. [2022] Hugo Germain, Daniel DeTone, Geoffrey Pascoe, Tanner Schmidt, David Novotny, Richard Newcombe, Chris Sweeney, Richard Szeliski, and Vasileios Balntas. Feature Query Networks: Neural Surface Description for Camera Pose Refinement. In CVPR Workshops , 2022.
Guédon and Lepetit [2024] Antoine Guédon and Vincent Lepetit. SuGaR: Surface-Aligned Gaussian Splatting for Efficient 3D Mesh Reconstruction and High-Quality Mesh Rendering. In CVPR, 2024.
Heng et al. [2019] Lionel Heng, Benjamin Choi, Zhaopeng Cui, Marcel Geppert, Sixing Hu, Benson Kuan, Peidong Liu, Rang Nguyen, Ye Chuan Yeo, Andreas Geiger, Gim Hee Lee, Marc Pollefeys, and Torsten Sattler. Project AutoVision: Localization and 3D Scene Perception for an Autonomous Vehicle with a Multi-Camera System. In ICRA, 2019.
Huang et al. [2024] Binbin Huang, Zehao Yu, Anpei Chen, Andreas Geiger, and Shenghua Gao. 2D Gaussian Splatting for Geometrically Accurate Radiance Fields. In SIGGRAPH, 2024.
Humenberger et al. [2020] Martin Humenberger, Yohann Cabon, Nicolas Guerin, Julien Morat, Jérôme Revaud, Philippe Rerole, Noé Pion, César Roberto de Souza, Vincent Leroy, and Gabriela Csurka. Robust Image Retrieval-based Visual Localization using Kapture. arXiv preprint arXiv:2007.13867, 2020.
Humenberger et al. [2022] Martin Humenberger, Yohann Cabon, Noé Pion, Philippe Weinzaepfel, Donghwan Lee, Nicolas Guérin, Torsten Sattler, and Gabriela Csurka. Investigating the Role of Image Retrieval for Visual Localization. International Journal of Computer Vision (IJCV), 130(7):1811–1836, 2022.
Ji et al. [2024] Shengxiang Ji, Guanjun Wu, Jiemin Fang, Jiazhong Cen, Taoran Yi, Wenyu Liu, Qi Tian, and Xinggang Wang. Segment Any 4D Gaussians. arXiv preprint arXiv:2407.04504, 2024.
Keetha et al. [2024] Nikhil Keetha, Jay Karhade, Krishna Murthy Jatavallabhula, Gengshan Yang, Sebastian Scherer, Deva Ramanan, and Jonathon Luiten. SplaTAM: Splat Track & Map 3D Gaussians for Dense RGB-D SLAM. In CVPR, 2024.
Kendall et al. [2015] Alex Kendall, Matthew Grimes, and Roberto Cipolla. PoseNet: a Convolutional Network for Real-Time 6-DOF Camera Relocalization. In ICCV, 2015.
Kerbl et al. [2023] Bernhard Kerbl, Georgios Kopanas, Thomas Leimkuehler, and George Drettakis. 3D Gaussian Splatting for Real-Time Radiance Field Rendering. IEEE Transactions on Graphics, 42(4):1–14, 2023.
Kingma and Ba [2014] Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
Lee et al. [2023] Chunghwan Lee, Jaihoon Kim, Chanhyuk Yun, and Je Hyeong Hong. Paired-Point Lifting for Enhanced Privacy-Preserving Visual Localization. In CVPR, 2023.
Liao et al. [2024] Guibiao Liao, Jiankun Li, Zhenyu Bao, Xiaoqing Ye, Jingdong Wang, Qing Li, and Kanglin Liu. CLIP-GS: CLIP-Informed Gaussian Splatting for Real-time and View-consistent 3D Semantic Understanding. arXiv preprint arXiv:2404.14249, 2024.
Lim et al. [2015] Hyon Lim, Sudipta N. Sinha, Michael F. Cohen, Matt Uyttendaele, and H. Jin Kim. Real-time Monocular Image-based 6-DoF Localization. International Journal of Robotics Research, 34(4–5):476–492, 2015.
Lin et al. [2023] Yunzhi Lin, Thomas Müller, Jonathan Tremblay, Bowen Wen, Stephen Tyree, Alex Evans, Patricio A. Vela, and Stan Birchfield. Parallel Inversion of Neural Radiance Fields for Robust Pose Estimation. In ICRA, 2023.
Lindenberger et al. [2021] Philipp Lindenberger, Paul-Edouard Sarlin, Viktor Larsson, and Marc Pollefeys. Pixel-Perfect Structure-from-Motion with Featuremetric Refinement. In ICCV, 2021.
Liu et al. [2024] Changkun Liu, Shuai Chen, Yash Bhalgat, Siyan Hu, Zirui Wang, Ming Cheng, Victor Adrian Prisacariu, and Tristan Braud. GSLoc: Efficient Camera Pose Refinement via 3D Gaussian Splatting. arXiv preprint arXiv:2408.11085, 2024.
Liu et al. [2022] Zhuang Liu, Hanzi Mao, Chao-Yuan Wu, Christoph Feichtenhofer, Trevor Darrell, and Saining Xie. A Convnet for the 2020s. In CVPR, 2022.
Luiten et al. [2023] Jonathon Luiten, Georgios Kopanas, Bastian Leibe, and Deva Ramanan. Dynamic 3D Gaussians: Tracking by Persistent Dynamic View Synthesis. arXiv preprint arXiv:2308.09713, 2023.
Lynen et al. [2015] Simon Lynen, Torsten Sattler, Michael Bosse, Joel Hesch, Marc Pollefeys, and Roland Siegwart. Get out of my Lab: Large-scale, Real-Time Visual-Inertial Localization. In RSS, 2015.
Maggio et al. [2023] Dominic Maggio, Marcus Abate, Jingnan Shi, Courtney Mario, and Luca Carlone. Loc-NeRF: Monte Carlo Localization using Neural Radiance Fields. In ICRA, 2023.
Matsuki et al. [2024] Hidenobu Matsuki, Riku Murai, Paul H.J. Kelly, and Andrew J. Davison. Gaussian Splatting SLAM. In CVPR, 2024.
Moon et al. [2024] Heejoon Moon, Chunghwan Lee, and Je Hyeong Hong. Efficient Privacy-Preserving Visual Localization Using 3D Ray Clouds. In CVPR, 2024.
Moreau et al. [2022] Arthur Moreau, Nathan Piasco, Dzmitry Tsishkou, Bogdan Stanciulescu, and Arnaud e La Fortelle. LENS: Localization Enhanced by NeRF Synthesis. In CoRL, 2022.
Moreau et al. [2023] Arthur Moreau, Nathan Piasco, Moussab Bennehar, Dzmitry Tsishkou, Bogdan Stanciulescu, and Arnaud de La Fortelle. CROSSFIRE: Camera Relocalization on Self-Supervised Features from an Implicit Representation. In ICCV, 2023.
Ng et al. [2022] Tony Ng, Adrian Lopez-Rodriguez, Vassileios Balntas, and Krystian Mikolajczyk. OoD-Pose: Camera Pose Regression From Out-of-Distribution Synthetic Views. In 3DV, 2022.
Oord et al. [2018] Aaron van den Oord, Yazhe Li, and Oriol Vinyals. Representation Learning with Contrastive Predictive Coding. arXiv preprint arXiv:1807.03748, 2018.
Oquab et al. [2023] Maxime Oquab, Timothée Darcet, Théo Moutakanni, Huy Vo, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, et al. Dinov2: Learning robust visual features without supervision. arXiv preprint arXiv:2304.07193, 2023.
Pan et al. [2023] Linfei Pan, Johannes L Schönberger, Viktor Larsson, and Marc Pollefeys. Privacy Preserving Localization via Coordinate Permutations. In ICCV, 2023.
Panek et al. [2022] Vojtech Panek, Zuzana Kukelova, and Torsten Sattler. MeshLoc: Mesh-Based Visual Localization. In ECCV, 2022.
Pietrantoni et al. [2023] Maxime Pietrantoni, Martin Humenberger, Torsten Sattler, and Gabriela Csurka. SegLoc: Learning Segmentation-Based Representations for Privacy-Preserving Visual Localization. In CVPR, 2023.
Pietrantoni et al. [2024] Maxime Pietrantoni, Gabriela Csurka, Martin Humenberger, and Torsten Sattler. Self-Supervised Learning of Neural Implicit Feature Fields for Camera Pose Refinement. In 3DV, 2024.
Pittaluga and Zhuang [2023] Francesco Pittaluga and Bingbing Zhuang. LDP-Feat: Image Features with Local Differential Privacy. In ICCV, 2023.
Pittaluga et al. [2019] Francesco Pittaluga, Sanjeev J. Koppal, Sing Bing Kang, and Sudipta N. Sinha. Revealing Scenes by Inverting Structure from Motion Reconstructions. In CVPR, 2019.
Ranftl et al. [2021] René Ranftl, Alexey Bochkovskiy, and Vladlen Koltun. Vision Transformers for Dense Prediction. In ICCV, 2021.
Sarlin et al. [2019] Paul-Edouard Sarlin, Cesar Cadena, Roland Siegwart, and Marcin Dymczyk. From Coarse to Fine: Robust Hierarchical Localization at Large Scale. In CVPR, 2019.
Sarlin et al. [2021] Paul-Edouard Sarlin, Ajaykumar Unagar, Mans Larsson, Hugo Germain, Carl Toft, Viktor Larsson, Marc Pollefeys, Vincent Lepetit, Lars Hammarstrand, Fredrik Kahl, and Torsten Sattler. Back to the Feature: Learning Robust Camera Localization From Pixels To Pose. In CVPR, 2021.
Sattler et al. [2015] Torsten Sattler, Michal Havlena, Filip Radenović, Konrad Schindler, and Marc Pollefeys. Hyperpoints and Fine Vocabularies for Large-Scale Location Recognition. In ICCV, 2015.
Sattler et al. [2019] Torsten Sattler, Qunjie Zhou, Marc Pollefeys, and Laura Leal-Taixé. Understanding the Limitations of CNN-based Absolute Camera Pose Regression. In CVPR, 2019.
Schöps et al. [2017] Thomas Schöps, Johannes L. Schönberger, Silvano Galliani, Torsten Sattler, Konrad Schindler, Marc Pollefeys, and Andreas Geiger. A Multi-view Stereo Benchmark with High-Resolution Images and Multi-camera Videos. In CVPR, 2017.
Schöps et al. [2019] Thomas Schöps, Torsten Sattler, and Marc Pollefeys. BAD SLAM: Bundle Adjusted Direct RGB-D SLAM. In CVPR, 2019.
Shibuya et al. [2020] Mikiya Shibuya, Shinya Sumikura, and Ken Sakurada. Privacy Preserving Visual SLAM. In ECCV, 2020.
Shotton et al. [2013] Jamie Shotton, Ben Glocker, Christopher Zach, Shahram Izadi, Antonio Criminisi, and Andrew Fitzgibbon. Scene Coordinate Regression Forests for Camera Relocalization in RGB-D Images. In CVPR, 2013.
Sidorov et al. [2024] Gennady Sidorov, Malik Mohrat, Ksenia Lebedeva, Ruslan Rakhimov, and Sergey Kolyubin. GSplatLoc: Grounding Keypoint Descriptors into 3D Gaussian Splatting for Improved Visual Localization. arXiv preprint arXiv:2409.16502, 2024.
Speciale et al. [2019a] Pablo Speciale, Johannes L. Schönberger, Sudipta N. Sinha, and Marc Pollefeys. Privacy Preserving Image Queries for Camera Localization. In ICCV, 2019a.
Speciale et al. [2019b] Pablo Speciale, Johannes L. Schönberger, Sing Bing Kang, Sudipta N. Sinha, and Marc Pollefeys. Privacy Preserving Image-Based Localization. In CVPR, 2019b.
Taira et al. [2021] Hajime Taira, Masatoshi Okutomi, Torsten Sattler, Mircea Cimpoi, Marc Pollefeys, Josef Sivic, Tomáš Pajdla, and Torii Akihiko. InLoc: Indoor Visual Localization with Dense Matching and View Synthesis. IEEE Transactions on Pattern Analysis and Machine Intelligence (PAMI), 43(4):1293–1307, 2021.
Toft et al. [2020] Carl Toft, Will Maddern, Akihiko Torii, Lars Hammarstrand, Erik Stenborg, Daniel Safari, Masatoshi Okutomi, Marc Pollefeys, Josef Sivic, Tomas Pajdla, et al. Long-term Visual Localization Revisited. IEEE Transactions on Pattern Analysis and Machine Intelligence (PAMI), 2020.
Torii et al. [2018] Akihiko Torii, Relja Arandjelović, Josef Sivic, Masatoshi Okutomi, and Tomáš Pajdla. 24/7 Place Recognition by View Synthesis. IEEE Transactions on Pattern Analysis and Machine Intelligence (PAMI), 40(2):257–271, 2018.
Trivigno et al. [2024] Gabriele Trivigno, Carlo Masone, Barbara Caputo, and Torsten Sattler. The Unreasonable Effectiveness of Pre-Trained Features for Camera Pose Refinement. In CVPR, 2024.
Valentin et al. [2016] Julien Valentin, Angela Dai, Matthias Nießner, Pushmeet Kohli, Philip Torr, Shahram Izadi, and Cem Keskin. Learning to Navigate the Energy Landscape. In 3DV. IEEE, 2016.
Von Stumberg et al. [2020] Lukas Von Stumberg, Patrick Wenzel, Qadeer Khan, and Daniel Cremers. GN-Net: The Gauss-Newton Loss for Multi-Weather Relocalization. IEEE Robotics and Automation Letters, 5(2):890–897, 2020.
von Stumberg et al. [2020] Lukas von Stumberg, Patrick Wenzel, Nan Yang, and Daniel Cremers. LM-Reloc: Levenberg-Marquardt Based Direct Visual Relocalization. In 3DV, 2020.
Wang et al. [2020] Bing Wang, Changhao Chen, Chris Xiaoxuan Lu, Peijun Zhao, Niki Trigoni, and Andrew Markham. AtLoc: Attention Guided Camera Localization. In AAAI, 2020.
Wang et al. [2024] Shuzhe Wang, Juho Kannala, and Daniel Barath. DGC-GNN: Descriptor-free Geometric-Color Graph Neural Network for 2D-3D Matching. In CVPR, 2024.
Wang et al. [2004] Zhou Wang, Alan C. Bovik, Hamid R. Sheikh, and Eero P. Simoncelli. Image Quality Assessment: From Error Visibility to Structural Similarity. IEEE Transactions on Image Processing (TIP), 13(4):600–612, 2004.
Wu and He [2018] Yuxin Wu and Kaiming He. Group Normalization. In ECCV, 2018.
Xiong et al. [2023] Haolin Xiong, Sairisheek Muttukuru, Rishi Upadhyay, Pradyumna Chari, and Achuta Kadambi. SparseGS: Real-Time 360° Sparse View Synthesis using Gaussian Splatting. arXiv preprint arXiv:2312.00206, 2023.
Xu et al. [2021] Binbin Xu, Andrew Davison, and Stefan Leutenegger. Deep Probabilistic Feature-metric Tracking. IEEE Robotics and Automation Letters , 6(1):223 – 230, 2021.
Yang et al. [2024] Ziyi Yang, Xinyu Gao, Wen Zhou, Shaohui Jiao, Yuqing Zhang, and Xiaogang Jin. Deformable 3D Gaussians for High-Fidelity Monocular Dynamic Scene Reconstruction. In CVPR, 2024.
Ye et al. [2024] Botao Ye, Sifei Liu, Haofei Xu, Xueting Li, Marc Pollefeys, Ming-Hsuan Yang, and Songyou Peng. No Pose, No Problem: Surprisingly Simple 3D Gaussian Splats from Sparse Unposed Images. arXiv preprint arXiv:2410.24207, 2024.
Yen-Chen et al. [2021] Lin Yen-Chen, Pete Florence, Jonathan T. Barron, Alberto Rodriguez, Phillip Isola, and Tsung-Yi Lin. INeRF: Inverting Neural Radiance Fields for Pose Estimation. In IROS, 2021.
Yu et al. [2024a] Hanyang Yu, Xiaoxiao Long, and Ping Tan. LM-Gaussian: Boost Sparse-view 3D Gaussian Splatting with Large Model Priors. arXiv preprint arXiv:2409.03456, 2024a.
Yu et al. [2024b] Zehao Yu, Anpei Chen, Binbin Huang, Torsten Sattler, and Andreas Geiger. Mip-splatting: Alias-free 3D Gaussian Splatting. In CVPR, 2024b.
Yu et al. [2024c] Zehao Yu, Torsten Sattler, and Andreas Geiger. Gaussian Opacity Fields: Efficient Adaptive Surface Reconstruction in Unbounded Scenes. IEEE Transactions on Graphics, 43(6):1–15, 2024c.
Zhai et al. [2024] Hongjia Zhai, Xiyu Zhang, Boming Zhao, Hai Li, Yijia He, Zhaopeng Cui, Hujun Bao, and Guofeng Zhang. SplatLoc: 3D Gaussian Splatting-based Visual Localization for Augmented Reality. arXiv preprint arXiv:2409.14067, 2024.
Zhai et al. [2025] Hongjia Zhai, Boming Zhao, Hai Li, Xiaokun Pan, Yijia He, Zhaopeng Cui, Hujun Bao, and Guofeng Zhang. Neuraloc: Visual localization in neural implicit map with dual complementary features. arXiv preprint arXiv:2503.06117, 2025.
Zhang et al. [2018] Richard Zhang, Phillip Isola, Alexei A. Efros, Eli Shechtman, and Oliver Wang. The Unreasonable Effectiveness of Deep Features as a Perceptual Metric. In CVPR, 2018.
Zhang and Kosecka [2006] W. Zhang and J. Kosecka. Image based Localization in Urban Environments. In 3DPVT, 2006.
Zhang et al. [2024] Zheng Zhang, Wenbo Hu, Yixing Lao, Tong He, and Hengshuang Zhao. Pixel-GS: Density Control with Pixel-aware Gradient for 3D Gaussian Splatting. arXiv preprint arXiv:2403.15530, 2024.
Zhao et al. [2024] Boming Zhao, Luwei Yang, Mao Mao, Hujun Bao, and Zhaopeng Cui. PNeRFLoc: Visual localization with point-based neural radiance fields. In AAAI, 2024.
Zheng and Wu [2015] Enliang Zheng and Changchang Wu. Structure From Motion Using Structure-Less Resection. In ICCV, 2015.
Zhou et al. [2020] Qunjie Zhou, Torsten Sattler, Marc Pollefeys, and Laura Leal-Taixé. To Learn or not to Learn: Visual Localization from Essential Matrices. In ICRA, 2020.
Zhou et al. [2022] Qunjie Zhou, Sergio Agostinho, Aljosa Osep, and Laura Leal-Taixe. Is Geometry Enough for Matching in Visual Localization? In ECCV, 2022.
Zhou et al. [2024] Qunjie Zhou, Maxim Maximov, Or Litany, and Laura Leal-Taixé. The NeRFect Match: Exploring NeRF Features for Visual Localization. arXiv preprint arXiv:2403.09577, 2024.

APPENDIX

In this appendix first in Appendix˜A, we provide additional explanations with regard to the contrastive losses, the optimal transport association procedure, and the volumetric feature encoding. In Appendix˜B, we detail the training and visual localization parameters. We also provide a pseudo-algorithm describing GSFFs training in Algorithm˜1. Then, in Appendix˜C, we provide visual localization results on the 12Scenes dataset with ablation studies and in Appendix˜D we present a privacy attack experiment. Finally, in Appendix˜E we display more visualizations of rendered segmentation, while commenting on the attached supplementary videos.

Appendix A Technical details of GSFFs

A.1 Contrastive losses and regularization terms

Here we provide more details regarding the derivation of the losses in Section˜3.1. We recall that our is training the model in a self-supervised manner to align the feature maps ${\bf F}^{\text{3D}}$ and ${\bf F}^{\text{2D}}$ . Therefore, during training at each step we sample $N$ pixels in these maps to be aligned. Let us denote the $N$ corresponding pairs of pixel aligned extracted/rendered features by $\{{\bf F}^{\text{2D}}_{n},{\bf F}^{\text{3D}}_{n}\}_{n=1}^{N}$ . The contrastive loss has two terms. The first one is a term enforcing similarity of 2D extracted features with regard to 3D rendered features, and the second term is enforcing similarity of the 2D rendered features with regard to 3D extracted features:

\textstyle{L_{NCE}=-\frac{1}{N}\sum_{u=1}^{N}\log\left(\frac{\exp{\left({\bf F}^{\text{3D}}_{u}\cdot{\bf F}^{\text{2D}}_{u}/\tau\right)}}{\sum_{j=1}^{N}\exp({\bf F}^{\text{3D}}_{u}\cdot{\bf F}^{\text{2D}}_{j}/\tau)}\right)}

\textstyle{-\frac{1}{N}\sum_{u=1}^{N}\log\left(\frac{\exp{\left({\bf F}^{\text{2D}}_{u}\cdot{\bf F}^{\text{3D}}_{u}/\tau\right)}}{\sum_{j=1}^{N}\exp({\bf F}^{\text{2D}}_{u}\cdot{\bf F}^{\text{3D}}_{j}/\tau)}\right)}

yielding $L_{NCE}$ equal to:

\textstyle{-\frac{1}{N}\sum_{u=1}^{N}\log\left(\frac{\exp{\left({\bf F}^{\text{3D}}_{u}\cdot{\bf F}^{\text{2D}}_{u}/\tau\right)}\exp{\left({\bf F}^{\text{2D}}_{u}\cdot{\bf F}^{\text{3D}}_{u}/\tau\right)}}{\sum_{j=1}^{N}\exp({\bf F}^{\text{3D}}_{u}\cdot{\bf F}^{\text{2D}}_{j}/\tau)\sum_{j=1}^{N}\exp({\bf F}^{\text{2D}}_{u}\cdot{\bf F}^{\text{3D}}_{j}/\tau)}\right)}\enspace.

From this we can derive Eq.˜2, where the normalization factor $A$ is:

\textstyle{A=\left(\sum_{j=1}^{N}\exp({\bf F}^{\text{3D}}_{u}\cdot{\bf F}^{\text{2D}}_{j}/\tau)\right)\left(\sum_{j=1}^{N}\exp({\bf F}^{\text{2D}}_{u}\cdot{\bf F}^{\text{3D}}_{j}/\tau)\right)}

Similarly, the prototypical contrastive loss has a term enforcing the similarity between 2D extracted features and the associated prototypes, and a second term enforcing similarity between 3D rendered features and the associated prototypes:

\textstyle{L_{PRO}=-\frac{1}{N}\sum_{n=1}^{N}\log\left(\frac{\exp{(({\bf F}^{\text{3D}}_{n}\cdot{\bf p}_{n})/\tau)}}{\sum_{j=1}^{K}\exp({\bf F}^{\text{3D}}_{n}\cdot{\bf p}_{j}/\tau)}\right)}

\textstyle{-\frac{1}{N}\sum_{n=1}^{N}\log\left(\frac{\exp{(({\bf F}^{\text{2D}}_{n}\cdot{\bf p}_{n}/\tau)}}{\sum_{j=1}^{K}\exp({\bf F}^{\text{2D}}_{n}\cdot{\bf p}_{j}/\tau)}\right)}

yielding $L_{PRO}$ equal to

\textstyle{-\frac{1}{N}\sum_{n=1}^{N}\log\left(\frac{\exp{(({\bf F}^{\text{3D}}_{n}\cdot{\bf p}_{n})/\tau)}\exp{(({\bf F}^{\text{2D}}_{n}{\bf p}_{n}/\tau)}}{(\sum_{j=1}^{K}\exp({\bf F}^{\text{3D}}_{n}\cdot{\bf p}_{j}/\tau))(\sum_{j=1}^{K}\exp({\bf F}^{\text{2D}}_{n}\cdot{\bf p}_{j}/\tau))}\right)}

which simplifies to Eq. 3 with the normalization factor:

\textstyle{B=\left(\sum_{j=1}^{K}\exp({\bf F}^{\text{3D}}_{u}\cdot{\bf p}_{j}/\tau)\right)\left(\sum_{j=1}^{K}\exp({\bf F}^{\text{2D}}_{u}\cdot{\bf p}_{j}/\tau)\right)}\enspace.

Data:

M

posed training images,

K

target number of classes and a total number of training iterations

N_{\textrm{iter}}

Build SfM and pretrain a GoF [89] with the training images

Apply spectral clustering on the set of 3D Gaussian centers to initialize the spatial prototypes

\{p_{k}\}_{k=1}^{K}

Then, to train the GSFFs, iterate:

for iter in range( $N_{\textrm{iter}}$ ) do

Sample a random image

I

and viewpoint

P

Extract 2D image-based features

{\bf F}^{\text{2D}}

and segmentation map

{\bf S}^{\text{2D}}

For each 3D Gaussian

{\mathcal{G}}_{i}

, extract a scale aware volumetric feature

{\bf g}_{i}

from the triplane using the covariance kernel based encoding (cf. Sec. 3.1)

Assign a 3D segmentation label

s_{i}

to each 3D Gaussian

{\mathcal{G}}_{i}

based on volumetric feature/prototypes similarities (cf. Sec. 4)

From the pose

P

rasterize 3D segmentation labels and volumetric features to obtain

{\bf S}^{\text{3D}}

and

{\bf F}^{\text{3D}}

Sample a novel viewpoint

\widehat{P}

, find correspondences between image

I

and

\hat{I}

(cf. Sec. 3.2).

Render 3D features/segmentation maps from

\widehat{P}

and replace features/segmentation in

{\bf F}^{\text{3D}}

{\bf S}^{\text{3D}}

with

\widehat{{\bf F}}^{\text{3D}}

\widehat{{\bf S}}^{\text{3D}}

at correspondence location

Repeat the replacements process for 2D features/segmentations

Compute the contrastive loss

L_{NCE}

from

{\bf F}^{\text{2D}}

and

{\bf F}^{\text{3D}}

Associate a prototype

p_{k}

to each pair

\{{\bf F}^{\text{2D}}_{i}

{\bf F}^{\text{3D}}_{i}\}

using the OT labeling procedure (cf. Section˜A.2)

Compute

L_{PRO}

L_{CE}

from these associations as well as from

{\bf F}^{\text{2D}}

{\bf F}^{\text{3D}}

{\bf S}^{\text{2D}}

{\bf S}^{\text{3D}}

(cf. Sec. 3.2)

Compute regularization losses

L_{TVL}

L_{Depth}

and photometric loss

L_{PHO}

from the rendered image

Jointly update the image encoder, the triplane field and 3D Gaussians by minimizing

L=L_{PHO}+.5L_{NCE}+.5L_{PRO}+.5L_{CE}+.1L_{TVL}+0.05L_{Depth}

Update the prototypes

p_{k}

with an EMA scheme based on volumetric features

g_{i}

and the spectral clustering assignments

end for

Algorithm 1 Pseudo algorithm describing the training process of GSFFs.

A.2 Optimal transport (OT) associations

In this section, we detail the optimal transport association step from Section˜3.2. Given a batch of $N$ pairs of pixel aligned extracted/rendered features $\{{\bf F}^{\text{2D}}_{n},{\bf F}^{\text{3D}}_{n}\}_{n=1}^{N}$ and a set of $K$ prototypes $P\!\in\!{\rm I\!R}^{K\times D}$ , we aim at associating a prototype per pair of pixel aligned features. We want this association operation to respect two criteria: 1) A single prototype must be associated per pair of features so that the extracted/rendered features are pushed toward the same "class" in the feature space. 2) Predictions must be as balanced as possible to avoid collapse.

To solve these constraints, we resort to using optimal transport, where we frame this problem as finding a mapping $Q\!\in\!{\rm I\!R}^{N\times K}$ between pixels and prototypes that maximizes the feature similarity between the pairs of features and the prototypes. We define the joint feature/prototypes similarities $S\in{\rm I\!R}^{K\times N}$ as:

\textstyle{S_{kn}=\exp{\left((\bf\text{F}^{2D}_{n}\cdot p_{k}+\bf\text{F}^{3D}_{n}\cdot p_{k})/\tau\right)}/C}\enspace,

with

\textstyle{C=(\sum_{k}{\exp{(\bf\text{F}^{2D}_{n}\cdot p_{k}/\tau)}})(\sum_{k}{\exp{(\bf\text{F}^{3D}_{n}\cdot p_{k}/\tau)}})}\enspace.

We introduce the following objective where $Q$ maximizes the joint feature/prototypes similarities $S$ :

\textstyle{\max_{Q\in U(\frac{1}{N},\frac{1}{K})}Tr(Q(-logS)^{t})+\lambda h(Q)}\enspace,

where the entropy term $h(Q)$ encourages balanced predictions while using joint extracted/rendered feature prototypes similarities yields a single association per feature pair. If we relax $Q$ such that it belongs to the transportation polytope $U(\frac{1}{N},\frac{1}{K})$ [17], $Q$ can be efficiently computed with the iterative Sinkhorn-Knopp algorithm [17]. The final associations $\Gamma\!\in\!{\rm I\!R}^{N}$ are obtained with $\Gamma\!=\!\text{argmax}_{k}(Q)$ .

A.3 Projecting 3D Gaussians onto the triplane

In this subsection, we explain how a volumetric feature for each 3D Gaussian is derived from the triplane grid Section˜3.1. The triplane grid is centered at the origin of the world coordinate space and it is composed of three orthogonal 2D planes $H_{xy},H_{xz},H_{yz}\in{\rm I\!R}^{D\times R\times R}$ , $D$ being the triplane feature dimension and $R$ the resolution of the grid. We project each 3D Gaussian ${\mathcal{G}}_{i}$ onto these planes and derive three Gaussian kernels ${\mathcal{G}}^{xy}_{i},{\mathcal{G}}^{xz}_{i},{\mathcal{G}}^{yz}_{i}$ from the projections, which we use to obtain scale aware volumetric features $\bf\text{g}_{i}^{3D}$ .

To obtain the $xy$ feature, we proceed as follows ( $xz$ and $yz$ features are obtained similarly). Let $m_{i}$ by the center of ${\mathcal{G}}_{i}$ and $\Sigma_{i}$ its covariance matrix. We first project the center on the plane yielding $m_{i}^{xy}$ . We perform orthographic projection of the covariance matrix to obtain $\Sigma_{i}^{xy}$ . We define a grid of dimension 5 by 5 centered on $m_{i}^{xy}$ . On the plane $xy$ , we use the coordinates of the points in the grid $u$ to define the following Gaussian kernel:

\textstyle{{\mathcal{G}}^{xy}_{i}({\bf u})=\frac{1}{z}exp(-\frac{{\bf u}(\Sigma_{i}^{xy})^{-1}{\bf u}^{t}}{2})}\enspace,

where $Z$ is a normalization constant. We query the feature plane $H_{xy}$ for each point $u_{k}$ in the grid and apply the Gaussian kernel on the queried features. This yields a $D$ -dimensional feature $\bf\text{g}_{i}^{xy}$ associated to ${\mathcal{G}}^{xy}_{i}$ . We repeat this operation for $H_{xz},{\mathcal{G}}^{xz}_{i}$ and $H_{yz},{\mathcal{G}}^{yz}_{i}$ . The resulting features $\bf\text{g}_{i}^{xy},\bf\text{g}_{i}^{xz},\bf\text{g}_{i}^{yz}$ are summed to obtain the volumetric feature $\bf\text{g}_{i}^{3D}$ of ${\mathcal{G}}_{i}$ .

Appendix B Implementation details

B.1 Additional training details

Similar to GoF [89], densification and pruning operations based on image space gradients are applied until iteration 15000. The densification interval is set to 600 iterations until iteration 7500, and reduced to 400 iterations afterward. To facilitate the convergence of the 3D Gaussian model, we train on images downscaled by a factor 4 until iteration 7500. From iteration 15000, the geometric regularization losses from [89] are applied until the end of the training. The triplane learning rate is set to 7e-3, while the encoder learning rate is set to 1e-4. The learning rates for 3D Gaussian primitives are identical to GoF [89].

GSFFs is optimized with the Adam optimizer [35]. The prototypes are updated based on an exponential moving average (EMA) scheme with $\alpha=0.9995$ after each training iteration. The temperature $\tau$ for the contrastive losses is set to 0.05. In the Multi-view consistency paragraph from Sec. 3.2, the pixel reprojection threshold is set to 2 for the fine level, and to 4 for the coarse level. The coarse encoder uses a Dinov2 [52] pretrained backbone , followed by projection convolutional layers (convolutional layer with kernel size 1 to reduce the dimension, while maintaining the resolution) and a ConvNeXt block [42]. The fine encoder is composed of a shallow convolutional layers followed by a ConvNeXt block [42]. The segmentation heads contain convolutional layers with ReLU activations and GroupNorm [81] normalization.

We provide in Algorithm˜1 the pseudo algorithm describing the training process of the GSFFs.

B.2 Data pre-processing

To reduce running time and the memory footprint, images are rescaled such that image width is 1024 pixels for Cambridge Landmarks and 480 pixels for Indoor6. The original image resolution of 640 by 480 pixels is used on 7Scenes. During visual localization, we use the same image resolution as the one used during training.

Cambridge Landmarks and Indoor6 contain images with illumination changes, as such we learn an embedding per training image to capture these illumination changes during the training. These embeddings are only used during training to learn the scene representation. On Cambridge Landmarks during training, we mask out the sky and pedestrians. The masks are extracted using the semantic segmentation model from [59]. As Indoor6 contain day/night images with extreme illumination changes we further apply CLAHE normalization on images.

B.3 Visual localization setup

For the pose refinement, we use the Adam optimizer [35] with coarse/fine learning rates of 0.5/0.2 on Cambridge Landmarks, 0.3/0.2 on Indoor6, and 0.2/0.1 on 7Scenes respectively. The number of refinement steps for the coarse and fine level is set to 150/300 on Cambridge Landmarks and Indoor6, and 75/150 on 7Scenes. Rendered areas with high distortion (see [89] for the definition of distortion) are masked out during refinement. Additionally, the sky is masked out on Cambridge Landmarks.

	N Classes	KingsCollege	OldHospital	ShopFacade	StMarysChurch
FP	34	0.034	0.032	0.024	0.028
FP	84	0.046	0.043	0.027	0.031
BP	34	0.065	0.065	0.071	0.076
BP	84	0.462	0.401	0287	0.392

Table 6: Computation time (s) for a forward pass FP (rendering) and a backward pass BP (pose optimization) on Cambridge Landmark for GSFFs trained with 34 classes and 84 classes.

B.4 Training time and rendering quality

In Table˜7 we report novel view rendering quality evaluated with PSNR, SSIM [80] and LPIPS [92] image metrics as well as the corresponding model training time for the Cambridge Landmark scenes. From these results in Table˜7, we observe that GSFFs-Feature, compared to GoF [89], adds computational overhead (the cost of training the feature fields), but it does not decrease the novel view synthesis quality. Note that GSFFs-Privacy cannot render RGB images for privacy reasons.

We also provide rendering time and backward pass time for the pose optimization in Table˜6 for 34 and for 84 classes. We can observe that while the forward pass is only slightly increased, the cost of backward pass increases significantly with the increase the number of classes. Globally, through our experiments, we found that in general using 34 classes is a good compromise between rendering speed and pose accuracy (see the accuracies in Table˜2 and Table˜6 for the training times).

	Training	KC	OH	SF	SMC
	time (h)	PSNR / SSIM / LPIPS
GoF [81]	0.7	16.95/0.69/0.27	16.98/0.63/0.30	20.04/0.74/0.21	18.49/0.71/0.25
GSFFs	10.8	16.92/0.69/0.27	16.94/0.61/0.30	20.02/0.74/0.21	18.47/0.71/0.26

Table 7: Evaluating novel view rendering.

Appendix C Additional Localization Experiments

Model	kitchen	living	bed	kitchen	living	luke	gates362	gates381	lounge	manolis	office2 5a	office2 5b
GSFFs-PR Feature (34 Classes) (NV)	0.3/0.2/99	0.3/0.18/100	0.4/0.17/100	0.7/0.42/91	0.4/0.21/96	0.6/0.27/97	0.5/0.23/100	0.5/0.27/99	0.8/0.29/97	0.5/0.22/99	0.9/0.41/99	1.1/0.41/94
GSFFs-PR Privacy (34 Classes) (NV)	0.6/0.32/97	0.6/0.29/100	0.5/0.23/100	0.9/0.51/75	0.7/0.30/99	0.8/0.30/96	0.8/0.31/98	0.7/0.36/98	1.6/0.54/91	0.7/0.31/97	1.3/0.65/95	1.8/0.83/69
NeRF-SCR [11]	0.9/0.5	2.1/0.6	1.6/0.7	1.2/0.5	2.0/0.8	2.6/1.0	2.0/0.8	2.7/1.2	1.8/0.6	1.6/0.7	2.5/0.9	2.6/0.8
PNeRFLoc [95]	1.0/0.6	1.5/0.5	1.2/0.5	0.8/0.4	1.4/0.5	8.1/3.3	1.6/0.7	8.7/3.2	2.3/0.8	1.1/0.5	X	2.8/0.9
GSPlatloc [90]	0.8/0.4	1.1/0.4	1.2/0.5	1.0/0.5	1.2/0.5	1.5/0.6	1.1/0.5	1.2/0.5	1.6/0.5	1.1/0.5	1.4/0.6	1.5/0.5
NeuraLoc [91]	0.9/0.5	1.1/0.4	1.3/0.6	1.0/0.6	1.2/0.5	1.4/0.7	1.1/0.5	1.1/0.5	1.7/0.6	1.0/0.5	1.3/0.6	1.5/0.5
GSFFs-PR Feature (34 Classes) (NV)	0.7/0.4	1.1/0.4	1.1/0.4	0.8/0.4	1.0/0.5	1.3/0.5	1.1/0.4	1.2/0.5	1.4/0.5	0.8/0.4	1.2/0.5	1.7/0.6

Table 8: Localization results on 12Scenes (SfM pGT top rows / DSlam pGT bottom rows). Median pose error (cm.) (

\downarrow

)/ Median angle error (°) (

\downarrow

)/ Recall at 5cm/5° (

\%

) (

\uparrow)

C.1 12Scenes Dataset

In Table˜8, we report localization results on the 12Scenes [75] dataset for both the SfM pseudo ground truth (pGT) and the DSLAM pGT. We compare our GSFFs-Feature model against three non privacy preserving feature rendering based-approaches methods NeRF-SCR [11], PNeRFLoc [95], NeuraLoc [91] and GSPlatloc [90].

GSFFs is more accurate than all baselines and clearly outperforms the rendering based-approaches [11, 95], while providing accuracy improvement compared to [90, 91] on most of the scenes.

C.2 Varying the number of prototypes

In this section, we study the influence of the number of classes/prototypes on both the feature- and the segmentation-based variants of our approach. All experiments in the main paper use a feature dimension of 16 with 34 segmentation classes. As we render both features and segmentations in the same variable (each channel is independently rendered), we compile the rasterizer with a fixed rendering dimension of 50 (16 + 34) yielding a good compromise between rendering speed and discriminative power.

In this ablation, we maintain a feature dimension of 16 and vary the number of classes. We compile the rasterizer with a map dimension of 100 and then train and evaluate GSFFs-PR with 84 classes, 59 classes, and 15 classes. In Fig.˜3, we plot the median translation and rotation errors on Cambridge Landmarks against the number of classes of each model. In Table˜4 we compare 34 classes versus 84 on the Indoor6 dataset. From these results, we derive the following observations. Using only a few classes is not sufficient because the lack of discriminative power. Increasing the number of classes first yields a significant gain, however above a certain limit we observe a drop in accuracy. We suspect that the reason is over-clustering and hence more difficulty for the representation to converge during the training.

Overall, the number of classes must be high enough to make the segmentation discriminative enough for localization, but not too high to ensure the convergence. Naturally, larger scenes with diverse viewpoints will require more classes, while for simpler scenes it is better to consider less classes.

Furthermore, we can see from Table˜2 that GSFFs-PR Feature localization pipeline also benefits from the increased number of classes as, during training, gradients are backpropagated from segmentation and feature maps and $L_{PRO}$ implicitly uses the prototypes.

C.3 Effect of the initialization

In Table˜9 we provide pose refinement results on Cambridge Landmarks with different initial poses. Note that poses estimated with ACE [6] and HLoc [60] are much more accurate than DenseVLAD(DV) [73]. We can observe that, as expected, improving the initial pose results in higher final GSFFs-PR localization accuracy. It also requires less refinement steps to converge. Starting from a wide baseline such as DenseVlad [73] the model needs more refinement steps but the optimization ultimately reaches high accuracy (especially for high resolution images) showing the robustness of GSFFs-PR.

	Init.	Image	Coarse	Runtime	KC	OH	SF	SMC
		Res.	Fine Steps	Query (s)	cm / deg
$[64]$	$[64]$	?	350	3	27/0.46	20/0.71	5/0.36	16/0.61
$[40]$	Ace	512	x	0.2	25/0.29	26/0.38	5/0.23	13/0.41
DV	-	-	-	-	280/5.7	401/7.1	111/7.6	231/8
Vanilla-GS	DV	1024	150-300	45	21.9/0.34	22.3/0.42	4.5/0.27	11.8/0.35
GSFFs-PR	DV	1024	150-300	45	17.9/0.27	21.4/0.41	4.1/0.26	10.4/0.30
GSFFs-PR	DV	480	150-300	8.1	19.7/0.27	21.7/0.36	4.7/0.23	8.7/0.29
GSFFs-PR	DV	480	50-50	1.8	74.6/1.26	162/2.81	16.4/0.73	90.2/2.68
ACE	-	-	-	-	28.0/0.4	31.0/0.6	5.0/0.3	18.0/0.6
GSFFs-PR	ACE	1024	150-300	45	16.1/0.22	17.9/0.34	3.9/0.18	7.6/0.23
GSFFs-PR	ACE	480	150-300	8.1	17.8/0.22	18.7/0.33	4.7/0.21	8.5/0.25
GSFFs-PR	ACE	480	75-150	4	18.2/0.24	21.1/0.35	4.8/0.22	8.5/0.25
GSFFs-PR	ACE	480	50-50	1.8	19.1/0.27	25.4/0.49	5.1/0.26	11.1/0.35
GSFFs-PR	ACE	480	25-25	0.9	19.8/0.44	25.6/0.66	5.8/0.45	13.2/0.53
Hloc	-	-	-	-	11.1/0.20	15.5/0.31	4.4/0.2	7.1/0.24
GSFFs-PR	Hloc	1024	0-50	5	10.9/0.18	14.1/0.29	3.9/0.19	6.4/0.21
GSFFs-PR	Hloc	480	0-50	0.9	11.0/0.19	14.2/0.30	3.9/0.18	6.4/0.22

Table 9: Varying initialization and image resolution.

C.4 Varying resolution and refinement steps

In Table˜9 we also provide results with different coarse/fine number of refinement steps, image resolution and runtime per query. We can see that performances on high resolution images is slightly better than on low resolution images, but this comes at a higher inference cost. We can further decrease the running time by decreasing the refinement steps. The loss in performance is relatively small conditioned that we start from a good initialization. This suggest that we could further increase the localization speed by combining low and hight resolution based refinement and stopping the optimization earlier.

Appendix D Privacy attack

To visualize and assess the degree of privacy of GSFFs-PR-Privacy, we train an inversion model [58, 55] to reconstruct images from rendered segmentations. As a baseline, we train another inversion model to reconstruct images from feature maps rendered from GSFFs-PR-Feature. Both inversions model are trained on 6 scenes (fire, heads, office, pumpkin, redkitchen, stairs) from the 7Scenes dataset and evaluated the remaining scene chess. Example reconstructed images from both GSFFs-PR-Feature and GSFFs-PR-Privacy and displayed in Fig.˜4. We can observe that images reconstructed from GSFFs-PR-Feature reveal a lot of scenes details, while images reconstructed from GSFFs-PR-Privacy totally obfuscate privacy sensitive information.

Appendix E Visualizations

We show in Fig.˜5 pairs of 2D extracted / 3D rendered segmentations for the coarse and fine levels while comparing segmentations from models trained with 34 and 84 classes. The segmentation classes of the coarse level capture larger and less sharply defined segments in the image compared to the fine level segmentations. As shown in the ablation in Table˜5, this allows us to increase the convergence basin of the pose refinement, while improving fine accuracy. Using 84 classes instead of 34 results in visually finer-grained segmentations in turn allows for more accurate pose refinement.