VMatcher: State-Space Semi-Dense Local Feature Matching

Ali Youssef
Abstract

This paper introduces VMatcher, a hybrid Mamba-Transformer network for semi-dense feature matching between image pairs. Learning-based feature matching methods, whether detector-based or detector-free, achieve state-of-the-art performance but depend heavily on the Transformer’s attention mechanism, which, while effective, incurs high computational costs due to its quadratic complexity. In contrast, Mamba introduces a Selective State-Space Model (SSM) that achieves comparable or superior performance with linear complexity, offering significant efficiency gains. VMatcher leverages a hybrid approach, integrating Mamba’s highly efficient long-sequence processing with the Transformer’s attention mechanism. Multiple VMatcher configurations are proposed, including hierarchical architectures, demonstrating their effectiveness in setting new benchmarks efficiently while ensuring robustness and practicality for real-time applications where rapid inference is crucial. Source Code is available at: https://github.com/ayoussf/VMatcher

1 Introduction

Finding reliable and accurate feature matches between image pairs is a core component in a multitude of Computer Vision tasks such as Structure-from-Motion (SfM), visual Simultaneous Localization and Mapping (vSLAM), Visual Localization, etc. Typically, feature detection and matching consist of three phases: feature detection, where distinctive interest points are extracted from an image; feature description, which involves encoding the local neighbourhood around the feature point into a stable high-dimensional descriptor vector; and feature matching, where descriptors from two images are paired to establish point-to-point correspondences, often through the mutual nearest-neighbour algorithm.

Traditionally, this problem has been approached with handcrafted methods, where feature detectors and descriptors such as [22, 28, 2] played a dominant role. These early methods relied on manually designed criteria to identify keypoints and establish matches, making them computationally efficient and interpretable. However, they often struggled under varying conditions, such as significant variation in viewpoint, illumination, and textureless regions, thus limiting their reliability in complex real-world scenarios. Over the past decade, learning-based methods have emerged, demonstrating greater robustness and consistently outperforming classical handcrafted approaches. This paradigm shift emphasises the importance of developing effective models and training strategies to maximise the potential of learning-based approaches.

In particular, the emergence of Transformers [39] marked a breakthrough in feature detection and matching, leveraging their attention mechanism to capture complex global dependencies across images. This capability enabled both detector-based and detector-free matching methods to achieve remarkable accuracy and performance. However, the quadratic complexity of attention with respect to the sequence length imposes a significant computational burden, challenging real-time and resource-limited applications. Recently, Mamba [12] introduced a Selective State-Space Model (SSM) designed for efficient handling of long sequences with linear complexity, achieving performance on par with or surpassing Transformers in language tasks. Mamba’s efficiency stems from a novel selection mechanism that dynamically adjusts its state-space parameters based on the input data, enabling effective modelling of complex patterns without the computational overhead associated with the attention mechanism. Nonetheless, Mamba’s autoregressive design limits its effectiveness in computer vision tasks, where spatial relationships are inherently parallel rather than sequential, and capturing global context is essential for accurate analysis [15].

Building on these insights, VMatcher is proposed, a hybrid Mamba-Transformer network for semi-dense feature matching that merges the strengths of Mamba’s efficient linear processing capabilities with the Transformer’s powerful attention mechanism. This hybrid approach effectively addresses computational bottlenecks while preserving the benefits of global context modelling, overcoming the ineffectiveness associated with Mamba’s autoregressive formulation for spatial data, and mitigating the high computational costs inherent to Transformer architectures.

VMatcher introduces multiple design patterns and configurations, which achieve state-of-the-art accuracy on par with existing methods while improving computational efficiency, making it highly practical for real-world applications where rapid inference is essential. This work aims to strike a balance between robustness, speed, and accuracy in semi-dense matching, to establish VMatcher as a versatile and scalable solution for a diverse range of Computer Vision tasks.

2 Related Work

2.1 Detector-Based Feature Matching

Traditional feature detection and matching methods, such as [22, 2, 28], relied on handcrafted algorithms to identify repeatable, distinctive feature points and matched them using techniques such as nearest-neighbour (NN) search and mutual nearest neighbours (MNN). Despite their wide use, these classical methods often struggle with large variations in viewpoint, illumination, and appearance. To overcome these limitations, recent approaches have adopted learning-based methods, which deliver significantly better performance. SuperPoint [7] performs joint detection and description within a self-supervised Convolutional Neural Network (CNN) framework to improve robustness; however, it requires a complex training setup and high computational cost at large image scales. Following SuperPoint, other works [9, 27, 38, 11] have also leveraged CNN-based architectures for feature detection and description. Recently, XFeat [26] introduced a lightweight architecture that optimises both speed and accuracy for resource-efficient feature detection and matching, offering an ideal balance for real-time applications.

Additionally, learning-based matchers such as SuperGlue [30] use Graph Neural Networks (GNNs) to ensure consistent correspondences by modelling relationships between feature points; however, it can be computationally expensive due to its quadratic complexity relative to the number of keypoints. In contrast, LightGlue [20] adopts an adaptive matching scheme that balances efficiency and accuracy by adjusting computation based on matching difficulty, allowing for faster processing while maintaining competitive accuracy.

2.2 Detector-Free Feature Matching

Detector-free approaches perform direct feature matching without explicit keypoint detection, improving flexibility and robustness in challenging scenarios. LoFTR [35] adopts a coarse-to-fine transformer framework, computing semi-dense correspondences to capture spatial relationships across an image pair without keypoint detection. While this approach results in high-quality matches, it suffers from significant computational overhead due to the need to compute the Transformer’s [39] attention over the entire feature map.

Subsequent works addressed efficiency limitations through various techniques. AspanFormer [3] introduced a hierarchical attention mechanism that dynamically adjusts the attention span based on flow maps, thereby reducing computational overhead while preserving matching accuracy. Similarly, MatchFormer [40] offers a more refined matching process which follows an extract-and-match methodology on multiple feature scales. QuadTree Attention [37] introduces a hierarchical attention mechanism that reduces computational cost without compromising performance. Rather than computing attention maps globally, it focuses on smaller, more relevant sections of the feature map, making it particularly effective for high-resolution matching tasks where managing computational load is crucial. Lastly, ELoFTR [41] enhances the LoFTR framework [35] through architectural refinements and the introduction of a novel transformer architecture that significantly improves efficiency while maintaining matching performance.

2.3 Mamba

The Mamba architecture [12] introduces a Selective Scan Mechanism, designed to effectively balance local and global feature extraction while maintaining flexibility for various tasks. Initially developed for language modelling, Mamba has since been adapted and evolved into several variants for computer vision applications, including Vim [44], which incorporated bidirectional encoding to improve spatial context, and EfficientVMamba [25], which combined SSMs with CNN layers for a hierarchical scanning approach. VMamba [21] introduced the Cross-Scan Module to increase the receptive field, though it remains constrained by the directional scanning paths.

More recently, MambaVision [15] introduced several modifications to the Mamba block, addressing its limitations in computer vision tasks. In particular, it replaces causal convolutions with standard convolutions and introduces a symmetric branch with a convolution and SiLU activation function. With its parallel architecture for both sequential and spatial processing, and the ability to model both short- and long-range dependencies, MambaVision [15] offers a rich feature representation with high efficiency, making it an ideal backbone for VMatcher’s architecture.

Refer to caption
Figure 1: VMatcher pipeline overview. 1) VGG-style backbone extracts multi-scale feature maps from images IAI_{A}italic_I start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT and IBI_{B}italic_I start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT, including coarse feature maps F~CA{\tilde{F}_{C_{A}}}over~ start_ARG italic_F end_ARG start_POSTSUBSCRIPT italic_C start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT end_POSTSUBSCRIPT and F~CB{\tilde{F}_{C_{B}}}over~ start_ARG italic_F end_ARG start_POSTSUBSCRIPT italic_C start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT end_POSTSUBSCRIPT at 18\frac{1}{8}divide start_ARG 1 end_ARG start_ARG 8 end_ARG resolution. 2) Hybrid Mamba-Transformer module processes these for enhanced discrimination. 3) Coarse-level matches {McM_{c}italic_M start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT} obtained via correlation and MNN matching. 4) Fine-level feature maps are generated by fusing F~CA{\tilde{F}_{C_{A}}}over~ start_ARG italic_F end_ARG start_POSTSUBSCRIPT italic_C start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT end_POSTSUBSCRIPT and F~CB{\tilde{F}_{C_{B}}}over~ start_ARG italic_F end_ARG start_POSTSUBSCRIPT italic_C start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT end_POSTSUBSCRIPT with 14\frac{1}{4}divide start_ARG 1 end_ARG start_ARG 4 end_ARG and 12\frac{1}{2}divide start_ARG 1 end_ARG start_ARG 2 end_ARG resolution backbone features maps. Two-stage refinement: extracting patches centred on {McM_{c}italic_M start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT}, correlating via MNN, and aligning each point from IAI_{A}italic_I start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT with a 3×33\times 33 × 3 patch in IBI_{B}italic_I start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT. Final-level matches {MfM_{f}italic_M start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT} derived via softmax and expectation computation.

3 Methodology

This section presents VMatcher’s methodology and model architecture, where given a pair of images IAI_{A}italic_I start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT and IBI_{B}italic_I start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT, the objective is to establish accurate correspondences between an image pair efficiently. Fig. 1 provides an overview of VMatcher’s pipeline.

3.1 Feature Extraction

VMatcher adopts a lightweight and compact VGG-style feature extractor to maximise efficiency without compromising performance. The backbone consists of three sequential groups, each containing three blocks of Conv2D-BatchNorm-ReLU layers. Coarse feature maps F~CA{\tilde{F}_{C_{A}}}over~ start_ARG italic_F end_ARG start_POSTSUBSCRIPT italic_C start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT end_POSTSUBSCRIPT and F~CB{\tilde{F}_{C_{B}}}over~ start_ARG italic_F end_ARG start_POSTSUBSCRIPT italic_C start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT end_POSTSUBSCRIPT are extracted at 18\frac{1}{8}divide start_ARG 1 end_ARG start_ARG 8 end_ARG resolution, along with additional finer feature maps at 14\frac{1}{4}divide start_ARG 1 end_ARG start_ARG 4 end_ARG and 12\frac{1}{2}divide start_ARG 1 end_ARG start_ARG 2 end_ARG resolutions utilised during the Fine Refinement and Matching stage. Lastly, convolutional and batch normalisation layers are fused during inference to minimise computational overhead. In contrast to previous work [35, 41, 3, 40], a lightweight extractor is favoured; for instance, despite having fewer parameters than comparable methods such as ELoFTR [41] RepVGG [8] feature extractor (1.5M vs. 9.5M), VMatcher’s lightweight feature extractor maintains effectiveness while improving computational efficiency, as detailed in Sec. 5.5.

3.2 Mamba

3.2.1 Preliminaries

State-space models (SSMs) serve as a framework for representing continuous-time dynamics through two primary equations: the state transition equation and the observation equation. These equations define the behaviour of the system over time, and are based on a latent state h(t)h(t)italic_h ( italic_t ) and state derivative h(t)h^{\prime}(t)italic_h start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_t ) that adjusts relative to input x(t)x(t)italic_x ( italic_t ), with an output y(t)y(t)italic_y ( italic_t ):

h(t)=𝐀h(t)+𝐁x(t).h^{\prime}(t)=\mathbf{A}h(t)+\mathbf{B}x(t).italic_h start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_t ) = bold_A italic_h ( italic_t ) + bold_B italic_x ( italic_t ) . (1)
y(t)=𝐂h(t).y(t)=\mathbf{C}h(t).italic_y ( italic_t ) = bold_C italic_h ( italic_t ) . (2)

Where matrices 𝐀\mathbf{A}bold_A, 𝐁\mathbf{B}bold_B, and 𝐂\mathbf{C}bold_C determine the system dynamics in continuous time.

Discretisation. To apply continuous models to discrete data, the system equations must be discretised. Mamba [12] leverages the Zero-Order Hold (ZOH) method, using a sampling interval Δ\Deltaroman_Δ to map the continuous matrices 𝐀\mathbf{A}bold_A and 𝐁\mathbf{B}bold_B into their discrete counterparts, 𝐀¯\bar{\mathbf{A}}over¯ start_ARG bold_A end_ARG and 𝐁¯\bar{\mathbf{B}}over¯ start_ARG bold_B end_ARG:

𝐀¯=exp(Δ𝐀).\bar{\mathbf{A}}=\exp(\Delta\mathbf{A}).over¯ start_ARG bold_A end_ARG = roman_exp ( roman_Δ bold_A ) . (3)
𝐁¯=(Δ𝐀)1(exp(Δ𝐀)𝐈)Δ𝐁.\bar{\mathbf{B}}=(\Delta\mathbf{A})^{-1}\left(\exp(\Delta\mathbf{A})-\mathbf{I}\right)\cdot\Delta\mathbf{B}.over¯ start_ARG bold_B end_ARG = ( roman_Δ bold_A ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ( roman_exp ( roman_Δ bold_A ) - bold_I ) ⋅ roman_Δ bold_B . (4)

This transformation allows the model to process sequential or other discrete inputs efficiently, aligning with modern digital computation [13].

Convolutional Computation. Mamba [12] accelerates sequence processing by reformulating the discretised system into a convolutional operation using discretised matrices 𝐀¯\bar{\mathbf{A}}over¯ start_ARG bold_A end_ARG and 𝐁¯\bar{\mathbf{B}}over¯ start_ARG bold_B end_ARG rather than sequentially computing outputs, thus enhancing parallelisation. The output yiy_{i}italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT at time step iiitalic_i is computed as:

yi=i=0L𝐂𝐀¯Li𝐁¯xi.\displaystyle y_{i}=\sum_{i=0}^{L}\mathbf{C}\bar{\mathbf{A}}^{L-i}\bar{\mathbf{B}}x_{i}.italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = ∑ start_POSTSUBSCRIPT italic_i = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_L end_POSTSUPERSCRIPT bold_C over¯ start_ARG bold_A end_ARG start_POSTSUPERSCRIPT italic_L - italic_i end_POSTSUPERSCRIPT over¯ start_ARG bold_B end_ARG italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT . (5)

By constructing a set of convolutional kernels 𝐊¯=(𝐂𝐁¯,𝐂𝐀¯𝐁¯,,𝐂𝐀¯L1𝐁¯)\mathbf{\bar{K}}=\left(\mathbf{C}\bar{\mathbf{B}},\mathbf{C}\bar{\mathbf{A}}\bar{\mathbf{B}},\dots,\mathbf{C}\bar{\mathbf{A}}^{L-1}\bar{\mathbf{B}}\right)over¯ start_ARG bold_K end_ARG = ( bold_C over¯ start_ARG bold_B end_ARG , bold_C over¯ start_ARG bold_A end_ARG over¯ start_ARG bold_B end_ARG , … , bold_C over¯ start_ARG bold_A end_ARG start_POSTSUPERSCRIPT italic_L - 1 end_POSTSUPERSCRIPT over¯ start_ARG bold_B end_ARG ), the output is efficiently computed as:

y=x𝐊¯.y=x\ast\mathbf{\bar{K}}.italic_y = italic_x ∗ over¯ start_ARG bold_K end_ARG . (6)

Where x=[x0,x1,]x=[x_{0},x_{1},\dots]italic_x = [ italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … ] and y=[y0,y1,]Ly=[y_{0},y_{1},\dots]\in\mathbb{R}^{L}italic_y = [ italic_y start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … ] ∈ blackboard_R start_POSTSUPERSCRIPT italic_L end_POSTSUPERSCRIPT, with LLitalic_L denoting the sequence length. This convolutional structure improves efficiency by enabling parallel matrix operations, improving performance on high-dimensional data.

Selection Mechanism. Traditional SSMs lack the flexibility to adjust outputs based on inputs due to inherent time invariance. Mamba [12] overcomes this by introducing an input-dependent selection mechanism that generates dynamic time-varying weight matrices, allowing it to filter irrelevant information while preserving details relevant to each input.

3.2.2 MambaVision

MambaVision [15] modifies vanilla Mamba [12] to enhance feature representation and generalisation in vision tasks. As shown in Fig. 2, the causal 1D convolution is replaced with a 1D convolution to eliminate directional limitations. A symmetric branch bypassing the selective scan mechanism is added, consisting of a 1D convolution layer with SiLU activation to recover information potentially lost due to sequential constraints imposed by the SSM. To maintain a parameter count similar to that of Vanilla Mamba [12], inputs are projected to half the embedding dimension, and outputs are concatenated and projected via a final linear layer. The output yyitalic_y is computed as:

x=SSM(σ(Conv(Linear(C,C2)(xin)))),x=\text{SSM}(\sigma(\text{Conv}(\text{Linear}(C,\frac{C}{2})(x_{\text{in}})))),italic_x = SSM ( italic_σ ( Conv ( Linear ( italic_C , divide start_ARG italic_C end_ARG start_ARG 2 end_ARG ) ( italic_x start_POSTSUBSCRIPT in end_POSTSUBSCRIPT ) ) ) ) , (7)
z=σ(Conv(Linear(C,C2)(xin))),z=\sigma(\text{Conv}(\text{Linear}(C,\frac{C}{2})(x_{\text{in}}))),italic_z = italic_σ ( Conv ( Linear ( italic_C , divide start_ARG italic_C end_ARG start_ARG 2 end_ARG ) ( italic_x start_POSTSUBSCRIPT in end_POSTSUBSCRIPT ) ) ) , (8)
y=Linear(C,C)(Concat(x,z)).y=\text{Linear}(C,C)(\text{Concat}(x,z)).italic_y = Linear ( italic_C , italic_C ) ( Concat ( italic_x , italic_z ) ) . (9)
Refer to caption
(a) Vanilla Mamba [12] Block
Refer to caption
(b) MambaVision [15] Block
Figure 2: Mamba Block Comparison: (a) Vanilla Mamba [12] Block vs. (b) MambaVision [15] Block. The MambaVision Block enhances the Vanilla Mamba by integrating a 1D convolution along with an additional symmetric branch, addressing the limitations of causal convolutions, thus improving performance in vision tasks.

3.3 Downsampled-Transformer

Following the MambaVision module, feature vectors F~CA\tilde{F}_{C_{A}}over~ start_ARG italic_F end_ARG start_POSTSUBSCRIPT italic_C start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT end_POSTSUBSCRIPT and F~CB\tilde{F}_{C_{B}}over~ start_ARG italic_F end_ARG start_POSTSUBSCRIPT italic_C start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT end_POSTSUBSCRIPT are passed through the Downsampling Attention Module that computes self-attention or cross-attention. Standard attention mechanism combines queries QQitalic_Q, keys KKitalic_K, and values VVitalic_V through weighted sums:

Attention(Q,K,V)=softmax(QKTdk)V.Attention(Q,K,V)=softmax(\frac{QK^{T}}{\sqrt{d_{k}}})V.italic_A italic_t italic_t italic_e italic_n italic_t italic_i italic_o italic_n ( italic_Q , italic_K , italic_V ) = italic_s italic_o italic_f italic_t italic_m italic_a italic_x ( divide start_ARG italic_Q italic_K start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT end_ARG start_ARG square-root start_ARG italic_d start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_ARG end_ARG ) italic_V . (10)

Where dkd_{k}italic_d start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT is the dimensionality of the feature vector, scaling the dot products for stability.

This form of attention effectively models dependencies between elements but can be computationally intensive. To mitigate computational costs, previous work has explored utilising linear attention [35], token-pyramid downsampling transformers [37], or ELoFTR’s [41] aggregated attention module that uses depth-wise convolution on the query-generating vector and max-pooling on the key- and value-generating vectors. However, VMatcher leverages Mamba’s [12] selection scan mechanism to filter irrelevant information, eliminating the need for intricate techniques employed by prior methods. As such, by simply downsampling F~CA{\tilde{F}_{C_{A}}}over~ start_ARG italic_F end_ARG start_POSTSUBSCRIPT italic_C start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT end_POSTSUBSCRIPT and F~CB{\tilde{F}_{C_{B}}}over~ start_ARG italic_F end_ARG start_POSTSUBSCRIPT italic_C start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT end_POSTSUBSCRIPT using bilinear interpolation before attention computation and upsampling afterwards, this approach minimises model parameters while maintaining performance.

Moreover, the DS-Transformer module omits the typical Multi-Layer Perceptron (MLP) post-attention layer in favour of directly adding the attention vector to the input, as it led to degraded results (see Sec. 5.5). Lastly, similar to previous work [20, 35, 41, 3, 40], Rotary Positional Embedding (RoPE) [34] is employed for queries and keys during self-attention to capture spatial dependencies. However, as Mamba’s [12] sequence modelling properties and parameter structure [19, 15] inherently encode positional information, utilising RoPE [34] may not be necessary, which is further detailed in Sec. 5.5. An illustration of the DS-Transformer can be seen in Fig. 3.

Refer to caption
Figure 3: Downsampled Transformer illustration.

3.4 Coarse-Level Matching

Coarse-level correspondences are derived by densely correlating transformed feature vectors F~CAt\tilde{F}^{t}_{C_{A}}over~ start_ARG italic_F end_ARG start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_C start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT end_POSTSUBSCRIPT and F~CBt\tilde{F}^{t}_{C_{B}}over~ start_ARG italic_F end_ARG start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_C start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT end_POSTSUBSCRIPT to compute score matrix SCS_{C}italic_S start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT, followed by a dual Softmax operation across both dimensions of SCS_{C}italic_S start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT to generate a matching probability map, as in [35, 41, 38, 11]. Coarse-level matches {McM_{c}italic_M start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT} are obtained using mutual nearest neighbour (MNN) and a threshold τ\tauitalic_τ.

3.5 Fine-Level Extraction and Refinement

Adapted from ELoFTR [41], the fine-level refinement module efficiently achieves sub-pixel precision through two components:

Fine Feature Extraction. The transformed coarse feature vectors F~CAt\tilde{F}^{t}_{C_{A}}over~ start_ARG italic_F end_ARG start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_C start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT end_POSTSUBSCRIPT and F~CBt\tilde{F}^{t}_{C_{B}}over~ start_ARG italic_F end_ARG start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_C start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT end_POSTSUBSCRIPT are fused with backbone feature maps extracted at 14\frac{1}{4}divide start_ARG 1 end_ARG start_ARG 4 end_ARG and 12\frac{1}{2}divide start_ARG 1 end_ARG start_ARG 2 end_ARG resolutions through convolution and upsampling operations, yielding fine maps F^fA\hat{F}_{f_{A}}over^ start_ARG italic_F end_ARG start_POSTSUBSCRIPT italic_f start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT end_POSTSUBSCRIPT and F^fB\hat{F}_{f_{B}}over^ start_ARG italic_F end_ARG start_POSTSUBSCRIPT italic_f start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT end_POSTSUBSCRIPT at fine-level resolution. Fine-level feature patches centred on coarse matches {McM_{c}italic_M start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT} are then extracted from F^fA\hat{F}_{f_{A}}over^ start_ARG italic_F end_ARG start_POSTSUBSCRIPT italic_f start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT end_POSTSUBSCRIPT and F^fB\hat{F}_{f_{B}}over^ start_ARG italic_F end_ARG start_POSTSUBSCRIPT italic_f start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT end_POSTSUBSCRIPT, allowing efficient yet precise refinement.

Two-Stage Refinement. First, dense correlation between fine-level feature patches produces a local patch score matrix SfS_{f}italic_S start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT, where high-confidence pixel-level matches are identified via mutual nearest neighbour (MNN) matching. Second, each point in IAI_{A}italic_I start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT is correlated with a 3×33\times 33 × 3 feature patch centred at its matched location in IBI_{B}italic_I start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT. Softmax is applied to the resulting match distribution matrix, and the refined matches {MfM_{f}italic_M start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT} are obtained by computing the expectation.

3.6 Supervision

Following [30, 35, 41], coarse-level supervision is achieved using ground truth matches {McgtM_{c_{gt}}italic_M start_POSTSUBSCRIPT italic_c start_POSTSUBSCRIPT italic_g italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT} generated via warping grid points from IAI_{A}italic_I start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT to IBI_{B}italic_I start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT using depth maps and camera poses. Coarse-level score matrix SCS_{C}italic_S start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT is supervised by minimising log-likelihood loss at {McgtM_{c_{gt}}italic_M start_POSTSUBSCRIPT italic_c start_POSTSUBSCRIPT italic_g italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT}:

Lc=1N(i~,j~){Mc}gtlogSc(i~,j~).L_{c}=-\frac{1}{N}\sum_{(\tilde{i},\tilde{j})\in\{M_{c}\}_{gt}}\log S_{c}(\tilde{i},\tilde{j}).italic_L start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT = - divide start_ARG 1 end_ARG start_ARG italic_N end_ARG ∑ start_POSTSUBSCRIPT ( over~ start_ARG italic_i end_ARG , over~ start_ARG italic_j end_ARG ) ∈ { italic_M start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_g italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT roman_log italic_S start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT ( over~ start_ARG italic_i end_ARG , over~ start_ARG italic_j end_ARG ) . (11)

Fine-level supervision adopts the two-stage approach of [41]. Lf1L_{f_{1}}italic_L start_POSTSUBSCRIPT italic_f start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT minimises log-likelihood of each fine local score matrix SfS_{f}italic_S start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT using pixel-level ground truth matches, while Lf2L_{f_{2}}italic_L start_POSTSUBSCRIPT italic_f start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT computes 2\ell_{2}roman_ℓ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT loss between final sub-pixel {MfM_{f}italic_M start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT} and ground truth {MfgtM_{f_{gt}}italic_M start_POSTSUBSCRIPT italic_f start_POSTSUBSCRIPT italic_g italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT} matches as in [3, 35, 41]. The total loss is expressed as:

L=Lc+αLf1+βLf2.L=L_{c}+\alpha L_{f_{1}}+\beta L_{f_{2}}.italic_L = italic_L start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT + italic_α italic_L start_POSTSUBSCRIPT italic_f start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT + italic_β italic_L start_POSTSUBSCRIPT italic_f start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT . (12)

4 Implementation Details

VMatcher Variations. VMatcher offers two configurations: VMatcher-B (Base) with 9.5M parameters and VMatcher-T (Tiny) with 6.9M parameters; architecture details are provided in the Supplementary Material. Moreover, Sec. 5 includes optimised variants of both models, in which the Dual-Softmax operation is omitted during coarse-level matching, following [41]. This modification greatly enhances inference time while maintaining performance.

Inference Time Measurement. Measuring inference time can be complex when comparing sparse and semi-dense methods. A direct comparison is slightly biased towards semi-dense methods since applications such as SfM extract feature points once and cache them, making inference time primarily a factor of feature matching. This does not extend to semi-dense methods, as caching large feature vectors is computationally intensive. Thus, for fairness, feature detection and matching times are separately reported for sparse methods in Sec. 5.

5 Experiments

To ensure a fair evaluation, the LightGlue [20] approach is adopted, tuning RANSAC’s inlier threshold for each model on each test dataset, as the emphasis is on evaluating the models rather than RANSAC itself. Tabs. 1, 2 and 3 present results using PoseLib [17] LO-RANSAC and OpenCV RANSAC [10], with runtimes marked mp if mixed precision was used during inference. Dense methods are excluded despite strong performance, as their computational demands and runtime significantly exceed those of sparse and semi-dense approaches. For consistency, Flash-Attention [5] is omitted to ensure fair comparisons.

5.1 Homography Estimation

Homography estimation is assessed on the HPatches dataset [1], which includes 57 and 59 planar scenes with varying illumination and viewpoints, each scene comprises 6 images linked by ground truth homography matrices.

Baselines. Comparisons are made against sparse methods, including SuperPoint [7] with Nearest Neighbours and SuperGlue [30], as well as SuperPoint, DISK [38], and Aliked [43] with LightGlue [20]. For semi-dense methods, VMatcher is evaluated against QuadTree, Aspanformer, LoFTR, and ELoFTR [37, 35, 41, 3].

Evaluation Protocol. In line with prior work [35, 30, 11], the shorter side of each image is resized to 480 pixels. The mean projection error for the image’s corner points is computed, and the area under the curve (AUC) error is reported at 3, 5, and 10 pixel thresholds. For consistency with sparse methods, results are reported using the top 1024 predicted matches.

Results. As shown in Tab. 1, VMatcher models outperform both sparse and semi-dense methods despite limiting keypoints to 1024. VMatcher-B matches ELoFTR [41] in runtime (further analysed in Sec. 5.4), while VMatcher-T processes images 1.8×\times× faster than LoFTR [35] and 2.2×\times× faster than AspanFormer [3] while maintaining competitive accuracy. The optimised variants further enhance efficiency, with VMatcher-TOpt{}_{\text{Opt}}start_FLOATSUBSCRIPT Opt end_FLOATSUBSCRIPT reducing runtime by 16.9%, achieving an optimal balance between accuracy and computational efficiency.

Matcher Type Method HPatches Dataset Time (ms)
AUC@3px / AUC@5px / AUC@10px
LO-RANSAC RANSAC
SP+NN 63.7 / 74.5 / 84.4 54.6 / 67.8 / 80.2 17.50 (15.58/2.36)
SP+SG 68.3 / 78.6 / 87.8 59.8 / 71.0 / 82.8 46.05 (15.58/30.47mp)
Sparse SP+LG 67.7 / 78.0 / 87.2 59.1 / 72.1 / 83.7 34.11 (15.14/18.97mp)
DISK+LG 61.7 / 74.0 / 84.9 52.6 / 67.2 / 80.7 51.28 (33.42/17.85mp)
Aliked+LG 66.3 / 77.1 / 86.7 61.1 / 73.4 / 84.5 36.74 (21.01/15.73mp)
QuadTree 69.5 / 79.2 / 87.3 67.4 / 77.4 / 85.9 100.64mp
AspanFormer 68.7 / 78.5 / 86.9 66.7 / 76.6 / 85.7 65.54mp
LoFTR 69.5 / 78.7 / 86.7 67.6 / 76.9 / 85.4 52.35mp
Semi-Dense ELoFTR 69.6 / 78.9 / 87.1 67.5 / 77.3 / 86.0 34.48mp
VMatcher-B 69.8 / 79.1 / 87.5 67.8 / 77.5 / 86.4 35.43mp
VMatcher-T 70.2 / 79.3 / 87.4 67.8 / 77.3 / 86.0 29.23mp
ELoFTROpt{}_{\text{Opt}}start_FLOATSUBSCRIPT Opt end_FLOATSUBSCRIPT 68.6 / 78.2 / 86.7 65.6 / 75.4 / 84.5 28.62mp
Semi-Dense-Opt VMatcher-BOpt{}_{\text{Opt}}start_FLOATSUBSCRIPT Opt end_FLOATSUBSCRIPT 68.9 / 78.5 / 86.9 66.4 / 76.1 / 85.3 29.58mp
VMatcher-TOpt{}_{\text{Opt}}start_FLOATSUBSCRIPT Opt end_FLOATSUBSCRIPT 67.6 / 77.5 / 86.5 64.7 / 74.8 / 84.7 24.29mp
Table 1: Homography Estimation Results on HPatches. Reprojection error AUC scores and runtimes for sparse and semi-dense methods.

5.2 Relative Pose Estimation

Relative pose estimation evaluation is performed on both outdoor and indoor datasets. Following prior work [20, 30, 41, 35] for outdoor scenes, the MegaDepth dataset [18] test split from [35] is utilised, which includes 1500 image pairs from ”Sacre Coeur” and ”St. Peter’s Square”. For indoor scenes, the ScanNet dataset [4] is used, which comprises 1613 sequences capturing challenges such as viewpoint variation and textureless regions. Evaluation is performed on the ScanNet [4] test split of 1500 image pairs from [30].

Baselines. Relative pose estimation evaluation utilises similar baselines as in Sec. 5.1.

Evaluation Protocol. For outdoor scenes, images are resized with the longest edge at 1184 pixels for semi-dense methods and 1600 for sparse methods, as specified in [20]. For indoor scenes, images are resized to 640x480 pixels. The pose error is measured as the maximum between angular rotation and translation errors, with results reported at thresholds of 55^{\circ}5 start_POSTSUPERSCRIPT ∘ end_POSTSUPERSCRIPT, 1010^{\circ}10 start_POSTSUPERSCRIPT ∘ end_POSTSUPERSCRIPT, and 2020^{\circ}20 start_POSTSUPERSCRIPT ∘ end_POSTSUPERSCRIPT. Sparse methods use the top 2048 keypoints for evaluation.

Results. As shown in Tab. 2, VMatcher variants outperform all sparse and semi-dense methods on MegaDepth [18] while offering considerable runtime advantages. However, the performance differences between models are minimal (\approx1.2% in AUC@1010^{\circ}10 start_POSTSUPERSCRIPT ∘ end_POSTSUPERSCRIPT), with the primary distinction lying in computational efficiency, where VMatcher-B is 1.15×\times× faster than ELoFTR [41] and 1.45×\times× faster than LoFTR [35], while VMatcher-T achieves even greater speed improvements at 1.26 ×\times× and 1.60×\times× faster, respectively. The optimised variants achieve near-sparse processing speeds with an insignificant degradation in performance.

From Tab. 3, VMatcher performance is comparable to other semi-dense methods, with AspanFormer [3] leading in performance. Notably, VMatcher-B runtime is similar to ELoFTR [35], similar to findings in Sec. 5.1 (further analysed in Sec. 5.4), while VMatcher-T offers additional runtime advantages. Interestingly, VMatcher-TOpt{}_{\text{Opt}}start_FLOATSUBSCRIPT Opt end_FLOATSUBSCRIPT strikes the best balance between accuracy and runtime, outperforming all sparse methods in both aspects.

A key limitation to highlight is the instability in relative pose estimation, particularly in translation error computation. The error is derived from the angular difference between ground truth and estimated translation vectors, which becomes unreliable due to scale ambiguity [14, 42], especially with small ground-truth translation norms (tGT||t_{GT}||| | italic_t start_POSTSUBSCRIPT italic_G italic_T end_POSTSUBSCRIPT | |). This affects 139 of 1500 image pairs in the ScanNet [4] test split.

Matcher Type Method MegaDepth Dataset Time (ms)
AUC@55^{\circ}5 start_POSTSUPERSCRIPT ∘ end_POSTSUPERSCRIPT / AUC@1010^{\circ}10 start_POSTSUPERSCRIPT ∘ end_POSTSUPERSCRIPT / AUC@2020^{\circ}20 start_POSTSUPERSCRIPT ∘ end_POSTSUPERSCRIPT
LO-RANSAC RANSAC
SP+NN 51.1 / 64.4 / 73.9 39.1 / 54.3 / 66.4 43.58 (40.73/2.85)
SP+SG 66.0 / 78.8 / 87.6 50.7 / 68.2 / 80.9 82.61 (40.73/41.88mp)
Sparse SP+LG 66.8 / 79.4 / 88.0 51.5 / 68.4 / 81.1 58.25 (40.73/17.54mp)
DISK+LG 61.1 / 74.2 / 83.9 44.7 / 62.5 / 76.4 161.04 (143.67/17.36mp)
Aliked+LG 66.3 / 78.9 / 87.5 51.7 / 68.3 / 80.1 80.04 (62.58/17.45mp)
QuadTree 66.5 / 78.9 / 87.2 51.5 / 68.4 / 80.6 398.38
AspanFormer 69.2 / 80.9 / 88.7 55.2 / 71.5 / 83.2 160.73mp
LoFTR 68.3 / 80.1 / 88.3 53.5 / 69.8 / 81.9 126.94mp
Semi-Dense ELoFTR 69.4 / 80.8 / 88.7 56.0 / 72.1 / 83.6 100.25mp
VMatcher-B 69.6 / 81.1 / 88.9 56.0 / 72.2 / 83.6 87.56mp
VMatcher-T 69.4 / 81.0 / 88.8 55.8 / 71.9 / 83.4 79.46mp
ELoFTROpt{}_{\text{Opt}}start_FLOATSUBSCRIPT Opt end_FLOATSUBSCRIPT 68.8 / 80.3 / 88.2 55.9 / 71.9 / 83.3 73.52mp
Semi-Dense-Opt VMatcher-BOpt{}_{\text{Opt}}start_FLOATSUBSCRIPT Opt end_FLOATSUBSCRIPT 69.5 / 81.0 / 88.7 56.3 / 72.3 / 83.6 66.16mp
VMatcher-TOpt{}_{\text{Opt}}start_FLOATSUBSCRIPT Opt end_FLOATSUBSCRIPT 69.0 / 80.7 / 88.4 55.5 / 71.6 / 83.2 55.94mp
Table 2: Outdoor Relative Pose Estimation Results on the MegaDepth dataset [18]. VMatcher variants deliver improved performance and runtime speed advantages compared to semi-dense methods.
Matcher Type Method ScanNet Dataset Time (ms)
AUC@55^{\circ}5 start_POSTSUPERSCRIPT ∘ end_POSTSUPERSCRIPT / AUC@1010^{\circ}10 start_POSTSUPERSCRIPT ∘ end_POSTSUPERSCRIPT / AUC@2020^{\circ}20 start_POSTSUPERSCRIPT ∘ end_POSTSUPERSCRIPT
LO-RANSAC RANSAC
SP+NN 15.7 / 31.7 / 48.6 11.5 / 25.6 / 42.3 19.04 (14.56/4.48)
SP+SG 22.1 / 40.8 / 58.1 18.1 / 36.6 / 54.4 57.93 (14.56/43.47mp)
Sparse SP+LG 20.0 / 38.0 / 55.6 16.9 / 34.3 / 51.9 42.79 (14.56/28.23mp)
DISK+LG 18.8 / 34.1 / 49.2 19.0 / 35.1 / 50.1 48.16 (28.81/19.35mp)
Aliked+LG 21.0 / 39.1 / 56.0 17.0 / 33.6 / 50.3 45.78 (23.76/22.02mp)
QuadTree 24.8 / 43.8 / 60.3 20.2 / 38.4 / 55.4 89.77
AspanFormer 26.4 / 45.7 / 62.4 21.6 / 40.2 / 57.1 49.70mp
LoFTR 22.4 / 39.8 / 55.6 17.2 / 34.0 / 50.3 33.02mp
Semi-Dense ELoFTR 25.0 / 43.8 / 60.0 21.5 / 39.8 / 56.2 23.92mp
VMatcher-B 25.0 / 43.9 / 60.0 21.5 / 39.5 / 56.1 24.52mp
VMatcher-T 23.3 / 42.0 / 58.4 19.7 / 37.5 / 54.1 19.41mp
ELoFTROpt{}_{\text{Opt}}start_FLOATSUBSCRIPT Opt end_FLOATSUBSCRIPT 24.5 / 43.5 / 59.8 20.0 / 38.2 / 54.7 18.59mp
Semi-Dense-Opt VMatcher-BOpt{}_{\text{Opt}}start_FLOATSUBSCRIPT Opt end_FLOATSUBSCRIPT 24.9 / 43.8 / 60.0 21.1 / 39.1 / 55.5 19.84mp
VMatcher-TOpt{}_{\text{Opt}}start_FLOATSUBSCRIPT Opt end_FLOATSUBSCRIPT 22.9 / 41.4 / 58.5 19.5 / 37.1 / 53.7 14.54mp
Table 3: Indoor Relative Pose Estimation Results on ScanNet [4]. VMatcher variants are on par to other sparse and semi-dense methods, while VMatcher-TOpt{}_{\text{Opt}}start_FLOATSUBSCRIPT Opt end_FLOATSUBSCRIPT offers the optimal balance between speed and accuracy.

5.3 Visual Localization

Refer to caption
Figure 4: Runtime Breakdown. A runtime breakdown of VMatcher’s pipeline across multiple image resolutions shows efficiency gains over ELoFTR [41] as the resolution increases.

Visual localization is evaluated on the Aachen v1.1 and InLoc datasets [31, 36]. Aachen v1.1 [31] presents challenging viewpoint variations and illumination changes, including day-night transitions, while InLoc [36] features indoor scenes with repetitive structures and textureless regions. Following [35, 3, 41, 20], the Hierarchical Localization (HLoc) framework [29] is employed for benchmarking. Results are reported for Aachen v1.1’s day/night scenes and InLoc’s DUC1/DUC2 test sets.

Baselines. Sparse methods include SuperPoint [7] with SuperGlue [30] and LightGlue [20], while for semi-dense methods VMatcher is evaluated against ELoFTR [41].

Evaluation Protocol. HLoc’s [29] evaluation pipeline is followed for both sparse and semi-dense methods. For semi-dense methods, the longest edge of Aachen v1.1 and InLoc [31, 36] images is resized to 1184 and 800 pixels, respectively. ELoFTR [41] is evaluated under these conditions since no evaluation settings are reported. Results are reported as the percentage of poses within predefined distance and angular error thresholds. For Absolute Pose Estimation (APE) on the Aachen v1.1 dataset [31], HLoc’s default configuration using PyCOLMAP [32, 33] is adopted. In contrast, due to the stochastic behaviour of PyCOLMAP on the InLoc dataset [36], affecting both sparse and semi-dense methods, APE is performed using PoseLib.

Results. From Tab. Tab. 4, both sparse and semi-dense methods yield similar results on Aachen v1.1 [31], with insignificant variations attributed to dataset saturation [20]. Moreover, Tab. 5 shows that VMatcher performs on par with state-of-the-art feature matchers on InLoc [36], with differences being statistically insignificant due to limited query samples per test split [20]. However, results highlight the computational advantage of VMatcher, particularly the optimised models achieving sparse-like runtime speeds.

Matcher Type Method Aachen v1.1 Dataset Time (ms)
Day Night
(0.25m,22^{\circ}2 start_POSTSUPERSCRIPT ∘ end_POSTSUPERSCRIPT)/(0.5m,55^{\circ}5 start_POSTSUPERSCRIPT ∘ end_POSTSUPERSCRIPT)/(1.0m,1010^{\circ}10 start_POSTSUPERSCRIPT ∘ end_POSTSUPERSCRIPT)
Sparse SP+SG 90.3 / 96.2 / 99.4 76.4 / 90.1 / 100.0 209.50 (19.27/168.23mp)
SP+LG 90.3 / 96.2 / 99.2 77.5 / 91.6 / 99.5 51.35 (19.27/32.08mp)
ELoFTR 89.0 / 96.0 / 98.7 77.0 / 91.1 / 99.5 90.91mp
Semi-Dense VMatcher-B 89.4 / 96.5 / 99.0 77.0 / 91.6 / 99.5 78.32mp
VMatcher-T 88.7 / 96.0 / 98.8 77.5 / 91.1 / 99.5 68.06mp
ELoFTROpt{}_{\text{Opt}}start_FLOATSUBSCRIPT Opt end_FLOATSUBSCRIPT 89.0 / 95.5 / 98.4 76.4 / 91.1 / 99.0 60.13mp
Semi-Dense-Opt VMatcher-BOpt{}_{\text{Opt}}start_FLOATSUBSCRIPT Opt end_FLOATSUBSCRIPT 89.3 / 96.1 / 98.8 77.5 / 91.6 / 99.0 54.19mp
VMatcher-TOpt{}_{\text{Opt}}start_FLOATSUBSCRIPT Opt end_FLOATSUBSCRIPT 88.8 / 95.6 / 98.1 77.0 / 91.1 / 99.0 46.58mp
Table 4: Aachen v1.1 Dataset [31] Visual Relocalization.
Matcher Type Method InLoc Dataset Time (ms)
DUC1 DUC2
(0.25m,22^{\circ}2 start_POSTSUPERSCRIPT ∘ end_POSTSUPERSCRIPT)/(0.5m,55^{\circ}5 start_POSTSUPERSCRIPT ∘ end_POSTSUPERSCRIPT)/(1.0m,1010^{\circ}10 start_POSTSUPERSCRIPT ∘ end_POSTSUPERSCRIPT)
Sparse SP+SG 46.0 / 68.7 / 79.8 57.3 / 77.9 / 79.4 153.12 (21.44/131.68mp)
SP+LG 44.9 / 68.2 / 77.8 56.5 / 73.3 / 77.9 40.21 (21.44/18.77mp)
ELoFTR 52.5 / 74.2 / 86.4 64.1 / 81.7 / 86.3 39.44mp
Semi-Dense VMatcher-B 55.1 / 74.2 / 85.9 61.1 / 81.7 / 85.5 36.79mp
VMatcher-T 53.5 / 74.2 / 85.4 59.5 / 79.4 / 83.2 32.52mp
ELoFTROpt{}_{\text{Opt}}start_FLOATSUBSCRIPT Opt end_FLOATSUBSCRIPT 50.5 / 72.2 / 84.3 59.5 / 78.6 / 84.0 32.75mp
Semi-Dense-Opt VMatcher-BOpt{}_{\text{Opt}}start_FLOATSUBSCRIPT Opt end_FLOATSUBSCRIPT 50.5 / 72.2 / 83.8 61.1 / 79.4 / 84.7 30.09mp
VMatcher-TOpt{}_{\text{Opt}}start_FLOATSUBSCRIPT Opt end_FLOATSUBSCRIPT 52.0 / 74.2 / 85.4 60.3 / 78.6 / 81.7 24.82mp
Table 5: InLoc Dataset [36] Visual Relocalization.

5.4 Runtime Breakdown

From Sec. 5.1 to Sec. 5.3, it is evident that the runtime of VMatcher varies across experiments. Therefore, Fig. 4 presents a runtime breakdown on ScanNet [4] across multiple resolutions, providing a more detailed analysis. As a sequence-length-dependent model, the Mamba [12, 6] architecture imposes a constant computational overhead of 0.8ms\approx 0.8ms≈ 0.8 italic_m italic_s per layer. At lower resolutions, VMatcher and its optimised variants achieve runtimes comparable to ELoFTR [41], with VMatcher-T being slightly more efficient. At higher resolutions, this overhead becomes negligible, allowing VMatcher to maintain competitive performance while improving efficiency across all variants.

5.5 Ablation Study

Method MegaDepth Dataset Time (ms)
AUC@55^{\circ}5 start_POSTSUPERSCRIPT ∘ end_POSTSUPERSCRIPT / AUC@1010^{\circ}10 start_POSTSUPERSCRIPT ∘ end_POSTSUPERSCRIPT / AUC@2020^{\circ}20 start_POSTSUPERSCRIPT ∘ end_POSTSUPERSCRIPT
LO-RANSAC RANSAC
VMatcher-B (Full Model) 69.5 / 81.1 / 88.9 56.0 / 72.2 / 83.6 87.56mp
a) VGG Backbone only 61.8 / 72.3 / 82.2 47.7 / 64.1 / 75.4 64.42mp
b) VGG + DS-Transformer (w/o MambaV + gMLP) 64.9 / 75.6 / 84.1 49.9 / 65.9 / 77.6 69.51mp
c) VGG + MambaVision + gMLP (w/o DS-Transformer) 65.4 / 76.9 / 85.7 50.5 / 66.4 / 78.3 80.14mp
d) Full Model w/ Rep-VGG Backbone 69.1 / 80.7 / 88.5 55.6 / 71.4 / 83.1 97.22mp
e) MLP replacing gMLP 68.1 / 80.1 / 88.0 53.7 / 70.2 / 82.2 86.86mp
f) MLP after Attention 67.7 / 79.9 / 87.8 53.4 / 70.0 / 82.0 89.77mp
g) Full Model w/o Self-Attention RoPE 69.7 / 81.2 / 88.9 56.0 / 72.1 / 83.2 86.79mp
Table 6: Ablation Study. Analysing VMatcher’s architecture.

Tab. 6 presents an ablation study of VMatcher, highlighting the impact of the hybrid Mamba-transformer architecture and key design choices. The VGG backbone alone (row a) shows significantly reduced performance, while adding either DS-Transformer or MambaVision+gMLP substantially improves results (rows b, c). However, neither component independently matches the full model’s capabilities, confirming the effectiveness of the hybrid approach. Replacing the lightweight VGG backbone with RepVGG (row d) yields lower performance with higher computational overhead, validating the lightweight backbone choice in Sec. 3.1. Substituting the Gated MLP with a standard Feed-Forward MLP (row e) offers minor runtime improvements but compromises accuracy. As detailed in Sec. 3.3, adding an MLP (row f) post-attention negatively impacts both performance and efficiency. Notably, removing RoPE [34] from self-attention (row g) slightly improves runtime and accuracy, which aligns with the hypothesis in Sec. 3.3 that Mamba’s [12] inherent positional encoding makes additional positional encoding unnecessary.

6 Conclusion

This paper introduces VMatcher, a hybrid Mamba-Transformer semi-dense feature matcher that prioritises computational efficiency without sacrificing performance. Evaluations demonstrated that VMatcher achieves runtime gains over existing semi-dense matchers, with optimised variants reaching sparse-like processing speeds, marking a notable advancement in semi-dense feature matching efficiency. Future work may focus on further refining the architecture, aiming for greater computational efficiency while maintaining accuracy. This work hopes to pave the way for more efficient feature matching methodologies, contributing to the ongoing evolution of the field and offering a promising direction towards more practical and efficient solutions.

References

  • Balntas et al. [2017] Vassileios Balntas, Karel Lenc, Andrea Vedaldi, and Krystian Mikolajczyk. Hpatches: A benchmark and evaluation of handcrafted and learned local descriptors. In CVPR, 2017.
  • Bay et al. [2006] Herbert Bay, Tinne Tuytelaars, and Luc Van Gool. Surf: Speeded up robust features. In Computer Vision – ECCV 2006, pages 404–417, Berlin, Heidelberg, 2006. Springer Berlin Heidelberg.
  • Chen et al. [2022] Hongkai Chen, Zixin Luo, Lei Zhou, Yurun Tian, Mingmin Zhen, Tian Fang, David Mckinnon, Yanghai Tsin, and Long Quan. Aspanformer: Detector-free image matching with adaptive span transformer, 2022.
  • Dai et al. [2017] Angela Dai, Angel X. Chang, Manolis Savva, Maciej Halber, Thomas A. Funkhouser, and Matthias Nießner. Scannet: Richly-annotated 3d reconstructions of indoor scenes. CoRR, abs/1702.04405, 2017.
  • Dao [2023] Tri Dao. Flashattention-2: Faster attention with better parallelism and work partitioning, 2023.
  • Dao and Gu [2024] Tri Dao and Albert Gu. Transformers are ssms: Generalized models and efficient algorithms through structured state space duality, 2024.
  • DeTone et al. [2017] Daniel DeTone, Tomasz Malisiewicz, and Andrew Rabinovich. Superpoint: Self-supervised interest point detection and description. CoRR, abs/1712.07629, 2017.
  • Ding et al. [2021] Xiaohan Ding, Xiangyu Zhang, Ningning Ma, Jungong Han, Guiguang Ding, and Jian Sun. Repvgg: Making vgg-style convnets great again, 2021.
  • Dusmanu et al. [2019] Mihai Dusmanu, Ignacio Rocco, Tomas Pajdla, Marc Pollefeys, Josef Sivic, Akihiko Torii, and Torsten Sattler. D2-net: A trainable cnn for joint description and detection of local features. In 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 8084–8093, 2019.
  • Fischler and Bolles [1981] Martin A. Fischler and Robert C. Bolles. Random sample consensus: a paradigm for model fitting with applications to image analysis and automated cartography. Commun. ACM, 24(6):381–395, 1981.
  • Gleize et al. [2023] Pierre Gleize, Weiyao Wang, and Matt Feiszli. Silk: Simple learned keypoints. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 22499–22508, 2023.
  • Gu and Dao [2024] Albert Gu and Tri Dao. Mamba: Linear-time sequence modeling with selective state spaces, 2024.
  • Gu et al. [2021] Albert Gu, Isys Johnson, Karan Goel, Khaled Saab, Tri Dao, Atri Rudra, and Christopher Ré. Combining recurrent, convolutional, and continuous-time models with linear state-space layers, 2021.
  • Hartley and Zisserman [2004] R. I. Hartley and A. Zisserman. Multiple View Geometry in Computer Vision. Cambridge University Press, ISBN: 0521540518, second edition, 2004.
  • Hatamizadeh and Kautz [2024] Ali Hatamizadeh and Jan Kautz. Mambavision: A hybrid mamba-transformer vision backbone, 2024.
  • Jin et al. [2020] Yuhe Jin, Dmytro Mishkin, Anastasiia Mishchuk, Jiri Matas, Pascal Fua, Kwang Moo Yi, and Eduard Trulls. Image Matching across Wide Baselines: From Paper to Practice. International Journal of Computer Vision, 2020.
  • Larsson and contributors [2020] Viktor Larsson and contributors. PoseLib - Minimal Solvers for Camera Pose Estimation, 2020.
  • Li and Snavely [2018] Zhengqi Li and Noah Snavely. Megadepth: Learning single-view depth prediction from internet photos. In 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2041–2050, 2018.
  • Lieber et al. [2024] Opher Lieber, Barak Lenz, Hofit Bata, Gal Cohen, Jhonathan Osin, Itay Dalmedigos, Erez Safahi, Shaked Meirom, Yonatan Belinkov, Shai Shalev-Shwartz, Omri Abend, Raz Alon, Tomer Asida, Amir Bergman, Roman Glozman, Michael Gokhman, Avashalom Manevich, Nir Ratner, Noam Rozen, Erez Shwartz, Mor Zusman, and Yoav Shoham. Jamba: A hybrid transformer-mamba language model, 2024.
  • Lindenberger et al. [2023] Philipp Lindenberger, Paul-Edouard Sarlin, and Marc Pollefeys. Lightglue: Local feature matching at light speed, 2023.
  • Liu et al. [2024] Yue Liu, Yunjie Tian, Yuzhong Zhao, Hongtian Yu, Lingxi Xie, Yaowei Wang, Qixiang Ye, and Yunfan Liu. Vmamba: Visual state space model, 2024.
  • Lowe [2004] David G. Lowe. Distinctive Image Features from Scale-Invariant Keypoints. International Journal of Computer Vision, 60(2):91–110, 2004.
  • NVIDIA Corporation [2020] NVIDIA Corporation. Triton Inference Server: An Optimized Cloud and Edge Inferencing Solution., 2020.
  • Paszke et al. [2019] Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, Alban Desmaison, Andreas Köpf, Edward Z. Yang, Zach DeVito, Martin Raison, Alykhan Tejani, Sasank Chilamkurthy, Benoit Steiner, Lu Fang, Junjie Bai, and Soumith Chintala. Pytorch: An imperative style, high-performance deep learning library. CoRR, abs/1912.01703, 2019.
  • Pei et al. [2024] Xiaohuan Pei, Tao Huang, and Chang Xu. Efficientvmamba: Atrous selective scan for light weight visual mamba, 2024.
  • Potje et al. [2024] Guilherme Potje, Felipe Cadar, Andre Araujo, Renato Martins, and Erickson R. Nascimento. Xfeat: Accelerated features for lightweight image matching, 2024.
  • Revaud et al. [2019] Jerome Revaud, Philippe Weinzaepfel, César De Souza, Noe Pion, Gabriela Csurka, Yohann Cabon, and Martin Humenberger. R2d2: Repeatable and reliable detector and descriptor. arXiv preprint arXiv:1906.06195, 2019.
  • Rublee et al. [2011] Ethan Rublee, Vincent Rabaud, Kurt Konolige, and Gary Bradski. Orb: An efficient alternative to sift or surf. In 2011 International Conference on Computer Vision, pages 2564–2571, 2011.
  • Sarlin et al. [2019] Paul-Edouard Sarlin, Cesar Cadena, Roland Siegwart, and Marcin Dymczyk. From coarse to fine: Robust hierarchical localization at large scale, 2019.
  • Sarlin et al. [2020] Paul-Edouard Sarlin, Daniel DeTone, Tomasz Malisiewicz, and Andrew Rabinovich. Superglue: Learning feature matching with graph neural networks, 2020.
  • Sattler et al. [2018] Torsten Sattler, Will Maddern, Carl Toft, Akihiko Torii, Lars Hammarstrand, Erik Stenborg, Daniel Safari, Masatoshi Okutomi, Marc Pollefeys, Josef Sivic, Fredrik Kahl, and Tomas Pajdla. Benchmarking 6dof outdoor visual localization in changing conditions, 2018.
  • Schönberger and Frahm [2016] Johannes Lutz Schönberger and Jan-Michael Frahm. Structure-from-motion revisited. In Conference on Computer Vision and Pattern Recognition (CVPR), 2016.
  • Schönberger et al. [2016] Johannes Lutz Schönberger, Enliang Zheng, Marc Pollefeys, and Jan-Michael Frahm. Pixelwise view selection for unstructured multi-view stereo. In European Conference on Computer Vision (ECCV), 2016.
  • Su et al. [2023] Jianlin Su, Yu Lu, Shengfeng Pan, Ahmed Murtadha, Bo Wen, and Yunfeng Liu. Roformer: Enhanced transformer with rotary position embedding, 2023.
  • Sun et al. [2021] Jiaming Sun, Zehong Shen, Yuang Wang, Hujun Bao, and Xiaowei Zhou. Loftr: Detector-free local feature matching with transformers. In 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 8918–8927, 2021.
  • Taira et al. [2018] Hajime Taira, Masatoshi Okutomi, Torsten Sattler, Mircea Cimpoi, Marc Pollefeys, Josef Sivic, Tomas Pajdla, and Akihiko Torii. InLoc: Indoor visual localization with dense matching and view synthesis. In CVPR, 2018.
  • Tang et al. [2022] Shitao Tang, Jiahui Zhang, Siyu Zhu, and Ping Tan. Quadtree attention for vision transformers, 2022.
  • Tyszkiewicz et al. [2020] Michał J. Tyszkiewicz, Pascal Fua, and Eduard Trulls. Disk: Learning local features with policy gradient, 2020.
  • Vaswani et al. [2023] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention is all you need, 2023.
  • Wang et al. [2022] Qing Wang, Jiaming Zhang, Kailun Yang, Kunyu Peng, and Rainer Stiefelhagen. Matchformer: Interleaving attention in transformers for feature matching, 2022.
  • Wang et al. [2024] Yifan Wang, Xingyi He, Sida Peng, Dongli Tan, and Xiaowei Zhou. Efficient loftr: Semi-dense local feature matching with sparse-like speed, 2024.
  • Yi et al. [2018] Kwang Moo Yi, Eduard Trulls, Yuki Ono, Vincent Lepetit, Mathieu Salzmann, and Pascal Fua. Learning to find good correspondences. In 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2666–2674, 2018.
  • Zhao et al. [2023] Xiaoming Zhao, Xingming Wu, Weihai Chen, Peter C. Y. Chen, Qingsong Xu, and Zhengguo Li. Aliked: A lighter keypoint and descriptor extraction network via deformable transformation, 2023.
  • Zhu et al. [2024] Lianghui Zhu, Bencheng Liao, Qian Zhang, Xinlong Wang, Wenyu Liu, and Xinggang Wang. Vision mamba: Efficient visual representation learning with bidirectional state space model, 2024.

Appendix

A VMatcher Model Configurations

VMatcher configurations introduced in Sec. 4 are designed to balance computational efficiency and matching accuracy by varying the model size:

  • VMatcher-B (Base):

    • Architecture: 24 layers (9.5M parameters)

    • Layer Pattern: [M G Ms G S M G Ms G C M G Ms G S M G Ms G C M G Ms G]

  • VMatcher-T (Tiny):

    • Architecture: 14 layers (6.9M parameters)

    • Layer Pattern: [M G Ms G S M G Ms G C M G Ms G]

Where the architectural components are defined as follows:

  • M: MambaVision layer

  • Ms: MambaVisions layer (MambaVision with transposed input and output is restored to its original orientation). Switches from row-major to column-major scanning.

  • G: Gated Multi-Layer Perceptron (gMLP) layer

  • S: Downsampled-Transformer Self-attention layer

  • C: Downsampled-Transformer Cross-attention layer

B Training Details

All models were trained exclusively on the MegaDepth dataset [18], featuring 196 tourist landmarks with camera calibration, poses, and depth maps derived through COLMAP SfM and Multi-view Stereo (MVS) [32, 33]. Training used a learning rate of 1×1041\times 10^{-4}1 × 10 start_POSTSUPERSCRIPT - 4 end_POSTSUPERSCRIPT with the AdamW optimiser, a weight decay of 0.1, and a batch size of 1 over 30 epochs. Before attention layers, the feature vectors are interpolated (Bilinear) by a factor of 4. The loss weights α\alphaitalic_α and β\betaitalic_β in Eq. 12 were set at 1.0 and 0.25, respectively. To accommodate more samples per batch, gradient accumulation was set to 8 for VMatcher-B and 32 for VMatcher-T. All training and evaluations were conducted on a single RTX 3090Ti GPU using PyTorch [24], with each epoch taking approximately 2 hours.

C Bidirectional Models

Two variants of VMatcher-B and VMatcher-T are introduced, where in the MambaVision layer [15], rather than scanning the feature vector in a Unidirectional manner (see Fig. 2), the scan is performed Bidirectionally as seen in Fig. 5. For the Bidirectional models, the MambaVision layer computes the following:

yf=MambaVision(Norm(xin)).y_{f}=\text{MambaVision}(\text{Norm}(x_{in})).italic_y start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT = MambaVision ( Norm ( italic_x start_POSTSUBSCRIPT italic_i italic_n end_POSTSUBSCRIPT ) ) . (13)
yb=flip(MambaVision(flip(Norm(xin)))).y_{b}=\text{flip}(\text{MambaVision}(\text{flip}(\text{Norm}(x_{in})))).italic_y start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT = flip ( MambaVision ( flip ( Norm ( italic_x start_POSTSUBSCRIPT italic_i italic_n end_POSTSUBSCRIPT ) ) ) ) . (14)

The outputs of Eqs. 13 and 14 are subsequently passed through the last linear layer, resulting in the final output of the MambaVision layer, as seen in Eq. 15:

y=xin+Linear(yf+yb).y=x_{in}+\text{Linear}(y_{f}+y_{b}).italic_y = italic_x start_POSTSUBSCRIPT italic_i italic_n end_POSTSUBSCRIPT + Linear ( italic_y start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT + italic_y start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT ) . (15)
Refer to caption
Figure 5: Bidirectional MambaVision Block.

As seen in Tab. 7, no substantial improvement is observed between the unidirectional and bidirectional models, aside from the increase in inference time. This aligns with the findings in other vision tasks [15, 44], where bidirectional scans offer limited practical benefit over unidirectional ones.

Method MegaDepth Dataset Time (ms)
AUC@55^{\circ}5 start_POSTSUPERSCRIPT ∘ end_POSTSUPERSCRIPT AUC@1010^{\circ}10 start_POSTSUPERSCRIPT ∘ end_POSTSUPERSCRIPT AUC@2020^{\circ}20 start_POSTSUPERSCRIPT ∘ end_POSTSUPERSCRIPT
VMatcher-B 69.6 81.1 88.9 87.56mp
VMatcher-B-Bi 69.3 81.0 88.8 97.67mp
VMatcher-T 69.4 81.0 88.8 79.46mp
VMatcher-T-Bi 69.4 80.6 88.6 86.25mp
Table 7: Unidirectional and Bidirectional VMatcher model comparison.
Refer to caption
Figure 6: Runtime Breakdown Extended. Flash Attention [5] enabled during inference time measurement on ScanNet [4].

D Design Choices

D.1 Downsampled Transformer

Preserving information during downsampling is crucial, prompting experiments with various methods, including Depth-wise Convolution, Area Interpolation, Max Pooling, and Average Pooling. While Bilinear Interpolation is less efficient for downsampling compared to other methods, as seen in Tab. 8, experiments revealed that Max or Average pooling resulted in the most degradation in performance by 1.52%\approx 1.5-2\%≈ 1.5 - 2 %. Depth-Wise convolution was avoided despite its efficiency to prevent introducing additional learnable parameters, as the downsampling quality would depend on the learnt convolutional kernels. Area Interpolation performed comparably to Bilinear Interpolation but showed a slight performance drop compared to Bilinear 0.51.0%\approx 0.5-1.0\%≈ 0.5 - 1.0 %. Given that the difference in runtime is negligible in the context of the complete pipeline, Bilinear Interpolation was ultimately chosen for its superior performance-efficiency trade-off.

Downsampling Method Time (ms)
Bilinear Interpolation 0.072
Area Interpolation 0.042
Depth-wise convolution 0.035
Max pooling 0.033
Average pooling 0.036
Table 8: Downsampling methods runtimes.
Refer to caption
Figure 7: VMatcher MambaVision layers scan directions visualisation.

D.2 Mamba Vision

While numerous Mamba [12] variants have been proposed for Computer Vision tasks; however, existing methods either prioritise high accuracy or fast inference speed, often at the cost of performance. Initially, the architecture was used Vanilla Mamba [12] in both bidirectional and unidirectional configurations. However, Vanilla Mamba proved less efficient than Mamba Vision [15], with nearly 1.5×1.5\times1.5 × the runtime and slightly lower performance. Given that the goal is to optimise both runtime and performance, MambaVision [15] was chosen as the variant for VMatcher’s architecture.

E Rotation Invariance

DISK’s [38] rotation invariance evaluation was performed on the Image Matching Challenge (IMC) 2020 [16] validation set. For each angle θ\thetaitalic_θ, 36 images are randomly selected and matched with their rotated counterparts. The ratio of correct matches, defined as those with a reprojection error below 3 pixels, is then computed. Results are shown in Fig. 8.

Refer to caption
Figure 8: Rotation Invariance evaluation on the IMC 2020 [16] validation set.

F Runtime Breakdown extended

VMatcher does not benefit from the usage of Flash Attention [5] compared to [35, 41, 3, 20, 30], due to the limited number of Transformer layers [39] in its architecture. However, for fairness, Fig. 4 was extended by enabling Flash Attention during inference time measurement. As seen in Fig. 6, VMatcher maintains its superior inference speed, even with Flash-Attention [5] enabled.

G VMatcher Scans Visualization

Secs. 4 and A described VMatcher’s architecture, which incorporates MambaVision and MambaVisions layers. Fig. 7 visualises their distinct scanning patterns using a simple 3×33\times 33 × 3 image example. The MambaVision layer processes the flattened image in row-major order, while MambaVisions first transposes the image to enable column-major scanning before restoring its original orientation. For bidirectional models introduced in Sec. C, MambaVision-Bi scans both the flattened image and its flipped copy, then returns the flipped version to its original orientation as shown in Fig. 5. Similarly, MambaVisions-Bi transposes the image, scans both the transposed image and its flipped version, then restores both to their original orientation.

H Limitations

  • The VMatcher model utilises the Mamba architecture, where a portion of the Mamba model is implemented in Triton [23], requiring a Graphics Processing Unit (GPU). While powerful hardware is common in state-of-the-art feature matching methods, this reliance on GPUs and modern hardware is believed to remain a general limitation.

  • While VMatcher offers significant runtime improvements, the efficiency of the Mamba [12] architecture depends on the sequence length. For smaller sequences, its runtime is comparable to Transformer-based models, but for larger sequences, it demonstrates marked efficiency gains. As Computer Vision tasks increasingly involve higher-resolution images, this advantage is expected to become more prominent. Moreover, as a relatively new architecture, Mamba [12] is likely to benefit from future optimisations, following a trajectory similar to the steady improvements seen in Transformer-based models.

I Matching Examples

Fig. 9 provides visual examples on the MegaDepth and ScanNet datasets [18, 4]. Mismatches are not filtered out in the figures to present an unbiased demonstration.

Refer to caption
Figure 9: Matching Examples on MegaDepth and ScanNet [18, 4] datasets.