VMatcher: State-Space Semi-Dense Local Feature Matching

Ali Youssef

Abstract

This paper introduces VMatcher, a hybrid Mamba-Transformer network for semi-dense feature matching between image pairs. Learning-based feature matching methods, whether detector-based or detector-free, achieve state-of-the-art performance but depend heavily on the Transformer’s attention mechanism, which, while effective, incurs high computational costs due to its quadratic complexity. In contrast, Mamba introduces a Selective State-Space Model (SSM) that achieves comparable or superior performance with linear complexity, offering significant efficiency gains. VMatcher leverages a hybrid approach, integrating Mamba’s highly efficient long-sequence processing with the Transformer’s attention mechanism. Multiple VMatcher configurations are proposed, including hierarchical architectures, demonstrating their effectiveness in setting new benchmarks efficiently while ensuring robustness and practicality for real-time applications where rapid inference is crucial. Source Code is available at: https://github.com/ayoussf/VMatcher

1 Introduction

Finding reliable and accurate feature matches between image pairs is a core component in a multitude of Computer Vision tasks such as Structure-from-Motion (SfM), visual Simultaneous Localization and Mapping (vSLAM), Visual Localization, etc. Typically, feature detection and matching consist of three phases: feature detection, where distinctive interest points are extracted from an image; feature description, which involves encoding the local neighbourhood around the feature point into a stable high-dimensional descriptor vector; and feature matching, where descriptors from two images are paired to establish point-to-point correspondences, often through the mutual nearest-neighbour algorithm.

Traditionally, this problem has been approached with handcrafted methods, where feature detectors and descriptors such as [22, 28, 2] played a dominant role. These early methods relied on manually designed criteria to identify keypoints and establish matches, making them computationally efficient and interpretable. However, they often struggled under varying conditions, such as significant variation in viewpoint, illumination, and textureless regions, thus limiting their reliability in complex real-world scenarios. Over the past decade, learning-based methods have emerged, demonstrating greater robustness and consistently outperforming classical handcrafted approaches. This paradigm shift emphasises the importance of developing effective models and training strategies to maximise the potential of learning-based approaches.

In particular, the emergence of Transformers [39] marked a breakthrough in feature detection and matching, leveraging their attention mechanism to capture complex global dependencies across images. This capability enabled both detector-based and detector-free matching methods to achieve remarkable accuracy and performance. However, the quadratic complexity of attention with respect to the sequence length imposes a significant computational burden, challenging real-time and resource-limited applications. Recently, Mamba [12] introduced a Selective State-Space Model (SSM) designed for efficient handling of long sequences with linear complexity, achieving performance on par with or surpassing Transformers in language tasks. Mamba’s efficiency stems from a novel selection mechanism that dynamically adjusts its state-space parameters based on the input data, enabling effective modelling of complex patterns without the computational overhead associated with the attention mechanism. Nonetheless, Mamba’s autoregressive design limits its effectiveness in computer vision tasks, where spatial relationships are inherently parallel rather than sequential, and capturing global context is essential for accurate analysis [15].

Building on these insights, VMatcher is proposed, a hybrid Mamba-Transformer network for semi-dense feature matching that merges the strengths of Mamba’s efficient linear processing capabilities with the Transformer’s powerful attention mechanism. This hybrid approach effectively addresses computational bottlenecks while preserving the benefits of global context modelling, overcoming the ineffectiveness associated with Mamba’s autoregressive formulation for spatial data, and mitigating the high computational costs inherent to Transformer architectures.

VMatcher introduces multiple design patterns and configurations, which achieve state-of-the-art accuracy on par with existing methods while improving computational efficiency, making it highly practical for real-world applications where rapid inference is essential. This work aims to strike a balance between robustness, speed, and accuracy in semi-dense matching, to establish VMatcher as a versatile and scalable solution for a diverse range of Computer Vision tasks.

2 Related Work

2.1 Detector-Based Feature Matching

Traditional feature detection and matching methods, such as [22, 2, 28], relied on handcrafted algorithms to identify repeatable, distinctive feature points and matched them using techniques such as nearest-neighbour (NN) search and mutual nearest neighbours (MNN). Despite their wide use, these classical methods often struggle with large variations in viewpoint, illumination, and appearance. To overcome these limitations, recent approaches have adopted learning-based methods, which deliver significantly better performance. SuperPoint [7] performs joint detection and description within a self-supervised Convolutional Neural Network (CNN) framework to improve robustness; however, it requires a complex training setup and high computational cost at large image scales. Following SuperPoint, other works [9, 27, 38, 11] have also leveraged CNN-based architectures for feature detection and description. Recently, XFeat [26] introduced a lightweight architecture that optimises both speed and accuracy for resource-efficient feature detection and matching, offering an ideal balance for real-time applications.

Additionally, learning-based matchers such as SuperGlue [30] use Graph Neural Networks (GNNs) to ensure consistent correspondences by modelling relationships between feature points; however, it can be computationally expensive due to its quadratic complexity relative to the number of keypoints. In contrast, LightGlue [20] adopts an adaptive matching scheme that balances efficiency and accuracy by adjusting computation based on matching difficulty, allowing for faster processing while maintaining competitive accuracy.

2.2 Detector-Free Feature Matching

Detector-free approaches perform direct feature matching without explicit keypoint detection, improving flexibility and robustness in challenging scenarios. LoFTR [35] adopts a coarse-to-fine transformer framework, computing semi-dense correspondences to capture spatial relationships across an image pair without keypoint detection. While this approach results in high-quality matches, it suffers from significant computational overhead due to the need to compute the Transformer’s [39] attention over the entire feature map.

Subsequent works addressed efficiency limitations through various techniques. AspanFormer [3] introduced a hierarchical attention mechanism that dynamically adjusts the attention span based on flow maps, thereby reducing computational overhead while preserving matching accuracy. Similarly, MatchFormer [40] offers a more refined matching process which follows an extract-and-match methodology on multiple feature scales. QuadTree Attention [37] introduces a hierarchical attention mechanism that reduces computational cost without compromising performance. Rather than computing attention maps globally, it focuses on smaller, more relevant sections of the feature map, making it particularly effective for high-resolution matching tasks where managing computational load is crucial. Lastly, ELoFTR [41] enhances the LoFTR framework [35] through architectural refinements and the introduction of a novel transformer architecture that significantly improves efficiency while maintaining matching performance.

2.3 Mamba

The Mamba architecture [12] introduces a Selective Scan Mechanism, designed to effectively balance local and global feature extraction while maintaining flexibility for various tasks. Initially developed for language modelling, Mamba has since been adapted and evolved into several variants for computer vision applications, including Vim [44], which incorporated bidirectional encoding to improve spatial context, and EfficientVMamba [25], which combined SSMs with CNN layers for a hierarchical scanning approach. VMamba [21] introduced the Cross-Scan Module to increase the receptive field, though it remains constrained by the directional scanning paths.

More recently, MambaVision [15] introduced several modifications to the Mamba block, addressing its limitations in computer vision tasks. In particular, it replaces causal convolutions with standard convolutions and introduces a symmetric branch with a convolution and SiLU activation function. With its parallel architecture for both sequential and spatial processing, and the ability to model both short- and long-range dependencies, MambaVision [15] offers a rich feature representation with high efficiency, making it an ideal backbone for VMatcher’s architecture.

Refer to caption — Figure 1: VMatcher pipeline overview. 1) VGG-style backbone extracts multi-scale feature maps from images $I_{A}$ and $I_{B}$ , including coarse feature maps ${\tilde{F}_{C_{A}}}$ and ${\tilde{F}_{C_{B}}}$ at $\frac{1}{8}$ resolution. 2) Hybrid Mamba-Transformer module processes these for enhanced discrimination. 3) Coarse-level matches { $M_{c}$ } obtained via correlation and MNN matching. 4) Fine-level feature maps are generated by fusing ${\tilde{F}_{C_{A}}}$ and ${\tilde{F}_{C_{B}}}$ with $\frac{1}{4}$ and $\frac{1}{2}$ resolution backbone features maps. Two-stage refinement: extracting patches centred on { $M_{c}$ }, correlating via MNN, and aligning each point from $I_{A}$ with a $3\times 3$ patch in $I_{B}$ . Final-level matches { $M_{f}$ } derived via softmax and expectation computation.

3 Methodology

This section presents VMatcher’s methodology and model architecture, where given a pair of images $I_{A}$ and $I_{B}$ , the objective is to establish accurate correspondences between an image pair efficiently. Fig. 1 provides an overview of VMatcher’s pipeline.

3.1 Feature Extraction

VMatcher adopts a lightweight and compact VGG-style feature extractor to maximise efficiency without compromising performance. The backbone consists of three sequential groups, each containing three blocks of Conv2D-BatchNorm-ReLU layers. Coarse feature maps ${\tilde{F}_{C_{A}}}$ and ${\tilde{F}_{C_{B}}}$ are extracted at $\frac{1}{8}$ resolution, along with additional finer feature maps at $\frac{1}{4}$ and $\frac{1}{2}$ resolutions utilised during the Fine Refinement and Matching stage. Lastly, convolutional and batch normalisation layers are fused during inference to minimise computational overhead. In contrast to previous work [35, 41, 3, 40], a lightweight extractor is favoured; for instance, despite having fewer parameters than comparable methods such as ELoFTR [41] RepVGG [8] feature extractor (1.5M vs. 9.5M), VMatcher’s lightweight feature extractor maintains effectiveness while improving computational efficiency, as detailed in Sec. 5.5.

3.2 Mamba

3.2.1 Preliminaries

State-space models (SSMs) serve as a framework for representing continuous-time dynamics through two primary equations: the state transition equation and the observation equation. These equations define the behaviour of the system over time, and are based on a latent state $h(t)$ and state derivative $h^{\prime}(t)$ that adjusts relative to input $x(t)$ , with an output $y(t)$ :

h^{\prime}(t)=\mathbf{A}h(t)+\mathbf{B}x(t).

(1)

y(t)=\mathbf{C}h(t).

(2)

Where matrices $\mathbf{A}$ , $\mathbf{B}$ , and $\mathbf{C}$ determine the system dynamics in continuous time.

Discretisation. To apply continuous models to discrete data, the system equations must be discretised. Mamba [12] leverages the Zero-Order Hold (ZOH) method, using a sampling interval $\Delta$ to map the continuous matrices $\mathbf{A}$ and $\mathbf{B}$ into their discrete counterparts, $\bar{\mathbf{A}}$ and $\bar{\mathbf{B}}$ :

\bar{\mathbf{A}}=\exp(\Delta\mathbf{A}).

(3)

\bar{\mathbf{B}}=(\Delta\mathbf{A})^{-1}\left(\exp(\Delta\mathbf{A})-\mathbf{I}\right)\cdot\Delta\mathbf{B}.

(4)

This transformation allows the model to process sequential or other discrete inputs efficiently, aligning with modern digital computation [13].

Convolutional Computation. Mamba [12] accelerates sequence processing by reformulating the discretised system into a convolutional operation using discretised matrices $\bar{\mathbf{A}}$ and $\bar{\mathbf{B}}$ rather than sequentially computing outputs, thus enhancing parallelisation. The output $y_{i}$ at time step $i$ is computed as:

\displaystyle y_{i}=\sum_{i=0}^{L}\mathbf{C}\bar{\mathbf{A}}^{L-i}\bar{\mathbf{B}}x_{i}.

(5)

By constructing a set of convolutional kernels $\mathbf{\bar{K}}=\left(\mathbf{C}\bar{\mathbf{B}},\mathbf{C}\bar{\mathbf{A}}\bar{\mathbf{B}},\dots,\mathbf{C}\bar{\mathbf{A}}^{L-1}\bar{\mathbf{B}}\right)$ , the output is efficiently computed as:

y=x\ast\mathbf{\bar{K}}.

(6)

Where $x=[x_{0},x_{1},\dots]$ and $y=[y_{0},y_{1},\dots]\in\mathbb{R}^{L}$ , with $L$ denoting the sequence length. This convolutional structure improves efficiency by enabling parallel matrix operations, improving performance on high-dimensional data.

Selection Mechanism. Traditional SSMs lack the flexibility to adjust outputs based on inputs due to inherent time invariance. Mamba [12] overcomes this by introducing an input-dependent selection mechanism that generates dynamic time-varying weight matrices, allowing it to filter irrelevant information while preserving details relevant to each input.

3.2.2 MambaVision

MambaVision [15] modifies vanilla Mamba [12] to enhance feature representation and generalisation in vision tasks. As shown in Fig. 2, the causal 1D convolution is replaced with a 1D convolution to eliminate directional limitations. A symmetric branch bypassing the selective scan mechanism is added, consisting of a 1D convolution layer with SiLU activation to recover information potentially lost due to sequential constraints imposed by the SSM. To maintain a parameter count similar to that of Vanilla Mamba [12], inputs are projected to half the embedding dimension, and outputs are concatenated and projected via a final linear layer. The output $y$ is computed as:

x=\text{SSM}(\sigma(\text{Conv}(\text{Linear}(C,\frac{C}{2})(x_{\text{in}})))),

(7)

z=\sigma(\text{Conv}(\text{Linear}(C,\frac{C}{2})(x_{\text{in}}))),

(8)

y=\text{Linear}(C,C)(\text{Concat}(x,z)).

(9)

3.3 Downsampled-Transformer

Following the MambaVision module, feature vectors $\tilde{F}_{C_{A}}$ and $\tilde{F}_{C_{B}}$ are passed through the Downsampling Attention Module that computes self-attention or cross-attention. Standard attention mechanism combines queries $Q$ , keys $K$ , and values $V$ through weighted sums:

Attention(Q,K,V)=softmax(\frac{QK^{T}}{\sqrt{d_{k}}})V.

(10)

Where $d_{k}$ is the dimensionality of the feature vector, scaling the dot products for stability.

This form of attention effectively models dependencies between elements but can be computationally intensive. To mitigate computational costs, previous work has explored utilising linear attention [35], token-pyramid downsampling transformers [37], or ELoFTR’s [41] aggregated attention module that uses depth-wise convolution on the query-generating vector and max-pooling on the key- and value-generating vectors. However, VMatcher leverages Mamba’s [12] selection scan mechanism to filter irrelevant information, eliminating the need for intricate techniques employed by prior methods. As such, by simply downsampling ${\tilde{F}_{C_{A}}}$ and ${\tilde{F}_{C_{B}}}$ using bilinear interpolation before attention computation and upsampling afterwards, this approach minimises model parameters while maintaining performance.

Moreover, the DS-Transformer module omits the typical Multi-Layer Perceptron (MLP) post-attention layer in favour of directly adding the attention vector to the input, as it led to degraded results (see Sec. 5.5). Lastly, similar to previous work [20, 35, 41, 3, 40], Rotary Positional Embedding (RoPE) [34] is employed for queries and keys during self-attention to capture spatial dependencies. However, as Mamba’s [12] sequence modelling properties and parameter structure [19, 15] inherently encode positional information, utilising RoPE [34] may not be necessary, which is further detailed in Sec. 5.5. An illustration of the DS-Transformer can be seen in Fig. 3.

3.4 Coarse-Level Matching

Coarse-level correspondences are derived by densely correlating transformed feature vectors $\tilde{F}^{t}_{C_{A}}$ and $\tilde{F}^{t}_{C_{B}}$ to compute score matrix $S_{C}$ , followed by a dual Softmax operation across both dimensions of $S_{C}$ to generate a matching probability map, as in [35, 41, 38, 11]. Coarse-level matches { $M_{c}$ } are obtained using mutual nearest neighbour (MNN) and a threshold $\tau$ .

3.5 Fine-Level Extraction and Refinement

Adapted from ELoFTR [41], the fine-level refinement module efficiently achieves sub-pixel precision through two components:

Fine Feature Extraction. The transformed coarse feature vectors $\tilde{F}^{t}_{C_{A}}$ and $\tilde{F}^{t}_{C_{B}}$ are fused with backbone feature maps extracted at $\frac{1}{4}$ and $\frac{1}{2}$ resolutions through convolution and upsampling operations, yielding fine maps $\hat{F}_{f_{A}}$ and $\hat{F}_{f_{B}}$ at fine-level resolution. Fine-level feature patches centred on coarse matches { $M_{c}$ } are then extracted from $\hat{F}_{f_{A}}$ and $\hat{F}_{f_{B}}$ , allowing efficient yet precise refinement.

Two-Stage Refinement. First, dense correlation between fine-level feature patches produces a local patch score matrix $S_{f}$ , where high-confidence pixel-level matches are identified via mutual nearest neighbour (MNN) matching. Second, each point in $I_{A}$ is correlated with a $3\times 3$ feature patch centred at its matched location in $I_{B}$ . Softmax is applied to the resulting match distribution matrix, and the refined matches { $M_{f}$ } are obtained by computing the expectation.

3.6 Supervision

Following [30, 35, 41], coarse-level supervision is achieved using ground truth matches { $M_{c_{gt}}$ } generated via warping grid points from $I_{A}$ to $I_{B}$ using depth maps and camera poses. Coarse-level score matrix $S_{C}$ is supervised by minimising log-likelihood loss at { $M_{c_{gt}}$ }:

L_{c}=-\frac{1}{N}\sum_{(\tilde{i},\tilde{j})\in\{M_{c}\}_{gt}}\log S_{c}(\tilde{i},\tilde{j}).

(11)

Fine-level supervision adopts the two-stage approach of [41]. $L_{f_{1}}$ minimises log-likelihood of each fine local score matrix $S_{f}$ using pixel-level ground truth matches, while $L_{f_{2}}$ computes $\ell_{2}$ loss between final sub-pixel { $M_{f}$ } and ground truth { $M_{f_{gt}}$ } matches as in [3, 35, 41]. The total loss is expressed as:

L=L_{c}+\alpha L_{f_{1}}+\beta L_{f_{2}}.

(12)

4 Implementation Details

VMatcher Variations. VMatcher offers two configurations: VMatcher-B (Base) with 9.5M parameters and VMatcher-T (Tiny) with 6.9M parameters; architecture details are provided in the Supplementary Material. Moreover, Sec. 5 includes optimised variants of both models, in which the Dual-Softmax operation is omitted during coarse-level matching, following [41]. This modification greatly enhances inference time while maintaining performance.

Inference Time Measurement. Measuring inference time can be complex when comparing sparse and semi-dense methods. A direct comparison is slightly biased towards semi-dense methods since applications such as SfM extract feature points once and cache them, making inference time primarily a factor of feature matching. This does not extend to semi-dense methods, as caching large feature vectors is computationally intensive. Thus, for fairness, feature detection and matching times are separately reported for sparse methods in Sec. 5.

5 Experiments

To ensure a fair evaluation, the LightGlue [20] approach is adopted, tuning RANSAC’s inlier threshold for each model on each test dataset, as the emphasis is on evaluating the models rather than RANSAC itself. Tabs. 1, 2 and 3 present results using PoseLib [17] LO-RANSAC and OpenCV RANSAC [10], with runtimes marked mp if mixed precision was used during inference. Dense methods are excluded despite strong performance, as their computational demands and runtime significantly exceed those of sparse and semi-dense approaches. For consistency, Flash-Attention [5] is omitted to ensure fair comparisons.

5.1 Homography Estimation

Homography estimation is assessed on the HPatches dataset [1], which includes 57 and 59 planar scenes with varying illumination and viewpoints, each scene comprises 6 images linked by ground truth homography matrices.

Baselines. Comparisons are made against sparse methods, including SuperPoint [7] with Nearest Neighbours and SuperGlue [30], as well as SuperPoint, DISK [38], and Aliked [43] with LightGlue [20]. For semi-dense methods, VMatcher is evaluated against QuadTree, Aspanformer, LoFTR, and ELoFTR [37, 35, 41, 3].

Evaluation Protocol. In line with prior work [35, 30, 11], the shorter side of each image is resized to 480 pixels. The mean projection error for the image’s corner points is computed, and the area under the curve (AUC) error is reported at 3, 5, and 10 pixel thresholds. For consistency with sparse methods, results are reported using the top 1024 predicted matches.

Results. As shown in Tab. 1, VMatcher models outperform both sparse and semi-dense methods despite limiting keypoints to 1024. VMatcher-B matches ELoFTR [41] in runtime (further analysed in Sec. 5.4), while VMatcher-T processes images 1.8 $\times$ faster than LoFTR [35] and 2.2 $\times$ faster than AspanFormer [3] while maintaining competitive accuracy. The optimised variants further enhance efficiency, with VMatcher-T ${}_{\text{Opt}}$ reducing runtime by 16.9%, achieving an optimal balance between accuracy and computational efficiency.

Matcher Type	Method	HPatches Dataset		Time (ms)
		AUC@3px / AUC@5px / AUC@10px
		LO-RANSAC	RANSAC
	SP+NN	63.7 / 74.5 / 84.4	54.6 / 67.8 / 80.2	17.50 (15.58/2.36)
	SP+SG	68.3 / 78.6 / 87.8	59.8 / 71.0 / 82.8	46.05 (15.58/30.47_mp)
Sparse	SP+LG	67.7 / 78.0 / 87.2	59.1 / 72.1 / 83.7	34.11 (15.14/18.97_mp)
	DISK+LG	61.7 / 74.0 / 84.9	52.6 / 67.2 / 80.7	51.28 (33.42/17.85_mp)
	Aliked+LG	66.3 / 77.1 / 86.7	61.1 / 73.4 / 84.5	36.74 (21.01/15.73_mp)
	QuadTree	69.5 / 79.2 / 87.3	67.4 / 77.4 / 85.9	100.64_mp
	AspanFormer	68.7 / 78.5 / 86.9	66.7 / 76.6 / 85.7	65.54_mp
	LoFTR	69.5 / 78.7 / 86.7	67.6 / 76.9 / 85.4	52.35_mp
Semi-Dense	ELoFTR	69.6 / 78.9 / 87.1	67.5 / 77.3 / 86.0	34.48_mp
	VMatcher-B	69.8 / 79.1 / 87.5	67.8 / 77.5 / 86.4	35.43_mp
	VMatcher-T	70.2 / 79.3 / 87.4	67.8 / 77.3 / 86.0	29.23_mp
	ELoFTR ${}_{\text{Opt}}$	68.6 / 78.2 / 86.7	65.6 / 75.4 / 84.5	28.62_mp
Semi-Dense-Opt	VMatcher-B ${}_{\text{Opt}}$	68.9 / 78.5 / 86.9	66.4 / 76.1 / 85.3	29.58_mp
	VMatcher-T ${}_{\text{Opt}}$	67.6 / 77.5 / 86.5	64.7 / 74.8 / 84.7	24.29_mp

Table 1: Homography Estimation Results on HPatches. Reprojection error AUC scores and runtimes for sparse and semi-dense methods.

5.2 Relative Pose Estimation

Relative pose estimation evaluation is performed on both outdoor and indoor datasets. Following prior work [20, 30, 41, 35] for outdoor scenes, the MegaDepth dataset [18] test split from [35] is utilised, which includes 1500 image pairs from ”Sacre Coeur” and ”St. Peter’s Square”. For indoor scenes, the ScanNet dataset [4] is used, which comprises 1613 sequences capturing challenges such as viewpoint variation and textureless regions. Evaluation is performed on the ScanNet [4] test split of 1500 image pairs from [30].

Baselines. Relative pose estimation evaluation utilises similar baselines as in Sec. 5.1.

Evaluation Protocol. For outdoor scenes, images are resized with the longest edge at 1184 pixels for semi-dense methods and 1600 for sparse methods, as specified in [20]. For indoor scenes, images are resized to 640x480 pixels. The pose error is measured as the maximum between angular rotation and translation errors, with results reported at thresholds of $5^{\circ}$ , $10^{\circ}$ , and $20^{\circ}$ . Sparse methods use the top 2048 keypoints for evaluation.

Results. As shown in Tab. 2, VMatcher variants outperform all sparse and semi-dense methods on MegaDepth [18] while offering considerable runtime advantages. However, the performance differences between models are minimal ( $\approx$ 1.2% in AUC@ $10^{\circ}$ ), with the primary distinction lying in computational efficiency, where VMatcher-B is 1.15 $\times$ faster than ELoFTR [41] and 1.45 $\times$ faster than LoFTR [35], while VMatcher-T achieves even greater speed improvements at 1.26 $\times$ and 1.60 $\times$ faster, respectively. The optimised variants achieve near-sparse processing speeds with an insignificant degradation in performance.

From Tab. 3, VMatcher performance is comparable to other semi-dense methods, with AspanFormer [3] leading in performance. Notably, VMatcher-B runtime is similar to ELoFTR [35], similar to findings in Sec. 5.1 (further analysed in Sec. 5.4), while VMatcher-T offers additional runtime advantages. Interestingly, VMatcher-T ${}_{\text{Opt}}$ strikes the best balance between accuracy and runtime, outperforming all sparse methods in both aspects.

A key limitation to highlight is the instability in relative pose estimation, particularly in translation error computation. The error is derived from the angular difference between ground truth and estimated translation vectors, which becomes unreliable due to scale ambiguity [14, 42], especially with small ground-truth translation norms ( $||t_{GT}||$ ). This affects 139 of 1500 image pairs in the ScanNet [4] test split.

Matcher Type	Method	MegaDepth Dataset		Time (ms)
		AUC@ $5^{\circ}$ / AUC@ $10^{\circ}$ / AUC@ $20^{\circ}$
		LO-RANSAC	RANSAC
	SP+NN	51.1 / 64.4 / 73.9	39.1 / 54.3 / 66.4	43.58 (40.73/2.85)
	SP+SG	66.0 / 78.8 / 87.6	50.7 / 68.2 / 80.9	82.61 (40.73/41.88_mp)
Sparse	SP+LG	66.8 / 79.4 / 88.0	51.5 / 68.4 / 81.1	58.25 (40.73/17.54_mp)
	DISK+LG	61.1 / 74.2 / 83.9	44.7 / 62.5 / 76.4	161.04 (143.67/17.36_mp)
	Aliked+LG	66.3 / 78.9 / 87.5	51.7 / 68.3 / 80.1	80.04 (62.58/17.45_mp)
	QuadTree	66.5 / 78.9 / 87.2	51.5 / 68.4 / 80.6	398.38
	AspanFormer	69.2 / 80.9 / 88.7	55.2 / 71.5 / 83.2	160.73_mp
	LoFTR	68.3 / 80.1 / 88.3	53.5 / 69.8 / 81.9	126.94_mp
Semi-Dense	ELoFTR	69.4 / 80.8 / 88.7	56.0 / 72.1 / 83.6	100.25_mp
	VMatcher-B	69.6 / 81.1 / 88.9	56.0 / 72.2 / 83.6	87.56_mp
	VMatcher-T	69.4 / 81.0 / 88.8	55.8 / 71.9 / 83.4	79.46_mp
	ELoFTR ${}_{\text{Opt}}$	68.8 / 80.3 / 88.2	55.9 / 71.9 / 83.3	73.52_mp
Semi-Dense-Opt	VMatcher-B ${}_{\text{Opt}}$	69.5 / 81.0 / 88.7	56.3 / 72.3 / 83.6	66.16_mp
	VMatcher-T ${}_{\text{Opt}}$	69.0 / 80.7 / 88.4	55.5 / 71.6 / 83.2	55.94_mp

Table 2: Outdoor Relative Pose Estimation Results on the MegaDepth dataset [18]. VMatcher variants deliver improved performance and runtime speed advantages compared to semi-dense methods.

Matcher Type	Method	ScanNet Dataset		Time (ms)
		AUC@ $5^{\circ}$ / AUC@ $10^{\circ}$ / AUC@ $20^{\circ}$
		LO-RANSAC	RANSAC
	SP+NN	15.7 / 31.7 / 48.6	11.5 / 25.6 / 42.3	19.04 (14.56/4.48)
	SP+SG	22.1 / 40.8 / 58.1	18.1 / 36.6 / 54.4	57.93 (14.56/43.47_mp)
Sparse	SP+LG	20.0 / 38.0 / 55.6	16.9 / 34.3 / 51.9	42.79 (14.56/28.23_mp)
	DISK+LG	18.8 / 34.1 / 49.2	19.0 / 35.1 / 50.1	48.16 (28.81/19.35_mp)
	Aliked+LG	21.0 / 39.1 / 56.0	17.0 / 33.6 / 50.3	45.78 (23.76/22.02_mp)
	QuadTree	24.8 / 43.8 / 60.3	20.2 / 38.4 / 55.4	89.77
	AspanFormer	26.4 / 45.7 / 62.4	21.6 / 40.2 / 57.1	49.70_mp
	LoFTR	22.4 / 39.8 / 55.6	17.2 / 34.0 / 50.3	33.02_mp
Semi-Dense	ELoFTR	25.0 / 43.8 / 60.0	21.5 / 39.8 / 56.2	23.92_mp
	VMatcher-B	25.0 / 43.9 / 60.0	21.5 / 39.5 / 56.1	24.52_mp
	VMatcher-T	23.3 / 42.0 / 58.4	19.7 / 37.5 / 54.1	19.41_mp
	ELoFTR ${}_{\text{Opt}}$	24.5 / 43.5 / 59.8	20.0 / 38.2 / 54.7	18.59_mp
Semi-Dense-Opt	VMatcher-B ${}_{\text{Opt}}$	24.9 / 43.8 / 60.0	21.1 / 39.1 / 55.5	19.84_mp
	VMatcher-T ${}_{\text{Opt}}$	22.9 / 41.4 / 58.5	19.5 / 37.1 / 53.7	14.54_mp

Table 3: Indoor Relative Pose Estimation Results on ScanNet [4]. VMatcher variants are on par to other sparse and semi-dense methods, while VMatcher-T

{}_{\text{Opt}}

offers the optimal balance between speed and accuracy.

5.3 Visual Localization

Visual localization is evaluated on the Aachen v1.1 and InLoc datasets [31, 36]. Aachen v1.1 [31] presents challenging viewpoint variations and illumination changes, including day-night transitions, while InLoc [36] features indoor scenes with repetitive structures and textureless regions. Following [35, 3, 41, 20], the Hierarchical Localization (HLoc) framework [29] is employed for benchmarking. Results are reported for Aachen v1.1’s day/night scenes and InLoc’s DUC1/DUC2 test sets.

Baselines. Sparse methods include SuperPoint [7] with SuperGlue [30] and LightGlue [20], while for semi-dense methods VMatcher is evaluated against ELoFTR [41].

Evaluation Protocol. HLoc’s [29] evaluation pipeline is followed for both sparse and semi-dense methods. For semi-dense methods, the longest edge of Aachen v1.1 and InLoc [31, 36] images is resized to 1184 and 800 pixels, respectively. ELoFTR [41] is evaluated under these conditions since no evaluation settings are reported. Results are reported as the percentage of poses within predefined distance and angular error thresholds. For Absolute Pose Estimation (APE) on the Aachen v1.1 dataset [31], HLoc’s default configuration using PyCOLMAP [32, 33] is adopted. In contrast, due to the stochastic behaviour of PyCOLMAP on the InLoc dataset [36], affecting both sparse and semi-dense methods, APE is performed using PoseLib.

Results. From Tab. Tab. 4, both sparse and semi-dense methods yield similar results on Aachen v1.1 [31], with insignificant variations attributed to dataset saturation [20]. Moreover, Tab. 5 shows that VMatcher performs on par with state-of-the-art feature matchers on InLoc [36], with differences being statistically insignificant due to limited query samples per test split [20]. However, results highlight the computational advantage of VMatcher, particularly the optimised models achieving sparse-like runtime speeds.

Matcher Type	Method	Aachen v1.1 Dataset		Time (ms)
		Day	Night
		(0.25m, $2^{\circ}$ )/(0.5m, $5^{\circ}$ )/(1.0m, $10^{\circ}$ )
Sparse	SP+SG	90.3 / 96.2 / 99.4	76.4 / 90.1 / 100.0	209.50 (19.27/168.23_mp)
	SP+LG	90.3 / 96.2 / 99.2	77.5 / 91.6 / 99.5	51.35 (19.27/32.08_mp)
	ELoFTR	89.0 / 96.0 / 98.7	77.0 / 91.1 / 99.5	90.91_mp
Semi-Dense	VMatcher-B	89.4 / 96.5 / 99.0	77.0 / 91.6 / 99.5	78.32_mp
	VMatcher-T	88.7 / 96.0 / 98.8	77.5 / 91.1 / 99.5	68.06_mp
	ELoFTR ${}_{\text{Opt}}$	89.0 / 95.5 / 98.4	76.4 / 91.1 / 99.0	60.13_mp
Semi-Dense-Opt	VMatcher-B ${}_{\text{Opt}}$	89.3 / 96.1 / 98.8	77.5 / 91.6 / 99.0	54.19_mp
	VMatcher-T ${}_{\text{Opt}}$	88.8 / 95.6 / 98.1	77.0 / 91.1 / 99.0	46.58_mp

Table 4: Aachen v1.1 Dataset [31] Visual Relocalization.

Matcher Type	Method	InLoc Dataset		Time (ms)
		DUC1	DUC2
		(0.25m, $2^{\circ}$ )/(0.5m, $5^{\circ}$ )/(1.0m, $10^{\circ}$ )
Sparse	SP+SG	46.0 / 68.7 / 79.8	57.3 / 77.9 / 79.4	153.12 (21.44/131.68_mp)
	SP+LG	44.9 / 68.2 / 77.8	56.5 / 73.3 / 77.9	40.21 (21.44/18.77_mp)
	ELoFTR	52.5 / 74.2 / 86.4	64.1 / 81.7 / 86.3	39.44_mp
Semi-Dense	VMatcher-B	55.1 / 74.2 / 85.9	61.1 / 81.7 / 85.5	36.79_mp
	VMatcher-T	53.5 / 74.2 / 85.4	59.5 / 79.4 / 83.2	32.52_mp
	ELoFTR ${}_{\text{Opt}}$	50.5 / 72.2 / 84.3	59.5 / 78.6 / 84.0	32.75_mp
Semi-Dense-Opt	VMatcher-B ${}_{\text{Opt}}$	50.5 / 72.2 / 83.8	61.1 / 79.4 / 84.7	30.09_mp
	VMatcher-T ${}_{\text{Opt}}$	52.0 / 74.2 / 85.4	60.3 / 78.6 / 81.7	24.82_mp

Table 5: InLoc Dataset [36] Visual Relocalization.

5.4 Runtime Breakdown

From Sec. 5.1 to Sec. 5.3, it is evident that the runtime of VMatcher varies across experiments. Therefore, Fig. 4 presents a runtime breakdown on ScanNet [4] across multiple resolutions, providing a more detailed analysis. As a sequence-length-dependent model, the Mamba [12, 6] architecture imposes a constant computational overhead of $\approx 0.8ms$ per layer. At lower resolutions, VMatcher and its optimised variants achieve runtimes comparable to ELoFTR [41], with VMatcher-T being slightly more efficient. At higher resolutions, this overhead becomes negligible, allowing VMatcher to maintain competitive performance while improving efficiency across all variants.

5.5 Ablation Study

Method	MegaDepth Dataset		Time (ms)
	AUC@ $5^{\circ}$ / AUC@ $10^{\circ}$ / AUC@ $20^{\circ}$
	LO-RANSAC	RANSAC
VMatcher-B (Full Model)	69.5 / 81.1 / 88.9	56.0 / 72.2 / 83.6	87.56_mp
a) VGG Backbone only	61.8 / 72.3 / 82.2	47.7 / 64.1 / 75.4	64.42_mp
b) VGG + DS-Transformer (w/o MambaV + gMLP)	64.9 / 75.6 / 84.1	49.9 / 65.9 / 77.6	69.51_mp
c) VGG + MambaVision + gMLP (w/o DS-Transformer)	65.4 / 76.9 / 85.7	50.5 / 66.4 / 78.3	80.14_mp
d) Full Model w/ Rep-VGG Backbone	69.1 / 80.7 / 88.5	55.6 / 71.4 / 83.1	97.22_mp
e) MLP replacing gMLP	68.1 / 80.1 / 88.0	53.7 / 70.2 / 82.2	86.86_mp
f) MLP after Attention	67.7 / 79.9 / 87.8	53.4 / 70.0 / 82.0	89.77_mp
g) Full Model w/o Self-Attention RoPE	69.7 / 81.2 / 88.9	56.0 / 72.1 / 83.2	86.79_mp

Table 6: Ablation Study. Analysing VMatcher’s architecture.

Tab. 6 presents an ablation study of VMatcher, highlighting the impact of the hybrid Mamba-transformer architecture and key design choices. The VGG backbone alone (row a) shows significantly reduced performance, while adding either DS-Transformer or MambaVision+gMLP substantially improves results (rows b, c). However, neither component independently matches the full model’s capabilities, confirming the effectiveness of the hybrid approach. Replacing the lightweight VGG backbone with RepVGG (row d) yields lower performance with higher computational overhead, validating the lightweight backbone choice in Sec. 3.1. Substituting the Gated MLP with a standard Feed-Forward MLP (row e) offers minor runtime improvements but compromises accuracy. As detailed in Sec. 3.3, adding an MLP (row f) post-attention negatively impacts both performance and efficiency. Notably, removing RoPE [34] from self-attention (row g) slightly improves runtime and accuracy, which aligns with the hypothesis in Sec. 3.3 that Mamba’s [12] inherent positional encoding makes additional positional encoding unnecessary.

6 Conclusion

This paper introduces VMatcher, a hybrid Mamba-Transformer semi-dense feature matcher that prioritises computational efficiency without sacrificing performance. Evaluations demonstrated that VMatcher achieves runtime gains over existing semi-dense matchers, with optimised variants reaching sparse-like processing speeds, marking a notable advancement in semi-dense feature matching efficiency. Future work may focus on further refining the architecture, aiming for greater computational efficiency while maintaining accuracy. This work hopes to pave the way for more efficient feature matching methodologies, contributing to the ongoing evolution of the field and offering a promising direction towards more practical and efficient solutions.

References

Balntas et al. [2017] Vassileios Balntas, Karel Lenc, Andrea Vedaldi, and Krystian Mikolajczyk. Hpatches: A benchmark and evaluation of handcrafted and learned local descriptors. In CVPR, 2017.
Bay et al. [2006] Herbert Bay, Tinne Tuytelaars, and Luc Van Gool. Surf: Speeded up robust features. In Computer Vision – ECCV 2006, pages 404–417, Berlin, Heidelberg, 2006. Springer Berlin Heidelberg.
Chen et al. [2022] Hongkai Chen, Zixin Luo, Lei Zhou, Yurun Tian, Mingmin Zhen, Tian Fang, David Mckinnon, Yanghai Tsin, and Long Quan. Aspanformer: Detector-free image matching with adaptive span transformer, 2022.
Dai et al. [2017] Angela Dai, Angel X. Chang, Manolis Savva, Maciej Halber, Thomas A. Funkhouser, and Matthias Nießner. Scannet: Richly-annotated 3d reconstructions of indoor scenes. CoRR, abs/1702.04405, 2017.
Dao [2023] Tri Dao. Flashattention-2: Faster attention with better parallelism and work partitioning, 2023.
Dao and Gu [2024] Tri Dao and Albert Gu. Transformers are ssms: Generalized models and efficient algorithms through structured state space duality, 2024.
DeTone et al. [2017] Daniel DeTone, Tomasz Malisiewicz, and Andrew Rabinovich. Superpoint: Self-supervised interest point detection and description. CoRR, abs/1712.07629, 2017.
Ding et al. [2021] Xiaohan Ding, Xiangyu Zhang, Ningning Ma, Jungong Han, Guiguang Ding, and Jian Sun. Repvgg: Making vgg-style convnets great again, 2021.
Dusmanu et al. [2019] Mihai Dusmanu, Ignacio Rocco, Tomas Pajdla, Marc Pollefeys, Josef Sivic, Akihiko Torii, and Torsten Sattler. D2-net: A trainable cnn for joint description and detection of local features. In 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 8084–8093, 2019.
Fischler and Bolles [1981] Martin A. Fischler and Robert C. Bolles. Random sample consensus: a paradigm for model fitting with applications to image analysis and automated cartography. Commun. ACM, 24(6):381–395, 1981.
Gleize et al. [2023] Pierre Gleize, Weiyao Wang, and Matt Feiszli. Silk: Simple learned keypoints. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 22499–22508, 2023.
Gu and Dao [2024] Albert Gu and Tri Dao. Mamba: Linear-time sequence modeling with selective state spaces, 2024.
Gu et al. [2021] Albert Gu, Isys Johnson, Karan Goel, Khaled Saab, Tri Dao, Atri Rudra, and Christopher Ré. Combining recurrent, convolutional, and continuous-time models with linear state-space layers, 2021.
Hartley and Zisserman [2004] R. I. Hartley and A. Zisserman. Multiple View Geometry in Computer Vision. Cambridge University Press, ISBN: 0521540518, second edition, 2004.
Hatamizadeh and Kautz [2024] Ali Hatamizadeh and Jan Kautz. Mambavision: A hybrid mamba-transformer vision backbone, 2024.
Jin et al. [2020] Yuhe Jin, Dmytro Mishkin, Anastasiia Mishchuk, Jiri Matas, Pascal Fua, Kwang Moo Yi, and Eduard Trulls. Image Matching across Wide Baselines: From Paper to Practice. International Journal of Computer Vision, 2020.
Larsson and contributors [2020] Viktor Larsson and contributors. PoseLib - Minimal Solvers for Camera Pose Estimation, 2020.
Li and Snavely [2018] Zhengqi Li and Noah Snavely. Megadepth: Learning single-view depth prediction from internet photos. In 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2041–2050, 2018.
Lieber et al. [2024] Opher Lieber, Barak Lenz, Hofit Bata, Gal Cohen, Jhonathan Osin, Itay Dalmedigos, Erez Safahi, Shaked Meirom, Yonatan Belinkov, Shai Shalev-Shwartz, Omri Abend, Raz Alon, Tomer Asida, Amir Bergman, Roman Glozman, Michael Gokhman, Avashalom Manevich, Nir Ratner, Noam Rozen, Erez Shwartz, Mor Zusman, and Yoav Shoham. Jamba: A hybrid transformer-mamba language model, 2024.
Lindenberger et al. [2023] Philipp Lindenberger, Paul-Edouard Sarlin, and Marc Pollefeys. Lightglue: Local feature matching at light speed, 2023.
Liu et al. [2024] Yue Liu, Yunjie Tian, Yuzhong Zhao, Hongtian Yu, Lingxi Xie, Yaowei Wang, Qixiang Ye, and Yunfan Liu. Vmamba: Visual state space model, 2024.
Lowe [2004] David G. Lowe. Distinctive Image Features from Scale-Invariant Keypoints. International Journal of Computer Vision, 60(2):91–110, 2004.
NVIDIA Corporation [2020] NVIDIA Corporation. Triton Inference Server: An Optimized Cloud and Edge Inferencing Solution., 2020.
Paszke et al. [2019] Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, Alban Desmaison, Andreas Köpf, Edward Z. Yang, Zach DeVito, Martin Raison, Alykhan Tejani, Sasank Chilamkurthy, Benoit Steiner, Lu Fang, Junjie Bai, and Soumith Chintala. Pytorch: An imperative style, high-performance deep learning library. CoRR, abs/1912.01703, 2019.
Pei et al. [2024] Xiaohuan Pei, Tao Huang, and Chang Xu. Efficientvmamba: Atrous selective scan for light weight visual mamba, 2024.
Potje et al. [2024] Guilherme Potje, Felipe Cadar, Andre Araujo, Renato Martins, and Erickson R. Nascimento. Xfeat: Accelerated features for lightweight image matching, 2024.
Revaud et al. [2019] Jerome Revaud, Philippe Weinzaepfel, César De Souza, Noe Pion, Gabriela Csurka, Yohann Cabon, and Martin Humenberger. R2d2: Repeatable and reliable detector and descriptor. arXiv preprint arXiv:1906.06195, 2019.
Rublee et al. [2011] Ethan Rublee, Vincent Rabaud, Kurt Konolige, and Gary Bradski. Orb: An efficient alternative to sift or surf. In 2011 International Conference on Computer Vision, pages 2564–2571, 2011.
Sarlin et al. [2019] Paul-Edouard Sarlin, Cesar Cadena, Roland Siegwart, and Marcin Dymczyk. From coarse to fine: Robust hierarchical localization at large scale, 2019.
Sarlin et al. [2020] Paul-Edouard Sarlin, Daniel DeTone, Tomasz Malisiewicz, and Andrew Rabinovich. Superglue: Learning feature matching with graph neural networks, 2020.
Sattler et al. [2018] Torsten Sattler, Will Maddern, Carl Toft, Akihiko Torii, Lars Hammarstrand, Erik Stenborg, Daniel Safari, Masatoshi Okutomi, Marc Pollefeys, Josef Sivic, Fredrik Kahl, and Tomas Pajdla. Benchmarking 6dof outdoor visual localization in changing conditions, 2018.
Schönberger and Frahm [2016] Johannes Lutz Schönberger and Jan-Michael Frahm. Structure-from-motion revisited. In Conference on Computer Vision and Pattern Recognition (CVPR), 2016.
Schönberger et al. [2016] Johannes Lutz Schönberger, Enliang Zheng, Marc Pollefeys, and Jan-Michael Frahm. Pixelwise view selection for unstructured multi-view stereo. In European Conference on Computer Vision (ECCV), 2016.
Su et al. [2023] Jianlin Su, Yu Lu, Shengfeng Pan, Ahmed Murtadha, Bo Wen, and Yunfeng Liu. Roformer: Enhanced transformer with rotary position embedding, 2023.
Sun et al. [2021] Jiaming Sun, Zehong Shen, Yuang Wang, Hujun Bao, and Xiaowei Zhou. Loftr: Detector-free local feature matching with transformers. In 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 8918–8927, 2021.
Taira et al. [2018] Hajime Taira, Masatoshi Okutomi, Torsten Sattler, Mircea Cimpoi, Marc Pollefeys, Josef Sivic, Tomas Pajdla, and Akihiko Torii. InLoc: Indoor visual localization with dense matching and view synthesis. In CVPR, 2018.
Tang et al. [2022] Shitao Tang, Jiahui Zhang, Siyu Zhu, and Ping Tan. Quadtree attention for vision transformers, 2022.
Tyszkiewicz et al. [2020] Michał J. Tyszkiewicz, Pascal Fua, and Eduard Trulls. Disk: Learning local features with policy gradient, 2020.
Vaswani et al. [2023] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention is all you need, 2023.
Wang et al. [2022] Qing Wang, Jiaming Zhang, Kailun Yang, Kunyu Peng, and Rainer Stiefelhagen. Matchformer: Interleaving attention in transformers for feature matching, 2022.
Wang et al. [2024] Yifan Wang, Xingyi He, Sida Peng, Dongli Tan, and Xiaowei Zhou. Efficient loftr: Semi-dense local feature matching with sparse-like speed, 2024.
Yi et al. [2018] Kwang Moo Yi, Eduard Trulls, Yuki Ono, Vincent Lepetit, Mathieu Salzmann, and Pascal Fua. Learning to find good correspondences. In 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2666–2674, 2018.
Zhao et al. [2023] Xiaoming Zhao, Xingming Wu, Weihai Chen, Peter C. Y. Chen, Qingsong Xu, and Zhengguo Li. Aliked: A lighter keypoint and descriptor extraction network via deformable transformation, 2023.
Zhu et al. [2024] Lianghui Zhu, Bencheng Liao, Qian Zhang, Xinlong Wang, Wenyu Liu, and Xinggang Wang. Vision mamba: Efficient visual representation learning with bidirectional state space model, 2024.

Appendix

A VMatcher Model Configurations

VMatcher configurations introduced in Sec. 4 are designed to balance computational efficiency and matching accuracy by varying the model size:

•
VMatcher-B (Base):
- –
  
  Architecture: 24 layers (9.5M parameters)
- –
  
  Layer Pattern: [M G M_s G S M G M_s G C M G M_s G S M G M_s G C M G M_s G]
•
VMatcher-T (Tiny):
- –
  
  Architecture: 14 layers (6.9M parameters)
- –
  
  Layer Pattern: [M G M_s G S M G M_s G C M G M_s G]

Where the architectural components are defined as follows:

•

M: MambaVision layer
•

M_s: MambaVision_s layer (MambaVision with transposed input and output is restored to its original orientation). Switches from row-major to column-major scanning.
•

G: Gated Multi-Layer Perceptron (gMLP) layer
•

S: Downsampled-Transformer Self-attention layer
•

C: Downsampled-Transformer Cross-attention layer

B Training Details

All models were trained exclusively on the MegaDepth dataset [18], featuring 196 tourist landmarks with camera calibration, poses, and depth maps derived through COLMAP SfM and Multi-view Stereo (MVS) [32, 33]. Training used a learning rate of $1\times 10^{-4}$ with the AdamW optimiser, a weight decay of 0.1, and a batch size of 1 over 30 epochs. Before attention layers, the feature vectors are interpolated (Bilinear) by a factor of 4. The loss weights $\alpha$ and $\beta$ in Eq. 12 were set at 1.0 and 0.25, respectively. To accommodate more samples per batch, gradient accumulation was set to 8 for VMatcher-B and 32 for VMatcher-T. All training and evaluations were conducted on a single RTX 3090Ti GPU using PyTorch [24], with each epoch taking approximately 2 hours.

C Bidirectional Models

Two variants of VMatcher-B and VMatcher-T are introduced, where in the MambaVision layer [15], rather than scanning the feature vector in a Unidirectional manner (see Fig. 2), the scan is performed Bidirectionally as seen in Fig. 5. For the Bidirectional models, the MambaVision layer computes the following:

y_{f}=\text{MambaVision}(\text{Norm}(x_{in})).

(13)

y_{b}=\text{flip}(\text{MambaVision}(\text{flip}(\text{Norm}(x_{in})))).

(14)

The outputs of Eqs. 13 and 14 are subsequently passed through the last linear layer, resulting in the final output of the MambaVision layer, as seen in Eq. 15:

y=x_{in}+\text{Linear}(y_{f}+y_{b}).

(15)

As seen in Tab. 7, no substantial improvement is observed between the unidirectional and bidirectional models, aside from the increase in inference time. This aligns with the findings in other vision tasks [15, 44], where bidirectional scans offer limited practical benefit over unidirectional ones.

Method	MegaDepth Dataset			Time (ms)
	AUC@ $5^{\circ}$	AUC@ $10^{\circ}$	AUC@ $20^{\circ}$
VMatcher-B	69.6	81.1	88.9	87.56_mp
VMatcher-B-Bi	69.3	81.0	88.8	97.67_mp
VMatcher-T	69.4	81.0	88.8	79.46_mp
VMatcher-T-Bi	69.4	80.6	88.6	86.25_mp

Table 7: Unidirectional and Bidirectional VMatcher model comparison.

D Design Choices

D.1 Downsampled Transformer

Preserving information during downsampling is crucial, prompting experiments with various methods, including Depth-wise Convolution, Area Interpolation, Max Pooling, and Average Pooling. While Bilinear Interpolation is less efficient for downsampling compared to other methods, as seen in Tab. 8, experiments revealed that Max or Average pooling resulted in the most degradation in performance by $\approx 1.5-2\%$ . Depth-Wise convolution was avoided despite its efficiency to prevent introducing additional learnable parameters, as the downsampling quality would depend on the learnt convolutional kernels. Area Interpolation performed comparably to Bilinear Interpolation but showed a slight performance drop compared to Bilinear $\approx 0.5-1.0\%$ . Given that the difference in runtime is negligible in the context of the complete pipeline, Bilinear Interpolation was ultimately chosen for its superior performance-efficiency trade-off.

Downsampling Method	Time (ms)
Bilinear Interpolation	0.072
Area Interpolation	0.042
Depth-wise convolution	0.035
Max pooling	0.033
Average pooling	0.036

Table 8: Downsampling methods runtimes.

D.2 Mamba Vision

While numerous Mamba [12] variants have been proposed for Computer Vision tasks; however, existing methods either prioritise high accuracy or fast inference speed, often at the cost of performance. Initially, the architecture was used Vanilla Mamba [12] in both bidirectional and unidirectional configurations. However, Vanilla Mamba proved less efficient than Mamba Vision [15], with nearly $1.5\times$ the runtime and slightly lower performance. Given that the goal is to optimise both runtime and performance, MambaVision [15] was chosen as the variant for VMatcher’s architecture.

E Rotation Invariance

DISK’s [38] rotation invariance evaluation was performed on the Image Matching Challenge (IMC) 2020 [16] validation set. For each angle $\theta$ , 36 images are randomly selected and matched with their rotated counterparts. The ratio of correct matches, defined as those with a reprojection error below 3 pixels, is then computed. Results are shown in Fig. 8.

F Runtime Breakdown extended

VMatcher does not benefit from the usage of Flash Attention [5] compared to [35, 41, 3, 20, 30], due to the limited number of Transformer layers [39] in its architecture. However, for fairness, Fig. 4 was extended by enabling Flash Attention during inference time measurement. As seen in Fig. 6, VMatcher maintains its superior inference speed, even with Flash-Attention [5] enabled.

G VMatcher Scans Visualization

Secs. 4 and A described VMatcher’s architecture, which incorporates MambaVision and MambaVision_s layers. Fig. 7 visualises their distinct scanning patterns using a simple $3\times 3$ image example. The MambaVision layer processes the flattened image in row-major order, while MambaVision_s first transposes the image to enable column-major scanning before restoring its original orientation. For bidirectional models introduced in Sec. C, MambaVision-Bi scans both the flattened image and its flipped copy, then returns the flipped version to its original orientation as shown in Fig. 5. Similarly, MambaVision_s-Bi transposes the image, scans both the transposed image and its flipped version, then restores both to their original orientation.

H Limitations

•

The VMatcher model utilises the Mamba architecture, where a portion of the Mamba model is implemented in Triton [23], requiring a Graphics Processing Unit (GPU). While powerful hardware is common in state-of-the-art feature matching methods, this reliance on GPUs and modern hardware is believed to remain a general limitation.
•

While VMatcher offers significant runtime improvements, the efficiency of the Mamba [12] architecture depends on the sequence length. For smaller sequences, its runtime is comparable to Transformer-based models, but for larger sequences, it demonstrates marked efficiency gains. As Computer Vision tasks increasingly involve higher-resolution images, this advantage is expected to become more prominent. Moreover, as a relatively new architecture, Mamba [12] is likely to benefit from future optimisations, following a trajectory similar to the steady improvements seen in Transformer-based models.

I Matching Examples

Fig. 9 provides visual examples on the MegaDepth and ScanNet datasets [18, 4]. Mismatches are not filtered out in the figures to present an unbiased demonstration.