Hyperbolic Cycle Alignment for Infrared-Visible Image Fusion

Timing Li
College of Intelligence and Computing
Tianjin University
litm@tju.edu.cn
&Bing Cao
College of Intelligence and Computing
Tianjin University
caobing@tju.edu.cn
&Jiahe Feng
College of Intelligence and Computing
Tianjin University
&Haifang Cao
School of Computer Science and Technology
Chongqing University of Posts and Telecommunications
&Qinghua Hu
College of Intelligence and Computing
Tianjin University
huqinghua@tju.edu.cn
&Pengfei Zhu
College of Intelligence and Computing
Tianjin University
zhupengfei@tju.edu.cn
Abstract

Image fusion synthesizes complementary information from multiple sources, mitigating the inherent limitations of unimodal imaging systems. Accurate image registration is essential for effective multi-source data fusion. However, existing registration methods, often based on image translation in Euclidean space, fail to handle cross-modal misalignment effectively, resulting in suboptimal alignment and fusion quality. To overcome this limitation, we explore image alignment in non-Euclidean space and propose a Hyperbolic Cycle Alignment Network (Hy-CycleAlign). To the best of our knowledge, Hy-CycleAlign is the first image registration method based on hyperbolic space. It introduces a dual-path cross-modal cyclic registration framework, in which a forward registration network aligns cross-modal inputs, while a backward registration network reconstructs the original image, forming a closed-loop registration structure with geometric consistency. Additionally, we design a Hyperbolic Hierarchy Contrastive Alignment (H2CA) module, which maps images into hyperbolic space and imposes registration constraints, effectively reducing interference caused by modality discrepancies. We further analyze image registration in both Euclidean and hyperbolic spaces, demonstrating that hyperbolic space enables more sensitive and effective multi-modal image registration. Extensive experiments on misaligned multi-modal images demonstrate that our method significantly outperforms existing approaches in both image alignment and fusion. Our code will be publicly available.

1 Introduction

Multi-modal image fusion integrates complementary information from heterogeneous sensors and has become a key technology for enhancing visual perception and analytical capabilities across various domains. By integrating the advantages of different modalities, such as the thermal radiation sensitivity of infrared imaging and the high-resolution texture details of visible, fusion technology generates comprehensive scene representations, enabling applications ranging from all-weather surveillance to search and rescue in harsh environments.

The key to the success of such a fusion system lies in accurate multi-modal image alignment, a process that establishes pixel-level spatial correspondences between modalities. Even minor misalignments, such as a displacement of thermal signatures in infrared images relative to visible edges, can lead to ghosting artifacts or misinterpretations. For example, in security surveillance, an unregistered fusion of infrared and visible feeds might erroneously overlay a human heat signature onto a nearby object, compromising threat identification accuracy. Despite its importance, achieving reliable registration between infrared and visible images remains a challenge.

The primary causes of multi-modal image misalignment include several factors. The positions of different sensors cannot be perfectly identical, resulting in displacement deviations in the captured images. Secondly, infrared imaging relies on thermal radiation while visible imaging relies on light reflection, and this difference in imaging mechanisms can lead to the mismatch of edge features. Additionally, complex factors such as viewpoint changes, motion blur, and rotational variations in real-world dynamic scenes further exacerbate the misalignment problem in multi-modal images. Unfortunately, mainstream fusion algorithms typically assume that the input images are pre-aligned, overlooking the inherent connection between registration and fusion. This idealized assumption limits their adaptability to partially aligned or noisy multi-modal data. Considering the difference in imaging mechanisms, the misalignment between infrared and visible images represents a nonlinear disparity, which is further exacerbated in dynamic scenes.

To address the difficulty of cross-modal pixel alignment of infrared and visible images in Euclidean space, we have realized pixel-level multimodal image alignment in hyperbolic space for the first time. We propose a Hyperbolic Space-based Cyclic Consistency Alignment Network to realize it, termed as Hy-CycleAlign. The main contributions are summarized as follows:

  • A Hyperbolic Space-based Cyclic Consistency Alignment Network is proposed, which introduces a dual-path cross-modal cyclic registration framework. By coordinating registration and inverse registration, the framework establishes a closed-loop registration structure.

  • We introduce a Hyperbolic Hierarchical Contrastive Alignment, which first maps the input images into Poincaré space to guide the alignment process within hyperbolic geometry. This design effectively mitigates the impact of nonlinear cross-modal discrepancies by leveraging the structural properties of hyperbolic space.

  • We provide a theoretical analysis demonstrating that hyperbolic space, particularly within the Poincaré model, exhibits greater sensitivity to distance variations. Extensive experiments on misaligned infrared-visible image fusion tasks validate the effectiveness of our method.

Refer to caption
Figure 1: Comparison of different modalities in Euclidean and hyperbolic spaces: the background is shown in magenta, humans in cyan, vehicles in yellow, and object edges in black. In Euclidean space, edge maps tend to appear more scattered and lack hierarchical structure, whereas in hyperbolic space, edge features exhibit more pronounced clustering and hierarchy.

2 Related Works

This section briefly reviews representative deep learning-based methods for multi-modal image alignment and fusion, as well as relevant foundational studies on non-Euclidean space.

2.1 Infrared-visible images alignment and fusion

These fusion methods fundamentally depend on the availability of strictly aligned input images. Any misalignment between the source images can significantly degrade the quality of the fused output. Broadly, these techniques can be categorized into three main types based on their underlying architectures: convolutional neural network (CNN)-based methods[1], hybrid CNN-Transformer-based methods[2], and Generative Adversarial Network (GAN)-based methods[3]. These methods improve the final fusion results through a series of carefully designed components, including feature extraction modules, reconstruction modules, and fusion modules.

Fusion methods based on unaligned images are specifically designed to address misalignment and image degradation issues that arise when existing approaches attempt to fuse unaligned image pairs. In recent years, to better achieve fusion of misaligned images, several studies such as ReCoNet [4], SuperFusion [5], MURF [6], UMF [7], and IMF [8] have been conducted to address this challenge. In existing registration-fusion methods rely on image translation modules which not only introduce additional noise but also make image registration heavily dependent on the performance of the translation modules. Therefore, how to improve image registration performance without introducing additional noise remains a critical challenge to be addressed.

2.2 Hyperbolic Deep Learning

Due to its negative curvature, hyperbolic space can more efficiently capture hierarchical and tree-like structures in data. Therefore, it has been widely adopted in fields such as graphs [9, 10, 11, 12, 13, 14], text [15, 16, 17, 18, 19], and vision [20, 21, 22, 23] tasks to address the limitations of Euclidean space in modeling hierarchical data. In this work, we build upon these foundations and take a step toward pixel-level multi-modal image registration by applying constraints in hyperbolic space to reduce nonlinear modality discrepancies.

Existing research has already demonstrated the substantial potential of hyperbolic space in effectively handling problems characterized by hierarchical structures. GhadimiAtigh et al. [23] verified that embedding pixels into hyperbolic space can accurately map the interiors and edges of an object, thereby achieving precise image segmentation. Khrulkov et al. [22] treated image degradation as a hierarchy, embedding it into hyperbolic space to achieve better re-identification performance. Fu et al. [24] introduced hyperbolic space to the object detection task and achieved weak alignment at the feature level. Li et al. [21] used hyperbolic distance metrics to represent the distance between features, enabling anomaly detection in hyperbolic space.

Although hyperbolic space has shown excellent performance in various vision tasks, most of these tasks focus on feature-level processing and unimodal vision tasks. To date, there has been no research on pixel-level image registration based on hyperbolic space. To fill this gap, we explore the image registration problem in hyperbolic space and propose a hyperbolic space-based pixel-level alignment method. This approach breaks through the nonlinearity limitations of traditional registration methods in Euclidean space and achieves promising results.

3 Method

This section presents our Hy-CycleAlign method, a cyclic consistency alignment model based on hyperbolic space, as shown in Fig. 2. We first analyze the advantages of constraining multi-modal image registration in hyperbolic space. Then, we introduce the cycle-consistent registration framework. Finally, we propose a hierarchical registration constraint in hyperbolic space, which enables pixel-level alignment and fusion of infrared and visible images.

Refer to caption
Figure 2: Overview of the proposed method. (a) is the Hy-CycleAlign alignment process in which the Hyperbolic Hierarchy Contrastive Alignment (H2CA) module aligns the infrared image to the visible image, followed by re-aligning the result back to the original image. The H2CA maps the images into the Poincaré space and constraints, thereby achieving effective infrared-visible image registration.

3.1 Motivation

Due to the modality differences between infrared and visible images [25], the mapping relationship between them is nonlinear, making it difficult to impose accurate alignment constraints within Euclidean space. Although image translation networks are widely used in modern image registration tasks, they often lead to structural distortions [26, 7], loss of semantic information [27], and accumulation of indirect errors [28, 29]. Considering that multi-modal images registration is a nonlinear problem, and inspired by works such as [16, 21, 23] that leverage hyperbolic space to handle nonlinearities, we explore the impact of hyperbolic space on multi-modal image registration.

Images contain implicit hierarchical structural information, which can be better represented in hyperbolic space to capture their hierarchical structure and complex relationships [30, 31, 23]. Considering that the Poincaré space has the characteristic of negative curvature, which makes its spatial tree-like expansion structure more naturally realize the mapping of 2-dimensional graphs to the hyperbolic space, we choose the Poincaré space to carry out the study of multi-modal image alignment on the hyperbolic space [32].

Theorem 1.

Compared to Euclidean space, the hyperbolic space represented by the Poincaré space is more sensitive to misalignments, and this sensitivity increases as points approach the boundary of the Poincaré space.

Proof.

Assuming uuitalic_u and vvitalic_v are the points to be registered from different modality images, their distance in Euclidean space dE(u,v)d_{E}(u,v)italic_d start_POSTSUBSCRIPT italic_E end_POSTSUBSCRIPT ( italic_u , italic_v ) can be expressed as

dE(u,v)=uv2.d_{E}(u,v)=\left\|u-v\right\|_{2}.italic_d start_POSTSUBSCRIPT italic_E end_POSTSUBSCRIPT ( italic_u , italic_v ) = ∥ italic_u - italic_v ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT . (1)

Assuming the normal Poincaré space 𝔻n={xn:x<1}\mathbb{D}^{n}=\{x\in\mathbb{R}^{n}:\left\|x\right\|<1\}blackboard_D start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT = { italic_x ∈ blackboard_R start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT : ∥ italic_x ∥ < 1 }, xxitalic_x denotes a point in the Poincaré space. Then, the distance dp(u,v)d_{p}(u,v)italic_d start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT ( italic_u , italic_v ) between points u and v in the Poincaré space is shown as

dP(u,v)=cosh1(1+2uv2(1u2)(1v2)).d_{P}(u,v)=\cosh^{-1}(1+2\frac{\left\|u-v\right\|^{2}}{(1-\left\|u\right\|^{2})(1-\left\|v\right\|^{2})}).italic_d start_POSTSUBSCRIPT italic_P end_POSTSUBSCRIPT ( italic_u , italic_v ) = roman_cosh start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ( 1 + 2 divide start_ARG ∥ italic_u - italic_v ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG ( 1 - ∥ italic_u ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) ( 1 - ∥ italic_v ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) end_ARG ) . (2)

Let δ=vu\delta=v-uitalic_δ = italic_v - italic_u, in the alignment task, the goal is to make vuv\to uitalic_v → italic_u. Then, it follows that vu\left\|v\right\|\approx\left\|u\right\|∥ italic_v ∥ ≈ ∥ italic_u ∥.

We define X in Eq. 19,

X=2uv2(1u2)(1v2)2δ2(1u2)2,X=2\frac{\left\|u-v\right\|^{2}}{(1-\left\|u\right\|^{2})(1-\left\|v\right\|^{2})}\approx 2\frac{\left\|\delta\right\|^{2}}{(1-\left\|u\right\|^{2})^{2}},italic_X = 2 divide start_ARG ∥ italic_u - italic_v ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG ( 1 - ∥ italic_u ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) ( 1 - ∥ italic_v ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) end_ARG ≈ 2 divide start_ARG ∥ italic_δ ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG ( 1 - ∥ italic_u ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG , (3)

From Eq. 18 and Eq. 19, the following equation can be obtained,

dP(u,v)=cosh1(1+X).d_{P}(u,v)=\cosh^{-1}(1+X).italic_d start_POSTSUBSCRIPT italic_P end_POSTSUBSCRIPT ( italic_u , italic_v ) = roman_cosh start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ( 1 + italic_X ) . (4)

Expanding Eq. 20 using a Taylor series and taking the first term yields

dP(u,v)21u2uv.d_{P}(u,v)\approx\frac{2}{1-\left\|u\right\|^{2}}\left\|u-v\right\|.italic_d start_POSTSUBSCRIPT italic_P end_POSTSUBSCRIPT ( italic_u , italic_v ) ≈ divide start_ARG 2 end_ARG start_ARG 1 - ∥ italic_u ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ∥ italic_u - italic_v ∥ . (5)

Thus, the ratio of the gradient magnitudes of dPd_{P}italic_d start_POSTSUBSCRIPT italic_P end_POSTSUBSCRIPT and dEd_{E}italic_d start_POSTSUBSCRIPT italic_E end_POSTSUBSCRIPT is:

udPudE=21u2>1.\frac{\left\|\nabla_{u}d_{P}\right\|}{\left\|\nabla_{u}d_{E}\right\|}=\frac{2}{1-\left\|u\right\|^{2}}>1.divide start_ARG ∥ ∇ start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT italic_d start_POSTSUBSCRIPT italic_P end_POSTSUBSCRIPT ∥ end_ARG start_ARG ∥ ∇ start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT italic_d start_POSTSUBSCRIPT italic_E end_POSTSUBSCRIPT ∥ end_ARG = divide start_ARG 2 end_ARG start_ARG 1 - ∥ italic_u ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG > 1 . (6)

This property provides a novel and powerful perspective for tackling complex image alignment challenges that are difficult to address within traditional Euclidean space. By facilitating more flexible representations and transformations, it opens up new possibilities for handling large deformations, non-linear distortions, and modality differences more effectively.

3.2 Hyperbolic cycle Alignment Network (Hy-CycleAlign)

The overall pipeline of our Hy-CycleAlign is shown in Fig. 2 (a). Taking infrared-to-visible alignment as an example, the network consists of two alignment stages during training. In the first stage, the original infrared image TTitalic_T is aligned to the visible image VVitalic_V, producing the aligned image TvT_{v}italic_T start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT. In the second stage, the aligned image TvT_{v}italic_T start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT is then inversely aligned back to the infrared image TTitalic_T, resulting in the reverse-aligned image TvtT_{vt}italic_T start_POSTSUBSCRIPT italic_v italic_t end_POSTSUBSCRIPT. It is important to note that during the forward alignment process, the aligned result TvT_{v}italic_T start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT is also fused with VVitalic_V to generate the final fusion image FFitalic_F.

Inspired by [33], we introduce an adversarial discriminator to assist the model in achieving better image registration performance. Specifically, we extract the gradient edges from from both the original infrared image TTitalic_T and the aligned image TvT_{v}italic_T start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT as input to the discriminator, in order to distinguish the edges between the image TTitalic_T and the aligned image TvT_{v}italic_T start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT. Similarly, another discriminator is used to differentiate the edges between the reverse-aligned image TvtT_{vt}italic_T start_POSTSUBSCRIPT italic_v italic_t end_POSTSUBSCRIPT and the original image TTitalic_T.

This design allows us to impose a consistency constraint between the reverse-aligned output and the input image. Consequently, the model eliminates the need for additional high-precision, manually aligned image pairs, significantly reducing reliance on costly and time-consuming data annotation.

3.3 Hyperbolic Hierarchy Contrastive Alignment (H2CA)

We design the H2CA module to map images from Euclidean space to Poincaré space and apply constraints within the hyperbolic space. According to Eq. 7, the vector idi\in\mathbb{R}^{d}italic_i ∈ blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT in Euclidean space can be projected into Poincaré space using the function Proj()Proj(*)italic_P italic_r italic_o italic_j ( ∗ ), where ccitalic_c denotes the curvature.

Proj(i)=icitanh(ci),Proj(i)=\frac{i}{\sqrt{c}\cdot\left\|i\right\|}\tanh(\sqrt{c}\cdot\left\|i\right\|),italic_P italic_r italic_o italic_j ( italic_i ) = divide start_ARG italic_i end_ARG start_ARG square-root start_ARG italic_c end_ARG ⋅ ∥ italic_i ∥ end_ARG roman_tanh ( square-root start_ARG italic_c end_ARG ⋅ ∥ italic_i ∥ ) , (7)

For m,n𝔻cnm,n\in\mathbb{D}^{n}_{c}italic_m , italic_n ∈ blackboard_D start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT in the Poincaré space, Möbius addition is used in place of Euclidean addition, as shown in Eq. 8.

mn=(1+2cm,n+cn2)m+(1cm2)n1+2cm,n+c2m2n2,m\oplus n=\frac{(1+2c\left\langle m,n\right\rangle+c\left\|n\right\|^{2})m+(1-c\left\|m\right\|^{2})n}{1+2c\left\langle m,n\right\rangle+c^{2}\left\|m\right\|^{2}\left\|n\right\|^{2}},italic_m ⊕ italic_n = divide start_ARG ( 1 + 2 italic_c ⟨ italic_m , italic_n ⟩ + italic_c ∥ italic_n ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) italic_m + ( 1 - italic_c ∥ italic_m ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) italic_n end_ARG start_ARG 1 + 2 italic_c ⟨ italic_m , italic_n ⟩ + italic_c start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ∥ italic_m ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ∥ italic_n ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG , (8)

Thus, the distance between mmitalic_m and nnitalic_n in the Poincaré space can be calculated using Eq. 9.

dP(m,n)=2ccosh1(cmn),d_{P}(m,n)=\frac{2}{\sqrt{c}}\cosh^{-1}(\sqrt{c}\left\|-m\oplus n\right\|),italic_d start_POSTSUBSCRIPT italic_P end_POSTSUBSCRIPT ( italic_m , italic_n ) = divide start_ARG 2 end_ARG start_ARG square-root start_ARG italic_c end_ARG end_ARG roman_cosh start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ( square-root start_ARG italic_c end_ARG ∥ - italic_m ⊕ italic_n ∥ ) , (9)

According to Eq. 9, registration constraints can be implemented in the Poincaré space. It is important to note that the H2CA imposes constraints on both pixels and edges separately, thus enabling the alignment of different hierarchies.

3.4 Loss Function

The Hy-CycleAlign model consists of four loss components: adversarial loss adv\mathcal{L}_{adv}caligraphic_L start_POSTSUBSCRIPT italic_a italic_d italic_v end_POSTSUBSCRIPT, cycle consistency loss cc\mathcal{L}_{cc}caligraphic_L start_POSTSUBSCRIPT italic_c italic_c end_POSTSUBSCRIPT, hyperbolic hierarchical contrastive alignment loss h2c\mathcal{L}_{h2c}caligraphic_L start_POSTSUBSCRIPT italic_h 2 italic_c end_POSTSUBSCRIPT, and smoothness loss sm\mathcal{L}_{sm}caligraphic_L start_POSTSUBSCRIPT italic_s italic_m end_POSTSUBSCRIPT.

Adversarial loss adv\mathcal{L}_{adv}caligraphic_L start_POSTSUBSCRIPT italic_a italic_d italic_v end_POSTSUBSCRIPT. We apply the adversarial loss to both registration networks. For the first registration Rt2v(T,V):TVR_{t2v}(T,V):T\to Vitalic_R start_POSTSUBSCRIPT italic_t 2 italic_v end_POSTSUBSCRIPT ( italic_T , italic_V ) : italic_T → italic_V and its discriminator DvD_{v}italic_D start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT, it can be formulated as Eq. 10. It is noted that, adv\mathcal{L}_{adv}caligraphic_L start_POSTSUBSCRIPT italic_a italic_d italic_v end_POSTSUBSCRIPT constrains the edges of the image VVitalic_V and the registered image Rt2v(T,V)R_{t2v}(T,V)italic_R start_POSTSUBSCRIPT italic_t 2 italic_v end_POSTSUBSCRIPT ( italic_T , italic_V ), where \nabla denotes the Sobel operator. Similarly, we use adv(V,T,Rv2t,Dt)\mathcal{L}_{adv}(V,T,R_{v2t},D_{t})caligraphic_L start_POSTSUBSCRIPT italic_a italic_d italic_v end_POSTSUBSCRIPT ( italic_V , italic_T , italic_R start_POSTSUBSCRIPT italic_v 2 italic_t end_POSTSUBSCRIPT , italic_D start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) to constrain the alignment model Rv2t(V,T):VTR_{v2t}(V,T):V\to Titalic_R start_POSTSUBSCRIPT italic_v 2 italic_t end_POSTSUBSCRIPT ( italic_V , italic_T ) : italic_V → italic_T and the discriminator DtD_{t}italic_D start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT in the inverse alignment.

adv=𝔼v[logDv(V)]+𝔼t[log(1Dv(Rt2v(T,V)))].\mathcal{L}_{adv}=\mathbb{E}_{v}[\log{D_{v}(\nabla V)}]+\mathbb{E}_{t}[\log{(1-D_{v}(\nabla R_{t2v}(T,V)))}].caligraphic_L start_POSTSUBSCRIPT italic_a italic_d italic_v end_POSTSUBSCRIPT = blackboard_E start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT [ roman_log italic_D start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT ( ∇ italic_V ) ] + blackboard_E start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT [ roman_log ( 1 - italic_D start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT ( ∇ italic_R start_POSTSUBSCRIPT italic_t 2 italic_v end_POSTSUBSCRIPT ( italic_T , italic_V ) ) ) ] . (10)

Cycle consistency loss cc\mathcal{L}_{cc}caligraphic_L start_POSTSUBSCRIPT italic_c italic_c end_POSTSUBSCRIPT. The input images VVitalic_V and TTitalic_T are aligned in two steps Rt2vR_{t2v}italic_R start_POSTSUBSCRIPT italic_t 2 italic_v end_POSTSUBSCRIPT and Rv2tR_{v2t}italic_R start_POSTSUBSCRIPT italic_v 2 italic_t end_POSTSUBSCRIPT to obtain the VtvV_{tv}italic_V start_POSTSUBSCRIPT italic_t italic_v end_POSTSUBSCRIPT and TvtT_{vt}italic_T start_POSTSUBSCRIPT italic_v italic_t end_POSTSUBSCRIPT, respectively. Thus, we constrain the reconstructed images using the cyclic consistent loss as shown in Eq. 11.

cc=TvtT1+VtvV1.\mathcal{L}_{cc}=\left\|T_{vt}-T\right\|_{1}+\left\|V_{tv}-V\right\|_{1}.caligraphic_L start_POSTSUBSCRIPT italic_c italic_c end_POSTSUBSCRIPT = ∥ italic_T start_POSTSUBSCRIPT italic_v italic_t end_POSTSUBSCRIPT - italic_T ∥ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT + ∥ italic_V start_POSTSUBSCRIPT italic_t italic_v end_POSTSUBSCRIPT - italic_V ∥ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT . (11)

Hyperbolic hierarchical contrastive loss h2c\mathcal{L}_{h2c}caligraphic_L start_POSTSUBSCRIPT italic_h 2 italic_c end_POSTSUBSCRIPT. We apply pixel-level contrastive constraints between H(Tv)H(T_{v})italic_H ( italic_T start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT ) and H(V)H(V)italic_H ( italic_V ), where H()H(*)italic_H ( ∗ ) denotes the hyperbolic mapping through the H2CA module. Then, we impose structural-level contrastive constraints between the image edges V\nabla V∇ italic_V and Tv\nabla T_{v}∇ italic_T start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT, thereby achieving hierarchical constraints, as shown in Eq. 30. Similarly, we impose the same constraints on TTitalic_T and VtV_{t}italic_V start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT.

h2c=(logσ(dP(Tv,V))+logσ(dP(Tv,V))).\mathcal{L}_{h2c}=-(\log{\sigma(-d_{P}(Tv,V))}+\log{\sigma(-d_{P}(\nabla Tv,\nabla V))}).caligraphic_L start_POSTSUBSCRIPT italic_h 2 italic_c end_POSTSUBSCRIPT = - ( roman_log italic_σ ( - italic_d start_POSTSUBSCRIPT italic_P end_POSTSUBSCRIPT ( italic_T italic_v , italic_V ) ) + roman_log italic_σ ( - italic_d start_POSTSUBSCRIPT italic_P end_POSTSUBSCRIPT ( ∇ italic_T italic_v , ∇ italic_V ) ) ) . (12)

Smoothness loss sm\mathcal{L}_{sm}caligraphic_L start_POSTSUBSCRIPT italic_s italic_m end_POSTSUBSCRIPT. To ensure the smoothness of the aligned image, we impose a constraint on the spatial gradient of the deformation field f(ϕ)\nabla f(\phi)∇ italic_f ( italic_ϕ ), as Eq. 13.

sm=ϕΦf(ϕ)2.\mathcal{L}_{sm}=\sum_{\phi\in\Phi}\|\nabla f(\phi)\|^{2}.caligraphic_L start_POSTSUBSCRIPT italic_s italic_m end_POSTSUBSCRIPT = ∑ start_POSTSUBSCRIPT italic_ϕ ∈ roman_Φ end_POSTSUBSCRIPT ∥ ∇ italic_f ( italic_ϕ ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT . (13)

Fusion loss f\mathcal{L}_{f}caligraphic_L start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT. Inspired by [34, 35], the fusion loss is defined as shown in Eq. 14, where FFitalic_F denotes the fuse image with the size of H×WH\times Witalic_H × italic_W.

f=1HWFmax(Tv,V)1+1HW|F|max(|Tv|,|V|)1.\mathcal{L}_{f}=\frac{1}{HW}\left\|F-max(T_{v},V)\right\|_{1}+\frac{1}{HW}\left\|\left|\nabla F\right|-max(\left|\nabla T_{v}\right|,\left|\nabla V\right|)\right\|_{1}.caligraphic_L start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT = divide start_ARG 1 end_ARG start_ARG italic_H italic_W end_ARG ∥ italic_F - italic_m italic_a italic_x ( italic_T start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT , italic_V ) ∥ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT + divide start_ARG 1 end_ARG start_ARG italic_H italic_W end_ARG ∥ | ∇ italic_F | - italic_m italic_a italic_x ( | ∇ italic_T start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT | , | ∇ italic_V | ) ∥ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT . (14)

Total loss \mathcal{L}caligraphic_L. Our total loss is:

=adv+cc+h2c+sm+f.\mathcal{L}=\mathcal{L}_{adv}+\mathcal{L}_{cc}+\mathcal{L}_{h2c}+\mathcal{L}_{sm}+\mathcal{L}_{f}.caligraphic_L = caligraphic_L start_POSTSUBSCRIPT italic_a italic_d italic_v end_POSTSUBSCRIPT + caligraphic_L start_POSTSUBSCRIPT italic_c italic_c end_POSTSUBSCRIPT + caligraphic_L start_POSTSUBSCRIPT italic_h 2 italic_c end_POSTSUBSCRIPT + caligraphic_L start_POSTSUBSCRIPT italic_s italic_m end_POSTSUBSCRIPT + caligraphic_L start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT . (15)

4 Experiments

Table 1: Quantitative comparisons at DroneVehicle, LLVIP and MFNet. Note that the random nonlinear transformation is applied to both the infrared images in LLVIP and the visible images in MFNet. Boldface and underline show the best and second-best values, respectively.
Dataset Metric Alignment-free fusion methods Alignment-based fusion methods
DIDFuse CDDFuse EMMA SuperFusion ReCoNet MURF UMF-CMGR IMF Hy-CycleAlign

DroneVehicle

HD\downarrow 74.20¯\underline{74.20}under¯ start_ARG 74.20 end_ARG 76.4176.4176.41 71.62\mathbf{71.62}bold_71.62 75.1575.1575.15 73.9173.9173.91 92.1892.1892.18 65.27¯\underline{65.27}under¯ start_ARG 65.27 end_ARG 65.19\mathbf{65.19}bold_65.19 70.3670.3670.36
HD95\downarrow 30.5630.5630.56 29.11¯\underline{29.11}under¯ start_ARG 29.11 end_ARG 27.23\mathbf{27.23}bold_27.23 30.12¯\underline{30.12}under¯ start_ARG 30.12 end_ARG 30.5330.5330.53 44.9544.9544.95 31.7331.7331.73 30.9730.9730.97 25.76\mathbf{25.76}bold_25.76
ASSD\downarrow 8.178.178.17 7.30¯\underline{7.30}under¯ start_ARG 7.30 end_ARG 6.61\mathbf{6.61}bold_6.61 7.55¯\underline{7.55}under¯ start_ARG 7.55 end_ARG 7.717.717.71 12.0612.0612.06 8.938.938.93 8.628.628.62 6.38\mathbf{6.38}bold_6.38
DSC\uparrow 0.80\mathbf{0.80}bold_0.80 0.630.630.63 0.72¯\underline{0.72}under¯ start_ARG 0.72 end_ARG 0.67¯\underline{0.67}under¯ start_ARG 0.67 end_ARG 0.75\mathbf{0.75}bold_0.75 0.660.660.66 0.570.570.57 0.590.590.59 0.75\mathbf{0.75}bold_0.75
MEE\downarrow 39.3739.3739.37 31.59\mathbf{31.59}bold_31.59 32.12¯\underline{32.12}under¯ start_ARG 32.12 end_ARG 28.55¯\underline{28.55}under¯ start_ARG 28.55 end_ARG 32.8432.8432.84 36.8536.8536.85 33.5633.5633.56 34.7134.7134.71 28.28\mathbf{28.28}bold_28.28
SF\uparrow 19.97¯\underline{19.97}under¯ start_ARG 19.97 end_ARG 21.80\mathbf{21.80}bold_21.80 19.1919.1919.19 17.50¯\underline{17.50}under¯ start_ARG 17.50 end_ARG 13.6013.6013.60 5.725.725.72 10.9610.9610.96 8.448.448.44 21.45\mathbf{21.45}bold_21.45
EN\uparrow 6.956.956.95 7.35\mathbf{7.35}bold_7.35 7.30¯\underline{7.30}under¯ start_ARG 7.30 end_ARG 7.15¯\underline{7.15}under¯ start_ARG 7.15 end_ARG 6.956.956.95 6.836.836.83 6.916.916.91 7.077.077.07 7.18\mathbf{7.18}bold_7.18

LLVIP

HD\downarrow 203.33203.33203.33 159.21\mathbf{159.21}bold_159.21 200.11¯\underline{200.11}under¯ start_ARG 200.11 end_ARG 213.07213.07213.07 226.59226.59226.59 199.21¯\underline{199.21}under¯ start_ARG 199.21 end_ARG 234.82234.82234.82 211.14211.14211.14 163.49\mathbf{163.49}bold_163.49
HD95\downarrow 104.27104.27104.27 71.31\mathbf{71.31}bold_71.31 71.57¯\underline{71.57}under¯ start_ARG 71.57 end_ARG 120.91120.91120.91 117.57117.57117.57 88.25\mathbf{88.25}bold_88.25 140.75140.75140.75 119.41119.41119.41 113.50¯\underline{113.50}under¯ start_ARG 113.50 end_ARG
ASSD\downarrow 26.6126.6126.61 17.19¯\underline{17.19}under¯ start_ARG 17.19 end_ARG 17.16\mathbf{17.16}bold_17.16 29.9729.9729.97 29.3029.3029.30 20.64¯\underline{20.64}under¯ start_ARG 20.64 end_ARG 39.1739.1739.17 36.2936.2936.29 15.50\mathbf{15.50}bold_15.50
DSC\uparrow 0.80\mathbf{0.80}bold_0.80 0.72¯\underline{0.72}under¯ start_ARG 0.72 end_ARG 0.710.710.71 0.590.590.59 0.480.480.48 0.79¯\underline{0.79}under¯ start_ARG 0.79 end_ARG 0.490.490.49 0.580.580.58 0.85\mathbf{0.85}bold_0.85
MEE\downarrow 33.7833.7833.78 18.19\mathbf{18.19}bold_18.19 18.47¯\underline{18.47}under¯ start_ARG 18.47 end_ARG 18.08\mathbf{18.08}bold_18.08 33.4233.4233.42 19.0519.0519.05 23.1123.1123.11 22.0122.0122.01 18.30¯\underline{18.30}under¯ start_ARG 18.30 end_ARG
SF\uparrow 11.3711.3711.37 18.66\mathbf{18.66}bold_18.66 14.92¯\underline{14.92}under¯ start_ARG 14.92 end_ARG 14.10¯\underline{14.10}under¯ start_ARG 14.10 end_ARG 11.4311.4311.43 19.69\mathbf{19.69}bold_19.69 4.644.644.64 4.824.824.82 11.2711.2711.27
EN\uparrow 6.026.026.02 7.44\mathbf{7.44}bold_7.44 7.36¯\underline{7.36}under¯ start_ARG 7.36 end_ARG 7.34\mathbf{7.34}bold_7.34 5.855.855.85 6.956.956.95 6.956.956.95 7.087.087.08 7.19¯\underline{7.19}under¯ start_ARG 7.19 end_ARG

MFNet

HD\downarrow 105.82105.82105.82 83.11¯\underline{83.11}under¯ start_ARG 83.11 end_ARG 72.14\mathbf{72.14}bold_72.14 108.63108.63108.63 110.36110.36110.36 76.77¯\underline{76.77}under¯ start_ARG 76.77 end_ARG 127.15127.15127.15 87.8787.8787.87 67.38\mathbf{67.38}bold_67.38
HD95\downarrow 58.3258.3258.32 28.72\mathbf{28.72}bold_28.72 29.49¯\underline{29.49}under¯ start_ARG 29.49 end_ARG 63.8363.8363.83 69.7269.7269.72 30.47¯\underline{30.47}under¯ start_ARG 30.47 end_ARG 100.17100.17100.17 68.0368.0368.03 22.43\mathbf{22.43}bold_22.43
ASSD\downarrow 14.0014.0014.00 6.75¯\underline{6.75}under¯ start_ARG 6.75 end_ARG 6.44\mathbf{6.44}bold_6.44 14.4914.4914.49 17.7117.7117.71 6.77¯\underline{6.77}under¯ start_ARG 6.77 end_ARG 39.2539.2539.25 25.7025.7025.70 4.43\mathbf{4.43}bold_4.43
DSC\uparrow 0.840.840.84 0.90¯\underline{0.90}under¯ start_ARG 0.90 end_ARG 0.94\mathbf{0.94}bold_0.94 0.840.840.84 0.450.450.45 0.91¯\underline{0.91}under¯ start_ARG 0.91 end_ARG 0.470.470.47 0.460.460.46 0.93\mathbf{0.93}bold_0.93
MEE\downarrow 35.0635.0635.06 7.83\mathbf{7.83}bold_7.83 10.08¯\underline{10.08}under¯ start_ARG 10.08 end_ARG 12.5312.5312.53 23.4023.4023.40 7.04\mathbf{7.04}bold_7.04 16.7416.7416.74 10.7910.7910.79 7.39¯\underline{7.39}under¯ start_ARG 7.39 end_ARG
SF\uparrow 8.508.508.50 12.10\mathbf{12.10}bold_12.10 10.84¯\underline{10.84}under¯ start_ARG 10.84 end_ARG 7.907.907.90 9.519.519.51 10.47\mathbf{10.47}bold_10.47 5.255.255.25 3.77 9.90¯\underline{9.90}under¯ start_ARG 9.90 end_ARG
EN\uparrow 5.425.425.42 6.58\mathbf{6.58}bold_6.58 6.57¯\underline{6.57}under¯ start_ARG 6.57 end_ARG 6.216.216.21 5.395.395.39 6.38¯\underline{6.38}under¯ start_ARG 6.38 end_ARG 6.016.016.01 4.154.154.15 6.50\mathbf{6.50}bold_6.50

4.1 Setup

Datasets. We use the popular DroneVehicle [36], LLVIP [37] and MFNet [38] benchmarks to evaluate the performance of our model. Furthermore, to validate the effectiveness of different approaches in handling more complex misaligned multi-modal image fusion scenarios, we conduct experiments on the dataset, which is based on drone views. Given that the LLVIP and MFNet datasets have been manually aligned, it is necessary to construct misaligned images for both training and testing. To generate misaligned data, we apply random nonlinear transformations separately to the infrared images in the LLVIP dataset and the visible images in the MFNet dataset.

Metrics. We use six metrics to quantitatively measure the alignment and fusion results of the model: hausdorff distance (HD), 95%95\%95 % hausdorff distance (HD95), average symmetric surface distance (ASSD), dice similarity coefficients (DSC), median square error (MEE), spatial frequency (SF), and entropy (EN). These metrics provide a comprehensive evaluation of aligned and fused image quality, detail retention, information integrity, and visual perception performance.

Implement details. Hy-CycleAlign needs to be trained for 120 epochs. All network parameters are updated with the AdamW optimizer [39] with the initial learning rate set to 10410^{-4}10 start_POSTSUPERSCRIPT - 4 end_POSTSUPERSCRIPT. The effective edge threshold ccitalic_c is 0.010.010.01.

4.2 Comparing with SOTA

In this sction, we test Hy-CycleAlign on the three test sets and compare the alignment and fusion results withe the state-of-the-art methods including DIDFuse [3], CDDFuse [35], EMMA [40], SuperFusion [5], ReCoNet [4], MURF [6], UMF-CMGR [41], IMF [8]. SuperFusion, ReCoNet, MURF, UMF-CMGR, and IMF are currently the mainstream alignment-based fusion methods.

Qualitative comparison of alignment effects. We use the fusion results obtained by training and testing EMMA on data without deformation as the ground truth for registration and fusion methods. Edge features are extracted using the Sobel operator and compared with those from the fusion results of existing registration-based methods. Obviously, our method gives maintains good alignment results when facing targets with significant edge differences, the intensity differences are shown in Fig. 3.

Refer to caption
Figure 3: The aligned MFNet dataset is used as ground truth (GT) to compare the edge intensity differences after applying the alignment method.
Refer to caption
Figure 4: Comparison of results in a DroneVehicle dataset based on drone views.
Refer to caption
Figure 5: Comparison of results for the LLVIP dataset with IR nonlinear transformations.
Refer to caption
Figure 6: Comparison of results for the MFNet dataset with visible nonlinear transformations.

Qualitative comparison of fusion effects. Fig. 4, 5 and 6 show the fusion results under different misalignment conditions. It is clear that our method achieves robust alignment and fusion performance in various types of misalignment and in diverse scenes. Compared with existing methods, our Hy-CycleAlign achieves better registration performance, and no incorrect registration results are observed.

Quantitative comparison. We conducted a quantitative comparison using five commonly used registration metrics and two fusion metrics, as shown in Table 1. Our method achieved the best registration and fusion performance on the real-world misaligned dataset DroneVehicle. Similarly, it performed excellently on the nonlinearly misaligned MFNet dataset. Although the performance advantage on the LLVIP dataset is less pronounced compared to the other two datasets, overall, our method still demonstrates strong overall competitiveness.

4.3 Ablation studies

Ablation experiments are set to verify the rationality of the different modules. HD, HD95, ASSD, DSC and EN. The results of experimental groups are shown in Fig. 7 and Tab. 4.

Euclidean space alignment baseline (Eu). In Exp. I, we retained the cyclic adversarial registration network structure and used the Sobel operator to extract edge features from the registered images, applying constraints in Euclidean space. The experimental results indicate that it is difficult to achieve multi-modal image registration in Euclidean space.

Cycle consistent alignment structure (CA). In Exp. II, we removed the reverse alignment process to verify the role of the cyclic alignment structure in the registration task. The results show that the alignment performance improved across all evaluation metrics.

Pixel alignment (H2CA-pH^{2}CA\text{-}pitalic_H start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_C italic_A - italic_p) and edge alignment (H2CA-eH^{2}CA\text{-}eitalic_H start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_C italic_A - italic_e) in hyperbolic space. In Exp. III and Exp. IV, we evaluate pixel-level alignment and edge alignment in hyperbolic space, respectively. Both approaches independently improve registration performance, but when combined, they complement each other and further enhance the overall alignment effectiveness.

Refer to caption
Figure 7: The analysis of the ablation experiment was conducted using the MFNet dataset.
Table 2: Ablation experiment results in the test set of MFNet. CA denotes whether a cycle alignment structure, Eu refers to alignment in Euclidean space, H2CA-pH^{2}CA\text{-}pitalic_H start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_C italic_A - italic_p denotes pixels alignment in hyperbolic space, and H2CA-eH^{2}CA\text{-}eitalic_H start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_C italic_A - italic_e denotes edge alignment in hyperbolic space. Bold indicates the best values.
Methods EuEuitalic_E italic_u CACAitalic_C italic_A H2CA-pH^{2}CA\text{-}pitalic_H start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_C italic_A - italic_p H2CA-eH^{2}CA\text{-}eitalic_H start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_C italic_A - italic_e HD \downarrow HD95 \downarrow ASSD \downarrow DSC \uparrow EN \uparrow
Exp. I 153.51153.51153.51 86.6886.6886.68 22.9922.9922.99 0.720.720.72 6.436.436.43
Exp. II 96.1396.1396.13 38.4438.4438.44 8.918.918.91 0.920.920.92 6.436.436.43
Exp. III 95.7695.7695.76 36.1536.1536.15 8.478.478.47 0.930.930.93 6.426.426.42
Exp. IV 93.9193.9193.91 34.7134.7134.71 7.887.887.88 0.920.920.92 6.446.446.44
Hy-CycleAlign 67.38\mathbf{67.38}bold_67.38 22.43\mathbf{22.43}bold_22.43 4.43\mathbf{4.43}bold_4.43 0.93\mathbf{0.93}bold_0.93 6.50\mathbf{6.50}bold_6.50
Refer to caption
Figure 8: Comparison of fusion and detection results on misaligned data from the DroneVehicle.

4.4 Downstream applying alignment and fusion

To further show that the alignment task can effectively enhance fusion and its downstream tasks, we apply the methods compared in Section 4.2 to an object detection task. The experimental results are shown in Fig. 3. Due to space limitations, more experimental results and analyses will be provided in the supplementary material.

4.5 Additional Analysis

We compared the computational complexity (FLOPs) and the number of parameters between our model and existing registration models. The results are shown below. Although the transformation from Euclidean space to hyperbolic space introduces more parameters, it does not lead to a significant increase in computational complexity.

Table 3: Comparison of network parameter quantities.
SuperFusion ReCoNet MURF UMF IMF Ours
FLOPs (G) 16.36 15.33 100.10 131.38 123.30 16.66
Parameters (M) 1.96 0.21 4.08 14.44 15.70 18.26

4.6 Limitation

Although our proposed Hy-CycleAlign is the first method to perform multi-modal image registration in hyperbolic space and demonstrates promising results in infrared-visible alignment tasks, several limitations still remain. Future work should focus on improving the computational efficiency of operations within Poincaré space and enhancing the model’s adaptability to images with large modality discrepancies, particularly in real-time or large-scale scenarios. Given the challenges posed by the current Poincaré model in handling complex alignment cases, exploring more advanced or hybrid hyperbolic geometries, combined with adaptive embedding mechanisms and constraint strategies, represents a crucial direction for improving the generality and robustness of hyperbolic registration frameworks.

5 Conclusions

This paper proposes a hyperbolic cyclic alignment and fusion model. By leveraging forward and backward alignment constraints in a cyclic manner, the model effectively performs multi-modal image registration. Hy-CycleAlign is the first registration framework based on hyperbolic space. It maps images from Euclidean space to hyperbolic space and imposes multi-level alignment constraints, which alleviates modality discrepancies. Finally, we provide corresponding theoretical proofs. Experimental results demonstrate that Hy-CycleAlign achieves promising performance in infrared and visible image alignment and fusion.

Appendix A More Explanations of the Motivation

Infrared and visible images have modal differences due to differences in imaging principles, making them nonlinear [42]. Although Euclidean space handles linear discrepancies well, its inherently linear geometry limits its ability to represent complex nonlinear relationships [43]. Therefore, it struggles to accurately capture the nonlinear modality gaps, making it inadequate to address cross-modal discrepancies in such settings. Due to the negative curvature of the hyperbolic space, it is naturally good at modeling nonlinear relationships [18, 44]. We explore multi-modal image registration within hyperbolic space. By leveraging its powerful capacity to represent complex structures, hyperbolic space enables more accurate capture of the nonlinear differences between modalities.

Theorem 2.

Compared to Euclidean space, the hyperbolic space represented by the Poincaré space is more sensitive to misalignments, and this sensitivity increases as points approach the boundary of the Poincaré space.

Proof.

Assuming uuitalic_u and vvitalic_v are the points to be registered from different modality images, their distance in Euclidean space dE(u,v)d_{E}(u,v)italic_d start_POSTSUBSCRIPT italic_E end_POSTSUBSCRIPT ( italic_u , italic_v ) can be expressed as

dE(u,v)=uv2.d_{E}(u,v)=\left\|u-v\right\|_{2}.italic_d start_POSTSUBSCRIPT italic_E end_POSTSUBSCRIPT ( italic_u , italic_v ) = ∥ italic_u - italic_v ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT . (16)

Assuming the normal Poincaré space 𝔻n={xn:x<1}\mathbb{D}^{n}=\{x\in\mathbb{R}^{n}:\left\|x\right\|<1\}blackboard_D start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT = { italic_x ∈ blackboard_R start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT : ∥ italic_x ∥ < 1 }, xxitalic_x denotes a point in the Poincaré space. Then, the distance dp(u,v)d_{p}(u,v)italic_d start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT ( italic_u , italic_v ) between points u and v in the Poincaré space is shown as

dP(u,v)=cosh1(1+2uv2(1u2)(1v2)).d_{P}(u,v)=\cosh^{-1}(1+2\frac{\left\|u-v\right\|^{2}}{(1-\left\|u\right\|^{2})(1-\left\|v\right\|^{2})}).italic_d start_POSTSUBSCRIPT italic_P end_POSTSUBSCRIPT ( italic_u , italic_v ) = roman_cosh start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ( 1 + 2 divide start_ARG ∥ italic_u - italic_v ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG ( 1 - ∥ italic_u ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) ( 1 - ∥ italic_v ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) end_ARG ) . (17)

Let δ=vu\delta=v-uitalic_δ = italic_v - italic_u, in the alignment task, the goal is to make vuv\to uitalic_v → italic_u. Then, it follows that vu\left\|v\right\|\approx\left\|u\right\|∥ italic_v ∥ ≈ ∥ italic_u ∥.

It can be further shown that

1u21v2.1-\left\|u\right\|^{2}\approx 1-\left\|v\right\|^{2}.1 - ∥ italic_u ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ≈ 1 - ∥ italic_v ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT . (18)

We define X in Eq. 19,

X=2uv2(1u2)(1v2)2δ2(1u2)2,X=2\frac{\left\|u-v\right\|^{2}}{(1-\left\|u\right\|^{2})(1-\left\|v\right\|^{2})}\approx 2\frac{\left\|\delta\right\|^{2}}{(1-\left\|u\right\|^{2})^{2}},italic_X = 2 divide start_ARG ∥ italic_u - italic_v ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG ( 1 - ∥ italic_u ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) ( 1 - ∥ italic_v ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) end_ARG ≈ 2 divide start_ARG ∥ italic_δ ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG ( 1 - ∥ italic_u ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG , (19)

Substituting Eq. 19 into Eq. 18, the distance computation in the Poincaré space becomes:

dP(u,v)=cosh1(1+X).d_{P}(u,v)=\cosh^{-1}(1+X).italic_d start_POSTSUBSCRIPT italic_P end_POSTSUBSCRIPT ( italic_u , italic_v ) = roman_cosh start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ( 1 + italic_X ) . (20)

According to the Taylor series expansion, we obtain:

cosh1(1+X)=2X+o(X3/2).\cosh^{-1}(1+X)=\sqrt{2X}+o(X^{3/2}).roman_cosh start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ( 1 + italic_X ) = square-root start_ARG 2 italic_X end_ARG + italic_o ( italic_X start_POSTSUPERSCRIPT 3 / 2 end_POSTSUPERSCRIPT ) . (21)

Neglecting the minimum term of the above equation, we know that

cosh1(1+X)2X.\cosh^{-1}(1+X)\approx\sqrt{2X}.roman_cosh start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ( 1 + italic_X ) ≈ square-root start_ARG 2 italic_X end_ARG . (22)

Expanding Eq. 20 using a Taylor series and taking the first term yields

dP(u,v)21u2uv.d_{P}(u,v)\approx\frac{2}{1-\left\|u\right\|^{2}}\left\|u-v\right\|.italic_d start_POSTSUBSCRIPT italic_P end_POSTSUBSCRIPT ( italic_u , italic_v ) ≈ divide start_ARG 2 end_ARG start_ARG 1 - ∥ italic_u ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ∥ italic_u - italic_v ∥ . (23)

Thus, the ratio of the gradient magnitudes of dPd_{P}italic_d start_POSTSUBSCRIPT italic_P end_POSTSUBSCRIPT and dEd_{E}italic_d start_POSTSUBSCRIPT italic_E end_POSTSUBSCRIPT is:

udPudE=21u2>1.\frac{\left\|\nabla_{u}d_{P}\right\|}{\left\|\nabla_{u}d_{E}\right\|}=\frac{2}{1-\left\|u\right\|^{2}}>1.divide start_ARG ∥ ∇ start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT italic_d start_POSTSUBSCRIPT italic_P end_POSTSUBSCRIPT ∥ end_ARG start_ARG ∥ ∇ start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT italic_d start_POSTSUBSCRIPT italic_E end_POSTSUBSCRIPT ∥ end_ARG = divide start_ARG 2 end_ARG start_ARG 1 - ∥ italic_u ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG > 1 . (24)

Moreover, as uuitalic_u approaches 1, it follows that 1u201-\left\|u\right\|^{2}\to 01 - ∥ italic_u ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT → 0 and in this case,

udP.\left\|\nabla_{u}d_{P}\right\|\to\infty.∥ ∇ start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT italic_d start_POSTSUBSCRIPT italic_P end_POSTSUBSCRIPT ∥ → ∞ . (25)

Through this process, we demonstrate that image registration in hyperbolic space is more sensitive than in Euclidean space. This indicates that even slight misalignments lead to more significant changes in hyperbolic space, suggesting that applying registration constraints in hyperbolic space theoretically yields better results. Moreover, the closer the mapped pixels are to the boundary of the Poincaré space, the more sensitive they become to minor misalignments.

Appendix B More details of Hy-CycleAlign

Hy-CycleAlign simultaneously trains two registration networks, Rt2vR_{t2v}italic_R start_POSTSUBSCRIPT italic_t 2 italic_v end_POSTSUBSCRIPT: TTitalic_T to VVitalic_V and Rv2tR_{v2t}italic_R start_POSTSUBSCRIPT italic_v 2 italic_t end_POSTSUBSCRIPT: VVitalic_V to TTitalic_T, two discriminators, DtD_{t}italic_D start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT and DvD_{v}italic_D start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT, and a fusion network. Rt2vR_{t2v}italic_R start_POSTSUBSCRIPT italic_t 2 italic_v end_POSTSUBSCRIPT aligns the infrared image to the visible image, producing the registered image TvT_{v}italic_T start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT. Rv2tR_{v2t}italic_R start_POSTSUBSCRIPT italic_v 2 italic_t end_POSTSUBSCRIPT aligns the visible image to the infrared image, generating the registered image VtV_{t}italic_V start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT. The fusion network then fuses TvT_{v}italic_T start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT and VVitalic_V to generate the final fused image FFitalic_F.

We provide implementation details of the training phase of Hy-CycleAlign to clearly describe the training process, as shown in Algorithm 1.

Algorithm 1 Pseudocode for Hy-CycleAlign training phase.

Input: unaligned  multi-modal  images  V  and  T

Output: fusion  images F

for  V, T  in  Dataloader   do

 ϕt2v=Rt2v(T,V)\phi_{t2v}=R_{t2v}(T,V)italic_ϕ start_POSTSUBSCRIPT italic_t 2 italic_v end_POSTSUBSCRIPT = italic_R start_POSTSUBSCRIPT italic_t 2 italic_v end_POSTSUBSCRIPT ( italic_T , italic_V ) // Generate the deformation field for infrared-to-visible alignment ϕv2t=Rv2t(T,V)\phi_{v2t}=R_{v2t}(T,V)italic_ϕ start_POSTSUBSCRIPT italic_v 2 italic_t end_POSTSUBSCRIPT = italic_R start_POSTSUBSCRIPT italic_v 2 italic_t end_POSTSUBSCRIPT ( italic_T , italic_V ) // Generate the deformation field for visible-to-infrared alignment Tv=Tϕt2v,Vt=Vϕv2tT_{v}=T\circ\phi_{t2v},V_{t}=V\circ\phi_{v2t}italic_T start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT = italic_T ∘ italic_ϕ start_POSTSUBSCRIPT italic_t 2 italic_v end_POSTSUBSCRIPT , italic_V start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = italic_V ∘ italic_ϕ start_POSTSUBSCRIPT italic_v 2 italic_t end_POSTSUBSCRIPT // Generate aligned images Tvt=Tvϕv2t,Vtv=Vtϕt2vT_{vt}=T_{v}\circ\phi_{v2t},V_{tv}=V_{t}\circ\phi_{t2v}italic_T start_POSTSUBSCRIPT italic_v italic_t end_POSTSUBSCRIPT = italic_T start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT ∘ italic_ϕ start_POSTSUBSCRIPT italic_v 2 italic_t end_POSTSUBSCRIPT , italic_V start_POSTSUBSCRIPT italic_t italic_v end_POSTSUBSCRIPT = italic_V start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∘ italic_ϕ start_POSTSUBSCRIPT italic_t 2 italic_v end_POSTSUBSCRIPT // Generate reverse-aligned images H2CA(Tv,V),H2CA(Vt,T)H^{2}CA(\nabla T_{v},\nabla V),H^{2}CA(\nabla V_{t},\nabla T)italic_H start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_C italic_A ( ∇ italic_T start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT , ∇ italic_V ) , italic_H start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_C italic_A ( ∇ italic_V start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , ∇ italic_T ) // H2CAe:H^{2}CA-e:italic_H start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_C italic_A - italic_e : Edge-to-Poincaré embedding H2CA(Tv,V),H2CA(Vt,T)H^{2}CA(T_{v},V),H^{2}CA(V_{t},T)italic_H start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_C italic_A ( italic_T start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT , italic_V ) , italic_H start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_C italic_A ( italic_V start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_T ) // H2CAp:H^{2}CA-p:italic_H start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_C italic_A - italic_p : Pixel-to-Poincaré embedding Dv(Tv,V),Dt(T,Vt)D_{v}(\nabla T_{v},\nabla V),D_{t}(\nabla T,\nabla V_{t})italic_D start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT ( ∇ italic_T start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT , ∇ italic_V ) , italic_D start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( ∇ italic_T , ∇ italic_V start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) // Determining alignment performance F=Decoder(Encoder(V)+Encoder(Tv))F=\rm Decoder(\rm Encoder(\it V\rm)+Encoder(\it T_{v}\rm)\rm)italic_F = roman_Decoder ( roman_Encoder ( italic_V ) + roman_Encoder ( italic_T start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT ) ) // Fusion of aligned multi-modal images
end for

B.1 More Details of Architecture

Hy-CycleAlign simultaneously trains two registration networks, Rt2vR_{t2v}italic_R start_POSTSUBSCRIPT italic_t 2 italic_v end_POSTSUBSCRIPT: TTitalic_T to VVitalic_V and Rv2tR_{v2t}italic_R start_POSTSUBSCRIPT italic_v 2 italic_t end_POSTSUBSCRIPT: VVitalic_V to TTitalic_T, two discriminators, DtD_{t}italic_D start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT and DvD_{v}italic_D start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT, and a fusion network. Rt2vR_{t2v}italic_R start_POSTSUBSCRIPT italic_t 2 italic_v end_POSTSUBSCRIPT aligns the infrared image to the visible image, producing the registered image TvT_{v}italic_T start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT. Rv2tR_{v2t}italic_R start_POSTSUBSCRIPT italic_v 2 italic_t end_POSTSUBSCRIPT aligns the visible image to the infrared image, generating the registered image VtV_{t}italic_V start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT. The fusion network then fuses TvT_{v}italic_T start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT and VVitalic_V to generate the final fused image FFitalic_F.

B.2 More Details of H2CA

H2CA consists of two parts, H2CA-e andH2CA-p, which map image pixels and edge information from Euclidean space to the Poincaré space.

H2CA-e: H2CA-e is used to map image edge information from Euclidean space to the Poincaré space. Edge features are first extracted using the Sobel operator \nabla. Inspired by [22], the extracted edge information is projected onto the hyperbolic tangent space, as shown in Eq. 26.

x=Map(i,c)=icitanh(ci),x=Map(i,c)=\frac{i}{\sqrt{c}\cdot\left\|i\right\|}\tanh(\sqrt{c}\cdot\left\|i\right\|),italic_x = italic_M italic_a italic_p ( italic_i , italic_c ) = divide start_ARG italic_i end_ARG start_ARG square-root start_ARG italic_c end_ARG ⋅ ∥ italic_i ∥ end_ARG roman_tanh ( square-root start_ARG italic_c end_ARG ⋅ ∥ italic_i ∥ ) , (26)

where the vector idi\in\mathbb{R}^{d}italic_i ∈ blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT in Euclidean space can be projected into Poincaré space using the function Map()Map(*)italic_M italic_a italic_p ( ∗ ), where ccitalic_c denotes the curvature.

Then, to avoid x1cx\geq\frac{1}{\sqrt{c}}italic_x ≥ divide start_ARG 1 end_ARG start_ARG square-root start_ARG italic_c end_ARG end_ARG, i.e., to prevent the mapped values from exceeding the boundary of the Poincaré space, we use Equation 27 to ensure that the projection lies within the Poincaré space.

Proj(x,c)={x if x<1cε(1ε)cxxotherwise.Proj(x,c)=\begin{cases}x&\text{ if }\left\|x\right\|<\frac{1}{\sqrt{c}}-\varepsilon\\ \frac{(1-\varepsilon)}{\sqrt{c}}\frac{x}{\left\|x\right\|}&\text{otherwise}\end{cases}.italic_P italic_r italic_o italic_j ( italic_x , italic_c ) = { start_ROW start_CELL italic_x end_CELL start_CELL if ∥ italic_x ∥ < divide start_ARG 1 end_ARG start_ARG square-root start_ARG italic_c end_ARG end_ARG - italic_ε end_CELL end_ROW start_ROW start_CELL divide start_ARG ( 1 - italic_ε ) end_ARG start_ARG square-root start_ARG italic_c end_ARG end_ARG divide start_ARG italic_x end_ARG start_ARG ∥ italic_x ∥ end_ARG end_CELL start_CELL otherwise end_CELL end_ROW . (27)

To prevent overflow beyond the boundary, we introduce a small constant ε\varepsilonitalic_ε and set it to 10610^{-6}10 start_POSTSUPERSCRIPT - 6 end_POSTSUPERSCRIPT.

We constrain image alignment by computing the geodesic distance between different modalities in hyperbolic space and converting it into a similarity probability:

h2ce=logσ(dP(Tv,V)).\mathcal{L}_{h2c-e}=-\log{\sigma(-d_{P}(\nabla Tv,\nabla V))}.caligraphic_L start_POSTSUBSCRIPT italic_h 2 italic_c - italic_e end_POSTSUBSCRIPT = - roman_log italic_σ ( - italic_d start_POSTSUBSCRIPT italic_P end_POSTSUBSCRIPT ( ∇ italic_T italic_v , ∇ italic_V ) ) . (28)

H2CA-p: Different from H2CA-p, which constrains edge information, H2CA-p focuses on aligning deep features. It first extracts features from each modality using a VGG-16 network [45], then maps them into the Poincaré space to achieve global multi-modal alignment:

h2cp=(logσ(dP(Tv,V)).\mathcal{L}_{h2c-p}=-(\log{\sigma(-d_{P}(Tv,V))}.caligraphic_L start_POSTSUBSCRIPT italic_h 2 italic_c - italic_p end_POSTSUBSCRIPT = - ( roman_log italic_σ ( - italic_d start_POSTSUBSCRIPT italic_P end_POSTSUBSCRIPT ( italic_T italic_v , italic_V ) ) . (29)

Therefore, the total loss in H2CA is:

h2c=h2ce+h2cp.\mathcal{L}_{h2c}=\mathcal{L}_{h2c-e}+\mathcal{L}_{h2c-p}.caligraphic_L start_POSTSUBSCRIPT italic_h 2 italic_c end_POSTSUBSCRIPT = caligraphic_L start_POSTSUBSCRIPT italic_h 2 italic_c - italic_e end_POSTSUBSCRIPT + caligraphic_L start_POSTSUBSCRIPT italic_h 2 italic_c - italic_p end_POSTSUBSCRIPT . (30)

B.3 More Details of Fusion Module

To eliminate the influence of complex fusion strategies on experimental conclusions, we deliberately adopt a simple fusion architecture to more clearly verify the direct relationship between registration quality and final fusion performance. The fusion module uses two encoders to extract features from the registered infrared and visible images, and then fuses them to produce the final fused output.

Appendix C More Experiments

To validate the performance of Hy-CycleAlign, we conducted additional experiments with different negative curvatures ccitalic_c in Poincaré space and tested the model on various misaligned data. The results demonstrate that Hy-CycleAlign consistently achieves good registration and fusion performance, even under different misalignment conditions.

C.1 Hyperparametric Analysis

We analyzed the negative curvature ccitalic_c of the Poincaré space. The experimental results are shown in Fig. 10 and Tab. 4. Hy-CycleAlign achieves good visual registration results across different values of the parameter ccitalic_c. Combined with quantitative results, setting ccitalic_c to 0.010.010.01 yields a balanced performance in both registration and fusion tasks.

Refer to caption
Figure 9: Visualization of the MFNet dataset with hyperparameter ccitalic_c.
Refer to caption
Figure 10: Comparison of results in MFNet dataset for hyperparameter ccitalic_c.
Table 4: Ablation study results of the hyperparameter ccitalic_c on the MFNet dataset. Bold indicates the best value.
ccitalic_c HD \downarrow HD95 \downarrow ASSD \downarrow DSC \uparrow EN \uparrow
111 96.4896.4896.48 38.3838.3838.38 8.728.728.72 0.910.910.91 6.386.386.38
1e11e-11 italic_e - 1 93.32\mathbf{93.32}bold_93.32 36.0436.0436.04 8.258.258.25 0.910.910.91 6.356.356.35
1e21e-21 italic_e - 2 93.9193.9193.91 34.71\mathbf{34.71}bold_34.71 7.88\mathbf{7.88}bold_7.88 0.92\mathbf{0.92}bold_0.92 6.446.446.44
1e31e-31 italic_e - 3 95.9595.9595.95 45.7645.7645.76 10.5010.5010.50 0.880.880.88 6.426.426.42
1e41e-41 italic_e - 4 99.8399.8399.83 40.7840.7840.78 8.998.998.99 0.910.910.91 6.46\mathbf{6.46}bold_6.46

C.2 More Downstream Results for Infrared-visible Applications

We apply the registration and fusion results to the task of object detection from the viewpoint of drones. In this task, we use YOLOv11-m [46] as the detector and employ mAP@0.5, precision and recall as evaluation metrics. Compared with existing methods, Hy-CycleAlign achieves the highest detection precision and mAP, indicating that registration and fusion can effectively enhance the performance of object detection tasks. The results are shown in Fig. 11 and Tab. 5.

Refer to caption
Figure 11: Qualitative results for infrared-visible object detection on DroneVehicle dataset
Table 5: Quantitative results of object detection in the DroneVehicle dataset. Bold and underline indicate the best and second-best values, respectively.
Methods DIDFuse CDDFuse EMMA SuperFusion ReCoNet MURF UMF-CMGR IMF Hy-CycleAlign
Recall \uparrow 73.873.873.8 80.080.080.0 70.870.870.8 69.269.269.2 78.578.578.5 53.853.853.8 80.880.880.8 83.1\mathbf{83.1}bold_83.1 81.5¯\underline{81.5}under¯ start_ARG 81.5 end_ARG
Precision \uparrow 7.77.77.7 7.07.07.0 8.78.78.7 8.38.38.3 8.9¯\underline{8.9}under¯ start_ARG 8.9 end_ARG 7.07.07.0 8.9¯\underline{8.9}under¯ start_ARG 8.9 end_ARG 8.38.38.3 12.3\mathbf{12.3}bold_12.3
mAP@0.5 \uparrow 7.27.27.2 6.56.56.5 8.7¯\underline{8.7}under¯ start_ARG 8.7 end_ARG 5.35.35.3 8.58.58.5 7.37.37.3 6.96.96.9 8.28.28.2 13.7\mathbf{13.7}bold_13.7

C.3 Comparisons of Rigid Misalignment Fusion Results

To further validate the alignment and fusion performance of Hy-CycleAlign, we conducted additional experiments on the RoadScene [47] and TNO [48] datasets. Considering that RoadScene is a well-aligned dataset, we randomly shifted the infrared images horizontally by 0.5% to 1.5% of the image width to artificially generate rigid misalignments caused by camera position differences. As shown in Fig. 12 and 13, under varying lighting conditions, Hy-CycleAlign maintains good alignment performance when facing rigid misalignment while better preserving image details.

Refer to caption
Figure 12: Comparison of results for the RoadScene dataset with rigid misalignment.
Refer to caption
Figure 13: Comparison of results for the TNO dataset.
Table 6: Quantitative comparisons at RoadScene and TNO. Boldface and underline show the best and second-best values, respectively.
Dataset Metric Alignment-free fusion methods Alignment-based fusion methods
DIDFuse CDDFuse EMMA SuperFusion ReCoNet MURF UMF-CMGR IMF Hy-CycleAlign

RoadScene

HD\downarrow 109.88109.88109.88 99.34¯\underline{99.34}under¯ start_ARG 99.34 end_ARG 91.63\mathbf{91.63}bold_91.63 81.56¯\underline{81.56}under¯ start_ARG 81.56 end_ARG 87.4287.4287.42 86.0386.0386.03 105.02105.02105.02 90.8890.8890.88 80.24\mathbf{80.24}bold_80.24
HD95\downarrow 69.0469.0469.04 64.65¯\underline{64.65}under¯ start_ARG 64.65 end_ARG 56.70\mathbf{56.70}bold_56.70 44.43¯\underline{44.43}under¯ start_ARG 44.43 end_ARG 45.7245.7245.72 39.3639.3639.36 65.2765.2765.27 56.7656.7656.76 34.94\mathbf{34.94}bold_34.94
ASSD\downarrow 22.2122.2122.21 18.10¯\underline{18.10}under¯ start_ARG 18.10 end_ARG 15.26\mathbf{15.26}bold_15.26 10.3810.3810.38 11.5711.5711.57 8.43¯\underline{8.43}under¯ start_ARG 8.43 end_ARG 22.1922.1922.19 18.3218.3218.32 7.73\mathbf{7.73}bold_7.73
DSC\uparrow 0.500.500.50 0.55¯\underline{0.55}under¯ start_ARG 0.55 end_ARG 0.64\mathbf{0.64}bold_0.64 0.730.730.73 0.74¯\underline{0.74}under¯ start_ARG 0.74 end_ARG 0.600.600.60 0.590.590.59 0.600.600.60 0.84\mathbf{0.84}bold_0.84
MEE\downarrow 27.2427.2427.24 14.76\mathbf{14.76}bold_14.76 25.42¯\underline{25.42}under¯ start_ARG 25.42 end_ARG 10.06\mathbf{10.06}bold_10.06 17.0117.0117.01 10.24¯\underline{10.24}under¯ start_ARG 10.24 end_ARG 28.3528.3528.35 14.5214.5214.52 15.2515.2515.25
SF\uparrow 16.57¯\underline{16.57}under¯ start_ARG 16.57 end_ARG 17.37\mathbf{17.37}bold_17.37 15.0715.0715.07 11.67¯\underline{11.67}under¯ start_ARG 11.67 end_ARG 9.029.029.02 9.799.799.79 3.863.863.86 7.437.437.43 12.3\mathbf{12.3}bold_12.3
EN\uparrow 7.50\mathbf{7.50}bold_7.50 7.277.277.27 7.42¯\underline{7.42}under¯ start_ARG 7.42 end_ARG 7.09\mathbf{7.09}bold_7.09 7.057.057.05 6.736.736.73 6.406.406.40 7.047.047.04 7.06¯\underline{7.06}under¯ start_ARG 7.06 end_ARG

TNO

HD\downarrow 102.22¯\underline{102.22}under¯ start_ARG 102.22 end_ARG 105.67105.67105.67 91.42\mathbf{91.42}bold_91.42 93.1093.1093.10 87.47\mathbf{87.47}bold_87.47 111.18111.18111.18 91.0091.0091.00 103.69103.69103.69 87.87¯\underline{87.87}under¯ start_ARG 87.87 end_ARG
HD95\downarrow 57.7257.7257.72 56.12¯\underline{56.12}under¯ start_ARG 56.12 end_ARG 51.55\mathbf{51.55}bold_51.55 47.4747.4747.47 42.34¯\underline{42.34}under¯ start_ARG 42.34 end_ARG 67.4267.4267.42 58.6158.6158.61 64.1664.1664.16 33.99\mathbf{33.99}bold_33.99
ASSD\downarrow 21.9221.9221.92 18.48¯\underline{18.48}under¯ start_ARG 18.48 end_ARG 15.39\mathbf{15.39}bold_15.39 13.5713.5713.57 10.27¯\underline{10.27}under¯ start_ARG 10.27 end_ARG 19.4219.4219.42 19.2119.2119.21 21.8121.8121.81 8.56\mathbf{8.56}bold_8.56
DSC\uparrow 0.690.690.69 0.73¯\underline{0.73}under¯ start_ARG 0.73 end_ARG 0.82\mathbf{0.82}bold_0.82 0.790.790.79 0.83¯\underline{0.83}under¯ start_ARG 0.83 end_ARG 0.730.730.73 0.580.580.58 0.590.590.59 0.90\mathbf{0.90}bold_0.90
MEE\downarrow 29.9429.9429.94 8.4\mathbf{8.4}bold_8.4 11.18¯\underline{11.18}under¯ start_ARG 11.18 end_ARG 9.30¯\underline{9.30}under¯ start_ARG 9.30 end_ARG 14.3614.3614.36 23.2223.2223.22 16.2416.2416.24 11.9011.9011.90 6.04\mathbf{6.04}bold_6.04
SF\uparrow 12.65¯\underline{12.65}under¯ start_ARG 12.65 end_ARG 13.90\mathbf{13.90}bold_13.90 11.7411.7411.74 9.47¯\underline{9.47}under¯ start_ARG 9.47 end_ARG 7.967.967.96 3.533.533.53 7.057.057.05 7.127.127.12 9.72\mathbf{9.72}bold_9.72
EN\uparrow 6.856.856.85 7.09¯\underline{7.09}under¯ start_ARG 7.09 end_ARG 7.16\mathbf{7.16}bold_7.16 6.816.816.81 6.686.686.68 6.346.346.34 6.856.856.85 6.88¯\underline{6.88}under¯ start_ARG 6.88 end_ARG 6.94\mathbf{6.94}bold_6.94

C.4 More Downstream Results for Infrared-visible Applications

We apply the registration and fusion results to the task of object detection from the viewpoint of drones. In this task, we use YOLO v11-m [46] as the detector and employ mAP@0.5, precision and recall as evaluation metrics. Compared with existing methods, Hy-CycleAlign achieves the highest detection precision and mAP, indicating that registration and fusion can effectively enhance the performance of object detection tasks. The results are shown in Fig. 11 and Tab. 5.

C.5 More Comparisons of Nonlinear Misalignment Fusion Results

Fig. 14 and 15 present additional qualitative comparisons of infrared-visible image registration and fusion results. Our method handles misalignment more effectively while integrating thermal radiation from infrared images with texture details from visible images. Compared to other approaches, Hy-CycleAlign achieves more accurate multi-modal alignment under varying lighting conditions, better preserves fine textures, and highlights structural information.

Refer to caption
Figure 14: Comparison of results for the LLVIP dataset with infrared image nonlinear transformations.
Refer to caption
Figure 15: Comparison of results for the MFNet dataset with visible nonlinear transformations.

References

  • [1] Y. Zhang, Y. Liu, P. Sun, H. Yan, X. Zhao, and L. Zhang, “Ifcnn: A general image fusion framework based on convolutional neural network,” Information Fusion, vol. 54, pp. 99–118, 2020.
  • [2] H. Xu, J. Ma, J. Jiang, X. Guo, and H. Ling, “U2fusion: A unified unsupervised image fusion network,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 44, no. 1, pp. 502–518, 2020.
  • [3] Z. Zhao, S. Xu, C. Zhang, J. Liu, P. Li, and J. Zhang, “Didfuse: Deep image decomposition for infrared and visible image fusion,” arXiv preprint arXiv:2003.09210, 2020.
  • [4] Z. Huang, J. Liu, X. Fan, R. Liu, W. Zhong, and Z. Luo, “Reconet: Recurrent correction network for fast and efficient multi-modality image fusion,” in European conference on computer Vision. Springer, 2022, pp. 539–555.
  • [5] L. Tang, Y. Deng, Y. Ma, J. Huang, and J. Ma, “Superfusion: A versatile image registration and fusion network with semantic awareness,” IEEE/CAA Journal of Automatica Sinica, vol. 9, no. 12, pp. 2121–2137, 2022.
  • [6] H. Xu, J. Yuan, and J. Ma, “Murf: Mutually reinforcing multi-modal image registration and fusion,” IEEE transactions on pattern analysis and machine intelligence, vol. 45, no. 10, pp. 12 148–12 166, 2023.
  • [7] Z. Chen, J. Wei, and R. Li, “Unsupervised multi-modal medical image registration via discriminator-free image-to-image translation,” arXiv preprint arXiv:2204.13656, 2022.
  • [8] D. Wang, J. Liu, L. Ma, R. Liu, and X. Fan, “Improving misaligned multi-modality image fusion with one-stage progressive dense registration,” IEEE Transactions on Circuits and Systems for Video Technology, 2024.
  • [9] G. Bachmann, G. Bécigneul, and O. Ganea, “Constant curvature graph convolutional networks,” in International conference on machine learning. PMLR, 2020, pp. 486–496.
  • [10] I. Chami, Z. Ying, C. Ré, and J. Leskovec, “Hyperbolic graph convolutional neural networks,” Advances in neural information processing systems, vol. 32, 2019.
  • [11] J. Dai, Y. Wu, Z. Gao, and Y. Jia, “A hyperbolic-to-hyperbolic graph convolutional network,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2021, pp. 154–163.
  • [12] H. Cao, Y. Wang, J. Li, P. Zhu, and Q. Hu, “Hyperbolic-euclidean deep mutual learning,” in Proceedings of the ACM on Web Conference 2025, 2025, pp. 3073–3083.
  • [13] Q. Liu, M. Nickel, and D. Kiela, “Hyperbolic graph neural networks,” Advances in neural information processing systems, vol. 32, 2019.
  • [14] A. Lou, I. Katsman, Q. Jiang, S. Belongie, S.-N. Lim, and C. De Sa, “Differentiating through the fréchet mean,” in International conference on machine learning. PMLR, 2020, pp. 6393–6403.
  • [15] R. Aly, A. Ossa, A. Köhn, C. Biemann, A. Panchenko, and S. Acharya, “Every child should have parents: A taxonomy refinement algorithm based on hyperbolic term embeddings,” in ACL 2019-57th Annual Meeting of the Association for Computational Linguistics, Proceedings of the Conference, 2020, pp. 4811–4817.
  • [16] A. Tifrea, G. Bécigneul, and O.-E. Ganea, “Poincaré glove: Hyperbolic word embeddings,” arXiv preprint arXiv:1810.06546, 2018.
  • [17] Y. Zhu, D. Zhou, J. Xiao, X. Jiang, X. Chen, and Q. Liu, “Hypertext: Endowing fasttext with hyperbolic geometry,” arXiv preprint arXiv:2010.16143, 2020.
  • [18] S. Ramasinghe, V. Shevchenko, G. Avraham, and A. Thalaiyasingam, “Accept the modality gap: An exploration in the hyperbolic space,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 27 263–27 272.
  • [19] M. Yang, H. Verma, D. C. Zhang, J. Liu, I. King, and R. Ying, “Hypformer: Exploring efficient transformer fully in hyperbolic space,” in Proceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, 2024, pp. 3770–3781.
  • [20] P. Mettes, M. Ghadimi Atigh, M. Keller-Ressel, J. Gu, and S. Yeung, “Hyperbolic deep learning in computer vision: A survey,” International Journal of Computer Vision, vol. 132, no. 9, pp. 3484–3508, 2024.
  • [21] H. Li, Z. Chen, Y. Xu, and J. Hu, “Hyperbolic anomaly detection,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 17 511–17 520.
  • [22] V. Khrulkov, L. Mirvakhabova, E. Ustinova, I. Oseledets, and V. Lempitsky, “Hyperbolic image embeddings,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2020, pp. 6418–6428.
  • [23] M. G. Atigh, J. Schoep, E. Acar, N. Van Noord, and P. Mettes, “Hyperbolic image segmentation,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2022, pp. 4453–4462.
  • [24] H. Fu, J. Yuan, G. Zhong, X. He, J. Lin, and Z. Li, “Cf-deformable detr: an end-to-end alignment-free model for weakly aligned visible-infrared object detection,” in Proceedings of the Thirty-Third International Joint Conference on Artificial Intelligence, 2024, pp. 758–766.
  • [25] L. G. Brown, “A survey of image registration techniques,” ACM computing surveys (CSUR), vol. 24, no. 4, pp. 325–376, 1992.
  • [26] L. Kong, C. Lian, D. Huang, Y. Hu, Q. Zhou et al., “Breaking the dilemma of medical image-to-image translation,” Advances in Neural Information Processing Systems, vol. 34, pp. 1964–1978, 2021.
  • [27] Q. Wu, P. Dai, J. Chen, C.-W. Lin, Y. Wu, F. Huang, B. Zhong, and R. Ji, “Discover cross-modality nuances for visible-infrared person re-identification,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2021, pp. 4330–4339.
  • [28] Z. Fan, Y. Pi, M. Wang, Y. Kang, and K. Tan, “Gls–mift: A modality invariant feature transform with global-to-local searching,” Information Fusion, vol. 105, p. 102252, 2024.
  • [29] L. Xiang, L. Zhao, S. Chen, and X. Li, “Infrared and visible image registration in uav inspection,” in Proceedings of the 2022 6th International Conference on Video and Image Processing, 2022, pp. 67–71.
  • [30] L. Itti, C. Koch, and E. Niebur, “A model of saliency-based visual attention for rapid scene analysis,” IEEE Transactions on pattern analysis and machine intelligence, vol. 20, no. 11, pp. 1254–1259, 2002.
  • [31] Y. Li, Y. Fan, X. Xiang, D. Demandolx, R. Ranjan, R. Timofte, and L. Van Gool, “Efficient and explicit modelling of image hierarchies for image restoration,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 18 278–18 289.
  • [32] A. Pal, M. van Spengler, G. M. D. di Melendugno, A. Flaborea, F. Galasso, and P. Mettes, “Compositional entailment learning for hyperbolic vision-language models,” arXiv preprint arXiv:2410.06912, 2024.
  • [33] J.-Y. Zhu, T. Park, P. Isola, and A. A. Efros, “Unpaired image-to-image translation using cycle-consistent adversarial networks,” in Proceedings of the IEEE international conference on computer vision, 2017, pp. 2223–2232.
  • [34] L. Tang, J. Yuan, and J. Ma, “Image fusion in the loop of high-level vision tasks: A semantic-aware real-time infrared and visible image fusion network,” Information Fusion, vol. 82, pp. 28–42, 2022.
  • [35] Z. Zhao, H. Bai, J. Zhang, Y. Zhang, S. Xu, Z. Lin, R. Timofte, and L. Van Gool, “Cddfuse: Correlation-driven dual-branch feature decomposition for multi-modality image fusion,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2023, pp. 5906–5916.
  • [36] Y. Sun, B. Cao, P. Zhu, and Q. Hu, “Drone-based rgb-infrared cross-modality vehicle detection via uncertainty-aware learning,” IEEE TCSVT, vol. 32, no. 10, pp. 6700–6713, 2022.
  • [37] X. Jia, C. Zhu, M. Li, W. Tang, and W. Zhou, “Llvip: A visible-infrared paired dataset for low-light vision,” in Proceedings of the IEEE/CVF international conference on computer vision, 2021, pp. 3496–3504.
  • [38] Q. Ha, K. Watanabe, T. Karasawa, Y. Ushiku, and T. Harada, “Mfnet: Towards real-time semantic segmentation for autonomous vehicles with multi-spectral scenes,” in 2017 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). IEEE, 2017, pp. 5108–5115.
  • [39] I. Loshchilov, F. Hutter et al., “Fixing weight decay regularization in adam,” arXiv preprint arXiv:1711.05101, vol. 5, 2017.
  • [40] Z. Zhao, H. Bai, J. Zhang, Y. Zhang, K. Zhang, S. Xu, D. Chen, R. Timofte, and L. Van Gool, “Equivariant multi-modality image fusion,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2024, pp. 25 912–25 921.
  • [41] D. Wang, J. Liu, X. Fan, and R. Liu, “Unsupervised misaligned infrared and visible image fusion via cross-modality image generation and registration,” arXiv preprint arXiv:2205.11876, 2022.
  • [42] Z. Jiang, Z. Zhang, J. Liu, X. Fan, and R. Liu, “Breaking modality disparity: Harmonized representation for infrared and visible image registration,” arXiv preprint arXiv:2304.05646, 2023.
  • [43] O. Ganea, G. Bécigneul, and T. Hofmann, “Hyperbolic neural networks,” Advances in neural information processing systems, vol. 31, 2018.
  • [44] L. Zhang, S. Na, T. Liu, D. Zhu, and J. Huang, “Multimodal deep fusion in hyperbolic space for mild cognitive impairment study,” in International Conference on Medical Image Computing and Computer-Assisted Intervention. Springer, 2023, pp. 674–684.
  • [45] K. Simonyan and A. Zisserman, “Very deep convolutional networks for large-scale image recognition,” arXiv preprint arXiv:1409.1556, 2014.
  • [46] G. Jocher and J. Qiu, “Ultralytics yolo11,” 2024. [Online]. Available: https://github.com/ultralytics/ultralytics
  • [47] H. Xu, J. Ma, Z. Le, J. Jiang, and X. Guo, “Fusiondn: A unified densely connected network for image fusion,” in proceedings of the Thirty-Fourth AAAI Conference on Artificial Intelligence, 2020.
  • [48] A. Toet and M. A. Hogervorst, “Progress in color night vision,” Optical Engineering, vol. 51, no. 1, pp. 010 901–010 901, 2012.