\WarningFilter

captionUnknown document class (or package)

HyCASS
Adjustable Spatio-Spectral Hyperspectral Image Compression Network
EnMAP
Environmental Mapping and Analysis Program
OHID-1
Orbita Hyperspectral Images Dataset-1
L2A
Level 2A
RS
remote sensing
EO
earth obervation
CV
computer vision
GPU
graphics processing unit
RGB
red, green and blue
HSI
hyperspectral image
CNN
convolutional neural network
ANN
artificial neural network
SPIHT
set partitioning in hierarchical trees
SPECK
set partitioning embedded block
PCA
principle component analysis
DCT
discrete cosine transform
KLT
Karhunen–Loève transform
SE
Squeeze and Excitation
Swin
Shifted windows
RSTB
Residual Swin Transformer Block
NTU
Neural Transformation Unit
FE
feature embedding
FU
feature unembedding
STL
Swin Transformer layer
WA
window attention
SWA
shifted window attention
FM
foundation model
PSNR
peak signal-to-noise ratio
SA
spectral angle
MSE
mean squared error
CR
compression ratio
bpppc
bits per pixel per channel
dB\mathrm{dB}roman_dB
decibels
GSD
ground sample distance
FLOPs
floating point operations
LR
learning rate
BS
batch size
LeakyReLU
leaky rectified linear unit
PReLU
parametric rectified linear unit
DPCM
Differential Pulse Code Modulation
AE
autoencoder
VAE
variational autoencoder
CAE
convolutional autoencoder
GAN
generative adversarial network
INR
Implicit Neural Representations
A1D-CAE
Adaptive 1-D Convolutional Autoencoder
1D-CAE
1D-Convolutional Autoencoder
SSCNet
Spectral Signals Compressor Network
3D-CAE
3D Convolutional Auto-Encoder
LineRWKV
Line Receptance Weighted Key Value
HyCoT
Hyperspectral Compression Transformer
TIC
Transformer-based Image Compression
S2C-Net
Spatio-Spectral Compression Network
HiFiC
High Fidelity Compression
MLP
multilayer perceptron
1D
one-dimensional
2D
two-dimensional
3D
three-dimensional
LN
layer normalization
RSiM
Remote Sensing Image Analysis
BIFOLD
Berlin Institute for the Foundations of Learning and Data

Adjustable Spatio-Spectral Hyperspectral Image Compression Network

Martin Hermann Paul Fuchs, Behnood Rasti, , and Begüm Demir Martin Hermann Paul Fuchs, Behnood Rasti and Begüm Demir are with the Faculty of Electrical Engineering and Computer Science, Technische Universität Berlin, 10623 Berlin, Germany (e-mail: m.fuchs@tu-berlin.de; behnood.rasti@tu-berlin.de; demir@tu-berlin.de). Behnood Rasti and Begüm Demir are also with the Berlin Institute for the Foundations of Learning and Data (BIFOLD), 10623 Berlin, Germany.
Abstract

With the rapid growth of hyperspectral data archives in remote sensing (RS), the need for efficient storage has become essential, driving significant attention toward learning-based hyperspectral image (HSI) compression. However, a comprehensive investigation of the individual and joint effects of spectral and spatial compression on learning-based HSI compression has not been thoroughly examined yet. Conducting such an analysis is crucial for understanding how the exploitation of spectral, spatial, and joint spatio-spectral redundancies affects HSI compression. To address this issue, we propose Adjustable Spatio-Spectral Hyperspectral Image Compression Network (HyCASS), a learning-based model designed for adjustable HSI compression in both spectral and spatial dimensions. HyCASS consists of six main modules: 1) spectral encoder; 2) spatial encoder; 3) compression ratio(CR) adapter encoder; 4) CRadapter decoder; 5) spatial decoder; and 6) spectral decoder module. The modules employ convolutional layers and transformer blocks to capture both short-range and long-range redundancies. Experimental results on two HSI benchmark datasets demonstrate the effectiveness of our proposed adjustable model compared to existing learning-based compression models. Based on our results, we establish a guideline for effectively balancing spectral and spatial compression across different CRs, taking into account the spatial resolution of the HSIs. Our code and pre-trained model weights are publicly available at https://git.tu-berlin.de/rsim/hycass.

Index Terms:
Hyperspectral image compression, adjustable spatio-spectral compression, deep learning, remote sensing.

I Introduction

Hyperspectral sensors capture images spanning hundreds of continuous bands across the electromagnetic spectrum. Fine spectral information provided in HSIs enables the identification and differentiation of materials within a scene. Hyperspectral sensors, mounted on satellites, airplanes, and drones, enable a wide range of RS applications, including forest monitoring [1], water quality assessment [2], wildfire detection [3], and flood mapping [4]. The continuous improvement in hyperspectral sensors has enabled them to extract increasingly detailed spectral and spatial information, which is essential for advanced analysis. However, the substantial volume of data produced by these sensors presents significant challenges in terms of storage, transmission, and processing. An emerging research area focuses on the efficient compression of hyperspectral data to preserve crucial spectral and spatial information content for subsequent analysis [5].

Many HSI compression methods are presented in the literature. Generally, they can be categorized into three classes: i) lossless; ii) near-lossless; and iii) lossy HSI compression. Each category offers a unique trade-off between data preservation and compression efficiency. Lossless HSI compression ensures a perfect reconstruction of the original data without any loss of information and is therefore particularly important for tasks with zero tolerance for data degradation. However, lossless HSI compression methods typically only achieve CRs of 2  4 24\text{\,}start_ARG start_ARG 2 end_ARG – start_ARG 4 end_ARG end_ARG start_ARG times end_ARG start_ARG end_ARG [6]. This restriction limits their applicability in scenarios with strict bandwidth or storage space constraints that require significantly higher CRs. Near-lossless HSI compression achieves higher CRs than lossless compression while introducing minimal distortion between the original and reconstructed HSIs. The maximum error is upper-bounded, ensuring controlled deviation in the reconstructed HSIs. However, small errors may accumulate, potentially impacting applications that require absolute spectral precision. Furthermore, the achievable CRs remain limited compared to lossy compression. Lossy HSI compression offers significant advantages, particularly in achieving high CRs, which is essential for applications with strict bandwidth or storage limitations [5]. By selectively discarding less important information, lossy compression efficiently reduces data size while preserving key features, making it ideal for large-scale hyperspectral data transmission, storage, and processing. Although lossy compression introduces some degree of information loss, the degradation is typically negligible and does not substantially compromise the usability of the data in most practical applications.

From a methodological point of view, HSI compression can be grouped into two categories: i) traditional methods; and ii) learning-based methods. Traditional HSI compression methods have been extensively investigated in RS with predictive coding emerging as a common approach [7]. Predictive coding takes advantage of both spectral and spatial redundancies by predicting pixel values based on contextual information and encoding only the residuals between predicted and actual values for efficient storage or transmission. For example, in [8], a clustered Differential Pulse Code Modulation (DPCM) compression method, which clusters spectra and calculates an optimized predictor for each cluster, is proposed for HSIs. After linear prediction of a spectrum, the difference is entropy-coded using an adaptive entropy coder for each of the clusters. Another prominent implementation of predictive coding is the CCSDS 123.0-B-2 standard [9] for lossless HSI compression that employs adaptive linear prediction to minimize redundancies in HSIs. Its low complexity and flexible architecture, coupled with the capability to achieve near-lossless compression through closed-loop quantization, make it well-suited for deployment in onboard RS systems. Although prediction-based methods are widely used for lossless and near-lossless HSI compression, they cannot generate meaningful latent representations, and their autoregressive functionality results in a slow processing speed.

In contrast, traditional transform coding methods excel at extracting compact latent representations from hyperspectral data and are frequently used for lossy compression. They project hyperspectral data into a lower-dimensional, decorrelated latent space using mathematical transformations. The resulting reduced number of coefficients is subsequently quantized, introducing some loss of information, and then entropy-coded. Several traditional transform-coding methods are proposed in RS. As an example, in [10], principle component analysis (PCA) is combined with JPEG2000 for joint spatio-spectral compression of HSIs, whereas PCA is applied followed by discrete cosine transform (DCT) in [11]. In [12], three-dimensional (3D) transform coding is achieved by applying wavelet transformation in the spatial dimensions and Karhunen–Loève transform (KLT) in the spectral dimension, followed by 3D-SPIHT for efficient lossy compression. In [13] 3D-SPECK, which takes advantage of the 3D wavelet transform to efficiently encode HSIs by compressing the redundancies in all three dimensions, is introduced. Despite their effectiveness in extracting compact features and achieving high CRs, traditional methods often rely on hand-crafted transformations, which limit their ability to fully exploit the rich spatio–spectral structure of hyperspectral data. Consequently, these methods may not generalize well when applied to hyperspectral data captured under different conditions, sensor types, or scene characteristics.

To overcome these limitations, the development of learning-based HSI compression has recently attracted great attention in RS. Learning-based HSI compression methods leverage deep neural networks to automatically learn hierarchical and data-adaptive representations from large-scale training datasets, allowing more effective characterization of complex spectral and spatial redundancies. These methods also lead to an improved rate–distortion performance and better generalization across different scenes [5]. As an example, in [14] the LineRWKV method that enhances CCSDS 123.0-B-2 [9] by introducing a learning-based predictor is presented. LineRWKV achieves superior lossless and near-lossless reconstruction performance compared to CCSDS 123.0-B-2 at the cost of increased training time and computational complexity. In [15], the authors introduce two generative adversarial network (GAN)-based models designed for spatio-spectral compression of HSIs. Their model extends the High Fidelity Compression (HiFiC) framework [16] by incorporating: i) Squeeze and Excitation(SE) blocks; and ii) 3Dconvolutions to better exploit spectral redundancies alongside spatial redundancies. Although these models are capable of achieving extremely high CRs, exceeding 10410^{4}10 start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT, their generative nature can lead to the synthesis of unrealistic spectral and spatial content, potentially compromising the fidelity of the reconstructed data. In [17], a method for HSI compression using Implicit Neural Representations (INR) is presented, where an multilayer perceptron (MLP) network learns to map pixel locations to pixel intensities. The weights of the learned model are stored or transmitted to achieve compression. In [18], a neural video representation approach is proposed. HSIs are treated as a stream of video data, where each spectral band represents a frame of information, and variances between spectral bands represent transformations between frames. Their approach utilizes the spectral band index and the spatial coordinate index as its input to perform network overfitting. A fundamental limitation of neural representation approaches lies in their training procedure, which involves overfitting a separate neural network to each HSI. This instance-specific optimization results in substantial computational costs, limiting the practicality of large-scale data processing.

Most state-of-the-art learning-based HSI compression methods adopt the autoencoder (AE) architecture, where an encoder network compresses the input data into a compact latent representation, and a decoder network reconstructs the data from that. This structure enables end-to-end optimization and allows the networks to learn efficient, data-driven representations of spectral and spatial information. Existing AE models differ primarily in how they process the spectral and spatial dimensions by using one-dimensional (1D), two-dimensional (2D), and 3D convolutional layers, to balance compression efficiency, reconstruction quality, and computational complexity. As an example, in [19, 20], 1D-Convolutional Autoencoder (1D-CAE) is presented, which compresses the spectral content without considering spatial redundancies by stacking multiple blocks of 1D convolutions, pooling layers, and leaky rectified linear unit (LeakyReLU) activations. Although high-quality reconstructions can be achieved with this model, the pooling layers limit the achievable CRs to 2n2^{n}2 start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT, where nnitalic_n is the number of poolings. Another limitation of this model is the increasing computational complexity with higher CRs due to the deeper network architecture. In [21], Spectral Signals Compressor Network (SSCNet) extracts spatial features via 2D convolutional kernels. 2D max pooling layers introduce spatial compression, while the final CR is adapted via the latent channels within the bottleneck. Although SSCNet enables significantly higher CRs, this comes at the cost of spatially blurred image reconstructions. In [22], 3D Convolutional Auto-Encoder (3D-CAE) is introduced for joint compression of spatio-spectral redundancies via 3D convolutional kernels. Additionally, residual blocks allow gradients to flow through the network directly and improve the model’s performance. However, the 3D kernels significantly increase computational complexity. In [23], Spatio-Spectral Compression Network (S2C-Net) is introduced. Initially, a pixelwise AE is pre-trained to capture the essential spectral features of the hyperspectral data. To enhance spatial redundancy removal, a spatial AE network is added to the bottleneck layer of the spectral AE. This dual-layer architecture allows the model to learn both spectral and spatial representations effectively. The entire model is then trained using a mixed loss function that combines reconstruction errors from both the spectral and spatial components. Although this model achieves state-of-the-art performance for high CRs, it falls short in optimally balancing the trade-off between spectral and spatial CR. This limitation suggests room for improvement in achieving more balanced compression across both dimensions. In [24], the authors propose Hyperspectral Compression Transformer (HyCoT), a transformer-based AE designed for pixelwise compression of HSIs that exploits long-range spectral redundancies. The model significantly reduces computational complexity through random sampling and independent pixelwise processing, without notable degradation in reconstruction quality. However, similar to other spectral compression models, HyCoT does not exploit spatial redundancies, thus limiting the achievable CRs.

Existing learning-based HSI compression models have limitations to effectively address several fundamental challenges associated with hyperspectral data, including varying spatial resolutions, high spectral dimensionality, and the need for adjustability across a broad range of CRs. In particular, these models often lack mechanisms to dynamically exploit spectral and spatial redundancies under specific compression requirements and sensor characteristics. As a result, current approaches face limitations in terms of scalability, generalization, and their ability to balance compression efficiency, reconstruction fidelity, and computational complexity.

To overcome the above-mentioned critical issues, in this paper, we introduce HyCASS. The proposed model aims to enable flexible spatio-spectral compression. To this end, HyCASS employs six modules: i) a spectral encoder module; ii) a spatial encoder module; iii) a CR adapter encoder module; iv) a CR adapter decoder module; v) a spatial decoder module; and vi) a spectral decoder module. The novelty of the proposed model consists of the following: 1) a spectral feature extraction that captures spectral redundancies across the whole spectrum of each pixel independently (realized within the spectral encoder module and spectral decoder module); 2) an adjustable number of spatial stages exploiting both short- and long-range spatial redundancies to control spatial compression (realized within the spatial encoder module and spatial decoder module); and 3) an adjustable latent channel dimension to regulate spectral compression (realized within the CR adapter encoder module and CR adapter decoder module). Unlike existing methods in the literature, our proposed model supports flexible compression in both spectral and spatial dimensions. We evaluate the proposed model on two distinct datasets (HySpecNet-11k [25] and MLRetSet [26]), demonstrating its effectiveness in compressing over a broad range of CRs and two different spatial resolutions. Our experimental results show that spectral compression is preferable in cases where the CR or spatial resolution is low, whereas spatio-spectral compression is more effective for high CRs or spatial resolutions. The main contributions of this paper are summarized as follows:

  • We propose a spatio-spectral HSI compression model with adjustable spatial stages and latent channels, capable of effective HSI compression across a broad range of CRs and varying spatial resolutions.

  • We provide extensive experimental results on two benchmark datasets for the introduced model, including an ablation study, comparisons with other approaches, and visual analyses of the reconstructions.

  • We analyze the effects of spectral and spatial HSI compression on the reconstruction quality for multiple CRs and two ground sample distances, providing a comprehensive evaluation of their individual and combined impacts on reconstruction performance.

  • We demonstrate the advantages of adjustable deep learning architectures and derive guidelines for the trade-off between spectral and spatial compression under varying CR conditions, for both low and high spatial resolution hyperspectral data.

The remainder of this paper is organized as follows: Section II introduces the proposed HyCASS model. Section III describes the considered datasets and provides the design of experiments, while the experimental results are presented in Section IV. Finally, in Section V, the conclusion of the work is drawn.

II Proposed Adjustable Spatio-Spectral Hyperspectral Image Compression Network (HyCASS)

Let 𝐗H×W×C\mathbf{X}\in\mathbb{R}^{H\times W\times C}bold_X ∈ blackboard_R start_POSTSUPERSCRIPT italic_H × italic_W × italic_C end_POSTSUPERSCRIPT denote an HSI with spatial dimensions HHitalic_H and WWitalic_W, and CCitalic_C spectral bands. In this work, we focus on lossy HSI compression that transforms the original HSI 𝐗\mathbf{X}bold_X into a compact and decorrelated latent representation 𝐘Σ×Ω×Γ\mathbf{Y}\in\mathbb{R}^{\Sigma\times\Omega\times\Gamma}bold_Y ∈ blackboard_R start_POSTSUPERSCRIPT roman_Σ × roman_Ω × roman_Γ end_POSTSUPERSCRIPT. Here, Σ\Sigmaroman_Σ and Ω\Omegaroman_Ω denote the reduced latent spatial dimensions, while Γ\Gammaroman_Γ represents the number of latent channels. The latent representation 𝐘\mathbf{Y}bold_Y should retain sufficient information, such that the original HSI can be approximately reconstructed from 𝐘\mathbf{Y}bold_Y as 𝐗^H×W×C\mathbf{\hat{X}}\in\mathbb{R}^{H\times W\times C}over^ start_ARG bold_X end_ARG ∈ blackboard_R start_POSTSUPERSCRIPT italic_H × italic_W × italic_C end_POSTSUPERSCRIPT. The compression aims to minimize the distortion d:𝐗×𝐗^[0,)d:\mathbf{X}\times\mathbf{\hat{X}}\rightarrow[0,\infty)italic_d : bold_X × over^ start_ARG bold_X end_ARG → [ 0 , ∞ ) between 𝐗\mathbf{X}bold_X and 𝐗^\mathbf{\hat{X}}over^ start_ARG bold_X end_ARG for a fixed CR.

Refer to captionConv2D CNC\rightarrow Nitalic_C → italic_N\qtyproduct1x1 \bulletResidual SwinTransformer BlockConv2D 2\downarrow^{2}↓ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPTNNN\rightarrow Nitalic_N → italic_N\qtyproduct3x3 \bulletConv2D NΓN\rightarrow\Gammaitalic_N → roman_Γ\qtyproduct1x1

Sigmoid

LatentRepresentationConv2D ΓN\Gamma\rightarrow Nroman_Γ → italic_N\qtyproduct1x1 \bulletConv2D 2\uparrow^{2}↑ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPTNNN\rightarrow Nitalic_N → italic_N\qtyproduct3x3 Residual SwinTransformer Block\bulletConv2D NCN\rightarrow Citalic_N → italic_C\qtyproduct1x1

Sigmoid

Refer to captionS×S\timesitalic_S ×S×S\timesitalic_S × Spectral Encoder Module CR Adapter Encoder Module Spatial Encoder Module Spectral Decoder Module CR Adapter Decoder Module
Spatial Decoder Module
Figure 1: Overview of our proposed HyCASS model. Initially, a pixelwise convolution in the spectral encoder module extracts spectral features. The spatial encoder module, composed of S×S\timesitalic_S × stacked stages, performs both long- and short-range spatial feature extraction, where each spatial stage introduces higher spatial compression. Subsequently, the CR adapter encoder module adjusts the size of the latent representation to match the targeted spatio-spectral CR. The decoder mirrors the encoder structure, replacing downsampling with upsampling operations.

To enable effective spatio-spectral HSI compression, we propose HyCASS. The proposed model combines pixelwise convolutions, strided 2D convolutional layers, and Residual Shifted windows (Swin) Transformer Blocks [27] to leverage both short-range and long-range redundancies across the spectral as well as the spatial dimension of HSIs. HyCASS consists of six modules within both the encoder and decoder: i) a spectral encoder module that involves a spectral feature extraction; ii) a spatial encoder module with a configurable number of spatial stages SSitalic_S for adjustable spatial compression, incorporating short-range and long-range spatial redundancies; iii) a CR adapter encoder module to balance the trade-off between compressing spectral and spatial information content via the number of latent channels Γ\Gammaroman_Γ, depending on the joint spatio-spectral target CR; iv) a CR adapter decoder module that recovers spectral information; v) a spatial decoder module that performs spatial reconstruction; and vi) a spectral decoder module for spectral reconstruction. HyCASS facilitates adjustable spatio-spectral compression, striking a balance between reconstruction fidelity and compression efficiency. This balance is achieved through the adjustable parameters Γ\Gammaroman_Γ and SSitalic_S, which control spectral and spatial CR, respectively. A schematic overview of HyCASS is provided in Fig. 1 with detailed explanations in the following subsections.

II-A HyCASS Spectral Encoder Module

Initially, the spectral encoder module EΦ:H×W×CH×W×NE_{\Phi}:\mathbb{R}^{H\times W\times C}\rightarrow\mathbb{R}^{H\times W\times N}italic_E start_POSTSUBSCRIPT roman_Φ end_POSTSUBSCRIPT : blackboard_R start_POSTSUPERSCRIPT italic_H × italic_W × italic_C end_POSTSUPERSCRIPT → blackboard_R start_POSTSUPERSCRIPT italic_H × italic_W × italic_N end_POSTSUPERSCRIPT of the proposed HyCASS model, which is defined as:

EΦ(ξ)=LeakyReLU(Conv2D\qtyproduct1x1CN(ξ))\displaystyle E_{\Phi}\left(\mathbf{\xi}\right)=\text{\acs{leakyrelu}}\left(\text{Conv2D}_{\qtyproduct{1x1}{}}^{C\rightarrow N}\left(\mathbf{\xi}\right)\right)italic_E start_POSTSUBSCRIPT roman_Φ end_POSTSUBSCRIPT ( italic_ξ ) = ( Conv2D start_POSTSUBSCRIPT 1 italic_x 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_C → italic_N end_POSTSUPERSCRIPT ( italic_ξ ) ) (1)

performs spectral feature extraction using a pixelwise convolution, realized via a \qtyproduct1x1 kernel inside a 2D convolutional layer. Although 2D convolutions are typically used for capturing spatial patterns, the use of a \qtyproduct1x1 kernel ensures that the convolution is applied independently to each pixel location without aggregating any spatial context. The pixelwise convolution captures spectral redundancies along the whole spectrum of each pixel by projecting the high-dimensional number CCitalic_C of spectral bands into a decorrelated, lower-dimensional spectral representation with NNitalic_N channels (N<CN<Citalic_N < italic_C). A LeakyReLU is applied after the convolutional layer to introduce non-linear activation, thereby enhancing its capacity to learn complex patterns from data. This enables the spectral features extracted from the previous layer to capture more complex relationships in the hyperspectral data. The extracted spectral features are subsequently processed by the spatial encoder module.

II-B HyCASS Spatial Encoder Module

Following the spectral encoding, HyCASS applies a spatial encoder module to capture and compress spatial redundancies within the hyperspectral data. The HyCASS spatial encoder Eχ:H×W×NΣ×Ω×NE_{\chi}:\mathbb{R}^{H\times W\times N}\rightarrow\mathbb{R}^{\Sigma\times\Omega\times N}italic_E start_POSTSUBSCRIPT italic_χ end_POSTSUBSCRIPT : blackboard_R start_POSTSUPERSCRIPT italic_H × italic_W × italic_N end_POSTSUPERSCRIPT → blackboard_R start_POSTSUPERSCRIPT roman_Σ × roman_Ω × italic_N end_POSTSUPERSCRIPT defined as:

{Eχ(ξ)=f(S)(ξ),S0f(ξ)=LeakyReLU(Conv2D\qtyproduct3x32(RSTB(ξ)))\displaystyle\begin{cases}E_{\chi}\left(\mathbf{\xi}\right)=f^{\left(S\right)}\left(\mathbf{\xi}\right),\quad S\in\mathbb{N}_{0}\\ f\left(\mathbf{\xi}\right)=\text{\acs{leakyrelu}}\left(\text{Conv2D}_{\qtyproduct{3x3}{}}^{\downarrow^{2}}\left(\text{\acs{rstb}}\left(\mathbf{\xi}\right)\right)\right)\end{cases}{ start_ROW start_CELL italic_E start_POSTSUBSCRIPT italic_χ end_POSTSUBSCRIPT ( italic_ξ ) = italic_f start_POSTSUPERSCRIPT ( italic_S ) end_POSTSUPERSCRIPT ( italic_ξ ) , italic_S ∈ blackboard_N start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_CELL start_CELL end_CELL end_ROW start_ROW start_CELL italic_f ( italic_ξ ) = ( Conv2D start_POSTSUBSCRIPT 3 italic_x 3 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ↓ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT ( ( italic_ξ ) ) ) end_CELL start_CELL end_CELL end_ROW (2)

consists of a configurable sequence of spatial stages, denoted as SSitalic_S, each implemented as a Neural Transformation Unit (NTU) [28]. This systematically reduces the spatial dimension while enriching the feature representation with contextual spatial information. SSitalic_S is a configurable hyperparameter that determines the spatial CR, given by CRspat=4S\acs{cr}_{\text{spat}}=4^{S}start_POSTSUBSCRIPT spat end_POSTSUBSCRIPT = 4 start_POSTSUPERSCRIPT italic_S end_POSTSUPERSCRIPT. This hyperparameter helps to adjust spatial compression to align with the targeted spatio-spectral CR. Each NTU performs stepwise spatial compression and is composed of three main components: i) a Residual Swin Transformer Block (RSTB) [27], ii) a strided 2D convolutional layer with a kernel size of \qtyproduct3x3, and iii) a LeakyReLU activation function. The RSTB employs shifted window self-attention to capture long-range spatial redundancies as illustrated in Fig. 2.

\bulletFESTLFU+
(a)
\bulletLNWA+\bulletLNMLP+LN\bulletSWA+\bulletLNMLP+
(b)
Figure 2: Architecture of (2(a)) RSTB and (2(b)) STL. Layout is redesigned based on [27].

It consists of three subcomponents: i) feature embedding(FE) that reorders the input feature channels into a token sequence; ii) Swin Transformer layer(STL) that applies multi-head self-attention within and across local windows using layer normalization (LN), window attention (WA), shifted window attention (SWA) and MLP; and iii) feature unembedding(FU) that reorders the tokens back to their original spatial shape. Residual connections mitigate the vanishing gradient issue, enhancing training stability. This design ensures efficient attention computation while preserving spatial locality and contextual information. Notably, patch division and linear embedding for tokenization are omitted as described in [28]. Following the RSTB, the strided 2D convolution captures local short-range spatial redundancies. This layer downsamples the feature maps by a factor of 222 along both height and width, effectively reducing the image size by a factor of 444 per stage. The number of channels NNitalic_N remains constant across all stages. A non-linear LeakyReLU activation function is applied at the end of each NTU to enhance the model’s capacity to learn complex patterns. This hierarchical spatial encoding progressively compresses the spatial content while preserving important structural and contextual details. We would like to note that in the case of zero spatial stages (S=0 ×S=$0\text{\,}\times$italic_S = start_ARG 0 end_ARG start_ARG times end_ARG start_ARG × end_ARG), HyCASS operates as spectral compression model while spatio-spectral compression is achieved with one or more spatial stages (S>0 ×S>$0\text{\,}\times$italic_S > start_ARG 0 end_ARG start_ARG times end_ARG start_ARG × end_ARG).

II-C HyCASS CR Adapter Encoder Module

After capturing spectral and spatial redundancies using the respective encoder modules, HyCASS applies the CR adapter encoder module EΨ:Σ×Ω×NΣ×Ω×ΓE_{\Psi}:\mathbb{R}^{\Sigma\times\Omega\times N}\rightarrow\mathbb{R}^{\Sigma\times\Omega\times\Gamma}italic_E start_POSTSUBSCRIPT roman_Ψ end_POSTSUBSCRIPT : blackboard_R start_POSTSUPERSCRIPT roman_Σ × roman_Ω × italic_N end_POSTSUPERSCRIPT → blackboard_R start_POSTSUPERSCRIPT roman_Σ × roman_Ω × roman_Γ end_POSTSUPERSCRIPT defined as:

EΨ(ξ)=Sigmoid(Conv2D\qtyproduct1x1NΓ(ξ)).\displaystyle E_{\Psi}\left(\mathbf{\xi}\right)=\text{Sigmoid}\left(\text{Conv2D}_{\qtyproduct{1x1}{}}^{N\rightarrow\Gamma}\left(\mathbf{\xi}\right)\right).italic_E start_POSTSUBSCRIPT roman_Ψ end_POSTSUBSCRIPT ( italic_ξ ) = Sigmoid ( Conv2D start_POSTSUBSCRIPT 1 italic_x 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N → roman_Γ end_POSTSUPERSCRIPT ( italic_ξ ) ) . (3)

It adjusts the number of latent channels Γ\Gammaroman_Γ to fit the targeted spatio-spectral CR. Therefore, a \qtyproduct1x1 convolutional layer maps the spatio-spectral features from NNitalic_N channels to Γ\Gammaroman_Γ latent channels. We use the sigmoid activation function to constrain the latent space to the range 0  1 01\text{\,}start_ARG start_ARG 0 end_ARG – start_ARG 1 end_ARG end_ARG start_ARG times end_ARG start_ARG end_ARG.

The spatio-spectral CR achieved by our proposed model arises from the joint contributions of both spectral and spatial compression. It can be expressed as:

CR=CRspecCRspat=CΓ4S,\displaystyle\acs{cr}=\acs{cr}_{\text{spec}}\cdot\acs{cr}_{\text{spat}}=\frac{C}{\Gamma}\cdot 4^{S},= start_POSTSUBSCRIPT spec end_POSTSUBSCRIPT ⋅ start_POSTSUBSCRIPT spat end_POSTSUBSCRIPT = divide start_ARG italic_C end_ARG start_ARG roman_Γ end_ARG ⋅ 4 start_POSTSUPERSCRIPT italic_S end_POSTSUPERSCRIPT , (4)

where CRspec=CΓ\acs{cr}_{\text{spec}}=\frac{C}{\Gamma}start_POSTSUBSCRIPT spec end_POSTSUBSCRIPT = divide start_ARG italic_C end_ARG start_ARG roman_Γ end_ARG denotes the spectral CR determined by the reduction of spectral channels from CCitalic_C to Γ\Gammaroman_Γ, and CRspat=4S\acs{cr}_{\text{spat}}=4^{S}start_POSTSUBSCRIPT spat end_POSTSUBSCRIPT = 4 start_POSTSUPERSCRIPT italic_S end_POSTSUPERSCRIPT corresponds to the spatial CR introduced through SSitalic_S stages of spatial downsampling.

II-D HyCASS CR Adapter Decoder Module

As the first step to perform reconstruction, the CR adapter decoder module DΨ:Σ×Ω×ΓΣ×Ω×ND_{\Psi^{\prime}}:\mathbb{R}^{\Sigma\times\Omega\times\Gamma}\rightarrow\mathbb{R}^{\Sigma\times\Omega\times N}italic_D start_POSTSUBSCRIPT roman_Ψ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT : blackboard_R start_POSTSUPERSCRIPT roman_Σ × roman_Ω × roman_Γ end_POSTSUPERSCRIPT → blackboard_R start_POSTSUPERSCRIPT roman_Σ × roman_Ω × italic_N end_POSTSUPERSCRIPT of HyCASS defined as:

DΨ(ξ)=LeakyReLU(Conv2D\qtyproduct1x1ΓN(ξ))\displaystyle D_{\Psi^{\prime}}\left(\mathbf{\xi}\right)=\text{\acs{leakyrelu}}\left(\text{Conv2D}_{\qtyproduct{1x1}{}}^{\Gamma\rightarrow N}\left(\mathbf{\xi}\right)\right)italic_D start_POSTSUBSCRIPT roman_Ψ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_ξ ) = ( Conv2D start_POSTSUBSCRIPT 1 italic_x 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT roman_Γ → italic_N end_POSTSUPERSCRIPT ( italic_ξ ) ) (5)

applies a \qtyproduct1x1 convolution that projects the channels from Γ\Gammaroman_Γ back to NNitalic_N, to match the NTU dimension, followed by a LeakyReLU.

II-E HyCASS Spatial Decoder Module

Afterwards, the spatial decoder module Dχ:Σ×Ω×NH×W×ND_{\chi^{\prime}}:\mathbb{R}^{\Sigma\times\Omega\times N}\rightarrow\mathbb{R}^{H\times W\times N}italic_D start_POSTSUBSCRIPT italic_χ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT : blackboard_R start_POSTSUPERSCRIPT roman_Σ × roman_Ω × italic_N end_POSTSUPERSCRIPT → blackboard_R start_POSTSUPERSCRIPT italic_H × italic_W × italic_N end_POSTSUPERSCRIPT of HyCASS defined as:

{Eχ(ξ)=g(S)(ξ),S0g(ξ)=RSTB(LeakyReLU(Conv2D\qtyproduct3x32(ξ)))\displaystyle\begin{cases}E_{\chi}\left(\mathbf{\xi}\right)=g^{\left(S\right)}\left(\mathbf{\xi}\right),\quad S\in\mathbb{N}_{0}\\ g\left(\mathbf{\xi}\right)=\text{\acs{rstb}}\left(\text{\acs{leakyrelu}}\left(\text{Conv2D}_{\qtyproduct{3x3}{}}^{\uparrow^{2}}\left(\mathbf{\xi}\right)\right)\right)\end{cases}{ start_ROW start_CELL italic_E start_POSTSUBSCRIPT italic_χ end_POSTSUBSCRIPT ( italic_ξ ) = italic_g start_POSTSUPERSCRIPT ( italic_S ) end_POSTSUPERSCRIPT ( italic_ξ ) , italic_S ∈ blackboard_N start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_CELL start_CELL end_CELL end_ROW start_ROW start_CELL italic_g ( italic_ξ ) = ( ( Conv2D start_POSTSUBSCRIPT 3 italic_x 3 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ↑ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT ( italic_ξ ) ) ) end_CELL start_CELL end_CELL end_ROW (6)

applies SSitalic_S stacked NTUs like the HyCASS spatial encoder module. However, in each NTU, first the 2D convolution is applied to aggregate short-range spatial features and upsample the spatial dimensions. Then, the RSTB is applied for long-range spatial feature extraction.

II-F HyCASS Spectral Decoder Module

Finally, the HyCASS spectral decoder module DΦ:H×W×NH×W×CD_{\Phi^{\prime}}:\mathbb{R}^{H\times W\times N}\rightarrow\mathbb{R}^{H\times W\times C}italic_D start_POSTSUBSCRIPT roman_Φ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT : blackboard_R start_POSTSUPERSCRIPT italic_H × italic_W × italic_N end_POSTSUPERSCRIPT → blackboard_R start_POSTSUPERSCRIPT italic_H × italic_W × italic_C end_POSTSUPERSCRIPT defined as:

DΦ(ξ)=Sigmoid(Conv2D\qtyproduct1x1NC(ξ))\displaystyle D_{\Phi^{\prime}}\left(\mathbf{\xi}\right)=\text{Sigmoid}\left(\text{Conv2D}_{\qtyproduct{1x1}{}}^{N\rightarrow C}\left(\mathbf{\xi}\right)\right)italic_D start_POSTSUBSCRIPT roman_Φ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_ξ ) = Sigmoid ( Conv2D start_POSTSUBSCRIPT 1 italic_x 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N → italic_C end_POSTSUPERSCRIPT ( italic_ξ ) ) (7)

projects the NNitalic_N channels back to the original CCitalic_C spectral bands using a 2D convolution with a \qtyproduct1x1 kernel. A sigmoid activation constrains the reconstructed output intensities to the valid range of 0  1 01\text{\,}start_ARG start_ARG 0 end_ARG – start_ARG 1 end_ARG end_ARG start_ARG times end_ARG start_ARG end_ARG.

III Dataset Description and Experimental Setup

III-A Dataset Description

Two HSI datasets were employed in our experiments. These datasets differ in several aspects, including spatial resolution, number of spectral bands, and dataset size, as summarized below.

III-A1 HySpecNet-11k

HySpecNet-11k [25] is a large-scale hyperspectral benchmark dataset constructed from 250250250 tiles acquired by the Environmental Mapping and Analysis Program (EnMAP) satellite [29]. It includes 11,48311,48311 , 483 nonoverlapping HSIs, each of which consists of \qtyproduct128x128 pixels\mathrm{pixels}roman_pixels and 202 bands202\text{\,}\mathrm{bands}start_ARG 202 end_ARG start_ARG times end_ARG start_ARG roman_bands end_ARG with a GSD of 30 m30\text{\,}\mathrm{m}start_ARG 30 end_ARG start_ARG times end_ARG start_ARG roman_m end_ARG (low spatial resolution) and a spectral range of 420  2,450 nm4202,450\text{\,}\mathrm{nm}start_ARG start_ARG 420 end_ARG – start_ARG 2 , 450 end_ARG end_ARG start_ARG times end_ARG start_ARG roman_nm end_ARG. The data is radiometrically, geometrically, and atmospherically corrected (i.e., the L2A water & land product). We used the recommended splits from [25] for training, validation, and test sets covering 70 %70\text{\,}\mathrm{\char 37\relax}start_ARG 70 end_ARG start_ARG times end_ARG start_ARG % end_ARG, 20 %20\text{\,}\mathrm{\char 37\relax}start_ARG 20 end_ARG start_ARG times end_ARG start_ARG % end_ARG, and 10 %10\text{\,}\mathrm{\char 37\relax}start_ARG 10 end_ARG start_ARG times end_ARG start_ARG % end_ARG of the HSIs, respectively. Fig. 3 illustrates example images from this dataset. We would like to note that we used HySpecNet-11k to show the effectiveness and generalization capability of our proposed model, particularly in the context of large-scale training with low spatial resolution HSIs.

Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption

Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Figure 3: An example of HSIs present in the HySpecNet-11k dataset [25].

III-A2 MLRetSet

MLRetSet [26] is a hyperspectral benchmark dataset created from high spatial resolution hyperspectral imagery with 27.86 cm27.86\text{\,}\mathrm{cm}start_ARG 27.86 end_ARG start_ARG times end_ARG start_ARG roman_cm end_ARG GSD. The hyperspectral dataset was acquired during an airborne flight covering the Turkish towns Yenice and Yeşilkaya on 4 May 2019. The twelve acquired tiles were split into 3,8403,8403 , 840 non-overlapping HSIs of size \qtyproduct100x100 pixels\mathrm{pixels}roman_pixels with 369 spectralbands369\text{\,}\mathrm{spectral\ bands}start_ARG 369 end_ARG start_ARG times end_ARG start_ARG roman_spectral roman_bands end_ARG each. We split the data into i) a training set that includes 70 %70\text{\,}\mathrm{\char 37\relax}start_ARG 70 end_ARG start_ARG times end_ARG start_ARG % end_ARG of the HSIs; ii) a validation set that includes 20 %20\text{\,}\mathrm{\char 37\relax}start_ARG 20 end_ARG start_ARG times end_ARG start_ARG % end_ARG of the HSIs; and iii) test set that includes 10 %10\text{\,}\mathrm{\char 37\relax}start_ARG 10 end_ARG start_ARG times end_ARG start_ARG % end_ARG of the HSIs. Fig. 4 provides visual examples of typical scenes present in the MLRetSet dataset. We would like to note that we employed the MLRetSet dataset to demonstrate the effectiveness of our proposed model on HSIs with a high spatial resolution.

Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption

Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Figure 4: An example of HSIs present in the MLRetSet dataset [26].

III-B Experimental Setup

Our code was implemented in PyTorch based on the CompressAI [30] framework. In the case of MLRetSet, the HSIs were center-cropped to \qtyproduct96x96 pixels\mathrm{pixels}roman_pixels to facilitate repeated spatial downsampling by factors of two, which is required for spatial compression models. For HySpecNet-11k, we followed the easy split as introduced in [25]. Training runs were carried out on a single NVIDIA A100 SXM4 80 GB GPU using the Adam optimizer [31]. For the loss function, we employed the mean squared error (MSE) defined as follows:

MSE(𝐗,𝐗^)=1HWCh,w,c(𝐗(h,w,c)𝐗^(h,w,c))2.\displaystyle\text{MSE}(\mathbf{X},\mathbf{\hat{X}})=\frac{1}{H\cdot W\cdot C}\sum_{h,w,c}\left(\mathbf{X}(h,w,c)-\mathbf{\hat{X}}(h,w,c)\right)^{2}.MSE ( bold_X , over^ start_ARG bold_X end_ARG ) = divide start_ARG 1 end_ARG start_ARG italic_H ⋅ italic_W ⋅ italic_C end_ARG ∑ start_POSTSUBSCRIPT italic_h , italic_w , italic_c end_POSTSUBSCRIPT ( bold_X ( italic_h , italic_w , italic_c ) - over^ start_ARG bold_X end_ARG ( italic_h , italic_w , italic_c ) ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT . (8)

We compare our proposed model with two traditional compression methods (JPEG2000 and PCA) and the following learning-based models: 1) 1D-CAE[19], a 1D convolutional autoencoder performing spectral compression; 2) SSCNet[21], a spatial compression network; 3) 3D-CAE[22], a 3D convolutional autoencoder jointly compressing spatial and spectral redundancies; and 4) HyCoT[24], a transformer-based AE exploiting spectral redundancies. Training epochs, learning rate (LR), and batch size (BS) were adjusted per model and dataset to optimize training time and GPU memory usage, while ensuring convergence of the loss function for each training run. Table I lists the specific training hyperparameters used for each model and dataset configuration. It is worth noting that on HySpecNet-11k for 1D-CAE, the number of epochs was reduced to 250250250 for CR {8,16,32}\in\left\{$8$,$16$,$32$\right\}∈ { 8 , 16 , 32 } due to runtime limitations. For MLRetSet, we had to reduce the BS to 111 for 1D-CAE at CR=32\acs{cr}=$32$= 32 due to GPU memory constraints. Furthermore, we used the random sampling strategy from [24] to efficiently train HyCoT.

For HyCASS, we fixed the number of spatial encoder module channels to N=128N=$128$italic_N = 128 and the Swin Transformer’s window size was set to 888, consistent with the configuration used in the Transformer-based Image Compression (TIC) model [28], and all remaining parameters were aligned accordingly. The number of spatial stages was varied between 0 ×0\text{\,}\timesstart_ARG 0 end_ARG start_ARG times end_ARG start_ARG × end_ARG and 3 ×3\text{\,}\timesstart_ARG 3 end_ARG start_ARG times end_ARG start_ARG × end_ARG to assess the effect of different spatial compression levels. We would like to note that for zero spatial stages, we increased NNitalic_N to 1,0241,0241 , 024 to compensate for the missing spatial stages in terms of model parameters and floating point operations (FLOPs). Also, we applied the random sampling strategy from HyCoT [24] and increased the LR to 11031\text{}{10}^{-3}start_ARG 1 end_ARG start_ARG times end_ARG start_ARG power start_ARG 10 end_ARG start_ARG - 3 end_ARG end_ARG, the BS to 646464, and the number of epochs to 2,0002,0002 , 000 for HySpecNet-11k and 1,0001,0001 , 000 for MLRetSet. It is important to note that while the targeted spatio-spectral CRs were CR{4,8,16,,1024}\acs{cr}\in\left\{$4$,$8$,$16$,\dots,1024\right\}∈ { 4 , 8 , 16 , … , 1024 }, the achieved CRs may deviate due to the spectral dimension not being divisible by powers of two.

TABLE I: Values of training hyperparameter selected for each model for the HySpecNet-11k [25] and MLRetSet [26] dataset.
Model HySpecNet-11k [25] MLRetSet [26]
Epochs LR BS Epochs LR BS
1D-CAE [19] 500500500 11041\text{}{10}^{-4}start_ARG 1 end_ARG start_ARG times end_ARG start_ARG power start_ARG 10 end_ARG start_ARG - 4 end_ARG end_ARG 222 100100100 11041\text{}{10}^{-4}start_ARG 1 end_ARG start_ARG times end_ARG start_ARG power start_ARG 10 end_ARG start_ARG - 4 end_ARG end_ARG 222
SSCNet [21] 2,0002,0002 , 000 11051\text{}{10}^{-5}start_ARG 1 end_ARG start_ARG times end_ARG start_ARG power start_ARG 10 end_ARG start_ARG - 5 end_ARG end_ARG 888 400400400 11051\text{}{10}^{-5}start_ARG 1 end_ARG start_ARG times end_ARG start_ARG power start_ARG 10 end_ARG start_ARG - 5 end_ARG end_ARG 888
3D-CAE [22] 1,0001,0001 , 000 11041\text{}{10}^{-4}start_ARG 1 end_ARG start_ARG times end_ARG start_ARG power start_ARG 10 end_ARG start_ARG - 4 end_ARG end_ARG 222 250250250 11041\text{}{10}^{-4}start_ARG 1 end_ARG start_ARG times end_ARG start_ARG power start_ARG 10 end_ARG start_ARG - 4 end_ARG end_ARG 222
HyCoT [24] 2,0002,0002 , 000 11031\text{}{10}^{-3}start_ARG 1 end_ARG start_ARG times end_ARG start_ARG power start_ARG 10 end_ARG start_ARG - 3 end_ARG end_ARG 646464 500500500 11031\text{}{10}^{-3}start_ARG 1 end_ARG start_ARG times end_ARG start_ARG power start_ARG 10 end_ARG start_ARG - 3 end_ARG end_ARG 646464
HyCASS 200200200 11041\text{}{10}^{-4}start_ARG 1 end_ARG start_ARG times end_ARG start_ARG power start_ARG 10 end_ARG start_ARG - 4 end_ARG end_ARG 161616 100100100 11041\text{}{10}^{-4}start_ARG 1 end_ARG start_ARG times end_ARG start_ARG power start_ARG 10 end_ARG start_ARG - 4 end_ARG end_ARG 161616

III-C Evaluation Metrics

For the evaluation of the HSI compression methods, we consider two kinds of metrics: metrics that measure the compression efficiency; and metrics that measure the reconstruction quality. In our experiments, we use the CR to quantify the data reduction. \Acpsnr and spectral angle (SA) are used to measure the fidelity of a reconstructed HSI. Given an original HSI 𝐗H×W×C\mathbf{X}\in\mathbb{R}^{H\times W\times C}bold_X ∈ blackboard_R start_POSTSUPERSCRIPT italic_H × italic_W × italic_C end_POSTSUPERSCRIPT, its latent representation 𝐘Σ×Ω×Γ\mathbf{Y}\in\mathbb{R}^{\Sigma\times\Omega\times\Gamma}bold_Y ∈ blackboard_R start_POSTSUPERSCRIPT roman_Σ × roman_Ω × roman_Γ end_POSTSUPERSCRIPT and reconstruction 𝐗^H×W×C\mathbf{\hat{X}}\in\mathbb{R}^{H\times W\times C}over^ start_ARG bold_X end_ARG ∈ blackboard_R start_POSTSUPERSCRIPT italic_H × italic_W × italic_C end_POSTSUPERSCRIPT, the metrics are defined as follows.

III-C1 Compression Ratio (CR)

The CR between an original HSI 𝐗\mathbf{X}bold_X with bit depth NbN_{b}italic_N start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT and its representation in the latent space 𝐘\mathbf{Y}bold_Y with bit depth N^b\hat{N}_{b}over^ start_ARG italic_N end_ARG start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT after encoding can be expressed as follows:

CR(𝐗,𝐘)=NbHWCN^bΣΩΓ.\displaystyle\text{\acs{cr}}\left(\mathbf{X},\mathbf{Y}\right)=\frac{N_{b}\cdot H\cdot W\cdot C}{\hat{N}_{b}\cdot\Sigma\cdot\Omega\cdot\Gamma}.( bold_X , bold_Y ) = divide start_ARG italic_N start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT ⋅ italic_H ⋅ italic_W ⋅ italic_C end_ARG start_ARG over^ start_ARG italic_N end_ARG start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT ⋅ roman_Σ ⋅ roman_Ω ⋅ roman_Γ end_ARG . (9)

A higher CR indicates greater compression of the hyperspectral data, which could result in increased loss of information during reconstruction.

III-C2 Peak Signal-to-Noise Ratio (PSNR)

For measuring the reconstruction quality, we use the peak signal-to-noise ratio (PSNR) between original HSI 𝐗\mathbf{X}bold_X and reconstructed HSI 𝐗^\mathbf{\hat{X}}over^ start_ARG bold_X end_ARG, which is defined as:

PSNR(𝐗,𝐗^)=10log10(MAX2MSE(𝐗,𝐗^)),\displaystyle\text{\acs{psnr}}\left(\mathbf{X},\mathbf{\hat{X}}\right)=10\cdot\log_{10}\left(\frac{\text{MAX}^{2}}{\text{MSE}\left(\mathbf{X},\mathbf{\hat{X}}\right)}\right),( bold_X , over^ start_ARG bold_X end_ARG ) = 10 ⋅ roman_log start_POSTSUBSCRIPT 10 end_POSTSUBSCRIPT ( divide start_ARG MAX start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG MSE ( bold_X , over^ start_ARG bold_X end_ARG ) end_ARG ) , (10)

where MAX denotes the maximum possible pixel value (e.g. 1.01.01.0 in the case of min-max normalization), and the MSE is defined as in Equation 8. A higher PSNR value indicates better reconstruction quality with less distortion.

III-C3 Spectral Angle (SA)

For some results, we also report the SA defined as:

SA(𝐗,𝐗^)=1HW\displaystyle\text{\acs{sa}}\left(\mathbf{X},\mathbf{\hat{X}}\right)=\frac{1}{H\cdot W}( bold_X , over^ start_ARG bold_X end_ARG ) = divide start_ARG 1 end_ARG start_ARG italic_H ⋅ italic_W end_ARG (11)
h,w180πarccos(c𝐗(h,w,c)𝐗^(h,w,c)c𝐗(h,w,c)2c𝐗^(h,w,c)2),\displaystyle\sum_{h,w}\frac{180}{\pi}\arccos\left(\frac{\sum_{c}\mathbf{X}\left(h,w,c\right)\cdot\mathbf{\hat{X}}\left(h,w,c\right)}{\sum_{c}\mathbf{X}\left(h,w,c\right)^{2}\sum_{c}\mathbf{\hat{X}}\left(h,w,c\right)^{2}}\right),∑ start_POSTSUBSCRIPT italic_h , italic_w end_POSTSUBSCRIPT divide start_ARG 180 end_ARG start_ARG italic_π end_ARG roman_arccos ( divide start_ARG ∑ start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT bold_X ( italic_h , italic_w , italic_c ) ⋅ over^ start_ARG bold_X end_ARG ( italic_h , italic_w , italic_c ) end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT bold_X ( italic_h , italic_w , italic_c ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT over^ start_ARG bold_X end_ARG ( italic_h , italic_w , italic_c ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ) ,

which quantifies the average spectral similarity between all pixels of an original 𝐗\mathbf{X}bold_X and a reconstructed HSI 𝐗^\mathbf{\hat{X}}over^ start_ARG bold_X end_ARG. A smaller SA indicates higher spectral similarity and is inherently scale-invariant.

IV Experimental Results

We conducted three sets of experiments, aiming at: 1) assessment of the effects of spectral and spatial compression within the proposed HyCASS model through an ablation study on two benchmark datasets; 2) comparison of our model’s effectiveness with traditional baselines and lossy learning-based state-of-the-art HSI compression models; and 3) qualitative analysis of the reconstruction results.

IV-A Ablation Study

In this subsection, we analyze the impact of varying the spatial stages SSitalic_S and latent channels Γ\Gammaroman_Γ inside HyCASS on the reconstruction quality for multiple CRs. To assess generalization across different spatial resolutions, we report results on the HySpecNet-11k and the MLRetSet datasets. For each configuration, SSitalic_S defines the spatial CR (CRspat\acs{cr}_{\text{spat}}start_POSTSUBSCRIPT spat end_POSTSUBSCRIPT), while Γ\Gammaroman_Γ determines the spectral CR (CRspec\acs{cr}_{\text{spec}}start_POSTSUBSCRIPT spec end_POSTSUBSCRIPT). Given CRspat\acs{cr}_{\text{spat}}start_POSTSUBSCRIPT spat end_POSTSUBSCRIPT, CRspec\acs{cr}_{\text{spec}}start_POSTSUBSCRIPT spec end_POSTSUBSCRIPT is adjusted accordingly to match the targeted spatio-spectral CR.

IV-A1 HySpecNet-11k

Table II shows the results of the ablation study on the HySpecNet-11k dataset.

TABLE II: HyCASS results obtained by varying the spatial stages SSitalic_S on the easy split test set of the HySpecNet-11k [25] dataset. CRspec\acs{cr}_{\text{spec}}start_POSTSUBSCRIPT spec end_POSTSUBSCRIPT, CRspat\acs{cr}_{\text{spat}}start_POSTSUBSCRIPT spat end_POSTSUBSCRIPT and CR denote spectral, spatial and joint spatio-spectral compression ratio, respectively. Reconstruction quality is evaluated using PSNR and SA.
SSitalic_S CR CRspec\acs{cr}_{\text{spec}}start_POSTSUBSCRIPT spec end_POSTSUBSCRIPT CRspat\acs{cr}_{\text{spat}}start_POSTSUBSCRIPT spat end_POSTSUBSCRIPT PSNR \uparrow SA \downarrow
0 ×0\text{\,}\timesstart_ARG 0 end_ARG start_ARG times end_ARG start_ARG × end_ARG 3.960,83.960,83.960 , 8 3.960,83.960,83.960 , 8 1.01.01.0 56.444 dB56.444\text{\,}\mathrm{dB}start_ARG 56.444 end_ARG start_ARG times end_ARG start_ARG roman_dB end_ARG 1.394,0°1.394 , 0 ⁢ °
1 ×1\text{\,}\timesstart_ARG 1 end_ARG start_ARG times end_ARG start_ARG × end_ARG 3.960,83.960,83.960 , 8 0.990,20.990,20.990 , 2 4.04.04.0 49.779 dB49.779\text{\,}\mathrm{dB}start_ARG 49.779 end_ARG start_ARG times end_ARG start_ARG roman_dB end_ARG 2.440,9°2.440 , 9 ⁢ °
2 ×2\text{\,}\timesstart_ARG 2 end_ARG start_ARG times end_ARG start_ARG × end_ARG 3.965,63.965,63.965 , 6 0.247,850.247,850.247 , 85 16.016.016.0 48.447 dB48.447\text{\,}\mathrm{dB}start_ARG 48.447 end_ARG start_ARG times end_ARG start_ARG roman_dB end_ARG 2.647,3°2.647 , 3 ⁢ °
3 ×3\text{\,}\timesstart_ARG 3 end_ARG start_ARG times end_ARG start_ARG × end_ARG 3.966,93.966,93.966 , 9 0.061,982,8120.061,982,8120.061 , 982 , 812 64.064.064.0 44.647 dB44.647\text{\,}\mathrm{dB}start_ARG 44.647 end_ARG start_ARG times end_ARG start_ARG roman_dB end_ARG 3.333,9°3.333 , 9 ⁢ °
0 ×0\text{\,}\timesstart_ARG 0 end_ARG start_ARG times end_ARG start_ARG × end_ARG 7.769,27.769,27.769 , 2 7.6927.6927.692 1.01.01.0 55.155 dB55.155\text{\,}\mathrm{dB}start_ARG 55.155 end_ARG start_ARG times end_ARG start_ARG roman_dB end_ARG 1.557,4°1.557 , 4 ⁢ °
1 ×1\text{\,}\timesstart_ARG 1 end_ARG start_ARG times end_ARG start_ARG × end_ARG 7.769,27.769,27.769 , 2 1.942,31.942,31.942 , 3 4.04.04.0 49.832 dB49.832\text{\,}\mathrm{dB}start_ARG 49.832 end_ARG start_ARG times end_ARG start_ARG roman_dB end_ARG 2.443,1°2.443 , 1 ⁢ °
2 ×2\text{\,}\timesstart_ARG 2 end_ARG start_ARG times end_ARG start_ARG × end_ARG 7.769,27.769,27.769 , 2 0.485,5750.485,5750.485 , 575 16.016.016.0 48.084 dB48.084\text{\,}\mathrm{dB}start_ARG 48.084 end_ARG start_ARG times end_ARG start_ARG roman_dB end_ARG 2.706,5°2.706 , 5 ⁢ °
3 ×3\text{\,}\timesstart_ARG 3 end_ARG start_ARG times end_ARG start_ARG × end_ARG 7.769,27.769,27.769 , 2 0.1210.1210.121 64.064.064.0 44.788 dB44.788\text{\,}\mathrm{dB}start_ARG 44.788 end_ARG start_ARG times end_ARG start_ARG roman_dB end_ARG 3.265,1°3.265 , 1 ⁢ °
0 ×0\text{\,}\timesstart_ARG 0 end_ARG start_ARG times end_ARG start_ARG × end_ARG 15.53815.53815.538 15.53815.53815.538 1.01.01.0 52.828 dB52.828\text{\,}\mathrm{dB}start_ARG 52.828 end_ARG start_ARG times end_ARG start_ARG roman_dB end_ARG 1.836,4°1.836 , 4 ⁢ °
1 ×1\text{\,}\timesstart_ARG 1 end_ARG start_ARG times end_ARG start_ARG × end_ARG 15.53815.53815.538 3.884,53.884,53.884 , 5 4.04.04.0 50.392 dB50.392\text{\,}\mathrm{dB}start_ARG 50.392 end_ARG start_ARG times end_ARG start_ARG roman_dB end_ARG 2.391,8°2.391 , 8 ⁢ °
2 ×2\text{\,}\timesstart_ARG 2 end_ARG start_ARG times end_ARG start_ARG × end_ARG 15.53815.53815.538 0.971,1250.971,1250.971 , 125 16.016.016.0 48.798 dB48.798\text{\,}\mathrm{dB}start_ARG 48.798 end_ARG start_ARG times end_ARG start_ARG roman_dB end_ARG 2.638,7°2.638 , 7 ⁢ °
3 ×3\text{\,}\timesstart_ARG 3 end_ARG start_ARG times end_ARG start_ARG × end_ARG 15.53815.53815.538 0.242,781,250.242,781,250.242 , 781 , 25 64.064.064.0 46.057 dB46.057\text{\,}\mathrm{dB}start_ARG 46.057 end_ARG start_ARG times end_ARG start_ARG roman_dB end_ARG 3.082,7°3.082 , 7 ⁢ °
0 ×0\text{\,}\timesstart_ARG 0 end_ARG start_ARG times end_ARG start_ARG × end_ARG 28.85728.85728.857 28.85728.85728.857 1.01.01.0 49.719 dB49.719\text{\,}\mathrm{dB}start_ARG 49.719 end_ARG start_ARG times end_ARG start_ARG roman_dB end_ARG 2.267,8°2.267 , 8 ⁢ °
1 ×1\text{\,}\timesstart_ARG 1 end_ARG start_ARG times end_ARG start_ARG × end_ARG 28.85728.85728.857 7.214,257.214,257.214 , 25 4.04.04.0 49.249 dB49.249\text{\,}\mathrm{dB}start_ARG 49.249 end_ARG start_ARG times end_ARG start_ARG roman_dB end_ARG 2.545,6°2.545 , 6 ⁢ °
2 ×2\text{\,}\timesstart_ARG 2 end_ARG start_ARG times end_ARG start_ARG × end_ARG 28.85728.85728.857 1.803,562,51.803,562,51.803 , 562 , 5 16.016.016.0 48.031 dB48.031\text{\,}\mathrm{dB}start_ARG 48.031 end_ARG start_ARG times end_ARG start_ARG roman_dB end_ARG 2.719,5°2.719 , 5 ⁢ °
3 ×3\text{\,}\timesstart_ARG 3 end_ARG start_ARG times end_ARG start_ARG × end_ARG 28.85728.85728.857 0.450,890,6250.450,890,6250.450 , 890 , 625 64.064.064.0 44.940 dB44.940\text{\,}\mathrm{dB}start_ARG 44.940 end_ARG start_ARG times end_ARG start_ARG roman_dB end_ARG 3.242,4°3.242 , 4 ⁢ °
0 ×0\text{\,}\timesstart_ARG 0 end_ARG start_ARG times end_ARG start_ARG × end_ARG 50.5050.5050.50 50.5050.5050.50 1.01.01.0 45.862 dB45.862\text{\,}\mathrm{dB}start_ARG 45.862 end_ARG start_ARG times end_ARG start_ARG roman_dB end_ARG 3.151,6°3.151 , 6 ⁢ °
1 ×1\text{\,}\timesstart_ARG 1 end_ARG start_ARG times end_ARG start_ARG × end_ARG 50.5050.5050.50 12.62512.62512.625 4.04.04.0 48.610 dB48.610\text{\,}\mathrm{dB}start_ARG 48.610 end_ARG start_ARG times end_ARG start_ARG roman_dB end_ARG 2.717,7°2.717 , 7 ⁢ °
2 ×2\text{\,}\timesstart_ARG 2 end_ARG start_ARG times end_ARG start_ARG × end_ARG 50.5050.5050.50 3.156,253.156,253.156 , 25 16.016.016.0 48.579 dB48.579\text{\,}\mathrm{dB}start_ARG 48.579 end_ARG start_ARG times end_ARG start_ARG roman_dB end_ARG 2.644,0°2.644 , 0 ⁢ °
3 ×3\text{\,}\timesstart_ARG 3 end_ARG start_ARG times end_ARG start_ARG × end_ARG 50.5050.5050.50 0.789,062,50.789,062,50.789 , 062 , 5 64.064.064.0 45.916 dB45.916\text{\,}\mathrm{dB}start_ARG 45.916 end_ARG start_ARG times end_ARG start_ARG roman_dB end_ARG 3.110,5°3.110 , 5 ⁢ °
0 ×0\text{\,}\timesstart_ARG 0 end_ARG start_ARG times end_ARG start_ARG × end_ARG 101.00101.00101.00 101.00101.00101.00 1.01.01.0 39.836 dB39.836\text{\,}\mathrm{dB}start_ARG 39.836 end_ARG start_ARG times end_ARG start_ARG roman_dB end_ARG 5.522,5°5.522 , 5 ⁢ °
1 ×1\text{\,}\timesstart_ARG 1 end_ARG start_ARG times end_ARG start_ARG × end_ARG 101.00101.00101.00 25.2525.2525.25 4.04.04.0 45.969 dB45.969\text{\,}\mathrm{dB}start_ARG 45.969 end_ARG start_ARG times end_ARG start_ARG roman_dB end_ARG 3.155°3.155 ⁢ °
2 ×2\text{\,}\timesstart_ARG 2 end_ARG start_ARG times end_ARG start_ARG × end_ARG 101.00101.00101.00 6.312,56.312,56.312 , 5 16.016.016.0 46.843 dB46.843\text{\,}\mathrm{dB}start_ARG 46.843 end_ARG start_ARG times end_ARG start_ARG roman_dB end_ARG 2.906,6°2.906 , 6 ⁢ °
3 ×3\text{\,}\timesstart_ARG 3 end_ARG start_ARG times end_ARG start_ARG × end_ARG 101.00101.00101.00 1.578,1251.578,1251.578 , 125 64.064.064.0 44.441 dB44.441\text{\,}\mathrm{dB}start_ARG 44.441 end_ARG start_ARG times end_ARG start_ARG roman_dB end_ARG 3.320,0°3.320 , 0 ⁢ °
0 ×0\text{\,}\timesstart_ARG 0 end_ARG start_ARG times end_ARG start_ARG × end_ARG 202.00202.00202.00 202.00202.00202.00 1.01.01.0 32.971 dB32.971\text{\,}\mathrm{dB}start_ARG 32.971 end_ARG start_ARG times end_ARG start_ARG roman_dB end_ARG 12.513°12.513 ⁢ °
1 ×1\text{\,}\timesstart_ARG 1 end_ARG start_ARG times end_ARG start_ARG × end_ARG 202.00202.00202.00 50.550.550.5 4.04.04.0 43.347 dB43.347\text{\,}\mathrm{dB}start_ARG 43.347 end_ARG start_ARG times end_ARG start_ARG roman_dB end_ARG 3.772,2°3.772 , 2 ⁢ °
2 ×2\text{\,}\timesstart_ARG 2 end_ARG start_ARG times end_ARG start_ARG × end_ARG 202.00202.00202.00 12.62512.62512.625 16.016.016.0 45.094 dB45.094\text{\,}\mathrm{dB}start_ARG 45.094 end_ARG start_ARG times end_ARG start_ARG roman_dB end_ARG 3.155,8°3.155 , 8 ⁢ °
3 ×3\text{\,}\timesstart_ARG 3 end_ARG start_ARG times end_ARG start_ARG × end_ARG 202.00202.00202.00 3.156,253.156,253.156 , 25 64.064.064.0 44.136 dB44.136\text{\,}\mathrm{dB}start_ARG 44.136 end_ARG start_ARG times end_ARG start_ARG roman_dB end_ARG 3.431,0°3.431 , 0 ⁢ °
1 ×1\text{\,}\timesstart_ARG 1 end_ARG start_ARG times end_ARG start_ARG × end_ARG 404.00404.00404.00 101.0101.0101.0 4.04.04.0 41.051 dB41.051\text{\,}\mathrm{dB}start_ARG 41.051 end_ARG start_ARG times end_ARG start_ARG roman_dB end_ARG 4.549,4°4.549 , 4 ⁢ °
2 ×2\text{\,}\timesstart_ARG 2 end_ARG start_ARG times end_ARG start_ARG × end_ARG 404.00404.00404.00 25.2525.2525.25 16.016.016.0 42.652 dB42.652\text{\,}\mathrm{dB}start_ARG 42.652 end_ARG start_ARG times end_ARG start_ARG roman_dB end_ARG 3.722,8°3.722 , 8 ⁢ °
3 ×3\text{\,}\timesstart_ARG 3 end_ARG start_ARG times end_ARG start_ARG × end_ARG 404.00404.00404.00 6.312,56.312,56.312 , 5 64.064.064.0 42.758 dB42.758\text{\,}\mathrm{dB}start_ARG 42.758 end_ARG start_ARG times end_ARG start_ARG roman_dB end_ARG 3.651,9°3.651 , 9 ⁢ °
1 ×1\text{\,}\timesstart_ARG 1 end_ARG start_ARG times end_ARG start_ARG × end_ARG 808.00808.00808.00 202202202 4.04.04.0 36.385 dB36.385\text{\,}\mathrm{dB}start_ARG 36.385 end_ARG start_ARG times end_ARG start_ARG roman_dB end_ARG 7.454,4°7.454 , 4 ⁢ °
2 ×2\text{\,}\timesstart_ARG 2 end_ARG start_ARG times end_ARG start_ARG × end_ARG 808.00808.00808.00 50.550.550.5 16.016.016.0 40.814 dB40.814\text{\,}\mathrm{dB}start_ARG 40.814 end_ARG start_ARG times end_ARG start_ARG roman_dB end_ARG 4.219,7°4.219 , 7 ⁢ °
3 ×3\text{\,}\timesstart_ARG 3 end_ARG start_ARG times end_ARG start_ARG × end_ARG 808.00808.00808.00 12.62512.62512.625 64.064.064.0 41.455 dB41.455\text{\,}\mathrm{dB}start_ARG 41.455 end_ARG start_ARG times end_ARG start_ARG roman_dB end_ARG 3.949,0°3.949 , 0 ⁢ °

From the table, one can derive three key observations:

First, for CRs<32\acsp{cr}<$32$< 32, HyCASS with zero spatial stages (i.e., spectral compression only) yields superior reconstruction performance compared to HyCASS with one or more spatial stages (i.e., spatio-spectral compression). For example, when CR4\acs{cr}\approx$4$≈ 4, HyCASS with zero spatial stages achieves a PSNR of 56.44 dB56.44\text{\,}\mathrm{dB}start_ARG 56.44 end_ARG start_ARG times end_ARG start_ARG roman_dB end_ARG while reconstruction quality reduces for one, two and three spatial stages to 49.78 dB49.78\text{\,}\mathrm{dB}start_ARG 49.78 end_ARG start_ARG times end_ARG start_ARG roman_dB end_ARG, 48.45 dB48.45\text{\,}\mathrm{dB}start_ARG 48.45 end_ARG start_ARG times end_ARG start_ARG roman_dB end_ARG and 44.65 dB44.65\text{\,}\mathrm{dB}start_ARG 44.65 end_ARG start_ARG times end_ARG start_ARG roman_dB end_ARG, respectively. This can be attributed to the relatively high number of latent channels retained in this CR range by HyCASS models without spatial compression (e.g. Γ=51\Gamma=$51$roman_Γ = 51 for CR4\acs{cr}\approx$4$≈ 4), which provide sufficient spectral information for accurate reconstruction by the decoder without requiring any spatial feature aggregation. For HySpecNet-11k, spatial compression poses reconstruction challenges due to the limited spatial correlation caused by the low spatial resolution, leading to noticeably blurred reconstruction results. Interestingly, at low CRs, spatio-spectral HyCASS models tend to exhibit PSNR stagnation. This behavior suggests that the high number of latent channels in such configurations contains redundancy, allowing comparable reconstruction performance at significantly higher CRs.

Second, as the CR increases beyond 323232, the performance of HyCASS models that rely solely on spectral compression with zero spatial stages diminishes rapidly. In contrast, HyCASS models that incorporate spatial compression (one or more spatial stages) maintain a higher quality of reconstruction at these compression levels. This behavior indicates that with higher CRs, where spectral compression reaches saturation, the inclusion of deeper spatial hierarchies becomes increasingly important for preserving structural and spectral fidelity. In particular, this trend persists even for CRs>64\acsp{cr}>$64$> 64, where models with a higher number of spatial stages consistently achieve better reconstruction performance. These findings highlight the increasing importance of spatial compression in highly constrained compression scenarios.

Third, reconstruction quality generally decreases with increasing CRs, dropping from 56.44 dB56.44\text{\,}\mathrm{dB}start_ARG 56.44 end_ARG start_ARG times end_ARG start_ARG roman_dB end_ARG at CR=3.96\acs{cr}=$3.96$= 3.96 to 41.46 dB41.46\text{\,}\mathrm{dB}start_ARG 41.46 end_ARG start_ARG times end_ARG start_ARG roman_dB end_ARG at CR=808\acs{cr}=$808$= 808 with the respective optimal spatial stages. This trend is consistently observed by increasing SA values, indicating that stronger compression leads to greater loss of spatial and spectral fidelity.

IV-A2 MLRetSet

Table III shows the ablation study of HyCASS on the MLRetSet dataset.

TABLE III: HyCASS results obtained by varying the spatial stages SSitalic_S on the test set of the MLRetSet [26] dataset. CRspec\acs{cr}_{\text{spec}}start_POSTSUBSCRIPT spec end_POSTSUBSCRIPT, CRspat\acs{cr}_{\text{spat}}start_POSTSUBSCRIPT spat end_POSTSUBSCRIPT and CR denote spectral, spatial and joint spatio-spectral compression ratio, respectively. Reconstruction quality is evaluated using PSNR and SA.
SSitalic_S CR CRspec\acs{cr}_{\text{spec}}start_POSTSUBSCRIPT spec end_POSTSUBSCRIPT CRspat\acs{cr}_{\text{spat}}start_POSTSUBSCRIPT spat end_POSTSUBSCRIPT PSNR \uparrow SA \downarrow
0 ×0\text{\,}\timesstart_ARG 0 end_ARG start_ARG times end_ARG start_ARG × end_ARG 4.010,94.010,94.010 , 9 4.010,94.010,94.010 , 9 1.01.01.0 44.858 dB44.858\text{\,}\mathrm{dB}start_ARG 44.858 end_ARG start_ARG times end_ARG start_ARG roman_dB end_ARG 1.440,3°1.440 , 3 ⁢ °
1 ×1\text{\,}\timesstart_ARG 1 end_ARG start_ARG times end_ARG start_ARG × end_ARG 3.978,43.978,43.978 , 4 0.994,60.994,60.994 , 6 4.04.04.0 42.644 dB42.644\text{\,}\mathrm{dB}start_ARG 42.644 end_ARG start_ARG times end_ARG start_ARG roman_dB end_ARG 1.848,9°1.848 , 9 ⁢ °
2 ×2\text{\,}\timesstart_ARG 2 end_ARG start_ARG times end_ARG start_ARG × end_ARG 3.970,43.970,43.970 , 4 0.248,150.248,150.248 , 15 16.016.016.0 42.279 dB42.279\text{\,}\mathrm{dB}start_ARG 42.279 end_ARG start_ARG times end_ARG start_ARG roman_dB end_ARG 1.887,8°1.887 , 8 ⁢ °
3 ×3\text{\,}\timesstart_ARG 3 end_ARG start_ARG times end_ARG start_ARG × end_ARG 3.970,43.970,43.970 , 4 0.062,037,50.062,037,50.062 , 037 , 5 64.064.064.0 40.341 dB40.341\text{\,}\mathrm{dB}start_ARG 40.341 end_ARG start_ARG times end_ARG start_ARG roman_dB end_ARG 2.151,8°2.151 , 8 ⁢ °
0 ×0\text{\,}\timesstart_ARG 0 end_ARG start_ARG times end_ARG start_ARG × end_ARG 7.851,17.851,17.851 , 1 7.851,17.851,17.851 , 1 1.01.01.0 44.893 dB44.893\text{\,}\mathrm{dB}start_ARG 44.893 end_ARG start_ARG times end_ARG start_ARG roman_dB end_ARG 1.435,4°1.435 , 4 ⁢ °
1 ×1\text{\,}\timesstart_ARG 1 end_ARG start_ARG times end_ARG start_ARG × end_ARG 7.809,57.809,57.809 , 5 1.952,3751.952,3751.952 , 375 4.04.04.0 42.617 dB42.617\text{\,}\mathrm{dB}start_ARG 42.617 end_ARG start_ARG times end_ARG start_ARG roman_dB end_ARG 1.851,5°1.851 , 5 ⁢ °
2 ×2\text{\,}\timesstart_ARG 2 end_ARG start_ARG times end_ARG start_ARG × end_ARG 7.778,77.778,77.778 , 7 0.486,168,750.486,168,750.486 , 168 , 75 16.016.016.0 42.119 dB42.119\text{\,}\mathrm{dB}start_ARG 42.119 end_ARG start_ARG times end_ARG start_ARG roman_dB end_ARG 1.914,0°1.914 , 0 ⁢ °
3 ×3\text{\,}\timesstart_ARG 3 end_ARG start_ARG times end_ARG start_ARG × end_ARG 7.771,07.771,07.771 , 0 0.121,421,8750.121,421,8750.121 , 421 , 875 64.064.064.0 40.975 dB40.975\text{\,}\mathrm{dB}start_ARG 40.975 end_ARG start_ARG times end_ARG start_ARG roman_dB end_ARG 2.074,0°2.074 , 0 ⁢ °
0 ×0\text{\,}\timesstart_ARG 0 end_ARG start_ARG times end_ARG start_ARG × end_ARG 16.04316.04316.043 16.04316.04316.043 1.01.01.0 44.855 dB44.855\text{\,}\mathrm{dB}start_ARG 44.855 end_ARG start_ARG times end_ARG start_ARG roman_dB end_ARG 1.441,8°1.441 , 8 ⁢ °
1 ×1\text{\,}\timesstart_ARG 1 end_ARG start_ARG times end_ARG start_ARG × end_ARG 15.70215.70215.702 3.925,53.925,53.925 , 5 4.04.04.0 42.928 dB42.928\text{\,}\mathrm{dB}start_ARG 42.928 end_ARG start_ARG times end_ARG start_ARG roman_dB end_ARG 1.785,8°1.785 , 8 ⁢ °
2 ×2\text{\,}\timesstart_ARG 2 end_ARG start_ARG times end_ARG start_ARG × end_ARG 15.57815.57815.578 0.973,6250.973,6250.973 , 625 16.016.016.0 42.234 dB42.234\text{\,}\mathrm{dB}start_ARG 42.234 end_ARG start_ARG times end_ARG start_ARG roman_dB end_ARG 1.904,9°1.904 , 9 ⁢ °
3 ×3\text{\,}\timesstart_ARG 3 end_ARG start_ARG times end_ARG start_ARG × end_ARG 15.54715.54715.547 0.242,921,8750.242,921,8750.242 , 921 , 875 64.064.064.0 40.358 dB40.358\text{\,}\mathrm{dB}start_ARG 40.358 end_ARG start_ARG times end_ARG start_ARG roman_dB end_ARG 2.149,9°2.149 , 9 ⁢ °
0 ×0\text{\,}\timesstart_ARG 0 end_ARG start_ARG times end_ARG start_ARG × end_ARG 30.7530.7530.75 30.7530.7530.75 1.01.01.0 44.786 dB44.786\text{\,}\mathrm{dB}start_ARG 44.786 end_ARG start_ARG times end_ARG start_ARG roman_dB end_ARG 1.452,4°1.452 , 4 ⁢ °
1 ×1\text{\,}\timesstart_ARG 1 end_ARG start_ARG times end_ARG start_ARG × end_ARG 28.94128.94128.941 7.235,257.235,257.235 , 25 4.04.04.0 42.625 dB42.625\text{\,}\mathrm{dB}start_ARG 42.625 end_ARG start_ARG times end_ARG start_ARG roman_dB end_ARG 1.852,3°1.852 , 3 ⁢ °
2 ×2\text{\,}\timesstart_ARG 2 end_ARG start_ARG times end_ARG start_ARG × end_ARG 28.94128.94128.941 1.808,812,51.808,812,51.808 , 812 , 5 16.016.016.0 42.270 dB42.270\text{\,}\mathrm{dB}start_ARG 42.270 end_ARG start_ARG times end_ARG start_ARG roman_dB end_ARG 1.892,0°1.892 , 0 ⁢ °
3 ×3\text{\,}\timesstart_ARG 3 end_ARG start_ARG times end_ARG start_ARG × end_ARG 28.87028.87028.870 0.451,093,750.451,093,750.451 , 093 , 75 64.064.064.0 37.633 dB37.633\text{\,}\mathrm{dB}start_ARG 37.633 end_ARG start_ARG times end_ARG start_ARG roman_dB end_ARG 2.723,2°2.723 , 2 ⁢ °
0 ×0\text{\,}\timesstart_ARG 0 end_ARG start_ARG times end_ARG start_ARG × end_ARG 61.561.561.5 61.561.561.5 1.01.01.0 44.231 dB44.231\text{\,}\mathrm{dB}start_ARG 44.231 end_ARG start_ARG times end_ARG start_ARG roman_dB end_ARG 1.548,2°1.548 , 2 ⁢ °
1 ×1\text{\,}\timesstart_ARG 1 end_ARG start_ARG times end_ARG start_ARG × end_ARG 61.561.561.5 15.37515.37515.375 4.04.04.0 42.888 dB42.888\text{\,}\mathrm{dB}start_ARG 42.888 end_ARG start_ARG times end_ARG start_ARG roman_dB end_ARG 1.779,8°1.779 , 8 ⁢ °
2 ×2\text{\,}\timesstart_ARG 2 end_ARG start_ARG times end_ARG start_ARG × end_ARG 61.561.561.5 3.843,753.843,753.843 , 75 16.016.016.0 42.279 dB42.279\text{\,}\mathrm{dB}start_ARG 42.279 end_ARG start_ARG times end_ARG start_ARG roman_dB end_ARG 1.885,0°1.885 , 0 ⁢ °
3 ×3\text{\,}\timesstart_ARG 3 end_ARG start_ARG times end_ARG start_ARG × end_ARG 61.561.561.5 0.960,937,50.960,937,50.960 , 937 , 5 64.064.064.0 41.282 dB41.282\text{\,}\mathrm{dB}start_ARG 41.282 end_ARG start_ARG times end_ARG start_ARG roman_dB end_ARG 2.005,7°2.005 , 7 ⁢ °
0 ×0\text{\,}\timesstart_ARG 0 end_ARG start_ARG times end_ARG start_ARG × end_ARG 123.0123.0123.0 123.0123.0123.0 1.01.01.0 42.930 dB42.930\text{\,}\mathrm{dB}start_ARG 42.930 end_ARG start_ARG times end_ARG start_ARG roman_dB end_ARG 1.795,3°1.795 , 3 ⁢ °
1 ×1\text{\,}\timesstart_ARG 1 end_ARG start_ARG times end_ARG start_ARG × end_ARG 123.0123.0123.0 30.7530.7530.75 4.04.04.0 42.443 dB42.443\text{\,}\mathrm{dB}start_ARG 42.443 end_ARG start_ARG times end_ARG start_ARG roman_dB end_ARG 1.858,2°1.858 , 2 ⁢ °
2 ×2\text{\,}\timesstart_ARG 2 end_ARG start_ARG times end_ARG start_ARG × end_ARG 123.0123.0123.0 7.687,57.687,57.687 , 5 16.016.016.0 42.269 dB42.269\text{\,}\mathrm{dB}start_ARG 42.269 end_ARG start_ARG times end_ARG start_ARG roman_dB end_ARG 1.890,6°1.890 , 6 ⁢ °
3 ×3\text{\,}\timesstart_ARG 3 end_ARG start_ARG times end_ARG start_ARG × end_ARG 123.0123.0123.0 1.921,8751.921,8751.921 , 875 64.064.064.0 41.419 dB41.419\text{\,}\mathrm{dB}start_ARG 41.419 end_ARG start_ARG times end_ARG start_ARG roman_dB end_ARG 1.973,2°1.973 , 2 ⁢ °
0 ×0\text{\,}\timesstart_ARG 0 end_ARG start_ARG times end_ARG start_ARG × end_ARG 184.5184.5184.5 184.5184.5184.5 1.01.01.0 40.977 dB40.977\text{\,}\mathrm{dB}start_ARG 40.977 end_ARG start_ARG times end_ARG start_ARG roman_dB end_ARG 2.164°2.164 ⁢ °
1 ×1\text{\,}\timesstart_ARG 1 end_ARG start_ARG times end_ARG start_ARG × end_ARG 184.5184.5184.5 46.12546.12546.125 4.04.04.0 42.403 dB42.403\text{\,}\mathrm{dB}start_ARG 42.403 end_ARG start_ARG times end_ARG start_ARG roman_dB end_ARG 1.843,6°1.843 , 6 ⁢ °
2 ×2\text{\,}\timesstart_ARG 2 end_ARG start_ARG times end_ARG start_ARG × end_ARG 184.5184.5184.5 11.531,2511.531,2511.531 , 25 16.016.016.0 42.169 dB42.169\text{\,}\mathrm{dB}start_ARG 42.169 end_ARG start_ARG times end_ARG start_ARG roman_dB end_ARG 1.910,7°1.910 , 7 ⁢ °
3 ×3\text{\,}\timesstart_ARG 3 end_ARG start_ARG times end_ARG start_ARG × end_ARG 184.5184.5184.5 2.882,812,52.882,812,52.882 , 812 , 5 64.064.064.0 41.126 dB41.126\text{\,}\mathrm{dB}start_ARG 41.126 end_ARG start_ARG times end_ARG start_ARG roman_dB end_ARG 2.024,6°2.024 , 6 ⁢ °
0 ×0\text{\,}\timesstart_ARG 0 end_ARG start_ARG times end_ARG start_ARG × end_ARG 369.0369.0369.0 369.0369.0369.0 1.01.01.0 33.548 dB33.548\text{\,}\mathrm{dB}start_ARG 33.548 end_ARG start_ARG times end_ARG start_ARG roman_dB end_ARG 4.813,6°4.813 , 6 ⁢ °
1 ×1\text{\,}\timesstart_ARG 1 end_ARG start_ARG times end_ARG start_ARG × end_ARG 369.0369.0369.0 92.2592.2592.25 4.04.04.0 42.169 dB42.169\text{\,}\mathrm{dB}start_ARG 42.169 end_ARG start_ARG times end_ARG start_ARG roman_dB end_ARG 1.902,1°1.902 , 1 ⁢ °
2 ×2\text{\,}\timesstart_ARG 2 end_ARG start_ARG times end_ARG start_ARG × end_ARG 369.0369.0369.0 23.062,523.062,523.062 , 5 16.016.016.0 42.125 dB42.125\text{\,}\mathrm{dB}start_ARG 42.125 end_ARG start_ARG times end_ARG start_ARG roman_dB end_ARG 1.910,5°1.910 , 5 ⁢ °
3 ×3\text{\,}\timesstart_ARG 3 end_ARG start_ARG times end_ARG start_ARG × end_ARG 369.0369.0369.0 5.765,6255.765,6255.765 , 625 64.064.064.0 41.970 dB41.970\text{\,}\mathrm{dB}start_ARG 41.970 end_ARG start_ARG times end_ARG start_ARG roman_dB end_ARG 1.882,3°1.882 , 3 ⁢ °
1 ×1\text{\,}\timesstart_ARG 1 end_ARG start_ARG times end_ARG start_ARG × end_ARG 738.0738.0738.0 184.5184.5184.5 4.04.04.0 40.431 dB40.431\text{\,}\mathrm{dB}start_ARG 40.431 end_ARG start_ARG times end_ARG start_ARG roman_dB end_ARG 2.244,0°2.244 , 0 ⁢ °
2 ×2\text{\,}\timesstart_ARG 2 end_ARG start_ARG times end_ARG start_ARG × end_ARG 738.0738.0738.0 46.12546.12546.125 16.016.016.0 41.433 dB41.433\text{\,}\mathrm{dB}start_ARG 41.433 end_ARG start_ARG times end_ARG start_ARG roman_dB end_ARG 2.020,2°2.020 , 2 ⁢ °
3 ×3\text{\,}\timesstart_ARG 3 end_ARG start_ARG times end_ARG start_ARG × end_ARG 738.0738.0738.0 11.531,2511.531,2511.531 , 25 64.064.064.0 41.574 dB41.574\text{\,}\mathrm{dB}start_ARG 41.574 end_ARG start_ARG times end_ARG start_ARG roman_dB end_ARG 1.944,7°1.944 , 7 ⁢ °
1 ×1\text{\,}\timesstart_ARG 1 end_ARG start_ARG times end_ARG start_ARG × end_ARG 1,476.01,476.01 , 476.0 369.0369.0369.0 4.04.04.0 37.680 dB37.680\text{\,}\mathrm{dB}start_ARG 37.680 end_ARG start_ARG times end_ARG start_ARG roman_dB end_ARG 2.831,9°2.831 , 9 ⁢ °
2 ×2\text{\,}\timesstart_ARG 2 end_ARG start_ARG times end_ARG start_ARG × end_ARG 1,476.01,476.01 , 476.0 92.2592.2592.25 16.016.016.0 40.799 dB40.799\text{\,}\mathrm{dB}start_ARG 40.799 end_ARG start_ARG times end_ARG start_ARG roman_dB end_ARG 2.082,2°2.082 , 2 ⁢ °
3 ×3\text{\,}\timesstart_ARG 3 end_ARG start_ARG times end_ARG start_ARG × end_ARG 1,476.01,476.01 , 476.0 23.062,523.062,523.062 , 5 64.064.064.0 41.037 dB41.037\text{\,}\mathrm{dB}start_ARG 41.037 end_ARG start_ARG times end_ARG start_ARG roman_dB end_ARG 2.013,8°2.013 , 8 ⁢ °

One can observe that spatio-spectral compression becomes effective primarily at CRs>128\acsp{cr}>$128$> 128. This behavior is attributed to the high number of spectral bands (369369369) in MLRetSet, which allows spectral compression to maintain its effectiveness across a broader range of CRs before saturation. Consequently, the performance disparity between spectral and spatio-spectral compression is less pronounced at lower CRs. This suggests that spatial information plays a more significant role in MLRetSet, due to its higher spatial resolution.

We would like to note that HyCASS consistently yields lower PSNR values on MLRetSet compared to those obtained using HySpecNet-11k (Table II), indicating a decrease in reconstruction fidelity. For example, when CR16\acs{cr}\approx$16$≈ 16, HyCASS with zero spatial stages reaches 52.83 dB52.83\text{\,}\mathrm{dB}start_ARG 52.83 end_ARG start_ARG times end_ARG start_ARG roman_dB end_ARG on HySpecNet-11k but only 44.86 dB44.86\text{\,}\mathrm{dB}start_ARG 44.86 end_ARG start_ARG times end_ARG start_ARG roman_dB end_ARG on MLRetSet. However, in contrast to that, the SA values indicate better preservation of the spectral shape for each reconstructed pixel (1.44°1.44 ⁢ ° instead of 1.84°1.84 ⁢ °). This discrepancy is caused by the smaller size of the MLRetSet dataset, which inherently constrains generalization performance. Consequently, the trained models tend to prioritize learning the general spectral shape, resulting in averaged intensity offsets.

2224448881616163232326464641281281282562562565125125121,0241{,}0241 , 0242,0482{,}0482 , 0483535354040404545455050505555556060600 ×0\text{\,}\timesstart_ARG 0 end_ARG start_ARG times end_ARG start_ARG × end_ARG0 ×0\text{\,}\timesstart_ARG 0 end_ARG start_ARG times end_ARG start_ARG × end_ARG0 ×0\text{\,}\timesstart_ARG 0 end_ARG start_ARG times end_ARG start_ARG × end_ARG0 ×0\text{\,}\timesstart_ARG 0 end_ARG start_ARG times end_ARG start_ARG × end_ARG1 ×1\text{\,}\timesstart_ARG 1 end_ARG start_ARG times end_ARG start_ARG × end_ARG2 ×2\text{\,}\timesstart_ARG 2 end_ARG start_ARG times end_ARG start_ARG × end_ARG2 ×2\text{\,}\timesstart_ARG 2 end_ARG start_ARG times end_ARG start_ARG × end_ARG3 ×3\text{\,}\timesstart_ARG 3 end_ARG start_ARG times end_ARG start_ARG × end_ARG3 ×3\text{\,}\timesstart_ARG 3 end_ARG start_ARG times end_ARG start_ARG × end_ARGCRPSNR [dB\mathrm{dB}roman_dB]
(a) HySpecNet-11k easy split
2224448881616163232326464641281281282562562565125125121,0241{,}0241 , 0242,0482{,}0482 , 0483535354040404545455050505555556060600 ×0\text{\,}\timesstart_ARG 0 end_ARG start_ARG times end_ARG start_ARG × end_ARG0 ×0\text{\,}\timesstart_ARG 0 end_ARG start_ARG times end_ARG start_ARG × end_ARG0 ×0\text{\,}\timesstart_ARG 0 end_ARG start_ARG times end_ARG start_ARG × end_ARG0 ×0\text{\,}\timesstart_ARG 0 end_ARG start_ARG times end_ARG start_ARG × end_ARG0 ×0\text{\,}\timesstart_ARG 0 end_ARG start_ARG times end_ARG start_ARG × end_ARG0 ×0\text{\,}\timesstart_ARG 0 end_ARG start_ARG times end_ARG start_ARG × end_ARG1 ×1\text{\,}\timesstart_ARG 1 end_ARG start_ARG times end_ARG start_ARG × end_ARG3 ×3\text{\,}\timesstart_ARG 3 end_ARG start_ARG times end_ARG start_ARG × end_ARG3 ×3\text{\,}\timesstart_ARG 3 end_ARG start_ARG times end_ARG start_ARG × end_ARGCRPSNR [dB\mathrm{dB}roman_dB]
(b) MLRetSet
Figure 5: Rate-distortion performance on the test set of (5(a)) HySpecNet-11k [25] (easy split) and (5(b)) MLRetSet [26]. Rate is visualized as CR and distortion is given as PSNR in dB\mathrm{dB}roman_dB.

IV-B Comparison with Other Approaches

This subsection analyzes the effectiveness of HyCASS in terms of PSNR at different CRs comparing it with several traditional baselines and state-of-the-art learning-based HSI compression models on both the HySpecNet-11k and the MLRetSet datasets. The comparative models include: 1) JPEG2000; 2) PCA; 3) 1D-CAE[19]; 4) SSCNet[21]; 5) 3D-CAE[22]; and 6) HyCoT[24]. Fig. 5 illustrates the corresponding rate-distortion curves for both datasets, where the rate is expressed as the CR and distortion is measured as PSNR in dB\mathrm{dB}roman_dB.

IV-B1 HySpecNet-11k

Fig. 5 (5(a)) shows that our proposed model achieves superior PSNR reconstruction quality across nearly all CRs when compared to state-of-the-art learning-based models. However, traditional methods demonstrate superior performance at CRs<64\acsp{cr}<$64$< 64. JPEG2000 performs better for CRs below 161616, while PCA gives higher PSNR value even for CRs up to 646464. At a CR101\acs{cr}\approx$101$≈ 101, HyCASS with two spatial stages reaches a PSNR of 46.84 dB46.84\text{\,}\mathrm{dB}start_ARG 46.84 end_ARG start_ARG times end_ARG start_ARG roman_dB end_ARG, clearly surpassing the best-performing state-of-the-art model SSCNet, which achieves 43.597 dB43.597\text{\,}\mathrm{dB}start_ARG 43.597 end_ARG start_ARG times end_ARG start_ARG roman_dB end_ARG and also the spectral baseline PCA, which achieves 44.76 dB44.76\text{\,}\mathrm{dB}start_ARG 44.76 end_ARG start_ARG times end_ARG start_ARG roman_dB end_ARG. Similarly, at the CR of approximately 1,0241,0241 , 024, HyCASS achieves a PSNR of 41.46 dB41.46\text{\,}\mathrm{dB}start_ARG 41.46 end_ARG start_ARG times end_ARG start_ARG roman_dB end_ARG, outperforming SSCNet and JPEG2000 that reach 40.11 dB40.11\text{\,}\mathrm{dB}start_ARG 40.11 end_ARG start_ARG times end_ARG start_ARG roman_dB end_ARG and 35.47 dB35.47\text{\,}\mathrm{dB}start_ARG 35.47 end_ARG start_ARG times end_ARG start_ARG roman_dB end_ARG, respectively. These results suggest that for CRs<64\acsp{cr}<$64$< 64 traditional methods remain more effective, and the increased complexity of learning-based models may offer limited benefits in this CR range. For CRs>64\acsp{cr}>$64$> 64, our results demonstrate the superior effectiveness of our proposed model over the other learning-based models and traditional approaches.

A comparative analysis of state-of-the-art learning-based models reveals that for CRs<32\acsp{cr}<$32$< 32, our proposed model, which exclusively performs spectral compression in this range without any spatial stages, shows limited improvement over the state of the art. This is because HyCASS is not specifically optimized for this CR range, resulting in performance closely similar to HyCoT. For CRs>256\acsp{cr}>$256$> 256, HyCASS’s performance converges with that of SSCNet, a spatial compression model. SSCNet naturally excels under these conditions due to its design for strong spatial compression. It is worth noting that the performance of HyCASS at CRs>256\acsp{cr}>$256$> 256 could potentially be further enhanced by integration of additional spatial stages (e.g., beyond three), which would allow for an even more effective exploitation of spatial redundancies.

IV-B2 MLRetSet

The considered models are also evaluated on the MLRetSet dataset, which has a higher spatial resolution than the HySpecNet-11k dataset. Fig. 5 (5(b)) presents the corresponding rate-distortion curves, from which the following observations can be made: Traditional approaches are highly effective in low-compression regimes. At CRs<64\acsp{cr}<$64$< 64, JPEG2000 performs particularly well, surpassing all learning-based models and the PCA baseline. The reconstruction quality achieved by the spectral learning-based models 1D-CAE, HyCoT, and HyCASS with zero spatial stages shows only minor variation, indicating comparable performance for long-range and short-range spectral redundancy compression. SSCNet and 3D-CAE, which incorporate spatial compression, perform worse than the spectral compression models, especially for CRs<64\acsp{cr}<$64$< 64. This suggests that, when sufficient bitrate is available, exploiting spectral redundancies is more straightforward and yields better compression performance than incorporating spatial information. HyCASS achieves comparable performance to learning-based spectral compression models at CRs<128\acsp{cr}<$128$< 128, while at CRs>128\acsp{cr}>$128$> 128 it demonstrates advantages thanks to its adjustable design by integrating spatial compression.

We would like to highlight that for MLRetSet all models demonstrate reduced reconstruction quality compared to HySpecNet-11k, due to fewer training data samples hindering the generalization of learning-based models. This observation extends to the PCA baseline, suggesting that spectral compression is inherently more challenging for this dataset. Consequently, learning-based spectral compression models such as 1D-CAE, HyCoT, and HyCASS with zero spatial stages also show a significant PSNR drop relative to HySpecNet-11k, particularly at CRs<64\acsp{cr}<$64$< 64. This can be attributed to the increased spatial complexity introduced by the higher spatial resolution, which diminishes the regularity of spectral patterns and thus limits their effective exploitation. In contrast, models employing spatial compression, such as SSCNet and HyCASS with one or more spatial stages, indicate greater robustness to higher spatial resolution. These models exhibit smaller reconstruction quality degradation on the MLRetSet dataset than on HySpecNet-11k, as they can exploit the higher spatial redundancies via dedicated 2D architectural components. In particular, 3D-CAE performs slightly better on MLRetSet than HySpecNet-11k in all considered CRs. This may indicate that 3D kernels are more effective in capturing joint spatio-spectral redundancies in hyperspectral data with a high spatial resolution, enabling the model to better exploit spatial detail.

IV-C Visual Analysis

For a qualitative evaluation, the reconstruction outputs of the considered learning-based compression models are visually compared. Fig. 6 presents the error maps (derived from the SA for each pixel) of a reconstructed HySpecNet-11k image across three representative CRs. This provides a detailed visual assessment of the spatial distribution of reconstruction errors. Each case also reports the corresponding CR and overall PSNR for comprehensive comparison.

Refer to caption 0222444666888101010SA
Original HSI 1D-CAE [19] SSCNet [21] 3D-CAE [22] HyCoT [24] HyCASS
Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption
PSNR 55.19 dB55.19\text{\,}\mathrm{dB}start_ARG 55.19 end_ARG start_ARG times end_ARG start_ARG roman_dB end_ARG 37.53 dB37.53\text{\,}\mathrm{dB}start_ARG 37.53 end_ARG start_ARG times end_ARG start_ARG roman_dB end_ARG 36.97 dB36.97\text{\,}\mathrm{dB}start_ARG 36.97 end_ARG start_ARG times end_ARG start_ARG roman_dB end_ARG 56.30 dB56.30\text{\,}\mathrm{dB}start_ARG 56.30 end_ARG start_ARG times end_ARG start_ARG roman_dB end_ARG 56.50 dB56.50\text{\,}\mathrm{dB}start_ARG 56.50 end_ARG start_ARG times end_ARG start_ARG roman_dB end_ARG
CR 3.963.963.96 3.963.963.96 3.963.963.96 3.963.963.96 3.963.963.96
Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption
PSNR 48.04 dB48.04\text{\,}\mathrm{dB}start_ARG 48.04 end_ARG start_ARG times end_ARG start_ARG roman_dB end_ARG 37.64 dB37.64\text{\,}\mathrm{dB}start_ARG 37.64 end_ARG start_ARG times end_ARG start_ARG roman_dB end_ARG 31.69 dB31.69\text{\,}\mathrm{dB}start_ARG 31.69 end_ARG start_ARG times end_ARG start_ARG roman_dB end_ARG 49.11 dB49.11\text{\,}\mathrm{dB}start_ARG 49.11 end_ARG start_ARG times end_ARG start_ARG roman_dB end_ARG 48.81 dB48.81\text{\,}\mathrm{dB}start_ARG 48.81 end_ARG start_ARG times end_ARG start_ARG roman_dB end_ARG
CR 28.8628.8628.86 32.0032.0032.00 31.6931.6931.69 28.8628.8628.86 28.8628.8628.86
Refer to caption Refer to caption Refer to caption Refer to caption
PSNR 37.39 dB37.39\text{\,}\mathrm{dB}start_ARG 37.39 end_ARG start_ARG times end_ARG start_ARG roman_dB end_ARG 33.95 dB33.95\text{\,}\mathrm{dB}start_ARG 33.95 end_ARG start_ARG times end_ARG start_ARG roman_dB end_ARG 38.70 dB38.70\text{\,}\mathrm{dB}start_ARG 38.70 end_ARG start_ARG times end_ARG start_ARG roman_dB end_ARG 42.49 dB42.49\text{\,}\mathrm{dB}start_ARG 42.49 end_ARG start_ARG times end_ARG start_ARG roman_dB end_ARG
CR 101.00101.00101.00 126.75126.75126.75 101.00101.00101.00 101.00101.00101.00
Figure 6: \Acsa error maps of state-of-the-art learning-based HSI compression models across three CRs, evaluated on an example image from the HySpecNet-11k [25] dataset.

At CR4\acs{cr}\approx$4$≈ 4, the spectral compression models 1D-CAE, HyCoT, and HyCASS with zero spatial stages achieve relatively low SA errors across most of the scene, as also indicated by their high overall PSNR values exceeding 55 dB55\text{\,}\mathrm{dB}start_ARG 55 end_ARG start_ARG times end_ARG start_ARG roman_dB end_ARG. Urban areas are reconstructed with high precision, showing SA<2°\acs{sa}<$$< 2 ⁢ °, while water regions pose more difficulty, with errors exceeding SA>5°\acs{sa}>$$> 5 ⁢ °. In contrast, SSCNet and 3D-CAE achieve lower PSNR values of approximately 37 dB37\text{\,}\mathrm{dB}start_ARG 37 end_ARG start_ARG times end_ARG start_ARG roman_dB end_ARG. Consequently, their error maps contain considerable noise in the urban regions. Similar to spectral compression models, these models also face difficulties in reconstructing water areas, whereas the forest region in the top-left shows the highest reconstruction fidelity. This suggests that at such low CR, spatial compression adversely affects reconstruction quality, limiting the model’s ability to preserve fine spectral details of each pixel due to the spatial downsampling.

At CR32\acs{cr}\approx$32$≈ 32, distortion increases for most methods, as can be seen by the overall PSNR and the SA error maps. SSCNet shows a slight improvement in reconstruction quality compared to its performance at CR=4\acs{cr}=$4$= 4, although overall quality remains low. Additionally, the models 1D-CAE, HyCoT, and HyCASS exhibit increased distortion in urban regions, with notably elevated SA values observed in the industrial area situated at the top center of the HSI.

At CR128\acs{cr}\approx$128$≈ 128, reconstruction degradation becomes more visible. While SSCNet and 3D-CAE exhibit minimal differences to lower CRs, the spectral compression model HyCoT exhibits substantial errors in urban and water regions, however it maintains relatively strong performance in forested areas. In contrast, HyCASS, employing two spatial stages at this CR, exhibits increased error in urban regions but it achieves the highest overall PSNR. Qualitative results indicate that learning-based models struggle to reconstruct water and urban regions. Urban areas also become challenging to reconstruct, especially at higher CRs and when spatial compression is applied. In contrast, forestry regions are consistently reconstructed with the highest quality, due to their frequent representation and homogeneous spatial and spectral characteristics in the training data, which facilitate accurate learning and reconstruction. Overall, HyCASS consistently achieves an effective trade-off between CR and reconstruction fidelity across a wide range of CRs.

V Conclusion and Discussion

In this paper, we have introduced HyCASS, a novel learning-based HSI compression model designed for adjustable spatio-spectral HSI compression. To this end, the proposed model employs six modules: i) a spectral encoder module; ii) a spatial encoder module; iii) a CR adapter encoder module; iv) a CR adapter decoder module; v) a spatial decoder module; and vi) a spectral decoder module. Our model accomplishes: 1) spectral feature extraction (realized within the spectral encoder and decoder modules); 2) spatial compression with variable stages (realized within the spatial encoder and decoder modules); and 3) spectral compression with variable output channels (realized within the CR adapter encoder and CR decoder modules). Unlike existing learning-based HSI compression models, HyCASS provides flexible control over the trade-off between spectral and spatial compression through its modular design. We have conducted extensive experiments on two benchmark datasets, including ablation studies, comparisons with state-of-the-art methods, and visual analyses of reconstruction errors. Our results demonstrate the effectiveness of HyCASS, and reveal how the balance between spectral and spatial compression affects reconstruction fidelity across different CRs for HSIs with both low and high spatial resolution. Our findings confirm the importance of adjustable spatio-spectral compression in addressing the diverse characteristics of hyperspectral data. It is worth emphasizing that with the continuous growth of hyperspectral data archives, spatio–spectral learning-based HSIs compression is becoming increasingly important, as it enables significantly higher CRs. In this context, the proposed model offers a promising solution for efficient and flexible HSI compression. Based on our analyses, we have also derived a guideline to select the trade-off between spectral or spatial compression depending on the spatial resolution and overall CR as follows:

  • For HSI data with high spatial resolution: use spectral compression at low CRs; and spatio-spectral compression with greater spatial emphasis at medium and high CRs.

  • For HSI data with low spatial resolution: use spectral compression at low CRs; spatio-spectral compression with greater spectral emphasis at medium CRs; and spatio-spectral compression with greater spatial emphasis at high CRs.

As a final remark, we would like to note that the development of foundation models has attracted great attention in RS. FMs are usually pre-trained on large-scale datasets and then fine-tuned for specific downstream tasks such as image classification, segmentation, or change detection. We believe that leveraging the generalization capabilities of FMs for HSI compression can lead to a more robust and effective compression performance. As a future work, we plan to investigate the use of FMs as a backbone for HSI compression to improve reconstruction fidelity and enable adaptation across diverse sensor and acquisition conditions.

References

  • [1] F.-C. Lin, Y.-S. Shiu, P.-J. Wang, U.-H. Wang, J.-S. Lai, and Y.-C. Chuang, “A model for forest type identification and forest regeneration monitoring based on deep learning and hyperspectral imagery,” Ecological Informatics, vol. 80, p. 102507, 2024.
  • [2] A. Fabbretto, M. Bresciani, A. Pellegrino, K. Alikas, M. Pinardi, S. Mangano, R. Padula, and C. Giardino, “Tracking water quality and macrophyte changes in lake trasimeno (italy) from spaceborne hyperspectral imagery,” Remote Sensing, vol. 16, no. 10, p. 1704, 2024.
  • [3] D. Spiller, A. Carbone, S. Amici, K. Thangavel, R. Sabatini, and G. Laneve, “Wildfire detection using convolutional neural networks and PRISMA hyperspectral imagery: A spatial-spectral analysis,” Remote Sensing, vol. 15, no. 19, p. 4855, 2023.
  • [4] I. Masari, G. Moser, and S. B. Serpico, “Manifold learning and deep generative networks for heterogeneous change detection from hyperspectral and synthetic aperture radar images,” IEEE Geoscience and Remote Sensing Letters, 2024.
  • [5] C. Gomes, I. Wittmann, D. Robert, J. Jakubik, T. Reichelt, S. Maurogiovanni, R. Vinge, J. Hurst, E. Scheurer, R. Sedona et al., “Lossy neural compression for geospatial analytics: A review,” IEEE Geoscience and Remote Sensing Magazine, 2025.
  • [6] A. Altamimi and B. Ben Youssef, “Lossless and near-lossless compression algorithms for remotely sensed hyperspectral images,” Entropy, vol. 26, no. 4, p. 316, 2024.
  • [7] Y. Dua, V. Kumar, and R. S. Singh, “Comprehensive review of hyperspectral image compression algorithms,” Optical Engineering, vol. 59, no. 9, pp. 090 902–090 902, 2020.
  • [8] J. Mielikainen and P. Toivanen, “Clustered DPCM for the lossless compression of hyperspectral images,” IEEE Transactions on Geoscience and Remote Sensing, vol. 41, no. 12, pp. 2943–2946, 2004.
  • [9] M. Hernández-Cabronero, A. B. Kiely, M. Klimesh, I. Blanes, J. Ligo, E. Magli, and J. Serra-Sagrista, “The CCSDS 123.0-B-2 “low-complexity lossless and near-lossless multispectral and hyperspectral image compression” standard: A comprehensive review,” IEEE Geoscience and Remote Sensing Magazine, vol. 9, no. 4, pp. 102–119, 2021.
  • [10] Q. Du and J. E. Fowler, “Hyperspectral image compression using JPEG2000 and principal component analysis,” IEEE Geoscience and Remote Sensing Letters, vol. 4, no. 2, pp. 201–205, 2007.
  • [11] R. J. Yadav and M. Nagmode, “Compression of hyperspectral image using PCA–DCT technology,” in Innovations in Electronics and Communication Engineering. Springer, 2018, pp. 269–277.
  • [12] P. L. Dragotti, G. Poggi, and A. R. Ragozini, “Compression of multispectral images by three-dimensional SPIHT algorithm,” IEEE Transactions on Geoscience and Remote Sensing, vol. 38, no. 1, pp. 416–428, 2000.
  • [13] X. Tang and W. A. Pearlman, “Three-dimensional wavelet-based compression of hyperspectral images,” in Hyperspectral Data Compression. Springer, 2006, pp. 273–308.
  • [14] D. Valsesia, T. Bianchi, and E. Magli, “Onboard deep lossless and near-lossless predictive coding of hyperspectral images with line-based attention,” IEEE Transactions on Geoscience and Remote Sensing, 2024.
  • [15] M. H. P. Fuchs, A. P. Byju, A. Walda, B. Rasti, and B. Demir, “Generative adversarial networks for spatio-spectral compression of hyperspectral images,” in IEEE Workshop on Hyperspectral Imaging and Signal Processing: Evolution in Remote Sensing, 2024, pp. 1–5.
  • [16] F. Mentzer, G. D. Toderici, M. Tschannen, and E. Agustsson, “High-fidelity generative image compression,” Advances in Neural Information Processing Systems, vol. 33, pp. 11 913–11 924, 2020.
  • [17] S. Rezasoltani and F. Z. Qureshi, “Hyperspectral image compression using sampling and implicit neural representations,” IEEE Transactions on Geoscience and Remote Sensing, 2024.
  • [18] N. Zhao, T. Pan, Z. Li, E. Chen, and L. Zhang, “The paradigm shift in hyperspectral image compression: A neural video representation methodology,” Remote Sensing, vol. 17, no. 4, p. 679, 2025.
  • [19] J. Kuester, W. Gross, and W. Middelmann, “1D-convolutional autoencoder based hyperspectral data compression,” International Archives of Photogrammetry, Remote Sensing and Spatial Information Sciences, vol. 43, pp. 15–21, 2021.
  • [20] J. Kuester, W. Gross, S. Schreiner, W. Middelmann, and M. Heizmann, “Adaptive two-stage multisensor convolutional autoencoder model for lossy compression of hyperspectral data,” IEEE Transactions on Geoscience and Remote Sensing, vol. 61, pp. 1–22, 2023.
  • [21] R. La Grassa, C. Re, G. Cremonese, and I. Gallo, “Hyperspectral data compression using fully convolutional autoencoder,” Remote Sensing, vol. 14, no. 10, p. 2472, 2022.
  • [22] Y. Chong, L. Chen, and S. Pan, “End-to-end joint spectral–spatial compression and reconstruction of hyperspectral images using a 3D convolutional autoencoder,” Journal of Electronic Imaging, vol. 30, no. 4, p. 041403, 2021.
  • [23] N. Sprengel, M. H. P. Fuchs, and B. Demir, “Learning-based hyperspectral image compression using a spatio-spectral approach,” EGU General Assembly, 2024.
  • [24] M. H. P. Fuchs, B. Rasti, and B. Demir, “HyCoT: A transformer-based autoencoder for hyperspectral image compression,” in IEEE Workshop on Hyperspectral Imaging and Signal Processing: Evolution in Remote Sensing, 2024, pp. 1–5.
  • [25] M. H. P. Fuchs and B. Demir, “HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods,” in IEEE International Geoscience and Remote Sensing Symposium, 2023, pp. 1779–1782.
  • [26] F. Ömrüuzun, Y. Yardımcı Çetin, U. M. Leloğlu, and B. Demir, “A novel semantic content-based retrieval system for hyperspectral remote sensing imagery,” Remote Sensing, vol. 16, no. 8, p. 1462, 2024.
  • [27] Z. Liu, Y. Lin, Y. Cao, H. Hu, Y. Wei, Z. Zhang, S. Lin, and B. Guo, “Swin transformer: Hierarchical vision transformer using shifted windows,” in IEEE/CVF International Conference on Computer Vision, 2021, pp. 10 012–10 022.
  • [28] M. Lu, P. Guo, H. Shi, C. Cao, and Z. Ma, “Transformer-based image compression,” in Data Compression Conference, 2022, pp. 469–469.
  • [29] L. Guanter, H. Kaufmann, K. Segl, S. Foerster, C. Rogass, S. Chabrillat, T. Kuester, A. Hollstein, G. Rossner, C. Chlebek et al., “The EnMAP spaceborne imaging spectroscopy mission for earth observation,” Remote Sensing, vol. 7, no. 7, pp. 8830–8857, 2015.
  • [30] J. Bégaint, F. Racapé, S. Feltman, and A. Pushparaja, “CompressAI: A PyTorch library and evaluation platform for end-to-end compression research,” arXiv preprint arXiv:2011.03029, 2020.
  • [31] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” arXiv preprint arXiv:1412.6980, 2014.
[Uncaptioned image] Martin Hermann Paul Fuchs received his B. Sc. and M. Sc. degrees in electrical engineering from Technische Universität Berlin, Berlin, Germany in 2018 and 2021, respectively. He is currently pursuing a Ph. D. degree in the Remote Sensing Image Analysis (RSiM) group at the Faculty of Electrical Engineering and Computer Science, TU Berlin. His research interests revolve around the intersection of remote sensing and deep learning, and he has a particular interest in hyperspectral imaging and compression.
[Uncaptioned image] Behnood Rasti (M’12–SM’19) received the B.Sc. and M.Sc. degrees in electronics and electrical engineering from the Electrical Engineering Department, University of Guilan, Rasht, Iran, in 2006 and 2009, respectively. He was a valedictorian of his M.Sc. class in 2009. He received his Ph.D. degree in Electrical and Computer Engineering from the University of Iceland, Reykjavik, Iceland, in 2014. From 2015 to 2016, he served as a postdoctoral researcher in the Electrical and Computer Engineering Department at the University of Iceland. He subsequently became a lecturer in the Center for Engineering Technology and Applied Sciences, Department of Electrical and Computer Engineering, from 2016 to 2019. Dr. Rasti was a Humboldt Research Fellow in 2020 and 2021, and a Principal Research Associate with Helmholtz Zentrum Dresden-Rossendorf (HZDR), Dresden, Germany, from 2022 to 2023. He is currently a Senior Research Scientist at the Faculty of Electrical Engineering and Computer Science, Technische Universität Berlin, and the Berlin Institute for the Foundations of Learning and Data, Berlin, Germany. His research interests include machine learning, deep learning, signal and image processing, remote sensing, Earth observation, and artificial intelligence. Dr. Rasti was a recipient of the Doctoral Grant of the University of Iceland Research Fund “The Eimskip University Fund” in 2013 and the “Alexander von Humboldt Research Fellowship Grant” in 2019. He serves as an Associate Editor for the IEEE Geoscience and Remote Sensing Letters (GRSL).
[Uncaptioned image] Begüm Demir (S’06-M’11-SM’16) received the B.Sc., M.Sc., and Ph.D. degrees in electronic and telecommunication engineering from Kocaeli University, Kocaeli, Turkey, in 2005, 2007, and 2010, respectively. She is currently a Full Professor and the founder head of the Remote Sensing Image Analysis (RSiM) group at the Faculty of Electrical Engineering and Computer Science, TU Berlin and the head of the Big Data Analytics for Earth Observation research group at the Berlin Institute for the Foundations of Learning and Data (BIFOLD). Her research activities lie at the intersection of machine learning, remote sensing and signal processing. Specifically, she performs research in the field of processing and analysis of large-scale Earth observation data acquired by airborne and satellite-borne systems. She was awarded by the prestigious ‘2018 Early Career Award’ by the IEEE Geoscience and Remote Sensing Society for her research contributions in machine learning for information retrieval in remote sensing. In 2018, she received a Starting Grant from the European Research Council (ERC) for her project “BigEarth: Accurate and Scalable Processing of Big Data in Earth Observation”. She is an IEEE Senior Member and Fellow of European Lab for Learning and Intelligent Systems (ELLIS). Dr. Demir is a Scientific Committee member of several international conferences and workshops. She is a referee for several journals such as the PROCEEDINGS OF THE IEEE, the IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING, the IEEE GEOSCIENCE AND REMOTE SENSING LETTERS, the IEEE TRANSACTIONS ON IMAGE PROCESSING, Pattern Recognition, the IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, the IEEE JOURNAL OF SELECTED TOPICS IN SIGNAL PROCESSING, the International Journal of Remote Sensing), and several international conferences. Currently she is an Associate Editor for the IEEE GEOSCIENCE AND REMOTE SENSING MAGAZINE.