\WarningFilter

captionUnknown document class (or package)

HyCASS: Adjustable Spatio-Spectral Hyperspectral Image Compression Network
EnMAP: Environmental Mapping and Analysis Program
OHID-1: Orbita Hyperspectral Images Dataset-1
L2A: Level 2A
RS: remote sensing
EO: earth obervation
CV: computer vision
GPU: graphics processing unit
RGB: red, green and blue
HSI: hyperspectral image
CNN: convolutional neural network
ANN: artificial neural network
SPIHT: set partitioning in hierarchical trees
SPECK: set partitioning embedded block
PCA: principle component analysis
DCT: discrete cosine transform
KLT: Karhunen–Loève transform
SE: Squeeze and Excitation
Swin: Shifted windows
RSTB: Residual Swin Transformer Block
NTU: Neural Transformation Unit
FE: feature embedding
FU: feature unembedding
STL: Swin Transformer layer
WA: window attention
SWA: shifted window attention
FM: foundation model
PSNR: peak signal-to-noise ratio
SA: spectral angle
MSE: mean squared error
CR: compression ratio
bpppc: bits per pixel per channel
$\mathrm{dB}$: decibels
GSD: ground sample distance
FLOPs: floating point operations
LR: learning rate
BS: batch size
LeakyReLU: leaky rectified linear unit
PReLU: parametric rectified linear unit
DPCM: Differential Pulse Code Modulation
AE: autoencoder
VAE: variational autoencoder
CAE: convolutional autoencoder
GAN: generative adversarial network
INR: Implicit Neural Representations
A1D-CAE: Adaptive 1-D Convolutional Autoencoder
1D-CAE: 1D-Convolutional Autoencoder
SSCNet: Spectral Signals Compressor Network
3D-CAE: 3D Convolutional Auto-Encoder
LineRWKV: Line Receptance Weighted Key Value
HyCoT: Hyperspectral Compression Transformer
TIC: Transformer-based Image Compression
S2C-Net: Spatio-Spectral Compression Network
HiFiC: High Fidelity Compression
MLP: multilayer perceptron
1D: one-dimensional
2D: two-dimensional
3D: three-dimensional
LN: layer normalization
RSiM: Remote Sensing Image Analysis
BIFOLD: Berlin Institute for the Foundations of Learning and Data

Adjustable Spatio-Spectral Hyperspectral Image Compression Network

Martin Hermann Paul Fuchs, Behnood Rasti, , and Begüm Demir Martin Hermann Paul Fuchs, Behnood Rasti and Begüm Demir are with the Faculty of Electrical Engineering and Computer Science, Technische Universität Berlin, 10623 Berlin, Germany (e-mail: m.fuchs@tu-berlin.de; behnood.rasti@tu-berlin.de; demir@tu-berlin.de). Behnood Rasti and Begüm Demir are also with the Berlin Institute for the Foundations of Learning and Data (BIFOLD), 10623 Berlin, Germany.

Abstract

With the rapid growth of hyperspectral data archives in remote sensing (RS), the need for efficient storage has become essential, driving significant attention toward learning-based hyperspectral image (HSI) compression. However, a comprehensive investigation of the individual and joint effects of spectral and spatial compression on learning-based HSI compression has not been thoroughly examined yet. Conducting such an analysis is crucial for understanding how the exploitation of spectral, spatial, and joint spatio-spectral redundancies affects HSI compression. To address this issue, we propose Adjustable Spatio-Spectral Hyperspectral Image Compression Network (HyCASS), a learning-based model designed for adjustable HSI compression in both spectral and spatial dimensions. HyCASS consists of six main modules: 1) spectral encoder; 2) spatial encoder; 3) compression ratio(CR) adapter encoder; 4) CRadapter decoder; 5) spatial decoder; and 6) spectral decoder module. The modules employ convolutional layers and transformer blocks to capture both short-range and long-range redundancies. Experimental results on two HSI benchmark datasets demonstrate the effectiveness of our proposed adjustable model compared to existing learning-based compression models. Based on our results, we establish a guideline for effectively balancing spectral and spatial compression across different CRs, taking into account the spatial resolution of the HSIs. Our code and pre-trained model weights are publicly available at https://git.tu-berlin.de/rsim/hycass.

Index Terms:

Hyperspectral image compression, adjustable spatio-spectral compression, deep learning, remote sensing.

I Introduction

Hyperspectral sensors capture images spanning hundreds of continuous bands across the electromagnetic spectrum. Fine spectral information provided in HSIs enables the identification and differentiation of materials within a scene. Hyperspectral sensors, mounted on satellites, airplanes, and drones, enable a wide range of RS applications, including forest monitoring [1], water quality assessment [2], wildfire detection [3], and flood mapping [4]. The continuous improvement in hyperspectral sensors has enabled them to extract increasingly detailed spectral and spatial information, which is essential for advanced analysis. However, the substantial volume of data produced by these sensors presents significant challenges in terms of storage, transmission, and processing. An emerging research area focuses on the efficient compression of hyperspectral data to preserve crucial spectral and spatial information content for subsequent analysis [5].

Many HSI compression methods are presented in the literature. Generally, they can be categorized into three classes: i) lossless; ii) near-lossless; and iii) lossy HSI compression. Each category offers a unique trade-off between data preservation and compression efficiency. Lossless HSI compression ensures a perfect reconstruction of the original data without any loss of information and is therefore particularly important for tasks with zero tolerance for data degradation. However, lossless HSI compression methods typically only achieve CRs of $24\text{\,}$ [6]. This restriction limits their applicability in scenarios with strict bandwidth or storage space constraints that require significantly higher CRs. Near-lossless HSI compression achieves higher CRs than lossless compression while introducing minimal distortion between the original and reconstructed HSIs. The maximum error is upper-bounded, ensuring controlled deviation in the reconstructed HSIs. However, small errors may accumulate, potentially impacting applications that require absolute spectral precision. Furthermore, the achievable CRs remain limited compared to lossy compression. Lossy HSI compression offers significant advantages, particularly in achieving high CRs, which is essential for applications with strict bandwidth or storage limitations [5]. By selectively discarding less important information, lossy compression efficiently reduces data size while preserving key features, making it ideal for large-scale hyperspectral data transmission, storage, and processing. Although lossy compression introduces some degree of information loss, the degradation is typically negligible and does not substantially compromise the usability of the data in most practical applications.

From a methodological point of view, HSI compression can be grouped into two categories: i) traditional methods; and ii) learning-based methods. Traditional HSI compression methods have been extensively investigated in RS with predictive coding emerging as a common approach [7]. Predictive coding takes advantage of both spectral and spatial redundancies by predicting pixel values based on contextual information and encoding only the residuals between predicted and actual values for efficient storage or transmission. For example, in [8], a clustered Differential Pulse Code Modulation (DPCM) compression method, which clusters spectra and calculates an optimized predictor for each cluster, is proposed for HSIs. After linear prediction of a spectrum, the difference is entropy-coded using an adaptive entropy coder for each of the clusters. Another prominent implementation of predictive coding is the CCSDS 123.0-B-2 standard [9] for lossless HSI compression that employs adaptive linear prediction to minimize redundancies in HSIs. Its low complexity and flexible architecture, coupled with the capability to achieve near-lossless compression through closed-loop quantization, make it well-suited for deployment in onboard RS systems. Although prediction-based methods are widely used for lossless and near-lossless HSI compression, they cannot generate meaningful latent representations, and their autoregressive functionality results in a slow processing speed.

In contrast, traditional transform coding methods excel at extracting compact latent representations from hyperspectral data and are frequently used for lossy compression. They project hyperspectral data into a lower-dimensional, decorrelated latent space using mathematical transformations. The resulting reduced number of coefficients is subsequently quantized, introducing some loss of information, and then entropy-coded. Several traditional transform-coding methods are proposed in RS. As an example, in [10], principle component analysis (PCA) is combined with JPEG2000 for joint spatio-spectral compression of HSIs, whereas PCA is applied followed by discrete cosine transform (DCT) in [11]. In [12], three-dimensional (3D) transform coding is achieved by applying wavelet transformation in the spatial dimensions and Karhunen–Loève transform (KLT) in the spectral dimension, followed by 3D-SPIHT for efficient lossy compression. In [13] 3D-SPECK, which takes advantage of the 3D wavelet transform to efficiently encode HSIs by compressing the redundancies in all three dimensions, is introduced. Despite their effectiveness in extracting compact features and achieving high CRs, traditional methods often rely on hand-crafted transformations, which limit their ability to fully exploit the rich spatio–spectral structure of hyperspectral data. Consequently, these methods may not generalize well when applied to hyperspectral data captured under different conditions, sensor types, or scene characteristics.

To overcome these limitations, the development of learning-based HSI compression has recently attracted great attention in RS. Learning-based HSI compression methods leverage deep neural networks to automatically learn hierarchical and data-adaptive representations from large-scale training datasets, allowing more effective characterization of complex spectral and spatial redundancies. These methods also lead to an improved rate–distortion performance and better generalization across different scenes [5]. As an example, in [14] the LineRWKV method that enhances CCSDS 123.0-B-2 [9] by introducing a learning-based predictor is presented. LineRWKV achieves superior lossless and near-lossless reconstruction performance compared to CCSDS 123.0-B-2 at the cost of increased training time and computational complexity. In [15], the authors introduce two generative adversarial network (GAN)-based models designed for spatio-spectral compression of HSIs. Their model extends the High Fidelity Compression (HiFiC) framework [16] by incorporating: i) Squeeze and Excitation(SE) blocks; and ii) 3Dconvolutions to better exploit spectral redundancies alongside spatial redundancies. Although these models are capable of achieving extremely high CRs, exceeding $10^{4}$ , their generative nature can lead to the synthesis of unrealistic spectral and spatial content, potentially compromising the fidelity of the reconstructed data. In [17], a method for HSI compression using Implicit Neural Representations (INR) is presented, where an multilayer perceptron (MLP) network learns to map pixel locations to pixel intensities. The weights of the learned model are stored or transmitted to achieve compression. In [18], a neural video representation approach is proposed. HSIs are treated as a stream of video data, where each spectral band represents a frame of information, and variances between spectral bands represent transformations between frames. Their approach utilizes the spectral band index and the spatial coordinate index as its input to perform network overfitting. A fundamental limitation of neural representation approaches lies in their training procedure, which involves overfitting a separate neural network to each HSI. This instance-specific optimization results in substantial computational costs, limiting the practicality of large-scale data processing.

Most state-of-the-art learning-based HSI compression methods adopt the autoencoder (AE) architecture, where an encoder network compresses the input data into a compact latent representation, and a decoder network reconstructs the data from that. This structure enables end-to-end optimization and allows the networks to learn efficient, data-driven representations of spectral and spatial information. Existing AE models differ primarily in how they process the spectral and spatial dimensions by using one-dimensional (1D), two-dimensional (2D), and 3D convolutional layers, to balance compression efficiency, reconstruction quality, and computational complexity. As an example, in [19, 20], 1D-Convolutional Autoencoder (1D-CAE) is presented, which compresses the spectral content without considering spatial redundancies by stacking multiple blocks of 1D convolutions, pooling layers, and leaky rectified linear unit (LeakyReLU) activations. Although high-quality reconstructions can be achieved with this model, the pooling layers limit the achievable CRs to $2^{n}$ , where $n$ is the number of poolings. Another limitation of this model is the increasing computational complexity with higher CRs due to the deeper network architecture. In [21], Spectral Signals Compressor Network (SSCNet) extracts spatial features via 2D convolutional kernels. 2D max pooling layers introduce spatial compression, while the final CR is adapted via the latent channels within the bottleneck. Although SSCNet enables significantly higher CRs, this comes at the cost of spatially blurred image reconstructions. In [22], 3D Convolutional Auto-Encoder (3D-CAE) is introduced for joint compression of spatio-spectral redundancies via 3D convolutional kernels. Additionally, residual blocks allow gradients to flow through the network directly and improve the model’s performance. However, the 3D kernels significantly increase computational complexity. In [23], Spatio-Spectral Compression Network (S2C-Net) is introduced. Initially, a pixelwise AE is pre-trained to capture the essential spectral features of the hyperspectral data. To enhance spatial redundancy removal, a spatial AE network is added to the bottleneck layer of the spectral AE. This dual-layer architecture allows the model to learn both spectral and spatial representations effectively. The entire model is then trained using a mixed loss function that combines reconstruction errors from both the spectral and spatial components. Although this model achieves state-of-the-art performance for high CRs, it falls short in optimally balancing the trade-off between spectral and spatial CR. This limitation suggests room for improvement in achieving more balanced compression across both dimensions. In [24], the authors propose Hyperspectral Compression Transformer (HyCoT), a transformer-based AE designed for pixelwise compression of HSIs that exploits long-range spectral redundancies. The model significantly reduces computational complexity through random sampling and independent pixelwise processing, without notable degradation in reconstruction quality. However, similar to other spectral compression models, HyCoT does not exploit spatial redundancies, thus limiting the achievable CRs.

Existing learning-based HSI compression models have limitations to effectively address several fundamental challenges associated with hyperspectral data, including varying spatial resolutions, high spectral dimensionality, and the need for adjustability across a broad range of CRs. In particular, these models often lack mechanisms to dynamically exploit spectral and spatial redundancies under specific compression requirements and sensor characteristics. As a result, current approaches face limitations in terms of scalability, generalization, and their ability to balance compression efficiency, reconstruction fidelity, and computational complexity.

To overcome the above-mentioned critical issues, in this paper, we introduce HyCASS. The proposed model aims to enable flexible spatio-spectral compression. To this end, HyCASS employs six modules: i) a spectral encoder module; ii) a spatial encoder module; iii) a CR adapter encoder module; iv) a CR adapter decoder module; v) a spatial decoder module; and vi) a spectral decoder module. The novelty of the proposed model consists of the following: 1) a spectral feature extraction that captures spectral redundancies across the whole spectrum of each pixel independently (realized within the spectral encoder module and spectral decoder module); 2) an adjustable number of spatial stages exploiting both short- and long-range spatial redundancies to control spatial compression (realized within the spatial encoder module and spatial decoder module); and 3) an adjustable latent channel dimension to regulate spectral compression (realized within the CR adapter encoder module and CR adapter decoder module). Unlike existing methods in the literature, our proposed model supports flexible compression in both spectral and spatial dimensions. We evaluate the proposed model on two distinct datasets (HySpecNet-11k [25] and MLRetSet [26]), demonstrating its effectiveness in compressing over a broad range of CRs and two different spatial resolutions. Our experimental results show that spectral compression is preferable in cases where the CR or spatial resolution is low, whereas spatio-spectral compression is more effective for high CRs or spatial resolutions. The main contributions of this paper are summarized as follows:

•

We propose a spatio-spectral HSI compression model with adjustable spatial stages and latent channels, capable of effective HSI compression across a broad range of CRs and varying spatial resolutions.
•

We provide extensive experimental results on two benchmark datasets for the introduced model, including an ablation study, comparisons with other approaches, and visual analyses of the reconstructions.
•

We analyze the effects of spectral and spatial HSI compression on the reconstruction quality for multiple CRs and two ground sample distances, providing a comprehensive evaluation of their individual and combined impacts on reconstruction performance.
•

We demonstrate the advantages of adjustable deep learning architectures and derive guidelines for the trade-off between spectral and spatial compression under varying CR conditions, for both low and high spatial resolution hyperspectral data.

The remainder of this paper is organized as follows: Section II introduces the proposed HyCASS model. Section III describes the considered datasets and provides the design of experiments, while the experimental results are presented in Section IV. Finally, in Section V, the conclusion of the work is drawn.

II Proposed Adjustable Spatio-Spectral Hyperspectral Image Compression Network (HyCASS)

Let $\mathbf{X}\in\mathbb{R}^{H\times W\times C}$ denote an HSI with spatial dimensions $H$ and $W$ , and $C$ spectral bands. In this work, we focus on lossy HSI compression that transforms the original HSI $\mathbf{X}$ into a compact and decorrelated latent representation $\mathbf{Y}\in\mathbb{R}^{\Sigma\times\Omega\times\Gamma}$ . Here, $\Sigma$ and $\Omega$ denote the reduced latent spatial dimensions, while $\Gamma$ represents the number of latent channels. The latent representation $\mathbf{Y}$ should retain sufficient information, such that the original HSI can be approximately reconstructed from $\mathbf{Y}$ as $\mathbf{\hat{X}}\in\mathbb{R}^{H\times W\times C}$ . The compression aims to minimize the distortion $d:\mathbf{X}\times\mathbf{\hat{X}}\rightarrow[0,\infty)$ between $\mathbf{X}$ and $\mathbf{\hat{X}}$ for a fixed CR.

Refer to caption — Figure 1: Overview of our proposed HyCASS model. Initially, a pixelwise convolution in the spectral encoder module extracts spectral features. The spatial encoder module, composed of $S\times$ stacked stages, performs both long- and short-range spatial feature extraction, where each spatial stage introduces higher spatial compression. Subsequently, the CR adapter encoder module adjusts the size of the latent representation to match the targeted spatio-spectral CR. The decoder mirrors the encoder structure, replacing downsampling with upsampling operations.

To enable effective spatio-spectral HSI compression, we propose HyCASS. The proposed model combines pixelwise convolutions, strided 2D convolutional layers, and Residual Shifted windows (Swin) Transformer Blocks [27] to leverage both short-range and long-range redundancies across the spectral as well as the spatial dimension of HSIs. HyCASS consists of six modules within both the encoder and decoder: i) a spectral encoder module that involves a spectral feature extraction; ii) a spatial encoder module with a configurable number of spatial stages $S$ for adjustable spatial compression, incorporating short-range and long-range spatial redundancies; iii) a CR adapter encoder module to balance the trade-off between compressing spectral and spatial information content via the number of latent channels $\Gamma$ , depending on the joint spatio-spectral target CR; iv) a CR adapter decoder module that recovers spectral information; v) a spatial decoder module that performs spatial reconstruction; and vi) a spectral decoder module for spectral reconstruction. HyCASS facilitates adjustable spatio-spectral compression, striking a balance between reconstruction fidelity and compression efficiency. This balance is achieved through the adjustable parameters $\Gamma$ and $S$ , which control spectral and spatial CR, respectively. A schematic overview of HyCASS is provided in Fig. 1 with detailed explanations in the following subsections.

II-A HyCASS Spectral Encoder Module

Initially, the spectral encoder module $E_{\Phi}:\mathbb{R}^{H\times W\times C}\rightarrow\mathbb{R}^{H\times W\times N}$ of the proposed HyCASS model, which is defined as:

\displaystyle E_{\Phi}\left(\mathbf{\xi}\right)=\text{\acs{leakyrelu}}\left(\text{Conv2D}_{\qtyproduct{1x1}{}}^{C\rightarrow N}\left(\mathbf{\xi}\right)\right)

(1)

performs spectral feature extraction using a pixelwise convolution, realized via a \qtyproduct1x1 kernel inside a 2D convolutional layer. Although 2D convolutions are typically used for capturing spatial patterns, the use of a \qtyproduct1x1 kernel ensures that the convolution is applied independently to each pixel location without aggregating any spatial context. The pixelwise convolution captures spectral redundancies along the whole spectrum of each pixel by projecting the high-dimensional number $C$ of spectral bands into a decorrelated, lower-dimensional spectral representation with $N$ channels ( $N<C$ ). A LeakyReLU is applied after the convolutional layer to introduce non-linear activation, thereby enhancing its capacity to learn complex patterns from data. This enables the spectral features extracted from the previous layer to capture more complex relationships in the hyperspectral data. The extracted spectral features are subsequently processed by the spatial encoder module.

II-B HyCASS Spatial Encoder Module

Following the spectral encoding, HyCASS applies a spatial encoder module to capture and compress spatial redundancies within the hyperspectral data. The HyCASS spatial encoder $E_{\chi}:\mathbb{R}^{H\times W\times N}\rightarrow\mathbb{R}^{\Sigma\times\Omega\times N}$ defined as:

\displaystyle\begin{cases}E_{\chi}\left(\mathbf{\xi}\right)=f^{\left(S\right)}\left(\mathbf{\xi}\right),\quad S\in\mathbb{N}_{0}\\ f\left(\mathbf{\xi}\right)=\text{\acs{leakyrelu}}\left(\text{Conv2D}_{\qtyproduct{3x3}{}}^{\downarrow^{2}}\left(\text{\acs{rstb}}\left(\mathbf{\xi}\right)\right)\right)\end{cases}

(2)

consists of a configurable sequence of spatial stages, denoted as $S$ , each implemented as a Neural Transformation Unit (NTU) [28]. This systematically reduces the spatial dimension while enriching the feature representation with contextual spatial information. $S$ is a configurable hyperparameter that determines the spatial CR, given by $\acs{cr}_{\text{spat}}=4^{S}$ . This hyperparameter helps to adjust spatial compression to align with the targeted spatio-spectral CR. Each NTU performs stepwise spatial compression and is composed of three main components: i) a Residual Swin Transformer Block (RSTB) [27], ii) a strided 2D convolutional layer with a kernel size of \qtyproduct3x3, and iii) a LeakyReLU activation function. The RSTB employs shifted window self-attention to capture long-range spatial redundancies as illustrated in Fig. 2.

(a)

(b)

Figure 2: Architecture of (2(a)) RSTB and (2(b)) STL. Layout is redesigned based on [27].

It consists of three subcomponents: i) feature embedding(FE) that reorders the input feature channels into a token sequence; ii) Swin Transformer layer(STL) that applies multi-head self-attention within and across local windows using layer normalization (LN), window attention (WA), shifted window attention (SWA) and MLP; and iii) feature unembedding(FU) that reorders the tokens back to their original spatial shape. Residual connections mitigate the vanishing gradient issue, enhancing training stability. This design ensures efficient attention computation while preserving spatial locality and contextual information. Notably, patch division and linear embedding for tokenization are omitted as described in [28]. Following the RSTB, the strided 2D convolution captures local short-range spatial redundancies. This layer downsamples the feature maps by a factor of $2$ along both height and width, effectively reducing the image size by a factor of $4$ per stage. The number of channels $N$ remains constant across all stages. A non-linear LeakyReLU activation function is applied at the end of each NTU to enhance the model’s capacity to learn complex patterns. This hierarchical spatial encoding progressively compresses the spatial content while preserving important structural and contextual details. We would like to note that in the case of zero spatial stages ( $S=$0\text{\,}\times$$ ), HyCASS operates as spectral compression model while spatio-spectral compression is achieved with one or more spatial stages ( $S>$0\text{\,}\times$$ ).

II-C HyCASS CR Adapter Encoder Module

After capturing spectral and spatial redundancies using the respective encoder modules, HyCASS applies the CR adapter encoder module $E_{\Psi}:\mathbb{R}^{\Sigma\times\Omega\times N}\rightarrow\mathbb{R}^{\Sigma\times\Omega\times\Gamma}$ defined as:

\displaystyle E_{\Psi}\left(\mathbf{\xi}\right)=\text{Sigmoid}\left(\text{Conv2D}_{\qtyproduct{1x1}{}}^{N\rightarrow\Gamma}\left(\mathbf{\xi}\right)\right).

(3)

It adjusts the number of latent channels $\Gamma$ to fit the targeted spatio-spectral CR. Therefore, a \qtyproduct1x1 convolutional layer maps the spatio-spectral features from $N$ channels to $\Gamma$ latent channels. We use the sigmoid activation function to constrain the latent space to the range $01\text{\,}$ .

The spatio-spectral CR achieved by our proposed model arises from the joint contributions of both spectral and spatial compression. It can be expressed as:

\displaystyle\acs{cr}=\acs{cr}_{\text{spec}}\cdot\acs{cr}_{\text{spat}}=\frac{C}{\Gamma}\cdot 4^{S},

(4)

where $\acs{cr}_{\text{spec}}=\frac{C}{\Gamma}$ denotes the spectral CR determined by the reduction of spectral channels from $C$ to $\Gamma$ , and $\acs{cr}_{\text{spat}}=4^{S}$ corresponds to the spatial CR introduced through $S$ stages of spatial downsampling.

II-D HyCASS CR Adapter Decoder Module

As the first step to perform reconstruction, the CR adapter decoder module $D_{\Psi^{\prime}}:\mathbb{R}^{\Sigma\times\Omega\times\Gamma}\rightarrow\mathbb{R}^{\Sigma\times\Omega\times N}$ of HyCASS defined as:

\displaystyle D_{\Psi^{\prime}}\left(\mathbf{\xi}\right)=\text{\acs{leakyrelu}}\left(\text{Conv2D}_{\qtyproduct{1x1}{}}^{\Gamma\rightarrow N}\left(\mathbf{\xi}\right)\right)

(5)

applies a \qtyproduct1x1 convolution that projects the channels from $\Gamma$ back to $N$ , to match the NTU dimension, followed by a LeakyReLU.

II-E HyCASS Spatial Decoder Module

Afterwards, the spatial decoder module $D_{\chi^{\prime}}:\mathbb{R}^{\Sigma\times\Omega\times N}\rightarrow\mathbb{R}^{H\times W\times N}$ of HyCASS defined as:

\displaystyle\begin{cases}E_{\chi}\left(\mathbf{\xi}\right)=g^{\left(S\right)}\left(\mathbf{\xi}\right),\quad S\in\mathbb{N}_{0}\\ g\left(\mathbf{\xi}\right)=\text{\acs{rstb}}\left(\text{\acs{leakyrelu}}\left(\text{Conv2D}_{\qtyproduct{3x3}{}}^{\uparrow^{2}}\left(\mathbf{\xi}\right)\right)\right)\end{cases}

(6)

applies $S$ stacked NTUs like the HyCASS spatial encoder module. However, in each NTU, first the 2D convolution is applied to aggregate short-range spatial features and upsample the spatial dimensions. Then, the RSTB is applied for long-range spatial feature extraction.

II-F HyCASS Spectral Decoder Module

Finally, the HyCASS spectral decoder module $D_{\Phi^{\prime}}:\mathbb{R}^{H\times W\times N}\rightarrow\mathbb{R}^{H\times W\times C}$ defined as:

\displaystyle D_{\Phi^{\prime}}\left(\mathbf{\xi}\right)=\text{Sigmoid}\left(\text{Conv2D}_{\qtyproduct{1x1}{}}^{N\rightarrow C}\left(\mathbf{\xi}\right)\right)

(7)

projects the $N$ channels back to the original $C$ spectral bands using a 2D convolution with a \qtyproduct1x1 kernel. A sigmoid activation constrains the reconstructed output intensities to the valid range of $01\text{\,}$ .

III Dataset Description and Experimental Setup

III-A Dataset Description

Two HSI datasets were employed in our experiments. These datasets differ in several aspects, including spatial resolution, number of spectral bands, and dataset size, as summarized below.

III-A1 HySpecNet-11k

HySpecNet-11k [25] is a large-scale hyperspectral benchmark dataset constructed from $250$ tiles acquired by the Environmental Mapping and Analysis Program (EnMAP) satellite [29]. It includes $11,483$ nonoverlapping HSIs, each of which consists of \qtyproduct128x128 $\mathrm{pixels}$ and $202\text{\,}\mathrm{bands}$ with a GSD of $30\text{\,}\mathrm{m}$ (low spatial resolution) and a spectral range of $4202,450\text{\,}\mathrm{nm}$ . The data is radiometrically, geometrically, and atmospherically corrected (i.e., the L2A water & land product). We used the recommended splits from [25] for training, validation, and test sets covering $70\text{\,}\mathrm{\char 37\relax}$ , $20\text{\,}\mathrm{\char 37\relax}$ , and $10\text{\,}\mathrm{\char 37\relax}$ of the HSIs, respectively. Fig. 3 illustrates example images from this dataset. We would like to note that we used HySpecNet-11k to show the effectiveness and generalization capability of our proposed model, particularly in the context of large-scale training with low spatial resolution HSIs.

III-A2 MLRetSet

MLRetSet [26] is a hyperspectral benchmark dataset created from high spatial resolution hyperspectral imagery with $27.86\text{\,}\mathrm{cm}$ GSD. The hyperspectral dataset was acquired during an airborne flight covering the Turkish towns Yenice and Yeşilkaya on 4 May 2019. The twelve acquired tiles were split into $3,840$ non-overlapping HSIs of size \qtyproduct100x100 $\mathrm{pixels}$ with $369\text{\,}\mathrm{spectral\ bands}$ each. We split the data into i) a training set that includes $70\text{\,}\mathrm{\char 37\relax}$ of the HSIs; ii) a validation set that includes $20\text{\,}\mathrm{\char 37\relax}$ of the HSIs; and iii) test set that includes $10\text{\,}\mathrm{\char 37\relax}$ of the HSIs. Fig. 4 provides visual examples of typical scenes present in the MLRetSet dataset. We would like to note that we employed the MLRetSet dataset to demonstrate the effectiveness of our proposed model on HSIs with a high spatial resolution.

III-B Experimental Setup

Our code was implemented in PyTorch based on the CompressAI [30] framework. In the case of MLRetSet, the HSIs were center-cropped to \qtyproduct96x96 $\mathrm{pixels}$ to facilitate repeated spatial downsampling by factors of two, which is required for spatial compression models. For HySpecNet-11k, we followed the easy split as introduced in [25]. Training runs were carried out on a single NVIDIA A100 SXM4 80 GB GPU using the Adam optimizer [31]. For the loss function, we employed the mean squared error (MSE) defined as follows:

\displaystyle\text{MSE}(\mathbf{X},\mathbf{\hat{X}})=\frac{1}{H\cdot W\cdot C}\sum_{h,w,c}\left(\mathbf{X}(h,w,c)-\mathbf{\hat{X}}(h,w,c)\right)^{2}.

(8)

We compare our proposed model with two traditional compression methods (JPEG2000 and PCA) and the following learning-based models: 1) 1D-CAE[19], a 1D convolutional autoencoder performing spectral compression; 2) SSCNet[21], a spatial compression network; 3) 3D-CAE[22], a 3D convolutional autoencoder jointly compressing spatial and spectral redundancies; and 4) HyCoT[24], a transformer-based AE exploiting spectral redundancies. Training epochs, learning rate (LR), and batch size (BS) were adjusted per model and dataset to optimize training time and GPU memory usage, while ensuring convergence of the loss function for each training run. Table I lists the specific training hyperparameters used for each model and dataset configuration. It is worth noting that on HySpecNet-11k for 1D-CAE, the number of epochs was reduced to $250$ for CR $\in\left\{$8$,$16$,$32$\right\}$ due to runtime limitations. For MLRetSet, we had to reduce the BS to $1$ for 1D-CAE at $\acs{cr}=$32$$ due to GPU memory constraints. Furthermore, we used the random sampling strategy from [24] to efficiently train HyCoT.

For HyCASS, we fixed the number of spatial encoder module channels to $N=$128$$ and the Swin Transformer’s window size was set to $8$ , consistent with the configuration used in the Transformer-based Image Compression (TIC) model [28], and all remaining parameters were aligned accordingly. The number of spatial stages was varied between $0\text{\,}\times$ and $3\text{\,}\times$ to assess the effect of different spatial compression levels. We would like to note that for zero spatial stages, we increased $N$ to $1,024$ to compensate for the missing spatial stages in terms of model parameters and floating point operations (FLOPs). Also, we applied the random sampling strategy from HyCoT [24] and increased the LR to $1\text{}{10}^{-3}$ , the BS to $64$ , and the number of epochs to $2,000$ for HySpecNet-11k and $1,000$ for MLRetSet. It is important to note that while the targeted spatio-spectral CRs were $\acs{cr}\in\left\{$4$,$8$,$16$,\dots,1024\right\}$ , the achieved CRs may deviate due to the spectral dimension not being divisible by powers of two.

TABLE I: Values of training hyperparameter selected for each model for the HySpecNet-11k [25] and MLRetSet [26] dataset.

Model	HySpecNet-11k [25]			MLRetSet [26]
Model	Epochs	LR	BS	Epochs	LR	BS
1D-CAE [19]	$500$	$1\text{}{10}^{-4}$	$2$	$100$	$1\text{}{10}^{-4}$	$2$
SSCNet [21]	$2,000$	$1\text{}{10}^{-5}$	$8$	$400$	$1\text{}{10}^{-5}$	$8$
3D-CAE [22]	$1,000$	$1\text{}{10}^{-4}$	$2$	$250$	$1\text{}{10}^{-4}$	$2$
HyCoT [24]	$2,000$	$1\text{}{10}^{-3}$	$64$	$500$	$1\text{}{10}^{-3}$	$64$
HyCASS	$200$	$1\text{}{10}^{-4}$	$16$	$100$	$1\text{}{10}^{-4}$	$16$

III-C Evaluation Metrics

For the evaluation of the HSI compression methods, we consider two kinds of metrics: • metrics that measure the compression efficiency; and • metrics that measure the reconstruction quality. In our experiments, we use the CR to quantify the data reduction. \Acpsnr and spectral angle (SA) are used to measure the fidelity of a reconstructed HSI. Given an original HSI $\mathbf{X}\in\mathbb{R}^{H\times W\times C}$ , its latent representation $\mathbf{Y}\in\mathbb{R}^{\Sigma\times\Omega\times\Gamma}$ and reconstruction $\mathbf{\hat{X}}\in\mathbb{R}^{H\times W\times C}$ , the metrics are defined as follows.

III-C1 Compression Ratio (CR)

The CR between an original HSI $\mathbf{X}$ with bit depth $N_{b}$ and its representation in the latent space $\mathbf{Y}$ with bit depth $\hat{N}_{b}$ after encoding can be expressed as follows:

\displaystyle\text{\acs{cr}}\left(\mathbf{X},\mathbf{Y}\right)=\frac{N_{b}\cdot H\cdot W\cdot C}{\hat{N}_{b}\cdot\Sigma\cdot\Omega\cdot\Gamma}.

(9)

A higher CR indicates greater compression of the hyperspectral data, which could result in increased loss of information during reconstruction.

III-C2 Peak Signal-to-Noise Ratio (PSNR)

For measuring the reconstruction quality, we use the peak signal-to-noise ratio (PSNR) between original HSI $\mathbf{X}$ and reconstructed HSI $\mathbf{\hat{X}}$ , which is defined as:

\displaystyle\text{\acs{psnr}}\left(\mathbf{X},\mathbf{\hat{X}}\right)=10\cdot\log_{10}\left(\frac{\text{MAX}^{2}}{\text{MSE}\left(\mathbf{X},\mathbf{\hat{X}}\right)}\right),

(10)

where MAX denotes the maximum possible pixel value (e.g. $1.0$ in the case of min-max normalization), and the MSE is defined as in Equation 8. A higher PSNR value indicates better reconstruction quality with less distortion.

III-C3 Spectral Angle (SA)

For some results, we also report the SA defined as:

		$\displaystyle\text{\acs{sa}}\left(\mathbf{X},\mathbf{\hat{X}}\right)=\frac{1}{H\cdot W}$		(11)
		$\displaystyle\sum_{h,w}\frac{180}{\pi}\arccos\left(\frac{\sum_{c}\mathbf{X}\left(h,w,c\right)\cdot\mathbf{\hat{X}}\left(h,w,c\right)}{\sum_{c}\mathbf{X}\left(h,w,c\right)^{2}\sum_{c}\mathbf{\hat{X}}\left(h,w,c\right)^{2}}\right),$		(11)

which quantifies the average spectral similarity between all pixels of an original $\mathbf{X}$ and a reconstructed HSI $\mathbf{\hat{X}}$ . A smaller SA indicates higher spectral similarity and is inherently scale-invariant.

IV Experimental Results

We conducted three sets of experiments, aiming at: 1) assessment of the effects of spectral and spatial compression within the proposed HyCASS model through an ablation study on two benchmark datasets; 2) comparison of our model’s effectiveness with traditional baselines and lossy learning-based state-of-the-art HSI compression models; and 3) qualitative analysis of the reconstruction results.

IV-A Ablation Study

In this subsection, we analyze the impact of varying the spatial stages $S$ and latent channels $\Gamma$ inside HyCASS on the reconstruction quality for multiple CRs. To assess generalization across different spatial resolutions, we report results on the HySpecNet-11k and the MLRetSet datasets. For each configuration, $S$ defines the spatial CR ( $\acs{cr}_{\text{spat}}$ ), while $\Gamma$ determines the spectral CR ( $\acs{cr}_{\text{spec}}$ ). Given $\acs{cr}_{\text{spat}}$ , $\acs{cr}_{\text{spec}}$ is adjusted accordingly to match the targeted spatio-spectral CR.

IV-A1 HySpecNet-11k

Table II shows the results of the ablation study on the HySpecNet-11k dataset.

TABLE II: HyCASS results obtained by varying the spatial stages

S

on the easy split test set of the HySpecNet-11k [25] dataset.

\acs{cr}_{\text{spec}}

\acs{cr}_{\text{spat}}

and CR denote spectral, spatial and joint spatio-spectral compression ratio, respectively. Reconstruction quality is evaluated using PSNR and SA.

$S$	CR	$\acs{cr}_{\text{spec}}$	$\acs{cr}_{\text{spat}}$	PSNR $\uparrow$	SA $\downarrow$
$0\text{\,}\times$	$3.960,8$	$3.960,8$	$1.0$	$56.444\text{\,}\mathrm{dB}$	$1.394, 0 °$
$1\text{\,}\times$	$3.960,8$	$0.990,2$	$4.0$	$49.779\text{\,}\mathrm{dB}$	$2.440, 9 °$
$2\text{\,}\times$	$3.965,6$	$0.247,85$	$16.0$	$48.447\text{\,}\mathrm{dB}$	$2.647, 3 °$
$3\text{\,}\times$	$3.966,9$	$0.061,982,812$	$64.0$	$44.647\text{\,}\mathrm{dB}$	$3.333, 9 °$
$0\text{\,}\times$	$7.769,2$	$7.692$	$1.0$	$55.155\text{\,}\mathrm{dB}$	$1.557, 4 °$
$1\text{\,}\times$	$7.769,2$	$1.942,3$	$4.0$	$49.832\text{\,}\mathrm{dB}$	$2.443, 1 °$
$2\text{\,}\times$	$7.769,2$	$0.485,575$	$16.0$	$48.084\text{\,}\mathrm{dB}$	$2.706, 5 °$
$3\text{\,}\times$	$7.769,2$	$0.121$	$64.0$	$44.788\text{\,}\mathrm{dB}$	$3.265, 1 °$
$0\text{\,}\times$	$15.538$	$15.538$	$1.0$	$52.828\text{\,}\mathrm{dB}$	$1.836, 4 °$
$1\text{\,}\times$	$15.538$	$3.884,5$	$4.0$	$50.392\text{\,}\mathrm{dB}$	$2.391, 8 °$
$2\text{\,}\times$	$15.538$	$0.971,125$	$16.0$	$48.798\text{\,}\mathrm{dB}$	$2.638, 7 °$
$3\text{\,}\times$	$15.538$	$0.242,781,25$	$64.0$	$46.057\text{\,}\mathrm{dB}$	$3.082, 7 °$
$0\text{\,}\times$	$28.857$	$28.857$	$1.0$	$49.719\text{\,}\mathrm{dB}$	$2.267, 8 °$
$1\text{\,}\times$	$28.857$	$7.214,25$	$4.0$	$49.249\text{\,}\mathrm{dB}$	$2.545, 6 °$
$2\text{\,}\times$	$28.857$	$1.803,562,5$	$16.0$	$48.031\text{\,}\mathrm{dB}$	$2.719, 5 °$
$3\text{\,}\times$	$28.857$	$0.450,890,625$	$64.0$	$44.940\text{\,}\mathrm{dB}$	$3.242, 4 °$
$0\text{\,}\times$	$50.50$	$50.50$	$1.0$	$45.862\text{\,}\mathrm{dB}$	$3.151, 6 °$
$1\text{\,}\times$	$50.50$	$12.625$	$4.0$	$48.610\text{\,}\mathrm{dB}$	$2.717, 7 °$
$2\text{\,}\times$	$50.50$	$3.156,25$	$16.0$	$48.579\text{\,}\mathrm{dB}$	$2.644, 0 °$
$3\text{\,}\times$	$50.50$	$0.789,062,5$	$64.0$	$45.916\text{\,}\mathrm{dB}$	$3.110, 5 °$
$0\text{\,}\times$	$101.00$	$101.00$	$1.0$	$39.836\text{\,}\mathrm{dB}$	$5.522, 5 °$
$1\text{\,}\times$	$101.00$	$25.25$	$4.0$	$45.969\text{\,}\mathrm{dB}$	$3.155 °$
$2\text{\,}\times$	$101.00$	$6.312,5$	$16.0$	$46.843\text{\,}\mathrm{dB}$	$2.906, 6 °$
$3\text{\,}\times$	$101.00$	$1.578,125$	$64.0$	$44.441\text{\,}\mathrm{dB}$	$3.320, 0 °$
$0\text{\,}\times$	$202.00$	$202.00$	$1.0$	$32.971\text{\,}\mathrm{dB}$	$12.513 °$
$1\text{\,}\times$	$202.00$	$50.5$	$4.0$	$43.347\text{\,}\mathrm{dB}$	$3.772, 2 °$
$2\text{\,}\times$	$202.00$	$12.625$	$16.0$	$45.094\text{\,}\mathrm{dB}$	$3.155, 8 °$
$3\text{\,}\times$	$202.00$	$3.156,25$	$64.0$	$44.136\text{\,}\mathrm{dB}$	$3.431, 0 °$
$1\text{\,}\times$	$404.00$	$101.0$	$4.0$	$41.051\text{\,}\mathrm{dB}$	$4.549, 4 °$
$2\text{\,}\times$	$404.00$	$25.25$	$16.0$	$42.652\text{\,}\mathrm{dB}$	$3.722, 8 °$
$3\text{\,}\times$	$404.00$	$6.312,5$	$64.0$	$42.758\text{\,}\mathrm{dB}$	$3.651, 9 °$
$1\text{\,}\times$	$808.00$	$202$	$4.0$	$36.385\text{\,}\mathrm{dB}$	$7.454, 4 °$
$2\text{\,}\times$	$808.00$	$50.5$	$16.0$	$40.814\text{\,}\mathrm{dB}$	$4.219, 7 °$
$3\text{\,}\times$	$808.00$	$12.625$	$64.0$	$41.455\text{\,}\mathrm{dB}$	$3.949, 0 °$

From the table, one can derive three key observations:

First, for $\acsp{cr}<$32$$ , HyCASS with zero spatial stages (i.e., spectral compression only) yields superior reconstruction performance compared to HyCASS with one or more spatial stages (i.e., spatio-spectral compression). For example, when $\acs{cr}\approx$4$$ , HyCASS with zero spatial stages achieves a PSNR of $56.44\text{\,}\mathrm{dB}$ while reconstruction quality reduces for one, two and three spatial stages to $49.78\text{\,}\mathrm{dB}$ , $48.45\text{\,}\mathrm{dB}$ and $44.65\text{\,}\mathrm{dB}$ , respectively. This can be attributed to the relatively high number of latent channels retained in this CR range by HyCASS models without spatial compression (e.g. $\Gamma=$51$$ for $\acs{cr}\approx$4$$ ), which provide sufficient spectral information for accurate reconstruction by the decoder without requiring any spatial feature aggregation. For HySpecNet-11k, spatial compression poses reconstruction challenges due to the limited spatial correlation caused by the low spatial resolution, leading to noticeably blurred reconstruction results. Interestingly, at low CRs, spatio-spectral HyCASS models tend to exhibit PSNR stagnation. This behavior suggests that the high number of latent channels in such configurations contains redundancy, allowing comparable reconstruction performance at significantly higher CRs.

Second, as the CR increases beyond $32$ , the performance of HyCASS models that rely solely on spectral compression with zero spatial stages diminishes rapidly. In contrast, HyCASS models that incorporate spatial compression (one or more spatial stages) maintain a higher quality of reconstruction at these compression levels. This behavior indicates that with higher CRs, where spectral compression reaches saturation, the inclusion of deeper spatial hierarchies becomes increasingly important for preserving structural and spectral fidelity. In particular, this trend persists even for $\acsp{cr}>$64$$ , where models with a higher number of spatial stages consistently achieve better reconstruction performance. These findings highlight the increasing importance of spatial compression in highly constrained compression scenarios.

Third, reconstruction quality generally decreases with increasing CRs, dropping from $56.44\text{\,}\mathrm{dB}$ at $\acs{cr}=$3.96$$ to $41.46\text{\,}\mathrm{dB}$ at $\acs{cr}=$808$$ with the respective optimal spatial stages. This trend is consistently observed by increasing SA values, indicating that stronger compression leads to greater loss of spatial and spectral fidelity.

IV-A2 MLRetSet

Table III shows the ablation study of HyCASS on the MLRetSet dataset.

TABLE III: HyCASS results obtained by varying the spatial stages

S

on the test set of the MLRetSet [26] dataset.

\acs{cr}_{\text{spec}}

\acs{cr}_{\text{spat}}

and CR denote spectral, spatial and joint spatio-spectral compression ratio, respectively. Reconstruction quality is evaluated using PSNR and SA.

$S$	CR	$\acs{cr}_{\text{spec}}$	$\acs{cr}_{\text{spat}}$	PSNR $\uparrow$	SA $\downarrow$
$0\text{\,}\times$	$4.010,9$	$4.010,9$	$1.0$	$44.858\text{\,}\mathrm{dB}$	$1.440, 3 °$
$1\text{\,}\times$	$3.978,4$	$0.994,6$	$4.0$	$42.644\text{\,}\mathrm{dB}$	$1.848, 9 °$
$2\text{\,}\times$	$3.970,4$	$0.248,15$	$16.0$	$42.279\text{\,}\mathrm{dB}$	$1.887, 8 °$
$3\text{\,}\times$	$3.970,4$	$0.062,037,5$	$64.0$	$40.341\text{\,}\mathrm{dB}$	$2.151, 8 °$
$0\text{\,}\times$	$7.851,1$	$7.851,1$	$1.0$	$44.893\text{\,}\mathrm{dB}$	$1.435, 4 °$
$1\text{\,}\times$	$7.809,5$	$1.952,375$	$4.0$	$42.617\text{\,}\mathrm{dB}$	$1.851, 5 °$
$2\text{\,}\times$	$7.778,7$	$0.486,168,75$	$16.0$	$42.119\text{\,}\mathrm{dB}$	$1.914, 0 °$
$3\text{\,}\times$	$7.771,0$	$0.121,421,875$	$64.0$	$40.975\text{\,}\mathrm{dB}$	$2.074, 0 °$
$0\text{\,}\times$	$16.043$	$16.043$	$1.0$	$44.855\text{\,}\mathrm{dB}$	$1.441, 8 °$
$1\text{\,}\times$	$15.702$	$3.925,5$	$4.0$	$42.928\text{\,}\mathrm{dB}$	$1.785, 8 °$
$2\text{\,}\times$	$15.578$	$0.973,625$	$16.0$	$42.234\text{\,}\mathrm{dB}$	$1.904, 9 °$
$3\text{\,}\times$	$15.547$	$0.242,921,875$	$64.0$	$40.358\text{\,}\mathrm{dB}$	$2.149, 9 °$
$0\text{\,}\times$	$30.75$	$30.75$	$1.0$	$44.786\text{\,}\mathrm{dB}$	$1.452, 4 °$
$1\text{\,}\times$	$28.941$	$7.235,25$	$4.0$	$42.625\text{\,}\mathrm{dB}$	$1.852, 3 °$
$2\text{\,}\times$	$28.941$	$1.808,812,5$	$16.0$	$42.270\text{\,}\mathrm{dB}$	$1.892, 0 °$
$3\text{\,}\times$	$28.870$	$0.451,093,75$	$64.0$	$37.633\text{\,}\mathrm{dB}$	$2.723, 2 °$
$0\text{\,}\times$	$61.5$	$61.5$	$1.0$	$44.231\text{\,}\mathrm{dB}$	$1.548, 2 °$
$1\text{\,}\times$	$61.5$	$15.375$	$4.0$	$42.888\text{\,}\mathrm{dB}$	$1.779, 8 °$
$2\text{\,}\times$	$61.5$	$3.843,75$	$16.0$	$42.279\text{\,}\mathrm{dB}$	$1.885, 0 °$
$3\text{\,}\times$	$61.5$	$0.960,937,5$	$64.0$	$41.282\text{\,}\mathrm{dB}$	$2.005, 7 °$
$0\text{\,}\times$	$123.0$	$123.0$	$1.0$	$42.930\text{\,}\mathrm{dB}$	$1.795, 3 °$
$1\text{\,}\times$	$123.0$	$30.75$	$4.0$	$42.443\text{\,}\mathrm{dB}$	$1.858, 2 °$
$2\text{\,}\times$	$123.0$	$7.687,5$	$16.0$	$42.269\text{\,}\mathrm{dB}$	$1.890, 6 °$
$3\text{\,}\times$	$123.0$	$1.921,875$	$64.0$	$41.419\text{\,}\mathrm{dB}$	$1.973, 2 °$
$0\text{\,}\times$	$184.5$	$184.5$	$1.0$	$40.977\text{\,}\mathrm{dB}$	$2.164 °$
$1\text{\,}\times$	$184.5$	$46.125$	$4.0$	$42.403\text{\,}\mathrm{dB}$	$1.843, 6 °$
$2\text{\,}\times$	$184.5$	$11.531,25$	$16.0$	$42.169\text{\,}\mathrm{dB}$	$1.910, 7 °$
$3\text{\,}\times$	$184.5$	$2.882,812,5$	$64.0$	$41.126\text{\,}\mathrm{dB}$	$2.024, 6 °$
$0\text{\,}\times$	$369.0$	$369.0$	$1.0$	$33.548\text{\,}\mathrm{dB}$	$4.813, 6 °$
$1\text{\,}\times$	$369.0$	$92.25$	$4.0$	$42.169\text{\,}\mathrm{dB}$	$1.902, 1 °$
$2\text{\,}\times$	$369.0$	$23.062,5$	$16.0$	$42.125\text{\,}\mathrm{dB}$	$1.910, 5 °$
$3\text{\,}\times$	$369.0$	$5.765,625$	$64.0$	$41.970\text{\,}\mathrm{dB}$	$1.882, 3 °$
$1\text{\,}\times$	$738.0$	$184.5$	$4.0$	$40.431\text{\,}\mathrm{dB}$	$2.244, 0 °$
$2\text{\,}\times$	$738.0$	$46.125$	$16.0$	$41.433\text{\,}\mathrm{dB}$	$2.020, 2 °$
$3\text{\,}\times$	$738.0$	$11.531,25$	$64.0$	$41.574\text{\,}\mathrm{dB}$	$1.944, 7 °$
$1\text{\,}\times$	$1,476.0$	$369.0$	$4.0$	$37.680\text{\,}\mathrm{dB}$	$2.831, 9 °$
$2\text{\,}\times$	$1,476.0$	$92.25$	$16.0$	$40.799\text{\,}\mathrm{dB}$	$2.082, 2 °$
$3\text{\,}\times$	$1,476.0$	$23.062,5$	$64.0$	$41.037\text{\,}\mathrm{dB}$	$2.013, 8 °$

One can observe that spatio-spectral compression becomes effective primarily at $\acsp{cr}>$128$$ . This behavior is attributed to the high number of spectral bands ( $369$ ) in MLRetSet, which allows spectral compression to maintain its effectiveness across a broader range of CRs before saturation. Consequently, the performance disparity between spectral and spatio-spectral compression is less pronounced at lower CRs. This suggests that spatial information plays a more significant role in MLRetSet, due to its higher spatial resolution.

We would like to note that HyCASS consistently yields lower PSNR values on MLRetSet compared to those obtained using HySpecNet-11k (Table II), indicating a decrease in reconstruction fidelity. For example, when $\acs{cr}\approx$16$$ , HyCASS with zero spatial stages reaches $52.83\text{\,}\mathrm{dB}$ on HySpecNet-11k but only $44.86\text{\,}\mathrm{dB}$ on MLRetSet. However, in contrast to that, the SA values indicate better preservation of the spectral shape for each reconstructed pixel ( $1.44 °$ instead of $1.84 °$ ). This discrepancy is caused by the smaller size of the MLRetSet dataset, which inherently constrains generalization performance. Consequently, the trained models tend to prioritize learning the general spectral shape, resulting in averaged intensity offsets.

5(a)

(a) HySpecNet-11k easy split

(b) MLRetSet

Figure 5: Rate-distortion performance on the test set of (5(a)) HySpecNet-11k [25] (easy split) and (5(b)) MLRetSet [26]. Rate is visualized as CR and distortion is given as PSNR in

\mathrm{dB}

IV-B Comparison with Other Approaches

This subsection analyzes the effectiveness of HyCASS in terms of PSNR at different CRs comparing it with several traditional baselines and state-of-the-art learning-based HSI compression models on both the HySpecNet-11k and the MLRetSet datasets. The comparative models include: 1) JPEG2000; 2) PCA; 3) 1D-CAE[19]; 4) SSCNet[21]; 5) 3D-CAE[22]; and 6) HyCoT[24]. Fig. 5 illustrates the corresponding rate-distortion curves for both datasets, where the rate is expressed as the CR and distortion is measured as PSNR in $\mathrm{dB}$ .

IV-B1 HySpecNet-11k

Fig. 5 (5(a)) shows that our proposed model achieves superior PSNR reconstruction quality across nearly all CRs when compared to state-of-the-art learning-based models. However, traditional methods demonstrate superior performance at $\acsp{cr}<$64$$ . JPEG2000 performs better for CRs below $16$ , while PCA gives higher PSNR value even for CRs up to $64$ . At a $\acs{cr}\approx$101$$ , HyCASS with two spatial stages reaches a PSNR of $46.84\text{\,}\mathrm{dB}$ , clearly surpassing the best-performing state-of-the-art model SSCNet, which achieves $43.597\text{\,}\mathrm{dB}$ and also the spectral baseline PCA, which achieves $44.76\text{\,}\mathrm{dB}$ . Similarly, at the CR of approximately $1,024$ , HyCASS achieves a PSNR of $41.46\text{\,}\mathrm{dB}$ , outperforming SSCNet and JPEG2000 that reach $40.11\text{\,}\mathrm{dB}$ and $35.47\text{\,}\mathrm{dB}$ , respectively. These results suggest that for $\acsp{cr}<$64$$ traditional methods remain more effective, and the increased complexity of learning-based models may offer limited benefits in this CR range. For $\acsp{cr}>$64$$ , our results demonstrate the superior effectiveness of our proposed model over the other learning-based models and traditional approaches.

A comparative analysis of state-of-the-art learning-based models reveals that for $\acsp{cr}<$32$$ , our proposed model, which exclusively performs spectral compression in this range without any spatial stages, shows limited improvement over the state of the art. This is because HyCASS is not specifically optimized for this CR range, resulting in performance closely similar to HyCoT. For $\acsp{cr}>$256$$ , HyCASS’s performance converges with that of SSCNet, a spatial compression model. SSCNet naturally excels under these conditions due to its design for strong spatial compression. It is worth noting that the performance of HyCASS at $\acsp{cr}>$256$$ could potentially be further enhanced by integration of additional spatial stages (e.g., beyond three), which would allow for an even more effective exploitation of spatial redundancies.

IV-B2 MLRetSet

The considered models are also evaluated on the MLRetSet dataset, which has a higher spatial resolution than the HySpecNet-11k dataset. Fig. 5 (5(b)) presents the corresponding rate-distortion curves, from which the following observations can be made: Traditional approaches are highly effective in low-compression regimes. At $\acsp{cr}<$64$$ , JPEG2000 performs particularly well, surpassing all learning-based models and the PCA baseline. The reconstruction quality achieved by the spectral learning-based models 1D-CAE, HyCoT, and HyCASS with zero spatial stages shows only minor variation, indicating comparable performance for long-range and short-range spectral redundancy compression. SSCNet and 3D-CAE, which incorporate spatial compression, perform worse than the spectral compression models, especially for $\acsp{cr}<$64$$ . This suggests that, when sufficient bitrate is available, exploiting spectral redundancies is more straightforward and yields better compression performance than incorporating spatial information. HyCASS achieves comparable performance to learning-based spectral compression models at $\acsp{cr}<$128$$ , while at $\acsp{cr}>$128$$ it demonstrates advantages thanks to its adjustable design by integrating spatial compression.

We would like to highlight that for MLRetSet all models demonstrate reduced reconstruction quality compared to HySpecNet-11k, due to fewer training data samples hindering the generalization of learning-based models. This observation extends to the PCA baseline, suggesting that spectral compression is inherently more challenging for this dataset. Consequently, learning-based spectral compression models such as 1D-CAE, HyCoT, and HyCASS with zero spatial stages also show a significant PSNR drop relative to HySpecNet-11k, particularly at $\acsp{cr}<$64$$ . This can be attributed to the increased spatial complexity introduced by the higher spatial resolution, which diminishes the regularity of spectral patterns and thus limits their effective exploitation. In contrast, models employing spatial compression, such as SSCNet and HyCASS with one or more spatial stages, indicate greater robustness to higher spatial resolution. These models exhibit smaller reconstruction quality degradation on the MLRetSet dataset than on HySpecNet-11k, as they can exploit the higher spatial redundancies via dedicated 2D architectural components. In particular, 3D-CAE performs slightly better on MLRetSet than HySpecNet-11k in all considered CRs. This may indicate that 3D kernels are more effective in capturing joint spatio-spectral redundancies in hyperspectral data with a high spatial resolution, enabling the model to better exploit spatial detail.

IV-C Visual Analysis

For a qualitative evaluation, the reconstruction outputs of the considered learning-based compression models are visually compared. Fig. 6 presents the error maps (derived from the SA for each pixel) of a reconstructed HySpecNet-11k image across three representative CRs. This provides a detailed visual assessment of the spatial distribution of reconstruction errors. Each case also reports the corresponding CR and overall PSNR for comprehensive comparison.


Original HSI	1D-CAE [19]	SSCNet [21]	3D-CAE [22]	HyCoT [24]	HyCASS

PSNR	$55.19\text{\,}\mathrm{dB}$	$37.53\text{\,}\mathrm{dB}$	$36.97\text{\,}\mathrm{dB}$	$56.30\text{\,}\mathrm{dB}$	$56.50\text{\,}\mathrm{dB}$
CR	$3.96$	$3.96$	$3.96$	$3.96$	$3.96$

PSNR	$48.04\text{\,}\mathrm{dB}$	$37.64\text{\,}\mathrm{dB}$	$31.69\text{\,}\mathrm{dB}$	$49.11\text{\,}\mathrm{dB}$	$48.81\text{\,}\mathrm{dB}$
CR	$28.86$	$32.00$	$31.69$	$28.86$	$28.86$

PSNR		$37.39\text{\,}\mathrm{dB}$	$33.95\text{\,}\mathrm{dB}$	$38.70\text{\,}\mathrm{dB}$	$42.49\text{\,}\mathrm{dB}$
CR		$101.00$	$126.75$	$101.00$	$101.00$

At $\acs{cr}\approx$4$$ , the spectral compression models 1D-CAE, HyCoT, and HyCASS with zero spatial stages achieve relatively low SA errors across most of the scene, as also indicated by their high overall PSNR values exceeding $55\text{\,}\mathrm{dB}$ . Urban areas are reconstructed with high precision, showing $\acs{sa}<$$$ , while water regions pose more difficulty, with errors exceeding $\acs{sa}>$$$ . In contrast, SSCNet and 3D-CAE achieve lower PSNR values of approximately $37\text{\,}\mathrm{dB}$ . Consequently, their error maps contain considerable noise in the urban regions. Similar to spectral compression models, these models also face difficulties in reconstructing water areas, whereas the forest region in the top-left shows the highest reconstruction fidelity. This suggests that at such low CR, spatial compression adversely affects reconstruction quality, limiting the model’s ability to preserve fine spectral details of each pixel due to the spatial downsampling.

At $\acs{cr}\approx$32$$ , distortion increases for most methods, as can be seen by the overall PSNR and the SA error maps. SSCNet shows a slight improvement in reconstruction quality compared to its performance at $\acs{cr}=$4$$ , although overall quality remains low. Additionally, the models 1D-CAE, HyCoT, and HyCASS exhibit increased distortion in urban regions, with notably elevated SA values observed in the industrial area situated at the top center of the HSI.

At $\acs{cr}\approx$128$$ , reconstruction degradation becomes more visible. While SSCNet and 3D-CAE exhibit minimal differences to lower CRs, the spectral compression model HyCoT exhibits substantial errors in urban and water regions, however it maintains relatively strong performance in forested areas. In contrast, HyCASS, employing two spatial stages at this CR, exhibits increased error in urban regions but it achieves the highest overall PSNR. Qualitative results indicate that learning-based models struggle to reconstruct water and urban regions. Urban areas also become challenging to reconstruct, especially at higher CRs and when spatial compression is applied. In contrast, forestry regions are consistently reconstructed with the highest quality, due to their frequent representation and homogeneous spatial and spectral characteristics in the training data, which facilitate accurate learning and reconstruction. Overall, HyCASS consistently achieves an effective trade-off between CR and reconstruction fidelity across a wide range of CRs.

V Conclusion and Discussion

In this paper, we have introduced HyCASS, a novel learning-based HSI compression model designed for adjustable spatio-spectral HSI compression. To this end, the proposed model employs six modules: i) a spectral encoder module; ii) a spatial encoder module; iii) a CR adapter encoder module; iv) a CR adapter decoder module; v) a spatial decoder module; and vi) a spectral decoder module. Our model accomplishes: 1) spectral feature extraction (realized within the spectral encoder and decoder modules); 2) spatial compression with variable stages (realized within the spatial encoder and decoder modules); and 3) spectral compression with variable output channels (realized within the CR adapter encoder and CR decoder modules). Unlike existing learning-based HSI compression models, HyCASS provides flexible control over the trade-off between spectral and spatial compression through its modular design. We have conducted extensive experiments on two benchmark datasets, including ablation studies, comparisons with state-of-the-art methods, and visual analyses of reconstruction errors. Our results demonstrate the effectiveness of HyCASS, and reveal how the balance between spectral and spatial compression affects reconstruction fidelity across different CRs for HSIs with both low and high spatial resolution. Our findings confirm the importance of adjustable spatio-spectral compression in addressing the diverse characteristics of hyperspectral data. It is worth emphasizing that with the continuous growth of hyperspectral data archives, spatio–spectral learning-based HSIs compression is becoming increasingly important, as it enables significantly higher CRs. In this context, the proposed model offers a promising solution for efficient and flexible HSI compression. Based on our analyses, we have also derived a guideline to select the trade-off between spectral or spatial compression depending on the spatial resolution and overall CR as follows:

•

For HSI data with high spatial resolution: use spectral compression at low CRs; and spatio-spectral compression with greater spatial emphasis at medium and high CRs.
•

For HSI data with low spatial resolution: use spectral compression at low CRs; spatio-spectral compression with greater spectral emphasis at medium CRs; and spatio-spectral compression with greater spatial emphasis at high CRs.

As a final remark, we would like to note that the development of foundation models has attracted great attention in RS. FMs are usually pre-trained on large-scale datasets and then fine-tuned for specific downstream tasks such as image classification, segmentation, or change detection. We believe that leveraging the generalization capabilities of FMs for HSI compression can lead to a more robust and effective compression performance. As a future work, we plan to investigate the use of FMs as a backbone for HSI compression to improve reconstruction fidelity and enable adaptation across diverse sensor and acquisition conditions.

References

[1] F.-C. Lin, Y.-S. Shiu, P.-J. Wang, U.-H. Wang, J.-S. Lai, and Y.-C. Chuang, “A model for forest type identification and forest regeneration monitoring based on deep learning and hyperspectral imagery,” Ecological Informatics, vol. 80, p. 102507, 2024.
[2] A. Fabbretto, M. Bresciani, A. Pellegrino, K. Alikas, M. Pinardi, S. Mangano, R. Padula, and C. Giardino, “Tracking water quality and macrophyte changes in lake trasimeno (italy) from spaceborne hyperspectral imagery,” Remote Sensing, vol. 16, no. 10, p. 1704, 2024.
[3] D. Spiller, A. Carbone, S. Amici, K. Thangavel, R. Sabatini, and G. Laneve, “Wildfire detection using convolutional neural networks and PRISMA hyperspectral imagery: A spatial-spectral analysis,” Remote Sensing, vol. 15, no. 19, p. 4855, 2023.
[4] I. Masari, G. Moser, and S. B. Serpico, “Manifold learning and deep generative networks for heterogeneous change detection from hyperspectral and synthetic aperture radar images,” IEEE Geoscience and Remote Sensing Letters, 2024.
[5] C. Gomes, I. Wittmann, D. Robert, J. Jakubik, T. Reichelt, S. Maurogiovanni, R. Vinge, J. Hurst, E. Scheurer, R. Sedona et al., “Lossy neural compression for geospatial analytics: A review,” IEEE Geoscience and Remote Sensing Magazine, 2025.
[6] A. Altamimi and B. Ben Youssef, “Lossless and near-lossless compression algorithms for remotely sensed hyperspectral images,” Entropy, vol. 26, no. 4, p. 316, 2024.
[7] Y. Dua, V. Kumar, and R. S. Singh, “Comprehensive review of hyperspectral image compression algorithms,” Optical Engineering, vol. 59, no. 9, pp. 090 902–090 902, 2020.
[8] J. Mielikainen and P. Toivanen, “Clustered DPCM for the lossless compression of hyperspectral images,” IEEE Transactions on Geoscience and Remote Sensing, vol. 41, no. 12, pp. 2943–2946, 2004.
[9] M. Hernández-Cabronero, A. B. Kiely, M. Klimesh, I. Blanes, J. Ligo, E. Magli, and J. Serra-Sagrista, “The CCSDS 123.0-B-2 “low-complexity lossless and near-lossless multispectral and hyperspectral image compression” standard: A comprehensive review,” IEEE Geoscience and Remote Sensing Magazine, vol. 9, no. 4, pp. 102–119, 2021.
[10] Q. Du and J. E. Fowler, “Hyperspectral image compression using JPEG2000 and principal component analysis,” IEEE Geoscience and Remote Sensing Letters, vol. 4, no. 2, pp. 201–205, 2007.
[11] R. J. Yadav and M. Nagmode, “Compression of hyperspectral image using PCA–DCT technology,” in Innovations in Electronics and Communication Engineering. Springer, 2018, pp. 269–277.
[12] P. L. Dragotti, G. Poggi, and A. R. Ragozini, “Compression of multispectral images by three-dimensional SPIHT algorithm,” IEEE Transactions on Geoscience and Remote Sensing, vol. 38, no. 1, pp. 416–428, 2000.
[13] X. Tang and W. A. Pearlman, “Three-dimensional wavelet-based compression of hyperspectral images,” in Hyperspectral Data Compression. Springer, 2006, pp. 273–308.
[14] D. Valsesia, T. Bianchi, and E. Magli, “Onboard deep lossless and near-lossless predictive coding of hyperspectral images with line-based attention,” IEEE Transactions on Geoscience and Remote Sensing, 2024.
[15] M. H. P. Fuchs, A. P. Byju, A. Walda, B. Rasti, and B. Demir, “Generative adversarial networks for spatio-spectral compression of hyperspectral images,” in IEEE Workshop on Hyperspectral Imaging and Signal Processing: Evolution in Remote Sensing, 2024, pp. 1–5.
[16] F. Mentzer, G. D. Toderici, M. Tschannen, and E. Agustsson, “High-fidelity generative image compression,” Advances in Neural Information Processing Systems, vol. 33, pp. 11 913–11 924, 2020.
[17] S. Rezasoltani and F. Z. Qureshi, “Hyperspectral image compression using sampling and implicit neural representations,” IEEE Transactions on Geoscience and Remote Sensing, 2024.
[18] N. Zhao, T. Pan, Z. Li, E. Chen, and L. Zhang, “The paradigm shift in hyperspectral image compression: A neural video representation methodology,” Remote Sensing, vol. 17, no. 4, p. 679, 2025.
[19] J. Kuester, W. Gross, and W. Middelmann, “1D-convolutional autoencoder based hyperspectral data compression,” International Archives of Photogrammetry, Remote Sensing and Spatial Information Sciences, vol. 43, pp. 15–21, 2021.
[20] J. Kuester, W. Gross, S. Schreiner, W. Middelmann, and M. Heizmann, “Adaptive two-stage multisensor convolutional autoencoder model for lossy compression of hyperspectral data,” IEEE Transactions on Geoscience and Remote Sensing, vol. 61, pp. 1–22, 2023.
[21] R. La Grassa, C. Re, G. Cremonese, and I. Gallo, “Hyperspectral data compression using fully convolutional autoencoder,” Remote Sensing, vol. 14, no. 10, p. 2472, 2022.
[22] Y. Chong, L. Chen, and S. Pan, “End-to-end joint spectral–spatial compression and reconstruction of hyperspectral images using a 3D convolutional autoencoder,” Journal of Electronic Imaging, vol. 30, no. 4, p. 041403, 2021.
[23] N. Sprengel, M. H. P. Fuchs, and B. Demir, “Learning-based hyperspectral image compression using a spatio-spectral approach,” EGU General Assembly, 2024.
[24] M. H. P. Fuchs, B. Rasti, and B. Demir, “HyCoT: A transformer-based autoencoder for hyperspectral image compression,” in IEEE Workshop on Hyperspectral Imaging and Signal Processing: Evolution in Remote Sensing, 2024, pp. 1–5.
[25] M. H. P. Fuchs and B. Demir, “HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods,” in IEEE International Geoscience and Remote Sensing Symposium, 2023, pp. 1779–1782.
[26] F. Ömrüuzun, Y. Yardımcı Çetin, U. M. Leloğlu, and B. Demir, “A novel semantic content-based retrieval system for hyperspectral remote sensing imagery,” Remote Sensing, vol. 16, no. 8, p. 1462, 2024.
[27] Z. Liu, Y. Lin, Y. Cao, H. Hu, Y. Wei, Z. Zhang, S. Lin, and B. Guo, “Swin transformer: Hierarchical vision transformer using shifted windows,” in IEEE/CVF International Conference on Computer Vision, 2021, pp. 10 012–10 022.
[28] M. Lu, P. Guo, H. Shi, C. Cao, and Z. Ma, “Transformer-based image compression,” in Data Compression Conference, 2022, pp. 469–469.
[29] L. Guanter, H. Kaufmann, K. Segl, S. Foerster, C. Rogass, S. Chabrillat, T. Kuester, A. Hollstein, G. Rossner, C. Chlebek et al., “The EnMAP spaceborne imaging spectroscopy mission for earth observation,” Remote Sensing, vol. 7, no. 7, pp. 8830–8857, 2015.
[30] J. Bégaint, F. Racapé, S. Feltman, and A. Pushparaja, “CompressAI: A PyTorch library and evaluation platform for end-to-end compression research,” arXiv preprint arXiv:2011.03029, 2020.
[31] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” arXiv preprint arXiv:1412.6980, 2014.