DA-Occ: Efficient 3D Voxel Occupancy Prediction via Directional 2D for Geometric Structure Preservation

Yuchen Zhou
Nanjing Agricultural University
Nanjing, China
2023119007@stu.njau.edu.cn
&Yan Luo
Nanjing Agricultural University
Nanjing, China
luoyan@njau.edu.cn
&Xiangang Wang
Desay SV Automotive co.,LTD.
Nanjing, China
Xiangang.Wang@desaysv.com
&Xingjian Gu
Nanjing Agricultural University
Nanjing, China
guxingjian@njau.edu.cn &Mingzhou Lu
Nanjing Agricultural University
Nanjing, China
mingzhou.lu@njau.edu.cn

Abstract

Efficient and high-accuracy 3D occupancy prediction is crucial for ensuring the performance of autonomous driving (AD) systems. However, many current methods focus on high accuracy at the expense of real-time processing needs. To address this challenge of balancing accuracy and inference speed, we propose a directional pure 2D approach. Our method involves slicing 3D voxel features to preserve complete vertical geometric information. This strategy compensates for the loss of height cues in Bird’s-Eye View (BEV) representations, thereby maintaining the integrity of the 3D geometric structure. By employing a directional attention mechanism, we efficiently extract geometric features from different orientations, striking a balance between accuracy and computational efficiency. Experimental results highlight the significant advantages of our approach for autonomous driving. On the Occ3D-nuScenes, the proposed method achieves an mIoU of 39.3% and an inference speed of 27.7 FPS, effectively balancing accuracy and efficiency. In simulations on edge devices, the inference speed reaches 14.8 FPS, further demonstrating the method’s applicability for real-time deployment in resource-constrained environments.

Keywords Autonomous Driving $\cdot$ 3D Occupancy Prediction $\cdot$ Directional Attention Mechanism $\cdot$ Geometric Structure Preservation

1 Introduction

Vision-based occupancy prediction aims to estimate both object occupancy and semantic information in 3D voxel space using surround-view images captured by ego-vehicle cameras [1, 2, 3, 4]. Accurate 3D geometry and semantic understanding of the surrounding environment are essential for autonomous driving (AD) [5]. Compared to conventional 3D object detection [6, 7], occupancy prediction offers finer-grained 3D scene understanding and is capable of recognizing general objects by learning their occupancy patterns.

Refer to caption — Figure 1: The inference speed (FPS) and accuracy (mIoU) of various methods are evaluated on the Occ3D-nuScenes benchmark [2]. Following the definition proposed by [8], we consider an occupancy prediction method to be real-time if it achieves at least 10 FPS.

Occupancy prediction offers several advantages, including enhanced scene understanding, improved handling of unknown obstacles, and joint modeling of geometry and semantics. However, many existing methods [9, 10, 11, 5] focus primarily on precision, often at the expense of real-time processing due to the complexity of their model architectures and large parameter counts. Consequently, these methods suffer from high computational costs and latency, making them impractical for deployment in real-time autonomous driving (AD) systems [12]. To ensure real-time inference speed and facilitate deployment, some approaches adopt 2D methods to predict 3D voxel occupancy [13, 4, 14, 15, 8]. These methods typically compress 3D voxel features into bird’s-eye view (BEV) representations, retaining only horizontal information. As a result, certain semantic and vertical geometric details are sacrificed, even at the cost of accuracy. Figure 1 illustrates the limitations of current methods: while high accuracy often leads to slower inference speeds, methods designed for faster inference tend to suffer significant accuracy loss, with performance differences exceeding 10%. This issue arises from the compression of vertical geometric structures, as shown in Figure 2. Methods relying solely on 2D BEV representations struggle to capture fine-grained spatial structures, which are essential for accurate scene understanding [12, 15, 5, 8]. The visualization results further emphasize the importance of preserving vertical geometric features to maintain the overall semantic and structural integrity of the 3D voxel space.

To ensure the preservation of geometric structure and achieve more accurate object representation in 3D coordinates, we propose a novel approach for processing 3D voxel features. Our method involves performing slicing operations on the 3D voxel features, which retains the full vertical (height) geometric structure, and subsequently fuses this information with the 2D BEV feature map. This fusion restores the vertical geometric details that are often lost in conventional 2D BEV representations, providing a more comprehensive understanding of the scene.

To maintain high inference speed while preserving accuracy, we propose a directional attention-based occupancy prediction method (DA-Occ). This approach efficiently captures geometric features in both horizontal and vertical directions, enabling the model to selectively focus on relevant features in different orientations. As a result, DA-Occ optimizes computational performance without compromising the integrity of the 3D voxel structure. With an mIoU of 39.3% and an inference speed of 27.7 FPS on the Occ3D-nuScenes dataset, DA-Occ achieves a balance between high accuracy and fast inference, making it well-suited for real-time applications in autonomous driving.

2 Related Work

2.1 2D-to-3D View Transformation.

Many existing methods utilize estimated depth information to project 2D image features into 3D voxel space [16, 17, 18, 19, 20]. LSS [21] explicitly predicts depth distributions to lift 2D image features into 3D space. More recent approaches [3, 22, 23] have highlighted the importance of accurate depth estimation in the view transformation process. For instance, BEVDepth [3] encodes both intrinsic and extrinsic camera parameters into its depth refinement module for improved 3D object detection. BEVStereo [22] introduces an efficient temporal stereo mechanism to enhance depth estimation quality. In addition, some studies [24, 25, 26, 27] focus on optimizing the projection stage itself. However existing approaches have largely overlooked the potential of leveraging height prediction as a direct supervisory signal for view transformation. Inspired by DHD [5], our method introduces a novel approach by leveraging height scores from 2D images to guide the transformation into 3D voxel space. This enables more accurate height estimation in the voxel representation and lays a solid foundation for subsequent BEV feature reconstruction.

2.2 3D Occupancy Prediction.

3D occupancy prediction reconstructs the 3D scene by predicting a dense voxel grid that encodes spatial occupancy [12]. A straightforward approach is to extend the BEV representation used in 3D object detection into voxel space by lifting BEV features into 3D [14] and applying a segmentation head [11, 4, 8, 13]. However, this strategy disrupts the geometric integrity of the voxel representation due to the lack of vertical structure awareness. Directly processing full 3D voxel features, on the other hand, incurs substantial computational overhead. To address this, COTR [9] introduces a compact voxel representation via downsampling, while PanoOcc [28] proposes a novel panoramic occupancy segmentation task and leverages sparse 3D convolutions to reduce computation. Nonetheless, these methods still suffer from limited inference speed, making them less suitable for real-time deployment. To tackle this challenge, we propose a 2D-based approach that preserves the geometric structure of the voxel space while significantly improving inference efficiency, offering a more deployable and scalable solution for 3D occupancy prediction.

3 Method

3.1 Problem Formulation

In 3D occupancy prediction tasks, the surrounding environment is discretized into a voxel-based volumetric grid [21]. Assuming the ego vehicle is located at the origin of the global coordinate system, the 3D perception range is defined by the bounds $[X_{s},Y_{s},Z_{s},X_{e},W_{e},Z_{e}]$ , where $[X_{s},Y_{s},Z_{s}]$ and $[X_{e},Y_{e},Z_{e}]$ denote the lower and upper bounds along the height, width, and depth axes, respectively. The scene volume is partitioned into a grid of shape $[X,Y,Z]$ (e.g. $[200,200,1]$ as in [14]). Accordingly, the physical size of a single voxel along each axis is computed as $[\frac{X_{e}-X_{s}}{X},\frac{Y_{e}-Y_{s}}{Y},\frac{Z_{e}-Z_{s}}{Z}]$ . At inference time, each voxel is assigned a binary occupancy state—occupied or empty along with a semantic label from a predefined set of categories or an unknown class. The final voxel-wise predictions are derived from visual features extracted from multi-view images, denoted as $\mathbf{X}\in\mathbb{R}^{N\times C\times H\times W}$ , where $N$ is the number of cameras, $[C,H,W]$ is the feature size. Each camera is associated with both intrinsic and extrinsic parameters ( $K$ and $[R|t]$ ), which are utilized to transform 2D image features into the unified 3D coordinate space. These parameters enable precise projection and aggregation of multi-view information into the voxel grid for occupancy and semantic prediction.

3.2 Overall Architecture

The overall architecture of DA-Occ is depicted in detail in Fig. 3. It comprises the following key components: (1)Image Backbone: A convolutional neural network (e.g., ResNet [29]) is employed to extract high-level visual features from multi-view input images. (2) Geometry-Aware Feature Encoder: This integrates a depth estimation network (DepthNet) and a height attention network (HeightNet) to enrich the geometric understanding and improve the structural fidelity of subsequent 3D representations. (3) 2D-to-3D View Transformation: Image-plane features are lifted into the 3D voxel space using calibrated projection matrices, enabling spatial alignment across multiple views. (4) Geometry-Aware Feature Decoder: The final decoding stage incorporates geometric priors to enhance 3D structural consistency and improve the granularity of semantic predictions within the voxel grid.

3.3 Directional Attention Mechanism

In order to preserve the consistency of 3D geometric structures, particularly along the vertical axis, which is often underrepresented in 2D BEV-based representations, we introduce the Directional Attention Mechanism. This mechanism is utilized in both the Geometry-Aware Feature Encoder and the Geometry-Aware Feature Decoder, but at different stages of the model pipeline. In the Encoder, it is applied before the 2D-to-3D View Transformation to dynamically acquire height scores, which help enhance the vertical geometric features. After the transformation, in the Decoder, the Directional Height Attention (DHA) and Directional BEV Attention (DBA) modules are employed to refine and reconstruct the vertical and horizontal geometric structures, ensuring a comprehensive and accurate 3D representation. Inspired by ParC-Net [30], which captures long-range dependencies in 2D image space, we propose a novel extension of this mechanism to the 3D voxel domain. Our extension dynamically extracts 3D geometric features, enabling the model to better capture long-range spatial dependencies and enhance the representation of 3D structures. By introducing direction-aware attention that model dependencies along both the vertical and horizontal axes, our approach preserves fine-grained 3D structure within a purely 2D framework. Specifically, DHA is designed to capture detailed height-dependent patterns by processing features along the $Z$ -axis (vertical), while DBA captures geometric dependencies along the horizontal plane within the BEV representation. The directional attention mechanism can be expressed as $\mathbf{F}{out}=\mathrm{DA}(\mathbf{F}{in},\text{dir}=i)$ , where $i$ refers to either the horizontal or vertical direction in feature maps. Figure 4 illustrates the structure of this mechanism. The computation proceeds as:

$\displaystyle W$	$\displaystyle=\mathrm{MLP}(\mathrm{Cp}(\mathbf{F}_{in},\text{dir}=i))$	(1)
$\displaystyle\mathbf{F}_{ch}$	$\displaystyle=\mathrm{Cat}(\mathbf{F}_{in},\mathbf{F}_{in}[:,:,:-1,:])$
$\displaystyle\mathbf{F}_{cv}$	$\displaystyle=\mathrm{Cat}(\mathbf{F}_{in},\mathbf{F}_{in}[:,:,:,:-1])$
$\displaystyle\mathbf{F}_{out}$	$\displaystyle=\mathrm{Conv}(\mathbf{F}_{ch/cv}+p_{i},W)$

where, $W$ denotes the convolutional weights dynamically generated by the MLP. The function $\mathrm{Cp}(\cdot)$ performs average pooling along direction $i$ to compress spatial information. $\mathbf{F}_{ch}$ and $\mathbf{F}_{cv}$ represent horizontally and vertically concatenated features, respectively. The position encoding $p_{i}$ is a learnable parameter with a shape of $H\times 1$ or $1\times W$ , and the final convolution $\mathrm{Conv}$ integrates both structure and learned weights to generate $\mathbf{F}_{out}$ .

3.4 Geometry-Aware Feature Encoder

HeightNet first introduced in [5], shares a similar structure with the DepthNet module in BEVDepth [3]. It is used to generate a predicted height map $H_{pred}$ , which is supervised by a ground-truth height map $H_{gt}$ to facilitate accurate height estimation. The generation of $H_{gt}$ proceeds as follows: a LiDAR point $p_{l}=[x_{l},y_{l},z_{l},1]^{T}$ in the LiDAR coordinate frame is transformed to the ego-vehicle coordinate system as $p_{e}=[x_{e},y_{e},z_{e},1]^{T}$ . The corresponding 2D projection onto the image plane is computed using:

d[u,v,1]^{T}=K[R|t][x_{l},y_{l},z_{l},1]^{T}

(2)

where $d$ is the depth value in the camera coordinate system, $[u,v,1]$ denotes the pixel coordinates in homogeneous form, $K$ is the intrinsic matrix, and $R\in\mathbb{R}^{3\times 3}$ , $t\in\mathbb{R}^{3}$ are the camera extrinsic parameters.

Following [5], we construct a set of tuples $[u,v,d,z_{e}]^{T}$ , where each pixel location $[u,v]$ is assigned both a projected depth $d$ and a corresponding height $z_{e}$ in the ego coordinate system. For each pixel, only the point with the smallest depth is retained to eliminate occluded redundancies, resulting in a ground-truth height map $H_{gt}$ that represents the visible uppermost surface in the scene.

Directional Height Attention Network is designed to predict $H_{pred}$ by leveraging structural priors along the vertical ( $Z$ -axis) direction. It incorporates both the geometric shape prior of objects in height and the contextual information from adjacent horizontal locations. To efficiently capture vertical geometry while maintaining lightweight computation, DHA replaces the last two residual blocks in the original DepthNet with directional attention modules. This architectural modification enables more effective extraction of height-sensitive features with minimal additional cost.

The entire Geometry-Aware Feature Encoder process is described as follows: $\mathbf{F}_{n}$ is processed by DepthNet, generating the depth score $D_{pred}$ and the feature map $\mathbf{F}_{feat}$ . $\mathbf{F}_{n}$ is then passed through HeightNet, producing the height score $H_{pred}$ .

3.5 2D-to-3D View Transformation

To preserve the integrity of 3D geometric structures while maintaining computational efficiency, we adopt the Lift-Splat-Shoot (LSS) framework [21] for view transformation. Unlike existing approaches such as FlashOcc [14], which rely on a purely 2D transformation grid (e.g., $1\times 200\times 200$ ) and often compress the vertical dimension, resulting in significant geometric distortion, our method takes a different approach. We adopt a lightweight 3D voxel grid (e.g., $16\times 32\times 32$ ) to enrich the BEV feature map with enhanced vertical geometric information. This enables BEV feature map to retain vertical spatial cues more effectively and recover structural geometry with higher fidelity under similar computational constraints.

Specifically, we construct a 3D voxel grid that enables finer-grained depth and height modeling during the lifting process. The generated 3D voxel features are then sliced and fused along the depth axis ( $Y$ -axis), effectively capturing vertical structure while still allowing the final output to remain in a 2D feature map format for efficient processing. In the Splat stage of the LSS, we enhance the lifted 3D voxel representation by the predicted height confidence scores. Specifically, the resulting voxel projection is formulated as:

\mathbf{F}_{3D}=\sum_{i\in\text{ranks\_bev}[i]=v}{softmax(H_{pred})\cdot\mathbf{F}_{feat,i}}

(3)

where $H_{pred}$ denote the scores along the height directions, and $\mathbf{F}_{feat,i}$ represents $i$ -th pixel feature. This design enhances the model’s perception of vertical geometric structures while maintaining real-time inference efficiency, leading to improved spatial feature representation. After constructing the 3D voxel feature volume $\mathbf{F}_{3D}\in\mathbb{R}^{B\times C\times Z\times Y\times X}$ , we perform slicing along the depth axis (i.e., the $Y$ -axis) and concatenate the resulting slices along the horizontal axis ( $X$ -axis). This transforms the 3D voxel structure into a 2D feature map with enhanced perception of vertical (height-wise) geometric structures, facilitating more efficient extraction of height-aware spatial representations.This process is described as:

\mathbf{F}_{height}=\text{Concat}^{k=Y-1}_{k=0}{\mathbf{F}_{3D}[:,:,:,k,:],dim=4}

(4)

where the concatenation is applied along the spatial $X$ -axis, resulting in a feature map $\mathbf{F}_{height}\in\mathbb{R}^{B\times C\times Z\times(Y\times X)}$

3.6 Geometry-Aware Feature Decoder

The BEV feature map $\mathbf{F}_{bev}$ , which primarily encodes the target’s depth-related information, and the sliced height-aware feature map feature map $\mathbf{F}_{height}$ , which captures the target’s vertical structural cues, are obtained following the view transformation stage. To enhance the geometric fidelity of $\mathbf{F}_{bev}$ and $\mathbf{F}_{height}$ , we introduce DBA and DHA, which specialize in refining horizontal and vertical geometrical features through direction-aware attention.

Directional-Aware Attention. The $\mathbf{F}_{\text{bev}}$ captures the relative distance between the ego vehicle and surrounding targets. This spatial relationship can be further decomposed into front-back and left-right directional components within the BEV plane. To better model depth-related geometric structures, we apply a directional attention operator $\text{DA}(\cdot,\text{dir})$ along both the $X$ and $Y$ axes of the BEV feature map. The refined BEV representation is computed as:

\mathbf{F}_{BEV}=\text{DA}(\mathbf{F}_{bev},\text{dir}=h)+\text{DA}(\mathbf{F}_{bev},\text{dir}=v)

(5)

where the output $\mathbf{F}_{\text{BEV}}$ has the same shape as the input $\mathbf{F}_{\text{bev}}$ , and the fusion enhances geometric sensitivity in both planar directions.

Due to the previous processing of sliced $\mathbf{F}_{\text{height}}$ , it preserves the geometric structure in the vertical direction. The DHA performs direction-aware processing that better aligns with the vertical structure . This directional sensitivity allows it to more effectively capture and refine height-dependent geometric features. The refinement process is defined as:

	$\displaystyle\mathbf{F}_{h3D}$	$\displaystyle=\text{reshape}(\text{DA}(\mathbf{F}_{height},\text{dir}=h))$		(6)
	$\displaystyle\mathbf{F}_{Height}$	$\displaystyle=\text{Concat}_{z=0}^{Z-1}(\mathbf{F}_{h3D}[:,:,z,:,:]),\text{dim}=1$		(6)

where $\mathbf{F}_{h3D}\in\mathbb{R}^{B\times C\times Z\times Y\times X}$ is the reshaped 3D height-aware feature volume, and $\mathbf{F}_{Height}\in\mathbb{R}^{B\times(C\cdot Z)\times Y\times X}$ is the final height-enhanced feature map obtained by concatenating the $Z$ axes slices along the channel dimension.

This formulation enables fine-grained recovery of vertical geometry and improves the spatial expressiveness of the height feature representation. The final feature representation $\mathbf{F}_{d\_h}$ is obtained by fusing $\mathbf{F}_{Height}$ and $\mathbf{F}_{BEV}$ , effectively combining fine-grained height-aware and depth-aware information to enhance 3D spatial understanding. The whole process looks like Figure 5.

4 Experiment

4.1 Dataset and Evaluation Metrics

We train and evaluate our model on the Occ3D-nuScenes benchmark [2], which is built upon the nuScenes dataset [31]. The dataset comprises 1,000 video sequences, split into 700 for training, 150 for validation, and 150 for testing. Each keyframe includes a 32-beam LiDAR point cloud, six RGB images from surround-view cameras, and dense voxel-level semantic occupancy annotations. The perception range is defined as $[-40\,\text{m},-40\,\text{m},-1\,\text{m},40\,\text{m},40\,\text{m},5.4\,\text{m}]$ , and the voxel resolution is set to $[0.4\,\text{m},0.4\,\text{m},0.4\,\text{m}]$ . Each voxel is assigned one of 18 semantic classes, including 16 predefined object categories, one other class for unknown objects, and one empty class representing free space. To assess the performance of our approach, we adopt evaluation metrics widely used in 3D occupancy prediction: the mean Intersection-over-Union (mIoU) [32], computed over all semantic categories.

4.2 Implementation Details

Following common practice [9, 4, 32], we train our models using 4 NVIDIA H800 GPUs with a batch size of 4 per GPU. Unless otherwise specified, all models are optimized using the AdamW optimizer with a learning rate of $1\times 10^{-4}$ and a weight decay of 0.05. During the LSS stage, the resolution of the grid for the depth score is set to $[1\times 200\times 200]$ , and the resolution of the grid for the height score is set to $[16\times 32\times 32]$ . Regarding the loss function, we design it based on the characteristics of geometric structure recovery in DA-Occ. Specifically, we follow the formulation proposed in MonoScene [1] and define the overall loss as: $\mathcal{L}=\lambda_{1}\mathcal{L}_{\text{bce}}^{\text{depth}}+\lambda_{2}\mathcal{L}_{\text{bce}}^{\text{height}}+\lambda_{3}\mathcal{L}_{\text{ce}}+\lambda_{4}\mathcal{L}_{\text{scal}}^{\text{sem}}+\lambda_{5}\mathcal{L}_{\text{scal}}^{\text{geo}}$ , where $\lambda_{1},\lambda_{2},\lambda_{3},\lambda_{4},\lambda_{5}$ are set as $1.0,1.0,10.0,1.0,1.0$ . For inference, we use a single NVIDIA A100 GPU with a batch size of 1. The frame-per-second (FPS) metric is measured using the official mmdetection3d codebase.

4.3 Main Results

Table 1: Comparison with state-of-the-art methods on the Occ3D-nuScenes benchmark. FPS is tested on a single NVIDIA A100 using the mmdetection3d codebase. DA-Occ* and FlashOcc* represents the way of Stereo4D temporal fusion [11], the 16-frame fusion method adopts the BEV-level approach [32].

Method	Venue	History Frames	Backbone	Image Size	mIoU (%)	FPS
MonoScene	CVPR’22	-	ResNet-101	928 $\times$ 1600	6.06	1.2
TPVFormer	CVPR’23	-	ResNet-101	928 $\times$ 1600	27.83	3.1
CTF-Occ	NIPS’24	-	ResNet-101	928 $\times$ 1600	28.53	-
OccFormer	ICCV’23	-	ResNet-50	256 $\times$ 704	20.40	-
BEVDetOcc	arXiv’22	-	ResNet-50	256 $\times$ 704	31.64	-
FlashOcc	arXiv’23	-	ResNet-50	256 $\times$ 704	31.95	79.4
DA-Occ(Ours)	-	-	ResNet-50	256 $\times$ 704	34.38	39.6
BEVDetOcc-4D-Stereo	arXiv’22	1	ResNet-50	256 $\times$ 704	36.01	1.8
FB-OCC	ICCV’23	1	ResNet-50	256 $\times$ 704	37.52	6.9
FlashOCC*	arXiv’23	1	ResNet-50	256 $\times$ 704	37.84	9.1
GSD-Occ	AAAI’25	1	ResNet-50	256 $\times$ 704	34.84	22.7
EOPIA	ICAIIC’25	1	ResNet-50	256 $\times$ 704	35.26	28.7
DA-Occ(Ours)*	-	1	ResNet-50	256 $\times$ 704	38.97	7.3
DA-Occ(Ours)	-	1	ResNet-50	256 $\times$ 704	35.77	31.5
FastOCC	ICRA’24	16	ResNet-101	320 $\times$ 800	37.2	10.1
COTR	CVPR’24	16	ResNet-50	256 $\times$ 704	44.5	0.9
GSD-Occ	AAAI’25	16	ResNet-50	256 $\times$ 704	39.4	20.0
DA-Occ(Ours)	-	16	ResNet-50	256 $\times$ 704	39.3	27.7

Table 1 presents a comparison between DA-Occ and state-of-the-art (SOTA) methods on the Occ3D-nuScenes benchmark. The results are grouped by the number of frames used in the temporal sequence. DA-Occ demonstrates outstanding efficiency while maintaining high accuracy. Notably, even when using only a single frame for data fusion, DA-Occ* achieves accuracy comparable to that of many multi-frame models. This is largely attributed to the use of the Stereo4D strategy in its temporal fusion module [11]. When using a single frame, the proposed method improves accuracy by 2.43% compared to FlashOcc. By employing the frame fusion technique at the BEV level [32], DA-Occ achieves FPS of 27.7 and accuracy of 39.3%. These results highlight DA-Occ’s ability to strike an effective balance between efficiency and accuracy The way we make comparisons is all based on innovations within the model itself, for example MonoScene [1], TPVForme [33], CTF-Occ [9], OccFormer [34], BEVDetOcc [11], FlashOcc [14], FB-OCC [4], GSD-Occ [12], EOPIA [13](Since there was no name for the model, the abbreviation of the article title was used instead.), and COTR [9]. These results underscore the effectiveness of DA-Occ in preserving geometric structures using a purely 2D-based approach. To further validate this capability, we conduct both ablation studies and qualitative visualizations, which consistently confirm the robustness of DA-Occ in maintaining geometric structural integrity.

4.4 Ablation Studies

Table 2: Effect of voxel grid resolution on mIoU, inference speed, and FLOPs.

Size of Grid	mIoU (%)	FPS	GFLOPs
$16\times 32\times 32$	34.38	39.6	276.59
$16\times 64\times 64$	34.45	38.7	276.65
$8\times 32\times 32$	33.67	39.9	276.58
$4\times 32\times 32$	32.89	40.1	276.55

Table 3: Effect of DHA and DBA modules on mIoU, inference speed, and FLOPs.

DHA	DBA	mIoU (%)	FPS	Flops
✓	✓	34.38	39.6	276.59G
✓	-	34.01	42.5	271.24G
-	✓	33.76	41.3	276.57G
-	-	32.58	44.7	271.21G

Table 4: The mIoU and FPS of each model without the visible mask.

Method	Venue	mIoU (%)	FPS
MonoScene	CVPR’22	6.06	1.2
OccFormer	ICCV’23	20.40	-
CTF-Occ	NIPS’24	28.53	-
BECDet	arXiv’22	18.05	-
SparseOcc	ECCV’24	30.64	17.7
DA-Occ	-	31.31	27.7

Table 5: Model inference speed of simulated edge device.

Precision	Threads	GPU Memory	FPS
FP32 (A100)	Full	Full	39.6
FP16 (A100)	2	4GB	14.8

To accelerate experimental evaluation, we conducted ablation experiments on the model without DepthNet and without temporal fusion. To investigate the impact of height information on occupancy prediction, we varied the resolution of the 3D voxel grid and analyzed how vertical geometric granularity influences performance. The impact of the $X$ and $Y$ grid size on accuracy is negligible, as it primarily encodes depth information, which is predominantly obtained from the BEV feature map. The results, summarized in Table 2, highlight the crucial role of height-aware geometry in achieving accurate 3D occupancy predictions.

Next, to assess the effectiveness of the DHA and DBA modules in preserving geometric structures, we performed additional ablation studies on the nuScenes dataset, as shown in Table 3. The results confirm that the direction-specific dynamic convolutions in DHA and DBA are highly effective at capturing structural features, contributing to both improved performance and model efficiency.

To further evaluate the robustness of DA-Occ, we conducted experiments without using the Visible Mask. This setting simulates real-world scenarios where the model may not have access to complete or perfect information, enabling us to evaluate the model’s ability to generalize and maintain accuracy under challenging conditions. The results, presented in Table 4, demonstrate that DA-Occ achieves an mIoU of 31.31% without the Visible Mask, indicating its strong resilience to missing data and its robustness in less-than-ideal environments.

Finally, we simulate the inference speed of DA-Occ on edge devices by imposing several hardware constraints to mimic real-world deployment environments. Specifically, we limit the available GPU memory to 4GB (with other memory being occupied to simulate concurrent tasks on the edge device) and restrict the GPU’s maximum power consumption to 90W. As shown in Table 5, even under these simulated deployment constraints, DA-Occ maintains its effectiveness. This further validates the applicability of DA-Occ for real-time inference on edge devices.

4.5 Visualization Studies

In visualization experiments, we adopt FlashOcc [14] as the benchmark model, including both non-temporal variants and deeply supervised versions for comparison. A variety of scenarios from the val set are employed to ensure comprehensive evaluation. All visualizations are generated using the official FlashOcc codebase. As shown in Figure 6, the red box highlights a significant semantic loss along the vertical axis in FlashOcc. In contrast, DA-Occ maintains geometric structures in both horizontal and vertical directions, even within a purely 2D framework. This structural consistency greatly contributes to its outstanding mIoU performance.

5 Conclusion

In this work, we propose DA-Occ, a novel directional 2D framework for efficient and accurate 3D occupancy prediction. The method leverages height-aware voxel slicing and incorporates a directional attention mechanism, including Directional Height Attention (DHA) and Directional BEV Attention (DBA). This design effectively preserves geometric structures, particularly along the vertical axis, while maintaining high inference speed. Extensive experiments on the Occ3D-nuScenes dataset demonstrate that DA-Occ achieves a favorable trade-off between accuracy and efficiency. Compared with existing approaches, it shows superior real-time performance and stronger deployment potential. The proposed framework enhances scene understanding in autonomous driving scenarios and provides a practical solution for real-world applications.

Acknowledgments

This work was supported by the Nanjing Desay SV Automotive co.,LTD. under Grant Nos. TRD2025001.

References

[1] Anh-Quan Cao and Raoul De Charette. Monoscene: Monocular 3d semantic scene completion. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3991–4001, 2022.
[2] Xiaoyu Tian, Tao Jiang, Longfei Yun, Yucheng Mao, Huitong Yang, Yue Wang, Yilun Wang, and Hang Zhao. Occ3d: A large-scale 3d occupancy prediction benchmark for autonomous driving. Advances in Neural Information Processing Systems, 36:64318–64330, 2023.
[3] Yinhao Li, Zheng Ge, Guanyi Yu, Jinrong Yang, Zengran Wang, Yukang Shi, Jianjian Sun, and Zeming Li. Bevdepth: Acquisition of reliable depth for multi-view 3d object detection. In Proceedings of the AAAI conference on artificial intelligence, volume 37, pages 1477–1485, 2023.
[4] Zhiqi Li, Zhiding Yu, Wenhai Wang, Anima Anandkumar, Tong Lu, and Jose M Alvarez. Fb-bev: Bev representation from forward-backward view transformations. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 6919–6928, 2023.
[5] Yuan Wu, Zhiqiang Yan, Zhengxue Wang, Xiang Li, Le Hui, and Jian Yang. Deep height decoupling for precise vision-based 3d occupancy prediction. arXiv preprint arXiv:2409.07972, 2024.
[6] Tingting Liang, Hongwei Xie, Kaicheng Yu, Zhongyu Xia, Zhiwei Lin, Yongtao Wang, Tao Tang, Bing Wang, and Zhi Tang. Bevfusion: A simple and robust lidar-camera fusion framework. Advances in Neural Information Processing Systems, 35:10421–10434, 2022.
[7] Tianwei Yin, Xingyi Zhou, and Philipp Krahenbuhl. Center-based 3d object detection and tracking. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 11784–11793, 2021.
[8] Jiawei Hou, Xiaoyan Li, Wenhao Guan, Gang Zhang, Di Feng, Yuheng Du, Xiangyang Xue, and Jian Pu. Fastocc: Accelerating 3d occupancy prediction by fusing the 2d bird’s-eye view and perspective view. In 2024 IEEE International Conference on Robotics and Automation (ICRA), pages 16425–16431. IEEE, 2024.
[9] Qihang Ma, Xin Tan, Yanyun Qu, Lizhuang Ma, Zhizhong Zhang, and Yuan Xie. Cotr: Compact occupancy transformer for vision-based 3d occupancy prediction. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 19936–19945, 2024.
[10] Yi Wei, Linqing Zhao, Wenzhao Zheng, Zheng Zhu, Jie Zhou, and Jiwen Lu. Surroundocc: Multi-camera 3d occupancy prediction for autonomous driving. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 21729–21740, 2023.
[11] Junjie Huang, Guan Huang, Zheng Zhu, Yun Ye, and Dalong Du. Bevdet: High-performance multi-camera 3d object detection in bird-eye-view. arXiv preprint arXiv:2112.11790, 2021.
[12] Yulin He, Wei Chen, Siqi Wang, Tianci Xun, and Yusong Tan. Achieving speed-accuracy balance in vision-based 3d occupancy prediction via geometric-semantic disentanglement. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 39, pages 3455–3463, 2025.
[13] Sungjin Park, Jaeha Song, and Soonmin Hwang. Efficient occupancy prediction with instance-level attention. In 2025 International Conference on Artificial Intelligence in Information and Communication (ICAIIC), pages 0103–0107. IEEE, 2025.
[14] Zichen Yu, Changyong Shu, Jiajun Deng, Kangjie Lu, Zongdai Liu, Jiangyong Yu, Dawei Yang, Hui Li, and Yan Chen. Flashocc: Fast and memory-efficient occupancy prediction via channel-to-height plugin. arXiv preprint arXiv:2311.12058, 2023.
[15] Feng Li, Kun Xu, Zhaoyue Wang, Yunduan Cui, Mohammad Masum Billah, and Jia Liu. Instancebev: Unifying instance and bev representation for global modeling. arXiv preprint arXiv:2505.13817, 2025.
[16] Zhiqiang Yan, Kun Wang, Xiang Li, Zhenyu Zhang, Jun Li, and Jian Yang. Desnet: Decomposed scale-consistent network for unsupervised depth completion. In Proceedings of the AAAI conference on artificial intelligence, volume 37, pages 3109–3117, 2023.
[17] Zhiqiang Yan, Xiang Li, Le Hui, Zhenyu Zhang, Jun Li, and Jian Yang. Rignet++: Semantic assisted repetitive image guided network for depth completion: Z. yan et al. International Journal of Computer Vision, pages 1–23, 2025.
[18] Zhiqiang Yan, Zhengxue Wang, Kun Wang, Jun Li, and Jian Yang. Completion as enhancement: A degradation-aware selective image guided network for depth completion. In Proceedings of the Computer Vision and Pattern Recognition Conference, pages 26943–26953, 2025.
[19] Kun Wang, Zhenyu Zhang, Zhiqiang Yan, Xiang Li, Baobei Xu, Jun Li, and Jian Yang. Regularizing nighttime weirdness: Efficient self-supervised monocular depth estimation in the dark. In Proceedings of the IEEE/CVF international conference on computer vision, pages 16055–16064, 2021.
[20] Zhiqiang Yan, Yupeng Zheng, Deng-Ping Fan, Xiang Li, Jun Li, and Jian Yang. Learnable differencing center for nighttime depth perception. Visual Intelligence, 2(1):15, 2024.
[21] Jonah Philion and Sanja Fidler. Lift, splat, shoot: Encoding images from arbitrary camera rigs by implicitly unprojecting to 3d. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XIV 16, pages 194–210. Springer, 2020.
[22] Yinhao Li, Han Bao, Zheng Ge, Jinrong Yang, Jianjian Sun, and Zeming Li. Bevstereo: Enhancing depth estimation in multi-view 3d object detection with temporal stereo. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 37, pages 1486–1494, 2023.
[23] Ruihang Miao, Weizhou Liu, Mingrui Chen, Zheng Gong, Weixin Xu, Chen Hu, and Shuchang Zhou. Occdepth: A depth-aware method for 3d semantic scene completion. arXiv preprint arXiv:2302.13540, 2023.
[24] Jinqing Zhang, Yanan Zhang, Qingjie Liu, and Yunhong Wang. Sa-bev: Generating semantic-aware bird’s-eye-view feature for multi-view 3d object detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 3348–3357, 2023.
[25] Yichen Xie, Chenfeng Xu, Marie-Julie Rakotosaona, Patrick Rim, Federico Tombari, Kurt Keutzer, Masayoshi Tomizuka, and Wei Zhan. Sparsefusion: Fusing multi-modal sparse representations for multi-sensor 3d object detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 17591–17602, 2023.
[26] Junjie Huang and Guan Huang. Bevpoolv2: A cutting-edge implementation of bevdet toward deployment. arXiv preprint arXiv:2211.17111, 2022.
[27] Xiaowei Chi, Jiaming Liu, Ming Lu, Rongyu Zhang, Zhaoqing Wang, Yandong Guo, and Shanghang Zhang. Bev-san: Accurate bev 3d object detection via slice attention networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 17461–17470, 2023.
[28] Yuqi Wang, Yuntao Chen, Xingyu Liao, Lue Fan, and Zhaoxiang Zhang. Panoocc: Unified occupancy representation for camera-based 3d panoptic segmentation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 17158–17168, 2024.
[29] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.
[30] Haokui Zhang, Wenze Hu, and Xiaoyu Wang. Parc-net: Position aware circular convolution with merits from convnets and transformer. In European conference on computer vision, pages 613–630. Springer, 2022.
[31] Holger Caesar, Varun Bankiti, Alex H Lang, Sourabh Vora, Venice Erin Liong, Qiang Xu, Anush Krishnan, Yu Pan, Giancarlo Baldan, and Oscar Beijbom. nuscenes: A multimodal dataset for autonomous driving. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 11621–11631, 2020.
[32] Haisong Liu, Yang Chen, Haiguang Wang, Zetong Yang, Tianyu Li, Jia Zeng, Li Chen, Hongyang Li, and Limin Wang. Fully sparse 3d occupancy prediction. In European Conference on Computer Vision, pages 54–71. Springer, 2024.
[33] Yuanhui Huang, Wenzhao Zheng, Yunpeng Zhang, Jie Zhou, and Jiwen Lu. Tri-perspective view for vision-based 3d semantic occupancy prediction. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 9223–9232, 2023.
[34] Yunpeng Zhang, Zheng Zhu, and Dalong Du. Occformer: Dual-path transformer for vision-based 3d semantic occupancy prediction. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 9433–9443, 2023.