DA-Occ: Efficient 3D Voxel Occupancy Prediction via Directional 2D for Geometric Structure Preservation

Yuchen Zhou
Nanjing Agricultural University
Nanjing, China
2023119007@stu.njau.edu.cn
&Yan Luo
Nanjing Agricultural University
Nanjing, China
luoyan@njau.edu.cn
&Xiangang Wang
Desay SV Automotive co.,LTD.
Nanjing, China
Xiangang.Wang@desaysv.com
&Xingjian Gu
Nanjing Agricultural University
Nanjing, China
guxingjian@njau.edu.cn &Mingzhou Lu
Nanjing Agricultural University
Nanjing, China
mingzhou.lu@njau.edu.cn
Abstract

Efficient and high-accuracy 3D occupancy prediction is crucial for ensuring the performance of autonomous driving (AD) systems. However, many current methods focus on high accuracy at the expense of real-time processing needs. To address this challenge of balancing accuracy and inference speed, we propose a directional pure 2D approach. Our method involves slicing 3D voxel features to preserve complete vertical geometric information. This strategy compensates for the loss of height cues in Bird’s-Eye View (BEV) representations, thereby maintaining the integrity of the 3D geometric structure. By employing a directional attention mechanism, we efficiently extract geometric features from different orientations, striking a balance between accuracy and computational efficiency. Experimental results highlight the significant advantages of our approach for autonomous driving. On the Occ3D-nuScenes, the proposed method achieves an mIoU of 39.3% and an inference speed of 27.7 FPS, effectively balancing accuracy and efficiency. In simulations on edge devices, the inference speed reaches 14.8 FPS, further demonstrating the method’s applicability for real-time deployment in resource-constrained environments.

Keywords Autonomous Driving  \cdot 3D Occupancy Prediction  \cdot Directional Attention Mechanism  \cdot Geometric Structure Preservation

1 Introduction

Vision-based occupancy prediction aims to estimate both object occupancy and semantic information in 3D voxel space using surround-view images captured by ego-vehicle cameras [1, 2, 3, 4]. Accurate 3D geometry and semantic understanding of the surrounding environment are essential for autonomous driving (AD) [5]. Compared to conventional 3D object detection [6, 7], occupancy prediction offers finer-grained 3D scene understanding and is capable of recognizing general objects by learning their occupancy patterns.

Refer to caption
Figure 1: The inference speed (FPS) and accuracy (mIoU) of various methods are evaluated on the Occ3D-nuScenes benchmark [2]. Following the definition proposed by [8], we consider an occupancy prediction method to be real-time if it achieves at least 10 FPS.

Occupancy prediction offers several advantages, including enhanced scene understanding, improved handling of unknown obstacles, and joint modeling of geometry and semantics. However, many existing methods [9, 10, 11, 5] focus primarily on precision, often at the expense of real-time processing due to the complexity of their model architectures and large parameter counts. Consequently, these methods suffer from high computational costs and latency, making them impractical for deployment in real-time autonomous driving (AD) systems [12]. To ensure real-time inference speed and facilitate deployment, some approaches adopt 2D methods to predict 3D voxel occupancy [13, 4, 14, 15, 8]. These methods typically compress 3D voxel features into bird’s-eye view (BEV) representations, retaining only horizontal information. As a result, certain semantic and vertical geometric details are sacrificed, even at the cost of accuracy. Figure 1 illustrates the limitations of current methods: while high accuracy often leads to slower inference speeds, methods designed for faster inference tend to suffer significant accuracy loss, with performance differences exceeding 10%. This issue arises from the compression of vertical geometric structures, as shown in Figure 2. Methods relying solely on 2D BEV representations struggle to capture fine-grained spatial structures, which are essential for accurate scene understanding [12, 15, 5, 8]. The visualization results further emphasize the importance of preserving vertical geometric features to maintain the overall semantic and structural integrity of the 3D voxel space.

Refer to caption
Figure 2: The top section illustrates traditional 2D BEV methods lead to the collapse of geometric structures, particularly in the vertical direction, during compression. The bottom section demonstrates how our approach effectively preserves these structures while maintaining efficiency, even in a purely 2D framework.

To ensure the preservation of geometric structure and achieve more accurate object representation in 3D coordinates, we propose a novel approach for processing 3D voxel features. Our method involves performing slicing operations on the 3D voxel features, which retains the full vertical (height) geometric structure, and subsequently fuses this information with the 2D BEV feature map. This fusion restores the vertical geometric details that are often lost in conventional 2D BEV representations, providing a more comprehensive understanding of the scene.

To maintain high inference speed while preserving accuracy, we propose a directional attention-based occupancy prediction method (DA-Occ). This approach efficiently captures geometric features in both horizontal and vertical directions, enabling the model to selectively focus on relevant features in different orientations. As a result, DA-Occ optimizes computational performance without compromising the integrity of the 3D voxel structure. With an mIoU of 39.3% and an inference speed of 27.7 FPS on the Occ3D-nuScenes dataset, DA-Occ achieves a balance between high accuracy and fast inference, making it well-suited for real-time applications in autonomous driving.

2 Related Work

2.1 2D-to-3D View Transformation.

Many existing methods utilize estimated depth information to project 2D image features into 3D voxel space [16, 17, 18, 19, 20]. LSS [21] explicitly predicts depth distributions to lift 2D image features into 3D space. More recent approaches [3, 22, 23] have highlighted the importance of accurate depth estimation in the view transformation process. For instance, BEVDepth [3] encodes both intrinsic and extrinsic camera parameters into its depth refinement module for improved 3D object detection. BEVStereo [22] introduces an efficient temporal stereo mechanism to enhance depth estimation quality. In addition, some studies [24, 25, 26, 27] focus on optimizing the projection stage itself. However existing approaches have largely overlooked the potential of leveraging height prediction as a direct supervisory signal for view transformation. Inspired by DHD [5], our method introduces a novel approach by leveraging height scores from 2D images to guide the transformation into 3D voxel space. This enables more accurate height estimation in the voxel representation and lays a solid foundation for subsequent BEV feature reconstruction.

2.2 3D Occupancy Prediction.

3D occupancy prediction reconstructs the 3D scene by predicting a dense voxel grid that encodes spatial occupancy [12]. A straightforward approach is to extend the BEV representation used in 3D object detection into voxel space by lifting BEV features into 3D [14] and applying a segmentation head [11, 4, 8, 13]. However, this strategy disrupts the geometric integrity of the voxel representation due to the lack of vertical structure awareness. Directly processing full 3D voxel features, on the other hand, incurs substantial computational overhead. To address this, COTR [9] introduces a compact voxel representation via downsampling, while PanoOcc [28] proposes a novel panoramic occupancy segmentation task and leverages sparse 3D convolutions to reduce computation. Nonetheless, these methods still suffer from limited inference speed, making them less suitable for real-time deployment. To tackle this challenge, we propose a 2D-based approach that preserves the geometric structure of the voxel space while significantly improving inference efficiency, offering a more deployable and scalable solution for 3D occupancy prediction.

3 Method

3.1 Problem Formulation

In 3D occupancy prediction tasks, the surrounding environment is discretized into a voxel-based volumetric grid [21]. Assuming the ego vehicle is located at the origin of the global coordinate system, the 3D perception range is defined by the bounds [Xs,Ys,Zs,Xe,We,Ze][X_{s},Y_{s},Z_{s},X_{e},W_{e},Z_{e}][ italic_X start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT , italic_Y start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT , italic_Z start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT , italic_X start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT , italic_W start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT , italic_Z start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT ], where [Xs,Ys,Zs][X_{s},Y_{s},Z_{s}][ italic_X start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT , italic_Y start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT , italic_Z start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ] and [Xe,Ye,Ze][X_{e},Y_{e},Z_{e}][ italic_X start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT , italic_Y start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT , italic_Z start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT ] denote the lower and upper bounds along the height, width, and depth axes, respectively. The scene volume is partitioned into a grid of shape [X,Y,Z][X,Y,Z][ italic_X , italic_Y , italic_Z ] (e.g. [200,200,1][200,200,1][ 200 , 200 , 1 ] as in [14]). Accordingly, the physical size of a single voxel along each axis is computed as [XeXsX,YeYsY,ZeZsZ][\frac{X_{e}-X_{s}}{X},\frac{Y_{e}-Y_{s}}{Y},\frac{Z_{e}-Z_{s}}{Z}][ divide start_ARG italic_X start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT - italic_X start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT end_ARG start_ARG italic_X end_ARG , divide start_ARG italic_Y start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT - italic_Y start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT end_ARG start_ARG italic_Y end_ARG , divide start_ARG italic_Z start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT - italic_Z start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT end_ARG start_ARG italic_Z end_ARG ]. At inference time, each voxel is assigned a binary occupancy state—occupied or empty along with a semantic label from a predefined set of categories or an unknown class. The final voxel-wise predictions are derived from visual features extracted from multi-view images, denoted as 𝐗N×C×H×W\mathbf{X}\in\mathbb{R}^{N\times C\times H\times W}bold_X ∈ blackboard_R start_POSTSUPERSCRIPT italic_N × italic_C × italic_H × italic_W end_POSTSUPERSCRIPT, where NNitalic_N is the number of cameras, [C,H,W][C,H,W][ italic_C , italic_H , italic_W ] is the feature size. Each camera is associated with both intrinsic and extrinsic parameters (KKitalic_K and [R|t][R|t][ italic_R | italic_t ]), which are utilized to transform 2D image features into the unified 3D coordinate space. These parameters enable precise projection and aggregation of multi-view information into the voxel grid for occupancy and semantic prediction.

3.2 Overall Architecture

Refer to caption
Figure 3: This diagram illustrates the overall architecture of DA-Occ. The left side shows the input images processed by the Backbone, generating feature maps 𝐅n\mathbf{F}_{n}bold_F start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT that are fed into the DepthNet and HeightNet for depth and height predictions. These features are then used to construct 3D features 𝐅3D\mathbf{F}_{3D}bold_F start_POSTSUBSCRIPT 3 italic_D end_POSTSUBSCRIPT (with height) and BEV features 𝐅bev\mathbf{F}_{bev}bold_F start_POSTSUBSCRIPT italic_b italic_e italic_v end_POSTSUBSCRIPT (without height). The DBA and DHA are applied to enhance the feature representation. Finally, these features are fused to produce the final output, which is visualized on the right side. (To facilitate understanding, some feature maps use the original images instead.)

The overall architecture of DA-Occ is depicted in detail in Fig. 3. It comprises the following key components: (1)Image Backbone: A convolutional neural network (e.g., ResNet [29]) is employed to extract high-level visual features from multi-view input images. (2) Geometry-Aware Feature Encoder: This integrates a depth estimation network (DepthNet) and a height attention network (HeightNet) to enrich the geometric understanding and improve the structural fidelity of subsequent 3D representations. (3) 2D-to-3D View Transformation: Image-plane features are lifted into the 3D voxel space using calibrated projection matrices, enabling spatial alignment across multiple views. (4) Geometry-Aware Feature Decoder: The final decoding stage incorporates geometric priors to enhance 3D structural consistency and improve the granularity of semantic predictions within the voxel grid.

3.3 Directional Attention Mechanism

Refer to caption
Figure 4: Internal operations of Directional Attention Mechanis. This process first performs directional average value compression on the input feature 𝐅in\mathbf{F}_{in}bold_F start_POSTSUBSCRIPT italic_i italic_n end_POSTSUBSCRIPT (in the horizontal or vertical direction) to generate one of 𝐅h\mathbf{F}_{h}bold_F start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT or 𝐅v\mathbf{F}_{v}bold_F start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT. This intermediate result is passed through a Multi-Layer Perceptron (MLP) to generate dynamic convolutional weights. These weights are then applied to a concatenated feature tensor via convolution, producing the final output feature 𝐅out\mathbf{F}_{out}bold_F start_POSTSUBSCRIPT italic_o italic_u italic_t end_POSTSUBSCRIPT.

In order to preserve the consistency of 3D geometric structures, particularly along the vertical axis, which is often underrepresented in 2D BEV-based representations, we introduce the Directional Attention Mechanism. This mechanism is utilized in both the Geometry-Aware Feature Encoder and the Geometry-Aware Feature Decoder, but at different stages of the model pipeline. In the Encoder, it is applied before the 2D-to-3D View Transformation to dynamically acquire height scores, which help enhance the vertical geometric features. After the transformation, in the Decoder, the Directional Height Attention (DHA) and Directional BEV Attention (DBA) modules are employed to refine and reconstruct the vertical and horizontal geometric structures, ensuring a comprehensive and accurate 3D representation. Inspired by ParC-Net [30], which captures long-range dependencies in 2D image space, we propose a novel extension of this mechanism to the 3D voxel domain. Our extension dynamically extracts 3D geometric features, enabling the model to better capture long-range spatial dependencies and enhance the representation of 3D structures. By introducing direction-aware attention that model dependencies along both the vertical and horizontal axes, our approach preserves fine-grained 3D structure within a purely 2D framework. Specifically, DHA is designed to capture detailed height-dependent patterns by processing features along the ZZitalic_Z-axis (vertical), while DBA captures geometric dependencies along the horizontal plane within the BEV representation. The directional attention mechanism can be expressed as 𝐅out=DA(𝐅in,dir=i)\mathbf{F}{out}=\mathrm{DA}(\mathbf{F}{in},\text{dir}=i)bold_F italic_o italic_u italic_t = roman_DA ( bold_F italic_i italic_n , dir = italic_i ), where iiitalic_i refers to either the horizontal or vertical direction in feature maps. Figure 4 illustrates the structure of this mechanism. The computation proceeds as:

W\displaystyle Witalic_W =MLP(Cp(𝐅in,dir=i))\displaystyle=\mathrm{MLP}(\mathrm{Cp}(\mathbf{F}_{in},\text{dir}=i))= roman_MLP ( roman_Cp ( bold_F start_POSTSUBSCRIPT italic_i italic_n end_POSTSUBSCRIPT , dir = italic_i ) ) (1)
𝐅ch\displaystyle\mathbf{F}_{ch}bold_F start_POSTSUBSCRIPT italic_c italic_h end_POSTSUBSCRIPT =Cat(𝐅in,𝐅in[:,:,:1,:])\displaystyle=\mathrm{Cat}(\mathbf{F}_{in},\mathbf{F}_{in}[:,:,:-1,:])= roman_Cat ( bold_F start_POSTSUBSCRIPT italic_i italic_n end_POSTSUBSCRIPT , bold_F start_POSTSUBSCRIPT italic_i italic_n end_POSTSUBSCRIPT [ : , : , : - 1 , : ] )
𝐅cv\displaystyle\mathbf{F}_{cv}bold_F start_POSTSUBSCRIPT italic_c italic_v end_POSTSUBSCRIPT =Cat(𝐅in,𝐅in[:,:,:,:1])\displaystyle=\mathrm{Cat}(\mathbf{F}_{in},\mathbf{F}_{in}[:,:,:,:-1])= roman_Cat ( bold_F start_POSTSUBSCRIPT italic_i italic_n end_POSTSUBSCRIPT , bold_F start_POSTSUBSCRIPT italic_i italic_n end_POSTSUBSCRIPT [ : , : , : , : - 1 ] )
𝐅out\displaystyle\mathbf{F}_{out}bold_F start_POSTSUBSCRIPT italic_o italic_u italic_t end_POSTSUBSCRIPT =Conv(𝐅ch/cv+pi,W)\displaystyle=\mathrm{Conv}(\mathbf{F}_{ch/cv}+p_{i},W)= roman_Conv ( bold_F start_POSTSUBSCRIPT italic_c italic_h / italic_c italic_v end_POSTSUBSCRIPT + italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_W )

where, WWitalic_W denotes the convolutional weights dynamically generated by the MLP. The function Cp()\mathrm{Cp}(\cdot)roman_Cp ( ⋅ ) performs average pooling along direction iiitalic_i to compress spatial information. 𝐅ch\mathbf{F}_{ch}bold_F start_POSTSUBSCRIPT italic_c italic_h end_POSTSUBSCRIPT and 𝐅cv\mathbf{F}_{cv}bold_F start_POSTSUBSCRIPT italic_c italic_v end_POSTSUBSCRIPT represent horizontally and vertically concatenated features, respectively. The position encoding pip_{i}italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is a learnable parameter with a shape of H×1H\times 1italic_H × 1 or 1×W1\times W1 × italic_W, and the final convolution Conv\mathrm{Conv}roman_Conv integrates both structure and learned weights to generate 𝐅out\mathbf{F}_{out}bold_F start_POSTSUBSCRIPT italic_o italic_u italic_t end_POSTSUBSCRIPT.

3.4 Geometry-Aware Feature Encoder

HeightNet first introduced in [5], shares a similar structure with the DepthNet module in BEVDepth [3]. It is used to generate a predicted height map HpredH_{pred}italic_H start_POSTSUBSCRIPT italic_p italic_r italic_e italic_d end_POSTSUBSCRIPT, which is supervised by a ground-truth height map HgtH_{gt}italic_H start_POSTSUBSCRIPT italic_g italic_t end_POSTSUBSCRIPT to facilitate accurate height estimation. The generation of HgtH_{gt}italic_H start_POSTSUBSCRIPT italic_g italic_t end_POSTSUBSCRIPT proceeds as follows: a LiDAR point pl=[xl,yl,zl,1]Tp_{l}=[x_{l},y_{l},z_{l},1]^{T}italic_p start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT = [ italic_x start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT , italic_z start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT , 1 ] start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT in the LiDAR coordinate frame is transformed to the ego-vehicle coordinate system as pe=[xe,ye,ze,1]Tp_{e}=[x_{e},y_{e},z_{e},1]^{T}italic_p start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT = [ italic_x start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT , italic_z start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT , 1 ] start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT. The corresponding 2D projection onto the image plane is computed using:

d[u,v,1]T=K[R|t][xl,yl,zl,1]Td[u,v,1]^{T}=K[R|t][x_{l},y_{l},z_{l},1]^{T}italic_d [ italic_u , italic_v , 1 ] start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT = italic_K [ italic_R | italic_t ] [ italic_x start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT , italic_z start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT , 1 ] start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT (2)

where dditalic_d is the depth value in the camera coordinate system, [u,v,1][u,v,1][ italic_u , italic_v , 1 ] denotes the pixel coordinates in homogeneous form, KKitalic_K is the intrinsic matrix, and R3×3R\in\mathbb{R}^{3\times 3}italic_R ∈ blackboard_R start_POSTSUPERSCRIPT 3 × 3 end_POSTSUPERSCRIPT, t3t\in\mathbb{R}^{3}italic_t ∈ blackboard_R start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT are the camera extrinsic parameters.

Following [5], we construct a set of tuples [u,v,d,ze]T[u,v,d,z_{e}]^{T}[ italic_u , italic_v , italic_d , italic_z start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT ] start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT, where each pixel location [u,v][u,v][ italic_u , italic_v ] is assigned both a projected depth dditalic_d and a corresponding height zez_{e}italic_z start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT in the ego coordinate system. For each pixel, only the point with the smallest depth is retained to eliminate occluded redundancies, resulting in a ground-truth height map HgtH_{gt}italic_H start_POSTSUBSCRIPT italic_g italic_t end_POSTSUBSCRIPT that represents the visible uppermost surface in the scene.

Directional Height Attention Network is designed to predict HpredH_{pred}italic_H start_POSTSUBSCRIPT italic_p italic_r italic_e italic_d end_POSTSUBSCRIPT by leveraging structural priors along the vertical (ZZitalic_Z-axis) direction. It incorporates both the geometric shape prior of objects in height and the contextual information from adjacent horizontal locations. To efficiently capture vertical geometry while maintaining lightweight computation, DHA replaces the last two residual blocks in the original DepthNet with directional attention modules. This architectural modification enables more effective extraction of height-sensitive features with minimal additional cost.

The entire Geometry-Aware Feature Encoder process is described as follows: 𝐅n\mathbf{F}_{n}bold_F start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT is processed by DepthNet, generating the depth score DpredD_{pred}italic_D start_POSTSUBSCRIPT italic_p italic_r italic_e italic_d end_POSTSUBSCRIPT and the feature map 𝐅feat\mathbf{F}_{feat}bold_F start_POSTSUBSCRIPT italic_f italic_e italic_a italic_t end_POSTSUBSCRIPT. 𝐅n\mathbf{F}_{n}bold_F start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT is then passed through HeightNet, producing the height score HpredH_{pred}italic_H start_POSTSUBSCRIPT italic_p italic_r italic_e italic_d end_POSTSUBSCRIPT.

3.5 2D-to-3D View Transformation

To preserve the integrity of 3D geometric structures while maintaining computational efficiency, we adopt the Lift-Splat-Shoot (LSS) framework [21] for view transformation. Unlike existing approaches such as FlashOcc [14], which rely on a purely 2D transformation grid (e.g., 1×200×2001\times 200\times 2001 × 200 × 200) and often compress the vertical dimension, resulting in significant geometric distortion, our method takes a different approach. We adopt a lightweight 3D voxel grid (e.g., 16×32×3216\times 32\times 3216 × 32 × 32) to enrich the BEV feature map with enhanced vertical geometric information. This enables BEV feature map to retain vertical spatial cues more effectively and recover structural geometry with higher fidelity under similar computational constraints.

Specifically, we construct a 3D voxel grid that enables finer-grained depth and height modeling during the lifting process. The generated 3D voxel features are then sliced and fused along the depth axis (YYitalic_Y-axis), effectively capturing vertical structure while still allowing the final output to remain in a 2D feature map format for efficient processing. In the Splat stage of the LSS, we enhance the lifted 3D voxel representation by the predicted height confidence scores. Specifically, the resulting voxel projection is formulated as:

𝐅3D=iranks_bev[i]=vsoftmax(Hpred)𝐅feat,i\mathbf{F}_{3D}=\sum_{i\in\text{ranks\_bev}[i]=v}{softmax(H_{pred})\cdot\mathbf{F}_{feat,i}}bold_F start_POSTSUBSCRIPT 3 italic_D end_POSTSUBSCRIPT = ∑ start_POSTSUBSCRIPT italic_i ∈ ranks_bev [ italic_i ] = italic_v end_POSTSUBSCRIPT italic_s italic_o italic_f italic_t italic_m italic_a italic_x ( italic_H start_POSTSUBSCRIPT italic_p italic_r italic_e italic_d end_POSTSUBSCRIPT ) ⋅ bold_F start_POSTSUBSCRIPT italic_f italic_e italic_a italic_t , italic_i end_POSTSUBSCRIPT (3)

where HpredH_{pred}italic_H start_POSTSUBSCRIPT italic_p italic_r italic_e italic_d end_POSTSUBSCRIPT denote the scores along the height directions, and 𝐅feat,i\mathbf{F}_{feat,i}bold_F start_POSTSUBSCRIPT italic_f italic_e italic_a italic_t , italic_i end_POSTSUBSCRIPT represents iiitalic_i-th pixel feature. This design enhances the model’s perception of vertical geometric structures while maintaining real-time inference efficiency, leading to improved spatial feature representation. After constructing the 3D voxel feature volume 𝐅3DB×C×Z×Y×X\mathbf{F}_{3D}\in\mathbb{R}^{B\times C\times Z\times Y\times X}bold_F start_POSTSUBSCRIPT 3 italic_D end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_B × italic_C × italic_Z × italic_Y × italic_X end_POSTSUPERSCRIPT, we perform slicing along the depth axis (i.e., the YYitalic_Y-axis) and concatenate the resulting slices along the horizontal axis (XXitalic_X-axis). This transforms the 3D voxel structure into a 2D feature map with enhanced perception of vertical (height-wise) geometric structures, facilitating more efficient extraction of height-aware spatial representations.This process is described as:

𝐅height=Concatk=0k=Y1𝐅3D[:,:,:,k,:],dim=4\mathbf{F}_{height}=\text{Concat}^{k=Y-1}_{k=0}{\mathbf{F}_{3D}[:,:,:,k,:],dim=4}bold_F start_POSTSUBSCRIPT italic_h italic_e italic_i italic_g italic_h italic_t end_POSTSUBSCRIPT = Concat start_POSTSUPERSCRIPT italic_k = italic_Y - 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k = 0 end_POSTSUBSCRIPT bold_F start_POSTSUBSCRIPT 3 italic_D end_POSTSUBSCRIPT [ : , : , : , italic_k , : ] , italic_d italic_i italic_m = 4 (4)

where the concatenation is applied along the spatial XXitalic_X-axis, resulting in a feature map 𝐅heightB×C×Z×(Y×X)\mathbf{F}_{height}\in\mathbb{R}^{B\times C\times Z\times(Y\times X)}bold_F start_POSTSUBSCRIPT italic_h italic_e italic_i italic_g italic_h italic_t end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_B × italic_C × italic_Z × ( italic_Y × italic_X ) end_POSTSUPERSCRIPT

3.6 Geometry-Aware Feature Decoder

Refer to caption
Figure 5: It illustrates the combined effects of the DHA and DBA modules. The right side presents an equivalent depiction of these effects, highlighting the collaborative extraction of geometric features along the XXitalic_X, YYitalic_Y, and ZZitalic_Z -axes, and emphasizing the synergy between the three axes in capturing spatial information.

The BEV feature map 𝐅bev\mathbf{F}_{bev}bold_F start_POSTSUBSCRIPT italic_b italic_e italic_v end_POSTSUBSCRIPT, which primarily encodes the target’s depth-related information, and the sliced height-aware feature map feature map 𝐅height\mathbf{F}_{height}bold_F start_POSTSUBSCRIPT italic_h italic_e italic_i italic_g italic_h italic_t end_POSTSUBSCRIPT, which captures the target’s vertical structural cues, are obtained following the view transformation stage. To enhance the geometric fidelity of 𝐅bev\mathbf{F}_{bev}bold_F start_POSTSUBSCRIPT italic_b italic_e italic_v end_POSTSUBSCRIPT and 𝐅height\mathbf{F}_{height}bold_F start_POSTSUBSCRIPT italic_h italic_e italic_i italic_g italic_h italic_t end_POSTSUBSCRIPT, we introduce DBA and DHA, which specialize in refining horizontal and vertical geometrical features through direction-aware attention.

Directional-Aware Attention. The 𝐅bev\mathbf{F}_{\text{bev}}bold_F start_POSTSUBSCRIPT bev end_POSTSUBSCRIPT captures the relative distance between the ego vehicle and surrounding targets. This spatial relationship can be further decomposed into front-back and left-right directional components within the BEV plane. To better model depth-related geometric structures, we apply a directional attention operator DA(,dir)\text{DA}(\cdot,\text{dir})DA ( ⋅ , dir ) along both the XXitalic_X and YYitalic_Y axes of the BEV feature map. The refined BEV representation is computed as:

𝐅BEV=DA(𝐅bev,dir=h)+DA(𝐅bev,dir=v)\mathbf{F}_{BEV}=\text{DA}(\mathbf{F}_{bev},\text{dir}=h)+\text{DA}(\mathbf{F}_{bev},\text{dir}=v)bold_F start_POSTSUBSCRIPT italic_B italic_E italic_V end_POSTSUBSCRIPT = DA ( bold_F start_POSTSUBSCRIPT italic_b italic_e italic_v end_POSTSUBSCRIPT , dir = italic_h ) + DA ( bold_F start_POSTSUBSCRIPT italic_b italic_e italic_v end_POSTSUBSCRIPT , dir = italic_v ) (5)

where the output 𝐅BEV\mathbf{F}_{\text{BEV}}bold_F start_POSTSUBSCRIPT BEV end_POSTSUBSCRIPT has the same shape as the input 𝐅bev\mathbf{F}_{\text{bev}}bold_F start_POSTSUBSCRIPT bev end_POSTSUBSCRIPT, and the fusion enhances geometric sensitivity in both planar directions.

Due to the previous processing of sliced 𝐅height\mathbf{F}_{\text{height}}bold_F start_POSTSUBSCRIPT height end_POSTSUBSCRIPT, it preserves the geometric structure in the vertical direction. The DHA performs direction-aware processing that better aligns with the vertical structure . This directional sensitivity allows it to more effectively capture and refine height-dependent geometric features. The refinement process is defined as:

𝐅h3D\displaystyle\mathbf{F}_{h3D}bold_F start_POSTSUBSCRIPT italic_h 3 italic_D end_POSTSUBSCRIPT =reshape(DA(𝐅height,dir=h))\displaystyle=\text{reshape}(\text{DA}(\mathbf{F}_{height},\text{dir}=h))= reshape ( DA ( bold_F start_POSTSUBSCRIPT italic_h italic_e italic_i italic_g italic_h italic_t end_POSTSUBSCRIPT , dir = italic_h ) ) (6)
𝐅Height\displaystyle\mathbf{F}_{Height}bold_F start_POSTSUBSCRIPT italic_H italic_e italic_i italic_g italic_h italic_t end_POSTSUBSCRIPT =Concatz=0Z1(𝐅h3D[:,:,z,:,:]),dim=1\displaystyle=\text{Concat}_{z=0}^{Z-1}(\mathbf{F}_{h3D}[:,:,z,:,:]),\text{dim}=1= Concat start_POSTSUBSCRIPT italic_z = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_Z - 1 end_POSTSUPERSCRIPT ( bold_F start_POSTSUBSCRIPT italic_h 3 italic_D end_POSTSUBSCRIPT [ : , : , italic_z , : , : ] ) , dim = 1

where 𝐅h3DB×C×Z×Y×X\mathbf{F}_{h3D}\in\mathbb{R}^{B\times C\times Z\times Y\times X}bold_F start_POSTSUBSCRIPT italic_h 3 italic_D end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_B × italic_C × italic_Z × italic_Y × italic_X end_POSTSUPERSCRIPT is the reshaped 3D height-aware feature volume, and 𝐅HeightB×(CZ)×Y×X\mathbf{F}_{Height}\in\mathbb{R}^{B\times(C\cdot Z)\times Y\times X}bold_F start_POSTSUBSCRIPT italic_H italic_e italic_i italic_g italic_h italic_t end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_B × ( italic_C ⋅ italic_Z ) × italic_Y × italic_X end_POSTSUPERSCRIPT is the final height-enhanced feature map obtained by concatenating the ZZitalic_Z axes slices along the channel dimension.

This formulation enables fine-grained recovery of vertical geometry and improves the spatial expressiveness of the height feature representation. The final feature representation 𝐅d_h\mathbf{F}_{d\_h}bold_F start_POSTSUBSCRIPT italic_d _ italic_h end_POSTSUBSCRIPT is obtained by fusing 𝐅Height\mathbf{F}_{Height}bold_F start_POSTSUBSCRIPT italic_H italic_e italic_i italic_g italic_h italic_t end_POSTSUBSCRIPT and 𝐅BEV\mathbf{F}_{BEV}bold_F start_POSTSUBSCRIPT italic_B italic_E italic_V end_POSTSUBSCRIPT, effectively combining fine-grained height-aware and depth-aware information to enhance 3D spatial understanding. The whole process looks like Figure 5.

4 Experiment

4.1 Dataset and Evaluation Metrics

We train and evaluate our model on the Occ3D-nuScenes benchmark [2], which is built upon the nuScenes dataset [31]. The dataset comprises 1,000 video sequences, split into 700 for training, 150 for validation, and 150 for testing. Each keyframe includes a 32-beam LiDAR point cloud, six RGB images from surround-view cameras, and dense voxel-level semantic occupancy annotations. The perception range is defined as [40m,40m,1m,40m,40m,5.4m][-40\,\text{m},-40\,\text{m},-1\,\text{m},40\,\text{m},40\,\text{m},5.4\,\text{m}][ - 40 m , - 40 m , - 1 m , 40 m , 40 m , 5.4 m ], and the voxel resolution is set to [0.4m,0.4m,0.4m][0.4\,\text{m},0.4\,\text{m},0.4\,\text{m}][ 0.4 m , 0.4 m , 0.4 m ]. Each voxel is assigned one of 18 semantic classes, including 16 predefined object categories, one other class for unknown objects, and one empty class representing free space. To assess the performance of our approach, we adopt evaluation metrics widely used in 3D occupancy prediction: the mean Intersection-over-Union (mIoU) [32], computed over all semantic categories.

4.2 Implementation Details

Following common practice [9, 4, 32], we train our models using 4 NVIDIA H800 GPUs with a batch size of 4 per GPU. Unless otherwise specified, all models are optimized using the AdamW optimizer with a learning rate of 1×1041\times 10^{-4}1 × 10 start_POSTSUPERSCRIPT - 4 end_POSTSUPERSCRIPT and a weight decay of 0.05. During the LSS stage, the resolution of the grid for the depth score is set to [1×200×200][1\times 200\times 200][ 1 × 200 × 200 ], and the resolution of the grid for the height score is set to [16×32×32][16\times 32\times 32][ 16 × 32 × 32 ]. Regarding the loss function, we design it based on the characteristics of geometric structure recovery in DA-Occ. Specifically, we follow the formulation proposed in MonoScene [1] and define the overall loss as: =λ1bcedepth+λ2bceheight+λ3ce+λ4scalsem+λ5scalgeo\mathcal{L}=\lambda_{1}\mathcal{L}_{\text{bce}}^{\text{depth}}+\lambda_{2}\mathcal{L}_{\text{bce}}^{\text{height}}+\lambda_{3}\mathcal{L}_{\text{ce}}+\lambda_{4}\mathcal{L}_{\text{scal}}^{\text{sem}}+\lambda_{5}\mathcal{L}_{\text{scal}}^{\text{geo}}caligraphic_L = italic_λ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT caligraphic_L start_POSTSUBSCRIPT bce end_POSTSUBSCRIPT start_POSTSUPERSCRIPT depth end_POSTSUPERSCRIPT + italic_λ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT caligraphic_L start_POSTSUBSCRIPT bce end_POSTSUBSCRIPT start_POSTSUPERSCRIPT height end_POSTSUPERSCRIPT + italic_λ start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT caligraphic_L start_POSTSUBSCRIPT ce end_POSTSUBSCRIPT + italic_λ start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT caligraphic_L start_POSTSUBSCRIPT scal end_POSTSUBSCRIPT start_POSTSUPERSCRIPT sem end_POSTSUPERSCRIPT + italic_λ start_POSTSUBSCRIPT 5 end_POSTSUBSCRIPT caligraphic_L start_POSTSUBSCRIPT scal end_POSTSUBSCRIPT start_POSTSUPERSCRIPT geo end_POSTSUPERSCRIPT, where λ1,λ2,λ3,λ4,λ5\lambda_{1},\lambda_{2},\lambda_{3},\lambda_{4},\lambda_{5}italic_λ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_λ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_λ start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT , italic_λ start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT , italic_λ start_POSTSUBSCRIPT 5 end_POSTSUBSCRIPT are set as 1.0,1.0,10.0,1.0,1.01.0,1.0,10.0,1.0,1.01.0 , 1.0 , 10.0 , 1.0 , 1.0. For inference, we use a single NVIDIA A100 GPU with a batch size of 1. The frame-per-second (FPS) metric is measured using the official mmdetection3d codebase.

4.3 Main Results

Table 1: Comparison with state-of-the-art methods on the Occ3D-nuScenes benchmark. FPS is tested on a single NVIDIA A100 using the mmdetection3d codebase. DA-Occ* and FlashOcc* represents the way of Stereo4D temporal fusion [11], the 16-frame fusion method adopts the BEV-level approach [32].
Method Venue History Frames Backbone Image Size mIoU (%) FPS
MonoScene CVPR’22 - ResNet-101 928×\times×1600 6.06 1.2
TPVFormer CVPR’23 - ResNet-101 928×\times×1600 27.83 3.1
CTF-Occ NIPS’24 - ResNet-101 928×\times×1600 28.53 -
OccFormer ICCV’23 - ResNet-50 256×\times×704 20.40 -
BEVDetOcc arXiv’22 - ResNet-50 256×\times×704 31.64 -
FlashOcc arXiv’23 - ResNet-50 256×\times×704 31.95 79.4
DA-Occ(Ours) - - ResNet-50 256×\times×704 34.38 39.6
BEVDetOcc-4D-Stereo arXiv’22 1 ResNet-50 256×\times×704 36.01 1.8
FB-OCC ICCV’23 1 ResNet-50 256×\times×704 37.52 6.9
FlashOCC* arXiv’23 1 ResNet-50 256×\times×704 37.84 9.1
GSD-Occ AAAI’25 1 ResNet-50 256×\times×704 34.84 22.7
EOPIA ICAIIC’25 1 ResNet-50 256×\times×704 35.26 28.7
DA-Occ(Ours)* - 1 ResNet-50 256×\times×704 38.97 7.3
DA-Occ(Ours) - 1 ResNet-50 256×\times×704 35.77 31.5
FastOCC ICRA’24 16 ResNet-101 320×\times×800 37.2 10.1
COTR CVPR’24 16 ResNet-50 256×\times×704 44.5 0.9
GSD-Occ AAAI’25 16 ResNet-50 256×\times×704 39.4 20.0
DA-Occ(Ours) - 16 ResNet-50 256×\times×704 39.3 27.7

Table 1 presents a comparison between DA-Occ and state-of-the-art (SOTA) methods on the Occ3D-nuScenes benchmark. The results are grouped by the number of frames used in the temporal sequence. DA-Occ demonstrates outstanding efficiency while maintaining high accuracy. Notably, even when using only a single frame for data fusion, DA-Occ* achieves accuracy comparable to that of many multi-frame models. This is largely attributed to the use of the Stereo4D strategy in its temporal fusion module [11]. When using a single frame, the proposed method improves accuracy by 2.43% compared to FlashOcc. By employing the frame fusion technique at the BEV level [32], DA-Occ achieves FPS of 27.7 and accuracy of 39.3%. These results highlight DA-Occ’s ability to strike an effective balance between efficiency and accuracy The way we make comparisons is all based on innovations within the model itself, for example MonoScene [1], TPVForme [33], CTF-Occ [9], OccFormer [34], BEVDetOcc [11], FlashOcc [14], FB-OCC [4], GSD-Occ [12], EOPIA [13](Since there was no name for the model, the abbreviation of the article title was used instead.), and COTR [9]. These results underscore the effectiveness of DA-Occ in preserving geometric structures using a purely 2D-based approach. To further validate this capability, we conduct both ablation studies and qualitative visualizations, which consistently confirm the robustness of DA-Occ in maintaining geometric structural integrity.

Refer to caption
Figure 6: This figure shows the comparison of the proposed method with FlashOcc in the visualization experiments. Our method demonstrates better generalization of geometric structures, resulting in more accurate segmentation in challenging areas.

4.4 Ablation Studies

Table 2: Effect of voxel grid resolution on mIoU, inference speed, and FLOPs.
Size of Grid mIoU (%) FPS GFLOPs
16×32×3216\times 32\times 3216 × 32 × 32 34.38 39.6 276.59
16×64×6416\times 64\times 6416 × 64 × 64 34.45 38.7 276.65
8×32×328\times 32\times 328 × 32 × 32 33.67 39.9 276.58
4×32×324\times 32\times 324 × 32 × 32 32.89 40.1 276.55
Table 3: Effect of DHA and DBA modules on mIoU, inference speed, and FLOPs.
DHA DBA mIoU (%) FPS Flops
34.38 39.6 276.59G
- 34.01 42.5 271.24G
- 33.76 41.3 276.57G
- - 32.58 44.7 271.21G
Table 4: The mIoU and FPS of each model without the visible mask.
Method Venue mIoU (%) FPS
MonoScene CVPR’22 6.06 1.2
OccFormer ICCV’23 20.40 -
CTF-Occ NIPS’24 28.53 -
BECDet arXiv’22 18.05 -
SparseOcc ECCV’24 30.64 17.7
DA-Occ - 31.31 27.7
Table 5: Model inference speed of simulated edge device.
Precision Threads GPU Memory FPS
FP32 (A100) Full Full 39.6
FP16 (A100) 2 4GB 14.8

To accelerate experimental evaluation, we conducted ablation experiments on the model without DepthNet and without temporal fusion. To investigate the impact of height information on occupancy prediction, we varied the resolution of the 3D voxel grid and analyzed how vertical geometric granularity influences performance. The impact of the XXitalic_X and YYitalic_Y grid size on accuracy is negligible, as it primarily encodes depth information, which is predominantly obtained from the BEV feature map. The results, summarized in Table 2, highlight the crucial role of height-aware geometry in achieving accurate 3D occupancy predictions.

Next, to assess the effectiveness of the DHA and DBA modules in preserving geometric structures, we performed additional ablation studies on the nuScenes dataset, as shown in Table 3. The results confirm that the direction-specific dynamic convolutions in DHA and DBA are highly effective at capturing structural features, contributing to both improved performance and model efficiency.

To further evaluate the robustness of DA-Occ, we conducted experiments without using the Visible Mask. This setting simulates real-world scenarios where the model may not have access to complete or perfect information, enabling us to evaluate the model’s ability to generalize and maintain accuracy under challenging conditions. The results, presented in Table 4, demonstrate that DA-Occ achieves an mIoU of 31.31% without the Visible Mask, indicating its strong resilience to missing data and its robustness in less-than-ideal environments.

Finally, we simulate the inference speed of DA-Occ on edge devices by imposing several hardware constraints to mimic real-world deployment environments. Specifically, we limit the available GPU memory to 4GB (with other memory being occupied to simulate concurrent tasks on the edge device) and restrict the GPU’s maximum power consumption to 90W. As shown in Table 5, even under these simulated deployment constraints, DA-Occ maintains its effectiveness. This further validates the applicability of DA-Occ for real-time inference on edge devices.

4.5 Visualization Studies

In visualization experiments, we adopt FlashOcc [14] as the benchmark model, including both non-temporal variants and deeply supervised versions for comparison. A variety of scenarios from the val set are employed to ensure comprehensive evaluation. All visualizations are generated using the official FlashOcc codebase. As shown in Figure 6, the red box highlights a significant semantic loss along the vertical axis in FlashOcc. In contrast, DA-Occ maintains geometric structures in both horizontal and vertical directions, even within a purely 2D framework. This structural consistency greatly contributes to its outstanding mIoU performance.

5 Conclusion

In this work, we propose DA-Occ, a novel directional 2D framework for efficient and accurate 3D occupancy prediction. The method leverages height-aware voxel slicing and incorporates a directional attention mechanism, including Directional Height Attention (DHA) and Directional BEV Attention (DBA). This design effectively preserves geometric structures, particularly along the vertical axis, while maintaining high inference speed. Extensive experiments on the Occ3D-nuScenes dataset demonstrate that DA-Occ achieves a favorable trade-off between accuracy and efficiency. Compared with existing approaches, it shows superior real-time performance and stronger deployment potential. The proposed framework enhances scene understanding in autonomous driving scenarios and provides a practical solution for real-world applications.

Acknowledgments

This work was supported by the Nanjing Desay SV Automotive co.,LTD. under Grant Nos. TRD2025001.

References

  • [1] Anh-Quan Cao and Raoul De Charette. Monoscene: Monocular 3d semantic scene completion. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3991–4001, 2022.
  • [2] Xiaoyu Tian, Tao Jiang, Longfei Yun, Yucheng Mao, Huitong Yang, Yue Wang, Yilun Wang, and Hang Zhao. Occ3d: A large-scale 3d occupancy prediction benchmark for autonomous driving. Advances in Neural Information Processing Systems, 36:64318–64330, 2023.
  • [3] Yinhao Li, Zheng Ge, Guanyi Yu, Jinrong Yang, Zengran Wang, Yukang Shi, Jianjian Sun, and Zeming Li. Bevdepth: Acquisition of reliable depth for multi-view 3d object detection. In Proceedings of the AAAI conference on artificial intelligence, volume 37, pages 1477–1485, 2023.
  • [4] Zhiqi Li, Zhiding Yu, Wenhai Wang, Anima Anandkumar, Tong Lu, and Jose M Alvarez. Fb-bev: Bev representation from forward-backward view transformations. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 6919–6928, 2023.
  • [5] Yuan Wu, Zhiqiang Yan, Zhengxue Wang, Xiang Li, Le Hui, and Jian Yang. Deep height decoupling for precise vision-based 3d occupancy prediction. arXiv preprint arXiv:2409.07972, 2024.
  • [6] Tingting Liang, Hongwei Xie, Kaicheng Yu, Zhongyu Xia, Zhiwei Lin, Yongtao Wang, Tao Tang, Bing Wang, and Zhi Tang. Bevfusion: A simple and robust lidar-camera fusion framework. Advances in Neural Information Processing Systems, 35:10421–10434, 2022.
  • [7] Tianwei Yin, Xingyi Zhou, and Philipp Krahenbuhl. Center-based 3d object detection and tracking. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 11784–11793, 2021.
  • [8] Jiawei Hou, Xiaoyan Li, Wenhao Guan, Gang Zhang, Di Feng, Yuheng Du, Xiangyang Xue, and Jian Pu. Fastocc: Accelerating 3d occupancy prediction by fusing the 2d bird’s-eye view and perspective view. In 2024 IEEE International Conference on Robotics and Automation (ICRA), pages 16425–16431. IEEE, 2024.
  • [9] Qihang Ma, Xin Tan, Yanyun Qu, Lizhuang Ma, Zhizhong Zhang, and Yuan Xie. Cotr: Compact occupancy transformer for vision-based 3d occupancy prediction. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 19936–19945, 2024.
  • [10] Yi Wei, Linqing Zhao, Wenzhao Zheng, Zheng Zhu, Jie Zhou, and Jiwen Lu. Surroundocc: Multi-camera 3d occupancy prediction for autonomous driving. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 21729–21740, 2023.
  • [11] Junjie Huang, Guan Huang, Zheng Zhu, Yun Ye, and Dalong Du. Bevdet: High-performance multi-camera 3d object detection in bird-eye-view. arXiv preprint arXiv:2112.11790, 2021.
  • [12] Yulin He, Wei Chen, Siqi Wang, Tianci Xun, and Yusong Tan. Achieving speed-accuracy balance in vision-based 3d occupancy prediction via geometric-semantic disentanglement. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 39, pages 3455–3463, 2025.
  • [13] Sungjin Park, Jaeha Song, and Soonmin Hwang. Efficient occupancy prediction with instance-level attention. In 2025 International Conference on Artificial Intelligence in Information and Communication (ICAIIC), pages 0103–0107. IEEE, 2025.
  • [14] Zichen Yu, Changyong Shu, Jiajun Deng, Kangjie Lu, Zongdai Liu, Jiangyong Yu, Dawei Yang, Hui Li, and Yan Chen. Flashocc: Fast and memory-efficient occupancy prediction via channel-to-height plugin. arXiv preprint arXiv:2311.12058, 2023.
  • [15] Feng Li, Kun Xu, Zhaoyue Wang, Yunduan Cui, Mohammad Masum Billah, and Jia Liu. Instancebev: Unifying instance and bev representation for global modeling. arXiv preprint arXiv:2505.13817, 2025.
  • [16] Zhiqiang Yan, Kun Wang, Xiang Li, Zhenyu Zhang, Jun Li, and Jian Yang. Desnet: Decomposed scale-consistent network for unsupervised depth completion. In Proceedings of the AAAI conference on artificial intelligence, volume 37, pages 3109–3117, 2023.
  • [17] Zhiqiang Yan, Xiang Li, Le Hui, Zhenyu Zhang, Jun Li, and Jian Yang. Rignet++: Semantic assisted repetitive image guided network for depth completion: Z. yan et al. International Journal of Computer Vision, pages 1–23, 2025.
  • [18] Zhiqiang Yan, Zhengxue Wang, Kun Wang, Jun Li, and Jian Yang. Completion as enhancement: A degradation-aware selective image guided network for depth completion. In Proceedings of the Computer Vision and Pattern Recognition Conference, pages 26943–26953, 2025.
  • [19] Kun Wang, Zhenyu Zhang, Zhiqiang Yan, Xiang Li, Baobei Xu, Jun Li, and Jian Yang. Regularizing nighttime weirdness: Efficient self-supervised monocular depth estimation in the dark. In Proceedings of the IEEE/CVF international conference on computer vision, pages 16055–16064, 2021.
  • [20] Zhiqiang Yan, Yupeng Zheng, Deng-Ping Fan, Xiang Li, Jun Li, and Jian Yang. Learnable differencing center for nighttime depth perception. Visual Intelligence, 2(1):15, 2024.
  • [21] Jonah Philion and Sanja Fidler. Lift, splat, shoot: Encoding images from arbitrary camera rigs by implicitly unprojecting to 3d. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XIV 16, pages 194–210. Springer, 2020.
  • [22] Yinhao Li, Han Bao, Zheng Ge, Jinrong Yang, Jianjian Sun, and Zeming Li. Bevstereo: Enhancing depth estimation in multi-view 3d object detection with temporal stereo. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 37, pages 1486–1494, 2023.
  • [23] Ruihang Miao, Weizhou Liu, Mingrui Chen, Zheng Gong, Weixin Xu, Chen Hu, and Shuchang Zhou. Occdepth: A depth-aware method for 3d semantic scene completion. arXiv preprint arXiv:2302.13540, 2023.
  • [24] Jinqing Zhang, Yanan Zhang, Qingjie Liu, and Yunhong Wang. Sa-bev: Generating semantic-aware bird’s-eye-view feature for multi-view 3d object detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 3348–3357, 2023.
  • [25] Yichen Xie, Chenfeng Xu, Marie-Julie Rakotosaona, Patrick Rim, Federico Tombari, Kurt Keutzer, Masayoshi Tomizuka, and Wei Zhan. Sparsefusion: Fusing multi-modal sparse representations for multi-sensor 3d object detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 17591–17602, 2023.
  • [26] Junjie Huang and Guan Huang. Bevpoolv2: A cutting-edge implementation of bevdet toward deployment. arXiv preprint arXiv:2211.17111, 2022.
  • [27] Xiaowei Chi, Jiaming Liu, Ming Lu, Rongyu Zhang, Zhaoqing Wang, Yandong Guo, and Shanghang Zhang. Bev-san: Accurate bev 3d object detection via slice attention networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 17461–17470, 2023.
  • [28] Yuqi Wang, Yuntao Chen, Xingyu Liao, Lue Fan, and Zhaoxiang Zhang. Panoocc: Unified occupancy representation for camera-based 3d panoptic segmentation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 17158–17168, 2024.
  • [29] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.
  • [30] Haokui Zhang, Wenze Hu, and Xiaoyu Wang. Parc-net: Position aware circular convolution with merits from convnets and transformer. In European conference on computer vision, pages 613–630. Springer, 2022.
  • [31] Holger Caesar, Varun Bankiti, Alex H Lang, Sourabh Vora, Venice Erin Liong, Qiang Xu, Anush Krishnan, Yu Pan, Giancarlo Baldan, and Oscar Beijbom. nuscenes: A multimodal dataset for autonomous driving. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 11621–11631, 2020.
  • [32] Haisong Liu, Yang Chen, Haiguang Wang, Zetong Yang, Tianyu Li, Jia Zeng, Li Chen, Hongyang Li, and Limin Wang. Fully sparse 3d occupancy prediction. In European Conference on Computer Vision, pages 54–71. Springer, 2024.
  • [33] Yuanhui Huang, Wenzhao Zheng, Yunpeng Zhang, Jie Zhou, and Jiwen Lu. Tri-perspective view for vision-based 3d semantic occupancy prediction. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 9223–9232, 2023.
  • [34] Yunpeng Zhang, Zheng Zhu, and Dalong Du. Occformer: Dual-path transformer for vision-based 3d semantic occupancy prediction. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 9433–9443, 2023.