DA-Occ: Efficient 3D Voxel Occupancy Prediction via Directional 2D for Geometric Structure Preservation
Abstract
Efficient and high-accuracy 3D occupancy prediction is crucial for ensuring the performance of autonomous driving (AD) systems. However, many current methods focus on high accuracy at the expense of real-time processing needs. To address this challenge of balancing accuracy and inference speed, we propose a directional pure 2D approach. Our method involves slicing 3D voxel features to preserve complete vertical geometric information. This strategy compensates for the loss of height cues in Bird’s-Eye View (BEV) representations, thereby maintaining the integrity of the 3D geometric structure. By employing a directional attention mechanism, we efficiently extract geometric features from different orientations, striking a balance between accuracy and computational efficiency. Experimental results highlight the significant advantages of our approach for autonomous driving. On the Occ3D-nuScenes, the proposed method achieves an mIoU of 39.3% and an inference speed of 27.7 FPS, effectively balancing accuracy and efficiency. In simulations on edge devices, the inference speed reaches 14.8 FPS, further demonstrating the method’s applicability for real-time deployment in resource-constrained environments.
Keywords Autonomous Driving 3D Occupancy Prediction Directional Attention Mechanism Geometric Structure Preservation
1 Introduction
Vision-based occupancy prediction aims to estimate both object occupancy and semantic information in 3D voxel space using surround-view images captured by ego-vehicle cameras [1, 2, 3, 4]. Accurate 3D geometry and semantic understanding of the surrounding environment are essential for autonomous driving (AD) [5]. Compared to conventional 3D object detection [6, 7], occupancy prediction offers finer-grained 3D scene understanding and is capable of recognizing general objects by learning their occupancy patterns.

Occupancy prediction offers several advantages, including enhanced scene understanding, improved handling of unknown obstacles, and joint modeling of geometry and semantics. However, many existing methods [9, 10, 11, 5] focus primarily on precision, often at the expense of real-time processing due to the complexity of their model architectures and large parameter counts. Consequently, these methods suffer from high computational costs and latency, making them impractical for deployment in real-time autonomous driving (AD) systems [12]. To ensure real-time inference speed and facilitate deployment, some approaches adopt 2D methods to predict 3D voxel occupancy [13, 4, 14, 15, 8]. These methods typically compress 3D voxel features into bird’s-eye view (BEV) representations, retaining only horizontal information. As a result, certain semantic and vertical geometric details are sacrificed, even at the cost of accuracy. Figure 1 illustrates the limitations of current methods: while high accuracy often leads to slower inference speeds, methods designed for faster inference tend to suffer significant accuracy loss, with performance differences exceeding 10%. This issue arises from the compression of vertical geometric structures, as shown in Figure 2. Methods relying solely on 2D BEV representations struggle to capture fine-grained spatial structures, which are essential for accurate scene understanding [12, 15, 5, 8]. The visualization results further emphasize the importance of preserving vertical geometric features to maintain the overall semantic and structural integrity of the 3D voxel space.

To ensure the preservation of geometric structure and achieve more accurate object representation in 3D coordinates, we propose a novel approach for processing 3D voxel features. Our method involves performing slicing operations on the 3D voxel features, which retains the full vertical (height) geometric structure, and subsequently fuses this information with the 2D BEV feature map. This fusion restores the vertical geometric details that are often lost in conventional 2D BEV representations, providing a more comprehensive understanding of the scene.
To maintain high inference speed while preserving accuracy, we propose a directional attention-based occupancy prediction method (DA-Occ). This approach efficiently captures geometric features in both horizontal and vertical directions, enabling the model to selectively focus on relevant features in different orientations. As a result, DA-Occ optimizes computational performance without compromising the integrity of the 3D voxel structure. With an mIoU of 39.3% and an inference speed of 27.7 FPS on the Occ3D-nuScenes dataset, DA-Occ achieves a balance between high accuracy and fast inference, making it well-suited for real-time applications in autonomous driving.
2 Related Work
2.1 2D-to-3D View Transformation.
Many existing methods utilize estimated depth information to project 2D image features into 3D voxel space [16, 17, 18, 19, 20]. LSS [21] explicitly predicts depth distributions to lift 2D image features into 3D space. More recent approaches [3, 22, 23] have highlighted the importance of accurate depth estimation in the view transformation process. For instance, BEVDepth [3] encodes both intrinsic and extrinsic camera parameters into its depth refinement module for improved 3D object detection. BEVStereo [22] introduces an efficient temporal stereo mechanism to enhance depth estimation quality. In addition, some studies [24, 25, 26, 27] focus on optimizing the projection stage itself. However existing approaches have largely overlooked the potential of leveraging height prediction as a direct supervisory signal for view transformation. Inspired by DHD [5], our method introduces a novel approach by leveraging height scores from 2D images to guide the transformation into 3D voxel space. This enables more accurate height estimation in the voxel representation and lays a solid foundation for subsequent BEV feature reconstruction.
2.2 3D Occupancy Prediction.
3D occupancy prediction reconstructs the 3D scene by predicting a dense voxel grid that encodes spatial occupancy [12]. A straightforward approach is to extend the BEV representation used in 3D object detection into voxel space by lifting BEV features into 3D [14] and applying a segmentation head [11, 4, 8, 13]. However, this strategy disrupts the geometric integrity of the voxel representation due to the lack of vertical structure awareness. Directly processing full 3D voxel features, on the other hand, incurs substantial computational overhead. To address this, COTR [9] introduces a compact voxel representation via downsampling, while PanoOcc [28] proposes a novel panoramic occupancy segmentation task and leverages sparse 3D convolutions to reduce computation. Nonetheless, these methods still suffer from limited inference speed, making them less suitable for real-time deployment. To tackle this challenge, we propose a 2D-based approach that preserves the geometric structure of the voxel space while significantly improving inference efficiency, offering a more deployable and scalable solution for 3D occupancy prediction.
3 Method
3.1 Problem Formulation
In 3D occupancy prediction tasks, the surrounding environment is discretized into a voxel-based volumetric grid [21]. Assuming the ego vehicle is located at the origin of the global coordinate system, the 3D perception range is defined by the bounds , where and denote the lower and upper bounds along the height, width, and depth axes, respectively. The scene volume is partitioned into a grid of shape (e.g. as in [14]). Accordingly, the physical size of a single voxel along each axis is computed as . At inference time, each voxel is assigned a binary occupancy state—occupied or empty along with a semantic label from a predefined set of categories or an unknown class. The final voxel-wise predictions are derived from visual features extracted from multi-view images, denoted as , where is the number of cameras, is the feature size. Each camera is associated with both intrinsic and extrinsic parameters ( and ), which are utilized to transform 2D image features into the unified 3D coordinate space. These parameters enable precise projection and aggregation of multi-view information into the voxel grid for occupancy and semantic prediction.
3.2 Overall Architecture

The overall architecture of DA-Occ is depicted in detail in Fig. 3. It comprises the following key components: (1)Image Backbone: A convolutional neural network (e.g., ResNet [29]) is employed to extract high-level visual features from multi-view input images. (2) Geometry-Aware Feature Encoder: This integrates a depth estimation network (DepthNet) and a height attention network (HeightNet) to enrich the geometric understanding and improve the structural fidelity of subsequent 3D representations. (3) 2D-to-3D View Transformation: Image-plane features are lifted into the 3D voxel space using calibrated projection matrices, enabling spatial alignment across multiple views. (4) Geometry-Aware Feature Decoder: The final decoding stage incorporates geometric priors to enhance 3D structural consistency and improve the granularity of semantic predictions within the voxel grid.
3.3 Directional Attention Mechanism

In order to preserve the consistency of 3D geometric structures, particularly along the vertical axis, which is often underrepresented in 2D BEV-based representations, we introduce the Directional Attention Mechanism. This mechanism is utilized in both the Geometry-Aware Feature Encoder and the Geometry-Aware Feature Decoder, but at different stages of the model pipeline. In the Encoder, it is applied before the 2D-to-3D View Transformation to dynamically acquire height scores, which help enhance the vertical geometric features. After the transformation, in the Decoder, the Directional Height Attention (DHA) and Directional BEV Attention (DBA) modules are employed to refine and reconstruct the vertical and horizontal geometric structures, ensuring a comprehensive and accurate 3D representation. Inspired by ParC-Net [30], which captures long-range dependencies in 2D image space, we propose a novel extension of this mechanism to the 3D voxel domain. Our extension dynamically extracts 3D geometric features, enabling the model to better capture long-range spatial dependencies and enhance the representation of 3D structures. By introducing direction-aware attention that model dependencies along both the vertical and horizontal axes, our approach preserves fine-grained 3D structure within a purely 2D framework. Specifically, DHA is designed to capture detailed height-dependent patterns by processing features along the -axis (vertical), while DBA captures geometric dependencies along the horizontal plane within the BEV representation. The directional attention mechanism can be expressed as , where refers to either the horizontal or vertical direction in feature maps. Figure 4 illustrates the structure of this mechanism. The computation proceeds as:
(1) | ||||
where, denotes the convolutional weights dynamically generated by the MLP. The function performs average pooling along direction to compress spatial information. and represent horizontally and vertically concatenated features, respectively. The position encoding is a learnable parameter with a shape of or , and the final convolution integrates both structure and learned weights to generate .
3.4 Geometry-Aware Feature Encoder
HeightNet first introduced in [5], shares a similar structure with the DepthNet module in BEVDepth [3]. It is used to generate a predicted height map , which is supervised by a ground-truth height map to facilitate accurate height estimation. The generation of proceeds as follows: a LiDAR point in the LiDAR coordinate frame is transformed to the ego-vehicle coordinate system as . The corresponding 2D projection onto the image plane is computed using:
(2) |
where is the depth value in the camera coordinate system, denotes the pixel coordinates in homogeneous form, is the intrinsic matrix, and , are the camera extrinsic parameters.
Following [5], we construct a set of tuples , where each pixel location is assigned both a projected depth and a corresponding height in the ego coordinate system. For each pixel, only the point with the smallest depth is retained to eliminate occluded redundancies, resulting in a ground-truth height map that represents the visible uppermost surface in the scene.
Directional Height Attention Network is designed to predict by leveraging structural priors along the vertical (-axis) direction. It incorporates both the geometric shape prior of objects in height and the contextual information from adjacent horizontal locations. To efficiently capture vertical geometry while maintaining lightweight computation, DHA replaces the last two residual blocks in the original DepthNet with directional attention modules. This architectural modification enables more effective extraction of height-sensitive features with minimal additional cost.
The entire Geometry-Aware Feature Encoder process is described as follows: is processed by DepthNet, generating the depth score and the feature map . is then passed through HeightNet, producing the height score .
3.5 2D-to-3D View Transformation
To preserve the integrity of 3D geometric structures while maintaining computational efficiency, we adopt the Lift-Splat-Shoot (LSS) framework [21] for view transformation. Unlike existing approaches such as FlashOcc [14], which rely on a purely 2D transformation grid (e.g., ) and often compress the vertical dimension, resulting in significant geometric distortion, our method takes a different approach. We adopt a lightweight 3D voxel grid (e.g., ) to enrich the BEV feature map with enhanced vertical geometric information. This enables BEV feature map to retain vertical spatial cues more effectively and recover structural geometry with higher fidelity under similar computational constraints.
Specifically, we construct a 3D voxel grid that enables finer-grained depth and height modeling during the lifting process. The generated 3D voxel features are then sliced and fused along the depth axis (-axis), effectively capturing vertical structure while still allowing the final output to remain in a 2D feature map format for efficient processing. In the Splat stage of the LSS, we enhance the lifted 3D voxel representation by the predicted height confidence scores. Specifically, the resulting voxel projection is formulated as:
(3) |
where denote the scores along the height directions, and represents -th pixel feature. This design enhances the model’s perception of vertical geometric structures while maintaining real-time inference efficiency, leading to improved spatial feature representation. After constructing the 3D voxel feature volume , we perform slicing along the depth axis (i.e., the -axis) and concatenate the resulting slices along the horizontal axis (-axis). This transforms the 3D voxel structure into a 2D feature map with enhanced perception of vertical (height-wise) geometric structures, facilitating more efficient extraction of height-aware spatial representations.This process is described as:
(4) |
where the concatenation is applied along the spatial -axis, resulting in a feature map
3.6 Geometry-Aware Feature Decoder

The BEV feature map , which primarily encodes the target’s depth-related information, and the sliced height-aware feature map feature map , which captures the target’s vertical structural cues, are obtained following the view transformation stage. To enhance the geometric fidelity of and , we introduce DBA and DHA, which specialize in refining horizontal and vertical geometrical features through direction-aware attention.
Directional-Aware Attention. The captures the relative distance between the ego vehicle and surrounding targets. This spatial relationship can be further decomposed into front-back and left-right directional components within the BEV plane. To better model depth-related geometric structures, we apply a directional attention operator along both the and axes of the BEV feature map. The refined BEV representation is computed as:
(5) |
where the output has the same shape as the input , and the fusion enhances geometric sensitivity in both planar directions.
Due to the previous processing of sliced , it preserves the geometric structure in the vertical direction. The DHA performs direction-aware processing that better aligns with the vertical structure . This directional sensitivity allows it to more effectively capture and refine height-dependent geometric features. The refinement process is defined as:
(6) | ||||
where is the reshaped 3D height-aware feature volume, and is the final height-enhanced feature map obtained by concatenating the axes slices along the channel dimension.
This formulation enables fine-grained recovery of vertical geometry and improves the spatial expressiveness of the height feature representation. The final feature representation is obtained by fusing and , effectively combining fine-grained height-aware and depth-aware information to enhance 3D spatial understanding. The whole process looks like Figure 5.
4 Experiment
4.1 Dataset and Evaluation Metrics
We train and evaluate our model on the Occ3D-nuScenes benchmark [2], which is built upon the nuScenes dataset [31]. The dataset comprises 1,000 video sequences, split into 700 for training, 150 for validation, and 150 for testing. Each keyframe includes a 32-beam LiDAR point cloud, six RGB images from surround-view cameras, and dense voxel-level semantic occupancy annotations. The perception range is defined as , and the voxel resolution is set to . Each voxel is assigned one of 18 semantic classes, including 16 predefined object categories, one other class for unknown objects, and one empty class representing free space. To assess the performance of our approach, we adopt evaluation metrics widely used in 3D occupancy prediction: the mean Intersection-over-Union (mIoU) [32], computed over all semantic categories.
4.2 Implementation Details
Following common practice [9, 4, 32], we train our models using 4 NVIDIA H800 GPUs with a batch size of 4 per GPU. Unless otherwise specified, all models are optimized using the AdamW optimizer with a learning rate of and a weight decay of 0.05. During the LSS stage, the resolution of the grid for the depth score is set to , and the resolution of the grid for the height score is set to . Regarding the loss function, we design it based on the characteristics of geometric structure recovery in DA-Occ. Specifically, we follow the formulation proposed in MonoScene [1] and define the overall loss as: , where are set as . For inference, we use a single NVIDIA A100 GPU with a batch size of 1. The frame-per-second (FPS) metric is measured using the official mmdetection3d codebase.
4.3 Main Results
Method | Venue | History Frames | Backbone | Image Size | mIoU (%) | FPS | |
---|---|---|---|---|---|---|---|
MonoScene | CVPR’22 | - | ResNet-101 | 9281600 | 6.06 | 1.2 | |
TPVFormer | CVPR’23 | - | ResNet-101 | 9281600 | 27.83 | 3.1 | |
CTF-Occ | NIPS’24 | - | ResNet-101 | 9281600 | 28.53 | - | |
OccFormer | ICCV’23 | - | ResNet-50 | 256704 | 20.40 | - | |
BEVDetOcc | arXiv’22 | - | ResNet-50 | 256704 | 31.64 | - | |
FlashOcc | arXiv’23 | - | ResNet-50 | 256704 | 31.95 | 79.4 | |
DA-Occ(Ours) | - | - | ResNet-50 | 256704 | 34.38 | 39.6 | |
BEVDetOcc-4D-Stereo | arXiv’22 | 1 | ResNet-50 | 256704 | 36.01 | 1.8 | |
FB-OCC | ICCV’23 | 1 | ResNet-50 | 256704 | 37.52 | 6.9 | |
FlashOCC* | arXiv’23 | 1 | ResNet-50 | 256704 | 37.84 | 9.1 | |
GSD-Occ | AAAI’25 | 1 | ResNet-50 | 256704 | 34.84 | 22.7 | |
EOPIA | ICAIIC’25 | 1 | ResNet-50 | 256704 | 35.26 | 28.7 | |
DA-Occ(Ours)* | - | 1 | ResNet-50 | 256704 | 38.97 | 7.3 | |
DA-Occ(Ours) | - | 1 | ResNet-50 | 256704 | 35.77 | 31.5 | |
FastOCC | ICRA’24 | 16 | ResNet-101 | 320800 | 37.2 | 10.1 | |
COTR | CVPR’24 | 16 | ResNet-50 | 256704 | 44.5 | 0.9 | |
GSD-Occ | AAAI’25 | 16 | ResNet-50 | 256704 | 39.4 | 20.0 | |
DA-Occ(Ours) | - | 16 | ResNet-50 | 256704 | 39.3 | 27.7 |
Table 1 presents a comparison between DA-Occ and state-of-the-art (SOTA) methods on the Occ3D-nuScenes benchmark. The results are grouped by the number of frames used in the temporal sequence. DA-Occ demonstrates outstanding efficiency while maintaining high accuracy. Notably, even when using only a single frame for data fusion, DA-Occ* achieves accuracy comparable to that of many multi-frame models. This is largely attributed to the use of the Stereo4D strategy in its temporal fusion module [11]. When using a single frame, the proposed method improves accuracy by 2.43% compared to FlashOcc. By employing the frame fusion technique at the BEV level [32], DA-Occ achieves FPS of 27.7 and accuracy of 39.3%. These results highlight DA-Occ’s ability to strike an effective balance between efficiency and accuracy The way we make comparisons is all based on innovations within the model itself, for example MonoScene [1], TPVForme [33], CTF-Occ [9], OccFormer [34], BEVDetOcc [11], FlashOcc [14], FB-OCC [4], GSD-Occ [12], EOPIA [13](Since there was no name for the model, the abbreviation of the article title was used instead.), and COTR [9]. These results underscore the effectiveness of DA-Occ in preserving geometric structures using a purely 2D-based approach. To further validate this capability, we conduct both ablation studies and qualitative visualizations, which consistently confirm the robustness of DA-Occ in maintaining geometric structural integrity.

4.4 Ablation Studies
Size of Grid | mIoU (%) | FPS | GFLOPs |
---|---|---|---|
34.38 | 39.6 | 276.59 | |
34.45 | 38.7 | 276.65 | |
33.67 | 39.9 | 276.58 | |
32.89 | 40.1 | 276.55 |
DHA | DBA | mIoU (%) | FPS | Flops |
---|---|---|---|---|
✓ | ✓ | 34.38 | 39.6 | 276.59G |
✓ | - | 34.01 | 42.5 | 271.24G |
- | ✓ | 33.76 | 41.3 | 276.57G |
- | - | 32.58 | 44.7 | 271.21G |
Method | Venue | mIoU (%) | FPS |
---|---|---|---|
MonoScene | CVPR’22 | 6.06 | 1.2 |
OccFormer | ICCV’23 | 20.40 | - |
CTF-Occ | NIPS’24 | 28.53 | - |
BECDet | arXiv’22 | 18.05 | - |
SparseOcc | ECCV’24 | 30.64 | 17.7 |
DA-Occ | - | 31.31 | 27.7 |
Precision | Threads | GPU Memory | FPS | |
---|---|---|---|---|
FP32 (A100) | Full | Full | 39.6 | |
FP16 (A100) | 2 | 4GB | 14.8 |
To accelerate experimental evaluation, we conducted ablation experiments on the model without DepthNet and without temporal fusion. To investigate the impact of height information on occupancy prediction, we varied the resolution of the 3D voxel grid and analyzed how vertical geometric granularity influences performance. The impact of the and grid size on accuracy is negligible, as it primarily encodes depth information, which is predominantly obtained from the BEV feature map. The results, summarized in Table 2, highlight the crucial role of height-aware geometry in achieving accurate 3D occupancy predictions.
Next, to assess the effectiveness of the DHA and DBA modules in preserving geometric structures, we performed additional ablation studies on the nuScenes dataset, as shown in Table 3. The results confirm that the direction-specific dynamic convolutions in DHA and DBA are highly effective at capturing structural features, contributing to both improved performance and model efficiency.
To further evaluate the robustness of DA-Occ, we conducted experiments without using the Visible Mask. This setting simulates real-world scenarios where the model may not have access to complete or perfect information, enabling us to evaluate the model’s ability to generalize and maintain accuracy under challenging conditions. The results, presented in Table 4, demonstrate that DA-Occ achieves an mIoU of 31.31% without the Visible Mask, indicating its strong resilience to missing data and its robustness in less-than-ideal environments.
Finally, we simulate the inference speed of DA-Occ on edge devices by imposing several hardware constraints to mimic real-world deployment environments. Specifically, we limit the available GPU memory to 4GB (with other memory being occupied to simulate concurrent tasks on the edge device) and restrict the GPU’s maximum power consumption to 90W. As shown in Table 5, even under these simulated deployment constraints, DA-Occ maintains its effectiveness. This further validates the applicability of DA-Occ for real-time inference on edge devices.
4.5 Visualization Studies
In visualization experiments, we adopt FlashOcc [14] as the benchmark model, including both non-temporal variants and deeply supervised versions for comparison. A variety of scenarios from the val set are employed to ensure comprehensive evaluation. All visualizations are generated using the official FlashOcc codebase. As shown in Figure 6, the red box highlights a significant semantic loss along the vertical axis in FlashOcc. In contrast, DA-Occ maintains geometric structures in both horizontal and vertical directions, even within a purely 2D framework. This structural consistency greatly contributes to its outstanding mIoU performance.
5 Conclusion
In this work, we propose DA-Occ, a novel directional 2D framework for efficient and accurate 3D occupancy prediction. The method leverages height-aware voxel slicing and incorporates a directional attention mechanism, including Directional Height Attention (DHA) and Directional BEV Attention (DBA). This design effectively preserves geometric structures, particularly along the vertical axis, while maintaining high inference speed. Extensive experiments on the Occ3D-nuScenes dataset demonstrate that DA-Occ achieves a favorable trade-off between accuracy and efficiency. Compared with existing approaches, it shows superior real-time performance and stronger deployment potential. The proposed framework enhances scene understanding in autonomous driving scenarios and provides a practical solution for real-world applications.
Acknowledgments
This work was supported by the Nanjing Desay SV Automotive co.,LTD. under Grant Nos. TRD2025001.
References
- [1] Anh-Quan Cao and Raoul De Charette. Monoscene: Monocular 3d semantic scene completion. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3991–4001, 2022.
- [2] Xiaoyu Tian, Tao Jiang, Longfei Yun, Yucheng Mao, Huitong Yang, Yue Wang, Yilun Wang, and Hang Zhao. Occ3d: A large-scale 3d occupancy prediction benchmark for autonomous driving. Advances in Neural Information Processing Systems, 36:64318–64330, 2023.
- [3] Yinhao Li, Zheng Ge, Guanyi Yu, Jinrong Yang, Zengran Wang, Yukang Shi, Jianjian Sun, and Zeming Li. Bevdepth: Acquisition of reliable depth for multi-view 3d object detection. In Proceedings of the AAAI conference on artificial intelligence, volume 37, pages 1477–1485, 2023.
- [4] Zhiqi Li, Zhiding Yu, Wenhai Wang, Anima Anandkumar, Tong Lu, and Jose M Alvarez. Fb-bev: Bev representation from forward-backward view transformations. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 6919–6928, 2023.
- [5] Yuan Wu, Zhiqiang Yan, Zhengxue Wang, Xiang Li, Le Hui, and Jian Yang. Deep height decoupling for precise vision-based 3d occupancy prediction. arXiv preprint arXiv:2409.07972, 2024.
- [6] Tingting Liang, Hongwei Xie, Kaicheng Yu, Zhongyu Xia, Zhiwei Lin, Yongtao Wang, Tao Tang, Bing Wang, and Zhi Tang. Bevfusion: A simple and robust lidar-camera fusion framework. Advances in Neural Information Processing Systems, 35:10421–10434, 2022.
- [7] Tianwei Yin, Xingyi Zhou, and Philipp Krahenbuhl. Center-based 3d object detection and tracking. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 11784–11793, 2021.
- [8] Jiawei Hou, Xiaoyan Li, Wenhao Guan, Gang Zhang, Di Feng, Yuheng Du, Xiangyang Xue, and Jian Pu. Fastocc: Accelerating 3d occupancy prediction by fusing the 2d bird’s-eye view and perspective view. In 2024 IEEE International Conference on Robotics and Automation (ICRA), pages 16425–16431. IEEE, 2024.
- [9] Qihang Ma, Xin Tan, Yanyun Qu, Lizhuang Ma, Zhizhong Zhang, and Yuan Xie. Cotr: Compact occupancy transformer for vision-based 3d occupancy prediction. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 19936–19945, 2024.
- [10] Yi Wei, Linqing Zhao, Wenzhao Zheng, Zheng Zhu, Jie Zhou, and Jiwen Lu. Surroundocc: Multi-camera 3d occupancy prediction for autonomous driving. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 21729–21740, 2023.
- [11] Junjie Huang, Guan Huang, Zheng Zhu, Yun Ye, and Dalong Du. Bevdet: High-performance multi-camera 3d object detection in bird-eye-view. arXiv preprint arXiv:2112.11790, 2021.
- [12] Yulin He, Wei Chen, Siqi Wang, Tianci Xun, and Yusong Tan. Achieving speed-accuracy balance in vision-based 3d occupancy prediction via geometric-semantic disentanglement. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 39, pages 3455–3463, 2025.
- [13] Sungjin Park, Jaeha Song, and Soonmin Hwang. Efficient occupancy prediction with instance-level attention. In 2025 International Conference on Artificial Intelligence in Information and Communication (ICAIIC), pages 0103–0107. IEEE, 2025.
- [14] Zichen Yu, Changyong Shu, Jiajun Deng, Kangjie Lu, Zongdai Liu, Jiangyong Yu, Dawei Yang, Hui Li, and Yan Chen. Flashocc: Fast and memory-efficient occupancy prediction via channel-to-height plugin. arXiv preprint arXiv:2311.12058, 2023.
- [15] Feng Li, Kun Xu, Zhaoyue Wang, Yunduan Cui, Mohammad Masum Billah, and Jia Liu. Instancebev: Unifying instance and bev representation for global modeling. arXiv preprint arXiv:2505.13817, 2025.
- [16] Zhiqiang Yan, Kun Wang, Xiang Li, Zhenyu Zhang, Jun Li, and Jian Yang. Desnet: Decomposed scale-consistent network for unsupervised depth completion. In Proceedings of the AAAI conference on artificial intelligence, volume 37, pages 3109–3117, 2023.
- [17] Zhiqiang Yan, Xiang Li, Le Hui, Zhenyu Zhang, Jun Li, and Jian Yang. Rignet++: Semantic assisted repetitive image guided network for depth completion: Z. yan et al. International Journal of Computer Vision, pages 1–23, 2025.
- [18] Zhiqiang Yan, Zhengxue Wang, Kun Wang, Jun Li, and Jian Yang. Completion as enhancement: A degradation-aware selective image guided network for depth completion. In Proceedings of the Computer Vision and Pattern Recognition Conference, pages 26943–26953, 2025.
- [19] Kun Wang, Zhenyu Zhang, Zhiqiang Yan, Xiang Li, Baobei Xu, Jun Li, and Jian Yang. Regularizing nighttime weirdness: Efficient self-supervised monocular depth estimation in the dark. In Proceedings of the IEEE/CVF international conference on computer vision, pages 16055–16064, 2021.
- [20] Zhiqiang Yan, Yupeng Zheng, Deng-Ping Fan, Xiang Li, Jun Li, and Jian Yang. Learnable differencing center for nighttime depth perception. Visual Intelligence, 2(1):15, 2024.
- [21] Jonah Philion and Sanja Fidler. Lift, splat, shoot: Encoding images from arbitrary camera rigs by implicitly unprojecting to 3d. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XIV 16, pages 194–210. Springer, 2020.
- [22] Yinhao Li, Han Bao, Zheng Ge, Jinrong Yang, Jianjian Sun, and Zeming Li. Bevstereo: Enhancing depth estimation in multi-view 3d object detection with temporal stereo. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 37, pages 1486–1494, 2023.
- [23] Ruihang Miao, Weizhou Liu, Mingrui Chen, Zheng Gong, Weixin Xu, Chen Hu, and Shuchang Zhou. Occdepth: A depth-aware method for 3d semantic scene completion. arXiv preprint arXiv:2302.13540, 2023.
- [24] Jinqing Zhang, Yanan Zhang, Qingjie Liu, and Yunhong Wang. Sa-bev: Generating semantic-aware bird’s-eye-view feature for multi-view 3d object detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 3348–3357, 2023.
- [25] Yichen Xie, Chenfeng Xu, Marie-Julie Rakotosaona, Patrick Rim, Federico Tombari, Kurt Keutzer, Masayoshi Tomizuka, and Wei Zhan. Sparsefusion: Fusing multi-modal sparse representations for multi-sensor 3d object detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 17591–17602, 2023.
- [26] Junjie Huang and Guan Huang. Bevpoolv2: A cutting-edge implementation of bevdet toward deployment. arXiv preprint arXiv:2211.17111, 2022.
- [27] Xiaowei Chi, Jiaming Liu, Ming Lu, Rongyu Zhang, Zhaoqing Wang, Yandong Guo, and Shanghang Zhang. Bev-san: Accurate bev 3d object detection via slice attention networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 17461–17470, 2023.
- [28] Yuqi Wang, Yuntao Chen, Xingyu Liao, Lue Fan, and Zhaoxiang Zhang. Panoocc: Unified occupancy representation for camera-based 3d panoptic segmentation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 17158–17168, 2024.
- [29] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.
- [30] Haokui Zhang, Wenze Hu, and Xiaoyu Wang. Parc-net: Position aware circular convolution with merits from convnets and transformer. In European conference on computer vision, pages 613–630. Springer, 2022.
- [31] Holger Caesar, Varun Bankiti, Alex H Lang, Sourabh Vora, Venice Erin Liong, Qiang Xu, Anush Krishnan, Yu Pan, Giancarlo Baldan, and Oscar Beijbom. nuscenes: A multimodal dataset for autonomous driving. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 11621–11631, 2020.
- [32] Haisong Liu, Yang Chen, Haiguang Wang, Zetong Yang, Tianyu Li, Jia Zeng, Li Chen, Hongyang Li, and Limin Wang. Fully sparse 3d occupancy prediction. In European Conference on Computer Vision, pages 54–71. Springer, 2024.
- [33] Yuanhui Huang, Wenzhao Zheng, Yunpeng Zhang, Jie Zhou, and Jiwen Lu. Tri-perspective view for vision-based 3d semantic occupancy prediction. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 9223–9232, 2023.
- [34] Yunpeng Zhang, Zheng Zhu, and Dalong Du. Occformer: Dual-path transformer for vision-based 3d semantic occupancy prediction. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 9433–9443, 2023.