3D-MOOD: Lifting 2D to 3D for Monocular Open-Set Object Detection

Yung-Hsu Yang1  Luigi Piccinelli1  Mattia Segu1  Siyuan Li1  Rui Huang1,2
 Yuqian Fu3  Marc Pollefeys1,4  Hermann Blum1,5  Zuria Bauer1
1ETH Zürich  2Tsinghua University  3INSAIT  4Microsoft  5University of Bonn
Abstract

Monocular 3D object detection is valuable for various applications such as robotics and AR/VR. Existing methods are confined to closed-set settings, where the training and testing sets consist of the same scenes and/or object categories. However, real-world applications often introduce new environments and novel object categories, posing a challenge to these methods. In this paper, we address monocular 3D object detection in an open-set setting and introduce the first end-to-end 3D Monocular Open-set Object Detector (3D-MOOD). We propose to lift the open-set 2D detection into 3D space through our designed 3D bounding box head, enabling end-to-end joint training for both 2D and 3D tasks to yield better overall performance. We condition the object queries with geometry prior and overcome the generalization for 3D estimation across diverse scenes. To further improve performance, we design the canonical image space for more efficient cross-dataset training. We evaluate 3D-MOOD on both closed-set settings (Omni3D) and open-set settings (Omni3D \rightarrow Argoverse 2, ScanNet), and achieve new state-of-the-art results. Code and models are available at royyang0714.github.io/3D-MOOD.

1 Introduction

Monocular 3D object detection (3DOD) aims to recognize and localize objects in 3D space from a single 2D image by estimating their 3D positions, dimensions, and orientations. Unlike stereo or LiDAR-based methods, monocular 3DOD relies solely on visual cues, making it significantly more challenging yet cost-effective for robotics and AR/VR applications [65, 10, 53, 36, 16].

While many methods [51, 22, 55, 28, 44, 49, 63] focus on improving 3DOD performance in specific domains, Cube R-CNN [4] and Uni-MODE [23] build unified models on the cross-dataset benchmark Omni3D [4], which consolidates six diverse 3D detection datasets [14, 5, 47, 2, 1, 43]. These advancements have driven the evolution of 3DOD from specialized models to more unified frameworks. However, as shown in Fig. 1, most existing methods, including the unified models, operate under an ideal assumption: the training set and testing set share identical scenes and object categories. This limits their generalizability in real-world applications for not being able to detect novel objects in unseen domains. This challenge motivates us to explore monocular open-set 3D object detection, further pushing the boundaries of existing 3DOD methods.

Refer to caption
Figure 1: Open-set Monocular 3D Object Detection. Unlike previous methods focusing on achieving good results in the closed-set setting, we aim to resolve the open-set monocular 3D object detection problem. This challenge requires the model to classify arbitrary objects while precisely localizing them in unseen scenes.

The first step towards the open-set monocular 3DOD is identifying the fundamental obstacles underlying this task. Our key observations are as follows: 1) Cross-modality learning is crucial to breaking the limitation of closed vocabulary for novel class classification [41]. However, 3D data lacks rich visual-language pairs, making it significantly more challenging to learn modality alignment and achieve satisfactory open-set results. 2) Robust depth estimation is essential for monocular 3DOD to generalize well across different scenes compared to LiDAR-based methods [56]. However, monocular depth estimation particularly in novel scenes, is inherently challenging for existing methods.

Given the scarcity of 3D data and text pairs, we propose to bridge the modality gap by lifting open-set 2D detection into open-set 3D detection. Fortunately, recent universal metric monocular depth estimation methods [38, 40, 39, 3, 57] have shown promising generalization across diverse scenes, which opens new opportunities for addressing open-set monocular 3DOD. Specifically, we design a 3D bounding box head to predict the differentiable lifting parameters from 2D object queries and enable the lift of the detected 2D bounding boxes as 3D object detection. This allows us to jointly train the open-set 2D and 3D detectors in an end-to-end (e2e) way, using both 2D and 3D ground truth (GT). Furthermore, we propose the geometry-aware 3D query generation module, which conditions 2D object queries with the camera intrinsics and depth estimation and generates 3D object queries. These 3D queries encode essential geometric information and are used for the 3D bounding box head to improve the model’s accuracy and generalization ability in 3D object detection. Additionally, we design a more effective canonical image space, which proves crucial for handling datasets with varying image resolutions, as demonstrated in our experiments.

Formally, we introduce the first e2e 3D Monocular Open-set Object Detecter (3D-MOOD) by integrating the proposed 3D bounding box head, geometry-aware 3D query generation module, and canonical image space into the open-set 2D detector [27]. Our method takes a monocular input image with the language prompts and outputs the 3D object detection for the desired objects in any given scene. Experimental results demonstrate that 3D-MOOD achieves state-of-the-art (SOTA) performance on the challenging closed-set Omni3D benchmark, surpassing all previous task-specific and unified models. More importantly, in open-set settings, i.e. transferring from Omni3D to Argoverse 2 [54] and ScanNet [8], our method consistently outperforms prior models, achieving clear improvements in generalization and novel classes recognition.

Our main contributions are: (1) We explore monocular 3D object detection in open-set settings, establishing benchmarks that account for both novel scenes and unseen object categories; (2) We introduce 3D-MOOD, the first end-to-end open-set monocular 3D object detector, via 2D to 3D lifting, geometry-aware 3D query generation, and canonical image space; (3) We achieve state-of-the-art performance in both closed-set and open-set settings, demonstrating the effectiveness of our method and the feasibility of open-set monocular 3D object detection.

Refer to caption
Figure 2: 3D-MOOD. We propose an end-to-end 3D monocular open-set object detector that takes a monocular image and the language prompts of the interested objects as input and classifies and localizes the 3D objects in the scenes. Our design will transform the input image and camera intrinsics into the proposed canonical image space and achieve the open-set ability for diverse scenes.

2 Related Work

2.1 Open-set 2D Object Detection

In recent years, there has been tremendous progress in 2D object detection [61, 60, 27, 64, 7, 13, 35] by leveraging language models [9] or visual-language foundation models [41] to detect and classify objects from language queries. Among varying definitions of these works, i.e. open-set object detection, open-world object detection, and open-vocabulary detection, we do not distinguish them in this section and describe the formal problem definition in Sec. 3.1.

OVR-CNN [61] aligns ResNet [15] and BERT [9] features to detect novel classes, while OV-DETR [60] uses CLIP [41] image and text embeddings with Detection Transformer [6]. GLIP [27] presents grounded language-image pre-training to align object detection and captions. Detic [66] leverages image level labels to align object reasoning and text to enable tens of thousands of concepts for classification.

In contrast, G-DINO [27] deeply fuses image and text features at all stages of the detection model [62] and proposes language-guided query generation to allow the open-set 2D detector to detect the desired object classes according to input language prompts. This is more natural and intuitive for humans and can help robots understand scenes in many applications. However, in 3D monocular object detection, the lack of data annotations in 3D due to the cost also increases the difficulty of tackling the open-set classification with visual-language alignment. Thus, in this work, we aim to propose a framework that can universally lift the open-set 2D detection to 3D to address the limitation of the data annotation for open-set classification.

2.2 3D Monocular Object Detection

3D monocular object detection is crucial for autonomous driving and indoor robotic navigation. In the past years, a large number of works [17, 36, 22, 55, 28, 12, 25, 22, 49, 32] proposed various methods to address the 3D multi-object detection for specific scenes, i.e. one model for one dataset. Recently, a challenging dataset called Omni3D [4] was proposed, providing a new direction for 3D monocular object detection. This dataset contains six popular datasets including outdoor scenes (KITTI [14] and nuScenes [5]), and indoor scenes (ARKitScenes [2], Hypersim [43], and SUN-RGB-D [47]), and object centric dataset (Objectron [1]). Cube R-CNN [4] proposes virtual depth to address the various focal lengths across the diverse training datasets. Uni-MODE [23] proposes the domain confidence to jointly train the Bird-eye-View (BEV) detector on indoor and outdoor datasets.

Although these methods work well on Omni3D, they are still limited by the closed-set classification design, hence they lack the ability to detect novel categories. To address this, OVM3D-Det [18] proposes a pipeline to generate pseudo GT for novel classes by using 2D foundation models [27, 19, 38] with Large Language Model (LLM) priors. However, while evaluating the quality of pseudo GT on open-set benchmarks, the performance is limited because the pipeline can not be e2e trained with 3D data. On the contrary, our method is designed to estimate the differentiable lifting parameters of the open-set 2D detection with geometry prior. Thus, it can be supervised in the e2e manner while also no longer constrained by the closed-set classification. Furthermore, to address open-set regression in 3D, we use the canonical image space to better train 3D detectors across datasets. With our proposed components, 3D-MOOD outperforms these prior works on both closed-set and open-set benchmarks.

3 Method

We aim to propose the first e2e open-set monocular 3D object detector that can be generalized to different scenes and object classes. We first discuss the problem setup in Sec. 3.1 to define the goal of monocular open-set 3D object detection. Then, we introduce the overall pipeline of our proposed open-set monocular 3D object detector, 3D-MOOD, in Sec. 3.2. We illustrate our 3D bounding box head design in Sec. 3.3 and introduce the proposed canonical image space for training monocular 3DOD models across datasets in Sec. 3.4. In Sec. 3.5, we introduce the metric monocular auxiliary depth head, which enhances 3D-MOOD by providing a more comprehensive understanding of the global scene. Finally, in Sec. 3.6, we illustrate the proposed geometry-aware 3D query generation, designed to improve generalization in both closed-set and open-set settings.

3.1 Problem Setup

The goal of 3D monocular open-set object detection is to detect any objects in any image, giving a language prompt for the objects of interest. To achieve this, one needs to extend the concept of open-set beyond the distinction of seen (base) and unseen (novel) classes within the same dataset [61]: We follow the manner of G-DINO [27] that trains the model on other datasets but tests on COCO, which contains base and novel classes in unseen domains. In this work, we aim to extend this research direction to 3DOD. Thus, our main focus is on how to train the open-set detectors using the largest and most diverse pre-training data to date, i.e. Omni3D, and achieve good performance on unseen datasets, e.g. Argoverse 2 and ScanNet.

3.2 Overall Architecture

As shown in Fig. 2, we address the monocular open-set 3DOD by lifting the open-set 2D detection. Formally, we estimate 2D bounding boxes 𝐃^2D\mathbf{\hat{D}}_{\text{2D}}over^ start_ARG bold_D end_ARG start_POSTSUBSCRIPT 2D end_POSTSUBSCRIPT from an input image 𝐈\mathbf{I}bold_I and language prompts 𝐓\mathbf{T}bold_T, and lift them as 3D orientated bounding boxes 𝐃^3D\mathbf{\hat{D}}_{\text{3D}}over^ start_ARG bold_D end_ARG start_POSTSUBSCRIPT 3D end_POSTSUBSCRIPT in the corresponding camera coordinate frame with the object classes 𝐂^\mathbf{\hat{C}}over^ start_ARG bold_C end_ARG. A 2D box is defined as 𝐛^2D=[x^1,y^1,x^2,y^2]\mathbf{\hat{b}}_{\text{2D}}=[\hat{x}_{1},\hat{y}_{1},\hat{x}_{2},\hat{y}_{2}]over^ start_ARG bold_b end_ARG start_POSTSUBSCRIPT 2D end_POSTSUBSCRIPT = [ over^ start_ARG italic_x end_ARG start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , over^ start_ARG italic_y end_ARG start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , over^ start_ARG italic_x end_ARG start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , over^ start_ARG italic_y end_ARG start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ], where 𝐛^2D𝐃^2D\mathbf{\hat{b}}_{\text{2D}}\in\mathbf{\hat{D}}_{\text{2D}}over^ start_ARG bold_b end_ARG start_POSTSUBSCRIPT 2D end_POSTSUBSCRIPT ∈ over^ start_ARG bold_D end_ARG start_POSTSUBSCRIPT 2D end_POSTSUBSCRIPT in the pixel coordinate. A 3D bounding box is defined as 𝐛^3D=[x^,y^,z^,w^,l^,h^,R^]\mathbf{\hat{b}}_{\text{3D}}=[\hat{x},\hat{y},\hat{z},\hat{w},\hat{l},\hat{h},\hat{R}]over^ start_ARG bold_b end_ARG start_POSTSUBSCRIPT 3D end_POSTSUBSCRIPT = [ over^ start_ARG italic_x end_ARG , over^ start_ARG italic_y end_ARG , over^ start_ARG italic_z end_ARG , over^ start_ARG italic_w end_ARG , over^ start_ARG italic_l end_ARG , over^ start_ARG italic_h end_ARG , over^ start_ARG italic_R end_ARG ], where 𝐛^3D𝐃^3D\mathbf{\hat{b}}_{\text{3D}}\in\mathbf{\hat{D}}_{\text{3D}}over^ start_ARG bold_b end_ARG start_POSTSUBSCRIPT 3D end_POSTSUBSCRIPT ∈ over^ start_ARG bold_D end_ARG start_POSTSUBSCRIPT 3D end_POSTSUBSCRIPT. [x^,y^,z^][\hat{x},\hat{y},\hat{z}][ over^ start_ARG italic_x end_ARG , over^ start_ARG italic_y end_ARG , over^ start_ARG italic_z end_ARG ] stands for the 3D location in the camera coordinates, [w^,l^,h^][\hat{w},\hat{l},\hat{h}][ over^ start_ARG italic_w end_ARG , over^ start_ARG italic_l end_ARG , over^ start_ARG italic_h end_ARG ] stands for the object’s dimensions as width, length, and height, and P^\hat{P}over^ start_ARG italic_P end_ARG is the rotation matrix R^SO(3)\hat{R}\in\mathrm{SO(3)}over^ start_ARG italic_R end_ARG ∈ roman_SO ( 3 ) of the object.

We choose G-DINO [27] as our 2D open-set object detector for its early visual-language features fusion design. On top of it, we build 3D-MOOD with the proposed 3D bounding box head, canonical image space, and geometry-aware 3D query generation module for end-to-end open-set 3D object detection. We use an image encoder [30] to extract image features 𝐪image\mathbf{q}_{\text{image}}bold_q start_POSTSUBSCRIPT image end_POSTSUBSCRIPT from 𝐈\mathbf{I}bold_I and use a text backbone [9] from 𝐓\mathbf{T}bold_T to extract text features 𝐪text\mathbf{q}_{\text{text}}bold_q start_POSTSUBSCRIPT text end_POSTSUBSCRIPT. Then, following detection transformer architectures [6, 67, 62], we pass 𝐪image\mathbf{q}_{\text{image}}bold_q start_POSTSUBSCRIPT image end_POSTSUBSCRIPT and 𝐪text\mathbf{q}_{\text{text}}bold_q start_POSTSUBSCRIPT text end_POSTSUBSCRIPT to the transformer [50] encoder with early visual-language features fusion  [21]. The image and text features will be used in the proposed language-guided query selection to generate encoder detection results 𝐃^2Denc\mathbf{\hat{D}}_{\text{2D}}^{\text{enc}}over^ start_ARG bold_D end_ARG start_POSTSUBSCRIPT 2D end_POSTSUBSCRIPT start_POSTSUPERSCRIPT enc end_POSTSUPERSCRIPT and bounding box queries 𝐪2d0\mathbf{q}_{\text{2d}}^{0}bold_q start_POSTSUBSCRIPT 2d end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT for the decoder. For each cross-modality transformer decoder layer TrDi\mathrm{TrD}_{i}roman_TrD start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, it uses a text cross-attention CAtexti\mathrm{CA}_{\text{text}}^{i}roman_CA start_POSTSUBSCRIPT text end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT and an image cross-attention CAimagei\mathrm{CA}_{\text{image}}^{i}roman_CA start_POSTSUBSCRIPT image end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT to combine 𝐪2di\mathbf{q}_{\text{2d}}^{i}bold_q start_POSTSUBSCRIPT 2d end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT with the multi-modality information as:

𝐪2di=CAtexti(SAi(𝐪2di),𝐪text),𝐪2di+1=FFNi(CAimagei(𝐪2di,𝐪image)),\begin{split}\mathbf{q}_{\text{2d}}^{i}&=\mathrm{CA}_{\text{text}}^{i}(\mathrm{SA}^{i}(\mathbf{q}_{\text{2d}}^{i}),\mathbf{q}_{\text{text}}),\\ \mathbf{q}_{\text{2d}}^{i+1}&=\mathrm{FFN}^{i}(\mathrm{CA}_{\text{image}}^{i}(\mathbf{q}_{\text{2d}}^{i},\mathbf{q}_{\text{image}})),\end{split}start_ROW start_CELL bold_q start_POSTSUBSCRIPT 2d end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT end_CELL start_CELL = roman_CA start_POSTSUBSCRIPT text end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ( roman_SA start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ( bold_q start_POSTSUBSCRIPT 2d end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ) , bold_q start_POSTSUBSCRIPT text end_POSTSUBSCRIPT ) , end_CELL end_ROW start_ROW start_CELL bold_q start_POSTSUBSCRIPT 2d end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i + 1 end_POSTSUPERSCRIPT end_CELL start_CELL = roman_FFN start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ( roman_CA start_POSTSUBSCRIPT image end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ( bold_q start_POSTSUBSCRIPT 2d end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT , bold_q start_POSTSUBSCRIPT image end_POSTSUBSCRIPT ) ) , end_CELL end_ROW (1)

where iiitalic_i starts from 0 to l1l-1italic_l - 1 and FFN\mathrm{FFN}roman_FFN stands for feed-forward neural network. Each layer bounding box queries 𝐪2di\mathbf{q}_{\text{2d}}^{i}bold_q start_POSTSUBSCRIPT 2d end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT will be decoded as 2D bounding boxes prediction 𝐃^2Di\mathbf{\hat{D}}_{\text{2D}}^{i}over^ start_ARG bold_D end_ARG start_POSTSUBSCRIPT 2D end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT by the 2D box head as 𝐃^2Di=MLP2Di(𝐪2di)\mathbf{\hat{D}}_{\text{2D}}^{i}=\mathrm{MLP}_{\text{2D}}^{i}(\mathbf{q}_{\text{2d}}^{i})over^ start_ARG bold_D end_ARG start_POSTSUBSCRIPT 2D end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT = roman_MLP start_POSTSUBSCRIPT 2D end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ( bold_q start_POSTSUBSCRIPT 2d end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ), where MLP\mathrm{MLP}roman_MLP stands for Multi-Layer Perceptron. The object classes 𝐂^\mathbf{\hat{C}}over^ start_ARG bold_C end_ARG are estimated based on the similarity between 𝐪2di\mathbf{q}_{\text{2d}}^{i}bold_q start_POSTSUBSCRIPT 2d end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT and the input text embeddings.

3.3 3D Bounding Box Head

Given the estimated 2D bounding boxes 𝐃^2D\mathbf{\hat{D}}_{\text{2D}}over^ start_ARG bold_D end_ARG start_POSTSUBSCRIPT 2D end_POSTSUBSCRIPT and the corresponding object queries, our 3D bounding box head predict the 3D properties of 𝐃^2D\mathbf{\hat{D}}_{\text{2D}}over^ start_ARG bold_D end_ARG start_POSTSUBSCRIPT 2D end_POSTSUBSCRIPT to lift it and get 𝐃^3D\mathbf{\hat{D}}_{\text{3D}}over^ start_ARG bold_D end_ARG start_POSTSUBSCRIPT 3D end_POSTSUBSCRIPT in the camera coordinate frame.

3D Localization. To localize the 3D center of the 3D bounding boxes in the camera coordinates, 3D-MOOD predicts the projected 3D center and the metric depth of the 3D center of the object as [17, 12, 4]. To be more specific, we predict [u^,v^][\hat{u},\hat{v}][ over^ start_ARG italic_u end_ARG , over^ start_ARG italic_v end_ARG ] as the distance between the projected 3D center and the center of the 2D detections. We lift the projected center to the camera coordinate with the given camera intrinsic 𝐊\mathbf{K}bold_K and the estimated metric depth z^\hat{z}over^ start_ARG italic_z end_ARG of the 3D bounding boxes center. We estimate the scaled logarithmic depth prediction from our 3D bounding box head noted as d^\hat{d}over^ start_ARG italic_d end_ARG with depth scale sdepths_{depth}italic_s start_POSTSUBSCRIPT italic_d italic_e italic_p italic_t italic_h end_POSTSUBSCRIPT. Thus, the metric depth will be acquired as z^=exp(d^/sdepth)\hat{z}=\exp(\hat{d}/s_{depth})over^ start_ARG italic_z end_ARG = roman_exp ( over^ start_ARG italic_d end_ARG / italic_s start_POSTSUBSCRIPT italic_d italic_e italic_p italic_t italic_h end_POSTSUBSCRIPT ) during inference.

3D Object Dimensions. To estimate the universal 3D objects, we follow [17, 12] to directly predict dimensions instead of using the pre-computed category prior as in [4]. Our bounding box head predicts the scaled logarithmic dimensions as [sdim×lnw^,lnl^×sdim,lnh^×sdim][s_{\text{dim}}\times\ln\hat{w},\ln\hat{l}\times s_{\text{dim}},ln\hat{h}\times s_{\text{dim}}][ italic_s start_POSTSUBSCRIPT dim end_POSTSUBSCRIPT × roman_ln over^ start_ARG italic_w end_ARG , roman_ln over^ start_ARG italic_l end_ARG × italic_s start_POSTSUBSCRIPT dim end_POSTSUBSCRIPT , italic_l italic_n over^ start_ARG italic_h end_ARG × italic_s start_POSTSUBSCRIPT dim end_POSTSUBSCRIPT ] as the output space and can obtain the width, length, and height with exponential and divided by scale sdims_{\text{dim}}italic_s start_POSTSUBSCRIPT dim end_POSTSUBSCRIPT during inference.

3D Object Orientation. Unlike [17, 12], we follow [20] to predict 6D parameterization of R^\hat{R}over^ start_ARG italic_R end_ARG, denoted ad rot^6d\hat{\text{rot}}_{6d}over^ start_ARG rot end_ARG start_POSTSUBSCRIPT 6 italic_d end_POSTSUBSCRIPT, instead of only estimating yaw as autonomous driving scenes.

Following detection transformer (DETR) [6]-like architecture design, we use an MLP as the 3D bounding box head to estimate the 121212 dimension 3D properties from 2D object queries 𝐪2di\mathbf{q}_{\text{2d}}^{i}bold_q start_POSTSUBSCRIPT 2d end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT for each transformer decoder layer iiitalic_i. The 3D detection 𝐃^3Di\mathbf{\hat{D}}_{\text{3D}}^{i}over^ start_ARG bold_D end_ARG start_POSTSUBSCRIPT 3D end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT for each layer is estimated by separate 3D bounding box heads (MLP3Di\mathrm{MLP}_{\text{3D}}^{i}roman_MLP start_POSTSUBSCRIPT 3D end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT) as:

𝐃^3Di=𝐋𝐢𝐟𝐭(MLP3Di(𝐪2di),𝐃^2Di,𝐊).\mathbf{\hat{D}}_{\text{3D}}^{i}=\mathbf{Lift}(\mathrm{MLP}_{\text{3D}}^{i}(\mathbf{q}_{\text{2d}}^{i}),\mathbf{\hat{D}}_{\text{2D}}^{i},\mathbf{K}).over^ start_ARG bold_D end_ARG start_POSTSUBSCRIPT 3D end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT = bold_Lift ( roman_MLP start_POSTSUBSCRIPT 3D end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ( bold_q start_POSTSUBSCRIPT 2d end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ) , over^ start_ARG bold_D end_ARG start_POSTSUBSCRIPT 2D end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT , bold_K ) . (2)

where 𝐋𝐢𝐟𝐭\mathbf{Lift}bold_Lift stands for we obtain the final 3D detections in the camera coordinate by lifting the projected 3D center with the estimated dimensions and rotation.

Refer to caption
Figure 3: Canonical Image Space. We compare the difference between different resizing and padding strategies. It is worth noting that the same image will have the same camera intrinsic K\mathrm{K}roman_K despite having very different image resolutions for previous methods.

3.4 Canonical Image Space

To train the model across datasets that contain images with different resolutions from various datasets, previous works [4, 23, 27] either resize the short or long edge to a particular value, then use right and bottom padding to align the image resolution of the training batches. However, as shown in Fig. 3, previous methods will heavily pad zeros when the training batches have very different resolutions and won’t change the camera intrinsics. This wastes resources for non-informative information but will also cause the same camera intrinsic 𝐊\mathbf{K}bold_K but with different image resolutions between training and inference time while also breaking the center projection assumption.

As illustrated in [57], the ambiguity among image, camera intrinsic, and metric depth will confuse the depth estimation model during training with multiple datasets. Thus, we proposed the canonical space where the model can have a unified observation for both training and testing time. We use the fixed input image resolution [𝐇c×𝐖c][\mathbf{H}_{c}\times\mathbf{W}_{c}][ bold_H start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT × bold_W start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT ] and resize the input images and intrinsics so that the height or width reaches 𝐇c\mathbf{H}_{c}bold_H start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT or 𝐖c\mathbf{W}_{c}bold_W start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT to keep the original input image ratio. Then, we center pad the images to [𝐇c×𝐇W][\mathbf{H}_{c}\times\mathbf{H}_{W}][ bold_H start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT × bold_H start_POSTSUBSCRIPT italic_W end_POSTSUBSCRIPT ] with value 0 and pad the camera intrinsic accordingly. This alignment is necessary for the model to learn the universal settings consistent across training and test time, and we demonstrate the effectiveness in closed-set and open-set experiments. We show more details in the supplementary material.

3.5 Auxiliary Metric Depth Estimation

A significant challenge in monocular 3DOD is accurately estimating object localization in 3D. 3D object localization is directly tied to the localization in the image plane and the object metric depth, making the metric depth estimation sub-task crucial for 3DOD. Previous methods [36, 55, 26] have emphasized the importance of incorporating an auxiliary depth estimation head to improve 3D localization. However, achieving accurate depth localization becomes more difficult when attempting to generalize depth estimation across different datasets. Recent methods [38, 39, 57] demonstrate the possibility of overcoming the universal depth estimation by leveraging the camera information. As Cube R-CNN [4] uses a similar approach as Metric3D [57] to have virtual depth, we argue that conditioning depth features on camera intrinsics yields a more robust solution. This approach avoids being limited by variations in camera models and enhances generalizability. To this end, we design an auxiliary depth estimation head conditioned on the camera information, as proposed in UniDepth [38, 40] to achieve a generalizable monocular depth estimation.

In particular, our model architecture incorporates an additional Feature Pyramid Network (FPN) [24] to extract depth features 𝐅\mathbf{F}bold_F from the image backbone [30]. We rescale them to 1/161/161 / 16 of the input image height HHitalic_H and width WWitalic_W and generate the depth features 𝐅16d\mathbf{F}_{16}^{d}bold_F start_POSTSUBSCRIPT 16 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT using a Transformer block [50]. We condition 𝐅16d\mathbf{F}_{16}^{d}bold_F start_POSTSUBSCRIPT 16 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT using camera embeddings, 𝐄\mathbf{E}bold_E, as described in [38]. We then upsample the depth features to 1/81/81 / 8 of the input image height and width, i.e. 𝐅8d|𝐄\mathbf{F}_{8}^{d}|\mathbf{E}bold_F start_POSTSUBSCRIPT 8 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT | bold_E to estimate the metric depth by a convolutional block. We generate the scaled logarithmic depth prediction d^full\hat{d}_{\text{full}}over^ start_ARG italic_d end_ARG start_POSTSUBSCRIPT full end_POSTSUBSCRIPT with the same depth scale sdepths_{\text{depth}}italic_s start_POSTSUBSCRIPT depth end_POSTSUBSCRIPT as our 3D bounding box head. Thus, the final metric depth z^full\hat{z}_{full}over^ start_ARG italic_z end_ARG start_POSTSUBSCRIPT italic_f italic_u italic_l italic_l end_POSTSUBSCRIPT will be acquired as z^full=exp(d^full/sdepth)\hat{z}_{\text{full}}=\exp(\hat{d}_{\text{full}}/s_{\text{depth}})over^ start_ARG italic_z end_ARG start_POSTSUBSCRIPT full end_POSTSUBSCRIPT = roman_exp ( over^ start_ARG italic_d end_ARG start_POSTSUBSCRIPT full end_POSTSUBSCRIPT / italic_s start_POSTSUBSCRIPT depth end_POSTSUBSCRIPT ).

3.6 Geometry-aware 3D Query Generation

To ensure the 3D bounding box estimation can be generalized for diverse scenes, we propose a geometry-aware 3D query generation to condition the 2D object query 𝐪2d\mathbf{q}_{\text{2d}}bold_q start_POSTSUBSCRIPT 2d end_POSTSUBSCRIPT with the learned geometry prior. First, we use the camera embeddings 𝐄\mathbf{E}bold_E in our auxiliary depth head to make the model aware of the scene-specific prior via a cross-attention layer. Due to the sparsity of the 3D bounding box annotations compared to the per-pixel depth supervision, we further leverage the depth features 𝐅8d|𝐄\mathbf{F}_{8}^{d}|\mathbf{E}bold_F start_POSTSUBSCRIPT 8 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT | bold_E to condition the object query. This allows us to align the metric depth prediction and 3D bounding box estimation while leveraging the learned depth estimation. Our full geometry-aware query generation will generate the 3D box queries 𝐪3d\mathbf{q}_{\text{3d}}bold_q start_POSTSUBSCRIPT 3d end_POSTSUBSCRIPT as:

𝐪3di=FFNcami(CAcami(SAcami(𝐪2di),𝐄)),𝐪3di=FFNdepthi(CAdepthi(SAdepthi(𝐪3di),𝐅8d|𝐄)).\begin{split}\mathbf{q}_{\text{3d}}^{i}&=\mathrm{FFN}_{\text{cam}}^{i}(\mathrm{CA}_{\text{cam}}^{i}(\mathrm{SA}_{\text{cam}}^{i}(\mathbf{q}_{\text{2d}}^{i}),\mathbf{E})),\\ \mathbf{q}_{\text{3d}}^{i}&=\mathrm{FFN}_{\text{depth}}^{i}(\mathrm{CA}_{\text{depth}}^{i}(\mathrm{SA}_{\text{depth}}^{i}(\mathbf{q}_{\text{3d}}^{i}),\mathbf{F}_{8}^{d}|\mathbf{E})).\end{split}start_ROW start_CELL bold_q start_POSTSUBSCRIPT 3d end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT end_CELL start_CELL = roman_FFN start_POSTSUBSCRIPT cam end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ( roman_CA start_POSTSUBSCRIPT cam end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ( roman_SA start_POSTSUBSCRIPT cam end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ( bold_q start_POSTSUBSCRIPT 2d end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ) , bold_E ) ) , end_CELL end_ROW start_ROW start_CELL bold_q start_POSTSUBSCRIPT 3d end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT end_CELL start_CELL = roman_FFN start_POSTSUBSCRIPT depth end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ( roman_CA start_POSTSUBSCRIPT depth end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ( roman_SA start_POSTSUBSCRIPT depth end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ( bold_q start_POSTSUBSCRIPT 3d end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ) , bold_F start_POSTSUBSCRIPT 8 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT | bold_E ) ) . end_CELL end_ROW (3)

We replace the 2D object queries in Eq. 2 with the generated 3D queries 𝐪3di\mathbf{q}_{\text{3d}}^{i}bold_q start_POSTSUBSCRIPT 3d end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT for each decoder layer as

𝐃^3Di=𝐋𝐢𝐟𝐭(MLP3Di(𝐪3di),𝐃^2Di,𝐊).\mathbf{\hat{D}}_{\text{3D}}^{i}=\mathbf{Lift}(\mathrm{MLP}_{\text{3D}}^{i}(\mathbf{q}_{\text{3d}}^{i}),\mathbf{\hat{D}}_{\text{2D}}^{i},\mathbf{K}).over^ start_ARG bold_D end_ARG start_POSTSUBSCRIPT 3D end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT = bold_Lift ( roman_MLP start_POSTSUBSCRIPT 3D end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ( bold_q start_POSTSUBSCRIPT 3d end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ) , over^ start_ARG bold_D end_ARG start_POSTSUBSCRIPT 2D end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT , bold_K ) . (4)

It is worth noting that we detach the gradient from the cross attention between 3D query and depth features to stabilize the training. We validate our geometry-aware 3D query generation in our ablation studies for both closed-set and open-set settings. The results suggest that incorporating geometric priors enhances model convergence during closed-set multi-dataset training and improves the robustness of 3D bounding box estimation in real-world scenarios.

3.7 Training Loss

We train 3D-MOOD with 2D losses L2DL_{\text{2D}}italic_L start_POSTSUBSCRIPT 2D end_POSTSUBSCRIPT, 3D losses L3DL_{\text{3D}}italic_L start_POSTSUBSCRIPT 3D end_POSTSUBSCRIPT, and auxiliary depth loss LdepthauxL_{\text{depth}}^{\text{aux}}italic_L start_POSTSUBSCRIPT depth end_POSTSUBSCRIPT start_POSTSUPERSCRIPT aux end_POSTSUPERSCRIPT in conjugation. For 2D losses, we follow MM G-DINO [64] and use L1 loss and GIoU [42] loss for the 2D bounding box regression and contrastive between predicted objects and language tokens for bounding box classification as GLIP [21]. For the 3D losses, we use L1 loss to supervise each estimated 3D properties. We compute 2D and 3D losses for each transformer decoder layer iiitalic_i and obtain L2DiL_{\text{2D}}^{i}italic_L start_POSTSUBSCRIPT 2D end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT and L3DiL_{\text{3D}}^{i}italic_L start_POSTSUBSCRIPT 3D end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT. For auxiliary depth estimation, we refer to each original dataset of Omni3D to find the depth GT or using the projected LiDAR points or structure-from-motion (SfM) [45, 46] points. We use Scale-invariant log loss [11] as auxiliary depth loss LdepthauxL_{\text{depth}}^{\text{aux}}italic_L start_POSTSUBSCRIPT depth end_POSTSUBSCRIPT start_POSTSUPERSCRIPT aux end_POSTSUPERSCRIPT with λdepth\lambda_{\text{depth}}italic_λ start_POSTSUBSCRIPT depth end_POSTSUBSCRIPT as loss weight for supervision. Finaly, we set the loss weights for 2D and 3D detection to 1.01.01.0 and λdepth\lambda_{\text{depth}}italic_λ start_POSTSUBSCRIPT depth end_POSTSUBSCRIPT to 101010 and obtain the final loss LfinalL_{\text{final}}italic_L start_POSTSUBSCRIPT final end_POSTSUBSCRIPT as

Lfinal=i=0l(L2Di+L3Di)+λdepthLdepthaux.L_{\text{final}}=\sum_{i=0}^{l}(L_{\text{2D}}^{i}+L_{\text{3D}}^{i})+\lambda_{\text{depth}}L_{\text{depth}}^{\text{aux}}.italic_L start_POSTSUBSCRIPT final end_POSTSUBSCRIPT = ∑ start_POSTSUBSCRIPT italic_i = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT ( italic_L start_POSTSUBSCRIPT 2D end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT + italic_L start_POSTSUBSCRIPT 3D end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ) + italic_λ start_POSTSUBSCRIPT depth end_POSTSUBSCRIPT italic_L start_POSTSUBSCRIPT depth end_POSTSUBSCRIPT start_POSTSUPERSCRIPT aux end_POSTSUPERSCRIPT . (5)

4 Experiments

We first describe our implementation details for 3D-MOOD and datasets in Sec. 4.1 and discuss the evaluation metrics in Sec. 4.2. Then, we show the open-set, cross-domain, and closed-set results in Sec. 4.3Sec. 4.4, and Sec. 4.5, and analyze the results of ablation studies in Sec. 4.6. We show some qualitative results in Sec. 4.7 and more in the supplementary material.

4.1 Implementation Details

Model. We use the Vis4D [59] as the framework to implement 3D-MOOD in PyTorch [37] and CUDA [33]. We train the full model for 120120120 epochs with batch size of 128128128 and set the initial learning rate of 0.00040.00040.0004 following [64]. For the ablation studies, we train the model for 121212 epochs with batch size of 646464. We choose 800×1333800\times 1333800 × 1333 as our canonical image shape, as described in Sec. 3.4. During training, we use random resize with scales between [0.75,1.25][0.75,1.25][ 0.75 , 1.25 ] and random horizontal flipping with a probability of 0.50.50.5 as data augmentation. We decay the learning rate by a factor of 101010 at epochs 888 and 111111 for the 121212 epoch setting and by 808080 and 110110110 for the 120120120 epoch setting.

Closed-set Data. We use Omni3D [4] as training data, which contains six popular monocular 3D object detection datasets, i.e. KITTI [14], nuScenes [5], SUN RGB-D [47], Objectron [1], ARKitScenes [2], Hypersim [43]. There are 176573176573176573 training images, 191271912719127 validation images, and 394523945239452 testing images with 989898 classes. We follow [4, 23] using the training and validation split from Omni3D [4] with 505050 classes for training and test the model on the test split.

Open-set Data. We choose two challenging datasets for indoor and outdoor scenes as the open-set monocular 3D object detection benchmarks. For outdoor settings, we use the validation split of Argoverse 2 (AV2) [54] Sensor Dataset as the benchmark. We sample 480648064806 images from the ring front-center camera, which provides portrait resolution (2048×15502048\times 15502048 × 1550), and use all the official classes that appear in the validation set to be evaluated. For indoor settings, we use the validation split of ScanNet [8] with official 181818 classes as the indoor benchmark. We uniformly sample 624062406240 images with 968×1296968\times 1296968 × 1296 resolution along with the axis-aligned 3D bounding boxes. We provide more details in the supplementary material.

4.2 Evaluation

We use the average precision (AP) metric to evaluate the performance of 2D and 3D detection results. Omni3D [4] matches the predictions and GT by computing the intersection-over-union (IoU3D\text{IoU}_{\text{3D}}IoU start_POSTSUBSCRIPT 3D end_POSTSUBSCRIPT) of 3D cuboids. The mean 3D AP, i.e. AP3D{\text{AP}_{\text{3D}}}AP start_POSTSUBSCRIPT 3D end_POSTSUBSCRIPT, is reported across classes and over a range of IoU3D\text{IoU}_{\text{3D}}IoU start_POSTSUBSCRIPT 3D end_POSTSUBSCRIPT thresholds [0.05,0.1,,0.5]\in[0.05,0.1,...,0.5]∈ [ 0.05 , 0.1 , … , 0.5 ]. However, this matching criterion is too restrictive for small or thin objects for monocular object detection, especially for open-set scenarios. As shown in Fig. 4, we report the difference between different matching criteria over three classes and methods under open-set settings. The performance of large objects, such as Regular Vehicles (cars), remains consistent between center-distance (CD) based and IoU-based matching. However, for smaller objects (e.g., Sinks) and thinner objects (e.g., Pictures), IoU-based matching fails to accurately reflect the true performance of 3D monocular object detection. Thus, we refer to nuScenes detection score (NDS) [5] and composite detection score (CDS\mathrm{CDS}roman_CDS[54] to propose a new 3D object detection score for open-set monocular object detection noted as open detection score (ODS).

To use ODS for both indoor and outdoor datasets, we use 3D Euclidean distance instead of the bird-eye view (BEV) distance used in autonomous driving scenes. Furthermore, unlike NDS and CDS using the fixed distances as matching thresholds, we set the matching distances as the uniform range [0.5,0.55,,1.0]\in[0.5,0.55,...,1.0]∈ [ 0.5 , 0.55 , … , 1.0 ] of the radius of the 3D GT boxes. This allows a flexible matching criterion given the object size and strikes a balance between IoU matching and other distance matching. We report mean 3D AP using normalized distance-based matching as AP3D dist\text{AP}_{\text{3D}}^{\text{ dist}}AP start_POSTSUBSCRIPT 3D end_POSTSUBSCRIPT start_POSTSUPERSCRIPT dist end_POSTSUPERSCRIPT over classes. We compute several true positive (TP) errors for the matched prediction and GT pair. We report mean average translation error (mATE), mean average scale error (mASE), and mean average orientation error (mAOE) to evaluate how precise the true positive is compared to the matched GT. The final ODS is computed as the weighted sum of AP3D dist\text{AP}_{\text{3D}}^{\text{ dist}}AP start_POSTSUBSCRIPT 3D end_POSTSUBSCRIPT start_POSTSUPERSCRIPT dist end_POSTSUPERSCRIPT, mATE, mASE, and mAOE as:

ODS=16[3×AP3D dist+(1mTPE)],\text{ODS}=\frac{1}{6}[3\times\text{AP}_{\text{3D}}^{\text{ dist}}+\sum(1-\text{mTPE})],ODS = divide start_ARG 1 end_ARG start_ARG 6 end_ARG [ 3 × AP start_POSTSUBSCRIPT 3D end_POSTSUBSCRIPT start_POSTSUPERSCRIPT dist end_POSTSUPERSCRIPT + ∑ ( 1 - mTPE ) ] , (6)

where mTPE \in [mATE, mASE, mAOE]. ODS considers the average precision and true positive errors under the flexible distance matching, making it suitable for evaluating the monocular 3D detection results, especially for open-set settings. In this work, we report AP3D\text{AP}_{\text{3D}}AP start_POSTSUBSCRIPT 3D end_POSTSUBSCRIPT, AP3D dist\text{AP}_{\text{3D}}^{\text{ dist}}AP start_POSTSUBSCRIPT 3D end_POSTSUBSCRIPT start_POSTSUPERSCRIPT dist end_POSTSUPERSCRIPT, and ODS in the form of percentage by default. Additional details are provided in the supplementary material.

Refer to caption
Figure 4: Matching function. Different matching criteria over three methods on three different classes on AV2 and ScanNet. CD stands for matching prediction and GT using our proposed normalized center distance matching, while IoU stands for using IoU3D\text{IoU}_{\text{3D}}IoU start_POSTSUBSCRIPT 3D end_POSTSUBSCRIPT.
Table 1: Open-set Results. We propose the first 3D monocular open-set object detection benchmark with Argoverse 2 [54] (Outdoor) and ScanNet [8] (Indoor). Each dataset contains seen (base) and unseen (novel) categories in the unseen scenes. Besides Cube R-CNN [4] full model, we evaluate Cube R-CNN (In/Out) as each domain expert variant, which is only trained and tested on Omni3D indoor/outdoor datasets. It is worth noting that OVM3D-Det’s depth estimation model [38] is trained on AV2 and ScanNet. We further evaluate the generalization of seen classes and the ability to detect novel classes through ODS(B) and ODS (N). 3D-MOOD establishes the SOTA performance on this new challenging open-set benchmark.
Method Argoverse 2 ScanNet
AP3D dist\text{AP}_{\text{3D}}^{\text{ dist}}\uparrowAP start_POSTSUBSCRIPT 3D end_POSTSUBSCRIPT start_POSTSUPERSCRIPT dist end_POSTSUPERSCRIPT ↑ mATE \downarrow mASE \downarrow mAOE \downarrow ODS \uparrow ODS (B) \uparrow ODS (N) \uparrow AP3D dist\text{AP}_{\text{3D}}^{\text{ dist}}\uparrowAP start_POSTSUBSCRIPT 3D end_POSTSUBSCRIPT start_POSTSUPERSCRIPT dist end_POSTSUPERSCRIPT ↑ mATE \downarrow mASE \downarrow mAOE \downarrow ODS \uparrow ODS (B) \uparrow ODS (N) \uparrow
Cube R-CNN (In) - - - - - - - 19.5 0.725 0.771 0.858 20.5 24.6 0.0
Cube R-CNN (Out) 10.5 0.896 0.869 0.991 9.3 19.5 0.0 - - - - - - -
Cube R-CNN [4] 8.6 0.903 0.867 0.953 8.9 18.6 0.0 20.0 0.733 0.774 0.921 19.5 23.4 0.0
OVM3D-Det [18] 7.7 0.914 0.893 0.899 8.8 16.5 1.7 15.6 0.798 0.871 0.818 16.3 17.8 8.8
Ours (Swin-T) 14.8 0.782 0.697 0.612 22.5 31.7 14.2 27.3 0.630 0.726 0.650 30.2 33.6 13.4
Ours (Swin-B) 14.7 0.755 0.680 0.580 23.8 33.6 14.8 28.8 0.612 0.706 0.655 31.5 34.7 15.7
Table 2: Cross-domain results. We validate 3D-MOOD cross-domain generalization by training on one of the indoor datasets from Omni3D while testing on the other two in a zero-shot manner. 3D-MOOD generalize better consistently for all three settings.
Method Trained on Hypersim Trained on SUN RGB-D Trained on ARKitScenes
AP3D hyp\text{AP}_{\text{3D}}^{\text{ hyp}}\uparrowAP start_POSTSUBSCRIPT 3D end_POSTSUBSCRIPT start_POSTSUPERSCRIPT hyp end_POSTSUPERSCRIPT ↑ AP3D sun\text{AP}_{\text{3D}}^{\text{ sun}}\uparrowAP start_POSTSUBSCRIPT 3D end_POSTSUBSCRIPT start_POSTSUPERSCRIPT sun end_POSTSUPERSCRIPT ↑ AP3D ark\text{AP}_{\text{3D}}^{\text{ ark}}\uparrowAP start_POSTSUBSCRIPT 3D end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ark end_POSTSUPERSCRIPT ↑ AP3D hyp\text{AP}_{\text{3D}}^{\text{ hyp}}\uparrowAP start_POSTSUBSCRIPT 3D end_POSTSUBSCRIPT start_POSTSUPERSCRIPT hyp end_POSTSUPERSCRIPT ↑ AP3D sun\text{AP}_{\text{3D}}^{\text{ sun}}\uparrowAP start_POSTSUBSCRIPT 3D end_POSTSUBSCRIPT start_POSTSUPERSCRIPT sun end_POSTSUPERSCRIPT ↑ AP3D ark\text{AP}_{\text{3D}}^{\text{ ark}}\uparrowAP start_POSTSUBSCRIPT 3D end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ark end_POSTSUPERSCRIPT ↑ AP3D hyp\text{AP}_{\text{3D}}^{\text{ hyp}}\uparrowAP start_POSTSUBSCRIPT 3D end_POSTSUBSCRIPT start_POSTSUPERSCRIPT hyp end_POSTSUPERSCRIPT ↑ AP3D sun\text{AP}_{\text{3D}}^{\text{ sun}}\uparrowAP start_POSTSUBSCRIPT 3D end_POSTSUBSCRIPT start_POSTSUPERSCRIPT sun end_POSTSUPERSCRIPT ↑ AP3D ark\text{AP}_{\text{3D}}^{\text{ ark}}\uparrowAP start_POSTSUBSCRIPT 3D end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ark end_POSTSUPERSCRIPT ↑
Cube R-CNN [4] 15.2 9.5 7.5 9.5 34.7 14.2 7.5 13.1 38.6
Uni-MODE [23] 14.7 5.6 3.6 3.0 28.5 8.8 4.2 13.0 35.0
Ours 25.6 15.9 14.5 13.8 42.1 21.4 12.9 23.8 43.9

4.3 Open-Set Results

Benchmarks. We establish the first 3D monocular open-set object detection benchmark as Tab. 1. We treat the diverse Omni3D [4] dataset as the training set and test the model performance on Argoverse 2 (outdoor) [54] and ScanNet (indoor) [8] validation splits as open-set testing.

Baselines. We validate the performance of 3D-MOOD by comparing it to several baselines. To the best of our knowledge, there are only two methods [4, 23] trained on the entire Omni3D training set. However, until the submission, Uni-MODE [23] did not release their model weights. Hence, we use Cube R-CNN [4] to build several baselines. We further compare the generalizability of 3D-MOOD with Uni-MODE in Sec. 4.4. We use three different Cube R-CNN models, which are trained with indoor-only, outdoor-only, or full Omni3D training sets, as the specialized closed-set models for indoor (In), outdoor (Out), and universal data (Cube R-CNN). We map the predicted categories from Omni3D to Argoverse 2 (AV2) and ScanNet to conduct the 3D detection on open data, which can provide 111111 and 151515 seen (base) classes, respectively. Another baseline is OVM3D-Det [18], which uses G-DINO [27], SAM [19], UniDepth [38] and LLM to generating pseudo GT for 3D detection. We run the OVM3D-Det pipeline on AV2 and ScanNet to generate the pseudo GT as open-set detection results and evaluate it with the real GT.

Results. As shown in Tab. 1, 3D-MOOD achieves the SOTA on both challenging datasets in open-set settings. The Cube R-CNN baselines (rows 1 to 3) show that the closed-set methods lack the ability to recognize the novel objects due to the closed-vocabulary model design, which further heavily affects the overall open-set performance when more than half of classes are novel, e.g. AV2. Furthermore, the performance differences between 3D-MOOD and Cube R-CNN on the seen (base) classes are more significant in the unseen domain. This suggests that 3D-MOOD benefits from the proposed canonical image spaces and geometry-aware 3D query generation, leading us to generalize better for unseen domains. The comparison to OVM3D-Det [18] shows the importance of e2e design to align better 2D open-set detector and 3D object detection. Given that UniDepth [38] is trained on AV2 and ScanNet, the depth estimation from OVM3D-Det is much more accurate. However, a lack of training in 3D data leads to worse performance for both base and novel classes.

4.4 Cross Domain Results.

Since we can not directly compare Uni-MODE [23] on our proposed open-set benchmarks, we follow [4, 23] and conduct the cross-domain generalization experiments within Omni3D datasets. We train 3D-MOOD on one indoor dataset at once and zero-shot test on the other two datasets. As shown in Tab. 2, our method can achieve higher performance for both in-domain data, i.e. seen dataset, and out-of-domain data. We believe it demonstrates the ability of the models to detect the base object in the unseen scenes, which benefits from our geometry-aware design.

Table 3: Results on Omni3D. We compare 3D-MOOD with other closed-set detectors on Omni3D test set. AP3D omni\text{AP}^{\text{ omni}}_{\text{3D}}\uparrowAP start_POSTSUPERSCRIPT omni end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 3D end_POSTSUBSCRIPT ↑ is the average scores over Omni3D 666 datasets. All methods are trained with Omni3D train and val splits and “-” represents the results not reported in previous literature [4, 23]. 3D-MOOD achieves SOTA performance on the closed-set setting with the open-set ability.
Method AP3D kit\text{AP}^{\text{ kit}}_{\text{3D}}\uparrowAP start_POSTSUPERSCRIPT kit end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 3D end_POSTSUBSCRIPT ↑ AP3D nus\text{AP}^{\text{ nus}}_{\text{3D}}\uparrowAP start_POSTSUPERSCRIPT nus end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 3D end_POSTSUBSCRIPT ↑ AP3D sun\text{AP}^{\text{ sun}}_{\text{3D}}\uparrowAP start_POSTSUPERSCRIPT sun end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 3D end_POSTSUBSCRIPT ↑ AP3D hyp\text{AP}^{\text{ hyp}}_{\text{3D}}\uparrowAP start_POSTSUPERSCRIPT hyp end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 3D end_POSTSUBSCRIPT ↑ AP3D ark\text{AP}^{\text{ ark}}_{\text{3D}}\uparrowAP start_POSTSUPERSCRIPT ark end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 3D end_POSTSUBSCRIPT ↑ AP3D obj\text{AP}^{\text{ obj}}_{\text{3D}}\uparrowAP start_POSTSUPERSCRIPT obj end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 3D end_POSTSUBSCRIPT ↑ AP3D omni\text{AP}^{\text{ omni}}_{\text{3D}}\uparrowAP start_POSTSUPERSCRIPT omni end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 3D end_POSTSUBSCRIPT ↑
ImVoxelNet [44] - - - - - - 9.4
SMOKE [29] - - - - - - 9.6
FCOS3D [51] - - - - - - 9.8
PGD [52] - - - - - - 11.2
Cube R-CNN [4] 32.6 30.1 15.3 7.5 41.7 50.8 23.3
Uni-MODE [23] 29.2 36.0 23.0 8.1 48.0 66.1 28.2
Ours (Swin-T) 32.8 31.5 21.9 10.5 51.0 64.3 28.4
Ours (Swin-B) 31.4 35.8 23.8 9.1 53.9 67.9 30.0
Table 4: Ablations of 3D-MOOD. CI denotes canonical image space, Depth denotes auxiliary depth estimation head, and GA stands for geometry-aware 3D query generation. We report the IoU-based AP for the Omni3D test split and our ODS for the AV2 and ScanNet validation split. AP3D omni\text{AP}^{\text{ omni}}_{\text{3D}}\uparrowAP start_POSTSUPERSCRIPT omni end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 3D end_POSTSUBSCRIPT ↑ is the average scores over Omni3D 666 datasets while ODS open\text{ODS}^{\text{ open}}\uparrowODS start_POSTSUPERSCRIPT open end_POSTSUPERSCRIPT ↑ is the average for open-set datasets. The results show that our proposed component help for both closed-set and open-set settings.
CI Depth GA AP3D kit\text{AP}^{\text{ kit}}_{\text{3D}}\uparrowAP start_POSTSUPERSCRIPT kit end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 3D end_POSTSUBSCRIPT ↑ AP3D nus\text{AP}^{\text{ nus}}_{\text{3D}}\uparrowAP start_POSTSUPERSCRIPT nus end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 3D end_POSTSUBSCRIPT ↑ AP3D hyp\text{AP}^{\text{ hyp}}_{\text{3D}}\uparrowAP start_POSTSUPERSCRIPT hyp end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 3D end_POSTSUBSCRIPT ↑ AP3D sun\text{AP}^{\text{ sun}}_{\text{3D}}\uparrowAP start_POSTSUPERSCRIPT sun end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 3D end_POSTSUBSCRIPT ↑ AP3D ark\text{AP}^{\text{ ark}}_{\text{3D}}\uparrowAP start_POSTSUPERSCRIPT ark end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 3D end_POSTSUBSCRIPT ↑ AP3D obj\text{AP}^{\text{ obj}}_{\text{3D}}\uparrowAP start_POSTSUPERSCRIPT obj end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 3D end_POSTSUBSCRIPT ↑ AP3D omni\text{AP}^{\text{ omni}}_{\text{3D}}\uparrowAP start_POSTSUPERSCRIPT omni end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 3D end_POSTSUBSCRIPT ↑ ODS av2\text{ODS}^{\text{ av2}}\uparrowODS start_POSTSUPERSCRIPT av2 end_POSTSUPERSCRIPT ↑ ODS scan\text{ODS}^{\text{ scan}}\uparrowODS start_POSTSUPERSCRIPT scan end_POSTSUPERSCRIPT ↑ ODS open\text{ODS}^{\text{ open}}\uparrowODS start_POSTSUPERSCRIPT open end_POSTSUPERSCRIPT ↑
1 - - - 32.5 29.7 8.1 17.3 46.5 54.9 24.1 18.2 29.0 23.6
2 - - 31.1 30.5 9.1 19.1 47.7 58.1 25.5 19.5 29.5 24.5
3 - 29.8 30.7 10.3 19.9 48.6 58.8 26.2 20.0 29.4 24.7
4 32.1 31.9 9.9 20.8 49.1 60.2 26.8 22.0 30.0 26.0

4.5 Closed-Set Results

We compare 3D-MOOD with the other closed-set models on the Omni3D [4] benchmark. As shown in Tab. 3, 3D-MOOD achieves the SOTA performance on Omni3D test split. Our model with Swin-Transformer [30] Tiny (Swin-T) as backbone achieves similar performance as previous SOTA Uni-MODE [23], which uses ConvNeXt [31] Base model. When we use the comparable image backbone to ConvNeXt-Base (89M), i.e. Swin-Transformer Base (Swin-B, 88M), 3D-MOOD achieves 30.1%{30.1\%}30.1 % AP on Omni3D test set and establish the new SOTA results on the benchmark.

4.6 Ablation Studies

We ablate each contribution in Tab. 4 for both closed-set and open-set settings. We build the naive baseline as row 1 by directly using 2D queries 𝐪2d\mathbf{q}_{2d}bold_q start_POSTSUBSCRIPT 2 italic_d end_POSTSUBSCRIPT and directly generate the 3D detection results.

Canonical Image Space. As shown in [42, 38], it is crucial to resolve the ambiguity between image, intrinsic, and depth. With the proposed canonical image (CI) space, we align the training and testing time image shape and camera intrinsics. Row 2 outlines how CI improves closed-set and open-set results by 1.41.41.4 and 0.90.90.9 for closed-set and open-set settings, respectively. This shows that the model learns the universal property and makes the detection ability generalize well across datasets for training and testing time.

Auxiliary Depth head. We validate the effect of the auxiliary depth head as row 3. Learning metric depth is essential for the network to better understand the geometry of the 3D scene instead of merely relying on the sparse depth supervision signal from the 3D bounding boxes loss. With the auxiliary depth head, 3D-MOOD improves 0.70.70.7 AP on the closed-set settings yet only slightly improve the open-set settings by 0.20.20.2 ODS. We hypothesize that the depth data is not diverse and rich enough in Omni3D compared to the data that other generalizable depth estimation methods [38, 57, 3] use for training. Thus, the benefits from the depth head is little in open-set settings.

Geometry-aware 3D Query Generation. Finally, we ablate our proposed geometry-aware 3D query generation module in row 4. We show that for both closed-set and open-set settings, the geometry condition can improve the performance by 0.60.60.6 and 1.31.31.3, respectively. It is worth noting that the geometry information can significantly improve the model’s generalizability, which demonstrates our contribution to 3D monocular open-set object detection.

Refer to caption Refer to caption
Refer to caption Refer to caption
Refer to caption Refer to caption
Figure 5: In-the-wild Qualitative Results. We show the visualization of 3D-MOOD for in-the-wild images. The red boxes in the 3D visualization (last row) are the GT annotations.

4.7 Qualitative Results

We show the open-set qualitative results in Fig. 5 to demonstrate the generalizability of 3D-MOOD, where we successfully detect novel objects in unseen scenes. More results are reported in the supplementary material.

5 Conclusion

In this work, we introduce 3D-MOOD, the first end-to-end 3D monocular open-set object detection method, which achieves state-of-the-art performance in closed-set settings while proving strong generalization to unseen scenes and object classes in open-set scenarios. We design a 3D bounding box head with the proposed geometry-aware 3D query generation to lift the open-set 2D detection to the corresponding 3D space. Our proposed method can be trained end-to-end and yield better overall performance. Furthermore, our proposed canonical image space resolves the ambiguity between image, intrinsic, and metric depth, leading to more robust results in closed-set and open-set settings. We propose a challenging 3D monocular open-set object detection benchmark using two out-of-domain datasets. 3D-MOOD sets the new state-of-the-art performance on the challenging Omni3D benchmark compared to other closed-set methods. Moreover, the results on the open-set benchmark demonstrate our method’s ability to generalize the monocular 3D object detection in the wild.

Acknowledgements. This research is supported by the ETH Foundation Project 2025-FS-352, Swiss AI Initiative and a grant from the Swiss National Supercomputing Centre (CSCS) under project ID a03 on Alps, and the Lamarr Institute for Machine Learning and Artificial Intelligence. The authors thank Linfei Pan and Haofei Xu for helpful discussions and technical support.

References

  • Ahmadyan et al. [2021] Adel Ahmadyan, Liangkai Zhang, Artsiom Ablavatski, Jianing Wei, and Matthias Grundmann. Objectron: A large scale dataset of object-centric videos in the wild with pose annotations. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2021.
  • Baruch et al. [2021] Gilad Baruch, Zhuoyuan Chen, Afshin Dehghan, Tal Dimry, Yuri Feigin, Peter Fu, Thomas Gebauer, Brandon Joffe, Daniel Kurz, Arik Schwartz, and Elad Shulman. ARKitscenes - a diverse real-world dataset for 3d indoor scene understanding using mobile RGB-d data. In Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 1), 2021.
  • Bochkovskii et al. [2024] Aleksei Bochkovskii, Amaël Delaunoy, Hugo Germain, Marcel Santos, Yichao Zhou, Stephan R Richter, and Vladlen Koltun. Depth pro: Sharp monocular metric depth in less than a second. arXiv preprint arXiv:2410.02073, 2024.
  • Brazil et al. [2023] Garrick Brazil, Abhinav Kumar, Julian Straub, Nikhila Ravi, Justin Johnson, and Georgia Gkioxari. Omni3d: A large benchmark and model for 3d object detection in the wild. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 13154–13164, 2023.
  • Caesar et al. [2019] Holger Caesar, Varun Bankiti, Alex H. Lang, Sourabh Vora, Venice Erin Liong, Qiang Xu, Anush Krishnan, Yu Pan, Giancarlo Baldan, and Oscar Beijbom. nuscenes: A multimodal dataset for autonomous driving. arXiv preprint arXiv:1903.11027, 2019.
  • Carion et al. [2020] Nicolas Carion, Francisco Massa, Gabriel Synnaeve, Nicolas Usunier, Alexander Kirillov, and Sergey Zagoruyko. End-to-end object detection with transformers. In European conference on computer vision, pages 213–229. Springer, 2020.
  • Cheng et al. [2024] Tianheng Cheng, Lin Song, Yixiao Ge, Wenyu Liu, Xinggang Wang, and Ying Shan. Yolo-world: Real-time open-vocabulary object detection. arXiv preprint arXiv:2401.17270, 2024.
  • Dai et al. [2017] Angela Dai, Angel X. Chang, Manolis Savva, Maciej Halber, Thomas Funkhouser, and Matthias Nießner. Scannet: Richly-annotated 3d reconstructions of indoor scenes. In Proc. Computer Vision and Pattern Recognition (CVPR), IEEE, 2017.
  • Devlin [2018] Jacob Devlin. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018.
  • Dong et al. [2022] Xingshuai Dong, Matthew A Garratt, Sreenatha G Anavatti, and Hussein A Abbass. Towards real-time monocular depth estimation for robotics: A survey. IEEE Transactions on Intelligent Transportation Systems, 23(10):16940–16961, 2022.
  • Eigen et al. [2014] David Eigen, Christian Puhrsch, and Rob Fergus. Depth map prediction from a single image using a multi-scale deep network. Advances in Neural Information Processing Systems (NeurIPS), 27, 2014.
  • Fischer et al. [2022] Tobias Fischer, Yung-Hsu Yang, Suryansh Kumar, Min Sun, and Fisher Yu. Cc-3dt: Panoramic 3d object tracking via cross-camera fusion. In 6th Annual Conference on Robot Learning, 2022.
  • Fu et al. [2024] Yuqian Fu, Yu Wang, Yixuan Pan, Lian Huai, Xingyu Qiu, Zeyu Shangguan, Tong Liu, Yanwei Fu, Luc Van Gool, and Xingqun Jiang. Cross-domain few-shot object detection via enhanced open-set object detector. In European Conference on Computer Vision, pages 247–264. Springer, 2024.
  • Geiger et al. [2012] Andreas Geiger, Philip Lenz, and Raquel Urtasun. Are we ready for autonomous driving? the kitti vision benchmark suite. In Conference on Computer Vision and Pattern Recognition (CVPR), 2012.
  • He et al. [2016] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.
  • He et al. [2025] Sicheng He, Zeyu Shangguan, Kuanning Wang, Yongchong Gu, Yuqian Fu, Yanwei Fu, and Daniel Seita. Sequential multi-object grasping with one dexterous hand. IROS, 2025.
  • Hu et al. [2022] Hou-Ning Hu, Yung-Hsu Yang, Tobias Fischer, Trevor Darrell, Fisher Yu, and Min Sun. Monocular quasi-dense 3d object tracking. IEEE Transactions on Pattern Analysis and Machine Intelligence, 45(2):1992–2008, 2022.
  • Huang et al. [2024] Rui Huang, Henry Zheng, Yan Wang, Zhuofan Xia, Marco Pavone, and Gao Huang. Training an open-vocabulary monocular 3d detection model without 3d data. In The Thirty-eighth Annual Conference on Neural Information Processing Systems, 2024.
  • Kirillov et al. [2023] Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer Whitehead, Alexander C. Berg, Wan-Yen Lo, Piotr Dollár, and Ross Girshick. Segment anything. arXiv:2304.02643, 2023.
  • Kundu et al. [2018] Abhijit Kundu, Yin Li, and James M Rehg. 3d-rcnn: Instance-level 3d object reconstruction via render-and-compare. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 3559–3568, 2018.
  • Li et al. [2022a] Liunian Harold Li, Pengchuan Zhang, Haotian Zhang, Jianwei Yang, Chunyuan Li, Yiwu Zhong, Lijuan Wang, Lu Yuan, Lei Zhang, Jenq-Neng Hwang, et al. Grounded language-image pre-training. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10965–10975, 2022a.
  • Li et al. [2022b] Zhiqi Li, Wenhai Wang, Hongyang Li, Enze Xie, Chonghao Sima, Tong Lu, Yu Qiao, and Jifeng Dai. Bevformer: Learning bird’s-eye-view representation from multi-camera images via spatiotemporal transformers. arXiv preprint arXiv:2203.17270, 2022b.
  • Li et al. [2024] Zhuoling Li, Xiaogang Xu, SerNam Lim, and Hengshuang Zhao. Unimode: Unified monocular 3d object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 16561–16570, 2024.
  • Lin et al. [2017] Tsung-Yi Lin, Piotr Dollár, Ross Girshick, Kaiming He, Bharath Hariharan, and Serge Belongie. Feature pyramid networks for object detection. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2117–2125, 2017.
  • Lin et al. [2022] Xuewu Lin, Tianwei Lin, Zixiang Pei, Lichao Huang, and Zhizhong Su. Sparse4d: Multi-view 3d object detection with sparse spatial-temporal fusion. arXiv preprint arXiv:2211.10581, 2022.
  • Lin et al. [2023] Xuewu Lin, Zixiang Pei, Tianwei Lin, Lichao Huang, and Zhizhong Su. Sparse4d v3: Advancing end-to-end 3d detection and tracking. arXiv preprint arXiv:2311.11722, 2023.
  • Liu et al. [2023] Shilong Liu, Zhaoyang Zeng, Tianhe Ren, Feng Li, Hao Zhang, Jie Yang, Chunyuan Li, Jianwei Yang, Hang Su, Jun Zhu, et al. Grounding dino: Marrying dino with grounded pre-training for open-set object detection. arXiv preprint arXiv:2303.05499, 2023.
  • Liu et al. [2022a] Yingfei Liu, Tiancai Wang, Xiangyu Zhang, and Jian Sun. Petr: Position embedding transformation for multi-view 3d object detection. arXiv preprint arXiv:2203.05625, 2022a.
  • Liu et al. [2020] Zechen Liu, Zizhang Wu, and Roland Tóth. Smoke: Single-stage monocular 3d object detection via keypoint estimation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition workshops, pages 996–997, 2020.
  • Liu et al. [2021] Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, and Baining Guo. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF international conference on computer vision, pages 10012–10022, 2021.
  • Liu et al. [2022b] Zhuang Liu, Hanzi Mao, Chao-Yuan Wu, Christoph Feichtenhofer, Trevor Darrell, and Saining Xie. A convnet for the 2020s. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 11976–11986, 2022b.
  • Naiden et al. [2019] Andretti Naiden, Vlad Paunescu, Gyeongmo Kim, ByeongMoon Jeon, and Marius Leordeanu. Shift r-cnn: Deep monocular 3d object detection with closed-form geometric constraints. In 2019 IEEE international conference on image processing (ICIP), pages 61–65. IEEE, 2019.
  • Nickolls et al. [2008] John Nickolls, Ian Buck, Michael Garland, and Kevin Skadron. Scalable parallel programming with cuda: Is cuda the parallel programming model that application developers have been waiting for? Queue, 6(2):40–53, 2008.
  • Oquab et al. [2023] Maxime Oquab, Timothée Darcet, Théo Moutakanni, Huy Vo, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, et al. Dinov2: Learning robust visual features without supervision. arXiv preprint arXiv:2304.07193, 2023.
  • Pan et al. [2024] Jiancheng Pan, Yanxing Liu, Yuqian Fu, Muyuan Ma, Jiahao Li, Danda Pani Paudel, Luc Van Gool, and Xiaomeng Huang. Locate anything on earth: Advancing open-vocabulary object detection for remote sensing community. arXiv preprint arXiv:2408.09110, 2024.
  • Park et al. [2021] Dennis Park, Rares Ambrus, Vitor Guizilini, Jie Li, and Adrien Gaidon. Is pseudo-lidar needed for monocular 3d object detection? In IEEE/CVF International Conference on Computer Vision (ICCV), 2021.
  • Paszke et al. [2019] Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, Alban Desmaison, Andreas Kopf, Edward Yang, Zachary DeVito, Martin Raison, Alykhan Tejani, Sasank Chilamkurthy, Benoit Steiner, Lu Fang, Junjie Bai, and Soumith Chintala. Pytorch: An imperative style, high-performance deep learning library. In Advances in Neural Information Processing Systems (NeurIPS), pages 8024–8035. Curran Associates, Inc., 2019.
  • Piccinelli et al. [2024] Luigi Piccinelli, Yung-Hsu Yang, Christos Sakaridis, Mattia Segu, Siyuan Li, Luc Van Gool, and Fisher Yu. UniDepth: Universal monocular metric depth estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024.
  • Piccinelli et al. [2025a] Luigi Piccinelli, Christos Sakaridis, Mattia Segu, Yung-Hsu Yang, Siyuan Li, Wim Abbeloos, and Luc Van Gool. UniK3D: Universal camera monocular 3d estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2025a.
  • Piccinelli et al. [2025b] Luigi Piccinelli, Christos Sakaridis, Yung-Hsu Yang, Mattia Segu, Siyuan Li, Wim Abbeloos, and Luc Van Gool. UniDepthV2: Universal monocular metric depth estimation made simpler. arXiv:2502.20110, 2025b.
  • Radford et al. [2021] Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. In International conference on machine learning, pages 8748–8763. PMLR, 2021.
  • Rezatofighi et al. [2019] Hamid Rezatofighi, Nathan Tsoi, JunYoung Gwak, Amir Sadeghian, Ian Reid, and Silvio Savarese. Generalized intersection over union: A metric and a loss for bounding box regression. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 658–666, 2019.
  • Roberts et al. [2021] Mike Roberts, Jason Ramapuram, Anurag Ranjan, Atulit Kumar, Miguel Angel Bautista, Nathan Paczan, Russ Webb, and Joshua M. Susskind. Hypersim: A photorealistic synthetic dataset for holistic indoor scene understanding. In International Conference on Computer Vision (ICCV) 2021, 2021.
  • Rukhovich et al. [2022] Danila Rukhovich, Anna Vorontsova, and Anton Konushin. Imvoxelnet: Image to voxels projection for monocular and multi-view general-purpose 3d object detection. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 2397–2406, 2022.
  • Schönberger and Frahm [2016] Johannes Lutz Schönberger and Jan-Michael Frahm. Structure-from-motion revisited. In Conference on Computer Vision and Pattern Recognition (CVPR), 2016.
  • Schönberger et al. [2016] Johannes Lutz Schönberger, Enliang Zheng, Marc Pollefeys, and Jan-Michael Frahm. Pixelwise view selection for unstructured multi-view stereo. In European Conference on Computer Vision (ECCV), 2016.
  • Song et al. [2015] Shuran Song, Samuel P Lichtenberg, and Jianxiong Xiao. Sun rgb-d: A rgb-d scene understanding benchmark suite. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 567–576, 2015.
  • Team et al. [2023] Gemini Team, Rohan Anil, Sebastian Borgeaud, Jean-Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk, Andrew M Dai, Anja Hauth, Katie Millican, et al. Gemini: a family of highly capable multimodal models. arXiv preprint arXiv:2312.11805, 2023.
  • Tu et al. [2023] Tao Tu, Shun-Po Chuang, Yu-Lun Liu, Cheng Sun, Ke Zhang, Donna Roy, Cheng-Hao Kuo, and Min Sun. Imgeonet: Image-induced geometry-aware voxel representation for multi-view 3d object detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 6996–7007, 2023.
  • Vaswani [2017] A Vaswani. Attention is all you need. Advances in Neural Information Processing Systems, 2017.
  • Wang et al. [2021] Tai Wang, Xinge Zhu, Jiangmiao Pang, and Dahua Lin. Fcos3d: Fully convolutional one-stage monocular 3d object detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 913–922, 2021.
  • Wang et al. [2022] Tai Wang, ZHU Xinge, Jiangmiao Pang, and Dahua Lin. Probabilistic and geometric depth: Detecting objects in perspective. In Conference on Robot Learning, pages 1475–1485. PMLR, 2022.
  • Wang et al. [2019] Yan Wang, Wei-Lun Chao, Divyansh Garg, Bharath Hariharan, Mark Campbell, and Kilian Q Weinberger. Pseudo-lidar from visual depth estimation: Bridging the gap in 3d object detection for autonomous driving. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8445–8453, 2019.
  • Wilson et al. [2021] Benjamin Wilson, William Qi, Tanmay Agarwal, John Lambert, Jagjeet Singh, Siddhesh Khandelwal, Bowen Pan, Ratnesh Kumar, Andrew Hartnett, Jhony Kaesemodel Pontes, Deva Ramanan, Peter Carr, and James Hays. Argoverse 2: Next generation datasets for self-driving perception and forecasting. In Proceedings of the Neural Information Processing Systems Track on Datasets and Benchmarks (NeurIPS Datasets and Benchmarks 2021), 2021.
  • Yang et al. [2022] Chenyu Yang, Yuntao Chen, Haofei Tian, Chenxin Tao, Xizhou Zhu, Zhaoxiang Zhang, Gao Huang, Hongyang Li, Y. Qiao, Lewei Lu, Jie Zhou, and Jifeng Dai. Bevformer v2: Adapting modern image backbones to bird’s-eye-view recognition via perspective supervision. ArXiv, 2022.
  • Yin et al. [2021] Tianwei Yin, Xingyi Zhou, and Philipp Krahenbuhl. Center-based 3d object detection and tracking. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 11784–11793, 2021.
  • Yin et al. [2023] Wei Yin, Chi Zhang, Hao Chen, Zhipeng Cai, Gang Yu, Kaixuan Wang, Xiaozhi Chen, and Chunhua Shen. Metric3d: Towards zero-shot metric 3d prediction from a single image. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 9043–9053, 2023.
  • Yu et al. [2018] Fisher Yu, Dequan Wang, Evan Shelhamer, and Trevor Darrell. Deep layer aggregation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2403–2412, 2018.
  • Yung-Hsu Yang and Tobias Fischer and Thomas E. Huang et al. [2024] Yung-Hsu Yang and Tobias Fischer and Thomas E. Huang, René Zurbrügg, Tao Sun, and Fisher Yu. Vis4d. https://github.com/SysCV/vis4d, 2024.
  • Zang et al. [2022] Yuhang Zang, Wei Li, Kaiyang Zhou, Chen Huang, and Chen Change Loy. Open-vocabulary detr with conditional matching. In European Conference on Computer Vision, pages 106–122. Springer, 2022.
  • Zareian et al. [2021] Alireza Zareian, Kevin Dela Rosa, Derek Hao Hu, and Shih-Fu Chang. Open-vocabulary object detection using captions. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14393–14402, 2021.
  • Zhang et al. [2022a] Hao Zhang, Feng Li, Shilong Liu, Lei Zhang, Hang Su, Jun Zhu, Lionel M Ni, and Heung-Yeung Shum. Dino: Detr with improved denoising anchor boxes for end-to-end object detection. arXiv preprint arXiv:2203.03605, 2022a.
  • Zhang et al. [2022b] Renrui Zhang, Han Qiu, Tai Wang, Xuanzhuo Xu, Ziyu Guo, Yu Qiao, Peng Gao, and Hongsheng Li. Monodetr: Depth-guided transformer for monocular 3d object detection. ICCV 2023, 2022b.
  • Zhao et al. [2024] Xiangyu Zhao, Yicheng Chen, Shilin Xu, Xiangtai Li, Xinjiang Wang, Yining Li, and Haian Huang. An open and comprehensive pipeline for unified object grounding and detection. arXiv preprint arXiv:2401.02361, 2024.
  • Zhou et al. [2019] Brady Zhou, Philipp Krähenbühl, and Vladlen Koltun. Does computer vision matter for action? Science Robotics, 4, 2019.
  • Zhou et al. [2022] Xingyi Zhou, Rohit Girdhar, Armand Joulin, Philipp Krähenbühl, and Ishan Misra. Detecting twenty-thousand classes using image-level supervision. In ECCV, 2022.
  • Zhu et al. [2020] Xizhou Zhu, Weijie Su, Lewei Lu, Bin Li, Xiaogang Wang, and Jifeng Dai. Deformable detr: Deformable transformers for end-to-end object detection. arXiv preprint arXiv:2010.04159, 2020.
\thetitle

Supplementary Material

This supplementary material elaborates more details of our main paper. In Sec. A, we illustrate how the proposed canonical image space can reduce the required GPU resource for training and how it helps the model learn better prior geometry. In Sec. B, we compare our proposed geometry-aware query generation to the virtual depth proposed by Cube R-CNN [4]. In Sec. C, we provide further details of our proposed open-set benchmark. We analyze the depth estimation performance in Sec. D, backbone comparison in Sec. E and FPS in Sec. F, respectively. In Sec. G, we discuss the importance of our proposed open detection score (ODS) and compare the evaluation results in detail to the IoU-based AP. Finally, we provide more qualitative results in Sec. H for both closed-set and open-set settings.

A Canonical Image Space

As shown in the main paper, we compare different resizing and padding strategies for training the model. Given the training batch size as 222, we have two training samples having very different image ratios, e.g. [376×1241][376\times 1241][ 376 × 1241 ] and [1920×1080][1920\times 1080][ 1920 × 1080 ]. The first strategy as [4] will find the shortest edge and resize it to the desired value, e.g. 512512512, and conduct the right-bottom padding to align the two samples’ resolutions. This will lead to considerable padding for the portrait image, while not change the camera intrinsic 𝐊\mathbf{K}bold_K.

The second strategy is like Grounding DINO (G-DINO) [27], which will find the longest edge and resize it to the desired value, e.g. 133313331333, and use the same right-bottom padding to align the resolutions. This leads to considerable padding for the landscape image, while the padding will also not change the camera intrinsic 𝐊\mathbf{K}bold_K. Both strategies will increase the GPU usage for unnecessary padding and lead to different image resolutions for the same sampled image according to the paired images.

On the other hand, our proposed canonical image space fixes the image resolutions, e.g. [800×1333][800\times 1333][ 800 × 1333 ], and will resize the longest or the shortest edge considering the image ratios. As shown in Tab. 5, our methods successfully reduce the needed GPU resources compared to the previous methods. Furthermore, we use the center padding to ensure our image space will affect the camera intrinsic 𝐊\mathbf{K}bold_K accordingly to unify it across not only the training and testing time but also across datasets.

During inference time, the same camera will capture the same image shape with the same camera intrinsics. The previous methods will fail to align the same observation between training and inference time. We speculate that this will hinder the model’s understanding of the relation between intrinsics, image shape, and metric depth. On the contrary, our canonical image space will keep the image and intrinsic consistent. As shown in the ablation studies of the main paper, the model benefits from the proposed canonical image space for both closed-set and open-set settings with even fewer GPU resource requirements for training.

Table 5: GPU RAM Consumption. We compare the GPU resource usage of different training padding and resizing methods. We show the results of training our full model using Swin-T [30] and batch size of 222 using gradient checkpointing on a RTX 4090.
Resize Padding Image Resolutions GPU RAM (G)
Short Edge Right-Bottom Short Edge to 512512512 21
Long Edge Right-Bottom Long Edge to 133313331333 23
Ours Center 800×1333800\times 1333800 × 1333 17
Table 6: Comparison with Virtual Depth. We compare our geometry-aware 3D query generation (GA) with the virtual depth proposed by Cube R-CNN [4]. The results shows that GA converge better than the virtual depth mechanism.
Method Virtual Depth GA AP3D omni\text{AP}^{\text{ omni}}_{\text{3D}}\uparrowAP start_POSTSUPERSCRIPT omni end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 3D end_POSTSUBSCRIPT ↑
Cube R-CNN [4] - 23.3
Ours (Swin-T) - 21.6
Ours (Swin-T) - 26.8

B Comparison with Virtual Depth

We compare our proposed geometry-aware 3D query generation with the virtual depth proposed in [4]. As shown in Tab. 6, the virtual depth leads our model to converge much slower than our proposed geometry-aware 3D query generation. We speculate that virtual depth requires much more training time to learn universal geometry, which also leads to underperformance.

C Open-set Benchmark

We show more details about our proposed open-set benchmark and list the full classes for each datasets in Tab. 7.

Table 7: Classes for Argoverse 2 and ScanNet. We list the base and novel categories for our proposed open-set benchmark. For ScanNet, the bold categories are the supercategories. We further list all 168168168 categories of ScanNet200 settings.
Dataset Base Novel
Argoverse 2 regular vehicle, pedestrian, bicyclist, construction cone, construction barrel, large vehicle, bus, truck, vehicular trailer, bicycle, motorcycle motorcyclist, wheeled rider, bollard, sign, stop sign, box truck, articulated bus, mobile pedestrian crossing sign, truck cab, school bus, wheeled device, stroller
ScanNet cabinet (cabinet, kitchen cabinet, file cabinet, bathroom vanity, bathroom cabinet, cabinet door, trash cabinet, media center), bed (bed, mattress, loft bed, sofa bed, air mattress), chair (chair, office chair, armchair, sofa chair, folded chair, massage chair, recliner chair, rocking chair), sofa (couch, sofa), table (table, coffee table, end table, dining table, folded table, round table, side table, air hockey table), door (door, doorframe, bathroom stall door, closet door, mirror door, glass door, sliding door, closet doorframe), window, picture (picture, poster, painting), counter (kitchen counter, counter, bathroom counter), desk, curtain, refrigerator (refrigerator, mini fridge, cooler), toilet (toilet, urinal), sink, bathtub bookshelf, shower curtain, other furniture (trash can, radiator, recycling bin, ottoman, bench, tv stand, wardrobe, trash bin, seat, closet, ladder, piano, water cooler, stand, washing machine, rack, wardrobe , clothes dryer, ironing board, keyboard piano, music stand, furniture, crate, drawer, footrest, piano bench, foosball table, footstool, compost bin, tripod, treadmill, chest, folded ladder, drying rack, pool table, heater, toolbox, beanbag chair, dollhouse, ping pong table, clothing rack, podium, luggage stand, rack stand, futon, book rack, workbench, easel, headboard, display rack, crib, bedframe, bunk bed, magazine rack, furnace, stepladder, baby changing station, flower stand, display)
ScanNet200 chair, table, door, couch, cabinet, shelf, desk, office chair, bed, pillow, sink, picture, window, toilet, bookshelf, monitor, curtain, book, armchair, coffee table, box, refrigerator, lamp, kitchen cabinet, towel, clothes, tv, nightstand, counter, dresser, stool, plant, bathtub, end table, dining table, keyboard, bag, backpack, toilet paper, printer, tv stand, whiteboard, blanket, shower curtain, trash can, closet, stairs, microwave, stove, shoe, computer tower, bottle, bin, ottoman, bench, board, washing machine, mirror, copier, basket, sofa chair, file cabinet, fan, laptop, shower, paper, person, paper towel dispenser, oven, blinds, rack, plate, blackboard, piano, suitcase, rail, radiator, recycling bin, container, wardrobe, soap dispenser, telephone, bucket, clock, stand, light, laundry basket, pipe, clothes dryer, guitar, toilet paper holder, seat, speaker, column, ladder, cup, jacket, storage bin, coffee maker, dishwasher, paper towel roll, machine, mat, windowsill, bar, bulletin board, ironing board, fireplace, soap dish, kitchen counter, doorframe, toilet paper dispenser, mini fridge, fire extinguisher, ball, hat, shower curtain rod, water cooler, paper cutter, tray, pillar, ledge, toaster oven, mouse, toilet seat cover dispenser, cart, scale, tissue box, light switch, crate, power outlet, decoration, sign, projector, closet door, vacuum cleaner, headphones, dish rack, broom, range hood, hair dryer, water bottle, vent, mailbox, bowl, paper bag, projector screen, divider, laundry detergent, bathroom counter, stick, bathroom vanity, closet wall, laundry hamper, bathroom stall door, ceiling light, trash bin, dumbbell, stair rail, tube, bathroom cabinet, coffee kettle, shower head, case of water bottles, power strip, calendar, poster, mattress

C.1 Argoverse 2

Compared to other autonomous driving datasets, Argoverse 2 (AV2) [54] possesses more diverse classes. Moreover, the resolutions of the front camera are portrait images, providing unseen cameras and domains. Those factors make AV2 a challenging dataset for benchmarking open-set monocular 3D object detection. We sample every 555 frame from the official validation split and obtain 480648064806 images as the open-set testing set. Among the official 262626 classes, 232323 appeared in the testing set, which contains 111111 base and 121212 novel classes.

C.2 ScanNet

ScanNet [8] provides diverse indoor scenes with 181818 supercategories as shown in Tab. 7. We uniformly sample maximum 202020 frames from each scan in the official ScanNet validation splits and obtain total 624062406240 images as open-set testing set. Given that 151515 supercategories are seen in the Omni3D training set, this benchmark still allows us to evaluate domain generalization, where Tab. 4 of the main paper indicates issues of previous works. Furthermore, in the rest 333 novel classes, the supercategory other furniture requires models to detect various types of furniture.

To further test 3D-MOOD, we extend ScanNet using the ScanNet200 setting, which has 168 thing classes appeared in the testing set. As shown in Tab. 8, 3D-MOOD can still achieve best performance given more diverse classes.

Table 8: ScanNet200 Results. 3D-MOOD achieves SOTA results given diverse novel categories in unseen scenes.
Method AP3D dist\text{AP}_{\text{3D}}^{\text{ dist}}\uparrowAP start_POSTSUBSCRIPT 3D end_POSTSUBSCRIPT start_POSTSUPERSCRIPT dist end_POSTSUPERSCRIPT ↑ mATE \downarrow mASE \downarrow mAOE \downarrow ODS \uparrow
Cube R-CNN [4] 2.1 0.962 0.970 0.985 2.5
OVM3D-Det [18] 3.1 0.957 0.973 0.946 3.6
Ours (Swin-B) 6.2 0.811 0.835 0.799 12.4

D Metric Monocular Depth Estimation

Because auxiliary depth estimation (ADE) is only used to help with 3D object detection, we evaluated its effectiveness in this regard. As shown in Tab. 4 of our paper, ADE improves the closed set AP by 0.70.70.7, but reduces the performance for unseen scenes, indicating that ADE can only help for known scenes. We further evaluate our depth quality on the KITTI Eigen-split test set, where UniDepth [38] achieves 4.21%4.21\%4.21 % absolute relative error, Metric3Dv2 [57] has 4.4%4.4\%4.4 %, and 3D-MOOD obtains 9.1%9.1\%9.1 %. We believe it is due to limited training data, different training objectives, and the model backbones [30, 34].

E Backbone Comparison

As shown in Tab. 9, Swin-B works equally well with ConvNeXt-B, and 3D-MOOD is comparable to Cube R-CNN and Uni-MODE using the same backbone, but with much shorter training. This shows effectiveness of our proposed designs rather than the backbone [34].

Table 9: Backbone comparison. We ablate the choice of different model backbones. All experiments are trained with 121212 epochs.
Backbone Parameters AP3D omni\text{AP}^{\text{ omni}}_{\text{3D}}\uparrowAP start_POSTSUPERSCRIPT omni end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 3D end_POSTSUBSCRIPT ↑
DLA-34 [58] 15M 24.9
Swin-Transformer (Tiny) [30] 29M 26.8
Swin-Transformer (Base) [30] 88M 28.2
ConvNeXt-B [31] 89M 28.4
Table 10: Comparison between different matching criteria for evaluation. The same detection results will have a huge AP difference when using different matching settings.
Matching Pedestrian Construction cone Monitor Door
IoU 7.4 0.5 0.5 2.0
Distance 26.2 6.5 9.4 24.2

F Inference Time

We compare the FPS on KITTI using an RTX 4090, and Cube R-CNN (DLA-34) can have 686868 FPS while 3D-MOOD (Swin-T) can achieve 171717 FPS. As a reference from the paper, Uni-MODE can obtain 21 FPS on a single A100.

G Open Detection Score (ODS)

Compared to point-cloud-based 3D object detectors, using a single image to estimate 3D objects requires the networks to predict metric depth, while the scales in depth are known in the point cloud. This extra challenge leads the monocular methods to fail to match the ground truth using IoU-based matching because of several centimeter error in depth, especially for the small or thin objects in the open-set settings.

To have a more suitable evaluation metric for monocular 3D object detection, we use the 3D Euclidean distance between prediction and GT as the matching criterion. With the dynamic matching threshold, e.g. radius of the GT 3D boxes, AP3D dist\text{AP}_{\text{3D}}^{\text{ dist}}AP start_POSTSUBSCRIPT 3D end_POSTSUBSCRIPT start_POSTSUPERSCRIPT dist end_POSTSUPERSCRIPT can be used for both indoor and outdoor scenes. As shown in Tab. 10, the same detection results on Argoverse 2 [54] and ScanNet [8] will have large AP differences depending on the matching criterion.

We also propose the normalized true positive errors (TPE) to further analyze the matched prediction. First, we compute the 3D Euclidean distance between prediction and GT, and we normalize the distance by the matching criterion as the translation error (TE). Second, we compute the IoU, i.e. IoU3D\text{IoU}_{\text{3D}}IoU start_POSTSUBSCRIPT 3D end_POSTSUBSCRIPT, between prediction and GT after aligning the 3D centers and orientation and use 1IoU3D1-\text{IoU}_{\text{3D}}1 - IoU start_POSTSUBSCRIPT 3D end_POSTSUBSCRIPT to measure the scale error (SE). Finally, we compute the SO3 relative angle between the prediction and GT normalized by π\piitalic_π as the orientation error (OE). We average the TP errors across classes over different recall thresholds to get mATE, mASE, and mAOE. Using AP3D dist\text{AP}_{\text{3D}}^{\text{ dist}}AP start_POSTSUBSCRIPT 3D end_POSTSUBSCRIPT start_POSTSUPERSCRIPT dist end_POSTSUPERSCRIPT with the proposed normalized true positive errors to get ODS can provide a better matching criterion for 3D monocular object detection and still evaluate the localization, orientation, and dimension estimation at the same time.

Refer to caption Refer to caption
Refer to caption Refer to caption
Gemini 2 3D-MOOD
Figure 6: Comparison with Gemini 2. We qualitatively compare with Gemini 2 given the novel classes.

H Qualitative Results

As shown in Fig. 6, we qualitatively compared our method with the closed-source Gemini 2 [48] beta functionality in 3D object detection, where 3D-MOOD provides more accurate localization. We provide more qualitative results in Fig. 7 for the open-set settings and Fig. 8 for the closed-set settings. We use the score threshold as 0.10.10.1 with class-agnostic nonmaximum suppression for better visualization.

Argoverse 2

Refer to caption Refer to caption Refer to caption Refer to caption

Argoverse 2

Refer to caption Refer to caption Refer to caption Refer to caption

ScanNet

Refer to caption Refer to caption Refer to caption Refer to caption

ScanNet

Refer to caption Refer to caption Refer to caption Refer to caption

3D

Refer to caption Refer to caption Refer to caption Refer to caption
Figure 7: Open-set Qualitative Results. We show more visualization on Argoverse 2 [54] and ScanNet [8].

KITTI

Refer to caption Refer to caption Refer to caption Refer to caption

nuScenes

Refer to caption Refer to caption Refer to caption Refer to caption

Objectron

Refer to caption Refer to caption Refer to caption Refer to caption

Hypersim

Refer to caption Refer to caption Refer to caption Refer to caption

ARKitScenes

Refer to caption Refer to caption Refer to caption Refer to caption

SUN RGB-D

Refer to caption Refer to caption Refer to caption Refer to caption

3D

Refer to caption Refer to caption Refer to caption Refer to caption
Figure 8: Closed-set Qualitative Results. We show the qualitative results for 3D-MOOD on Omni3D [4] test set.