This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

OV9D: Open-Vocabulary Category-Level 9D Object Pose and Size Estimation

Junhao Cai1  Yisheng He2  Weihao Yuan2  Siyu Zhu3  Zilong Dong2
Liefeng Bo2  Qifeng Chen1
1Hong Kong University of Science and Technology  2Alibaba Group  3Fudan University
Abstract

This paper studies a new open-set problem, the open-vocabulary category-level object pose and size estimation. Given human text descriptions of arbitrary novel object categories, the robot agent seeks to predict the position, orientation, and size of the target object in the observed scene image. To enable such generalizability, we first introduce OO3D-9D, a large-scale photorealistic dataset for this task. Derived from OmniObject3D, OO3D-9D is the largest and most diverse dataset in the field of category-level object pose and size estimation. It includes additional annotations for the symmetry axis of each category, which help resolve symmetric ambiguity. Apart from the large-scale dataset, we find another key to enabling such generalizability is leveraging the strong prior knowledge in pre-trained visual-language foundation models. We then propose a framework built on pre-trained DinoV2 and text-to-image stable diffusion models to infer the normalized object coordinate space (NOCS) maps of the target instances. This framework fully leverages the visual semantic prior from DinoV2 and the aligned visual and language knowledge within the text-to-image diffusion model, which enables generalization to various text descriptions of novel categories. Comprehensive quantitative and qualitative experiments demonstrate that the proposed open-vocabulary method, trained on our large-scale synthesized data, significantly outperforms the baseline and can effectively generalize to real-world images of unseen categories. The project page is at https://ov9d.github.io.

Equal contributions. Corresponding author.

Vision-based object pose estimation (OPE), which aims to estimate the object’s position and orientation from the observed visual information, is a fundamental problem in computer vision and robotics. This task is widely used in numerous applications such as robotic manipulation [56], augmented reality [36], surface reconstruction [57], etc.

From the perspective of working scope, the existing work of object pose estimation can be divided into three categories: 1) instance-level [60, 53, 20, 21], 2) generalized instance-level [34], and 3) category-level [54, 63]. Instance-level OPE deals with a limited number of objects that are fully available in both the training and testing sets. Generalized instance-level OPE methods can handle objects that are unseen during the training stage, but these methods usually require complete 3D object models [34] or large amounts of templates with ground-truth poses [34, 47] during the inference stage, which inevitably limits their application. Category-level OPE aims at estimating object poses from the category-canonicalized object frame to the camera frame, allowing the model to perform estimation on novel object instances within the same category without the requirement of object models. However, such methods still suffer from similar problems as instance-level OPE, i.e., they cannot generalize to objects with novel categories.

Refer to caption
Figure 1: The open-vocabulary learning of category-level pose and size estimation is trained on a large dataset with diverse categories, such that it could be generalized to novel categories given text prompts of an unseen target object in novel scene images.

While the existing methods of OPE are still confined to limited generalization scope, the rapid development of open-vocabulary learning [58] has demonstrated that open-vocabulary approaches [14, 31, 61, 67, 66] can effectively eliminate the gap between close-set and open-set scenarios by making use of the feature alignment learned by the visual language models. These methods have achieved decent performance on many vision tasks such as detection [14] and segmentation [31, 61, 67], even on novel categories in the wild. Moreover, many vision foundation models trained on Internet-scale data have also shown remarkable zero-shot performance on many tasks [61, 66, 64]. However, the task of open-vocabulary pose and size estimation has not been touched so far.

In this paper, we take one step forward in this direction by introducing a new challenging problem: open-vocabulary category-level object pose estimation, which enables estimating poses and size of novel categories in a novel RGB-D scene image with only text inputs. To enable such generalization capability, our key insight is that we can take advantage of the prior alignment knowledge in visual-language foundation models pre-trained from large-scale language and image datasets. This alignment can guide the training for consistent pose estimation using identical text descriptions for same-category instances and enable knowledge to transfer to new categories. Specifically, we design an open-vocabulary framework based on CLIP (Contrastive Language-Image Pretraining) [40], text-to-image stable diffusion model [43], and DinoV2 [38] to estimate the normalized object coordinate space (NOCS) maps [54] from the monocular RGB images. Concretely, text features from the CLIP model and latent visual features from VQVAE [15] are fused together in the diffusion UNet to generate diffusion features. The RGB images are also fed into the DinoV2 module to extract additional discriminative features. These two types of features are jointly leveraged to estimate the NOCS maps for the target objects.

For the training of this framework, current datasets [48, 54, 63], however, contain either limited object instances or few categories, which are not enough for open-vocabulary learning. To resolve this, we create a large-scale photo-realistic dataset, namely OO3D-9D, which is derived from OmniObject3D (OO3D) [59]. This dataset comprises 5,371 objects spanning 216 categories, encompassing both single-object and multi-object scenarios. Each single-object scenario is composed of 1,000 RGB-D images with ground truth object poses. Each multi-object scenario is composed of 5 captures of 5-20 objects, and there are 100K various multi-object scenes in total.

In the experiments, we evaluate our method using both synthesized data and real data from Co3D [42]. The results demonstrate that our approach, trained with the generated large-scale dataset, achieves notable performance on novel object instances as well as objects belonging to previously unseen categories.

Our main contributions are then summarized as follows:

  • We introduce a new challenging problem, the open-vocabulary category-level object pose and size estimation, and establish a benchmark to study it.

  • We introduce a large-scale photo-realistic dataset for model training in this problem. This synthesized dataset is the largest and most diverse dataset for category-level object pose and size estimation. To generate this dataset, we add annotations of the objects’ symmetry axes in OO3D, making it suitable for addressing our problem.

  • We propose an open-vocabulary framework to tackle this problem, where we fully leverage the visual semantic prior from the pre-trained Dino and the aligned visual and text knowledge within the text-to-image diffusion model. For the first time, we reveal that these priors can boost the network generalizability to real-world images of unseen categories and enable open-vocabulary pose and size estimation given free-form descriptions of the target object.

1 Related Work

1.1 Generalized Object Pose Estimation

The task of generalized instance-level object pose estimation focuses on estimating 6-degree-of-freedom (6-DoF) poses for novel objects that are not available during training or finetuning. Traditional methods follow the pipeline of template rendering, feature extraction, and template matching to find the matched template and point correspondences [24, 26, 27]. To extract more robust features, recent methods use deep neural networks to perform object detection and correspondence matching between the reference and query images [47, 23, 34, 65]. Although these methods provide a more general framework to estimate poses for unseen objects, they inevitably require object models or templates with ground-truth 6-DoF poses, which might be impractical in real-world applications.

1.2 Category-Level Object Pose Estimation

Category-level object poses estimation [3] aims to estimate the scale, position, and orientation of the objects in the same category that shares the same reference frame, which is a more general task compared with instance-level object pose estimation [60]. Early work investigates this task by jointly detecting the object and estimating the corresponding normalized coordinates in normalized object coordinate space (NOCS) [54]. Follow-up works leverage the geometric information from the observed point cloud or object shape prior to estimate both the canonical object coordinates and object poses [6, 22, 25, 50, 55, 7, 32, 29]. Other methods solve this problem in an analysis-by-synthesis manner by comparing the rendered and observed images [8, 28, 10, 2]. While these methods can only deal with seen object categories during training, our open-vocabulary framework can generalize to unseen categories given the text descriptions.

Refer to caption
Figure 2: Visualization of aligned objects: Row 1 features non-symmetric toy trucks with their heads aligned to the X-axis. Row 2 presents handbags containing discrete symmetric axes, with their openings aligned to the Y-axis. Row 3 shows bowls that possess continuous symmetric axes.

1.3 Datasets

Although the category-level methods perform well on novel instances within the same category, they cannot be generalized to objects of a novel category. One main challenge is the lack of training data. The most widely used benchmark is REAL275 and CAMERA25 proposed by [54], which contains only 6 categories and 18 object instances. Wild6D [63] scales up the number of object instances to 1722, but the categories’ diversity is still limited to 5 categories [63]. On the other hand, large amounts of 3D object datasets have been proposed for various tasks including shape reconstruction and generation [5, 42, 13, 9, 59]. OO3D [59] introduced an aligned real-scanned 3D database including 6000 objects in 190 categories, which is diverse enough to cover the commonly seen objects in our daily life. However, this dataset only contains 3D models and the object symmetry information is lost. In this work, we add annotations of the objects’ symmetry axes and render a large-scale dataset for open-vocabulary pose and size estimation. Our dataset is so far the largest and most diverse dataset for category-level object pose and size estimation.

1.4 Open-Vocabulary Learning and Vision Foundation Models

Recently, visual language models trained with Internet-scaled data have shown remarkable zero-shot performance on various vision tasks [40, 58]. These methods, including open-vocabulary object detection [14], segmentation [31, 61, 67], and depth estimation [66], make use of the alignment learned in the visual language models to reduce the gap between in-distribution and out-of-distribution scenarios. Meanwhile, many research results have demonstrated that visual models including text-to-image diffusion models [43, 44] or ViT [4, 38] trained with billions of data [46] can serve as generalized feature descriptors, which remarkably improve the performance of generalized tasks including open-vocabulary learning [61, 66] and correspondence matching [64, 49]. Differently, we exploit features extracted from both stable diffusion [43] and DINOv2 [38] to infer the objects’ normalized coordinates and estimate the pose and size parameters. To the best of our knowledge, we are the first to show that these priors can enable the open-vocabulary pose and size estimation given free-form descriptions of the target object.

Refer to caption
Figure 3: Example images in OO3D-9D dataset. Single-object scenes as CO3D are displayed in the first row while challenging multi-object scenes are displayed in the second row.
Refer to caption
Figure 4: Overall framework. Text features are acquired from the prompt through the CLIP model and fed to the SD UNet. By combining these text features with latent visual features from VQVAE, SD feature maps are generated. Simultaneously, the DinoV2 module processes the masked RGB image to obtain Dino features. Both features are then combined in the decoder to estimate the NOCS map of the target object. During the inference stage, the depth map is utilized to establish correspondence between NOCS and the camera frame. Finally, the object’s size and pose are computed using a pose-fitting algorithm.

2 Methodology

2.1 Problem Statement

In this paper, we consider the task of open-vocabulary category-level object pose estimation as evaluating the scale, position, and orientation of the objects in novel categories that are not present in the training stage. For the objects within the same category, the model should be able to recognize them and assign them with implicitly consistent reference frames. In other words, the body frames attached to these objects are aligned for scale, position, and orientation under the cognition of the model. Figure 2 gives an example of such alignment.

Formally, suppose an object set is divided into m+nm+n subsets as

𝒪=𝒪M𝒪N\mathcal{O}=\mathcal{O}_{M}\bigcup\mathcal{O}_{N} (1)

where 𝒪M=𝒪1𝒪2𝒪m\mathcal{O}_{M}=\mathcal{O}_{1}\bigcup\mathcal{O}_{2}\bigcup...\bigcup\mathcal{O}_{m}, and 𝒪N=𝒪m+1𝒪m+n\mathcal{O}_{N}=\mathcal{O}_{m+1}\bigcup...\bigcup\mathcal{O}_{m+n}. Each subset 𝒪i\mathcal{O}_{i} has a text description did_{i} characterizing the category of the object set and a reference frame T𝒪i{}^{\mathcal{O}_{i}}T shared by all the instances in the set. For a learning-based method, we further assume that the first mm subsets are the training set, and the remaining nn subsets are unknown for the model. Therefore, the task of open-vocabulary category-level object pose estimation could be formulated as estimating a 9-DoF size and pose T𝒪ic={s,R𝒪ic,tc}{}^{c}T_{\mathcal{O}_{i}}=\{s,{}^{c}R_{\mathcal{O}_{i}},t_{c}\} from the observed quadruple {Ioij,Doij,Moij,di}\{I_{o_{ij}},D_{o_{ij}},M_{o_{ij}},d_{i}\}, where ss is the 3-D scale vector, R𝒪ic{}^{c}R_{\mathcal{O}_{i}} and tct_{c} are the rotation and translation components, oij{Oi}o_{ij}\in\{O_{i}\}, {Ioij,Doij}\{I_{o_{ij}},D_{o_{ij}}\} is the RGBD pair containing the object, MoijM_{o_{ij}} is the object mask, and i[1,m+n]i\in[1,m+n]. It is worth noting that in the ideal case, for any two objects in the same category oij,oik{Oi}o_{ij},o_{ik}\in\{O_{i}\}, the intra-category frame consistency requires Toijoik=I{}^{o_{ik}}T_{o_{ij}}=I.

2.2 Datasets

To ensure the network generalization capability, we generate a new large-scale and diverse pose dataset based on OmniObject3D [59] and BlenderProc [11], called OO3D-9D. Overall, OO3D-9D consists of 5371 object instances in 216 categories. Each instance contains 1000 RGBD image pairs with ground truth 3D object size, 6-DoF poses, and masks. Among this dataset, we draw 10 categories of 230 instances as the test set of novel categories. We further select 214 instances from the known categories as the test set of novel instances. We further generate a multi-object dataset among these objects. Table 1 provides an overview of OO3D-9D and comparisons with other datasets.

Datasets #instance #category #image GT
CAMERA25 [54] 184 6 300K \checkmark
REAL275 [54] 24 6 8K \checkmark
Wild6D-train [63] 1,560 5 1M ×\times
Wild6D-test [63] 162 5 100K \checkmark
FS6D [23] 12,490 51 800K \checkmark
OO3D-9D 5,371 216 5M \checkmark
Table 1: Comparison of different datasets. Our OO3D-9D is the largest and most diverse in object categories, which is critical in network generalizability under our open-vocabulary setting. GT denotes ground-truth object pose and size annotation.

2.2.1 Symmetry Axis Annotation of OO3D

While OmniObject3D contains large amounts of objects with diverse categories, the symmetric property of the objects is not provided, which is essential information for handling symmetry-shaped objects for the task of object pose estimation. To deal with this problem, we manually annotate symmetry axes for objects based on their shapes. Concretely, we follow the BOP data format [48] and add transformation matrices for discrete symmetry and rotation axes for continuous symmetry. Overall, the dataset comprises 1129 objects exhibiting discrete symmetry and 2066 objects displaying continuous symmetry. Figure 2 also visualizes some objects with their symmetry axes.

It is worth noting that many objects in OmniObject3D, such as apples and broccoli, neither have a clear sense of direction nor exhibit strict symmetry. For these objects, we annotate continuous symmetric axes along recognizable directions and consider them as the principal directions of the objects. Consequently, the model will estimate only the principal directions when dealing with such objects.

2.2.2 OO3D-9D Generation

We follow the usage of BlenderProc [11] by BOP benchmark [48] to render data for each object. Specifically, RGBD images with object masks and camera poses are rendered from randomized viewpoints sampled on the spheres whose radius ranges between 3 and 5 times the diagonal size of the object. The principal directions of the cameras always point toward the object center. Such configuration not only ensures a well-distributed pose space but also guarantees a clear view of the objects.

To construct a more comprehensive benchmark, we follow the pipeline of photorealistic image synthesis used in BOP Challenge [48] to further generate a multi-object dataset. Concretely, a random number of objects are sampled from the object set and placed on the workspace with random poses. We then randomly select one of the points of interest based on the location of the objects as the viewpoint of the camera. The visual data with ground truth poses is finally rendered with randomized lighting conditions and background textures. To simplify the usage, both datasets are organized in BOP format [48].

2.3 Approach

2.3.1 Preliminaries

CLIP [40]. CLIP (Contrastive Language-Image Pretraining) is a visual language model based on a transformer architecture [52] pre-trained using a contrastive learning approach that maximizes the agreement between semantically related images and text while minimizing the agreement between unrelated pairs.

Diffusion model [43]. The text-to-image Diffusion model is a generative framework that synthesizes high-quality images from text descriptions. The architecture of the diffusion model compromises three modules: an encoder and a decoder derived from VQVAE [15] to map features between image and latent spaces, and a U-Net that performs a denoising operation on the latent space.

Dino [4, 38]. Dino and DinoV2 are self-supervised learning methods for visual representation learning. The model is based on Vision Transformers (ViT) [12] and employs contrastive learning on large-scale datasets, which is encouraged to learn discriminative and diverse features.

2.3.2 Overview

The proposed method consists of a neural network fθ(I,d)f_{\theta}(I,d) that takes as input a masked RGB image II and the text description dd of the object and outputs an estimation of the normalized object coordinates map YY of that object in the normalized object coordinate space (NOCS) [54]. The learning objective is to optimize the prediction of YY such that by leveraging the depth map and camera intrinsic, we can estimate scales, rotation, and translation from the NOCS to the camera frame.

The overview of the network ff is shown in Figure. 4, which consists of five parts: 1) the encoder of VQVAE fθvf_{{\theta}_{v}} that maps the image features to the latent space, 2) the CLIP model fθcf_{{\theta}_{c}} that encodes the text description to latent features, 3) the denoising UNet fθuf_{{\theta}_{u}} in Stable Diffusion that captures semantic relationships between text and visual features, 4) the DinoV2 model fθdf_{{\theta}_{d}} that extracts discriminative features from the images, and 5) the NOCS decoder fθnf_{{\theta}_{n}} that aggregates the features together and predicts the NOCS map of the object.

2.3.3 Normalized Object Coordinate Estimation From Prior Foundation Model

The encoder builds upon VPD [66] and consists of 4 parts. Latent visual features vv=fθv(I)v_{v}=f_{{\theta}_{v}}(I) and vc=fθc(d)v_{c}=f_{{\theta}_{c}}(d) text features are first extracted by the encoder of VQVAE and the CLIP model respectively. These two features are then incorporated together by the cross-attention modules in the denoising UNet and result in semantic features vs=fθu(vv,vc)v_{s}=f_{{\theta}_{u}}(v_{v},v_{c}) of the target objects. Since the estimation of normalized object coordinates requires the model to recognize every part of the object in a discriminative manner, we further employ DinoV2 on the RGB images to extract additional image features vd=fθd(I)v_{d}=f_{{\theta}_{d}}(I).

With vsv_{s} and vdv_{d} in place, we concatenate and aggregate them by a convolutional layer. In the implementation of the diffusion UNet, the resolution of the output feature maps is 1/321/32 times the size of the original image, we thus use a 5-layer fully convolutional residual network as the decoder network [19] where each layer contains a bilinear 2×2\times upsampling module such that it outputs a coordinate map Y^=fθn(vs,vd)\hat{Y}=f_{{\theta}_{n}}(v_{s},v_{d}) with the same size and resolution as that of the input image.

The model is trained by minimizing the smooth L1 loss [18] between the predicted and ground truth normalized coordinate maps, which is denoted as

L=1Ni=1Hj=1WMijSij,L=\frac{1}{N}\sum_{i=1}^{H}\sum_{j=1}^{W}M_{ij}S_{ij}, (2)

where MM is the object mask, N=i=1Hj=1WMijN=\sum_{i=1}^{H}\sum_{j=1}^{W}M_{ij}, and SijS_{ij} is the smooth L1 loss at pixel ijij

Sij={0.5(YijY^ij)2/β,|YijY^ij|<β,|YijY^ij|0.5β,otherwise,\begin{split}S_{ij}=\left\{\begin{array}[]{ll}0.5(Y_{ij}-\hat{Y}_{ij})^{2}/\beta,&|Y_{ij}-\hat{Y}_{ij}|<\beta,\\ |Y_{ij}-\hat{Y}_{ij}|-0.5\beta,&otherwise\end{array},\right.\end{split} (3)

where β\beta is the smooth threshold and is set to 0.1 in this work.

For objects with symmetry, we will first augment the ground truth by transforming the ground truth coordinate map according to the symmetric axes, which form an augmented coordinate map set 𝒮={S}\mathcal{S}=\{S\}. Then the corresponding loss function becomes

L=infS𝒮[1Ni=1Hj=1WMijSij].L=\inf_{S\in\mathcal{S}}[\frac{1}{N}\sum_{i=1}^{H}\sum_{j=1}^{W}M_{ij}S_{ij}]. (4)

2.3.4 Pose and Size Estimation

Given the observed depth map with the camera intrinsic, the partial point cloud of the object in the camera frame can be obtained. With the estimated normalized coordinates, we can further establish the 3D correspondences between the camera frame and the normalized coordinate space. Following the pipeline in [54], we can recover the object pose by using the Umeyama algorithm [51] with RANSAC paradigm [16].

2.3.5 Network Architecture

We use the stable diffusion model [43] pre-trained on LAION-5b [46] as one of the image feature extractors, where we only use the feature maps extracted from the last block of the denoising UNet. Following VPD [66] and ODISE [61], we add zero noise to the input feature maps by setting the time step to 0. For DinoV2 [38], we use the distilled version ViT-L whose output stride is set to 14. We further perform bilinear interpolation on the Dino features to make the output image features consistent with features extracted from the diffusion UNet. As for the CLIP model [40], we only use the text encoder to extract the text features.

All these modules are initialized with the corresponding pre-trained models. Only the decoder is initialized with Kaiming Initialization. To fully exploit the performance of the pre-trained models, we keep CLIP, DinoV2, and the encoder of VQVAE fixed and only tune the parameters of the diffusion UNet and the decoder during training.

Refer to caption
Figure 5: Qualitative results on OO3D-9D. Row 1: single object scene. Row 2: occluded scene. Predicted and ground truth boxes are colored in blue and green, respectively.
Abs IoU@50 Abs 55cm5^{\circ}5cm Abs 105cm10^{\circ}5cm Abs 1010cm10^{\circ}10cm Rel 55cm5^{\circ}5cm Rel 105cm10^{\circ}5cm Rel 1010cm10^{\circ}10cm
PCA IST Ours PCA IST Ours PCA IST Ours PCA IST Ours PCA IST Ours PCA IST Ours PCA IST Ours
bowl\dagger 19.2 53.7 97.3 1.2 14.7 41.7 1.3 32.5 62.3 1.3 32.5 62.3 2.3 1.0 66.3 4.7 2.7 90.5 9.3 2.7 90.5
bumbag\ddagger 29.6 40.4 91.7 2.4 0.0 10.9 5.2 0.2 41.5 5.8 0.2 41.5 2.4 0.2 14.2 4.0 0.2 44.2 11.8 0.2 44.2
dumpling\ddagger 46.7 64.5 81.2 3.7 0.0 1.8 6.6 0.4 11.0 6.6 0.4 11.0 6.0 0.1 1.9 11.3 0.1 10.4 11.3 0.1 10.4
facial cream\ddagger 3.6 37.1 91.3 0.4 7.7 34.0 1.7 17.1 52.7 1.7 17.1 52.7 2.1 0.3 25.3 4.9 2.3 56.4 4.9 2.3 58.4
handbag\ddagger 35.5 52.3 91.8 1.6 0.0 22.0 3.0 0.2 54.7 4.1 0.2 56.0 4.1 0.0 25.3 6.8 0.1 56.4 10.0 0.1 56.4
litchi\dagger 32.4 89.9 94.6 1.7 15.3 7.0 2.8 26.4 18.6 2.8 26.4 18.6 2.4 0.4 18.4 3.7 2.5 40.5 3.7 2.5 40.5
mouse\ast 10.8 28.8 72.9 0.3 0.0 6.3 0.8 0.1 17.7 0.8 0.1 17.7 2.7 0.1 3.9 7.3 0.1 15.0 7.3 0.1 15.0
pineapple\dagger 29.7 41.0 89.4 1.4 6.9 30.8 2.8 14.2 49.5 3.2 15.0 49.5 1.9 0.2 47.3 3.4 0.8 69.0 3.4 0.8 72.3
teddy bear\ast 64.0 61.8 91.6 0.2 0.0 1.0 0.4 0.6 5.0 0.6 0.6 5.2 2.0 0.2 1.0 3.0 0.2 2.0 4.8 0.2 2.0
toy truck\ast 58.7 57.5 76.5 5.9 0.0 2.8 9.6 0.0 11.3 9.6 0.0 11.3 2.2 0.1 4.2 5.0 0.2 12.9 5.0 0.2 12.9
all 33.2 52.7 87.8 1.9 4.5 15.8 3.42 9.2 32.4 3.7 9.3 32.6 2.8 0.3 26.8 5.41 0.9 43.8 7.2 0.9 44.4
Table 2: Quantitative results on OO3D-9D. \dagger: continuous symmetric. \ddagger: discrete symmetric. \ast: non-symmetric.
Refer to caption
Figure 6: Qualitative results on unseen real-world data. Our model trained with OO3D-9D data could directly perform 6D pose and size estimation on the unseen Co3Dv2 dataset. The odd rows display cropped real-world images with our estimated 6D pose and size (visualized with the object bounding box), while the even rows display their corresponding aligned shapes in the normalized object coordinate space.
Refer to caption
Figure 7: The absolute and relative precision on OO3D-9D. The average precision (AP) with respect to different thresholds of 3D IoU, rotation precision, and translation precision are reported.

3 Experiments

3.1 Implementation Details

Training details. We train the model for 25 epochs with a batch size of 64 on 4 A100 GPUs. We use the AdamW [35] as the optimizer with a fixed learning rate 0.00010.0001. The input size of images is set to 4802480^{2}. To compute the training loss in Eqn. 4 for continuous symmetric objects, we approximate the augmented map set 𝒮\mathcal{S} by sampling 3636 angles around the rotation axis with equal angle interval. The sampled rotation matrices are applied to SS, yielding an approximated map set 𝒮\mathcal{S}. In practice, the input mask could be obtained by an open-vocabulary masking model [61] and the depth could be obtained by a depth estimation model [62]. In the experiments, for simplicity for evaluating our core modules, we use the ground-truth masks and depths provided in the dataset.

Dataset rendering. We employ the camera intrinsics used by the YCB-V [60] dataset as the parameters of the virtual camera in BlenderProc. The image resolution is set to 480×640480\times 640, which is also in accordance with YCB-V.

3.2 Benchmark Datasets

To thoroughly evaluate the performance of the models, we mainly conduct experiments on OO3D-9D single-object scenes with unseen categories. Concretely, we manually select 10 categories from the dataset as the test data, which are listed in Table 2. Overall, the dataset includes a total of 230 instances, comprising non-symmetric, discrete symmetric, and continuous symmetric objects.

To demonstrate the effectiveness of our method, we further conduct experiments on Co3Dv2 [42] and visualize the prediction results of various novel instances. For further details on the data and experiments involving new viewpoints of known instances and new instances of known categories, please refer to the supplementary material.

3.3 Evaluation Metrics

We first evaluate our results on the widely-used Rot&Trans precision (abcma^{\circ}\ b\ cm[30], and the 3D Intersection over Union (IoU) [17] metrics, and then evaluate with a new proposed metric, namely relative Rot&Trans precision.

Rot&Trans precision is the average precision of the samples where the error of rotation is less than aa^{\circ} and the error of translation is less than bcmb\ cm, as

1Nj=1Nfab(T𝒪icj^,T𝒪icj,a,b),\frac{1}{N}\sum_{j=1}^{N}{f_{ab}(\hat{{}^{c_{j}}T_{\mathcal{O}_{i}}},{}^{c_{j}}T_{\mathcal{O}_{i}},a,b)}, (5)

where fab{0,1}f_{ab}\in\{0,1\} is 11 only if the pose differences are within the thresholds aa and bb. We should note that for objects with discrete symmetry, we will first augment the ground truth poses according to the number of symmetric matrices, and then compare the estimated results with the augmented pose set. The pose with minimal error will be considered as the final ground truth pose. For objects with continuous symmetry, we only compare the angle difference between the ground truth and estimated principal axes for the computation of rotation error.

3D IoU denotes the intersection over union (IoU) of two 3D bounding boxes. In this work, we follow [54] and compute axis-aligned 3D IoU@50 based on the Pytorch3D implementation [41].

Relative Rot&Trans precision. Since the model may have its own cognition of the object reference frames, which might be different from the ones provided by the data, for the objects of unseen categories. We further propose a novel metric that can be used to evaluate the consistency of the estimated poses. Suppose the estimated poses for specific object category 𝒪i\mathcal{O}_{i} are denoted as (T𝒪ip1c1,T𝒪ip2c2,,T𝒪ipncn)({}^{c_{1}}T_{\mathcal{O}_{i}p_{1}},{}^{c_{2}}T_{\mathcal{O}_{i}p_{2}},...,{}^{c_{n}}T_{\mathcal{O}_{i}p_{n}}) and their corresponding ground truths are (T𝒪ic1,T𝒪ic2,,T𝒪icn)({}^{c_{1}}T_{\mathcal{O}_{i}},{}^{c_{2}}T_{\mathcal{O}_{i}},...,{}^{c_{n}}T_{\mathcal{O}_{i}}). 𝒪ipj\mathcal{O}_{i}p_{j} represents the predicted object reference frame that is recognized by the model for the jjth sample. We can thus obtain the relative pose set from the predicted to the ground truth object frame, which is denoted as (T𝒪ip1𝒪i,T𝒪ip2𝒪i,,T𝒪ipn𝒪i)({}^{\mathcal{O}_{i}}T_{\mathcal{O}_{i}p_{1}},{}^{\mathcal{O}_{i}}T_{\mathcal{O}_{i}p_{2}},...,{}^{\mathcal{O}_{i}}T_{\mathcal{O}_{i}p_{n}}). Then we compute the minimal relative difference among this pose set and regard it as the relative pose error. The average precision of the relative poses is then computed as

1N1maxjk=1,kjNfab(T𝒪ipk𝒪i,T𝒪ipj𝒪i,a,b).\frac{1}{N-1}\max_{j}\sum_{k=1,k\neq j}^{N}{f_{ab}({}^{\mathcal{O}_{i}}T_{\mathcal{O}_{i}p_{k}},{}^{\mathcal{O}_{i}}T_{\mathcal{O}_{i}p_{j}},a,b)}. (6)

3.4 Baselines

Since there is no work in the literature that researches this novel task, we find very few baselines that can be applied to this problem. The widely-used baselines such as ICP [1] or CPD [37] cannot be used in this task, since we assume that the 3D object model is not available during the inference stage. Since the principal component analysis (PCA) can infer the coordinate system according to the data observation, which is usually used as a baseline in the task of shape canonicalization [45], we consider it as the baseline to perform the estimation of translation and rotation from the observed point cloud. To evaluate the performance in comparison to other methods, we further apply the training data to IST-Net [33], which is the current state-of-the-art method for close-set category-level object pose estimation, and assess the metrics on the test set with new categories.

3.5 Evaluation Results

3.5.1 Results on OO3D-9D

In this experiment, we sample 50 views for each object instance and evaluate the metrics for each category. The mean average precision (mAP) of all metrics is presented in Table 2, while Fig. 7 shows more comprehensive results for each category.

Overall, our method significantly outperforms the baseline across all metrics. Specifically, our method attains mAPs of 87.8%, 15.8%, and 32.4% for absolute IoU@50, 55cm5^{\circ}5cm, and 105cm10^{\circ}5cm, respectively, indicating the preference of the model for determining object reference frames aligned with manually labeled data. On the other hand, we also obtain mAPs of 26.8% and 43.8% for relative 55cm5^{\circ}5cm and 105cm10^{\circ}5cm, compared with 2.8% and 5.4% achieved by PCA, demonstrating that our method recognizes more consistent reference object frames within the same category. While IST-Net achieves slightly better results on absolute metrics than PCA, it performs worse on relative metrics, which shows that such close-set methods cannot directly adapt to unseen object categories even with large amounts of training data.

Furthermore, we observe that our method attains mAPs of 44.0%, 31.6%, and 3.0% for relative 55cm5^{\circ}5cm on categories with continuous symmetric, discrete symmetric, and non-symmetric instances, respectively. We believe that symmetric objects tend to have regular shapes, making it easier for the model to capture their internal principal directions.

3.5.2 Results on Co3Dv2

We visualize in Figure 6 the qualitative results on unseen real-world data from Co3Dv2. The even rows display point clouds of instances transformed into normalized object coordinate space based on the estimation of our method. It can be observed that our open-vocabulary framework trained with only synthesized data can accurately estimate pose and size across instances with various novel categories. Interestingly, the upward directions of all instances align with the y-axes, consistent with most training data instances (some shown in Figure 3). This further suggests that by training with our OO3D-9D, the open-vocabulary framework can transfer the cognition of canonical object frames to novel objects in a manner consistent with human understanding.

REAL275 50 cat. 100 cat. All cat.
Abs IoU@50 48.2 75.5 81.2 87.8
Abs 55cm5^{\circ}5cm 2.8 6.9 11.8 15.8
Rel 55cm5^{\circ}5cm 4.5 7.8 21.9 26.8
Table 3: Ablation study of different scales of training data.
Dino Full (Dino+SD)
Abs IoU@50 75.0 87.8
Abs 55cm5^{\circ}5cm 8.8 15.8
Rel 55cm5^{\circ}5cm 9.8 26.8
Table 4: Framework ablation.
Dummy Free-form text Category Labels
Abs 55cm5^{\circ}5cm 8.34 13.4 15.8
Rel 55cm5^{\circ}5cm 11.2 21.9 26.8
Table 5: Ablation study of different text prompts

3.5.3 Ablation Study

Effects of different network components. We first examine the effectiveness of different encoder modules of the proposed framework. Specifically, we train the model using only the Dino encoder and compare it to the full framework (Dino + Stable Diffusion), yielding the results in Table 4. It is noticeable that while the Dino module performs well on various visual tasks such as depth estimation [38], directly adapting the features for this task results in inferior performance, which only achieves mAPs of 8.8% and 9.8% on the absolute and relative 55cm5^{\circ}5cm precision, respectively.

Effects of the scale of the training data. We ablate on the number of object categories used by our method to assess the impact of the data scale. We randomly select 2 subsets from the original OO3D-9D dataset, containing 50 and 100 object categories, respectively. We then train our model on these two sub-datasets and evaluate the performance of data with novel categories. We also train our method on the NOCS REAL275 dataset and perform evaluation on the same test data. The results, presented in Table 3, indicate that performance improves as the number of training categories increases, demonstrating that a larger dataset can enhance the transfer ability of open-vocabulary category level pose estimation.

Refer to caption
Figure 8: Visualization of results with free-form text prompts.

Effects of text descriptions. We investigate the influence of prompts by utilizing free-form descriptions to characterize objects. We also use dummy prompts with meaningless spaces for comparison. Specifically, we substitute object names with descriptions sourced from Wikipedia and employ the textual features of these sentences to guide the estimation of NOCS maps for the target objects. Some of the text descriptions and their corresponding estimated results are depicted in Fig. 8 and the results are shown in Table 5. Thanks to the prior knowledge from Diffusion UNet which is pre-trained with large-scale diverse image-text pairs, our method still gains effective information from these free-form texts.

4 Conclusions and Limitations

This paper presents a new open-set problem for estimating object pose and size from RGBD images and text descriptions. We propose an open-vocabulary framework that predicts NOCS maps using foundation models and develop OO3D-9D, a large-scale, photorealistic dataset with diverse categories. Experiments show that the proposed method trained with OO3D-9D can be effectively generalized to unseen object categories. However, our method achieves inferior performance when faced with non-symmetric object instances or categories with large intra-class variance. Moreover, this method relies on precise object masks to crop out the region of the target objects and the scenes in the training data are relatively simple compared with multi-object scenes. Future work will consider leveraging this dataset and integrating object detection and pose estimation in a unified framework.

Appendix A Appendix

A.1 Legitimacy of evaluation metrics

It is worth noting that in the setting of open-vocabulary category-level object pose estimation problem, the model should have its own cognitive reference frame for each category, and these reference frames might be different from the ones manually annotated by human beings, which makes the absolute metrics mentioned in Sec. 4.3 inappropriate for this task. However, in OmniObject3D, the reference frames of most objects are aligned with specific patterns, e.g., the upward directions of the instances in OmniObject3D are usually aligned with the y-axes, and the front directions are usually aligned with the x-axes. Example object instances with labeled reference frames in these patterns are visualized in Figure 9. These patterns could be learned by the model in the training, and generalized to unseen categories, as visualized in Figure 10. Therefore, the absolute metrics are still a good indicator for evaluating the model in learning the patterned poses, which also benefits the subsequent applications like grasping.

Refer to caption
Figure 9: Visualization of instances in OmniObject3D with coordinate frames. In these examples, it is obvious to see that the upward directions are aligned with y-axes (green arrows) and the front directions are consistent with the x-axes (red arrows).
Refer to caption
Figure 10: Qualitative results on unseen data. The odd rows display the estimated (blue boxes) and the ground-truth (green) poses, while the even rows display their estimated aligned shapes in the normalized object coordinate space.

A.2 Applications

In this section, we will showcase potential applications stemming from this task, highlighting the significance of conducting research on it. Our focus will primarily be on two tasks: robotic grasping and object reconstruction. It is essential to reiterate that, due to the formulation of open-vocabulary category-level object pose estimation, the canonical frames of instances within the same category must be consistent. Furthermore, instances from different categories with similar attributes will possess analogous canonical frames, resulting from the alignment learned through the language model and manually labeled data.

Robotic grasping. Building upon the aforementioned setup, we can transfer grasping knowledge to novel instances. An example is provided in Figure 11. Given an object instance from a specific category with predefined grasp poses (left), we can directly transfer the gripper poses to new instances within the same category (middle) once the object pose of this instance is recognized. Likewise, we can also transfer them to unseen instances (right) whose object attributes resemble the predefined instance.

Refer to caption
Figure 11: Illustration of cross-instance / cross-category grasping. Left: pre-defined gripper pose for the box beverage instance. Middle: transferred gripper pose for the new box beverage instance. Right: transferred gripper pose for the bottle instance.

Object reconstruction. As the model possesses a unique reference frame for a specific object instance, we can utilize the recognized poses from the model to perform object reconstruction. Figure 12 displays the results of combined point clouds for the same instances, where the partial point clouds originate from depth maps of various views and are aligned by the estimated object poses.

Refer to caption
Figure 12: Examples of object reconstruction.

A.3 Visualization of the data

In Sec. 4.5.3, we study the effect of the data scale on the performance. Figure 14 visualizes the specific categories we used when performing ablation analysis about data scale. The number of instances including the number of non-symmetric, discrete symmetric, and continuous symmetric instances for each subset are listed in Table 6.

For the multi-object dataset, we also count the distributions of the number of instances and the number of object categories per image among 50k samples. The density distributions are visualized in Figure 13.

# cat. # ins. # NS # DS # CS
Set 1 50 1290 649 348 331
Set 2 100 2291 1127 513 689
Set 3 204 4782 1982 1004 1904

NS: non-symmetric. DS: discrete symmetric. CS: continuous symmetric.

Table 6: Statistics of data.
Refer to caption
Figure 13: Statistics of the multi-object dataset.
Refer to caption
(a) Set1: 50 object categories
Refer to caption
(b) Set2: 100 object categories
Refer to caption
(c) Set3: 206 object categories
Figure 14: Visualization of the object categories in the OmniObject3D dataset. Larger font size indicates more examples of that object category.
Abs IoU@50 Abs 55cm5^{\circ}5cm Abs 105cm10^{\circ}5cm Abs 1010cm10^{\circ}10cm
New views NS 97.1 37.6 65.9 66.5
DS 97.5 46.7 73.9 74.4
CS 99.0 59.2 82.2 82.3
All 97.9 47.6 73.7 74.0
New instances NS 93.0 23.5 46.3 47.2
DS 96.1 44.0 68.6 68.7
CS 97.3 45.1 65.4 65.4
All 95.5 36.7 58.8 59.2
Table 7: Extra results of OO3D-9D.

A.4 More results on OO3D-9D

We conduct extra experiments on OO3D-9D by evaluating the metrics on known objects with novel views and unseen objects with known categories. To test the metrics on known objects with novel views, we randomly sample 50 novel views for each instance and perform the evaluation on these new samples. To construct the new instance set, we randomly select 2 instances from an object set of a specific category if the number of instances is greater than 20. The number of test views is 50 as well. The results are shown in Table 7. It is obvious to see that 1) the model performs better on objects with known categories, and it achieves the best performance when dealing with known instances, 2) similar to the results in Sec. 4.5.1, our method achieves better precision on objects with symmetry for both test sets, which further implies the challenge of non-symmetric objects for the task of open-vocabulary category-level object pose estimation.

To evaluate the close-set performance in comparison to other methods, we apply the training data to IST-Net [33], which is the current state-of-the-art method for close-set category-level object pose estimation, and assess the metrics on the test set with new instances. Specifically, we maintain a training batch size of 64, keeping other settings unchanged. The results can be found in Table 8. It is obvious to see that the performance of IST-Net is much inferior to our method, as it employs ResNet18 [19] and PointNet++ [39] as the network backbones, which are capability-limited when trying to fit the pattern from our dataset containing diverse object categories.

Abs IoU@50 Abs 55cm5^{\circ}5cm Abs 105cm10^{\circ}5cm Abs 1010cm10^{\circ}10cm
IST-Net[33] 56.6 8.3 15.1 15.1
Ours 95.5 36.7 58.8 59.2
Table 8: Comparison with IST-Net on OO3D-9D test set with new instances

A.5 Implementation Details

A.5.1 Details of generating text features

In this work, we borrow the prompt engineering used by [66] to generate text features for each object category. Concretely, we substitute each category name into the ImageNet template containing 80 sentences. Each sentence is then fed into the CLIP textual encoder to generate the corresponding feature. We compute the mean of the features within the same category and consider it as the final feature for that category. For the experiments with free-form texts, we use the descriptions looked up from the Collins English Dictionary to replace the original object names. Examples of the descriptions are listed in Table 9.

Object Name Description
bowl a round container
bumbag a small bag worn on a belt, round the waist
dumpling a small lump of dough that is cooked and eaten
facial_cream a cream used cosmetically for softening and cleaning the skin
handbag a small bag which a woman uses to carry things such as her money and keys in when she goes out
litchi a small rounded fruit with sweet white scented flesh, a large central stone, and a thin rough skin
mouse a device that is connected to a computer
pineapple a large tropical fruit with a rough orange or brown skin and pointed leaves on top
teddy_bear a toy of children, made from soft or furry material, which looks like a friendly bear
toy_truck a mini version of a real truck, designed for play and fun
Table 9: Examples of free-form text description.

References

  • Besl and McKay [1992] Paul J Besl and Neil D McKay. Method for registration of 3-d shapes. In Sensor fusion IV, pages 586–606. Spie, 1992.
  • Bruns and Jensfelt [2022] Leonard Bruns and Patric Jensfelt. Sdfest: Categorical pose and shape estimation of objects from rgb-d using signed distance fields. RAL, 7(4):9597–9604, 2022.
  • Bruns and Jensfelt [2023] Leonard Bruns and Patric Jensfelt. Rgb-d-based categorical object pose and shape estimation: Methods, datasets, and evaluation. arXiv preprint arXiv:2301.08147, 2023.
  • Caron et al. [2021] Mathilde Caron, Hugo Touvron, Ishan Misra, Hervé Jégou, Julien Mairal, Piotr Bojanowski, and Armand Joulin. Emerging properties in self-supervised vision transformers. In ICCV, pages 9650–9660, 2021.
  • Chang et al. [2015] Angel X. Chang, Thomas Funkhouser, Leonidas Guibas, Pat Hanrahan, Qixing Huang, Zimo Li, Silvio Savarese, Manolis Savva, Shuran Song, Hao Su, Jianxiong Xiao, Li Yi, and Fisher Yu. ShapeNet: An Information-Rich 3D Model Repository. Technical Report arXiv:1512.03012 [cs.GR], 2015.
  • Chen et al. [2020a] Dengsheng Chen, Jun Li, Zheng Wang, and Kai Xu. Learning canonical shape space for category-level 6d object pose and size estimation. In CVPR, pages 11973–11982, 2020a.
  • Chen and Dou [2021] Kai Chen and Qi Dou. Sgpa: Structure-guided prior adaptation for category-level 6d object pose estimation. In ICCV, pages 2773–2782, 2021.
  • Chen et al. [2020b] Xu Chen, Zijian Dong, Jie Song, Andreas Geiger, and Otmar Hilliges. Category level object pose estimation via neural analysis-by-synthesis. In ECCV, pages 139–156. Springer, 2020b.
  • Deitke et al. [2023] Matt Deitke, Ruoshi Liu, Matthew Wallingford, Huong Ngo, Oscar Michel, Aditya Kusupati, Alan Fan, Christian Laforte, Vikram Voleti, Samir Yitzhak Gadre, et al. Objaverse-xl: A universe of 10m+ 3d objects. arXiv preprint arXiv:2307.05663, 2023.
  • Deng et al. [2022] Xinke Deng, Junyi Geng, Timothy Bretl, Yu Xiang, and Dieter Fox. icaps: Iterative category-level object pose and shape estimation. RAL, 7(2):1784–1791, 2022.
  • Denninger et al. [2023] Maximilian Denninger, Dominik Winkelbauer, Martin Sundermeyer, Wout Boerdijk, Markus Wendelin Knauer, Klaus H Strobl, Matthias Humt, and Rudolph Triebel. Blenderproc2: A procedural pipeline for photorealistic rendering. Journal of Open Source Software, 8(82):4901, 2023.
  • Dosovitskiy et al. [2021] Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, and Neil Houlsby. An image is worth 16x16 words: Transformers for image recognition at scale. ICLR, 2021.
  • Downs et al. [2022] Laura Downs, Anthony Francis, Nate Koenig, Brandon Kinman, Ryan Hickman, Krista Reymann, Thomas B McHugh, and Vincent Vanhoucke. Google scanned objects: A high-quality dataset of 3d scanned household items. In ICRA, pages 2553–2560. IEEE, 2022.
  • Du et al. [2022] Yu Du, Fangyun Wei, Zihe Zhang, Miaojing Shi, Yue Gao, and Guoqi Li. Learning to prompt for open-vocabulary object detection with vision-language model. In CVPR, pages 14084–14093, 2022.
  • Esser et al. [2021] Patrick Esser, Robin Rombach, and Bjorn Ommer. Taming transformers for high-resolution image synthesis. In CVPR, pages 12873–12883, 2021.
  • Fischler and Bolles [1981] Martin A Fischler and Robert C Bolles. Random sample consensus: a paradigm for model fitting with applications to image analysis and automated cartography. Communications of the ACM, 24(6):381–395, 1981.
  • Geiger et al. [2013] Andreas Geiger, Philip Lenz, Christoph Stiller, and Raquel Urtasun. Vision meets robotics: The kitti dataset. IJRR, 32(11):1231–1237, 2013.
  • Girshick [2015] Ross Girshick. Fast r-cnn. In ICCV, pages 1440–1448, 2015.
  • He et al. [2016] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In CVPR, pages 770–778, 2016.
  • He et al. [2020] Yisheng He, Wei Sun, Haibin Huang, Jianran Liu, Haoqiang Fan, and Jian Sun. Pvn3d: A deep point-wise 3d keypoints voting network for 6dof pose estimation. In IEEE/CVF conference on computer vision and pattern recognition, pages 11632–11641, 2020.
  • He et al. [2021] Yisheng He, Haibin Huang, Haoqiang Fan, Qifeng Chen, and Jian Sun. Ffb6d: A full flow bidirectional fusion network for 6d pose estimation. In CVPR, pages 3003–3013, 2021.
  • He et al. [2022a] Yisheng He, Haoqiang Fan, Haibin Huang, Qifeng Chen, and Jian Sun. Towards self-supervised category-level object pose and size estimation. arXiv preprint arXiv:2203.02884, 2022a.
  • He et al. [2022b] Yisheng He, Yao Wang, Haoqiang Fan, Jian Sun, and Qifeng Chen. Fs6d: Few-shot 6d pose estimation of novel objects. In CVPR, pages 6814–6824, 2022b.
  • Hinterstoisser et al. [2013] Stefan Hinterstoisser, Vincent Lepetit, Slobodan Ilic, Stefan Holzer, Gary Bradski, Kurt Konolige, and Nassir Navab. Model based training, detection and pose estimation of texture-less 3d objects in heavily cluttered scenes. In ACCV, pages 548–562. Springer, 2013.
  • Irshad et al. [2022] Muhammad Zubair Irshad, Thomas Kollar, Michael Laskey, Kevin Stone, and Zsolt Kira. Centersnap: Single-shot multi-object 3d shape reconstruction and categorical 6d pose and size estimation. In ICRA, pages 10632–10640. IEEE, 2022.
  • Konishi et al. [2016] Yoshinori Konishi, Yuki Hanzawa, Masato Kawade, and Manabu Hashimoto. Fast 6d pose estimation from a monocular image using hierarchical pose trees. In ECCV, pages 398–413. Springer, 2016.
  • Konishi et al. [2019] Yoshinori Konishi, Kosuke Hattori, and Manabu Hashimoto. Real-time 6d object pose estimation on cpu. In IROS, pages 3451–3458. IEEE, 2019.
  • Lee et al. [2021] Taeyeop Lee, Byeong-Uk Lee, Myungchul Kim, and In So Kweon. Category-level metric scale object shape and pose estimation. RAL, 6(4):8575–8582, 2021.
  • Li et al. [2023] Guanglin Li, Yifeng Li, Zhichao Ye, Qihang Zhang, Tao Kong, Zhaopeng Cui, and Guofeng Zhang. Generative category-level shape and pose estimation with semantic primitives. In CoRL, pages 1390–1400. PMLR, 2023.
  • Li et al. [2018] Yi Li, Gu Wang, Xiangyang Ji, Yu Xiang, and Dieter Fox. Deepim: Deep iterative matching for 6d pose estimation. In ECCV, pages 683–698, 2018.
  • Liang et al. [2023] Feng Liang, Bichen Wu, Xiaoliang Dai, Kunpeng Li, Yinan Zhao, Hang Zhang, Peizhao Zhang, Peter Vajda, and Diana Marculescu. Open-vocabulary semantic segmentation with mask-adapted clip. In CVPR, pages 7061–7070, 2023.
  • Lin et al. [2022] Jiehong Lin, Zewei Wei, Changxing Ding, and Kui Jia. Category-level 6d object pose and size estimation using self-supervised deep prior deformation networks. In ECCV, pages 19–34. Springer, 2022.
  • Liu et al. [2023] Jianhui Liu, Yukang Chen, Xiaoqing Ye, and Xiaojuan Qi. Ist-net: Prior-free category-level pose estimation with implicit space transformation. In ICCV, pages 13978–13988, 2023.
  • Liu et al. [2022] Yuan Liu, Yilin Wen, Sida Peng, Cheng Lin, Xiaoxiao Long, Taku Komura, and Wenping Wang. Gen6d: Generalizable model-free 6-dof object pose estimation from rgb images. In ECCV, pages 298–315. Springer, 2022.
  • Loshchilov and Hutter [2018] Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. In ICLR, 2018.
  • Marchand et al. [2016] Éric Marchand, Hideaki Uchiyama, and Fabien Spindler. Pose estimation for augmented reality: A hands-on survey. TVCG, 22(12):2633–2651, 2016.
  • Myronenko and Song [2010] Andriy Myronenko and Xubo Song. Point set registration: Coherent point drift. PAMI, 32(12):2262–2275, 2010.
  • Oquab et al. [2023] Maxime Oquab, Timothée Darcet, Théo Moutakanni, Huy Vo, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, et al. Dinov2: Learning robust visual features without supervision. arXiv preprint arXiv:2304.07193, 2023.
  • Qi et al. [2017] Charles Ruizhongtai Qi, Li Yi, Hao Su, and Leonidas J. Guibas. Pointnet++: Deep hierarchical feature learning on point sets in a metric space. In NIPS, 2017.
  • Radford et al. [2021] Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. In ICML, pages 8748–8763. PMLR, 2021.
  • Ravi et al. [2020] Nikhila Ravi, Jeremy Reizenstein, David Novotny, Taylor Gordon, Wan-Yen Lo, Justin Johnson, and Georgia Gkioxari. Accelerating 3d deep learning with pytorch3d. arXiv:2007.08501, 2020.
  • Reizenstein et al. [2021] Jeremy Reizenstein, Roman Shapovalov, Philipp Henzler, Luca Sbordone, Patrick Labatut, and David Novotny. Common objects in 3d: Large-scale learning and evaluation of real-life 3d category reconstruction. In ICCV, pages 10901–10911, 2021.
  • Rombach et al. [2022] Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High-resolution image synthesis with latent diffusion models. In CVPR, pages 10684–10695, 2022.
  • Saharia et al. [2022] Chitwan Saharia, William Chan, Saurabh Saxena, Lala Li, Jay Whang, Emily L Denton, Kamyar Ghasemipour, Raphael Gontijo Lopes, Burcu Karagol Ayan, Tim Salimans, et al. Photorealistic text-to-image diffusion models with deep language understanding. NIPS, 35:36479–36494, 2022.
  • Sajnani et al. [2022] Rahul Sajnani, Adrien Poulenard, Jivitesh Jain, Radhika Dua, Leonidas J Guibas, and Srinath Sridhar. Condor: Self-supervised canonicalization of 3d pose for partial shapes. In CVPR, pages 16969–16979, 2022.
  • Schuhmann et al. [2022] Christoph Schuhmann, Romain Beaumont, Richard Vencu, Cade Gordon, Ross Wightman, Mehdi Cherti, Theo Coombes, Aarush Katta, Clayton Mullis, Mitchell Wortsman, et al. Laion-5b: An open large-scale dataset for training next generation image-text models. NIPS, 35:25278–25294, 2022.
  • Shugurov et al. [2022] Ivan Shugurov, Fu Li, Benjamin Busam, and Slobodan Ilic. Osop: A multi-stage one shot object pose estimation framework. In CVPR, pages 6835–6844, 2022.
  • Sundermeyer et al. [2023] Martin Sundermeyer, Tomáš Hodaň, Yann Labbe, Gu Wang, Eric Brachmann, Bertram Drost, Carsten Rother, and Jiří Matas. Bop challenge 2022 on detection, segmentation and pose estimation of specific rigid objects. In CVPR, pages 2784–2793, 2023.
  • Tang et al. [2024] Luming Tang, Menglin Jia, Qianqian Wang, Cheng Perng Phoo, and Bharath Hariharan. Emergent correspondence from image diffusion. NIPS, 36, 2024.
  • Tian et al. [2020] Meng Tian, Marcelo H Ang, and Gim Hee Lee. Shape prior deformation for categorical 6d object pose and size estimation. In ECCV, pages 530–546. Springer, 2020.
  • Umeyama [1991] Shinji Umeyama. Least-squares estimation of transformation parameters between two point patterns. PAMI, 13(04):376–380, 1991.
  • Vaswani et al. [2017] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. NIPS, 30, 2017.
  • Wang et al. [2019a] Chen Wang, Danfei Xu, Yuke Zhu, Roberto Martin Martin, Cewu Lu, Li Fei-Fei, and Silvio Savarese. Densefusion: 6d object pose estimation by iterative dense fusion. In CVPR, pages 3343–3352, 2019a.
  • Wang et al. [2019b] He Wang, Srinath Sridhar, Jingwei Huang, Julien Valentin, Shuran Song, and Leonidas J Guibas. Normalized object coordinate space for category-level 6d object pose and size estimation. In CVPR, pages 2642–2651, 2019b.
  • Wang et al. [2021] Jiaze Wang, Kai Chen, and Qi Dou. Category-level 6d object pose estimation via cascaded relation and recurrent reconstruction networks. in 2021 ieee. In IROS, page 5, 2021.
  • Wang and Tian [2023] Zhongli Wang and Guohui Tian. Object pose estimation from rgb-d images with affordance-instance segmentation constraint for semantic robot manipulation. RAL, 9(1):595–602, 2023.
  • Wen et al. [2023] Bowen Wen, Jonathan Tremblay, Valts Blukis, Stephen Tyree, Thomas Müller, Alex Evans, Dieter Fox, Jan Kautz, and Stan Birchfield. Bundlesdf: Neural 6-dof tracking and 3d reconstruction of unknown objects. In CVPR, pages 606–617, 2023.
  • Wu et al. [2024] Jianzong Wu, Xiangtai Li, Shilin Xu, Haobo Yuan, Henghui Ding, Yibo Yang, Xia Li, Jiangning Zhang, Yunhai Tong, Xudong Jiang, et al. Towards open vocabulary learning: A survey. PAMI, 2024.
  • Wu et al. [2023] Tong Wu, Jiarui Zhang, Xiao Fu, Yuxin Wang, Jiawei Ren, Liang Pan, Wayne Wu, Lei Yang, Jiaqi Wang, Chen Qian, et al. Omniobject3d: Large-vocabulary 3d object dataset for realistic perception, reconstruction and generation. In CVPR, pages 803–814, 2023.
  • Xiang et al. [2018] Yu Xiang, Tanner Schmidt, Venkatraman Narayanan, and Dieter Fox. Posecnn: A convolutional neural network for 6d object pose estimation in cluttered scenes. RSS, 2018.
  • Xu et al. [2023] Jiarui Xu, Sifei Liu, Arash Vahdat, Wonmin Byeon, Xiaolong Wang, and Shalini De Mello. Open-vocabulary panoptic segmentation with text-to-image diffusion models. In CVPR, pages 2955–2966, 2023.
  • Yuan et al. [2022] Weihao Yuan, Xiaodong Gu, Zuozhuo Dai, Siyu Zhu, and Ping Tan. Neural window fully-connected crfs for monocular depth estimation. In CVPR, pages 3916–3925, 2022.
  • Ze and Wang [2022] Yanjie Ze and Xiaolong Wang. Category-level 6d object pose estimation in the wild: A semi-supervised learning approach and a new dataset. NIPS, 35:27469–27483, 2022.
  • Zhang et al. [2024] Junyi Zhang, Charles Herrmann, Junhwa Hur, Luisa Polania Cabrera, Varun Jampani, Deqing Sun, and Ming-Hsuan Yang. A tale of two features: Stable diffusion complements dino for zero-shot semantic correspondence. NIPS, 36, 2024.
  • Zhao et al. [2022] Chen Zhao, Yinlin Hu, and Mathieu Salzmann. Fusing local similarities for retrieval-based 3d orientation estimation of unseen objects. In ECCV, pages 106–122. Springer, 2022.
  • Zhao et al. [2023] Wenliang Zhao, Yongming Rao, Zuyan Liu, Benlin Liu, Jie Zhou, and Jiwen Lu. Unleashing text-to-image diffusion models for visual perception. In ICCV, pages 5729–5739, 2023.
  • Zou et al. [2024] Xueyan Zou, Jianwei Yang, Hao Zhang, Feng Li, Linjie Li, Jianfeng Wang, Lijuan Wang, Jianfeng Gao, and Yong Jae Lee. Segment everything everywhere all at once. NIPS, 36, 2024.