CRT-6D: Fast 6D Object Pose Estimation with Cascaded Refinement Transformers

1 Introduction

We would like to thank the reviewers and area chairs for the time spent reviewing our paper. The main weaknesses found by reviewers (mainly 1 and 3) were typos, formatting and notation errors. We made a thorough review of the paper and corrected the errors pointed by the reviewers as well as other small oversights. We answer some concerns in more detail in order to either clarify our position or describe a more complex solution and/or correction.

2 Addressing Concerns

L114. We replace ”NOCS” with ”Normalized Object Coordinates (NOCS)”, while including relevant reference works using this representation.

Table 1. This table contains two major errors. Firstly, as pointed out by the Reviewer 3, we fix the format to reposition ”N.S.O.” correctly. Moreover, we do not provide citations for the compared methods, which we’ve fixed in the current draft.

Figure 4. : ”why is SurfEmb not compared here?”: The omission of SurfEmb [surfemb] as well as ZebraPose [zebrapose], with which we compare results, was deliberate. Their runtime results lie considerably far from other methods, positioning both works outside of the plot range with the current x-axis scaling. Their pose estimation runtime for a single crop are 9 $\times$ and 75 $\times$ higher than it takes for our method to perform the estimation on an average YCB-V image (the dataset has an average of $\sim$ 4.5 crops per image). Due to formatting and visualization reasons, we decided the plot would be better without these results. We will add a note to the final draft explaining why these were omitted: ”ZebraPose [zebrapose] and SurfEmb [surfemb] are not shown as their results would lie outside the runtime range, with estimation taking over 250ms and 2000ms per object crop, respectively.”

”In Formula (5) the script JK(…) are not in the underscript”. We fix this typo both in the equation and on following uses of $\mathcal{P}_{ljs}$ . As Reviewer 3 suggested, we replace ”where $\Delta\mathcal{P}_{l}js$ refers to the $j$ predicted sampling offset for level $l$ and object keypoint $k$ and $J$ refers to the number of sampling points used for deformation.” with ”where $\Delta\mathcal{P}_{ljs}$ refers to the predicted sampling offset for the $j^{\textit{th}}$ deformable position, spatial level $l$ and object keypoint $k$ . $J$ refers to the number of sampling points used for deformation.” which should clarify the notation for the reader.

2.1 Literature Review missing applications of transformers in Pose Estimation

This is a very valid concern which we address by adding a section of the literature review on our current draft. In order to make space, we reduce the size of Table 1.

”Given the rising effectiveness of transformers in computer vision tasks, there have been attempts to use transformers to improve human , hand and object [DProST, yolopose, 6dvit, osop, handobjecttransformer] pose estimation. For object pose, such approaches are aimed at improving results [DProST, yolopose], category level estimation [6dvit, osop] or hand-object interaction [handobjecttransformer]. However, these improvements come at the cost of runtime, making them unsuitable for real-time applications. Our novel approach, not only improves results, but also decreases the runtime when compared to prior object pose estimation methods.”

Note: On this rebuttal, we do not provide citations for human and hand pose estimation with transformers due to the lack of space. We will add them to the final draft.

2.2 BOP Results

We will be adding our full results to the BOP challenge website [bop] before the 2022 challenge submission deadline. Additionally, we address Reviewer 2’s first concern by making the training and inference code publicly available along with the pre-trained models used in our experiments, allowing others to extend it to their own real-time applications.