This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

\newfloatcommand

capbtabboxtable[][\FBwidth]

TAX-Pose: Task-Specific Cross-Pose Estimation
for Robot Manipulation - Supplement

Appendix A Translational Equivariance

One benefit of our method is that it is translationally equivariant by construction. This mean that if the object point clouds, 𝐏𝒜\mathbf{P}_{{\mathcal{A}}} and 𝐏\mathbf{P}_{{\mathcal{B}}}, are translated by random translation 𝐭α\mathbf{t}_{\alpha} and 𝐭β\mathbf{t}_{\beta}, respectively, i.e. 𝐏𝒜=𝐏𝒜+𝐭α\mathbf{P}_{\mathcal{A}^{\prime}}=\mathbf{P}_{{\mathcal{A}}}+\mathbf{t}_{\alpha} and 𝐏=𝐏+𝐭β\mathbf{P}_{\mathcal{B}^{\prime}}=\mathbf{P}_{{\mathcal{B}}}+\mathbf{t}_{\beta}, then the resulting corrected virtual correspondences, 𝐕~\mathbf{\tilde{V}}_{{\mathcal{B}}} and 𝐕~𝒜\mathbf{\tilde{V}}_{{\mathcal{A}}}, respectively, are transformed accordingly, i.e. 𝐕~+𝐭β\mathbf{\tilde{V}}_{{\mathcal{B}}}+\mathbf{t}_{\beta} and 𝐕~𝒜+𝐭α\mathbf{\tilde{V}}_{{\mathcal{A}}}+\mathbf{t}_{\alpha}, respectively, as we will show below. This results in an estimated cross-pose transformation that is also equivariant to translation by construction.

This is achieved because our learned features and correspondence residuals are invariant to translation, and our virtual correspondence points are equivariant to translation.

First, our point features are a function of centered point clouds. That is, given point clouds 𝐏𝒜\mathbf{P}_{\mathcal{A}} and 𝐏\mathbf{P}_{\mathcal{B}}, the mean of each point cloud is computed as

𝐩¯k=1Nki=1Nk𝐏k.\bar{\mathbf{p}}_{k}=\frac{1}{N_{k}}\sum_{i=1}^{N_{k}}\mathbf{P}_{k}.

This mean is then subtracted from the clouds,

\Bar𝐏k=𝐏k𝐩¯k,\mathbf{\Bar{P}}_{k}=\mathbf{P}_{k}-\bar{\mathbf{p}}_{k},

which centers the cloud at the origin. The features are then computed on the centered point clouds:

𝚽k=gk(\Bar𝐏k).\mathbf{\Phi}_{k}=g_{k}(\mathbf{\Bar{P}}_{k}).

Since the point clouds are centered before features are computed, the features 𝚽k\mathbf{\Phi}_{k} are invariant to an arbitrary translation 𝐏k=𝐏k+𝐭κ\mathbf{P}_{k^{\prime}}=\mathbf{P}_{k}+\mathbf{t}_{\kappa}.

These translationally invariant features are then used, along with the original point clouds, to compute “corrected virtual points” as a combination of virtual correspondence points, 𝐯ik\mathbf{v}_{i}^{k^{\prime}} and the correspondence residuals, \boldsymbolδik\boldsymbol{\delta}_{i}^{k^{\prime}}. As we will see below, the “corrected virtual points” will be translationally equivariant by construction.

The virtual correspondence points, 𝐯ik\mathbf{v}_{i}^{k^{\prime}}, are computed using weights that are a function of only the translationally invariant features, 𝚽k\mathbf{\Phi}_{k}:

𝐰i𝒜=\textsoftmax(Φϕi𝒜)=\textsoftmax(Φϕi𝒜)=𝐰i𝒜;\mathbf{w}_{i}^{\mathcal{A}^{\prime}\to\mathcal{B}^{\prime}}=\text{softmax}\left(\Phi_{{\mathcal{B}}^{\prime}}^{\top}\phi_{i}^{{\mathcal{A}}^{\prime}}\right)=\text{softmax}\left(\Phi_{{\mathcal{B}}}^{\top}\phi_{i}^{{\mathcal{A}}}\right)=\mathbf{w}_{i}^{{\mathcal{A}}\to{\mathcal{B}}};

thus the weights are also translationally invariant. These translationally invariant weights are applied to the translated cloud

𝐯i𝒜=𝐏𝐰i𝒜=(𝐏+𝐭β)𝐰i𝒜=j𝐩jwi,j𝒜+𝐭βjwi,j𝒜=𝐏𝐰i𝒜+𝐭β,\mathbf{v}_{i}^{\mathcal{A}^{\prime}}=\mathbf{P}_{{\mathcal{B}}^{\prime}}\mathbf{w}_{i}^{{\mathcal{A}}\to{\mathcal{B}}}=(\mathbf{P}_{{\mathcal{B}}}+\mathbf{t}_{\beta})\mathbf{w}_{i}^{{\mathcal{A}}\to{\mathcal{B}}}=\sum_{j}\mathbf{p}_{j}^{{\mathcal{B}}}\cdot w_{i,j}^{{\mathcal{A}}\to{\mathcal{B}}}+\mathbf{t}_{\beta}\sum_{j}w_{i,j}^{{\mathcal{A}}\to{\mathcal{B}}}=\mathbf{P}_{{\mathcal{B}}}\mathbf{w}_{i}^{{\mathcal{A}}\to{\mathcal{B}}}+\mathbf{t}_{\beta},

since j=1Nwij𝒜=1\sum_{j=1}^{N_{{\mathcal{B}}}}w_{ij}^{{\mathcal{A}}\to{\mathcal{B}}}=1. Thus the virtual correspondence points 𝐯i𝒜\mathbf{v}_{i}^{\mathcal{A}^{\prime}} are equivalently translated. The same logic follows for the virtual correspondence points 𝐯i\mathbf{v}_{i}^{\mathcal{B}^{\prime}}. This gives us a set of translationally equivaraint virtual correspondence points 𝐯i𝒜\mathbf{v}_{i}^{\mathcal{A}^{\prime}} and 𝐯i\mathbf{v}_{i}^{\mathcal{B}^{\prime}}.

The correspondence residuals, \boldsymbolδik\boldsymbol{\delta}_{i}^{k^{\prime}}, are a direct function of only the translationally invariant features 𝚽k\mathbf{\Phi}_{k},

\boldsymbolδik=g(\boldsymbolϕik)=g(\boldsymbolϕik)=\boldsymbolδik,\boldsymbol{\delta}_{i}^{k^{\prime}}=g_{\mathcal{R}}(\boldsymbol{\phi}_{i}^{k^{\prime}})=g_{\mathcal{R}}(\boldsymbol{\phi}_{i}^{k})=\boldsymbol{\delta}_{i}^{k},

therefore they are also translationally invariant.

Since the virtual correspondence points are translationally equivariant, 𝐯i𝒜=𝐯i𝒜+𝐭β\mathbf{v}_{i}^{\mathcal{A}^{\prime}}=\mathbf{v}_{i}^{\mathcal{A}}+\mathbf{t}_{\beta} and the correspondence residuals are translationally invariant, \boldsymbolδik=\boldsymbolδik\boldsymbol{\delta}_{i}^{k^{\prime}}=\boldsymbol{\delta}_{i}^{k}, the final corrected virtual correspondence points, 𝐯~i𝒜\tilde{\mathbf{v}}_{i}^{\mathcal{A}^{\prime}}, are translationally equivariant, i.e. 𝐯~i𝒜=𝐯i𝒜+\boldsymbolδik+𝐭β\tilde{\mathbf{v}}_{i}^{\mathcal{A}^{\prime}}=\mathbf{v}_{i}^{\mathcal{A}}+\boldsymbol{\delta}_{i}^{k}+\mathbf{t}_{\beta}. This also holds for 𝐯~i\tilde{\mathbf{v}}_{i}^{\mathcal{B}^{\prime}}, giving us the final translationally equivariant correspondences between the translated object clouds as (𝐏𝒜+𝐭α,𝐕~+𝐭β)\left(\mathbf{P}_{{\mathcal{A}}}+\mathbf{t}_{\alpha},\mathbf{\tilde{V}}_{{\mathcal{B}}}+\mathbf{t}_{\beta}\right) and (𝐏+𝐭β,𝐕~𝒜+𝐭α)\left(\mathbf{P}_{{\mathcal{B}}}+\mathbf{t}_{\beta},\mathbf{\tilde{V}}_{{\mathcal{A}}}+\mathbf{t}_{\alpha}\right), where 𝐕~={bmatrix}𝐯~1𝒜𝐯~N𝒜𝒜\mathbf{\tilde{V}}_{{\mathcal{B}}}=\bmatrix\tilde{\mathbf{v}}_{1}^{{\mathcal{A}}}\dots\tilde{\mathbf{v}}_{N_{{\mathcal{A}}}}^{{\mathcal{A}}}{}^{\top}.

As a result, the final computed transformation will be automatically adjusted accordingly. Given that we use weighted SVD to compute the optimal transform, 𝐓𝒜\mathbf{T}_{{\mathcal{A}}{\mathcal{B}}}, with rotational component 𝐑𝒜\mathbf{R}_{{\mathcal{A}}{\mathcal{B}}} and translational component 𝐭𝒜\mathbf{t}_{{\mathcal{A}}{\mathcal{B}}}, the optimal rotation remains unchanged if the point cloud is translated, 𝐑𝒜=𝐑𝒜\mathbf{R}_{{\mathcal{A}}^{\prime}{\mathcal{B}}^{\prime}}=\mathbf{R}_{{\mathcal{A}}{\mathcal{B}}}, since the rotation is computed as a function of the centered point clouds. The optimal translation is defined as

𝐭𝒜:=𝐯~¯𝒜𝐑𝒜𝐩¯𝒜,\mathbf{t}_{{\mathcal{A}}{\mathcal{B}}}:=\bar{\tilde{\mathbf{v}}}_{\mathcal{\mathcal{A}}}-\mathbf{R}_{{\mathcal{A}}{\mathcal{B}}}\cdot\bar{\mathbf{p}}_{{\mathcal{A}}},

where 𝐯~¯𝒜\bar{\tilde{\mathbf{v}}}_{\mathcal{\mathcal{A}}} and 𝐩¯𝒜\bar{\mathbf{p}}_{{\mathcal{A}}} are the means of the corrected virtual correspondence points, 𝐕~\mathbf{\tilde{V}}_{{\mathcal{B}}}, and the object cloud 𝐏𝒜\mathbf{P}_{{\mathcal{A}}}, respectively, for object 𝒜{\mathcal{A}}. Therefore, the optimal translation between the translated point cloud 𝐏𝒜\mathbf{P}_{\mathcal{A^{\prime}}} and corrected virtual correspondence points 𝐕~𝒜\mathbf{\tilde{V}}^{\mathcal{A}^{\prime}} is {align*} t_AB’ &= ¯~v_A’ - R_AB ⋅¯p_A
= ¯~v_A + t_β- R_AB ⋅(¯p_A + t_α)
= ¯~v_A + t_β- R_AB ⋅¯p_A - R_ABt_α
= t_AB + t_β- R_ABt_α To simplify the analysis, if we assume that, for a given example, 𝐑𝒜=𝐈\mathbf{R}_{{\mathcal{A}}{\mathcal{B}}}=\mathbf{I}, then we get 𝐭𝒜=𝐭𝒜+𝐭β𝐭α\mathbf{t}_{{\mathcal{A}}^{\prime}{\mathcal{B}}^{\prime}}=\mathbf{t}_{{\mathcal{A}}{\mathcal{B}}}+\mathbf{t}_{\beta}-\mathbf{t}_{\alpha}, demonstrating that the computed transformation is translation-equivariant by construction.

Appendix B Weighted SVD

The objective function for computing the optimal rotation and translation given a set of correspondences for object 𝒦\mathcal{K}, {𝐩ik𝐯~ik}iNk\{\mathbf{p}_{i}^{k}\rightarrow\tilde{\mathbf{v}}_{i}^{k}\}_{i}^{N_{k}} and weights {αik}iNk\{\alpha_{i}^{k}\}_{i}^{N_{k}}, is as follows:

𝒥(𝐓𝒜)=i=1N𝒜αi𝒜𝐓𝒜𝐩i𝒜𝐯~i𝒜22+i=1Nαi𝐓𝒜1𝐩i𝐯~i22\mathcal{J}(\mathbf{T}_{{\mathcal{A}}{\mathcal{B}}})=\sum_{i=1}^{N_{{\mathcal{A}}}}\alpha_{i}^{{\mathcal{A}}}||\mathbf{T}_{{\mathcal{A}}{\mathcal{B}}}\leavevmode\nobreak\ \mathbf{p}_{i}^{{\mathcal{A}}}-\tilde{\mathbf{v}}_{i}^{{\mathcal{A}}}||_{2}^{2}+\sum_{i=1}^{N_{{\mathcal{B}}}}\alpha_{i}^{{\mathcal{B}}}||\mathbf{T}_{{\mathcal{A}}{\mathcal{B}}}^{-1}\leavevmode\nobreak\ \mathbf{p}_{i}^{{\mathcal{B}}}-\tilde{\mathbf{v}}_{i}^{{\mathcal{B}}}||_{2}^{2}

First we center (denoted with *) the point clouds and virtual points independently, with respect to the learned weights, and stack them into frame-specific matrices (along with weights) retaining their relative position and correspondence:

𝐀={bmatrix}𝐏𝒜&𝐕~,𝐁={bmatrix}𝐕~𝒜&𝐏,\boldsymbolΓ=\textdiag({bmatrix}\boldsymbolα𝒜&\boldsymbolα)\mathbf{A}=\bmatrix\mathbf{P}_{{\mathcal{A}}}^{*\top}&\mathbf{\tilde{V}}_{{\mathcal{B}}}^{*\top},\;\;\mathbf{B}=\bmatrix\mathbf{\tilde{V}}_{{\mathcal{A}}}^{*\top}&\mathbf{P}_{{\mathcal{B}}}^{*\top}{}^{\top},\;\;\boldsymbol\Gamma=\text{diag}\left(\bmatrix\boldsymbol{\alpha}_{{\mathcal{A}}}&\boldsymbol{\alpha}_{{\mathcal{B}}}\right)

Then the minimizing rotation 𝐑𝒜\mathbf{R}_{{\mathcal{A}}{\mathcal{B}}} is given by:

{subequations}
𝐔𝚺𝐕=svd(𝐀\boldsymbolΓ𝐁)\mathbf{U}\mathbf{\Sigma}\mathbf{V}^{\top}=\mathrm{svd}(\mathbf{A}\boldsymbol\Gamma\mathbf{B}^{\top}) (1)
𝐑𝒜=𝐔𝚺𝐕\mathbf{R}_{{\mathcal{A}}{\mathcal{B}}}=\mathbf{U}\mathbf{\Sigma}_{*}\mathbf{V}^{\top} (2)

where 𝚺=diag({bmatrix}1,1,det(𝐔𝐕)\mathbf{\Sigma}_{*}=\mathrm{diag}(\bmatrix 1,1,...\mathrm{det}(\mathbf{U}\mathbf{V}^{\top}) and svd\mathrm{svd} is a differentiable SVD operation [papadopoulo2000estimating].

The optimal translation can be computed as:

{subequations}
𝐭𝒜=𝐯~¯𝐑𝒜𝐩¯𝒜\mathbf{t}_{{\mathcal{A}}}=\mathbf{\bar{\tilde{v}}}_{{\mathcal{B}}}-\mathbf{R}_{{\mathcal{A}}{\mathcal{B}}}\mathbf{\bar{p}}_{{\mathcal{A}}}
𝐭=𝐩¯𝐑𝒜𝐯~¯𝒜\mathbf{t}_{{\mathcal{B}}}=\mathbf{\bar{p}}_{{\mathcal{B}}}-\mathbf{R}_{{\mathcal{A}}{\mathcal{B}}}\mathbf{\bar{\tilde{v}}}_{{\mathcal{A}}}
𝐭=N𝒜Nt𝒜+NNt\mathbf{t}=\frac{N_{{\mathcal{A}}}}{N}\textbf{t}_{{\mathcal{A}}}+\frac{N_{{\mathcal{B}}}}{N}\textbf{t}_{{\mathcal{B}}} (3)

with N=N𝒜+NN=N_{{\mathcal{A}}}+N_{{\mathcal{B}}}. In the special translation-only case, the optimal translation and be computed by setting 𝐑𝒜\mathbf{R}_{{\mathcal{A}}{\mathcal{B}}} to identity in above equations. The final transform can be assembled:

𝐓𝒜={bmatrix}𝐑𝒜&𝐭01\mathbf{T}_{{\mathcal{A}}{\mathcal{B}}}=\bmatrix\mathbf{R}_{{\mathcal{A}}{\mathcal{B}}}&\mathbf{t}\\ 01 (4)

Appendix C Cross-Object Attention Weight Computation

To map our estimated features we obtain from an object-specific Embedding Network (DGCNN), \boldsymbolΨ𝒜\boldsymbol{\Psi}_{\mathcal{A}} and \boldsymbolΨ\boldsymbol{\Psi}_{\mathcal{B}} for object 𝒜,\mathcal{A},\mathcal{B}, respectively, to a set of normalized weight vectors 𝐖𝒜\mathbf{W}_{{\mathcal{A}}\to{\mathcal{B}}} and 𝐖𝒜\mathbf{W}_{{\mathcal{B}}\to{\mathcal{A}}}, we use the cross attention mechanism of our Transformer module [vaswani2017attention]. Following Equations 5a and 5b from the paper, we can extract the desired normalized weight vector 𝐰i𝒜\mathbf{w}_{i}^{{\mathcal{B}}\to{\mathcal{A}}} for any point 𝐩i𝒜\mathbf{p}_{i}^{{\mathcal{A}}} using the intermediate attention embeddings of cross-object attention module as:

𝐰i𝒜=\textsoftmax(𝐊𝐪i𝒜d),𝐰i𝒜=\textsoftmax(𝐊𝒜𝐪id)\mathbf{w}_{i}^{{\mathcal{A}}\to{\mathcal{B}}}=\text{softmax}\left(\frac{\mathbf{K_{{\mathcal{B}}}}\mathbf{q}_{i}^{{\mathcal{A}}}}{\sqrt{d}}\right),\;\;\mathbf{w}_{i}^{{\mathcal{B}}\to{\mathcal{A}}}=\text{softmax}\left(\frac{\mathbf{K_{{\mathcal{A}}}}\mathbf{q}_{i}^{{\mathcal{B}}}}{\sqrt{d}}\right)\;\; (5)

where 𝐪i𝒦𝐐𝒦\mathbf{q}_{i}^{\mathcal{K}}\in\mathbf{Q_{\mathcal{K}}}, and 𝐐𝒦,𝐊𝒦\mathbbR𝐍𝒦×d\mathbf{Q_{\mathcal{K}}},\mathbf{K_{\mathcal{K}}}\in\mathbb{R}^{\mathbf{N}_{\mathcal{K}}\times d} are the query and key (respectively) for object 𝒦\mathcal{K} associated with cross-object attention Transformer module g𝒯𝒦g_{\mathcal{T}_{\mathcal{K}}}, as shown in Figure S.1. These weights are then used to compute the virtual corresponding points 𝐕𝒜\mathbf{V}_{\mathcal{A}}, 𝐕\mathbf{V}_{\mathcal{B}} using Equations 5a and 5b in the main paper.

Refer to caption
Figure S.1: Cross-Object attention weight computation for virtual soft correspondence 𝐕𝒜\mathbf{V}_{{\mathcal{A}}} from object 𝒜\mathcal{A} to \mathcal{B}. 𝐐𝒦,𝐊𝒦,Val𝒦\mathbbR𝐍𝒦×d\mathbf{Q_{\mathcal{K}}},\mathbf{K_{\mathcal{K}}},\textbf{Val}_{\mathcal{K}}\in\mathbb{R}^{\mathbf{N}_{\mathcal{K}}\times d} are the query, key and value (respectively) for object 𝒦\mathcal{K} associated with cross-object attention Transformer module g𝒯𝒦g_{\mathcal{T}_{\mathcal{K}}}. The Transformer block is modified from Figure 2(b) in DCP [wang2019deep].

C.1 Ablation

To explore the importance of this weight computation design choice described in Equation 5, we conducted an ablation experiment on this design choice against an alternative, arguably simpler method for cross-object attention weight computation that was used in prior work) [wang2019deep]. Since the point embeddings \boldsymbolϕi𝒜\boldsymbol{\phi}_{i}^{\mathcal{A}} and \boldsymbolϕi\boldsymbol{\phi}_{i}^{\mathcal{B}} have the same dimension dd, we can select the inner product of the space as a similarity metric between two embeddings. For any point 𝐩i𝒜\mathbf{p}_{i}^{{\mathcal{A}}}, we can extract the desired normalized weight vector 𝐰i𝒜\mathbf{w}_{i}^{{\mathcal{B}}\to{\mathcal{A}}} with the softmax function:

𝐰i𝒜=\textsoftmax(\boldsymbolΦ\boldsymbolϕi𝒜),𝐰i𝒜=\textsoftmax(\boldsymbolΦ𝒜\boldsymbolϕi)\mathbf{w}_{i}^{{\mathcal{A}}\to{\mathcal{B}}}=\text{softmax}\left(\boldsymbol{\Phi}_{{\mathcal{B}}}^{\top}\boldsymbol{\phi}_{i}^{{\mathcal{A}}}\right),\;\;\mathbf{w}_{i}^{{\mathcal{B}}\to{\mathcal{A}}}=\text{softmax}\left(\boldsymbol{\Phi}_{{\mathcal{A}}}^{\top}\boldsymbol{\phi}_{i}^{{\mathcal{B}}}\right) (6)

This is the approach used in the prior work of Deep Closest Point (DCP) [wang2019deep]. In the experiments below, we refer to this approach as point embedding dot-product.

We conducted an ablation experiment on the weight computation method used in TAX-Pose (Equation 5) against the simpler approach from DCP [wang2019deep] (Equation 6), on the upright mug hanging task in simulation. The models are trained from 10 demonstrations and tested on 100 trials over the test mug set. As seen in Table LABEL:tab:ablation_weight, the TAX-Pose approach (Equation 5) outperforms point embedding dot-product (Equation 6) in all three evaluation categories on grasp, place, and overall in terms of test success rate.

Appendix D Supervision Details

To train the encoders g𝒜(𝐏¯𝒜)g_{\mathcal{A}}(\bar{\mathbf{P}}_{\mathcal{A}}), g(𝐏¯)g_{\mathcal{B}}(\bar{\mathbf{P}}_{\mathcal{B}}) as well as the residual networks g(\boldsymbolϕi𝒜)g_{\mathcal{R}}\left(\boldsymbol{\phi}_{i}^{{\mathcal{A}}}\right), g(\boldsymbolϕi)g_{\mathcal{R}}\left(\boldsymbol{\phi}_{i}^{{\mathcal{B}}}\right), we use a set of losses defined below. We assume we have access to a set of demonstrations of the task, in which the action and anchor objects are in the target relative pose such that 𝐓𝒜=𝐈\mathbf{T}_{{\mathcal{A}}{\mathcal{B}}}=\mathbf{I}.

Point Displacement Loss [xiang2017posecnn, li2018deepim]: Instead of directly supervising the rotation and translation (as is done in DCP), we supervise the predicted transformation using its effect on the points. For this loss, we take the point clouds of the objects in the demonstration configuration, and transform each cloud by a random transform, 𝐏^𝒜=𝐓α𝐏𝒜\mathbf{\hat{P}}_{\mathcal{A}}=\mathbf{T}_{\alpha}\mathbf{P}_{\mathcal{A}}, and 𝐏^=𝐓β𝐏\mathbf{\hat{P}}_{\mathcal{B}}=\mathbf{T}_{\beta}\mathbf{P}_{\mathcal{B}}. This would give us a ground truth transform of 𝐓𝒜GT=𝐓β𝐓α1\mathbf{T}_{{\mathcal{A}}{\mathcal{B}}}^{GT}=\mathbf{T}_{\beta}\mathbf{T}_{\alpha}^{-1}; the inverse of this transform would move object \mathcal{B} to the correct position relative to object 𝒜\mathcal{A}. Using this ground truth transform, we compute the MSE loss between the correctly transformed points and the points transformed using our prediction.

disp=𝐓𝒜𝐏𝒜𝐓𝒜GT𝐏𝒜2+𝐓𝒜1𝐏𝐓𝒜GT1𝐏2\mathcal{L}_{\mathrm{disp}}=\left\|\mathbf{T}_{\mathcal{A}\mathcal{B}}\mathbf{P}_{\mathcal{A}}-\mathbf{T}_{\mathcal{A}\mathcal{B}}^{GT}\mathbf{P}_{\mathcal{A}}\right\|^{2}+\left\|\mathbf{T}_{\mathcal{A}\mathcal{B}}^{-1}\mathbf{P}_{\mathcal{B}}-\mathbf{T}_{\mathcal{A}\mathcal{B}}^{GT-1}\mathbf{P}_{\mathcal{B}}\right\|^{2} (7)

Direct Correspondence Loss. While the Point Displacement Loss best describes errors seen at inference time, it can lead to correspondences that are inaccurate but whose errors average to the correct pose. To improve these errors we directly supervise the learned correspondences V~𝒜\tilde{V}_{\mathcal{A}} and V~\tilde{V}_{\mathcal{B}}:

corr=𝐕~𝒜𝐓𝒜GT𝐏𝒜2+𝐕~𝐓𝒜GT1𝐏2.\mathcal{L}_{\mathrm{corr}}=\left\|\mathbf{\tilde{V}}_{{\mathcal{A}}}-\mathbf{T}_{\mathcal{A}\mathcal{B}}^{GT}\mathbf{P}_{\mathcal{A}}\right\|^{2}+\left\|\mathbf{\tilde{V}}_{{\mathcal{B}}}-\mathbf{T}_{\mathcal{A}\mathcal{B}}^{GT-1}\mathbf{P}_{\mathcal{B}}\right\|^{2}. (8)

Correspondence Consistency Loss. Furthermore, a consistency loss can be used. This loss penalizes correspondences that deviate from the final predicted transform. A benefit of this loss is that it can help the network learn to respect the rigidity of the object, while it is still learning to accurately place the object. Note, that this is similar to the Direct Correspondence Loss, but uses the predicted transform as opposed to the ground truth one. As such, this loss requires no ground truth:

cons=𝐕~𝒜𝐓𝒜𝐏𝒜2+𝐕~𝐓𝒜1𝐏2.\mathcal{L}_{\mathrm{cons}}=\left\|\mathbf{\tilde{V}}_{{\mathcal{A}}}-\mathbf{T}_{\mathcal{A}\mathcal{B}}\mathbf{P}_{\mathcal{A}}\right\|^{2}+\left\|\mathbf{\tilde{V}}_{{\mathcal{B}}}-\mathbf{T}_{\mathcal{A}\mathcal{B}}^{-1}\mathbf{P}_{\mathcal{B}}\right\|^{2}. (9)

Overall Training Procedure. We train with a combined loss net=disp+λ1corr+λ2cons\mathcal{L}_{\mathrm{net}}=\mathcal{L}_{\mathrm{disp}}+\lambda_{1}\mathcal{L}_{\mathrm{corr}}+\lambda_{2}\mathcal{L}_{\mathrm{cons}}, where λ1\lambda_{1} and λ2\lambda_{2} are hyperparameters. We use a similar network architecture as DCP [wang2019deep], which consists of DGCNN [wang2019dynamic] and a Transformer [vaswani2017attention]. We also optionally incorporate a contextual embedding vector into each DGCNN module - identical to the contextual encoding proposed in the original DGCNN paper - which can be used to provide an embedding of the specific placement relationship that is desired in a scene (e.g. selecting a “top” vs. “left” placement position) and thus enable goal conditioned placement. We refer to this variant as TAX-Pose GC (goal-conditioned). We briefly experimented with Vector Neurons [deng2021vector] and found that this led to worse performance on this task. In order to quickly adapt to new tasks, we optionally pre-train the DGCNN embedding networks over a large set of individual objects using the InfoNCE loss [oord2018representation] with a geometric distance weighting and random transformations, to learn SE(3)SE(3) invariant embeddings (see appendix for further details).

Appendix E Visual Explanation

E.1 Illustration of Corrected Virtual Correspondence

The virtual corresponding points, 𝐕𝒜\mathbf{V}_{{\mathcal{A}}}, 𝐕\mathbf{V}_{{\mathcal{B}}} given by Equation 3 in main text, are constrained to be within the convex hull of each object. However, correspondences which are constrained to the convex hull are insufficient to express a large class of desired tasks. For instance, we might want a point on the handle of a teapot to correspond to some point above a stovetop, which lies outside the convex hull of the points on the stovetop. To allow for such placements, for each point-wise embedding \boldsymbolϕi\boldsymbol{\phi}_{i}, we further learn a residual vector, \boldsymbolδi𝒜\boldsymbolΔ𝒜\boldsymbol{\delta}_{i}^{\mathcal{A}}\in\boldsymbol{\Delta}_{\mathcal{A}} that corrects each virtual corresponding point, allowing us to displace each virtual corresponding point to any arbitrary location that might be suitable for the task. Concretely, we use a point-wise neural network gg_{\mathcal{R}} which maps each embedding into a 3D residual vector:

\boldsymbolδi𝒜=g(\boldsymbolϕi𝒜)\mathbbR3,\boldsymbolδi=g(\boldsymbolϕi)\mathbbR3\boldsymbol{\delta}_{i}^{\mathcal{A}}=g_{\mathcal{R}}\left(\boldsymbol{\phi}_{i}^{{\mathcal{A}}}\right)\in\mathbb{R}^{3},\;\;\boldsymbol{\delta}_{i}^{\mathcal{B}}=g_{\mathcal{R}}\left(\boldsymbol{\phi}_{i}^{{\mathcal{B}}}\right)\in\mathbb{R}^{3}

Applying these to the virtual points, we get a set of corrected virtual correspondences, 𝐯~i𝒜𝐕~𝒜\tilde{\mathbf{v}}_{i}^{\mathcal{A}}\in\mathbf{\tilde{V}}_{{\mathcal{A}}} and 𝐯~i𝐕~\tilde{\mathbf{v}}_{i}^{\mathcal{B}}\in\mathbf{\tilde{V}}_{{\mathcal{B}}}, defined as

𝐯~i𝒜=𝐯i𝒜+\boldsymbolδi𝒜,𝐯~i=𝐯i+\boldsymbolδi\tilde{\mathbf{v}}_{i}^{\mathcal{A}}=\mathbf{v}_{i}^{\mathcal{A}}+\boldsymbol{\delta}_{i}^{\mathcal{A}},\;\;\tilde{\mathbf{v}}_{i}^{\mathcal{B}}=\mathbf{v}_{i}^{\mathcal{B}}+\boldsymbol{\delta}_{i}^{\mathcal{B}} (10)

These corrected virtual correspondences 𝐯~i𝒜\tilde{\mathbf{v}}_{i}^{\mathcal{A}} define the estimated goal location relative to object {\mathcal{B}} for each point 𝐩i𝐏𝒜\mathbf{p}_{i}\in\mathbf{P}_{{\mathcal{A}}} in object 𝒜{\mathcal{A}}, and likewise for each point in object {\mathcal{B}}, as shown in Figure S.2.

Refer to caption
Figure S.2: Computation of Corrected Virtual Correspondence. Given a pair of object point clouds 𝐏𝒜,𝐏\mathbf{P}_{\mathcal{A}},\mathbf{P}_{\mathcal{B}}, a per-point soft correspondence 𝐕𝒜\mathbf{V}_{\mathcal{A}} is first computed. Next, to allow the predicted correspondence to lie beyond object’s convex hull, these soft correspondences are adjusted with correspondence residuals, \boldsymbolΔ𝒜\boldsymbol{\Delta}_{\mathcal{A}}, which results in the corrected virtual correspondence, 𝐕~𝒜\tilde{\mathbf{V}}_{{\mathcal{A}}}. The coloring scheme and the point size on the rack represent the the value of the the attention weights, where the more red and larger the point, the higher the attention weights, the more gray and smaller the point the lower the attention weights.

E.2 Learned Importance Weights

A visualization of the learned importance weights, α𝒜\alpha_{\mathbf{\mathcal{A}}} and α\alpha_{\mathbf{\mathcal{B}}} for the mug and rack are visualized by both color scheme and point size in Figure S.3

Refer to caption
Figure S.3: Learned Importance Weights for Weighted SVD on Mug and Rack. The coloring scheme and the point size on both objects represent the the value of the the learned importance weights, where the more yellow and larger the point, the higher the learned importance weights, the more purple and smaller the point the lower the learned importance weights.

Appendix F Additional NDF Task Experiments

F.1 Further Ablations on Mug Hanging Task

In order to examine the effects of different design choices in the training pipeline, we conduct ablation experiments with final task-success (grasp, place, overall ) as evaluation metrics for Mug Hanging task with upright pose initialization for the following components of our method, see Table  LABEL:tab:mug_rack_ablation_full for full ablation results. For consistency, all ablated models are trained to 15K batch steps.

  1. 1.

    Loss. In the full pipeline reported, we use a weighted sum of the three types of losses described in Section 4.2 of the paper. Specifically, the loss used net\mathcal{L}_{\mathrm{net}} is given by

    net=disp+λ1cons+λ2corr\mathcal{L}_{\mathrm{net}}=\mathcal{L}_{\mathrm{disp}}+\lambda_{1}\mathcal{L}_{\mathrm{cons}}+\lambda_{2}\mathcal{L}_{\mathrm{corr}} (11)

    where we chose λ1=0.1\lambda_{1}=0.1, λ2=1\lambda_{2}=1 after hyperparameter search.

    We ablate usage of all three types of losses, by reporting the final task performance in simulation for all experiments, specifically, we report task success on the following net\mathcal{L}_{\mathrm{net}} variants.

    1. [nosep]

    2. (a)

      Remove the point displacement loss term, disp\mathcal{L}_{\mathrm{disp}}, after which we are left with

      net=(0.1)cons+corr\mathcal{L^{\prime}}_{\mathrm{net}}=(0.1)\mathcal{L}_{\mathrm{cons}}+\mathcal{L}_{\mathrm{corr}}
    3. (b)

      Remove the direct correspondence loss term, corr\mathcal{L}_{\mathrm{corr}}, after which we are left with

      net=disp+(0.1)cons\mathcal{L^{\prime}}_{\mathrm{net}}=\mathcal{L}_{\mathrm{disp}}+(0.1)\mathcal{L}_{\mathrm{cons}}
    4. (c)

      Remove the correspondence consistency loss term, cons\mathcal{L}_{\mathrm{cons}}, after which we are left with

      net=disp+corr\mathcal{L^{\prime}}_{\mathrm{net}}=\mathcal{L}_{\mathrm{disp}}+\mathcal{L}_{\mathrm{corr}}
    5. (d)

      From testing loss variants above, we found that the point displacement loss is a vital contributing factor for task success, where removing this loss term results in no overall task success, as shown in Table  LABEL:tab:mug_rack_ablation_full. However, in practice, we have found that adding the correspondence consistency loss and direct correspondence loss generally help to lower the rotational error of predicted placement pose compared to the ground truth of collected demos. To further investigate the effects of the combination of these two loss terms, we used a scaled weighted combination of cons\mathcal{L}_{\mathrm{cons}} and corr\mathcal{L}_{\mathrm{corr}}, such that the former weight of the displacement loss term is transferred to consistency loss term, with the new λ1=1.1\lambda_{1}=1.1, and with λ2=1\lambda_{2}=1 stays unchanged. Note that this is different from variant (a) above, as now the consistency loss given a comparable weight with dense correspondence loss term, which intuitively makes sense as the consistency loss is a function of the predicted transform 𝐓𝒜\mathbf{T}_{\mathcal{A}\mathcal{B}} to be used, while the dense correspondence loss is instead a function of the ground truth transform, 𝐓𝒜GT\mathbf{T}_{\mathcal{A}\mathcal{B}}^{GT}, which provides a less direct supervision on the predicted transforms. Thus we are left with

      net=(1.1)cons+corr\mathcal{L^{\prime}}_{\mathrm{net}}=(1.1)\mathcal{L}_{\mathrm{cons}}+\mathcal{L}_{\mathrm{corr}}
  2. 2.

    Usage of Correspondence Residuals. After predicting a per-point soft correspondence between objects 𝒜\mathcal{A} and \mathcal{B}, we adjust the location of the predicted corresponding points by further predicting a point-wise correspondence residual vector to displace each of the predicted corresponding point. This allows the predicted corresponding point to get mapped to free space outside of the convex hulls of points in object 𝒜\mathcal{A} and \mathcal{B}. This is a desirable adjustment for mug hanging task, as the desirable cross-pose usually require points on the mug handle to be placed somewhere near but not in contact with the mug rack, which can be outside of the convex hull of rack points. We ablate correspondence residuals by directly using the soft correspondence prediction to find the cross-pose transform through weighted SVD, without any correspondence adjustment via correspondence residual.

  3. 3.

    Weighted SVD vs Non-weighted SVD. We leverage weighted SVD as described in Section 4.1 of the paper as we leverage predicted per-point weight to signify the importance of specific correspondence. We ablate the use of weighted SVD, and we use an un-weighted SVD, where instead of using the predicted weights, each correspondence is assign equal weights of 1N\frac{1}{N}, where NN is the number of points in the point cloud 𝐏\mathbf{P} used.

  4. 4.

    Pretraining. In our full pipeline, we pretrain the point cloud embedding network such that the embedding network is SE(3)SE(3) invariant. Specifically, the mug-specific embedding network is pretrained on  200 ShapeNet mug objects, while the rack-specific and gripper specific embedding network is trained on the same rack and Franka gripper used at test time, respectively. We conduct ablation experiments where

    1. [nosep]

    2. (a)

      We omit the pretraining phase of embedding network

    3. (b)

      We do not finetune the embedding network during downstream training with task-specific demonstrations.

    Note that in practice, we find that pretraining helps speed up the downstream training by about a factor of 3, while models with or without pretraining both reach a similar final performance in terms of task success after both models converge.

  5. 5.

    Usage of Transformer as Cross-object Attention Module. In the full pipeline, we use transformer as the cross-object attention module, and we ablate this design choice by replacing the transformer architecture with a simple 3-layer MLP with ReLU activation and hidden dimension of 256, and found that this leads to worse place and grasp success.

  6. 6.

    Dimension of Embedding. In the full pipeline, the embedding is chosen to be of dimension 512. We conduct experiment on much lower dimension of 16, and found that with dimension =16, the place success is much lower, dropped from 0.97 to 0.59.

F.2 Effects of Pretraining on Mug Hanging Task

We explore the effects of pretraining on the final task performance, as well as training convergence speed. We have found that pretraining the point cloud embedding network as described in G.1, is a helpful but not necessary component in our training pipeline. Specifically, we find that while utilizing pretraining reduces training time, allowing the model to reach similar task performance and train rotation/translation error with much fewer training steps, this component is not necessary if training time is not of concern. In fact, as see in Table LABEL:tab:no_pretraining_longer, we find that for mug hanging tasks, by training the models from scratch without our pretraining, the models are able to reach similar level of task performance of 0.990.99 grasp, 0.920.92 for place and 0.920.92 for overall success rate. Furthermore, it is able to achieve similar level of train rotation error of 4.914.91^{\circ} and translation error of 0.01m0.01m, compared to the models with pretraining. However, without pre-trainig, the model needs to be trained for about 5 times longer (26K steps compared to 5K steps) to reach the similar level of performance. Thus we adopt our object-level pretraining in our overall pipeline to allow lower training time.

Another benefit of pretraining is that the pretraining for each object category is done in a task-agnostic way, so the network can be more quickly adapted to new tasks after the pretraining is performed. For example, we use the same pre-trained mug embeddings for both the gripper-mug cross-pose estimation for grasping as well as the mug-rack cross-pose estimation for mug hanging.

F.3 Additional Simulation Experiments on Bowl and Bottle Placement Task

Additional results on Grasp, Place and Overall success rate in simulation for Bowl and Bottle are shown in Table LABEL:tab:bottle_bowl. For bottle and bowl experiment, we follow the same experimentation setup as in [simeonov2021neural], where the successful grasp is considered if a stable grasp of the object is obtained, and a successful place is considered when the bottle or bowl is stably placed upright on the elevated flat slab over the table without falling on the table. Reported task success results in are for both Upright Pose and Arbitrary Pose run over 100 trials each.

Appendix G Additional Training Details

G.1 Pretraining

We utilize pretraining for the embedding network for the mug hanging task, and describe the details below.

We pretrain embedding network for each object category (mug, rack, gripper), such that the embedding network is SE(3)SE(3) invariant with respect to the point clouds of that specific object category. Specifically, the mug-specific embedding network is pretrained on 200 ShapeNet [chang2015shapenet] mug instances, while the rack-specific and gripper-specific embedding network is trained on the same rack and Franka gripper used at test time, respectively. Note that before our pretraining, the network is randomly initialized with the Kaiming initialization scheme [he2015delving]; we don’t adopt any third-party pretrained models.

For the network to be trained to be SE(3)SE(3) invariant, we pre-train with InfoNCE loss [oord2018representation] with a geometric distance weighting and random SE(3)SE(3) transformations. Specifically, given a point cloud of an object instance, 𝐏𝒜\mathbf{P}_{{\mathcal{A}}}, of a specific object category 𝒜\mathcal{A}, and an embedding network g𝒜g_{{\mathcal{A}}}, we define the point-wise embedding for 𝐏𝒜\mathbf{P}_{{\mathcal{A}}} as Φ𝒜=g𝒜(𝐏𝒜)\Phi_{\mathcal{A}}=g_{{\mathcal{A}}}(\mathbf{P}_{{\mathcal{A}}}), where ϕi𝒜Φ𝒜\phi_{i}^{\mathcal{A}}\in\Phi_{\mathcal{A}} is a dd-dimensional vector for each point pi𝒜𝐏𝒜p_{i}^{{\mathcal{A}}}\in\mathbf{P}_{{\mathcal{A}}}. Given a random SE(3)SE(3) transformation, 𝐓\mathbf{T}, we define Ψ𝒜=g𝒜(𝐓𝐏𝒜)\Psi_{\mathcal{A}}=g_{{\mathcal{A}}}(\mathbf{T}\mathbf{P}_{{\mathcal{A}}}), where ψi𝒜Ψ𝒜\psi_{i}^{\mathcal{A}}\in\Psi_{\mathcal{A}} is the dd-dimensional vector for the iith point pi𝒜𝐏𝒜p_{i}^{{\mathcal{A}}}\in\mathbf{P}_{{\mathcal{A}}}.

The weighted contrastive loss used for pretraining, wc\mathcal{L}_{wc}, is defined as {align} L_wc :&= - ∑_ilog[exp(ϕiψi)∑jexp(dij(ϕiψj))]
d_ij:= {1μ tanh(λ∥p_i^A - p_j^A∥_2), \textif i ≠j
1, \textotherwise
μ:=max(tanh(λ∥p_i^A - p_j^A∥_2)) For this pretraining, we use λ:=10\lambda:=10.

Appendix H PartNet-Mobility Objects Placement Task Details

In this section, we describe the PartNet-Mobility Objects Placement experiments in detail.

H.1 Dataset Preparation

Simulation Setup. We leverage the PartNet-Mobility dataset [Xiang2020-oz] to find common household objects as the anchor object for TAX-Pose prediction. The selected subset of the dataset contains 8 categories of objects. We split the objects into 54 seen and 14 unseen instances. During training, for a specific task of each of the seen objects, we generate an action-anchor objects pair by randomly sampling transformations from SE(3)SE(3) as cross-poses. The action object is chosen from the Ravens simulator’s rigid body objects dataset [Zeng2020-tk]. We define a subset of four tasks (“In”, “On”, “Left” and “Right”) for each selected anchor object. Thus, there exists a ground-truth cross-pose (defined by human manually) associated with each defined specific task. We then use the ground-truth TAX-Poses to supervise each task’s TAX-Pose prediction model. For each observation action-anchor objects pair, we sample 100 times using the aforementioned procedure for the training and testing datasets.

Real-World Setup. In real-world, we select a set of anchor objects: Drawer, Fridge, and Oven and a set of action objects: Block and Bowl. We test 3 (“In”, “On”, and “Left”) TAX-Pose models in real-world without retraining or finetuning. The point here is to show the method capability of generalizing to unseen real-world objects.

H.2 Metrics

Simulation Metrics. In simulation, with access to the object’s ground-truth pose, we are able to quantitatively calculate translational and rotation error of the TAX-Pose prediction models. Thus, we report the following metrics on a held-out set of anchor objects in simulation:

Translational Error: The L2 distance between the inferred cross-pose translation (𝐭𝒜pred\mathbf{t}_{\mathcal{A}\mathcal{B}}^{\mathrm{pred}}) and the ground-truth pose translation (𝐭𝒜GT\mathbf{t}_{\mathcal{A}\mathcal{B}}^{\mathrm{GT}}). Rotational Error: The geodesic SO(3)SO(3) distance [huynh2009metrics, hartley2013rotation] between the predicted cross-pose rotation (𝐑𝒜pred\mathbf{R}_{\mathcal{A}\mathcal{B}}^{\mathrm{pred}}) and the ground-truth rotation (𝐑𝒜GT\mathbf{R}_{\mathcal{A}\mathcal{B}}^{\mathrm{GT}}).
𝐭=𝐭𝒜pred𝐭𝒜GT2\mathcal{E}_{\mathbf{t}}=||\mathbf{t}_{\mathcal{A}\mathcal{B}}^{\mathrm{pred}}-\mathbf{t}_{\mathcal{A}\mathcal{B}}^{\mathrm{GT}}||_{2} 𝐑=12arccos(tr(𝐑𝒜pred𝐑𝒜GT)12)\mathcal{E}_{\mathbf{R}}=\frac{1}{2}\arccos{\left(\frac{\mathrm{tr}(\mathbf{R}_{\mathcal{A}\mathcal{B}}^{\mathrm{pred}\top}\mathbf{R}_{\mathcal{A}\mathcal{B}}^{\mathrm{GT}})-1}{2}\right)}
Refer to caption
(a) Failure of ”In” prediction. Predicted TAX-Pose violates the physical constraints by penetrating the oven base too much.
Refer to caption
(b) Failure of ”Left” prediction. Predicted TAX-Pose violates the physical constraints by being in collision with the leg of the drawer.
Figure S.4: An illustration of unsuccessful real-world TAX-Pose predictions. In both subfigures, red points represent the anchor object, blue points represent action object’s starting pose, and green points represent action object’s predicted pose.

Real-World Metrics. In real-world, due to the difficulty of defining ground-truth TAX-Pose, we instead manually, qualitatively define goal “regions” for each of the anchor-action pairs. The goal-region should have the following properties:

  • [noitemsep]

  • The predicted TAX-Pose of the action object should appear visually correct. For example, if the specified task is “In”, then the action object should be indeed contained within the anchor object after being transformed by predicted TAX-Pose.

  • The predicted TAX-Pose of the action object should not violate physical constraints of the workspace and of the relation between the action and anchor objects. Specifically, the action object should not interfere with/collide with the anchor object after being transformed by the predicted TAX-Pose. See Figure S.4 for an illustration of TAX-Pose predictions that fail to meet this criterion.

H.3 Motion Planning

In both simulated and real-world experiments, we use off-the-shelf motion-planning tools to find a path between the starting pose and goal pose of the action object.

Simulation. To actuate the action object from its starting pose 𝐓0\mathbf{T}_{0} to its goal pose transformed by the predicted TAX-Pose 𝐓^𝒜𝐓0\hat{\mathbf{T}}_{\mathcal{A}\mathcal{B}}\mathbf{T}_{0}, we plan a path free of collision. Learning-based methods such as [danielczuk2021object] deal with collision checking with point clouds by training a collision classifier. A more data-efficient method is by leveraging computer graphics techniques, transforming the point clouds into marching cubes [lorensen1987marching], which can then be used to efficiently reconstruct meshes. Once the triangular meshes are reconstructed, we can deploy off-the-shelf collision checking methods such as FCL [pan2012fcl] to detect collisions in the planned path. Thus, in our case, we use position control to plan a trajectory of the action object 𝒜\mathcal{A} to move it from its starting pose to the predicted goal pose. We use OMPL [sucan2012open] as the motion planning tool and the constraint function passed into the motion planner is from the output of FCL after converting the point clouds to meshes via marching cubes.

[Uncaptioned image]
Figure S.5: Real-world experiments illustration. Left: work-space setup for physical experiments. Center: Octomap visualization of the perceived anchor object.

Real World. In real-world experiments, we need to resolve several practical issues to make TAX-Pose prediction model viable. First, we do not have access to a mask that labels action and anchor objects. Thus, we manually define a mask by using a threshold value of yy-coordinate to automatically detect discontinuity in yy-coordinates, representing the gap of spacing between action and anchor objects upon placement. Next, grasping action objects is a non-trivial task. Since, we are only using 2 action objects (a cube and a bowl), we manually define a grasping primitive for each action object. This is done by hand-picking an offset from the centroid of the action object before grasping, and an approach direction after the robot reaches the pre-grasp pose to make contacts with the object of interest. The offsets are chosen via kinesthetic teaching on the robot when the action object is under identity rotation (canonical pose). Finally, we need to make an estimation of the action’s starting pose for motion planning. This is done by first statistically cleaning the point cloud [EisnerZhang2022FLOW] of the action object, and then calculating the centroid of the action object point cloud as the starting position. For starting rotation, we make sure the range of the rotation is not too large for the pre-defined grasping primitive to handle. Another implementation choice here is to use ICP [besl1992method] calculate a transformation between the current point cloud to a pre-scanned point cloud in canonical (identity) pose. We use the estimated starting pose to guide the pre-defined grasp primitive. Once a successful grasp is made, the robot end-effector is rigidly attached to the action object, and we can then use the same predicted TAX-Pose to calculate the end pose of the robot end effector, and thus feed the two poses into MoveIt! to get a full trajectory in joint space. Note here that the collision function in motion planning is comprised of two parts: workspace and anchor object. That is, we first reconstruct the workspace using boxes to avoid collision with the table top and camera mount, and we then reconstruct the anchor object in RViz using Octomap [hornung2013octomap] using the cleaned anchor object point cloud. In this way, the robot is able to avoid collision with the anchor object as well. See Figure S.5 for the workspace.

H.4 Goal-Conditioned Variant

We train two variants of our model, one goal-conditioned variant (TAX-Pose GC), and one task-specific variant (TAX-Pose): the only difference being that the TAX-Pose GC variant receives an encoding of the desired semantic goal position (‘top’, ‘left’, …) for the task. The goal-conditioned variant is trained across all semantic goal positions, whereas the task-specific variant is trained separately on each semantic goal category (for a total of 4 models). Importantly, both variants are trained across all PartNet-Mobility object categories. We report the performance of the variants in Table LABEL:tab:gi-Rt.

H.5 Baselines Description

In simulation, we compare our method to a variety of baseline methods.

E2E Behavioral Cloning: Generate motion-planned trajectories using OMPL that take the action object from start to goal. These serve as “expert” trajectories for Behavioral Cloning (BC). We then use a PointNet++ network to output a sparse policy that, at each time step, takes as input the point cloud observation of the action and anchor objects and outputs an incremental 6-DOF transformation that imitates the expert trajectory. The 6-DoF transformation is expressed using Euclidean xyzxyz translation and rotation quaternion. The “prediction” is the final achieved pose of the action object at the terminal state.

E2E DAgger: Using the same BC dataset and the same PointNet++ architecture as above, we train a sparse policy that outputs the same transformation representation as in BC using DAgger [ross2011reduction]. The “prediction” is the final achieved pose of the action object at the terminal state.

Trajectory Flow: Using the same BC dataset with DAgger, we train a dense policy using PointNet++ to predict a dense per-point 3D flow vector at each time step instead of a single incremental 6-DOF transformation. Given this dense per-point flow, we add the per-point flow to each point of the current time-step’s point cloud, and we are able to extract a rigid transformation between the current point cloud and the point cloud transformed by adding per-point flow vectors using SVD, yielding the next pose. The “prediction” is the final achieved pose of the action object at the terminal state.

Goal Flow: Instead of training a multi-step sparse/dense policy to reach the goal, train a PointNet++ network to output a single dense flow prediction which assigns a per-point 3D flow vector that points from each action object point from its starting pose directly to its corresponding goal location. Given this dense per-point flow, we add the per-point flow to each point of the start point cloud, and we are able to extract a rigid transformation between the start point cloud and the point cloud transformed by adding per-point flow vectors using SVD, yielding goal pose. We pass the start and goal pose into a motion planner (OMPL) and execute the planned trajectory. The “prediction” is thus given by the SVD output.

Refer to caption
Figure S.6: A visualization of all categories of anchor objects and associated semantic tasks, with action objects in ground-truth TAX-Poses used in simulation training.

H.6 Per-Task Results

In the main body of the paper, we have shown the meta-results of the performance of each method by averaging the quantitative metrics for each sub-task (“In”, “On”, “Left”, and “Right” in simulation and “In”, “On” and “Left” in real-world). Here we show each sub-task’s results in Table LABEL:tab:gi-Rt_goal0111Categories from left to right: microwave, dishwasher, oven, fridge, table, washing machine, safe, and drawer., Table LABEL:tab:gi-Rt_goal1, Table LABEL:tab:gi-Rt_goal2, and Table LABEL:tab:gi-Rt_goal3 respectively.

As mentioned above, not all anchor objects have all 4 tasks due to practical reasons. For example, the doors of safes might occlude the action object completely and it is impossible to show the action object in the captured image under “Left” and “Right” tasks (due to handedness of the door); a table’s height might be too tall for the camera to see the action object under the “Top” task. Under this circumstance, for sake of simplicity and consistency, we define a subset of the 4 goals for each object such that the anchor objects of the same category have consistent tasks definitions. We show a collection of visualizations of each task defined for each category in Figure S.6.

Moreover, we also show per-task success rate for real-world experiments in Table LABEL:tab:rw_each.

Appendix I Mug Hanging Task Details

In this section, we describe the Mug Hanging task and experiments in detail. The Mug Hanging task is consisted of two sub tasks: grasp and place. A success in grasp is achieved when the mug is grasped stably by the gripper, while a success in place is achieved when the mug is hanged stably on the hanger of the rack. And the overall success mug hanging is considered when the predicted transforms enable both grasp and place success for the same trial. See Figure S.7 for a detailed breakdown of the mug hanging task in stages.

Refer to caption
Figure S.7: Visualization of Mug Hanging Task (Upright Pose). Mug hanging task is consisted of two stages, given a mug that is randomly initialized on the table, the model first predicts a SE(3)SE(3) transform from gripper end effector to the mug rim 𝐓gm\mathbf{T}_{g\rightarrow m}, then grasp it by the rim. Next, the model predicts another SE(3)SE(3) transform from the mug to the rack 𝐓mr\mathbf{T}_{m\rightarrow r} such that the mug handle gets hanged on the the mug rack.

I.1 Baseline Description

In simulation, we compare our method to the results described in  [simeonov2021neural].

  • [nosep]

  • Dense Object Nets (DON) [florence2018dense]: Using manually labeled semantic keypoints on the demonstration clouds, DON is used to compute sparse correspondences with the test objects. These correspondences are converted to a pose using SVD. A full description of usage of DON for the mug hanging task can be found in [simeonov2021neural].

  • Neural Descriptor Field (NDF) [simeonov2021neural]: Using the learned descriptor field for the mug, the positions of a constellation of task specific query points are optimized to best match the demonstration using gradient descent.

I.2 Training Data

To be directly comparable with the baselines we compared to, we use the exact same sets of demonstration data used to train the network in NDF [simeonov2021neural], where the data are generated via teleportation in PyBullet, collected on 10 mug instances with random pose initialization.

I.3 Training and Inference

Using the pretrained embedding network for mug and gripper, we train a grasping model for the grasping task to predict a transformation 𝐓gm\mathbf{T}_{g\rightarrow m} in gripper’s frame from gripper to mug to complete the grasp stage of the task. Similarly, using the pretrained embedding network for rack and mug, we train a placement model for the placing task to predict a transformation 𝐓mr\mathbf{T}_{m\rightarrow r} in mug’s frame from mug to rack to complete the place stage of the task. Both models are trained with the same combined loss net\mathcal{L}_{net} as described in the main paper. During inference, we simply use grasping model to predict the 𝐓gm\mathbf{T}_{g\rightarrow m} at test time, and placement model to predict 𝐓mr\mathbf{T}_{m\rightarrow r} at test time.

I.4 Motion Planning

After the model predicts a transformation 𝐓gm\mathbf{T}_{g\rightarrow m} and 𝐓mr\mathbf{T}_{m\rightarrow r}, using the known gripper’s world frame pose, we calculate the desired gripper end effector pose at grasping and placement, and pass the end effector to IKFast to get the desired joint positions of Franka at grasping and placement. Next we pass the desired joint positions at gripper’s initial pose, and desired grasping joint positions to OpenRAVE motion planning library to solve for trajectory from gripper’s initial pose to grasp pose, and then grasp pose to placement pose for the gripper’s end effector.

I.5 Real-World Experiments

We pre-train the DGCNN embedding network with rotation-equivariant loss on ShapeNet mugs’ simulated point clouds in simulation. Using the pre-trained embedding, we then train the full TAX-Pose model with the 10 collected real-world point clouds.

I.6 Failure Cases

Some failure cases for TAX-Pose occur when the predicted gripper misses the rim of the mug by a xy-plane translation error, thus resulting in failure of grasp, as seen in Figure  8(a). A common failure mode for the mug placement subtask is characterized by an erroneous transform prediction that results in the mug’s handle completely missing the rack hanger, thus resulting in placement failure, as seen in Figure  8(b).

Refer to caption
(a) Failure of grasp prediction. Predicted TAX-Pose for the gripper misses the rim of mug.
Refer to caption
(b) Failure of place prediction. Predicted TAX-Pose for mug results in the mug handle misses the rack hanger completely.
Figure S.8: An illustration of unsuccessful TAX-Pose predictions for mug hanging. In both subfigures, red points represent the anchor object, blue points represent action object’s starting pose, and green points represent action object’s predicted pose.