Model Predictive Manipulation of Compliant Objects with Multi-Objective Optimizer and Adversarial Network for Occlusion Compensation

Jiaming Qi, Dongyu Li, Yufeng Gao, Peng Zhou, and David Navarro-Alarcon This work is supported in part by the Research Grants Council (RGC) of Hong Kong under grants 14203917 and 15212721, in part by the Key-Area Research and Development Program of Guangdong Province under Grant 2020B090928001, and in part by the Jiangsu Industrial Technology Research Institute Collaborative Funding Scheme under grant 43-ZG9V.J. Qi and Y. Gao are with the Harbin Institute of Technology, Department of Control Science and Engineering, Harbin, Heilongjiang, China. (e-mail: [email protected], [email protected])D. Li is with Beihang University, School of Cyber Science and Technology, Beijing, China. (e-mail: [email protected])P. Zhou and D. Navarro-Alarcon are with The Hong Kong Polytechnic University, Department of Mechanical Engineering, Kowloon, Hong Kong. (e-mail: [email protected], [email protected])

Abstract

The robotic manipulation of compliant objects is currently one of the most active problems in robotics due to its potential to automate many important applications. Despite the progress achieved by the robotics community in recent years, the 3D shaping of these types of materials remains an open research problem. In this paper, we propose a new vision-based controller to automatically regulate the shape of compliant objects with robotic arms. Our method uses an efficient online surface/curve fitting algorithm that quantifies the object’s geometry with a compact vector of features; This feedback-like vector enables to establish an explicit shape servo-loop. To coordinate the motion of the robot with the computed shape features, we propose a receding-time estimator that approximates the system’s sensorimotor model while satisfying various performance criteria. A deep adversarial network is developed to robustly compensate for visual occlusions in the camera’s field of view, which enables to guide the shaping task even with partial observations of the object. Model predictive control is utilized to compute the robot’s shaping motions subject to workspace and saturation constraints. A detailed experimental study is presented to validate the effectiveness of the proposed control framework.

Index Terms:

Robotics; Visual Servoing; Deformable Objects; Occlusion Compensation; Model Predictive Control

I Introduction

The manipulation/shaping of deformable bodies by robots is a fundamental problem that has recently attracted the attention of many researchers [1]; Its complexity has forced researchers to develop new methods in a wide range of fundamental areas that include representation, learning, planning, and control. From an applied research perspective, this challenging problem has shown great potential in various economically-important tasks such as the assembly of compliant/delicate objects [2], surgical/medical robotics [3], cloth/fabric folding [4], etc. The manipulation of compliant materials contrast with its rigid-body counterpart in that physical interactions will invariably change the object’s shape, which introduces additional degrees-of-freedom to the typically unknown objects, and hence, complicates the manipulation task. While great progress has been achieved in recent years, the development of these types of embodied manipulation capabilities is still largely considered an open research problem in robotics and control.

There are various technical issues that hamper the implementation of these tasks in real-world unstructured environments, which largely differ from implementations in ideal simulation environments (a trend followed by many recent works). Here, we argue that to effectively servo-control the shape of deformable materials in the field, a sensor-based controller must possess the following features: 1) Efficient compression of the object’s high-dimensional shape; 2) Occlusion-tolerant estimation of the objects geometry; 3) Adaptive prediction of the differential shape-motion model; Our aim in this paper is precisely to develop a new shape controller endowed with all the above-mentioned features. The proposed method is formulated under the model predictive control (MPC) framework that enables to compute shaping actions that satisfy multiple performance criteria (a key property for real engineering applications).

I-A Related Work

Many researchers have previously studied this challenging problem (we refer the reader to [5, 6] for comprehensive reviews). To servo-control the object’s non-rigid shape, it is essential to design a low-dimensional feature representation that can capture the key geometric properties of the object. Several representation methods have been proposed before, e.g., geometric features based on points, angles, curvatures, etc [7, 8]; However, due to their hard-coded nature, these methods can only be used to represent a single shaping action. Other geometric features computed from contours and centerlines [9, 10, 11] can represent soft object deformation in a more general way. Various data-driven approaches have also been proposed to represent shapes, e.g., using fast point feature histograms [12], bottleneck layers [13, 14], principal component analysis [15], etc. However, there is no widely accepted approach to compute efficient/compact feature representations for 3D shape; This is still an open research problem.

Occlusions of a camera’s field of view pose many complications to the implementation of visual servoing controllers, as the computation of (standard) feedback features requires complete visual observations of the object at all times. Many methods have been developed to tackle this critical issue, e.g. [16] used the estimated interaction matrix to handle information loss in visual servoing, yet, this method requires a calibrated visual-motor model. Coherent point drift was utilized in [17] to register the topological structure from previous sequences and to predict occlusions; However, this method is sensitive to the initial point set, further affects the registration result. A structure preserved registration method was presented in [18] to track occluded objects; This approach has good accuracy and robustness to noise, yet, its efficiency decreases with the number of points. To efficiently implement vision-based strategies in the field, it is essential to develop algorithms that can robustly guide the servoing task, even in the presence of occlusions.

To visually guide the manipulation task, control methods must have some form of model that (at least approximately) describes that how the input robot motions produce output shape changes [19]; In the visual servoing community, such differential relation is typically captured by the so-called interaction (Jacobian) matrix [20]. Many methods have been proposed to address this issue, e.g. the Broyden update rule [21] is a classical algorithm to iteratively estimate this transformation matrix [7, 22, 23]. Although these types of algorithms do not require knowledge of the model’s structure, its estimation properties are only valid locally. Other approaches with global estimation properties include algorithms based on (deep) artificial neural networks [2], optimization-based algorithms [24], adaptive estimators [25], etc. However, the majority of existing methods estimate this model based on a single performance criterion (typically, a first-order Jacobian-like relation), which limits the types of dynamic responses that the robot can achieve during a task. A multi-objective model estimation is particularly important in the manipulation of compliant materials, as their mechanical properties are rarely known in practice, thus, making hard to meet various performance requirements.

To compute the active shaping motions, most control methods only formulate the problem in terms of the final target shape and do not typically consider the system’s physical constraints. MPC represents a feasible solution to these issues, as it performs the control tasks by optimizing cost functions over a finite time-horizon rather than finding the exact analytical solution [26]; This allows MPC to compute controls that guide the task while satisfying a set of constraints, e.g., control saturation, workspace bounds, etc. Despite its valuable and flexible properties, MPC has not been sufficiently studied in the context of shape control of deformable objects.

I-B Our Contribution

To solve the above-mentioned issues, in this paper we propose a new control framework to manipulate purely-elastic objects into desired configurations. The original features of our new methodology include: 1) A parametric shape descriptor to efficiently characterize 3D deformations based on online curve/surface fitting; 2) A robust shape prediction network based on adversarial neural networks to compensate visual occlusions; 3) An optimization-based estimator to approximate the deformation Jacobian matrix and satisfy various performance constraints; 4) An MPC-based motion controller to guide the shaping motions while simultaneously solving workspace and saturation constraints.

To the best of the authors’ knowledge, this is the first time that a shape servoing controller is developed with all the functions proposed in this paper. To validate the effectiveness of our new methodology, we report a detail experimental study with a robotic platform manipulating various types of compliant objects.

Refer to caption — Figure 1: Schematic diagram of the manipulations of elastic objects, including deformation (centerline, contour, and surface) and positioning tasks of rigid objects. The framework aims to command the robot to manipulate elastic objects into the target configurations.

II Problem Formulation

Notation: In this paper, we use the following frequently-used notation: Bold small letters, e.g., $\mathbf{v}$ , denote column vectors, while bold capital letters, e.g., $\mathbf{M}$ , denote matrices. Time evolving variables are denoted as $\mathbf{x}_{k}$ , for $k$ as the discrete time instant. The $n\times m$ matrix of ones is denoted by $\mathbf{I}_{n\times m}$ and the identity matrix as $\mathbf{E}_{n}$ . $\mathbf{L}_{n}$ represents the low triangle matrix of $\mathbf{I}_{n\times n}$ , and $\otimes$ represents the Kronecker product.

The schematic diagram of the proposed shape servoing framework is conceptually illustrated in Fig. 1. A depth camera with eye-to-hand configuration observes the shapes of elastic objects that are manipulated by the robot, see Fig. 2. We denote the 3D measurement points captured by the vision system as:

\begin{array}[]{*{20}{c}}{\bar{\mathbf{c}}={{\left[{\mathbf{c}_{1}^{\intercal},\ldots,\mathbf{c}_{N}^{\intercal}}\right]}^{\intercal}}\in{\mathbb{R}^{3N}}},&{{\mathbf{c}_{i}}={{\left[{{x_{i}},{y_{i}},{z_{i}}}\right]}^{\intercal}}\in{\mathbb{R}^{3}}}\end{array}

(1)

for $N$ as the number of points, and $\mathbf{c}_{i}$ as the 3D coordinates of the $i$ th point, expressed in the camera frame.

II-A Feature Sensorimotor Model

Let us denote the position of the robot’s end-effector by $\mathbf{r}\in\mathbb{R}^{3}$ . As the dimension of the $3N$ observed points $\bar{\mathbf{c}}$ is typically very large, therefore, its direct use as a feedback signal for shape servocontrol is impractical. Therefore, an efficient controller may typically require to use some form of dimension reduction technique. To deal with this issue, we construct a feature vector $\mathbf{s}=\mathbf{f}_{s}(\bar{\mathbf{c}}):\mathbb{R}^{3N}\mapsto\mathbb{R}^{p}$ , for $p\ll 3N$ , to represent the geometric feedback $\bar{\mathbf{c}}$ of the object. This compact feedback-like signal will be used to design our automatic 3D shape controller.

For purely elastic objects undergoing deformations, it is reasonable to model that the object configuration is only dependant on its potential energy $\mathcal{P}$ (thus, all inertial and viscous effects are neglected from our analysis [10]). We model that $\mathcal{P}$ is fully determined by the feedback feature vector $\mathbf{s}$ and the robot’s position $\mathbf{r}$ [27], i.e.: $\mathcal{P}=\mathcal{P}(\mathbf{s},\mathbf{r})$ . In steady-state, the extremum expression satisfies $(\partial\mathcal{P}/\partial\mathbf{s})^{\intercal}=\mathbf{a}(\mathbf{s},\mathbf{r})=\mathbf{0}$ [28]. We then define the following matrices:

{\frac{\partial^{2}\mathcal{P}}{\partial\mathbf{s}^{2}}=\mathbf{G}(\mathbf{s},\mathbf{r})},\qquad{\frac{\partial^{2}\mathcal{P}}{{\partial{\mathbf{r}}}{\partial{\mathbf{s}}}}=\mathbf{K}(\mathbf{s},\mathbf{r})}

(2)

which are useful to linearize the extremum equation (by using first-order Taylor’s series expansion) as follows:

\mathbf{a}(\mathbf{s}+\Delta\mathbf{s},\mathbf{r}+\Delta\mathbf{r})\approx\mathbf{a}(\mathbf{s},\mathbf{r})+\mathbf{G}(\mathbf{s},\mathbf{r})\Delta\mathbf{s}+\mathbf{K}(\mathbf{s},\mathbf{r})\Delta\mathbf{r}

(3)

for $\Delta\mathbf{s}$ and $\Delta\mathbf{r}$ as small changes. Note that as $\mathbf{a}(\mathbf{s}+\Delta\mathbf{s},\mathbf{r}+\Delta\mathbf{r})=\mathbf{a}(\mathbf{s},\mathbf{r})=\mathbf{0}$ is satisfied, we can obtain the following motion model:

\Delta\mathbf{s}\approx-\mathbf{G}(\mathbf{s},\mathbf{r})^{-1}\mathbf{K}(\mathbf{s},\mathbf{r})\Delta\mathbf{r}=\mathbf{J}(\mathbf{s},\mathbf{r})\mathbf{u}

(4)

where $\mathbf{J}(\mathbf{s},\mathbf{r})=-\mathbf{G}(\mathbf{s},\mathbf{r})^{-1}\mathbf{K}(\mathbf{s},\mathbf{r})\in\mathbb{R}^{p\times 3}$ represents the deformation Jacobian matrix (which depends on both the feature vector and robot position), and $\mathbf{u}$ represents the robot’s motion control input. This model can be expressed in an intuitive discrete-time form:

\Delta\mathbf{s}_{k+1}=\mathbf{J}_{k}(\mathbf{s}_{k},\mathbf{r}_{k})\mathbf{u}_{k}

(5)

The deformation Jacobian matrix (DJM) $\mathbf{J}_{k}$ indicates how the robot’s action $\mathbf{u}_{k}=\mathbf{r}_{k}-\mathbf{r}_{k-1}$ produces changes in the feedback features $\Delta\mathbf{s}_{k+1}=\mathbf{s}_{k+1}-\mathbf{s}_{k}$ . Clearly, the analytical computation of $\mathbf{J}_{k}$ requires knowledge of the physical properties and model of the elastic object and the vision system, which are difficult to obtain in practice. Thus, numerical methods are often used to approximate this matrix in real-time, which enables to perform vision-guided manipulation tasks.

In this paper, we consider a robot manipulator whose control inputs represent velocity commands (here, modelled as the differential changes $\Delta\mathbf{r}$ ). It is assumed that $\Delta\mathbf{r}$ can be instantaneously executed without delay [11].

Problem statement. Design a vision-based control method to automatically manipulate a compliant object into a target 3D configuration, while simultaneously compensating for visual occlusions of the camera and estimating the deformation Jacobian matrix of the object-robot system.

III Shape Representation

This section presents how online curve/surface fitting (least-squares minimization (LSM) [29] and moving least squares (MLS) [30]) is combined with a parametric shape descriptor to compute a compact vector of feedback shape features.

III-A LSM-Based Features

III-A1 Centerline and Contour Extraction

Both are expressed as a parametric curve dependent on the normalized arc-length $0\leq{\rho}\leq{1}$ . Then, the point can be represented as $\mathbf{c}_{i}=\mathbf{f}(\rho_{i})$ , with $\rho_{i}$ as the arc-length between the start point $\mathbf{c}_{1}$ and $\mathbf{c}_{i}$ , where $\rho_{1}=0$ and $\rho_{N}=1$ . The fitting functional $\mathbf{f}(\cdot)$ is constructed as follows:

{\mathbf{f}\left({{\rho}}\right)=\sum\limits_{j=0}^{n}{{\mathbf{p}_{j}}{B_{j,n}}\left({{\rho}}\right)}}

(6)

where the vector $\mathbf{p}_{j}\in\mathbb{R}^{3}$ denotes the shape weights, $n\in\mathbb{N}^{*}$ specifies the fitting order, and the scalar $B_{j,n}(\rho)$ represents a parametric regression function, which may take various forms, such as: [31]

•

Polynominal parameterization [32]:

${B_{j,n}}\left({{\rho}}\right)=\rho^{j}$ (7)
•

Bernstein parameterization [33]:

$B_{j,n}\left({{\rho}}\right)=C_{n}^{j}{\left({1-{\rho}}\right)^{n-j}}\rho^{j}$ (8)

where $C_{b}^{j}$ represents the binomial coefficient.

•

Cox-deBoor parameterization [34]:

{B_{j,n}}\left(\rho\right)=\frac{1}{{n!}}\sum\limits_{l=0}^{n-j}{\left({{{\left({-1}\right)}^{l}}C_{n+1}^{l}{{\left({\rho+n-j-l}\right)}^{n}}}\right)}

(9)

•

Trigonometric parameterization [35]:

\displaystyle{B_{j,n}}\left({{\rho}}\right)=\left\{{\begin{array}[]{*{20}{c}}1,&{j=0}\\ {\cos(\frac{j+1}{2}\rho)},&{j>0,\ j\ \rm{is}\ \rm{odd}}\\ {\sin(\frac{j}{2}\rho)},&{j>0,\ j\ \rm{is}\ \rm{even}}\end{array}}\right.

(13)

From (1) and (6), we can compute the following fitting cost function:

Q={\left({\mathbf{B}\mathbf{s}-\bar{\mathbf{c}}}\right)^{\intercal}}\left({\mathbf{B}\mathbf{s}-\bar{\mathbf{c}}}\right)

(14)

for a “tall” regression-like matrix $\mathbf{B}$ constructed as:

	$\displaystyle\mathbf{B}$	$\displaystyle=[\mathbf{B}_{1}^{\intercal},\ldots,\mathbf{B}_{N}^{\intercal}]^{\intercal}\in\mathbb{R}^{3N\times 3(n+1)}$
	$\displaystyle\mathbf{B}_{i}$	$\displaystyle=[B_{0,n}(\rho_{i}),\ldots,B_{n,n}(\rho_{i})]\otimes\mathbf{E}_{3}\in\mathbb{R}^{3\times 3(n+1)}$		(15)

and $\mathbf{s}=[\mathbf{p}_{0}^{\intercal},\ldots,\mathbf{p}_{n}^{\intercal}]^{\intercal}\in\mathbb{R}^{3(n+1)}$ as a compact feedback feature vector that represents the shape. We seek to minimize (14) to obtain a feature vector $\mathbf{s}$ that closely approximates $\bar{\mathbf{c}}\approx\mathbf{B}\mathbf{s}$ . The solution to (14) is:

\mathbf{s}={\left({{\mathbf{B}^{\intercal}}\mathbf{B}}\right)^{-1}}{\mathbf{B}^{\intercal}}\bar{\mathbf{c}}

(16)

where it is assumed that $N\gg n+1$ ¹¹1In the following sections, $Q$ is used to generically represent different, albeit related, functions..

III-A2 Surface Extraction

The equation of the surface is defined as follows:

z=f(x,y)=\sum\limits_{j=0}^{n_{x}}{\sum\limits_{l=0}^{n_{y}}{{B_{j,n_{x}}}\left({{x}}\right){B_{l,n_{y}}}\left(y\right){q_{jl}}}}

(17)

where $n_{x},n_{y}\in\mathbb{N}^{*}$ are the fitting order along with $x$ and $y$ direction, and $q_{jl}\in\mathbb{R}$ is the shape weight. Same as with (14), as fitting cost function is introduced:

Q={\left({\mathbf{D}\mathbf{s}-{\mathbf{z}}}\right)^{\intercal}}\left({\mathbf{D}\mathbf{s}-{\mathbf{z}}}\right)

(18)

for a “augmented” regression matrix $\mathbf{D}$ satisfying:

$\displaystyle\mathbf{D}$	$\displaystyle=[\mathbf{D}_{1}^{\intercal},\ldots,\mathbf{D}_{N}^{\intercal}]^{\intercal}\in\mathbb{R}^{N\times(n_{x}+1)(n_{y}+1)}$
$\displaystyle\mathbf{D}_{i}$	$\displaystyle=[B_{0,n_{x}}(x_{i})B_{0,n_{y}}(y_{i}),\ldots,B_{n_{x},n_{x}}(x_{i})B_{n_{y},n_{y}}(y_{i})]$
$\displaystyle\mathbf{D}_{i}^{\intercal}$	$\displaystyle\in\mathbb{R}^{(n_{x}+1)(n_{y}+1)},\ for\ \ i=1,\ldots,N$	(19)

The depth vector is defiend as ${\mathbf{z}}=[z_{1},\ldots,z_{N}]^{\intercal}\in\mathbb{R}^{N}$ , and the feature vector is $\mathbf{s}={\left[{{q_{00}},\ldots,{q_{{n_{x}}{n_{y}}}}}\right]^{\intercal}}\in{\mathbb{R}^{\left({{n_{x}}+1}\right)\left({{n_{y}}+1}\right)}}$ . The solution to the minimization of (18) that approximates $\mathbf{z}\approx\mathbf{D}\mathbf{s}$ is as follows:

\mathbf{s}={\left({{\mathbf{D}^{\intercal}}\mathbf{D}}\right)^{-1}}{\mathbf{D}^{\intercal}}{\mathbf{z}}

(20)

For $N\gg(n_{x}+1)(n_{y}+1)$ .

III-B MLS-based Feature Extraction

Although LSM has an efficient one-step calculation, the weights of $B_{j,n}$ are the same all over the variable parameters (e.g., $\rho$ or $x,y$ ). Thus, to approximate complex curves/surfaces, the fitting order needs to be increased, which may lead to over-fitting problems. To address this issue, MLS assumes that $\mathbf{s}$ is parameter-dependent, i.e., it changes with respect to $\rho$ (for centerline and contour), or to $(x,y)$ (for surface). This enables MLS to represent complex shapes with a lower fitting order. A support field [36] is introduced to ensure that the value of each weight is only influenced by data points within the support field. A conceptual diagram is given in Fig. 3.

III-B1 Centerline and Contour Extraction

The parametric equation with $\boldsymbol{\sigma}_{j}(\rho)$ is presented by referring to (6):

{\mathbf{f}\left({{\rho}}\right)=\sum\limits_{j=0}^{n}{{\boldsymbol{\sigma}_{j}(\rho)}{B_{j,n}}\left({{\rho}}\right)}}

(21)

where the parameter-dependent weight is denoted as $\boldsymbol{\sigma}_{j}(\rho)\in\mathbb{R}^{3}$ . The weighted square residual function is defined as:

Q(\rho)=\sum\limits_{i=1}^{N}\omega\left(\varepsilon_{i}\right)\biggl{\|}{\sum\limits_{j=0}^{n}{{\boldsymbol{\sigma}_{j}}\left(\rho\right){B_{j,n}}\left({{\rho_{i}}}\right)}-{\mathbf{c}_{i}}}\biggr{\|}^{2}

(22)

where $\varepsilon_{i}=|\rho-\rho_{i}|/d>0$ , and $d$ is the constant support field radius. The scalar function $\omega(\varepsilon_{i})$ is calculated as follows [37]:

\omega\left(\varepsilon_{i}\right)=\left\{{\begin{array}[]{*{20}{c}}{\frac{2}{3}-4\varepsilon_{i}^{2}+4\varepsilon_{i}^{3}},&{\varepsilon_{i}\leq 0.5}\\ {\frac{4}{3}-4\varepsilon_{i}+4\varepsilon_{i}^{2}-\frac{4}{3}\varepsilon_{i}^{3}},&{0.5<\varepsilon_{i}\leq 1}\\ 0,&{\varepsilon_{i}>1}\end{array}}\right.

(23)

The function $\omega(\varepsilon_{i})$ indicates the weight of $\rho$ relative to $\rho_{i}$ , $\omega(\varepsilon_{i})$ decreases as $\varepsilon_{i}$ increasing inside the support field. When $\rho$ is outside the support field, $\omega(\varepsilon_{i})=0$ . MLS reduces into LSM when $\omega(\varepsilon_{i})$ is constant. The cost (22) can be equivalently expressed in matrix form as:

Q(\rho)=\left({\mathbf{B}\boldsymbol{\vartheta}(\rho)-\bar{\mathbf{c}}}\right)^{\intercal}\mathbf{W}(\varepsilon)\left({\mathbf{B}\boldsymbol{\vartheta}(\rho)-\bar{\mathbf{c}}}\right)

(24)

where $\boldsymbol{\vartheta}(\rho)$ and $\mathbf{W}(\varepsilon)$ are defined as follows:

	$\displaystyle\boldsymbol{\vartheta}(\rho)$	$\displaystyle=[\boldsymbol{\sigma}_{0}^{\intercal}(\rho),\ldots,\boldsymbol{\sigma}_{n}^{\intercal}(\rho)]^{\intercal}\in\mathbb{R}^{3(n+1)}$		(25)
	$\displaystyle\mathbf{W}(\varepsilon)$	$\displaystyle=\operatorname{diag}({\omega(\varepsilon_{1}),\ldots,\omega(\varepsilon_{N})})\otimes\mathbf{E}_{3}\in{\mathbb{R}^{3N\times 3N}}$

The value of $\boldsymbol{\vartheta}(\rho)$ that minimizes (24) is computed as follows:

\boldsymbol{\vartheta}(\rho)=(\mathbf{B}^{\intercal}\mathbf{W}(\varepsilon)\mathbf{B})^{-1}\mathbf{B}^{\intercal}\mathbf{W}(\varepsilon)\bar{\mathbf{c}}

(26)

By completing $N$ iterations along the parameter $\rho_{1},...,\rho_{N}$ , we can compute the augmented shape features as:

\boldsymbol{\Pi}=[\boldsymbol{\vartheta}(\rho_{1}),\ldots\boldsymbol{\vartheta}(\rho_{N})]\in\mathbb{R}^{3(n+1)\times N}

(27)

As the dimension of (27) is very large, it is impractical to use its components as a feedback signal for control. Thus, principal components analysis (PCA) [15] is used to reduce the augmented structure (27) into a compact form $\widetilde{\boldsymbol{\Pi}}\in\mathbb{R}^{3(n+1)\times m}$ where $m\ll N$ by selecting the first $m$ most significant dimensions. The feature vector $\mathbf{s}\in\mathbb{R}^{3m(n+1)}$ is computed by vectorizing the elements of $\widetilde{\boldsymbol{\Pi}}$ .

III-B2 Surface Extraction

The equation of the surface is constructed as:

{f(x,y)=\sum\limits_{j=0}^{n_{x}}{\sum\limits_{l=0}^{n_{y}}{{B_{j,n_{x}}}\left({{x}}\right){B_{l,n_{y}}}\left(y\right){\varpi_{jl}}(x,y)}}}

(28)

where $\varpi_{jl}(x,y)\in\mathbb{R}$ is the parameter-dependent weight related to $(x,y)$ . Algorithm 1 gives a pseudocode description of this method.

Algorithm 1 Surface fitting procedure of MLS.

\bar{\mathbf{c}}

n_{x}

n_{y}

m

, and

d

;

2:Calculate node distance:

\varepsilon_{i}=\sqrt{(x-x_{i})^{2}+(y-y_{i})^{2}}/d

3:Construct the fitting cost function:

Q={\left({\mathbf{D}\boldsymbol{\phi}\left({x,y}\right)-\mathbf{z}}\right)^{\intercal}}\mathbf{W}\left(\varepsilon\right)\left({\mathbf{D}\boldsymbol{\phi}\left({x,y}\right)-\mathbf{z}}\right)

for

\boldsymbol{\phi}(x,y)

and

\mathbf{W}(\varepsilon)

satisfying:

	$\displaystyle\boldsymbol{\phi}$	$\displaystyle={[{{\varpi_{00}},\ldots,{\varpi_{{n_{x}}{n_{y}}}}}]^{\intercal}}$
	$\displaystyle\mathbf{W}$	$\displaystyle=\operatorname{diag}\left({\omega\left(\varepsilon_{1}\right),\ldots,\omega\left(\varepsilon_{N}\right)}\right)$		(29)

4:Compute the structure

\boldsymbol{\phi}(x,y)

\boldsymbol{\phi}=({{{\mathbf{D}^{\intercal}}\mathbf{W}\left(\varepsilon\right)\mathbf{D}}})^{-1}{{\mathbf{D}^{\intercal}}\mathbf{W}\left(\varepsilon\right)}{\mathbf{z}}

5:Use

x_{i},y_{i},i\in[1,N]

to compute:

\boldsymbol{\Pi}=[\boldsymbol{\phi}(x_{1},y_{1}),\ldots,\boldsymbol{\phi}(x_{N},y_{N})]

(30)

6:Use PCA to calculate

\widetilde{\boldsymbol{\Pi}}

;

7:Vectorize

\widetilde{\boldsymbol{\Pi}}

to obtain

\mathbf{s}

;

8:return

\mathbf{s}

;

Remark 1.

In addition to the proposed basis functions, there are other approaches that can be used to obtain similar results, e.g., Chebyshev polynomials [38] and Legendre basis transformations [39].

IV Shape Prediction Network

During the manipulation process, occlusions caused by obstacles or the robot itself may affect the integrity of observed shapes, and hence, the vector $\mathbf{s}$ cannot properly describe the object’s configuration. As a solution to this critical issue, in this paper we propose an occlusion compensation shape prediction network (SPN), which is composed of a regulation input (RI), a multi-resolution encoder (MRE) and a discriminator network (DN) [40]. The proposed SPN utilizes the robot and object configurations and the active robot motions (i.e., $\bar{\mathbf{c}}_{k}$ , $\mathbf{r}_{k}$ , $\mathbf{u}_{k}$ ) as input to the network to predict the next instance shape, here denoted by $\hat{\bar{\mathbf{c}}}_{k+1}$ . Fig. 4 shows the overall architecture of the SPN.

IV-A Input Data Preprocessing

As the input data $\bar{\mathbf{c}}_{k},\mathbf{r}_{k},\mathbf{u}_{k}$ to the network have different sizes, they need to be rearranged into structures with unified dimensions. To this end, $\bar{\mathbf{c}}_{k}$ is first rearranged into the matrix $\mathbf{T}^{high}_{k}=[\mathbf{c}_{1},\mathbf{c}_{2},\ldots,\mathbf{c}_{N}]^{\intercal}\in\mathbb{R}^{N\times 3}$ . Then, farthest point sampling (FPS) [41] is used to downsample $\mathbf{T}^{high}_{k}$ to two resolutions $\mathbf{T}^{mid}_{k}\in\mathbb{R}^{\frac{N}{\delta}\times 3}$ and $\mathbf{T}^{low}_{k}\in\mathbb{R}^{\frac{N}{\delta^{2}}\times 3}$ , for $\delta$ as the resolution scale. Finally, the vectors $\mathbf{r}_{k}$ and $\mathbf{u}_{k}$ are rearranged into the matrices $\boldsymbol{\Lambda}_{k}^{high}={\mathbf{I}_{N\times 1}}\otimes\mathbf{r}_{k}^{\intercal}\in\mathbb{R}^{N\times 3}$ and $\boldsymbol{\Sigma}_{k}^{high}={\mathbf{I}_{N\times 1}}\otimes\mathbf{u}_{k}^{\intercal}\in\mathbb{R}^{N\times 3}$ , which are similarly downsampled into mid and low resolutions as follows:

	$\displaystyle\boldsymbol{\Lambda}_{k}^{mid}$	$\displaystyle={\mathbf{I}_{\frac{N}{\delta}\times 1}}\otimes\mathbf{r}_{k}^{\intercal},\qquad\boldsymbol{\Lambda}_{k}^{low}={\mathbf{I}_{\frac{N}{\delta^{2}}\times 1}}\otimes\mathbf{r}_{k}^{\intercal},$		(31)
	$\displaystyle\boldsymbol{\Sigma}_{k}^{mid}$	$\displaystyle={\mathbf{I}_{\frac{N}{\delta}\times 1}}\otimes\mathbf{u}_{k}^{\intercal},\qquad\boldsymbol{\Sigma}_{k}^{low}={\mathbf{I}_{\frac{N}{\delta^{2}}\times 1}}\otimes\mathbf{u}_{k}^{\intercal}$		(32)

Thus, three different resolutions are generated for the network, high $\{\mathbf{T}_{k}^{high},\boldsymbol{\Lambda}_{k}^{high},\boldsymbol{\Sigma}_{k}^{high}\}$ , mid $\{\mathbf{T}_{k}^{mid},\boldsymbol{\Lambda}_{k}^{mid},\boldsymbol{\Sigma}_{k}^{mid}\}$ , and low $\{\mathbf{T}_{k}^{low},\boldsymbol{\Lambda}_{k}^{low},\boldsymbol{\Sigma}_{k}^{low}\}$ , that provides a total input data of dimension $[N\times 9,\frac{N}{\delta}\times 9,\frac{N}{\delta^{2}}\times 9]$ . $\mathbf{T}_{k}^{high},\mathbf{T}_{k}^{mid},\mathbf{T}_{k}^{low}$ are the geometric shapes of the object under different compression sizes specified by $\delta$ . Thus, these three terms can describe the potential feature structure of objects. The proposed SPN aims to predict the object’s shape that results from the robots actions, under the current object-robot configuration. By using this multi-resolution data {high, mid, low}, the encoder can better learn the potential feature information of shapes.

IV-B Multi-Resolution Encoder

A combined multi-layer perceptron (CMLP) is the feature extractor of MRE, which uses the output of each layer in a MLP to form a multiple-dimensional feature vector. Traditional methods adopt the last layer of the MLP output as features, and do not consider the output of the intermediate layers, which leads to potentially losing important local information [42]. CMLP enables to make good use of low-level and mid-level features that include useful intermediate-transition information [40]. CMLP utilizes MLP to encode input data into multiple dimensions $[64,128,256,512,1024]$ . Then, we maxpool the output of the last four layers to construct a multiple-dimensional feature vector as follows:

\displaystyle\mathbf{v}_{1}\in\mathbb{R}^{128},\ \mathbf{v}_{2}\in\mathbb{R}^{256},\ \mathbf{v}_{3}\in\mathbb{R}^{512},\ \mathbf{v}_{4}\in\mathbb{R}^{1024}

(33)

The combined feature vector is constructed as: $\bar{\mathbf{v}}_{i}=[\mathbf{v}_{1}^{\intercal},\mathbf{v}_{2}^{\intercal},\mathbf{v}_{3}^{\intercal},\mathbf{v}_{4}^{\intercal}]^{\intercal}\in\mathbb{R}^{1920}$ . Three independent CMLPs map three resolutions into three individual $\bar{\mathbf{v}}_{i},$ for $i=1,2,3$ . Each $\bar{\mathbf{v}}_{i}$ represents the extracted potential information of each resolution. Then, the augmented feature matrix $\mathbf{V}$ is generated by arranging its columns as $\mathbf{V}=[\bar{\mathbf{v}}_{1},\bar{\mathbf{v}}_{2},\bar{\mathbf{v}}_{3}]\in\mathbb{R}^{1920\times 3}$ , and further through 1D-convolution to obtain $\mathbf{M}\in\mathbb{R}^{N\times 3}$ . Finally, the predicted next-moment shape $\hat{\bar{\mathbf{c}}}_{k+1}\in\mathbb{R}^{3N}$ is obtained by vectorizing $\mathbf{M}$ . The prediction loss of MRE is:

{L_{mre}}=\left\|{\hat{\bar{\mathbf{c}}}_{k+1}-\bar{\mathbf{c}}_{k+1}}\right\|

(34)

where $\bar{\mathbf{c}}_{k+1}$ represents the ground-truth next-moment shape in the training data-set.

IV-C Discriminator Network

Generative Adversarial Network (GAN) is chosen as DN to enhance the prediction accuracy. For simplicity, we define $\Phi=\rm{MRE}()$ and $\Psi=\rm{DN}()$ . We define ( $\bar{\mathbf{c}}_{k},\mathbf{r}_{k},\mathbf{u}_{k}$ ) as the $\mathcal{X}$ input into $\Phi$ , while $\mathcal{Y}$ represents the true shape $\bar{\mathbf{c}}_{k+1}$ . $\Psi$ is a classification network with similar structure as CMLP, constituted by serial MLP layers $[128,256,512,1024]$ to distinguish the predicted shape $\Phi(\mathcal{X})$ and the real shape $\mathcal{Y}$ . We maxpool the last three layers of $\Psi$ to obtain feature vector $[256,512,1024]$ . Three feature vectors are concatenated into a latent vector $\mathbf{m}\in\mathbb{R}^{1792}$ , and then passed through the fully-connected layers $[512,256,64,1]$ followed by sigmoid-classifier to obtain the evaluation. The adversarial loss is defined as follows:

{L_{adv}}=\sum\limits_{1\leq i\leq v}{\log\left({1-\Psi\left({{\beta_{i}}}\right)}\right)}+\sum\limits_{1\leq j\leq v}{\log\left({\Psi\left(\Phi(\alpha_{i})\right)}\right)}

(35)

where $\alpha_{i}\in\mathcal{X},\beta_{i}\in\mathcal{Y},i=1,\ldots,v$ , and $v$ is size of the dataset including $\mathcal{X}$ and $\mathcal{Y}$ . The total loss of SPN is:

L=\zeta_{mre}L_{mre}+\zeta_{adv}L_{adv}

(36)

where $\zeta_{mre}$ and $\zeta_{adv}$ are the weights of $L_{mre}$ and $L_{adv}$ , respectively, which satisfy the condition: $\zeta_{mre}+\zeta_{adv}=1$ .

V Receding-time Model Estimation

In this paper, the objects are assumed to be manipulated by the robot slowly, thus $\mathbf{J}_{k}({\mathbf{s}_{k},\mathbf{r}_{k}})$ is expected to change smoothly. For ease of presentation, we omit the arguments of $\mathbf{J}_{k}({\mathbf{s}_{k},\mathbf{r}_{k}})$ and denote it as as $\mathbf{J}_{k}$ from now on. To estimate the DJM, three indicators are considered, viz., accuracy, smoothness, and singularity. To this end, an optimization-based receding-time model (RTM) estimator is presented to estimate the changes of the Jacobian matrix, denoted by $\Delta\hat{\mathbf{\mathbf{J}}}_{k}=\hat{\mathbf{J}}_{k}-\hat{\mathbf{J}}_{k-1}$ , which enables to monitor the estimation procedure. $\Delta\hat{\mathbf{J}}_{k}$ can be obtained by considering the following three constraints:

•

( $Q_{1}$ ) Constraint of receding-time error [43]. As $\mathbf{J}_{k}$ depicts the relationship between $\Delta\mathbf{s}_{k+1}$ and $\mathbf{u}_{k}$ in a local range, thus we consider the accumulated error in $\eta$ past moments to ensure the estimation accuracy. $\eta$ is the receding window size. The receding-time error is given by:

Q_{1}=\sum\limits_{j=1}^{\eta}{{{\gamma^{j}\bigg{\|}{{\Delta\mathbf{s}_{k+1-j}}-{({\hat{\mathbf{J}}}_{k-1}+\Delta\hat{\mathbf{J}}_{k})}{\mathbf{u}_{k-j}}}\bigg{\|}}^{2}}}

(37)

The sensitivity to noise can be improved by adjusting $\eta$ , which helps to address the measurement fluctuations. $0<\gamma\leq{1}$ is a constant forgetting factor giving less weight to the past observation data.

•

( $Q_{2}$ ) Constraint of estimation smoothness [43]. As $\mathbf{J}_{k}$ is assumed to be smooth, thus $\Delta\hat{\mathbf{J}}_{k}$ should be estimated smoothly to avoid sudden large fluctuations, which can be achieved by minimizing the Frobenius norm of $\Delta\hat{\mathbf{J}}_{k}$ :

$\displaystyle Q_{2}=\|\Delta\hat{\mathbf{J}}_{k}\|_{F}^{2}$ (38)

•

( $Q_{3}$ ) Constraint of shape manipulability [44]. It evaluates the feasibility of changing the object’s shape under the current object-robot configuration:

Q_{3}=\bigg{\|}\frac{\lambda_{\max}((\hat{\mathbf{J}}_{k-1}+\Delta\hat{\mathbf{J}}_{k})^{\intercal}(\hat{\mathbf{J}}_{k-1}+\Delta\hat{\mathbf{J}}_{k}))}{\lambda_{\min}((\hat{\mathbf{J}}_{k-1}+\Delta\hat{\mathbf{J}}_{k})^{\intercal}(\hat{\mathbf{J}}_{k-1}+\Delta\hat{\mathbf{J}}_{k}))}\bigg{\|}^{2}

(39)

where $\lambda_{\max}$ and $\lambda_{\min}$ are the maximum and minimum eigenvalue, respectively, and $Q_{3}\geq 1$ . When $Q_{3}=1$ , the object can deform isotropically in any direction. A growing $Q_{3}$ indicates that the object is reaching singular (non-manipulable) configuration.

Finally, the total weighted optimization index is given by:

\displaystyle Q(\Delta\hat{\mathbf{J}}_{k})=\mu_{1}Q_{1}+\mu_{2}Q_{2}+\mu_{3}Q_{3}

(40)

where $\mu_{i}>0$ are the weights that specify the contribution of each constraint, and which satisfy $\mu_{1}+\mu_{2}+\mu_{3}=1$ . The index (40) is then solved by using numerical optimization tools (e.g., Matlab/fmincon or Python/CasADi) to obtain $\Delta\hat{\mathbf{J}}_{k}$ and thus, iteratively update the deformation Jacobian matrix as follows: $\hat{\mathbf{J}}_{k}=\hat{\mathbf{J}}_{k-1}+\Delta\hat{\mathbf{J}}_{k}$ .

VI Model Predictive Controller

It is assumed that the matrix $\mathbf{J}_{k}$ has been accurately estimated at the time instant $k$ by the RTM, such that it satisfies $\Delta\mathbf{s}_{k+1}={\hat{\mathbf{J}}_{k}}{\mathbf{u}_{k}}$ . Based on this model, we propose an MPC-based controller to derive the velocity inputs $\mathbf{u}_{k}$ for the robot, while taking saturation and workspace constraints into account. Two vectors are defined as follow:

	$\displaystyle\bar{\mathbf{s}}_{k}$	$\displaystyle={[{\mathbf{s}_{k+1\|k}^{\intercal},\ldots,\mathbf{s}_{k+h\|k}^{\intercal}}]^{\intercal}}\in{\mathbb{R}^{ph}}$
	$\displaystyle\bar{\mathbf{u}}_{k}$	$\displaystyle={[{\mathbf{u}_{k\|k}^{\intercal},\ldots,\mathbf{u}_{k+h-1\|k}^{\intercal}}]^{\intercal}}\in{\mathbb{R}^{3h}}$		(41)

where $\bar{\mathbf{s}}_{k}$ and $\bar{\mathbf{u}}_{k}$ represent the predictions of $\mathbf{s}_{k}$ and $\mathbf{u}_{k}$ in the next $h$ periods, respectively. The vectors $\mathbf{s}_{k+i|k}$ and $\mathbf{u}_{k+i|k}$ denote the $i$ th predictions of $\mathbf{s}_{k}$ and $\mathbf{u}_{k}$ from the time instant $k$ , where $\mathbf{s}_{k|k}=\mathbf{s}_{k}$ , and $\mathbf{u}_{k|k}=\mathbf{u}_{k}$ must hold. The prediction $\bar{\mathbf{s}}_{k}$ can be calculated from the estimated Jacobian matrix by noting that $\hat{\mathbf{\mathbf{J}}}_{k}\approx\hat{\mathbf{\mathbf{J}}}_{k+h}$ is satisfied during period $[k,k+h]$ (which is reasonable, given the regularity of the object). This way, the predictions are computed as follows:

\begin{array}[]{*{20}{c}}{\mathbf{s}_{k+j|k}=\mathbf{s}_{k}+\sum\limits_{i=0}^{j-1}{{{\hat{\mathbf{J}}}_{k}}\mathbf{u}_{k+i|k}}},&{j=1,\ldots,h}\end{array}

(42)

All predictions are then grouped and arranged into a single vector form:

	$\displaystyle{\bar{\mathbf{s}}_{k}}$	$\displaystyle=\mathbf{A}\mathbf{s}_{k}+\boldsymbol{\Theta}{\bar{\mathbf{u}}_{k}},$
	$\displaystyle\mathbf{A}$	$\displaystyle=\mathbf{I}_{h\times 1}\otimes\mathbf{E}_{p}\in{\mathbb{R}^{ph\times p}},\ \boldsymbol{\Theta}=\mathbf{L}_{h}\otimes\hat{\mathbf{J}}_{k}\in{\mathbb{R}^{ph\times 3h}}$		(43)

In addition to $\bar{\mathbf{s}}_{k}$ and $\bar{\mathbf{u}}_{k}$ , we define the constant sequence vector $\bar{\mathbf{s}}_{k}^{*}$ that represents the desired shape feature as:

\bar{\mathbf{s}}_{k}^{*}={\left[{\mathbf{s}_{k+1|k}^{*{\intercal}},\ldots,\mathbf{s}_{k+h|k}^{*{\intercal}}}\right]^{\intercal}}\in{\mathbb{R}^{ph}}

(44)

The cost function $Q\left(\bar{\mathbf{u}}_{k}\right)$ for the optimization of the control input is formulated as:

Q\left(\bar{\mathbf{u}}_{k}\right)={\left({{\bar{\mathbf{s}}_{k}}-{\bar{\mathbf{s}}_{k}^{*}}}\right)^{\intercal}}\boldsymbol{\Upsilon}_{1}\left({{\bar{\mathbf{s}}_{k}}-{\bar{\mathbf{s}}_{k}^{*}}}\right)+\bar{\mathbf{u}}_{k}^{\intercal}\boldsymbol{\Upsilon}_{2}{\bar{\mathbf{u}}_{k}}

(45)

where $\boldsymbol{\Upsilon}_{1}>0$ and $\boldsymbol{\Upsilon}_{2}>0$ are the weights for the error convergence rate and the smoothness of $\mathbf{u}_{k}$ , respectively. Two constraints are considered:

•

Saturation limits. In practice, robots have limits on their achievable joint speeds. These constraints are useful in soft object manipulation tasks to avoid damaging the object. Therefore, $\bar{\mathbf{u}}_{k}$ needs to be constrained:

${\bar{\mathbf{u}}_{\min}\leq\bar{\mathbf{u}}_{k}\leq\bar{\mathbf{u}}_{\max}}$ (46)

where $\bar{\mathbf{u}}_{\min}$ and $\bar{\mathbf{u}}_{\max}$ are the constant lower and upper bounds, respectively.

•

Workspace limits. Robots are also often required to operate in a confined workspace to avoid colliding with the environment. In soft object manipulation, this constraint is needed to avoid over-stretching or over-compressing the manipulated object. To this end, the following constant bounds are introduced:

\begin{array}[]{*{20}{c}}{\mathbf{r}_{k+i|k}^{\min}\leq{\mathbf{r}_{k+i|k}}\leq\mathbf{r}_{k+i|k}^{\max}},&{0\leq i\leq h-1}\end{array}

(47)

Similarly as in (42), the recursive structure of (47) can be obtained follows:

{\boldsymbol{\Xi}_{\min}}\leq\mathbf{C}{{\bar{\mathbf{u}}}_{k}}\leq{\boldsymbol{\Xi}_{\max}}

(48)

for $\boldsymbol{\Xi}_{\min},\boldsymbol{\Xi}_{\max}\in\mathbb{R}^{3h}$ and $\mathbf{C}\in\mathbb{R}^{3h\times 3h}$ defined as:

$\displaystyle\boldsymbol{\Xi}_{\min}$	$\displaystyle=[({\mathbf{r}_{k\|k}^{\min}-{\mathbf{r}_{k-1}}})^{\intercal},...,({\mathbf{r}_{k+h-1\|k}^{\min}-{\mathbf{r}_{k-1}}})^{\intercal}]^{\intercal}$
$\displaystyle\boldsymbol{\Xi}_{\max}$	$\displaystyle=[({\mathbf{r}_{k\|k}^{\max}-{\mathbf{r}_{k-1}}})^{\intercal},...,({\mathbf{r}_{k+h-1\|k}^{\max}-{\mathbf{r}_{k-1}}})^{\intercal}]^{\intercal}$
$\displaystyle\mathbf{C}$	$\displaystyle=\mathbf{L}_{h}\otimes\mathbf{E}_{3}\in\mathbb{R}^{3h\times 3h}$	(49)

The quadratic optimization problem is formulated as follows:

\begin{array}[]{*{20}{c}}\underset{\bar{\mathbf{u}}_{k}}{\min}\leavevmode\nobreak\ Q(\bar{\mathbf{u}}_{k})=\frac{1}{2}\bar{\mathbf{u}}_{k}^{\intercal}{\mathbf{H}}{\bar{\mathbf{u}}_{k}}+{\mathbf{q}^{\intercal}}{\bar{\mathbf{u}}_{k}\vspace{0.1cm}}\\ {\begin{array}[]{*{20}{c}}{\mathtt{s.t.}}&{\begin{array}[]{*{20}{c}}{{\bar{\mathbf{u}}}_{\min}}\leq{{\bar{\mathbf{u}}}_{k}}\leq{{\bar{\mathbf{u}}}_{\max}\vspace{0.08cm}}\\ {\boldsymbol{\Xi}_{\min}}\leq\mathbf{C}{{\bar{\mathbf{u}}}_{k}}\leq{\boldsymbol{\Xi}_{\max}}\end{array}}\end{array}}\end{array}

(50)

where ${\mathbf{H}}=2\left({{\boldsymbol{\Theta}^{\intercal}}\boldsymbol{\Upsilon}_{1}\boldsymbol{\Theta}+\boldsymbol{\Upsilon}_{2}}\right)\in{\mathbb{R}^{3h\times 3h}}$ , $\mathbf{q}^{\intercal}=2{\boldsymbol{\Omega}^{\intercal}}\boldsymbol{\Upsilon}_{1}\boldsymbol{\Theta}\in{\mathbb{R}^{1\times 3h}}$ and $\boldsymbol{\Omega}=\mathbf{A}\mathbf{s}_{k}-\bar{\mathbf{s}}_{k}^{*}\in{\mathbb{R}^{ph}}$ are constant matrices. Then, $\bar{\mathbf{u}}_{k}$ can be obtained by using a standard quadratic solver on (50). Finally, ${\mathbf{u}}_{k}$ is calculated by the receding horizon scheme:

\mathbf{u}_{k}=[\mathbf{E}_{3},\mathbf{0},\ldots,\mathbf{0}]\bar{\mathbf{u}}_{k}

(51)

Fig. 5 presents the conceptual block diagram of the proposed framework.

Remark 2.

The proposed MPC-based technique (50) computes the robot’s shaping actions based on a performance objective and subject to system’s constraints; This approach does not require the identification of the full analytical model of the deformable object. Its quadratic optimization form enables to integrate additional metrics into the problem, e.g., rising time and overshoot.

(a) Centerline

(b) Contour

(d) Plane

Figure 6: The experimental setup, including the objects (elastic and rigid), single-arm robot (UR5), and D455.

VII Results

VII-A Experimental Setup

Vision-based manipulation experiments are conducted to validate our proposed framework. The experimental platform used in our study includes a fixed D455 depth sensor, a UR5 robot manipulator, and various deformable objects shown in Fig. 6. The depth sensor receives the video stream, from which it computes the 3D shapes by using the OpenCV and RealSense libraries. In our experiments, only 3 DOF of the robot manipulator are considered, therefore, the control input $\mathbf{u}=[u_{x},u_{y},u_{z}]^{\intercal}\in\mathbb{R}^{3}$ represents the linear velocity of the end-effector; A saturation limit of $|u_{i}|\leq 0.01$ m/s, is applied for $i=x,y,z$ . The motion control algorithm is implemented on ROS/Python, which runs with a servo-control loop of 10 Hz. A video of the conducted experiments can be downloaded from https://github.com/JiamingQi-Tom/experiment_video/raw/master/paper4/video.mp4

The proposed shape extraction algorithm is depicted in Fig. 7. The RGB image from the camera is transformed into HSV and combined with mask processing to obtain a binary image. $OpenCV/thinning$ is utilized with FPS to extract a fixed number of object points. The point on the centerline closest to the gripper’s green marker is chosen to sort the centerline along the cable. Then we obtain the 3D shapes by using the RealSense sensor and checking its 2D pixels. We adopt the method in [11] to compute the object’s contour. A surface is obtained in the similar way to the centerline, i.e., by sorting points from top to bottom, and from left to right. As 2D pixels and 3D points have a one-to-one correspondence in a depth camera, thus, our extraction method improves the robustness to measurement noise and is simpler than traditional point cloud processing algorithms.

Table I: Fitting configurations. “N/A” stands for “Not Applicable”.

	Centerline	Contour	Surface / Rigid
Method	LSM	MLS	MLS
Support Radius	N/A	$d=0.2$	$d=0.2$
PCA	N/A	$m=1$	$m=1$
Fitting order	$n=5$	$n=4$	$n_{x}=n_{y}=2$
Basis-Function	Bernstein	Trigonometric	Polynominal
Numbers	$N=64$	$N=64$	$N=32$

VII-B Online Fitting of the Parametric Shape Representation

In this section, ten thousand samples of centerlines, contours and surfaces with $N=64,64,32$ , respectively, are collected by commanding the robot to manipulate the objects, whose configuration is then captured by a depth sensor. Such shaping actions are shown in Fig. 8, and visualized in the accompanying multimedia attachment. This data is used to evaluate the performance (viz. its accuracy and computation time) of our representation framework. For that, we calculate the average error between the feedback shape $\bar{\mathbf{c}}$ and the reconstructed shape $\hat{\bar{\mathbf{c}}}$ as follows $\operatorname{mean}(\sum{\left\|{{{\bar{\mathbf{c}}}_{i}}-{{\hat{\bar{\mathbf{c}}}}_{i}}}\right\|})$ . The computation time is defined as the average of the overall processing time of all sample data among each method.

Fig. 9 shows that the larger the scalars $n,n_{x},n_{y}$ are, the better the fitting accuracy of LSM and MLS is. MLS fits better than LSM under the same condition, as MLS calculates the independent weight while LSM assumes that each node has the same weight. MLS works better in fitting contour because the parametric curve may not be continuous in the end corner, thus, the equal weight assumption of LSM is not suitable here. As the number of data points for surface is $N=32$ , it does not satisfy the condition $N\gg(n_{x}+1)(n_{y}+1)$ for higher order fitting models (e.g., $n_{x}=n_{y}\geq 5$ ), thus, we only use $n_{x}=n_{y}\leq 4$ . The results show that MLS performs better than LSM in the surface representation; Interestingly, MLS obtains satisfactory performance even with $n_{x}=n_{y}=1$ . Fig. 9 also shows that larger $n,n_{x},n_{y}$ will also increase the computation time. MLS has a more noticeable increase, as it calculates the weights of all nodes while LSM calculates them once. The trigonometric approach is the fastest, polynomial and Bernstein follow, while Cox-deBoor is the slowest with the most iterative operations. The above analysis verifies the effectiveness of the proposed extraction framework, which can represent objects with a low-dimensional feature. Details of the curve fitting configuration is given in Table. I.

VII-C Occlusion-Robust Prediction of Object Shapes

The data collected in Section VII-B is used to evaluate our proposed SPN. 80% of the data is used as the training set, and the remaining 20% of the data is used to test the trained network. We set $\delta=2$ for the centerline, contour, and surface parametrization. SPN is built using PyTorch and trained by an ADAM optimizer with a batch size of 500, and the initial learning rate set to 0.0001. RELU activation and batch normalization are adopted to improve the network’s performance. In this validation test, the robot deforms the objects with small babbling motions while a cardboard sheet covers parts of objects. Fig. 10 shows that SPN can predict and provide relatively complete shapes for the three types of manipulated objects. The accompanying video demonstrates the performance of this method.

VII-D Estimation of the Sensorimotor Model

This section aims to evaluate RTM (40) that approximates the deformation Jacobian matrix; The performance of the RTM estimator is compared with the interaction matrix estimator (IEM) in [15], and the unscented Kalman filter (UKF) in [31]. To this end, we introduce two metrics, i.e., $T_{1}$ and $T_{2}$ , to quantitatively compare performances of these algorithms:

\displaystyle\begin{array}[]{*{20}{c}}{{T_{1}}=\|{\widehat{{\mathbf{s}}}_{k+1}-{\mathbf{s}}_{k+1}}\|},&{{T_{2}}=\|{{\Delta\mathbf{s}_{k+1}}-{{\hat{\mathbf{J}}}_{k}}\mathbf{u}_{k}}\|}\end{array}

(53)

where $\widehat{\mathbf{s}}_{k+1}=\widehat{\mathbf{s}}_{k}+\hat{\mathbf{J}}_{k}\mathbf{u}_{k}$ is the approximated shape feature that is computed based on the control actions. Fig. 11 shows that RTM with $\eta=20$ provides the best performance for $T_{1}$ among the methods; This means that constraint $Q_{1}$ (37) enables RTM to learn from past data by adjusting $\eta$ . Small values of $T_{2}$ reflect that RTM accurately predicts the differential changes induced by the DJM; This means that constraint $Q_{2}$ (38) helps to compute a smooth matrix $\hat{\mathbf{J}}_{k}$ . As the proposed method incorporates the constraint $Q_{3}$ (39), thus, RTM can prevent singularities in the estimation of the Jacobian matrix, while IEM and UKF are prone to reach ill-conditioned estimations. We use RTM with $\eta=10$ in the following sections.

VII-E Automatic Shape Servoing Control

This section conducts four automatic manipulation experiments, labeled as Exp1, Exp2, Exp3, and Exp4, respectively. The target contour $\bar{\mathbf{c}}^{*}$ is obtained from previous demonstrations of the manipulation task, which ensures its reachability. A cardboard sheet is manually placed over the object to produce (partial) occlusions and test the robustness of our algorithm. The estimation methods in [15] and [31] are compared with the proposed receding-time model with MPC ( $h=5$ and $h=15$ ). We label these methods as $\rm{CM}_{1},\ldots,CM_{4}$ , and each method has been optimized to achieve a balance of stability, convergence, and responsiveness.

In addition to the feedback shape error $\|\mathbf{s}^{*}-\mathbf{s}_{k}\|$ , we also compare $T_{max}$ (i.e., the number of steps from start to finish), $t_{d}$ (the steps from $\|\mathbf{s}^{*}-\mathbf{s}_{0}\|$ to 10% of this value), $t_{s}$ (the steps from 10% $\|\mathbf{s}^{*}-\mathbf{s}_{0}\|$ to the threshold value), and $d_{eff}$ (the total moving distance of the end-effector), [11, 23]. Fig. 12 shows the shaping motions of the manipulated objects toward the desired configuration (black curve), with the cardboard blocking the view at various instances. The results demonstrate that SPN can predict the object’s shape during occlusions and feed it back to the controller to enforce the shape servo-loop; This results in the objects being gradually manipulated towards the desired configuration. The error norm $\|\mathbf{s}^{*}-\mathbf{s}_{k}\|$ plots in Fig. 13 shows that $\rm{CM}_{3}$ provides the best control performance for the error minimization, with $\rm{CM}_{4}$ as the second-best, and $\rm{CM}_{1}$ and $\rm{CM}_{2}$ showing similar performance. A higher $h$ is helpful for feature prediction, yet, since we assume that $\hat{\mathbf{J}}_{k}$ is constant in the window period $h$ , it may lead to inaccurate predictions and even wrong manipulation of objects. Therefore, $h$ should be chosen according to the performance requirements of the system.

From the $Q_{3}$ plots in Fig. 13, we can see RTM with $\eta=10$ provides the smallest value, which validates that RTM can enhance the manipulation feasibility and avoid shapes falling into the singular configurations. The black dashed lines represent the workspace constraints of $r_{x},r_{y},r_{z}$ , on which we can see the end-effector’s trajectories in the Cartesian coordinate system; The results show that $\rm{CM}_{3}$ and $\rm{CM}_{4}$ remain within the workspace, while $\rm{CM}_{1}$ and $\rm{CM}_{2}$ may violate this constraint. Due to the adopted input saturation in our platform, all four methods can satisfy the control saturation constraint. Exp4 shows that our method has good universality, not only for the shape control, but also for traditional rigid object positioning.

A performance comparison of these experiments is given in Fig. 14. The results illustrate that our proposed shape servoing framework achieves the best performance relative to the manipulation speed $T_{max}$ , response speed $t_{d}$ , convergence speed $t_{s}$ , and motion distance $d_{eff}$ .

VIII Conclusions

In this paper, we present an occlusion-robust shape servoing framework to control shapes of elastic objects into target configurations, while considering workspace and saturation constraints. A low-dimensional feature extractor is proposed to represent 3D shapes based on LSM and MLS. A deep neural network is introduced to predict the object’s configuration subject to occlusions, and feed it to the shape servo-controller. A receding-time model estimator is designed to approximate the deformation Jacobian matrix with various constraints such as accuracy, smoothness, singularity. The conducted experiments validate the proposed methodology with multiple unstructured shape servoing tasks in visually occluded situations and with unknown deformation models.

However, there are some limitations in our framework. For example, as the support field radius $d$ is constant (i.e., it does not adjust with dynamic shapes), the computed representation lacks flexibility. Also, the SPN needs to obtain substantial offline training data to properly work, which might pose complications in practice. Note that our method may not accurately shape objects with negligible elastic properties (e.g., fabrics, food materials, etc). Future work include the incorporation of shape reachability detection into the framework in order to determine the feasibility of a given shaping task beforehand.

References

[1] J. Zhu, A. Cherubini, C. Dune, D. Navarro-Alarcon, F. Alambeigi, D. Berenson, F. Ficuciello, K. Harada, X. Li, J. Pan, et al., “Challenges and outlook in robotic manipulation of deformable objects,” arXiv preprint arXiv:2105.01767, 2021.
[2] X. Li, X. Su, and Y.-H. Liu, “Vision-based robotic manipulation of flexible pcbs,” IEEE/ASME Transactions on Mechatronics, vol. 23, no. 6, pp. 2739–2749, 2018.
[3] L. Cao, X. Li, P. T. Phan, A. M. H. Tiong, H. L. Kaan, J. Liu, W. Lai, Y. Huang, H. M. Le, M. Miyasaka, et al., “Sewing up the wounds: A robotic suturing system for flexible endoscopy,” IEEE Robotics & Automation Magazine, vol. 27, no. 3, pp. 45–54, 2020.
[4] Z. Hu, P. Sun, and J. Pan, “Three-dimensional deformable object manipulation using fast online gaussian process regression,” IEEE Robotics and Automation Letters, vol. 3, no. 2, pp. 979–986, 2018.
[5] J. Sanchez, J.-A. Corrales, B.-C. Bouzgarrou, and Y. Mezouar, “Robotic manipulation and sensing of deformable objects in domestic and industrial applications: a survey,” Int. J. of Rob. Res., 2018.
[6] H. Yin, A. Varava, and D. Kragic, “Modeling, learning, perception, and control methods for deformable object manipulation,” Science Robotics, vol. 6, no. 54, 2021.
[7] D. Navarro-Alarcon, Y.-H. Liu, J. G. Romero, and P. Li, “Model-free visually servoed deformation control of elastic objects by robot manipulators,” IEEE Transactions on Robotics, vol. 29, no. 6, pp. 1457–1468, 2013.
[8] D. Navarro-Alarcon, Y.-h. Liu, J. G. Romero, and P. Li, “On the visual deformation servoing of compliant objects: Uncalibrated control methods and experiments,” The International Journal of Robotics Research, vol. 33, no. 11, pp. 1462–1480, 2014.
[9] H. Wang, B. Yang, J. Wang, X. Liang, W. Chen, and Y.-H. Liu, “Adaptive visual servoing of contour features,” IEEE/ASME Transactions on Mechatronics, vol. 23, no. 2, pp. 811–822, 2018.
[10] D. Navarro-Alarcon and Y.-H. Liu, “Fourier-based shape servoing: a new feedback method to actively deform soft objects into desired 2-d image contours,” IEEE Transactions on Robotics, vol. 34, no. 1, pp. 272–279, 2017.
[11] J. Qi, G. Ma, J. Zhu, P. Zhou, Y. Lyu, H. Zhang, and D. Navarro-Alarcon, “Contour moments based manipulation of composite rigid-deformable objects with finite time model estimation and shape/position control,” IEEE/ASME Transactions on Mechatronics, 2021.
[12] Z. Hu, T. Han, P. Sun, J. Pan, and D. Manocha, “3-d deformable object manipulation using deep neural networks,” IEEE Robotics and Automation Letters, vol. 4, no. 4, pp. 4255–4261, 2019.
[13] J. Qi, G. Ma, P. Zhou, H. Zhang, Y. Lyu, and D. Navarro-Alarcon, “Towards latent space based manipulation of elastic rods using autoencoder models and robust centerline extractions,” Advanced Robotics, vol. 36, no. 3, pp. 101–115, 2022.
[14] P. Zhou, J. Zhu, S. Huo, and D. Navarro-Alarcon, “LaSeSOM: A latent and semantic representation framework for soft object manipulation,” IEEE Robotics and Automation Letters, vol. 6, no. 3, pp. 5381–5388, 2021.
[15] J. Zhu, D. Navarro-Alarcon, R. Passama, and A. Cherubini, “Vision-based manipulation of deformable and rigid objects using subspace projections of 2d contours,” Robotics and Autonomous Systems, vol. 142, p. 103798, 2021.
[16] N. Cazy, P.-B. Wieber, P. R. Giordano, and F. Chaumette, “Visual servoing when visual information is missing: Experimental comparison of visual feature prediction schemes,” in 2015 IEEE International Conference on Robotics and Automation (ICRA). IEEE, 2015, pp. 6031–6036.
[17] T. Tang, C. Wang, and M. Tomizuka, “A framework for manipulating deformable linear objects by coherent point drift,” IEEE Robotics and Automation Letters, vol. 3, no. 4, pp. 3426–3433, 2018.
[18] T. Tang and M. Tomizuka, “Track deformable objects from point clouds with structure preserved registration,” The International Journal of Robotics Research, p. 0278364919841431, 2018.
[19] D. Navarro-Alarcon, J. Qi, J. Zhu, and A. Cherubini, “A lyapunov-stable adaptive method to approximate sensorimotor models for sensor-based control,” Frontiers in Neurorobotics, vol. 14, 2020.
[20] F. Chaumette and S. Hutchinson, “Visual servo control, part I: Basic approaches,” IEEE Robotics and Automation Magazine, vol. 13, no. 4, pp. 82–90, 2006.
[21] K. Hosoda and M. Asada, “Versatile visual servoing without knowledge of true jacobian,” in Proceedings of IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS’94), vol. 1. IEEE, 1994, pp. 186–193.
[22] F. Alambeigi, Z. Wang, R. Hegeman, Y.-H. Liu, and M. Armand, “Autonomous data-driven manipulation of unknown anisotropic deformable tissues using unmodelled continuum manipulators,” IEEE Robotics and Automation Letters, vol. 4, no. 2, pp. 254–261, 2018.
[23] R. Lagneau, A. Krupa, and M. Marchal, “Active deformation through visual servoing of soft objects,” in 2020 IEEE International Conference on Robotics and Automation (ICRA). IEEE, 2020, pp. 8978–8984.
[24] Lagneau, K. Romain, M. Alexandre, and Maud, “Automatic shape control of deformable wires based on model-free visual servoing,” IEEE Robotics and Automation Letters, vol. 5, no. 4, pp. 5252–5259, 2020.
[25] L. Han, H. Wang, Z. Liu, W. Chen, and X. Zhang, “Visual tracking control of deformable objects with a fat-based controller,” IEEE Transactions on Industrial Electronics, 2021.
[26] A. Hajiloo, M. Keshmiri, W.-F. Xie, and T.-T. Wang, “Robust online model predictive control for a constrained image-based visual servoing,” IEEE Transactions on Industrial Electronics, vol. 63, no. 4, pp. 2242–2250, 2015.
[27] H. Wakamatsu and S. Hirai, “Static modeling of linear object deformation based on differential geometry,” The International Journal of Robotics Research, vol. 23, no. 3, pp. 293–311, 2004.
[28] M. Yu, H. Zhong, and X. Li, “Shape control of deformable linear objects with offline and online learning of local linear deformation models,” arXiv preprint arXiv:2109.11091, 2021.
[29] H. Abdi et al., “The method of least squares,” Encyclopedia of Measurement and Statistics. CA, USA: Thousand Oaks, 2007.
[30] P. Lancaster and K. Salkauskas, “Surfaces generated by moving least squares methods,” Mathematics of computation, vol. 37, no. 155, pp. 141–158, 1981.
[31] J. Qi, W. Ma, D. Navarro-Alarcon, H. Gao, and G. Ma, “Adaptive shape servoing of elastic rods using parameterized regression features and auto-tuning motion controls,” arXiv preprint arXiv:2008.06896, 2020.
[32] K. Smith, “On the standard deviations of adjusted and interpolated values of an observed polynomial function and its constants and the guidance they give towards a proper choice of the distribution of observations,” Biometrika, vol. 12, no. 1/2, pp. 1–85, 1918.
[33] G. G. Lorentz, Bernstein polynomials. American Mathematical Soc., 2013.
[34] I. J. Schoenberg, Cardinal spline interpolation. SIAM, 1973.
[35] M. J. D. Powell et al., Approximation theory and methods. Cambridge university press, 1981.
[36] W. Y. Tey, N. A. Che Sidik, Y. Asako, M. W Muhieldeen, and O. Afshar, “Moving least squares method and its improvement: A concise review,” Journal of Applied and Computational Mechanics, vol. 7, no. 2, pp. 883–889, 2021.
[37] H. Zhang, C. Guo, X. Su, and C. Zhu, “Measurement data fitting based on moving least squares method,” Mathematical Problems in Engineering, vol. 2015, 2015.
[38] J. C. Mason and D. C. Handscomb, Chebyshev polynomials. CRC press, 2002.
[39] R. T. Farouki, “Legendre–bernstein basis transformations,” Journal of Computational and Applied Mathematics, vol. 119, no. 1-2, pp. 145–160, 2000.
[40] Z. Huang, Y. Yu, J. Xu, F. Ni, and X. Le, “Pf-net: Point fractal network for 3d point cloud completion,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020, pp. 7662–7670.
[41] X. Yan, C. Zheng, Z. Li, S. Wang, and S. Cui, “Pointasnl: Robust point clouds processing using nonlocal neural networks with adaptive sampling,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020, pp. 5589–5598.
[42] C. R. Qi, H. Su, K. Mo, and L. J. Guibas, “Pointnet: Deep learning on point sets for 3d classification and segmentation,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2017, pp. 652–660.
[43] H. Mo, B. Ouyang, L. Xing, D. Dong, Y. Liu, and D. Sun, “Automated 3-d deformation of a soft object using a continuum robot,” IEEE Transactions on Automation Science and Engineering, 2020.
[44] K. M. Lynch and F. C. Park, Modern robotics. Cambridge University Press, 2017.

David Navarro-Alarcon (Senior Member, IEEE) received the Ph.D. degree in mechanical and automation engineering from The Chinese University of Hong Kong (CUHK), Shatin, Hong Kong, in 2014. From 2014 to 2017, he was a Postdoctoral Fellow and then a Research Assistant Professor with the CUHK T Stone Robotics Institute. Since 2017, he has been with The Hong Kong Polytechnic University, Hong Kong, where he is currently an Assistant Professor with the Department of Mechanical Engineering, and the Principal Investigator of the Robotics and Machine Intelligence Laboratory. His current research interests include perceptual robotics and control theory.