ExTraCT – Explainable Trajectory Corrections from language inputs using Textual description of features

J-Anne Yow ^1,2∗, Neha Priyadarshini Garg ¹, Manoj Ramanathan ¹ and Wei Tech Ang ^1,2 ¹Rehabalitation Research Institute of Singapore (RRIS), Clinical Science Building, 308232 Singapore. RRIS is a joint research institute by Nanyang Technological University (NTU), Agency for Science, Technology and Research (A*STAR) and National Healthcare Group (NHG), Singapore²Singapore-ETH Centre, Future Health Technologies Programme, CREATE Tower, SingaporeCorrespondence^∗: J-Anne Yow ([email protected])This work is supported by the Rehabilitation Research Institute of Singapore and the National Research Foundation, Prime Minister’s Office, Singapore, under its Campus for Research Excellence and Technological Enterprise (CREATE) programme.

Abstract

Natural language provides an intuitive and expressive way of conveying human intent to robots. Prior works employed end-to-end methods for learning trajectory deformations from language corrections. However, such methods do not generalize to new initial trajectories or object configurations. This work presents ExTraCT, a modular framework for trajectory corrections using natural language that combines Large Language Models (LLMs) for natural language understanding and trajectory deformation functions. Given a scene, ExTraCT generates the trajectory modification features (scene-specific and scene-independent) and their corresponding natural language textual descriptions for the objects in the scene online based on a template. We use LLMs for semantic matching of user utterances to the textual descriptions of features. Based on the feature matched, a trajectory modification function is applied to the initial trajectory, allowing generalization to unseen trajectories and object configurations. Through user studies conducted both in simulation and with a physical robot arm, we demonstrate that trajectories deformed using our method were more accurate and were preferred in about 80% of cases, outperforming the baseline. We also showcase the versatility of our system in a manipulation task and an assistive feeding task.

Index Terms:

Natural dialog for HRI, human-centered robotics, human factors and human-in-the-Loop

I Introduction

Motion planning algorithms optimize trajectories based on predefined cost functions, which usually consider robot dynamics and environmental constraints. However, robots need to account for human preferences to assist effectively when working with humans. Human preferences can vary based on the environment and human factors; e.g. while throwing trash, humans might want the robot to avoid food items, or when placing a cup, they may prefer the robot to stay close to the table and avoid other objects. Thus, incorporating all possible human preferences in the cost function is challenging. To address this, we explore the problem of natural language trajectory corrections, i.e. modifying a robot’s initial trajectory based on natural language corrections provided by a human. We choose natural language as it provides an intuitive and expressive way for humans to convey their preferences.

Refer to caption — Figure 1: Architecture of ExTraCT. Given the objects in a scene, the features $\phi$ and corresponding textual descriptions $T_{\phi}$ are generated online. We obtain the embeddings of the language correction $q(l)$ and the phrases ( $t_{\phi}\in T_{\phi}$ ) in the textual descriptions of the features $q(t_{\phi})$ , and use semantic textual similarity to obtain the most similar textual description, which is mapped to feature $\phi^{*}$ . A deformation function $\delta$ is used to deform the initial trajectory $\xi_{0}$ based on the feature $\phi^{*}$ and the object positions in the environment $E$ . A trajectory optimizer is used to ensure that the robot’s kinematic constraints are satisfied.

A key challenge in natural language trajectory corrections is mapping the natural language to the robot action, which is the deformed trajectory. Existing works [4, 21, 6, 7] try to learn a direct mapping between natural language and robot trajectories or actions using offline training paradigms. To improve generalization to different objects and phrases, some works [21, 6, 7] leverage on foundational models, i.e. BERT [10] and CLIP [17]. However, these models struggle to generalize to varied initial trajectories and object poses due to dataset limitations. Furthermore, with a direct mapping learnt between natural language and robot action, it is difficult to understand the root cause of failures, which could stem from issues in language grounding, scene understanding, or inaccuracies in trajectory deformation function.

To address these limitations, we separate language understanding from trajectory deformation, thus enabling a more accurate interpretation of instructions. First, we match the language corrections to a short description of the change in trajectory, which we term a feature. We define a set of feature templates and their corresponding textual description templates, which are then used to generate features and their textual descriptions (Fig. 1) for any given scene online. The language uttered by the user can then be mapped to the most likely feature by computing the semantic similarity between the textual descriptions of the feature and the correction uttered by the user using Large Language Models [10, 24, 5, 8]. For features that are mapped with insufficient confidence ( $\leq 0.6$ ), our approach can alert the end-user instead of generating a random modified trajectory.

Once the most likely feature is determined, the initial trajectory can be modified based on a trajectory deformation function, allowing generalization to different object configurations and trajectories.

This separation of language understanding from trajectory deformation allows for more precise and context-aware robotic responses. By decoupling these two elements, our method can easily expand to new tasks, as shown in Section IV-E. Furthermore, it provides clearer insights into potential sources of failure, as it distinguishes between errors in language interpretation and trajectory execution. Thus, our approach improves the accuracy and generalization of language corrections and enhances the interpretability and reliability of robotic systems in executing these corrections.

While our approach is limited by the feature types in the feature templates, end-to-end training methods would encounter similar limitations when dealing with unseen features. However, they are additionally limited by the object configurations and initial trajectories used in the training data, as shown in Section IV-A. Our formulation can generalize to different object configurations and initial trajectories without training a model. Our key insight is that integrating the strengths of LLMs in handling language diversity with a hand-crafted approach for trajectory modifications bypasses the need for end-to-end training while achieving comparable or better performance. Furthermore, end-to-end training methods are data-intensive. Collecting data is challenging in the robotics domain, while generating data would require a similar amount of hand-crafted features as our proposed approach.

We evaluated our system through within-subject user studies in simulation and the real world. Our results show that our method had higher accuracy and was rated higher in approximately $80\%$ of test cases compared to the state-of-the-art method LaTTe [7], which also uses LLMs but was trained in an end-to-end fashion. We further analysed the failure cases of our method and showed that our method could be improved further by adding more phrases to the textual description templates.

Our contributions in this work are threefold: (1) We introduce a modular framework that integrates LLMs with trajectory deformation functions for trajectory corrections using language without end-to-end training. (2) We conducted extensive quantitative experiments on a substantial dataset, complemented by user studies comparing our method against a baseline that deforms trajectories based on language in an end-to-end fashion. (3) We demonstrate the versatility of our framework through its application to a range of tasks, including general object manipulation and assistive feeding.

II Related Work

There are various ways of conveying human preference to the robot, such as through language [6, 7, 21], physical interaction [2, 1, 3], rankings [13] and joystick inputs [22]. In this work, we use language as it is the most natural and intuitive way of communicating human preference [23].

Language correction works can be broadly classified into two categories – generating new trajectories [4, 21, 14, 26] and modifying existing ones [6, 7]. The first category, trajectory generation, involves creating new motion plans to enable robots to correct errors to complete tasks. Consider a task where a robot provides feeding assistance to a user. An online correction such as “move 5cm to the left” when acquiring the food might be provided so the robot can align more accurately with the food morsel before acquiring it. This directs the robot to generate a new trajectory to move to the left for more precise alignment. On the other hand, our work focuses on the second category of corrections, which involves modifying an existing trajectory. This includes relative corrections such as “scoop a larger amount of food”. Such corrections require adjustments to an existing trajectory, requiring an understanding of the initial plan or trajectory. Modifying existing trajectories is important, as human preferences are commonly expressed in relative terms.

Both categories of language corrections face a common challenge: translating natural language to robot actions, also known as language grounding. Langauge grounding can be categorized into three approaches – semantic parsing to probabilistic graphs, end-to-end learning using embeddings and prompting LLMs to generate code.

II-1 Semantic Parsing to Probabilistic Graphs

Earlier works in language correction use a grammatical structure to represent language and ground language by learning the weights of functions of factors [4]. A probabilistic graphical model (distributed correspondence graph, DCG) is used to ground language to a set of features that relate to the environment and context. Each feature represents a cost or constraint, which a motion planner then optimizes. Given the variety of possible phrases a human can provide to correct the same feature, grounding language using a fixed grammatical structure limits the generalization of this approach. We differ from this approach in how we ground language. Instead of parsing language corrections to phrases with a fixed grammar, we leverage the textual embeddings in LLMs [10, 24, 5, 8] to ground them to textual descriptions of features.

II-2 End-to-End Learning Using Embeddings

Prior works have explored LLMs, combining BERT [10] (textual) embeddings and CLIP [17] (textual and visual) embeddings to align visual and language representations. Sharma et. al. [21] learn a 2D cost map from CLIP and BERT embeddings, which converts language corrections to a cost map that is optimized using a motion planner to obtain the corrected trajectory. Bucker et. al. goes a step further by directly learning a corrected trajectory from CLIP and BERT embeddings in 2D [6] and 3D [7]. Additionally, a transformer encoder obtains geometric embeddings of the object poses and an initial trajectory. A transformer decoder combines textual, visual and geometric embeddings to generate the corrected trajectory. Geometric embeddings have also been combined with textual embeddings to learn a robot policy conditioned on language [9], but a shared autonomy paradigm was used to reduce the complexity. These approaches deploy an end-to-end learning approach requiring large multi-modal datasets, including image, text, and trajectory data, which are difficult to obtain. Compared to these works, our proposed approach leverages LLMs to summarize natural language corrections into concise and informative textual representations, thus removing the need for multi-modal training data to ground language corrections.

II-3 Prompting LLMs to Generate Code

Concurrent with our work, advances in Large Language Models (LLMs) have led to recent works [14, 12, 25, 26] employing LLMs to generate executable code from natural language instructions. These works mainly focus on generating robot plans or trajectories given a phrase, either by calling pre-defined motion primitives [14, 26] or by designing specific reward functions [12, 25]. This approach has shown versatility in handling diverse instructions and constructing sequential policy logic. However, the adoption of these methods is hindered by significant computational costs due to the need for larger, general-purpose models like GPT-4, which cannot be run on standard commercial hardware. Remote execution via API calls is a viable option, but this is limited by the need for a stable internet connection and the rate limits imposed on the APIs, which can hinder real-time applications. Moreover, similar to our approach that integrates hand-crafted templates, the effective use of LLMs for specific task-oriented code generation also requires many in-context examples. In contrast, our approach provides a more cost-effective solution for language understanding in robotic systems by using semantic textual similarity, eliminating the need for extensive prompting or high computational demands. Although our method requires additional modules to process complex language utterances, such as referring expressions and compound sentences, it is a more viable alternative in settings constrained by limited computational resources and internet connectivity. This makes our framework well-suited for practical applications where efficiency is a key consideration.

III Approach

III-A Problem Definition

Our goal is to develop an interface that allows users to modify the trajectory of robot manipulators based on their preferences conveyed through natural language corrections. More specifically, the problem can be described as finding the most likely trajectory $\xi^{*}$ given the environment $E$ , the language correction $l$ provided by the human and an initial trajectory $\xi_{0}$ .

\xi^{*}=\underset{\xi}{\mathrm{argmax}}\,P(\xi|E,l,\xi_{0})

(1)

The environment consists of a set of objects, with each object having two attributes – object name $o_{name}$ and object position $o_{pos}$ , i.e. $E=\{(o_{name}^{i},o_{pos}^{i})\}$ . The object names and poses in the environment can be obtained using perception algorithms (e.g. Mask R-CNN [11], YOLO [18]) or foundational models (e.g. OWL-ViT [16], Grounding DINO [15]). In our work, we used Mask R-CNN to get the environment $E$ .

III-B Features

The space of possible trajectories $\xi$ is infinite. However, realistically, it can be bounded by the environment and motion planner constraints [4]. Therefore, we assume that the language correction $l$ can be mapped to a finite set of features $\Phi$ that can be scene-specific or scene-independent. Each feature $\phi\in\Phi$ corresponds to a deformed trajectory, i.e. $\xi=\delta(\phi,\xi_{0},E)$ where $\delta$ is the trajectory deformation function which we describe in Section III-D. Thus, we can obtain the most likely trajectory $\xi^{*}=\delta(\phi^{*},\xi_{0},E)$ from the most likely feature $\phi^{*}$ ¹¹1We assume that a given language correction will correspond to only one feature. Practically, a complex language correction can map to multiple features. We will address this issue in the future.. This reduces the problem of finding the most likely trajectory $\xi^{*}$ to the problem of finding the most likely feature $\phi^{*}$ . Thus, our problem can be rewritten as:

\begin{split}\phi^{*}&=\underset{\phi\in\Phi}{\mathrm{argmax}}\,P(\phi|l)\\ \end{split}

(2)

where $P(\phi|l)$ is the probability of feature $\phi$ given language utterance $l$ . The features can be categorized into two types – scene-specific and scene-independent features. Scene-specific features depend on the objects in the scene, while scene-independent features are not.

As proof of concept, we define 2 scene-specific and 6 scene-independent feature templates (Table I). Scene-specific features are generated online for each object detected in the scene, allowing the approach to generalize to different objects. They are defined based on object distance, which is to either increase (obj_distance_increase) or decrease (obj_distance_decrease) the distance to an object. Scene-independent features are based on the 3 axes in cartesian space, i.e. to move the gripper up (z_cart_increase), down (z_cart_decrease), left (y_cart_decrease), right (y_cart_increase), forward (x_cart_increase) and backward (x_cart_decrease). The space of feature templates is expandable and can be made specific to the robot’s task, as demonstrated in Section IV-E.

TABLE I: Feature Templates (FT) and their Textual Description Templates (TDT)

FT	obj_distance_decrease	obj_distance_increase
TDT	Move closer to {obj}	Move further away from {obj}
	Stay close to {obj}	Stay away from {obj}
	Decrease distance to {obj}	Increase distance to {obj}
	Keep a smaller distance from {obj}	Keep a bigger distance from {obj}
		Avoid {obj}
FT	z_cart_decrease	z_cart_increase
TDT	Move closer to table	Move further away from table
	Stay closer to table	Stay away from table
	Move lower	Move higher
	Move down	Move up
	Stay down	Stay up
	Go to the bottom	Stay on the upper part
	Down	Up
	Low	Top
		Go to the top
FT	y_cart_decrease	y_cart_increase
TDT	Stay on the left	Stay on the right
	Go to the left	Go to the right
	Move left	Move right
	Move more towards the left	Move more towards the right
	Left	Right
FT	x_cart_decrease	x_cart_increase
TDT	Stay at the back	Stay at the front
	Go to the back	Go to the front
	Move back	Move front
	Stay back	Stay front
	Move backward	Move forward
	Go behind

Only the bolded phrases were used for the analysis in Section IV-C3.

III-C Textual Descriptions and Optimal Feature Selection

To compute $P(\phi|l)$ , we generate a textual description $T_{\phi}$ for each feature $\phi$ . Each textual description consists of a set of language phrases, which includes commonly used phrases to modify the trajectory for a particular feature. Table I shows the textual description templates (TDT) for each feature template (FT). During feature generation for a scene, the {obj} placeholder in scene-specific feature templates will be replaced by the object name $o_{name}$ obtained using object detection, as shown in Fig. 1.

We initially define a few phrases in the textual description templates. If feature matching fails due to out-of-distribution user utterances, additional phrases can be added to the textual description templates easily to improve feature matching. We examine how the phrases in the textual description templates can affect feature matching and performance in Section IV-C3.

Since there is a one-to-one mapping between the feature and its textual description, Eq. 2 can be rewritten as:

\begin{split}\phi^{*}&=\underset{\phi\in\Phi}{\mathrm{argmax}}\,P(T_{\phi}|l)\\ \end{split}

(3)

To compute $P(T_{\phi}|l)$ , we leverage large language models (LLMs) to map diverse language phrases to fixed-length vectors called embeddings. To capture the semantic meaning of sentences, we used Sentence Transformers [19], which is fine-tuned for semantic similarity tasks. We chose the pre-trained all-MiniLM-L6-v2 model [24] provided by Sentence Transformers ²²2https://www.sbert.net/docs/pretrained_models.html as it provided us with the best trade-off between speed and performance. Semantically closer language phrases are more likely to have higher cosine similarity between their embeddings. Thus, $P(T_{\phi}|l)$ can be defined as follows:

P(T_{\phi}|l)\propto\underset{t_{\phi}\in T_{\phi}}{\mathrm{max}}\,q(t_{\phi}).q(l)/||q(t_{\phi})||.||q(l)||

(4)

where $q(x)$ is the embedding for a language phrase x. Once the most likely feature $\phi^{*}$ is obtained, the trajectory $\xi^{*}$ can be obtained using a deformation function.

III-D Deformation Function

A deformation function $\delta(\phi,\xi_{0},E)$ modifies a trajectory based on a feature $\phi$ . First, we calculate the force $F$ to be exerted on each waypoint of the trajectory.

For scene-specific features, i.e. object distance features, the force exerted is dependent on the object position $o_{pos}$ and a radius of deformation $r$ . In our experiments, we set $r=0.3$ , which was determined empirically. We note that $r$ can be set adaptively based on environmental constraints and human preferences but this will be part of our future work. For waypoints within the radius of deformation $r$ from the object position $o_{pos}$ , a force is applied on the waypoints in the direction of the distance vector between the waypoint and the object. The force is $0$ for other waypoints.

For scene-independent features, a force is exerted on all waypoints of the trajectory, where the direction of the force is dependent on the feature.

The trajectory is deformed based on the force calculated on each waypoint : $\delta=\xi_{0}+wF$ , where the weight $w$ changes the magnitude of the deformation. We empirically determined the value of $w$ to be a constant of $1.0$ in our experiments, but this can be modified based on the intensity of the language correction in the future.

Finally, the deformed trajectory is passed to a trajectory optimizer to ensure the robot’s kinematic constraints are satisfied.

IV Experiments

We conducted simulation and real-world experiments to validate our proposed approach. We hypothesize that:

H1 Our approach will be able to generalize to natural language phrases and environments with different objects;

H2 Our approach will deform trajectories at least as accurately as end-to-end methods for trajectory deformation;

H3 Our approach will obtain higher rankings from users compared to end-to-end methods;

H4 Our approach will be more interpretable than end-to-end methods.

We compared our approach against LaTTe [7], an end-to-end approach for language corrections that can deform trajectories in 3D space. Experiments were conducted to evaluate the generalization ability of the proposed approach. We also conducted user studies to evaluate the end-users’ satisfaction with the deformed trajectories and to obtain more diverse language utterances. The Nanyang Technological University Institutional Review Board approved the user studies. In the next section, we briefly describe our baseline method, LaTTe.

IV-A Baseline

LaTTe first created a dataset consisting of tuples of initial trajectory, language correction, object images, names and poses in the scene, and deformed trajectory. This dataset is then used to train a neural network which can map the initial trajectory, language correction and objects in the scene to a deformed trajectory. For generalization to various objects in the scene and different natural language phrases, pre-trained BERT and CLIP are used to generate embeddings for the language correction and the object images, which are provided as input to the neural network. We used the model provided by the authors ³³3https://github.com/arthurfenderbucker/LaTTe-Language-Trajectory-TransformEr, trained on 70k samples for our experiments. Before generating the deformed trajectory using the trained model, a locality factor hyper-parameter must be set, determining the range of desired change over the trajectory. In our user studies, we set the locality factor hyper-parameter to be $0.3$ , the mean value in their dataset.

IV-B Generalization Experiments

To evaluate the ability of our approach to generalize to different object configurations, trajectories and corrections, we used the dataset that was originally employed to train the LaTTe. The LaTTe dataset includes 100k samples, with object names sampled from the Imagenet dataset. The full dataset contains three types of trajectory modifications – cartesian changes, object distance changes and speed changes. We removed the samples related to speed changes in our evaluation as we did not include speed change features in our templates, resulting in 65261 samples. We set the locality hyper-parameter for LaTTe by referencing the value provided in each sample in the dataset.

IV-B1 Evaluation Metrics

To evaluate the accuracy of the deformed trajectories, LaTTe compared the similarity between the deformed trajectory using their method and the ground-truth output trajectory in the dataset using metrics like dynamic time warping (DTW) distance. However, this may not accurately capture the correctness of the trajectory modification. For example, even if the trajectory change occurs in a direction contrary to the intended one, the DTW distance may still register as small. Additionally, relying on a single ground-truth trajectory for comparison is problematic, as multiple valid trajectories can achieve the correct deformation.

Thus, we developed a way to assess the performance of trajectory deformations based on language. Our approach evaluates the accuracy of deformed trajectories by considering the type of correction applied, as shown in Fig. 2.

For corrections involving cartesian changes, we first sampled a range of trajectory deformations of different intensities, i.e. by varying the value of $w$ from -2.5 to 2.5. Dynamic time warping (DTW) distance was then used to find the weight that best represents the deformed trajectory. A deformation is deemed correct if the weight aligns with the intended change direction. For example, a positive weight should correspond to a positive change (i.e. increase), and a negative weight should correspond to a negative change (i.e. decrease).

For corrections involving object distance changes, we measure the accuracy by comparing the distance between the object and the waypoint closest to the object in both the original and deformed trajectories. The key metric is whether the modified trajectory brings the waypoint closer to or further from the target object, which can be quantified without sampling multiple trajectories. A deformation is deemed correct if, for a positive change, the waypoint in the deformed trajectory is further from the object, and for a negative change, it is closer to the object.

To facilitate comparison with the findings reported in LaTTe, we also evaluated the performance in terms of the similarity between the deformed trajectory of each approach and the ground-truth output trajectory in the dataset using the dynamic time-warping (DTW) distance.

IV-B2 Results

TABLE II: Generalization Experiment Results

Approach	Overall		Object Distance Changes		Cartesian Changes
Approach	Accuracy	DTW	Accuracy	DTW	Accuracy	DTW
ExTraCT	89.23%	2.6386	86.10%	3.0600	92.27%	2.2100
LaTTe	73.37%	3.4099	53.48%	3.7499	92.64%	3.0804

Table II shows the evaluation results. Our approach outperformed the baseline, especially for object distance changes, showing support for H1 and H2. To understand whether there is a difference in performance depending on the type of change, we analysed the results based on the type of change. Even though both approaches perform similarly for cartesian changes, we see that the performance of LaTTe degrades for object distance changes, with an accuracy of only $54.5\%$ , which is approximately equivalent to random chance.

IV-B3 Analysis of Failure Cases in LaTTe

To better understand the degraded performance of LaTTe for distance changes, we performed qualitative analysis on LaTTe by picking a random sample in LaTTe’s dataset (Fig. 3a) and modifying the target object pose (highlighted in red, Fig. 3b), initial trajectory (Fig. 3d) and language correction (Fig. 3c). The trajectories deformed by LaTTe shown in Fig. 3 demonstrate that these changes did not result in an expected change in the deformed trajectory. Fig. 3b and 3c show a similar (but incorrect) deformed trajectory to the original sample, while Fig. 3d shows the deformed trajectory in the opposite direction to the correction. On the other hand, ExTraCT deforms the trajectories correctly in all these cases.

To understand why a change in the language correction (Fig. 3c) did not result in a correct change in the deformed trajectory, we analysed the embeddings of the language correction. We used Sentence Transformers [19] to find the top eight sentences with the closest semantic similarity in LaTTe’s dataset. Sentences that convey opposite meanings but have high lexical similarity, such as “stay closer to the Egyptian cat” had a high cosine similarity score to “stay further away from the Egyptian cat”, as highlighted in Table III. This makes learning trajectory deformations from textual embeddings difficult, as the BERT embeddings used may not always capture the semantic meaning of the corrections.

Unfortunately, for the cases where we modified the target object pose (Fig. 3b) and the input trajectory (Fig. 3d), it was difficult to understand why failures occurred. Failures could arise from various factors, such as errors in embedding geometrical information like trajectories and object configurations and insufficient training data. The lack of transparency makes identifying and rectifying specific issues difficult, motivating our separation of the problem into two distinct phases – language understanding and trajectory deformation.

IV-B4 Analysis of Failure Cases in ExTraCT

All the failure cases in ExTraCT can be attributed to incorrect feature mapping. For example, “Go to the upper part” was incorrectly mapped to z_cartesian_decrease as the most similar phrase was “Go to the bottom”. “Drive a lot closer to the meat market” was incorrectly mapped to meat market_distance_increase as the closest phrase was “Stay a lot further away from meat market”. These errors occurred as the embeddings from large language models (LLMs) may not always capture the nuanced semantic meanings of language phrases (Table III).

To improve the language understanding capabilities of our system, we can either fine-tune the existing language model for our application or employ a larger LLM with better semantic understanding. Another way to improve performance is to include previously mismatched phrases in our textual description templates. In Section IV-C3, we show how this can improve performance. Note that for this evaluation, we deliberately did not include all the phrases in LaTTe’s dataset in our template descriptions, as that would naturally lead to an exact match in the sentence, which may not realistically reflect the model’s true language understanding capabilities.

TABLE III: Top 8 Sentences with Highest Similarity Scores for ”Stay further away from the Egyptian cat” in LaTTe’s Dataset

Sentence	Similarity Score
Stay further away from the Egyptian cat	1.00
Stay a lot further away from the Egyptian cat	0.98
Stay closer to the Egyptian cat	0.91
Walk a lot further away from the Egyptian cat	0.90
Walk a bit further away from the Egyptian cat	0.89
Stay a lot closer to the Egyptian cat	0.88
Stay very closer to the Egyptian cat	0.88
Stay a bit closer to the Egyptian cat	0.87

Bolded phrases indicate sentences with opposite meanings.

IV-C User Studies in Simulation

We recruited 15 subjects (8 male, 7 female) for the simulation study. Out of these participants, 6 individuals did not have a background in robotics. The study interface (Fig. 4) (modified from LaTTe), displayed the initial trajectory, the modified trajectory using LaTTe and the modified trajectory using our method (ExTraCT) at the same time. The trajectories were displayed to the users in 3D and subjects could interact with the plots to change the view. The participants were not informed which method was used to deform the trajectories, and the modified trajectories were labelled only with ”1” and ”2”. The labels were kept consistent throughout the experiment so that we could obtain the subjects’ overall rankings and preferences for each method across different scenes.

3 scenes were presented to each subject, as shown in Figures 4 and 5. Different household objects were placed at randomized locations for each scene. The objects in scenes 1 and 2 were selected from LaTTe’s dataset, while those in scene 3 were randomly selected and out of LaTTe’s dataset. Since there were no images of the objects, no CLIP image embeddings were used for LaTTe. Instead, CLIP textual embeddings of the object names were used for LaTTe to identify the correct target object.

There were 5 language corrections for each scene, where 3 of the corrections were given by us (for consistency across subjects and to familiarize subjects with the types of corrections), and 2 of the corrections were given by the subjects. Subjects were free to provide any language corrections, after which both approaches would modify the trajectory. The corrections we provided were in LaTTe’s dataset for scenes 1 and 2, and were not present in LaTTe’s dataset for scene 3. After each correction, subjects had to rank their agreement on how well each method modified the trajectory and which trajectory they preferred, if any. Specifically, for each method, they had to choose whether the deformed trajectory was 1) Completely wrong, 2) Somewhat wrong, 3) Neutral, 4) Somewhat correct, or 5) Completely correct. After each trajectory deformation and at the end of the experiments, they had to compare both methods. For this, they had to choose whether 1) Method 1 was much better, 2) Method 1 was a bit better, 3) Both the methods were the same, 4) Method 2 was a bit better, or 5) Method 2 was much better. At the end of the study, subjects were asked to elaborate on which method they preferred and why. We also measured the performance based on the accuracy of the trajectory deformations, as outlined in Section IV-B1.

IV-C1 Results

TABLE IV: Simulation User Study Results

Approach	Overall		Object Distance Changes		Cartesian Changes
Approach	Accuracy	Mean User Rank	Accuracy	Mean User Rank	Accuracy	Mean User Rank
ExTraCT	88.00%	4.471 $\pm$ 0.062	96.53%	4.521 $\pm$ 0.0668	100.00%	4.776 $\pm$ 0.060
LaTTe	$55.56\%$	$2.840\pm 0.0790$	$47.22\%$	$2.722\pm 0.0988$	$94.83\%$	$3.207\pm 0.136$

Table IV shows the results of our user study, showing better accuracy and preference for our method, supporting H2 and H3. A Wilcoxon signed-rank test showed a significant difference in the rankings between our method and LaTTe ( $p<0.0001$ ) on how well the trajectory followed the correction. When comparing the two methods after each correction, subjects rated that our method was better $85.5\%$ of the cases, LaTTe was better $6.2\%$ of the cases, and there was no difference for $8.0\%$ . When asked at the end of the experiments, all the subjects preferred our method, with $33.3\%$ of subjects rating that our method was “slightly better”, and $66.7\%$ of subjects rating that our method was “much better” than LaTTe.

The higher preference for our method could be largely attributed to the accuracy of our approach, with many subjects stating that ExTraCT “follows [the] correction more”. LaTTe did not deform trajectories according to the corrections for many cases, such as in Fig. 5a and 5b, where the deformed trajectories were opposite to the corrections.

IV-C2 Failure Cases

There were $12.00\%$ of failure cases with ExTraCT. Due to the explainability of our approach, we can analyse the feature matching and confidence score for each failure case (H4). We labelled the correct feature(s) for corrections within our feature space. For example, “Move below the fridge”, which contains a directional component for an object, is out of our feature space. These failures can be attributed to a lack of a correct deformation function. There were 12 such corrections ( $5.33\%$ ), which we removed from subsequent analysis.

The remaining 15 failures could be attributed to two reasons. Of these, 10 failures were due to multiple trajectory modification features within a single correction, such as “Move away from bottle and then closer to cup”. Since we assumed that each correction only contains one feature, such failures were expected. 5 failures were due to incorrect feature mapping, i.e. “Stay further away from the pineapple” was incorrectly mapped to pineapple_distance_decrease, as the most similar phrase was “Stay close to pineapple”; and “Move to the cake” was mapped to cake_distance_increase since the most similar phrase was “Move further away from cake”. These failures highlight the limitations of using Sentence Transformers for capturing semantic nuances in language corrections. Fine-tuning the model with in-domain data could improve the performance.

IV-C3 Effect of textual description templates on performance

We investigated the impact of the size of textual descriptions (number of phrases in each textual description) and phrase selection on performance. We selected phrases less frequently provided in the simulation experiments from the original set of textual description templates. The resulting textual description templates contained only 2 phrases, highlighted in bold in Table I. Using this smaller set of phrases, we examined the features matched for the 59 unique language corrections from the simulation study. With a smaller set of language phrases, the number of errors in terms of feature mapping remained at 6, but there were 3 instances of low confidence. Even though the number of errors was the same, the phrases with incorrect feature mapping differed.

We also demonstrate how we can easily add phrases to our textual description templates and improve the system’s performance. We added the following phrases that were incorrectly mapped to the feature during the simulation study – “Stay further away from the {obj}” to obj_distance_inc; “Move towards {obj}”, “Move to {obj}” to obj_distance_dec; and “Move to the right” to y_cart_inc. After adding the phrases with incorrect feature mappings to the textual description templates, there were no more inaccurate feature mappings.

From this, we can conclude that while using a pre-trained LLM for semantic similarity matching can capture some complexity in natural language, they do not generalize to all cases due to the diversity and flexibility of natural language. The explainability of our approach provides valuable insights to enhance the system’s performance. We demonstrate that language grounding to features can be improved by expanding the phrases in the textual description templates, improving our approach’s generalizability to language variations.

IV-D User Studies with Real Arm

We also conducted user studies using the xArm-6 robotic arm (UFactory, China) with the xArm two-finger gripper. An Intel Realsense Depth Camera D435 was mounted on the xArm-6 gripper to capture images of the scene to estimate the location of the objects. Additionally, for LaTTe, the bounding boxes of the objects were also estimated to obtain the CLIP embeddings of the objects. Object detection was performed using Mask R-CNN [11] in our experiments. TrajOpt [20] was used to optimize the trajectory based on the robot kinematic constraints.

We recruited 5 subjects (2 male, 3 female) for a within-subject study, where each subject evaluated two methods – our approach (ExTraCT) and LaTTe.

IV-D1 Setup and Protocol

There were 3 scenes with different objects on the tables. Fig. 6 shows the scenes used for real arm experiments. The object types, number of objects and object positions varied across the scenes. For the first scene, we provided the language corrections for consistency across subjects and also to familiarize the subjects with the types of corrections. The subjects had to rate the performance of the following corrections for both methods – ”Go down”, ”Keep a bigger distance away from bowl”, ”Stay closer to spoon”.

For the next 2 scenes, subjects had to use language corrections to modify a trajectory to complete tasks. Subjects were informed that they could only give 1 correction at a time, with 3 corrections required. The 2nd scene required subjects to throw trash into the bin while avoiding the food items and objects on the table. The 3rd scene required subjects to bring an apple to a dummy for handover while staying away from the bowl and staying closer to the table. In both scenes, subjects also had to modify the trajectory’s final waypoint to move closer to the bin or the dummy. The language corrections were manually typed into the computer running both methods. Subjects completed both methods for each scene before moving to the next scene.

The order of the methods was randomized across subjects to counteract the effects of novelty and practice. Subjects filled out a qualitative survey after each task and each method, rating on a scale of 1 to 5 whether the modified trajectory followed the language correction. At the end of the study, subjects were asked to elaborate on their preferred method and why. We also measured the accuracy of the trajectory deformations using the same method in Section IV-B1.

IV-D2 Results

The real-arm study showed similar results to the simulation study (Table V), where ExTraCT deformed trajectories more accurately than LaTTe (H2). A Wilcoxon signed-rank test was performed on the user rankings. Statistically significant differences were found, with our trajectory deformations following the corrections better ( $p<0.0001$ ), showing support for H3. Subjects generally preferred our method, with $80\%$ of subjects rating our method slightly better or much better than LaTTe. Subjects preferred our method because it “captures the user’s intention better” and has a “greater accuracy”.

TABLE V: Real Arm Experiment Results

Method	Accuracy	Mean User Rank
ExTraCT	$\textbf{97.78}\%$	$\textbf{4.38}\pm 0.12$
LaTTe	$66.67\%$	$3.16\pm 0.20$

IV-E Application in Assistive Feeding

We deployed our framework in an assistive feeding task to show that our framework is versatile and can be applied across diverse scenarios. To tailor our framework for this context, we defined two features – bite size and feeding speed. The features and textual descriptions are shown in Table VI. We show some sample interactions on how the amount of food scooped can be modified using language in Fig. 7. The modularity of our framework allows different trajectory modification methods to be used depending on the task’s specific requirements. In this case, we used parameterized dynamic motion primitives (DMP) to modify the bite size.

Fig. 7b displays the variations in scooping trajectories based on language corrections. After the correction “Feed faster” was provided, the scooping speed increased. Similarly, following the correction “Feed me bigger bites”, the scooping trajectory was modified by scaling up the weight of the DMP, resulting in a larger bite size, as seen in Fig. 7d.

TABLE VI: Features (F) and their Textual Descriptions (TD) for Assistive Feeding

F	bitesize_decrease	bitesize_increase
TD	I want a bigger bite	I want a smaller bite
	Increase the spoonful size	Decrease the spoonful size
	I want a larger bite next	I want a smaller bite next
	I want more food	I want less food
	Increase bite size	Decrease bite size
F	speed_decrease	speed_increase
TD	Move faster	Move slower
	Increase speed	Decrease speed
	Too slow	Too fast

V Conclusion and Discussion

In conclusion, we propose a modular trajectory correction framework, ExTraCT. ExTraCT creates more accurate trajectory modification features for natural language corrections, which is important for safety and building trust in human-robot interaction. Our proposed architecture uses pre-trained LLMs for grounding user corrections and semantically maps them to the textual description of the features.

Our approach combines the strengths of hand-crafted features in trajectory deformations to generalize to different object configurations and initial trajectories and the language modelling capabilities of LLMs to handle language variations. By separating the problem of language understanding and trajectory modification, we have shown improvements in interpreting and executing language corrections, even for non-templated language phrases. The transparency of our approach allowed us to understand the root causes of failures, whether in language understanding or the trajectory deformation process, enabling more targeted improvements to the system.

This work is just a step towards understanding how explicitly obtaining the features for trajectory modification can help provide a more explainable and generalizable approach. Our feature space is limited, and we cannot extract multiple features from a single correction. While our current implementation allows a chain of corrections, correcting the trajectory one feature at a time may not be efficient, especially since natural language can handle greater complexity. Future work aims to increase our feature space and handle more complex language utterances, such as different intensities of trajectory modifications, compound sentences and referring expressions. Another direction is to look into bi-directional communication between the robot and the user, allowing the robot to ask the user for clarifications in cases of uncertainty (i.e. low confidence in feature mapping). We hope our framework can provide a more transparent approach to learning human preferences, which can serve as a basis for transferring human preferences across different contexts.

Ethics Statement

The studies involving humans were approved by the Nanyang Technological University Institutional Review Board. The participants provided their written consent to participate in the studies.

Funding

This work is supported by the Rehabilitation Research Institute of Singapore and the National Research Foundation, Prime Minister’s Office, Singapore, under its Campus for Research Excellence and Technological Enterprise (CREATE) programme.

Acknowledgments

The authors acknowledge the use of OpenAI’s ChatGPT in the editing process of the paper.

References

[1] Andrea Bajcsy, Dylan P Losey, Marcia K O’Malley and Anca D Dragan “Learning from physical human corrections, one feature at a time” In Proceedings of the 2018 ACM/IEEE International Conference on Human-Robot Interaction, 2018, pp. 141–149
[2] Andrea Bajcsy, Dylan P Losey, Marcia K O’malley and Anca D Dragan “Learning robot objectives from physical human interaction” In Conference on Robot Learning, 2017, pp. 217–226 PMLR
[3] Andreea Bobu, Marius Wiggert, Claire Tomlin and Anca D Dragan “Feature expansive reward learning: Rethinking human input” In Proceedings of the 2021 ACM/IEEE International Conference on Human-Robot Interaction, 2021, pp. 216–224
[4] Alexander Broad, Jacob Arkin, Nathan Ratliff, Thomas Howard and Brenna Argall “Real-time natural language corrections for assistive robotic manipulators” In The International Journal of Robotics Research 36.5-7 SAGE Publications Sage UK: London, England, 2017, pp. 684–698
[5] Tom Brown et al. “Language Models are Few-Shot Learners” In Advances in Neural Information Processing Systems 33 Curran Associates, Inc., 2020, pp. 1877–1901 URL: https://proceedings.neurips.cc/paper/2020/file/1457c0d6bfcb4967418bfb8ac142f64a-Paper.pdf
[6] Arthur Bucker, Luis Figueredo, Sami Haddadinl, Ashish Kapoor, Shuang Ma and Rogerio Bonatti “Reshaping robot trajectories using natural language commands: A study of multi-modal data alignment using transformers” In 2022 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), 2022, pp. 978–984 IEEE
[7] Arthur Bucker et al. “LATTE: LAnguage Trajectory TransformEr” In 2023 IEEE International Conference on Robotics and Automation (ICRA), 2023, pp. 7287–7294 DOI: 10.1109/ICRA48891.2023.10161068
[8] Aakanksha Chowdhery et al. “PaLM: Scaling Language Modeling with Pathways” arXiv, 2022 DOI: 10.48550/ARXIV.2204.02311
[9] Yuchen Cui, Siddharth Karamcheti, Raj Palleti, Nidhya Shivakumar, Percy Liang and Dorsa Sadigh “No, to the Right: Online Language Corrections for Robotic Manipulation via Shared Autonomy”, HRI ’23 Stockholm, Sweden: Association for Computing Machinery, 2023, pp. 93–101 DOI: 10.1145/3568162.3578623
[10] Jacob Devlin, Ming-Wei Chang, Kenton Lee and Kristina Toutanova “Bert: Pre-training of deep bidirectional transformers for language understanding” In arXiv preprint arXiv:1810.04805, 2018
[11] Kaiming He, Georgia Gkioxari, Piotr Dollár and Ross Girshick “Mask r-cnn” In Proceedings of the IEEE international conference on computer vision, 2017, pp. 2961–2969
[12] Wenlong Huang, Chen Wang, Ruohan Zhang, Yunzhu Li, Jiajun Wu and Li Fei-Fei “Voxposer: Composable 3d value maps for robotic manipulation with language models” In arXiv preprint arXiv:2307.05973, 2023
[13] Ashesh Jain, Shikhar Sharma, Thorsten Joachims and Ashutosh Saxena “Learning preferences for manipulation tasks from online coactive feedback” In The International Journal of Robotics Research 34.10 SAGE Publications Sage UK: London, England, 2015, pp. 1296–1313
[14] Jacky Liang et al. “Code as policies: Language model programs for embodied control” In 2023 IEEE International Conference on Robotics and Automation (ICRA), 2023, pp. 9493–9500 IEEE
[15] Shilong Liu et al. “Grounding dino: Marrying dino with grounded pre-training for open-set object detection” In arXiv preprint arXiv:2303.05499, 2023
[16] Matthias Minderer et al. “Simple open-vocabulary object detection” In European Conference on Computer Vision, 2022, pp. 728–755 Springer
[17] Alec Radford et al. “Learning transferable visual models from natural language supervision” In International conference on machine learning, 2021, pp. 8748–8763 PMLR
[18] Joseph Redmon and Ali Farhadi “Yolov3: An incremental improvement” In arXiv preprint arXiv:1804.02767, 2018
[19] Nils Reimers and Iryna Gurevych “Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks” In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing Association for Computational Linguistics, 2019 URL: https://arxiv.org/abs/1908.10084
[20] John Schulman et al. “Motion planning with sequential convex optimization and convex collision checking” In The International Journal of Robotics Research 33.9 SAGE Publications Sage UK: London, England, 2014, pp. 1251–1270
[21] Pratyusha Sharma et al. “Correcting robot plans with natural language feedback” In arXiv preprint arXiv:2204.05186, 2022
[22] Jonathan Spencer et al. “Learning from interventions: Human-robot interaction as both explicit and implicit feedback” In 16th Robotics: Science and Systems, RSS 2020, 2020 MIT Press Journals
[23] Stefanie Tellex, Nakul Gopalan, Hadas Kress-Gazit and Cynthia Matuszek “Robots that use language” In Annual Review of Control, Robotics, and Autonomous Systems 3 Annual Reviews, 2020, pp. 25–55
[24] Wenhui Wang, Furu Wei, Li Dong, Hangbo Bao, Nan Yang and Ming Zhou “MiniLM: Deep Self-Attention Distillation for Task-Agnostic Compression of Pre-Trained Transformers”, 2020 arXiv:2002.10957 [cs.CL]
[25] Wenhao Yu et al. “Language to Rewards for Robotic Skill Synthesis” In arXiv preprint arXiv:2306.08647, 2023
[26] Lihan Zha et al. “Distilling and Retrieving Generalizable Knowledge for Robot Manipulation via Language Corrections” In 2nd Workshop on Language and Robot Learning: Language as Grounding, 2023