How Can LLMs and Knowledge Graphs Contribute to Robot Safety? A Few-Shot Learning Approach
Abstract
Large Language Models (LLMs) are transforming the robotics domain by enabling robots to comprehend and execute natural language instructions. The cornerstone benefits of LLM include processing textual data from technical manuals, instructions, academic papers, and user queries based on the knowledge provided. However, deploying LLM-generated code in robotic systems without safety verification poses significant risks. This paper outlines a safety layer that verifies the code generated by ChatGPT before executing it to control a drone in a simulated environment. The safety layer consists of a fine-tuned GPT-4o model using Few-Shot learning, supported by knowledge graph prompting (KGP). Our approach improves the safety and compliance of robotic actions, ensuring that they adhere to the regulations of drone operations.
1 Introduction
The increasing complexity of robotic systems has increased the demand for efficient programming tools that can simplify the development process. Traditional robot programming methods require a deep understanding of both hardware and software, creating a barrier for those without specialized expertise. To address these challenges, recent advances in Natural Language Processing (NLP) specifically, with Large Language Models (LLMs) such as ChatGPT driving significant advances [?], have led to the development of systems capable of generating code directly from human-readable instructions.

Robotics code generation using NLP promises to revolutionize how robots are controlled, demonstrating remarkable capabilities in understanding natural language and translating these commands into robotic planned actions [?], enabling new applications across various domains. Moreover, this will enable developers to specify complex behaviors through natural language commands rather than writing intricate code [?]. By leveraging NLP techniques, code generation tools can interpret user inputs in natural language, translating them into executable code for various robotic tasks, hence having the potential to revolutionize human-robot interaction by allowing users to control robots using natural language instructions [?]. This significantly lowers the technical entry barrier and accelerates the development process, allowing for faster prototyping, testing, and deployment of drone applications.
For example, a user might instruct a drone to ‘Navigate to a target and avoid obstacles on its way,’ and the system would generate the necessary low-level code to perform the task autonomously. Commands violating the drone’s safe operation parameters led to incidents affecting physical assets, human life, and an increased risk of crashes i.e. “fly upwards to an altitude of 200 meters.” which could result: “Drone exceeding 120-meter altitude is not permitted by the safety regulation of drone operations because it might interfere with a crew operated aircraft”. Moreover, an illustrative example of current systems that do not have a safe layer to prevent harm commands from being executed to robot operations can be seen in Figure 1.
The potential impact of NLP-driven code generation in robotics extends beyond reducing development time. It opens the door to more collaborative and accessible robotic programming environments where non-expert users, such as industrial operators or educators, can interact with robots more intuitively. These tools could facilitate the broader adoption of robotics in diverse fields, from manufacturing and healthcare to the education and service industries. However, despite the promise of this approach, several challenges remain. Translating complex human language into precise robotic commands requires not only advanced NLP algorithms but also an understanding of the physical and logical constraints of robotic systems. Furthermore, ensuring that the generated code is reliable, safe and optimized for real-time execution presents additional hurdles.
Traditional methods of controlling robots would require experience in robot programming and coding. However, LLM models like GPT-4o are capable of comprehending natural language and translating it into code executable by robots. The limitation of using large language models is the safety mechanism associated with the code generated to control the robot. For example, if the user issues a command in natural language to fly the drone to a specific altitude, then the LLM model might misinterpret the user command due to model hallucination [?]. The model would generate and execute the wrong code that could lead the drone to breach the safety rules of drone operations. Our contribution is to fill the gap by adding a safety layer that will verify the safety of the code generated before the final low-level code execution. The key contributions can be summarized as follows:
-
1.
Present a novel safety layer that adds a fine-tuned GPT-4o model using Few-Shot learning for code classification, enhancing the safety of LLM-generated robot action code.
-
2.
Development of a labeled code dataset by employing a Large Language Model (LLM) and implementing an supervised learning procedure. This approach enabled us to produce a domain-based set of safe and unsafe code generation data that we manually labeled according to rules of safe drone operations.
-
3.
Integrate knowledge graph prompting to incorporate drone rules into the model’s decision-making process.
-
4.
Evaluation of the fine-tuned model and baseline model with and without Knowledge Graph Prompting with zero temperature settings.
2 Related Work
Numerous language models (LMs) have been developed, typically pre-trained with specific objectives (e.g., masked language modeling [?]) and later fine-tuned for various downstream tasks. These LMs generally fall into three categories: 1) Masked LMs, which predict masked words in a sentence based on their context, such as BERT [?] and RoBERTa [?]; 2) Encoder-Decoder models, used for tasks such as translation and summarization, where the encoder converts the input into a fixed-length representation and the decoder generates the output, examples include T5, BART, and MASS [?]; and 3) Left-to-Right LMs, trained to predict the next word in a sequence based on prior words, like GPT-3, and GPT-4. Most of these models are built upon the Transformer architecture, which uses self-attention mechanisms to efficiently handle long-range dependencies and adapt to diverse downstream tasks.
There have been attempts to integrate large models to control robotic action [?]. Furthermore, beyond language-conditioned robotic manipulation, foundation models have driven notable advances in robotics. For instance, LID [?] introduces a method for sequential decision-making by leveraging a pre-trained LM to initialize a policy network that embeds both goals and observations. R3M [?] demonstrates how visual representations learned from diverse human video data, through time-contrastive learning and video-language models [?], enable efficient learning of robotic manipulation tasks. CACTI presents a scalable visual imitation learning framework, using pre-trained models to convert pixel data into low-dimensional latent embedding for better generalization [?]. DALL-E-Bot [?], on the other hand, uses Stable Diffusion [?] to create images of the target scene that guide robot actions, offering a unique approach.
Our work is mainly based on [?] in which the authors highlight the ability of ChatGPT by understanding user commands and generating relevant code using only a prompt and a function library. In addition, they experimented with different robotic tasks to demonstrate their approach. For example, drone control for navigation and obstacle avoidance using real-time sensor information. The authors have conducted simulation experiments using the AirSim simulator and real-world environment to demonstrate ChatGPT’s ability to control complex robot tasks with the user in the feedback loop. However, a key issue with these approaches is the absence of a safety pipeline that can verify the integrity of the generated code before deploying it to the robot. This is a very important and crucial step, especially in a real-world deployment in which the wrong code could endanger the safety of humans and assets.
One of the emerging themes in AI safety is the prevention of using an AI system that could endanger the safety of humans [?]. Safety is a very important element in robotic operation in a dynamic environment [?]. Furthermore, with the advancement of LLM and its ability to control robotic systems, a safety layer to constrain robot actions, regardless of the performance of the LLM model, is essential. Most approaches lack the layer to verify the safety of the code generated by LLM models for the robot’s action, specifically in the drone operations domain.
Large language models such as GPT-4o are trained on billions of parameters and are not generalized to a specific knowledge domain. However, Few-Shot learning is an effective way in which we can fine-tune the model to be more capable toward the targeted domain of knowledge, especially if the model is larger in size [?]. Moreover, LLMs are black-box models that are limited by their trained data and do not have access to evolving knowledge [?]. This might lead to insufficient generalization during inference [?] and is also a subject of hallucination of false world information [?].
In contrast, Knowledge Graphs (KG) can store accurate, domain-specific, and evolving knowledge that presents a formal understanding of the world [?]. The unification of LLMs and KG can improve performance in terms of knowledge awareness [?]. One of the methods of KG-enhanced LLMs is KG Prompting (KGP), which converts KG structures into text sequences that can be injected into LLMs to enhance their reasoning during inference [?]. KGP can maximize the full potential of LLMs to provide better reasoning on domain-specific knowledge without retraining the model. However, the only downside is that crafting KGPs requires extensive human effort.


3 Methodology
The system entry starts with the user input that goes to a GPT-4o model, which handles the following: (1) an API file containing a finite set of high-level programmed functions, (2) a system prompt defining the system role and a user prompt describing each function in the API file, along with a few usage examples, and (3) a user input as command and feedback. The GPT-4o model can comprehend and translate user commands into an executable Python code for drone control in the AirSim simulation environment [?]. Before final code execution, the GPT-4o model can ask the user for additional information to clarify the given command, keeping the user in the feedback loop through a dialogue approach. This phase was developed by [?] where they ended their process by executing the code to achieve drone control in natural language. Our contribution to their work is done by incorporating a safety layer before the final execution of the code. The entire process of developing and integrating our safety classifier is highlighted in the following sections.
3.1 Few-Shot learning
We have selected GPT-4o as our target system for code classification because it understands complex language and code syntax. Moreover, since the model generating the code is a GPT-4o model, it made sense to incorporate an LLM model to comprehend the generated code and classify it as safe or unsafe. In addition, choosing an LLM to classify the code will make the integration process with the current system less complex. Since large lange models are pre-trained on billions of parameters, we decided to fine-tune our model with 100-generated SAFE and UNSAFE code samples under a Few-Shot learning approach, summarized as follows:
-
1.
Dataset preparation: We developed a small dataset consisting of 100 code snippets, evenly split between SAFE and UNSAFE labels. This data set was crafted to ensure full coverage of various scenarios relevant to drone operations and safety regulations using the Airsim simulator and the GPT-4o model with 0 temperature to minimize randomness. Furthermore, our data collection and training process can be seen in Figure 2(b).
-
2.
Supervised fine-tuning: The model was fine-tuned using supervised learning on the few labeled examples. The objective was to fine-tune the model parameters to improve its performance in the safety classification task for drone operations.
-
3.
Model Optimization: All the model optimization process was done using OpenAI’s API which comprises the pre-trained GPT-4o model used in this work. The fine-tuning job was done using the gpt-4o-2024-08-06 model’s instance during four epochs, using a batch size of two. In addition, a total of 16,460 training tokens were used to optimize the model with a learning rate equal to 0.1.
3.2 Knowledge Graph Prompting
After fine-tuning we extended our model with a KGP that contains the safety rules for drone operation obtained from the Australian Government Civil Aviation Safety Authority (CASA)111https://www.casa.gov.au/knowyourdrone/drone-rules. Moreover, the following are a few examples of the safe operation rules:
-
1.
Maximum altitude: Drones must not fly higher than 120 meters (400 feet) above ground level.
-
2.
Proximity to people: Drones must not fly closer than 30 meters or fly above people unrelated to the operation at all times.
-
3.
Proximity to objects: Drones must not fly closer than 30 meters to populated areas, including buildings, highways, cars and beaches.
-
4.
Weather conditions: Drones should only be operated in visual line-of-sight conditions, during the day, and in safe weather conditions.

After collecting knowledge on the safety rules of drone operations, we then proceeded to convert it into a KG triple format, as follows:
-
•
(Drone, must_not_fly_higher_than, 120_meters)
-
•
(Drone, must_maintain_distance_from, People_30_meters)
-
•
(Drone, must_not_hover_above, people_at_all)
-
•
(Drone, should_operate_during, Daytime)
-
•
(Drone, should_operate_in, Safe_Weather_Conditions)
After we converted the information into KGP triples, we transformed the triples into clear sentences to form the knowledge graph prompt provided to the model. In addition, this prompt includes the safety rules in a clear and concise manner, along with code examples to illustrate unsafe code. By extending our fine-tuned model with KGP, we aimed to have a model with domain-specific knowledge for a robust classification of SAFE and UNSAFE code for drone operations. An example of the KGP can be seen in Figure 3.
3.3 Integration
After fine-tuning the GPT-4o model as a code classifier, our proposal integrates it into the original method to classify the generated code before the drone executes it. An example of full integration with Airsim simulation interaction can be seen in Figure 4. Moreover, the fine-tuned model is integrated into the pipeline, where it receives the code from the first GPT-4o and classifies it as SAFE or UNSAFE. If the code is SAFE, it will be executed for low-level drone control. On the contrary, if the code is UNSAFE, it will not be executed and will be returned for the user to issue a new safe command. The overall system’s architecture is shown in Figure 2(a).
Code Snippet | Label | Description |
---|---|---|
aw.takeoff() | SAFE | The drone is in line of sight. |
aw.land() | SAFE | The drone is in line of sight. |
aw.fly_to([0, 0, 150]) | UNSAFE | The drone exceeds the 120m altitude limits. |
aw.fly_to(crowd_position) | UNSAFE | The drone is less than 30m of distance away from the crowd/person. |
Approach | Model | TP | FP | TN | FN | Acc. | Prec. | Rec. | F1-score | MCC |
---|---|---|---|---|---|---|---|---|---|---|
w/o KGP | GPT-4o | 4 | 8 | 32 | 36 | 45.00% | 33.33% | 10.00% | 15.39% | 0.00% |
FTGPT-4o | 28 | 22 | 18 | 12 | 57.50% | 56.00% | 70.00% | 62.22% | 15.49% | |
w/KGP | GPT-4o | 16 | 0 | 40 | 24 | 70.00% | 100.00% | 40.00% | 57.14% | 50.00% |
FTGPT-4o | 28 | 12 | 28 | 12 | 70.00% | 70.00% | 70.00% | 70.00% | 40.00% |
4 Dataset
The experimental setup was developed using the AirSim simulation environment [?], built on the Unreal physics engine. AirSim simulator has been widely used for different autonomous drone control approaches due to its simple interface. Additionally, it comprises realistic graphics with straightforward API methods to control the drone within the simulated environment. As the focus of this work is to develop a safety pipeline using LLM as a classifier for safe and unsafe codes, outdoor scenes were used based on previous work [?].
The used environment is a large outdoor environment that includes a crowd of people in the center, and two wind turbines with five power towers connected to an electrical substation. The UAV starts from a fixed location, waiting for user commands to begin its mission control. The user commands are prompted through GPT-4o, which interprets the natural language order into Python code to AirSim using the respected API call.
The provided scene, including its elements and components, made it suitable to generate a custom dataset based on the operational rules established by the Australian Civil Aviation Safety regulations222CASA drone rules. For this purpose, a pre-trained GPT-4o model was used to generate Python code based on user commands, as presented in Figure 2(b). We created a dataset consisting of 100 code snippets based on the safety constraint rules defined previously. The dataset was balanced with 50 instances labeled as SAFE and the other 50 instances labeled as UNSAFE. A SAFE instance on the dataset means that the processed command complies with all safety guidelines. In the same way, UNSAFE instances mean that a given instruction violates at least one safety rule. An example of the dataset is shown in Table 1.
The dataset includes various drone operation scenarios, such as altitude limitations, flying near objects, and navigating around crowds or specific locations. We issued commands to control the drone, ensuring the usage of as many library functions as possible.
To ensure safety compliance, the dataset is organized into four main categories, each representing a specific safety rule for drone operation. These categories contain both SAFE and UNSAFE examples, as detailed below:
-
•
Altitude: This category is related to the maximum altitude value of 120 m that drones must not exceed. In this category, SAFE examples refer to instances where the drone remains within the specified altitude, while UNSAFE examples involve drones exceeding this altitude.
-
•
Minimum distance from objects: This category is related to the safety constraint of the 30-m distance that drones must maintain when approaching objects in the scene, such as turbine1, turbine2, solar panels, car, tower1, tower2, and tower3. SAFE examples demonstrate that adequate distance is maintained, while UNSAFE examples reflect situations where the drone flies too close to objects.
-
•
Minimum distance from the crowd: This category is the same as the previous, but instead the drone must not be closer than 30m from a crowd or a person.
-
•
Hovering above crowd or person: This category is related to the safety rule which indicates that a drone must not hover over a crowd or a person. A SAFE example in this category is a drone that avoids flying or hovering directly over a crowd or person, while an UNSAFE example involves a drone violating this rule by hovering above people.
The 100 examples of the training dataset were split into training and validation subsets, using 80% and 20% of the data respectively. The training subset was focused on being used during model optimization, while the testing subset intended to evaluate the model’s generalization capability. The testing subset comprises 80 balanced examples for each category, with 8 for the altitude and 24 for the rest. We made this dataset publicly available333https://github.com/AbdulrahmanN4/LLMs-KGP-for-drone-safety.

5 Results and discussion

For result comparison, two approaches with each having two models were considered to evaluate the classification task between SAFE and UNSAFE commands. On each approach, the OpenAI GPT-4o provided model was used as comparison baseline for our proposed model, named FTGPT-4o. The first approach is without KGP, considering a simple system instruction prompt for both models. The second approach considers the use of the KGP based on safety rules regulations for the GPT-4o and FTGPT-4o models.
The general results demonstrate that the FTGPT-4o model is capable of classifying SAFE commands, in addition to identifying UNSAFE instructions in a better way than GPT-4o. Hence, code prediction from user input is solely based on its training data.
The fine-tuning job’s outcome demonstrated that the model was able to fit the new presented data, but with high performance. Furthermore, our main approach consists of including the safety rules through KGP in the fine-tuned model, making it the best-performing approach against the overall testing dataset.
As our main concern is to increase safety in the drone operation pipeline, the UNSAFE examples were linked to the True Positive (TP) and False Positive (FP) instances. Thus, the SAFE examples were associated with the True Negative (TN) and False Negative (FN) classes. Since the testing data is balanced, 40 instances for TP and 40 instances for TN are required to achieve perfect performance. In this regard, the quantitative analysis was addressed using the accuracy, precision, recall, F1 score, and Mathew correlation coefficient (MCC) metrics.
The classification comparison in Table 2, shows the overall performance for each approach and model. The use of the proposed dataset to fine-tune the model significantly improves performance against the GPT-4o baseline to identify UNSAFE commands, in both approaches with and without KGP. For the approach without KGP, the fine-tuned model increases the accuracy from 45% to 57.50%, the precision from 33.33% to 56%, an impressive enhancement in recall and F1-score, from 10% to 70% and 15.39% to 62.22%, respectively. Finally, the MCC increased from 0% to 15.49%. A general difference that can be observed is that the GPT-4o model struggles to identify UNSAFE commands, classifying a large number of examples as SAFE. In contrast, the fine-tuned GPT-4o model is able to recognize both types of command, classifying most of the samples as UNSAFE.
For the KGP approach, given explicit guidance, it can be seen that both models improved their performance. The GPT-4o was able to increase the number of UNSAFE correctly identified; however, it continues to classify too many examples as SAFE. Although GPT-4o has 100% precision in classifying the 16 examples of UNSAFE commands, the 40% recall indicates that it was not sufficient to correctly identify all the UNSAFE instructions. On the contrary, the fine-tuned GPT-4o improves two out of five of the metrics in comparison to GPT-4 with KGP, with exception of accuracy and precision. The fine-tuned model got the same 70% as the baseline in accuracy, and presented a reduction in precision and MCC from 100% to 70% and from 50% to 40%, respectively. However, fine-tuned GPT-4o increases the recall and F1 score metrics from 40% to 70% and from 57.14% to 70%, respectively. The same value obtained for accuracy, precision, recall, and F1 score for the fine-tuned model indicates better performance in identifying both SAFE and UNSAFE commands. Even with an important amount of FP instances, the fine-tuned GPT-4o is the best model for safe drone operations since it is better alerting of possible risky situations than considering it as SAFE.
5.1 Categories analysis
The performance of our approaches against our four testing categories can be seen in Figure 5 which consist of altitude, distance from objects (DistObject), distance from crowd (DistCrowd) and hovering above the crowd (HoverCrowd). Moreover, in terms of the altitude and distance from objects testing categories, we can see an even distribution of the result between the models with KGP and similar to the model without KGP. However, a note worth highlighting is that both models FTGPT-4o and GPT-4o both with KGP archived perfect scores of identifying SAFE and UNSAFE showing the benefits of KGP that can reinforce the model with domain-specific knowledge.
In the DistCrowd category, regarding the minimum distance from the crowd rule sets, the GPT-4o models cannot classify the UNSAFE commands with and without KGP. On the contrary, the FTGPT-4o can correctly classify UNSAFE commands without KGP. However, the FTGPT-4o with KGP slightly improves the classification of SAFE instructions with a precision detriment for UNSAFE instances. The high similarity between the distance to objects and distance to crowd instructions on the testing data and the KGP rules probably makes the FTGPT-4o model behave this way.
With the set of rules related to avoiding hovering in the crowd (HoverCrowd), the baseline model, it was always challenging in classify the UNSAFE examples, even when adding KGP. However, when using the KGP, the GPT-4o model was able to improve its precision in classifying SAFE commands. On the contrary, the FTGPT-4o model detects in a well manner the UNSAFE instructions, having difficulties with the SAFE examples. Even when adding KGP, the fine-tuned model is not capable of improving the SAFE commands classification, presenting a drawback with the UNSAFE instructions.
6 Conclusions
Most current systems that allow UAV control via natural language lack a safety barrier that prevents the UAV from violating the regulations of safe drone operations. Our work addressed this gap by developing a safety layer that eliminates harmful commands from being executed to low-level robotic actions. Obtained results have proven that a fine-tuned LLM with KGP can ensure that drone operations obey local authorities’ regulations. Future works will address more complex CASA drone rules. For example, flying the drone during day time only and avoiding to fly the drone near areas where emergency operations are taken place to ensure full safety regulation complaints. Having LLMs following the rules for drone operations is one approach to ensure the safety of the code generated, future research will also consider the current drone’s hardware capabilities. For instance, giving unsafe commands that exceed the safe velocity parameters of the drone can lead to unstable flight patterns, reduced maneuverability, and an increased risk of crashes, e.g., the instruction accelerate to 50 m/s and maintain altitude could lead the drone to crash into an obstacle due to reduced control accuracy at high speeds.
We hope that our work has made a contribution to answering the open research question of how to prevent AI systems from causing harm to humans and assets without limiting robot control, increasing latency, and continuing learning required for LLMs controlling drones to rise for the challenges of object-goal navigation at an unpredictable dynamic outdoor environment.
Acknowledgments
This research was partially financed by the Coordenação de Aperfeiçoamento de Pessoal de Nível Superior—Brasil (CAPES)—Finance Code 001, Fundação de Amparo a Ciência e Tecnologia do Estado de Pernambuco (FACEPE), and Conselho Nacional de Desenvolvimento Científico e Tecnológico (CNPq)—Brazilian research agencies.
References
- [Amodei et al., 2016] Dario Amodei, Chris Olah, Jacob Steinhardt, Paul Christiano, John Schulman, and Dan Mané. Concrete problems in ai safety. arXiv preprint arXiv:1606.06565, 2016.
- [Brown et al., 2020] Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. In Advances in Neural Information Processing Systems, volume 33, pages 1877–1901, 2020.
- [Bucker et al., 2022] Arthur Bucker, Luis Figueredo, Sami Haddadin, Ashish Kapoor, Shuang Ma, and Rogerio Bonatti. Reshaping robot trajectories using natural language commands: A study of multi-modal data alignment using transformers, 2022.
- [Devlin et al., 2019] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding, 2019.
- [Future of Life Institute, 2015] Future of Life Institute. Autonomous weapons: An open letter from AI & robotics researchers. https://futureoflife.org/open-letter-autonomous-weapons/, 2015. Signed by 20,000+ people.
- [Ji et al., 2022] Shaoxiong Ji, Shirui Pan, Erik Cambria, Pekka Marttinen, and Philip S. Yu. A survey on knowledge graphs: Representation, acquisition, and applications. IEEE Transactions on Neural Networks and Learning Systems, 33(2):494–514, 2022.
- [Ji et al., 2023] Ziwei Ji, Nayeon Lee, Rita Frieske, Tiezheng Yu, Dan Su, Yan Xu, Etsuko Ishii, Ye Jin Bang, Andrea Madotto, and Pascale Fung. Survey of hallucination in natural language generation. ACM Computing Surveys, 55(12):1–38, March 2023.
- [Kapelyukh et al., 2023] Ivan Kapelyukh, Vitalis Vosylius, and Edward Johns. Dall-e-bot: Introducing web-scale diffusion models to robotics. IEEE Robotics and Automation Letters, 8(7):3956–3963, 2023.
- [Li et al., 2022] Shuang Li, Xavier Puig, Chris Paxton, Yilun Du, Clinton Wang, Linxi Fan, Tao Chen, De-An Huang, Ekin Akyürek, Anima Anandkumar, et al. Pre-trained language models for interactive decision-making. Advances in Neural Information Processing Systems, 35:31199–31212, 2022.
- [Li et al., 2023] Shiyang Li, Yifan Gao, Haoming Jiang, Qingyu Yin, Zheng Li, Xifeng Yan, Chao Zhang, and Bing Yin. Graph reasoning for question answering with triplet retrieval, 2023.
- [Liang et al., 2022] Jacky Liang, Wenlong Huang, Fei Xia, Peng Xu, Karol Hausman, Brian Ichter, Pete Florence, and Andy Zeng. Code as policies: Language model programs for embodied control. In arXiv preprint arXiv:2209.07753, 2022.
- [Liu et al., 2019] Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. Roberta: A robustly optimized bert pretraining approach, 2019.
- [Mandi et al., 2022] Zhao Mandi, Homanga Bharadhwaj, Vincent Moens, Shuran Song, Aravind Rajeswaran, and Vikash Kumar. Cacti: A framework for scalable multi-task multi-scene visual imitation learning. arXiv preprint arXiv:2212.05711, 2022.
- [Matuszek et al., 2013] Cynthia Matuszek, Evan Herbst, Luke Zettlemoyer, and Dieter Fox. Learning to Parse Natural Language Commands to a Robot Control System, pages 403–415. Springer International Publishing, Heidelberg, 2013.
- [McCoy et al., 2019] Tom McCoy, Ellie Pavlick, and Tal Linzen. Right for the wrong reasons: Diagnosing syntactic heuristics in natural language inference. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 3428–3448, Florence, Italy, July 2019. Association for Computational Linguistics.
- [Mitchell et al., 2018] T. Mitchell, W. Cohen, E. Hruschka, P. Talukdar, B. Yang, J. Betteridge, A. Carlson, B. Dalvi, M. Gardner, B. Kisiel, J. Krishnamurthy, N. Lao, K. Mazaitis, T. Mohamed, N. Nakashole, E. Platanios, A. Ritter, M. Samadi, B. Settles, R. Wang, D. Wijaya, A. Gupta, X. Chen, A. Saparov, M. Greaves, and J. Welling. Never-ending learning. Communications of the ACM, 61(5):103–115, May 2018.
- [Nair et al., 2022] Suraj Nair, Aravind Rajeswaran, Vikash Kumar, Chelsea Finn, and Abhinav Gupta. R3m: A universal visual representation for robot manipulation. arXiv preprint arXiv:2203.12601, 2022.
- [Niu et al., 2022] Changan Niu, Chuanyi Li, Vincent Ng, Jidong Ge, Liguo Huang, and Bin Luo. Spt-code: Sequence-to-sequence pre-training for learning source code representations. In Proceedings of the 44th international conference on software engineering, pages 2006–2018, 2022.
- [Pan et al., 2024] Shirui Pan, Linhao Luo, Yufei Wang, Chen Chen, Jiapu Wang, and Xindong Wu. Unifying large language models and knowledge graphs: A roadmap. IEEE Transactions on Knowledge and Data Engineering, 36:3580–3599, 7 2024. Model.
- [Radosavovic et al., 2023] Ilija Radosavovic, Tete Xiao, Stephen James, Pieter Abbeel, Jitendra Malik, and Trevor Darrell. Real-world robot learning with masked visual pre-training. In Conference on Robot Learning, pages 416–426. PMLR, 2023.
- [Rombach et al., 2022] Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10684–10695, 2022.
- [Salazar et al., 2019] Julian Salazar, Davis Liang, Toan Q Nguyen, and Katrin Kirchhoff. Masked language model scoring. arXiv preprint arXiv:1910.14659, 2019.
- [Shah et al., 2017] Shital Shah, Debadeepta Dey, Chris Lovett, and Ashish Kapoor. Airsim: High-fidelity visual and physical simulation for autonomous vehicles. In Field and Service Robotics, 2017.
- [Vemprala et al., 2024] Sai H. Vemprala, Rogerio Bonatti, Arthur Bucker, and Ashish Kapoor. Chatgpt for robotics: Design principles and model abilities. IEEE Access, 12:55682–55696, 2024.
- [Wei et al., 2022] Jason Wei, Yi Tay, Rishi Bommasani, Colin Raffel, Barret Zoph, Sebastian Borgeaud, Dani Yogatama, Maarten Bosma, Denny Zhou, Donald Metzler, Ed H. Chi, Tatsunori Hashimoto, Oriol Vinyals, Percy Liang, Jeff Dean, and William Fedus. Emergent abilities of large language models. Transactions on Machine Learning Research, 2022. Survey Certification.
- [Yu et al., 2023] Wenhao Yu, Nimrod Gileadi, Chuyuan Fu, Sean Kirmani, Kuang-Huei Lee, Montserrat Gonzalez Arenas, Hao-Tien Lewis Chiang, Tom Erez, Leonard Hasenclever, Jan Humplik, et al. Language to rewards for robotic skill synthesis. In Conference on Robot Learning, pages 374–404. PMLR, 2023.