captionUnsupported document class
BenchBot: Evaluating Robotics Research in Photorealistic 3D Simulation and on Real Robots
Abstract
We introduce BenchBot, a novel software suite for benchmarking the performance of robotics research across both photorealistic 3D simulations and real robot platforms. BenchBot provides a simple interface to the sensorimotor capabilities of a robot when solving robotics research problems; an interface that is consistent regardless of whether the target platform is simulated or a real robot. In this paper we outline the BenchBot system architecture, and explore the parallels between its user-centric design and an ideal research development process devoid of tangential robot engineering challenges. The paper describes the research benefits of using the BenchBot system, including: enhanced capacity to focus solely on research problems, direct quantitative feedback to inform research development, tools for deriving comprehensive performance characteristics, and submission formats which promote sharability and repeatability of research outcomes. BenchBot is publicly available111Available on GitHub, via http://benchbot.org, and we encourage its use in the research community for comprehensively evaluating the simulated and real world performance of novel robotic algorithms.
Index Terms:
benchmarking research, research evaluation, 3D simulation, robotics softwareI Introduction

Robotics research requires comprehensive evaluation prior to being deployed on real systems to guarantee robustness, a characteristic crucial for robots that operate with humans in the real world. Guaranteeing the robustness of robotic systems in the real world is challenging due to the dichotomy between the deterministic computing environment in which algorithms are conceived and the unpredictability of the real world. Consequently, evaluation processes that operate both within simulated computing environments and real world applications play a crucial role in determining the robustness of novel research.
Standardised benchmarks have become synonymous with evaluating performance in fields like computer vision, but have had limited adoption in the robotics field due to challenges associated with standardising robotics [1]. Fundamentally, standardising robotics problems diverges from a number of the key characteristic that define robotics research—e.g. robustness under changing environments, lighting conditions, climates, sensors, and platforms.
The research community has taken varying approaches in trying to reliably evaluate the performance of robotics research outcomes. These include static datasets like the Oxford RobotCar dataset [2] that ignore the agency of a robot, engineering-heavy real world competitions like the DARPA robotics challenges [3], game-engine powered high fidelity simulations like AirSim [4], and 3D environment reconstructions from real world data [5]. Although there is a wide range of approaches, a standard approach and toolset for comprehensively evaluating the performance of robotics research has not yet been established.
We present the BenchBot system (shown in Fig. 1) as a software tool for comprehensively evaluating the performance of novel robotics research in both high-fidelity 3D simulation and on real robots. The system allows users to define tasks their research is trying to solve, declare metrics for evaluation of performance, and integrate their research with the sensorimotor capabilities provided by robotic systems. The software suite seamlessly transitions between evaluation in simulation and the real world to facilitate comprehensive testing of novel research systems. This paper describes the following key contributions provided by the BenchBot software suite:
-
•
a simple Python API for interaction with the underlying robot system,
-
•
support for complex changes in target scope (i.e. research task, robot platform, and operating environment),
-
•
ability to run the same research on both simulated and real robot platforms without code changes,
-
•
a customisable evaluation pipeline for guiding research development through quantitative feedback,
-
•
a batch operation mode for building comprehensive performance profiles of novel research algorithms, and
-
•
modular design for easy extension to new research tasks, robot platforms, and operating environments.
The rest of the paper is organised as follows. Section II discusses existing approaches to benchmarking and evaluation in robotics, and methods for robot simulation. Next, the BenchBot system and its underlying components are formally described in Section III. The paper concludes in Section IV with a discussion of the results and future intentions for the BenchBot system.
II Related Work
We provide context for the BenchBot system by outlining the current standards for benchmarking and evaluation within robotics, and then look specifically at robot simulation processes meant to enable benchmarking.
II-A Benchmarking and Evaluation in Robotics
Standardized benchmarks are not common in robotics research when compared to fields like computer vision. This can largely be attributed to how difficult standardizing a robotics test can be, requiring standardized hardware, software, and environments [1]. This has led to a culture of robotics testing via experimentation to prove a hypothesis rather than comparison and evaluation [6]. Corke et al. [6] postulates that this is a factor which limits the speed of progress in robotics research in comparison to similar fields which instead perform regular evaluation and comparison of techniques. Despite being uncommon, there are some typical approaches used when trying to create benchmarks for robotic systems.
The first approach is to use pre-recorded data and evaluate how well algorithms interpret that data for specific tasks. Some well-known examples of this are the KITTI [7], Cityscape [8] and Oxford RobotCar [2] datasets which enable tasks such as object tracking, visual odometry, SLAM, and semantic segmentation to be evaluated. While this approach drives research in data interpretation, it loses the active nature of robotics wherein observations inform actions to solve problems.
Another approach to standardize robotics testing is to provide a consistent environment for robotic testing while leaving other variables of robot design open. This enables both active interaction with the environment, and different hardware and software solutions to be tested and compared. This approach is seen by competitions like RoboCup@Home [9], the DARPA robotics challenge [3], and the Amazon picking challenge [10]. While enabling system comparison, we see three main limitations to this approach. Firstly, these events are too infrequent to drive research in the same manner seen in computer vision and machine learning research. Secondly, while good for systems-level comparison, precise research outputs (algorithms, sensor design, etc.) cannot be easily compared. Finally, these competitions become monetarily restrictive with groups needing access to their own physical robotic platforms, large engineering investment, and significant transport funding.
An interesting approach being newly adopted, is to provide remote access to robot platforms. This is present in challenges like RoboThor [11], iGibson [12] and the Real Robot Challenge222https://real-robot-challenge.com/en. This avenue presents promise in giving users access to physical hardware without the costs of setup and maintenance of the hardware for most users. While not currently a highly scalable solution, using real robot platforms provides the most realistic real-world performance whilst enabling interaction. The issue of scalability can be lessened somewhat by combining remote access to real platforms with the use of high-fidelity simulations.
II-B Robot Simulation
Robot simulation endeavours to provide tools that enable consistent robotics testing. We define two approaches to this which are utilised within the literature.
The first approach is the creation of fully simulated environments where the environments are hand-crafted and designed, typically through using a game engine. Some well known examples of this are AirSim [4] and CARLA [13] for outdoor environments, and AI2Thor [14], RoboThor [11] and Isaac333https://www.nvidia.com/en-us/deep-learning-ai/industries/robotics/ for indoor environments.
The second approach is the creation of simulated environments by utilizing real-world data. Generally this comes in the form of a full 3D environment that a simulated agent can explore freely, such as is seen by AIHabitat [15], Gibson [5], and iGibson [12]. Functionally these can act identically to those created in a games engine but generated using sets of depth and image data collected throughout the environment, which are stitched together to create the final simulated environment. Alternatively, real-world data can be used to provide precise real-world sensor readings from specific pre-defined poses within an environment. This is the approach of the active vision dataset [16], which, while not a simulator in the same sense of the others shown here, enables “traversal” between densely sampled poses to simulate movement while providing the precise visual data captured at that location.


There are distinct advantages and disadvantages when using either fully simulated or real-world data simulators. On a practical level, fully simulated environments are easily manipulated and adapted for new conditions (e.g. lighting variations, rearranging objects, etc.). This is more challenging for simulated environments created from manually observing real-world environments. However, manually collecting data typically provides more naturally messy environments with randomly cluttered surfaces. This seems more realistic when compared to the clean and spacious environments given by many fully simulated environments. However, the subject of visual realism between approaches is a hotly contested topic. Without fixed agent poses such as those used in the active vision dataset [16], using real data to create a virtual environment can lead to visual artefacts being introduced and realistic lighting reflections cannot be achieved as lighting is not actively calculated. While this is not an issue for fully simulated environments, current robotics simulators of this type don’t typically have particularly realistic textures (unlike when using real-world data) which can leave object appearances looking “flat”. You can see a simplistic comparison of simulator approaches in Figure 2. Regardless of the approach used, it is important to accept that a sim-to-real gap will always be present when using simulators.
The sim-to-real gap is perhaps best addressed when combined with real-world robot platforms. This is the approach used by RoboThor [11] and iGibson [12] that can directly examine the performance degradation in sim-to-real transfer. This enables rapid repeatable prototyping in high-fidelity simulation while still enabling direct analysis of real world performance. This is also a key component used in the design of our BenchBot system.
III The BenchBot System
Robotics researchers using the BenchBot system are able to focus on the research process—defining problems, creating solutions, and improving results—without being encumbered by the complications that underpin complex robotic systems. The BenchBot system is a software suite that manages the process of applying entire robot systems to a variety of research tasks, regardless of if the robot systems are simulated or operate in the real world. The software suite provides this wide scope of capabilities, while prioritising minimising the configuration and interaction burdens passed to the end-user.
BenchBot provides three scripts for using the system, which are denoted by the coloured sections of the system architecture shown in Fig. 3. The scripts allow users to: 1) select a research task, robot platform, and environment to run; 2) submit a solution for the research task; and 3) obtain evaluation feedback to iteratively improve their solution’s performance. Each of these three steps directly map to the steps of the ideal research process described above: defining problems, creating solutions, and improving results.
The user simply employs these scripts in their research process, which BenchBot then uses to manage all of the complex underpinnings required of a robot system. Each area of the BenchBot system managed by these scripts is discussed in further detail in Sections III-A, III-B, and III-C respectively, along with their underlying components. Lastly, this section concludes in Section III-D with a discussion of batching—BenchBot’s tool for effortlessly building comprehensive performance profiles of robotics research.

III-A Declaring a problem with the BenchBot back end
Using BenchBot begins by clearly declaring a problem that needs to be solved, a step synonymous with beginning research. The back end of the BenchBot system (including entire robot platforms, simulators, configuration, initialisation, networking, and interfacing) is started through a single script called benchbot_run. The script requires the user to select a target research task, robot platform, and operating environment from the pool of available options. Supported options are listed through helper flags, and the script validates whether the selected configuration is achievable (e.g. running a real robot in a simulated environment is not a valid configuration).
Options are declared to the BenchBot back end simply by creating a YAML file in the appropriate pool describing the configuration option, and providing any necessary data described by the configuration. For example, a simulated environment definition would contain the data for the environment simulation and a YAML file declaring an identifier, the type of environment (simulated or real), a start pose for the robot, a trajectory the robot may use to travel through the environment, etc. Robots are declared as a series of directed connections between robot platform and BenchBot API along with functions for translating data between the two endpoints, and tasks are declared as a list of available robot capabilities.
III-A1 BenchBot supervisor
is started once a valid selection has been provided. The supervisor serves as the central component of the BenchBot back end, providing a single interface to handle the conglomeration of data required to manage the robot system: command line selections, HTTP communication with the BenchBot API, configuration YAML files, ROS sensorimotor data, environment initialisation commands, and HTTP control commands for simulators and real robots. In handling this wide range of data, the supervisor is able to load data for the selected configuration from the available pools, manage the life cycle of the underlying robot platform and environment (whether that be real or simulated), and provide a conduit between the sensorimotor capabilities of the robot and the simplicity of access provided by the BenchBot API.
III-A2 Real robot platforms
sit below the supervisor in the back end architecture, and follow a pattern similar to typical ROS systems. BenchBot adds a robot controller to facilitate the ease-of-use functionality provided by the BenchBot API. Examples include stopping collisions before commands provided by the API can create them, guiding the robot through static trajectories for easier tasks, and returning the robot to consistent starting pose between trials.






III-A3 Simulated platforms
in the BenchBot back end heavily leverage the capabilities provided by the NVIDIA Isaac SDK and Isaac Unreal Engine Simulation platform [18]. The Isaac simulator provides agency of a robot platform within an environment simulated by Unreal Engine’s powerful simulation capabilities. The simulator capabilities allow the robot to be simulated in a wide variety of environments that can range in scale, lighting conditions, and even time of day (as shown in Fig. 4). A BenchBot simulator package is used above the Isaac components to transform the Isaac robot interface into ROS, and control aspects of the simulator life cycle like restarting from a clean state, declaring robot collisions, and dynamically changing environment selections.
III-B Creating research task solutions with the BenchBot API
The next step in the research process is creating a solution to the research problem, which once again has an analogous step in the BenchBot process. The user creates a solution that uses the BenchBot Python API to interact with sensorimotor robot capabilities, obtain back end configuration details, and generate structured task results. A user solution is submitted to a running BenchBot back end using the benchbot_submit script, which supports both native (i.e. running a Python script locally) and containerised (i.e. building a Docker image from a Dockerfile) submission modes. Containerised research solutions can run independently on other systems, enabling easy access and verification of solutions by the research community. Shareable and repeatable research outcomes, a key contribution of the BenchBot system, are a significant driver of progress in the research community.
Design of the BenchBot API is inspired by the “observe and act” framing employed in areas of robotics like reinforcement learning and the OpenAI Gym ecosystem [19]. The API uses data in the task definition to provide a list of sensor observations and possible robot actions to the user. A solution can either combine these manually into a control loop, or declare an agent with the three capabilities required to complete a BenchBot task: choosing an action given a set of observations, knowing when the task is done, and saving results for the task. Breaking the entire process of solving a complex robotics task down into simply providing three functions is an embodiment of the directness in which research can be conducted with the BenchBot system.
III-C Measuring performance with BenchBot evaluation tools
Once results have been attained through the BenchBot system, the final step is to evaluate the performance of the solution given the generated results. A benchbot_eval script is provided to pass a collection of results to the underlying Python evaluation module. The evaluation module supports scoring results individually and producing a summary score for multiple results.
An appropriate evaluation method is selected from the pool of available methods based on the task identifier provided with the results. This flexibility recognises that different tasks will have different metrics that best capture their performance, while also consolidating metrics into single reusable implementations rather than each researcher creating their own implementation.
III-D Building comprehensive performance profiles with batches
Although the BenchBot system makes generating a result simple, we recognise that a single result is rarely enough to gain a comprehensive understanding of an algorithm’s performance. BenchBot provides a final script benchbot_batch that generates a set of task results by sweeping over a set of different environment and robot combinations. The script, in combination with evaluation tool support for multiple results, allows a user to produce a comprehensive performance profile for their research contribution with a single command. A performance profile comprising of results from multiple varying simulated environments, multiple robot platforms, and real world results empowers researchers to glean more comprehensive and meaningful insights from their research.
IV Conclusions & Future Work
In summary, we have described the BenchBot system from an architectural level and explored the capabilities the user-centric system design affords researchers. BenchBot provides simple tools for: targeting a multitude of research tasks, robot platforms, and environments; interacting with the sensorimotor capabilities of a robot platform; developing and iteratively improving research solutions with quantitative feedback; autonomously generating comprehensive performance characteristics for robotics research; and sharing research solutions to promote repeatability and accessibility.
BenchBot is in its infancy, with it currently targeting semantic scene understanding tasks on a limited number of robot platforms. As discussed throughout the paper, the modular system architecture employed in BenchBot allows a wide variety of expansions and improvements to the system. Depending on collaborative interest and development drivers, possible future outcomes we could explore with the BenchBot system include:
-
•
using the semantic scene understanding tasks internally to produce novel outcomes in the semantic SLAM and scene understanding research fields;
-
•
widening the range of supported robot platforms, particularly those with different actuation capabilities like robot manipulators;
-
•
adding support for simulation via NVIDIA Omniverse, a new ray-tracing enabled high-fidelity 3D simulation platform;
-
•
exposing resource pools (i.e. research tasks, robot platforms, environments, and evaluation methods) to end-users so they can easily create their own content and expansions; and
-
•
providing novel research challenges to the community using the BenchBot platform to stimulate and drive innovation.
BenchBot allows researchers to focus on developing novel robotics algorithms without the tangential engineering challenges posed by complex robotic systems. The tools provided with BenchBot facilitate feedback-guided research development, and present users with deep insights into the performance characteristics of their research. We encourage researchers to try BenchBot in their research process, get in contact with us if they have any feedback, and help us enable the development of robust robotics research through comprehensive evaluation.
References
- [1] A. P. del Pobil, R. Madhavan, and E. Messina, “Benchmarks in robotics research,” in Workshop IROS. Citeseer, 2006.
- [2] W. Maddern, G. Pascoe, C. Linegar, and P. Newman, “1 Year, 1000km: The Oxford RobotCar Dataset,” The International Journal of Robotics Research (IJRR), vol. 36, no. 1, pp. 3–15, 2017. [Online]. Available: http://dx.doi.org/10.1177/0278364916679498
- [3] E. Krotkov, D. Hackett, L. Jackel, M. Perschbacher, J. Pippine, J. Strauss, G. Pratt, and C. Orlowski, “The darpa robotics challenge finals: Results and perspectives,” Journal of Field Robotics, vol. 34, no. 2, pp. 229–240, 2017.
- [4] S. Shah, D. Dey, C. Lovett, and A. Kapoor, “Airsim: High-fidelity visual and physical simulation for autonomous vehicles,” in Field and Service Robotics, 2017. [Online]. Available: https://arxiv.org/abs/1705.05065
- [5] F. Xia, A. R. Zamir, Z. He, A. Sax, J. Malik, and S. Savarese, “Gibson env: Real-world perception for embodied agents,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 9068–9079.
- [6] P. Corke, F. Dayoub, D. Hall, J. Skinner, and N. Sünderhauf, “What can robotics research learn from computer vision research?” arXiv, pp. arXiv–2001, 2020.
- [7] A. Geiger, P. Lenz, C. Stiller, and R. Urtasun, “Vision meets robotics: The kitti dataset,” International Journal of Robotics Research (IJRR), 2013.
- [8] M. Cordts, M. Omran, S. Ramos, T. Rehfeld, M. Enzweiler, R. Benenson, U. Franke, S. Roth, and B. Schiele, “The cityscapes dataset for semantic urban scene understanding,” in Proc. of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016.
- [9] “Robocup@home 2019: Rules and regulations (draft),” http://www.robocupathome.org/rules/2019_rulebook.pdf, 2019.
- [10] N. Correll, K. E. Bekris, D. Berenson, O. Brock, A. Causo, K. Hauser, K. Okada, A. Rodriguez, J. M. Romano, and P. R. Wurman, “Analysis and observations from the first amazon picking challenge,” IEEE Transactions on Automation Science and Engineering, vol. 15, no. 1, pp. 172–188, 2016.
- [11] M. Deitke, W. Han, A. Herrasti, A. Kembhavi, E. Kolve, R. Mottaghi, J. Salvador, D. Schwenk, E. VanderBilt, M. Wallingford et al., “Robothor: An open simulation-to-real embodied ai platform,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020, pp. 3164–3174.
- [12] F. Xia, W. B. Shen, C. Li, P. Kasimbeg, M. E. Tchapmi, A. Toshev, R. Martín-Martín, and S. Savarese, “Interactive gibson benchmark: A benchmark for interactive navigation in cluttered environments,” IEEE Robotics and Automation Letters, vol. 5, no. 2, pp. 713–720, 2020.
- [13] A. Dosovitskiy, G. Ros, F. Codevilla, A. Lopez, and V. Koltun, “CARLA: An open urban driving simulator,” in Proceedings of the 1st Annual Conference on Robot Learning, 2017, pp. 1–16.
- [14] E. Kolve, R. Mottaghi, W. Han, E. VanderBilt, L. Weihs, A. Herrasti, D. Gordon, Y. Zhu, A. Gupta, and A. Farhadi, “Ai2-thor: An interactive 3d environment for visual ai,” arXiv preprint arXiv:1712.05474, 2017.
- [15] Manolis Savva*, Abhishek Kadian*, Oleksandr Maksymets*, Y. Zhao, E. Wijmans, B. Jain, J. Straub, J. Liu, V. Koltun, J. Malik, D. Parikh, and D. Batra, “Habitat: A Platform for Embodied AI Research,” in Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2019.
- [16] P. Ammirato, P. Poirson, E. Park, J. Kosecka, and A. C. Berg, “A dataset for developing and benchmarking active vision,” in IEEE International Conference on Robotics and Automation (ICRA), 2017.
- [17] J. Skinner, D. Hall, H. Zhang, F. Dayoub, and N. Sünderhauf, “The probabilistic object detection challenge,” arXiv preprint arXiv:1903.07840, 2019.
- [18] NVIDIA Corporation, “Nvidia isaac: The platform for robotics,” 2019, accessed: 31-07-20. [Online]. Available: https://www.nvidia.com/en-au/deep-learning-ai/industries/robotics/
- [19] G. Brockman, V. Cheung, L. Pettersson, J. Schneider, J. Schulman, J. Tang, and W. Zaremba, “Openai gym,” 2016.