Mechatronic generation of datasets for acoustics research

Abstract

We address the challenge of making spatial audio datasets by proposing a shared mechanized recording space that can run custom acoustic experiments: a Mechatronic Acoustic Research System (MARS). To accommodate a wide variety of experiments, we implement an extensible architecture for wireless multi-robot coordination which enables synchronized robot motion for dynamic scenes with moving speakers and microphones. Using a virtual control interface, we can remotely design automated experiments to collect large-scale audio data. This data is shown to be similar across repeated runs, demonstrating the reliability of MARS. We discuss the potential for MARS to make audio data collection accessible for researchers without dedicated acoustic research spaces.

Index Terms— Audio recording, robotics, remote control, signal processing, source separation, speech processing

1 Introduction

Rich, high-quality audio data is a vital resource for the development and evaluation of audio signal processing algorithms. However, the process of acquiring data can be time-consuming and labor-intensive.

Researchers have developed a number of approaches for generating audio data with desired spatial characteristics. Simulation methods such as the Image Source Model (ISM) can be used to generate a room impulse response (RIR), which can be convolved with an audio signal to model room acoustics [1]. Unfortunately, RIRs are hard to leverage in live applications. Therefore, researchers interested in evaluating their algorithms in live settings will use data from carefully arranged microphones and loudspeakers [2]. Some researchers opt to record human subjects engaging in everyday tasks, which provides rich, dynamic data [3, 4]. Sensors can be used to measure the position of subjects [5], which is useful for research in localization and tracking. Although realistic, these experiments are difficult to replicate. In contrast, robotic experiments are repeatable and can emulate complex human-like motion. An example is the LOCATA challenge, which provides a data corpus that includes recordings from a moving humanoid robot [6, 7]. Robots also enable dense spatial samplings of acoustic spaces. The CAMIL dataset was created for research on binaural manifolds [8] by using a robot to direct an acoustic head simulator at various orientations. Although robots can benefit spatial audio research, they are costly to develop and thus inaccessible to many.

Refer to caption — Fig. 1: Photograph of the MARS setup used in this paper

A tool that grants access to a robot-enabled recording space would enable many researchers to generate real data from finely controlled acoustic scenes. The Mechatronic Acoustic Research System (MARS) is such a remotely-accessible workbench for the creation of audio datasets. We summarize the tradeoffs of our proposed system with other approaches in Table 1.

	Simulation	Live human experiment	Live robotic experiment	MARS prototype
Modeling Inaccuracies	Inexact [9, 10]	Inexact Positioning and Labels	Servo Noise from Robotic Motion	Servo Noise from Robotic Motion
Required Resources	Low	High	High	Remotely Accessible
Repeatable	Yes	No	Yes	Yes
Setup Time	Minimal	High	High	Minimal
Dense spatial sampling	Yes	No	Yes	Yes

Table 1: Summary of spatial audio experiment methods.

The challenges that must be solved for MARS to achieve our goals are explored in Section 2. Design and control of the MARS prototype is discussed in detail in Sections 3 and 4. We evaluate our prototype’s data-collection ability in Section 5 and summarize our results in Section 6.

2 Challenges

For MARS to be a viable general-purpose audio data collection tool, it must allow users to specify a wide variety of experiments and be able to run them consistently. We refer to these two tasks as design and control, respectively. To design an experiment, a user must be able to fully describe complicated acoustic scenes, which may involve many speakers and microphones, each with their own audio, positioning, and motion. The more elements in an experiment, the more challenging it is to design. To address this, MARS must represent acoustic scenes in a concise manner, such that users can easily specify even the most elaborate of experiments. Additionally, MARS is intended to be remotely accessible, therefore visualization tools are necessary for users to fully understand the recording space’s capabilities and limitations.

The control aspect of MARS also presents many unique challenges. Multi-robot coordination is required for repeatable motion across a fleet of robots, however this is a complex task [11], and MARS must address this while also being extensible so that state-of-the-art devices can be seamlessly integrated on a rolling basis.

Furthermore, certain precision constraints must be met for MARS to be suitable as a shared data-collection platform.

3 User design of complex experiments

3.1 Describing an experiment

Central to the task of designing an experiment is the ability to accurately specify an acoustic scene. The volume of information required to fully characterize such scenes makes concise representation challenging, thereby increasing the tedium of creating and iterating experiment designs. Our solution is to provide an object-oriented application programming interface (API) in Python, modeled loosely after pyroomacoustics [9]. Any audio or robot device in MARS is dubbed a component, each of which offer a range of actions that can be requested via instructions. A first pass of the experiment can be done without calling upon the actual equipment, so that logs are provided nearly instantly. We find that this streamlines the process of iterating experiment designs.

3.2 Simulation in Gazebo

Remote access to a mechanized room poses several safety concerns. In particular, the possibility for user-defined experiments to damage equipment must be addressed. A digital twin of the system is developed in Gazebo: a simulation toolbox and physics engine commonly used in robotics [12], such that experiments may be simulated in advance to check for collisions. This tool also provides a visualization of the physical space, as shown in Figure 2. We anticipate this tool to be instrumental in the design and execution of a new class of robot-enabled audio experiments.

3.3 Designing and monitoring

We operate the MARS prototype using a virtual interface, which lets us design and run experiments remotely. Experiments may involve a large number of devices, or have a duration spanning multiple days, thus a monitoring tool is provided to track experiment progress, report system status, and stream a camera feed of the recording space.

4 Control of audio and robot devices

4.1 Multi-robot coordination

To convert descriptions of experiments into audio data, MARS must offer positioning, playback, and recording capabilities. Our prototype does this by providing an extensible API that defines such behavior for an arbitrary set of devices.

The positioning of microphones and loudspeakers within an acoustic scene is done using multiple robots, which can range from simple linear actuators to complex pulley-driven systems. These robots must move to requested positions at specified times while avoiding collision. Collision avoidance in multi-robot systems is well studied [13] and can be highly sensitive. Fortunately, MARS controls a known environment where motion planning can be done in advance. Additionally, for the sake of repeatability, the motion provided by MARS is mostly described by the user, so there is little need for advanced decision-making and path-planning algorithms. These factors make MARS suitable for a centralized architecture, which can be more efficient than a decentralized one [14].

4.2 Device integration

We acknowledge the need for MARS to provide access to state-of-the-art equipment. To this end, MARS is built atop the Robot Operating System (ROS): a collection of open-source robotics software that offers a publisher-subscriber communication architecture over IP [15]. To support device integration with MARS, we offer modular client and server objects that request and execute instructions, respectively. The use of client/servers in the handling of an experiment is shown in Figure 3. A single component can run many servers in parallel, each of which passes instruction data to an arbitrary callback function. This implementation is extensible: Updating a server to fit a new device only requires changing the instruction type and callback implementation, so devices relevant to audio research such as the Tympan [16] can be added with little overhead. Using ROS, wireless devices can be interfaced easily, making MARS suitable for research on cooperative listening with IoT devices.

4.3 Precisely timed actions

Although the modified TCP protocol provided by ROS guarantees data integrity and order of arrival, large latency and jitter is observed, particularly on the low-power microcontrollers that drive many of the robots within MARS. This does not pose an issue for sequentially executed instructions, as order of execution is easily maintained. To support timestamped instructions, the prototype controller transmits messages in advance, which gives components ample time to receive and execute instructions according to their internal hardware clocks. Assuming synchronized hardware clocks across the network, this guarantees that robot motion occurs on a shared timeline. The effects of clock drift are minimized by running a local Network Time Protocol (NTP) server [17]. This approach allows for the creation of dynamic acoustic scenes, even on busy wireless networks.

4.4 Precise motion

Given that MARS is designed to provide spatial audio, we aim to record and validate the positions of robots in the environment. Recording this data can be achieved using ROS tools, namely rosbag. Validation requires ground-truth position data, which can be found using a camera system and visual fiducial markers [18, 19]. With the ground-truth positions known, position error can be calculated and used to interrupt experiments that cause the MARS system to behave erratically. Doing so serves as a form of quality control, wherein only high-quality, precise data is captured.

4.5 Managing large datasets

Complex experiments have the potential to involve a massive number of devices, each of which logs its own performance locally and writes audio files to internal storage. MARS offers the ability to consolidate this data onto the host machine, where it can be labeled and uploaded for the user to access. Data upload can be requested concurrently to experiment execution, allowing for validation and feature extraction even as an experiment is ongoing. This minimizes the downtime between requesting an experiment and receiving data.

5 Evaluation

Data collected from baseline experiments was used to evaluate MARS. The following equipment setup was used:

•

The interface component, a fanless computer with a 64-channel audio interface (Antelope Audio Galaxy 64 Synergy Core) allows an audio file to be played through a speaker array while a set of omnidirectional lavalier condenser microphones records audio concurrently. Scripting is handled with PortAudio [20] linked to the interface’s ASIO driver.
•

A 3D-printed acoustic head simulator, shown in Figure 1. One microphone was placed atop the head, and two were inserted into the left and right ear canals.
•

The spiderbot component, modeled after the spider-cam system used in professional sports arenas, which carries a speaker along a three-dimensional path.
•

The turret component, which rotates the head.
•

The rail component, a motor-driven linear guide rail that translates the turret along a single axis.

5.1 High density spatial sampling

MARS was used to take a dense spatial sampling of an acoustic scene, to demonstrate robustness. The acoustic head simulator was placed in various poses by the rail and turret. Over the course of around fifty hours, 40,000 three-channel audio files were collected at a sampling frequency of 48 kHz without human supervision or intervention.

5.2 Repeatability of experiments

Using the densely sampled MARS data, we verify the repeatability of MARS for static scene creation using four measurements corresponding to the same head pose. The recorded data was highpass filtered to remove background noise. The value of the maximum entry in the normalized cross-correlation (NCC) between two signals is used as a metric for similarity between recordings. To evaluate repeatability for multiple instances of the experiment, the NCC between the first recording and each of the others is calculated, as shown in Figure 4. The maximum entries of the NCCs had an average value of 0.98 with a standard deviation of 0.01. A maximum NCC value of 1.00 corresponds to perfect similarity between two signals, thus our results indicate highly repeatable performance.

To verify repeatability for dynamic scenes, an experiment was run where the spiderbot carried its speaker payload along a trajectory as audio was being played concurrently. Recordings were taken from a fixed microphone across ten repeated runs of the experiments. Across nine comparisons to a reference recording, the maximum NCC was 0.93 on aver age, with a standard deviation of 0.04. We observe that the repeatability was only marginally inferior when motion was introduced.

The dynamic experiment was repeated with the fixed acoustic head simulator in place of the microphone. The interaural time difference (ITD) remained constant across separate runs of the experiment, demonstrating reliable collection of spatial audio.

6 Conclusion

MARS demonstrates how remote access to coordinated robots can be applied to the collection of custom high quality audio data. Using a virtual interface, we ran several experiments which would have otherwise required significant human labor and time. We evaluated this data to show that our framework for wireless multi-robot coordination is capable of collecting repeatable data from both large-scale static scenes and challenging dynamic scenes. The process of designing experiments was streamlined by our scripting interface which concisely describes the motion and playback/recording of robot-driven microphones and loudspeakers. By creating MARS with extensibility in mind, we open up the possibility of accommodating an ever-growing variety of experiments. With the development of this initial version of MARS, we have provided solutions to several of the major challenges that must be solved for an open-access mecha-acoustic platform to become a reality.

References

[1] Jont B. Allen and David A. Berkley, “Image method for efficiently simulating small‐room acoustics,” The Journal of the Acoustical Society of America, vol. 65, no. 4, pp. 943–950, 1979.
[2] Ryan M. Corey, Microphone array processing for augmented listening, Ph.D. thesis, University of Illinois Urbana-Champaign, Urbana, 2019.
[3] Jon Barker, Shinji Watanabe, Emmanuel Vincent, and Jan Trmal, “The fifth ’CHiME’ speech separation and recognition challenge: Dataset, task and baselines,” CoRR, vol. abs/1803.10609, 2018.
[4] K. Kinoshita, M. Delcroix, S. Gannot, E. Habets, R. Haeb-Umbach, W. Kellermann, V. Leutnant, R. Maas, T. Nakatani, B. Raj, A. Sehr, and T. Yoshioka, “A summary of the REVERB challenge: State-of-the-art and remaining challenges in reverberant speech processing research,” EURASIP Journal on Advances in Signal Processing, 2016.
[5] Charles Fox, Yulan Liu, Erich Zwyssig, and Thomas Hain, “The Sheffield wargames corpus,” in Proc. Interspeech, 2013, pp. 1116–1120.
[6] Heinrich W. Löllmann, Christine Evers, Alexander Schmidt, Heinrich Mellmann, Hendrik Barfuss, Patrick A. Naylor, and Walter Kellermann, “The LOCATA challenge data corpus for acoustic source localization and tracking,” in IEEE Sensor Array and Multichannel Signal Processing Workshop (SAM), 2018, pp. 410–414.
[7] Christine Evers, Heinrich W. Löllmann, Heinrich Mellmann, Alexander Schmidt, Hendrik Barfuss, Patrick A. Naylor, and Walter Kellermann, “The LOCATA challenge: Acoustic source localization and tracking,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 28, pp. 1620–1643, 2020.
[8] Antoine Deleforge, Florence Forbes, and Radu Horaud, “Acoustic space learning for sound-source separation and localization on binaural manifolds,” International Journal of Neural Systems, vol. 25, no. 01, 2015.
[9] Robin Scheibler, Eric Bezzam, and Ivan Dokmanic, “Pyroomacoustics: A python package for audio room simulations and array processing algorithms,” CoRR, vol. abs/1710.04196, 2017.
[10] Dirk Schröder, Physically based real-time auralization of interactive virtual environments, Ph.D. thesis, RWTH Aachen University, Aachen, 2012.
[11] Imad Jawhar, Nader Mohamed, Jie Wu, and Jameela Al-Jaroodi, “Networking of multi-robot systems: Architectures and requirements,” Journal of Sensor and Actuator Networks, vol. 7, pp. 52, 11 2018.
[12] Evan Ackerman, “Latest version of gazebo simulator makes it easier than ever to not build a robot,” IEEE Spectrum, Feb. 2016.
[13] Zhi Yan, Nicolas Jouandeau, and Arab Ali, “A survey and analysis of multi-robot coordination,” International Journal of Advanced Robotic Systems, vol. 10, pp. 1, 12 2013.
[14] Ryan Luna and Kostas E. Bekris, “Efficient and complete centralized multi-robot path planning,” in IEEE/RSJ International Conference on Intelligent Robots and Systems, 2011, pp. 3268–3275.
[15] Morgan Quigley, Brian Gerkey, Ken Conley, Josh Faust, Tully Foote, Jeremy Leibs, Eric Berger, Rob Wheeler, and Andrew Ng, “ROS: An open-source robot operating system,” in IEEE International Conference on Robotics and Automation, 2009.
[16] Joshua Alexander, Odile Clavier, and William Audette, “Audiologic evaluation of the Tympan open source hearing aid,” The Journal of the Acoustical Society of America, vol. 143, pp. 1736–1736, 03 2018.
[17] D.L. Mills, “Internet time synchronization: The network time protocol,” IEEE Transactions on Communications, vol. 39, no. 10, pp. 1482–1493, 1991.
[18] Michail Kalaitzakis, Brennan Cain, Sabrina Carroll, Anand Ambrosi, Camden Whitehead, and Nikolaos Vitzilaios, “Fiducial markers for pose estimation,” Journal of Intelligent & Robotic Systems, vol. 101, no. 4, Mar. 2021.
[19] Tomáš Krajník, Matías Nitsche, Jan Faigl, Petr Vaněk, Martin Saska, Libor Přeučil, Tom Duckett, and Marta Mejail, “A practical multirobot localization system,” Journal of Intelligent & Robotic Systems, vol. 76, no. 3-4, pp. 539–562, Apr. 2014.
[20] Ross Bencina and Phil Burk, PortAudio – an Open Source Cross Platform Audio API, 2001.