SoundSil-DS: Deep Denoising and Segmentation of
Sound-field Images with Silhouettes
Abstract
Development of optical technology has enabled imaging of two-dimensional (2D) sound fields. This acousto-optic sensing enables understanding of the interaction between sound and objects such as reflection and diffraction. Moreover, it is expected to be used an advanced measurement technology for sonars in self-driving vehicles and assistive robots. However, the low sound-pressure sensitivity of the acousto-optic sensing results in high intensity of noise on images. Therefore, denoising is an essential task to visualize and analyze the sound fields. In addition to denoising, segmentation of sound and object silhouette is also required to analyze interactions between them. In this paper, we propose sound-field-images-with-object-silhouette denoising and segmentation (SoundSil-DS) that jointly perform denoising and segmentation for sound fields and object silhouettes on a visualized image. We developed a new model based on the current state-of-the-art denoising network. We also created a dataset to train and evaluate the proposed method through acoustic simulation. The proposed method was evaluated using both simulated and measured data. We confirmed that our method can applied to experimentally measured data. These results suggest that the proposed method may improve the post-processing for sound fields, such as physical model-based three-dimensional reconstruction since it can remove unwanted noise and separate sound fields and other object silhouettes. Our code is available at https://github.com/nttcslab/soundsil-ds.
1 Introduction
Sound is one of the most important cues to understanding scenes as well as vision. For example, self-driving cars and assistive robots are equipped with ultrasonic sonars and microphones, which are used to gather information about their surroundings. Recently, research has been conducted on converting sound information into vision information and vice versa. Lindell et al. have proposed an acoustic non-line-of-sight imaging method for resolving three-dimensional object shapes hidden around corners through acoustic echoes [17]. Davis et al. have proposed a method with which sound waves are recovered through object vibrations captured using a high-speed camera [6]. Sheinin et al. have proposed a method of sensing sound at high speeds through object-surface vibrations [20]. These studies demonstrated the potential of capturing sound as images, paving the way for further advancements in the field.

A sound-visualization technique involving directly capturing and visualizing the density variations in air caused by sound has been proposed [27]. Such acousto-optic sensing can capture sound without any microphones by observing modulations of the phase of light passing through sound fields. By using high-speed cameras as sensors, it becomes possible to create visual representations of invisible sound waves as images [5, 22]. However, a significant challenge remains: the phase modulation of light induced by sound is extremely small, leading to a high level of noise in the measured images. Ishikawa et al. [4] have proposed deep sound-field denoiser (DSFD), the first denoiser to use deep neural networks (DNNs) for noise reduction in sound-field images, demonstrating that DNN-based methods achieve superior denoising performance compared with conventional filtering methods.
With the optical technology and signal processing method, we can understand the nature of sound propagation. Therefore, the interaction between sound and objects, such as reflections and diffractions, can be visualized using acousto-optic sensing. To analyze such interactions between sound and objects, both denoising and segmentation of the sound field and object regions on visualized images should be done simultaneously.
We propose a method for simultaneous denoising and segmentation of sound-field images including object silhouettes. A conceptual diagram is presented in Fig. 1. With this method, optically measured sound-field images with object silhouettes are denoised and segmented using a DNN. The sound field on the laser path is captured with a high-speed camera. The area where the object blocks the laser light can be visualized as noisy silhouettes in the visualized images. The visualized images are converted to denoised and segmentation images with the DNN. The DNN is constructed based on the state-of-the-art (SOTA) denoising network, which also has the potential for per-pixel feature segmentation. We created a dataset with acoustic simulation since there is no dataset that includes acoustic scattering by objects. Denoising and segmentation are expected to (1) enable analysis of the propagation, reflection, and diffraction of sound waves in space and (2) be used as an advanced measurement technology for sonars in self-driving vehicles and assistive robots. The contributions of our work are summarized as follows:
-
•
We propose a method for simultaneous denoising and segmentation of sound-field images with silhouettes.
-
•
We created a dataset considering acoustic scattering caused by various shapes of objects.
-
•
We confirmed that the proposed method performed effectively on denoising and segmentation tasks.
2 Related Work
2.1 Acousto-optic Sensing
Acousto-optic sensing detects sounds by illuminating the sound fields with laser light from a distance, by using the changes in the refractive index of air due to density variations caused by sound. Since the method can capture sound without microphones, it can be useful to measure sound where microphone cannot be used such as narrow spaces and inside airflow. This technique has been applied to various situations such as visualizing sound field generated from loudspeakers [19, 10, 5], measuring flow-induced sounds [21, 24], and microphone calibrations [15, 11]. Using a high-speed camera to capture sound-field images provides an intuitive understanding of acoustic phenomena [5, 24, 11].
While they can non-intrusively capture 2D sound fields, the measured data often contain a significant amount of noise due to the small modulation of the light phase. An example of the measured data of visualized images is illustrated in Fig. 1. The sound wave propagates from right to left within the images, with reflections and diffractions occurring at the reflector. The data contain a significant noise; therefore, the diffracted waves are almost invisible.
2.2 Sound-field-image Denoising
Sound-field images captured by a high-speed camera are three-dimensional data in two dimensions of space and one dimension of time. Sound-field-image denoising has been done for both space and time dimensions.
Classical Filters
For more than a decade, classical filters have been used to reduce noise in sound fields [29, 5, 23]. The most straightforward method involves time-domain filtering [29]. The time-directional signals in each pixel can be considered similar to microphone signals where 2D sound fields are captured by high-speed camera. Hence, by applying time-domain filtering on a per-pixel basis, it becomes feasible to extract images corresponding to specific frequencies. Although the time-domain filters can remove noise with frequencies other than that of sound, noise components that fluctuate at the same frequency as the sound cannot be removed. In contrast, by using spatio-temporal filters [5, 23], we can extract components that satisfy the frequency of sound in both the time and spatial domains.
Deep Sound-Field Denoiser (DSFD)[4]
The DNN-based sound-field denoising method DSFD has been proposed. It is the first attempt at sound-field denoising using a DNN. The network of DSFD is based on nonlinear activation free network (NAFNet) [1], which is a network for natural image denoising. The input data consists of frequency-domain data obtained by applying Fourier transform (FT) in the time direction to each pixel of a sound-field video. The real and imaginary parts of the complex amplitudes at a specific frequency are treated as two separate images, which are then stacked along the channel direction, resulting in input data , where is the number of images, is the number of channels, and and are the number of pixels in width and height. The training data has been generated from 2D sound-field simulations in the frequency domain. Supervised learning is carried out using the simulated noisy images as inputs. By using DNNs, denoising can be achieved with higher accuracy compared with traditional classical filtering methods.

2.3 Image Denoising and Segmentation
As demonstrated with DSFD, 2D sound-field data can be regarded as images, allowing the application of DNNs commonly used for RGB/grayscale images. Numerous natural-image denoising models have been proposed, including convolutional neural network (CNN)-based [7, 2, 1, 9, 8] and transformer-based models [30, 28]. Denoising models have been proposed that use lightweight CNNs instead of transformers. In fact, the top three SOTA denoising models on the Smartphone Image Denoising Dataset (SIDD) [SIDD2018] have used CNNs [8, 9, 1]. These CNN-based models are designed to retain global information, achieving impressive accuracy. For instance, Cascaded Gaze Net (CGNet) [8], which outperforms the NAFNet [1], has achieved a peak signal-to-noise ratio (PSNR) of 40.39 dB on the SIDD. Therefore, we believed it better to use this SOTA network as a base network other than NAFNet in our study.
The task of joint denoising and segmentation, which is the objective of this study, is explored in the field of microscopy imaging. DenoiSeg [1], for instance, is a method that performs both denoising and segmentation on biological images. DenoiSeg achieves this multitasking by adding channels corresponding to the segmentation classes to the decoder of a denoising network in Noise2Void [16]. Losses are then calculated for each task, enabling the model to produce both denoised and segmentation images. The largest difference of the proposed method is using self-supervised learning due to the difficulty in obtaining clear images. Sound-field images can be generated from acoustic simulations, making supervised learning a rational approach.
3 Proposed Methods
To simultaneously carry out denoising and segmentation, we use the SOTA denoising network, CGNet [8], as a base network. We increase the number of output channels in the final layer of CGNet to three channels. The first two channels are for denoising sound images, that same as with DSFD. The last channel is for the segmentation of object silhouettes. By increasing the number of channels in the final layer, the proposed method can accomplish both tasks, resulting in lower computational costs compared with training denoising and segmentation sequentially.
3.1 Proposed Archtecture
Overviews of training and inference processes are shown in Fig. 2(a) and Fig. 2(b), respectively. The same as with DSFD, denoising is carried out in the frequency domain. To achieve this, frequency-domain data, obtained by applying a 1D FT in the time direction to time-domain sound-field videos are used. In the training process ( Fig. 2(a)), simulated sound-field images with added noise are input to CGNet. The input channels consist of real and imaginary parts of the frequency-domain data obtained from 1D FT. The number of channels of the last layer of the network is set to three channels: first two channels are for denoising and last channel is for segmentation . The range of the segmentation data is converted to using a sigmoid function. Segmentation labels are the binary labels obtained by the thresholding of . The binary labels are 0 for sound and 1 for silhouette class. The loss function is the weighted sum of and as
(1) |
where is the weighting coefficient that balances the two losses. For denoising, the negative PSNR loss function is used as well as that of CGNet. For segmentation, a combination of binary cross entropy loss and dice loss is used to reduce the bias in the number of pixels in the sound and silhouette classes:
(2) |
where is the weighting coefficient. These loss functions are determined based on the result of the preliminary experiment (See details in the supplementary material).
The inference process is shown in Fig. 2(b). To carry out inference in the frequency domain, noisy time-sequential images are converted to noisy complex amplitude with 1D FT along the temporal axis, where and are the number of data samples in temporal and frequency axes, respectively. Denoising and segmentation are carried out for each frequency bin of . The image in which the real and imaginary parts of the -th frequency bin are arranged in the channel direction is input to the network in the same manner as in the training process. After carrying out inference for all frequency bins, denoised images are converted to time-sequential data with inverse FT. The segmentation image at the sound frequency is the final segmentation label since the image at this sound frequency exhibits the highest contrast between the sound field and silhouettes and is considered to be easier to separate those two classes.
3.2 Dataset Creation
Since no dataset exists for training the network, we created a dataset of sound fields including objects. We used acoustic numerical simulation to create a training dataset because it is difficult to collect sound-field data under various conditions through experiments.
Simulation Conditions
To simulate sound fields including objects, we conducted a simulation of the sound fields in the time domain with MATLAB [12] using the k-Wave toolbox [26]. The simulation setup is shown in Fig. 3(a). Following DSFD, to obtain the image size of pixels, the observation area was set to m m, and observation points were set in a grid pattern at intervals of m. Sound sources were randomly placed outside the observation area, within a range of m m. Objects were set inside the observation area. The shapes of objects included ellipses, lines, and polygons, and the parameters positions were randomly selected. The medium other than the object was air, and the object was made of expanded polystyrene (EPS) to enable stable calculations. Thus, the reflectivity of the object was 93.2%.
The simulation conditions are listed in Tab. 1. The frequency range of sound was set to Hz, which corresponds to the wavenumber being rad/m. The sound frequency for each condition was fixed, which means each sound source had the same frequency and was randomly selected from a uniform distribution. The amplitude of the first sound source was set to Pa, and those of the other sound sources were randomly selected from a uniform distribution between to Pa. Calculated time-sequential data were converted to the frequency domain with FT and the data at the sound frequency were extracted. The total amount of images was , with images for each number of sound sources. Sample simulated data are shown in the second row of Fig. 3(b). Reflections and diffractions were simulated to occur around the silhouettes highlighted in white in the first row of Fig. 3(b).

Noisy-Data Creation
Since the noise characteristics between sound-field regions and object-silhouette regions can be different, we added different types of noise. For the sound-field regions, we added white noise with different SNRs. The SNRs were randomly selected from a uniform distribution between to dB. For the silhouette regions, we created noise on the basis of experimentally obtained data. The details of the noisy data creation for silhouette regions are in the supplementary material. We estimated the probability density function (PDF) from empirical data using kernel density estimation [4]. On the basis of the estimated PDF, we generated noise for silhouette regions by using the inverse transform sampling method [25]. The SNRs of the noise for silhouette regions were also selected from a uniform distribution between to dB. However, since clean-image values in silhouette regions are zero, SNRs were set for signals in the sound-field regions.
Created noisy data are shown in the last row of Fig. 3(b). Because the SNRs of the silhouette and sound-field regions were set independently, the intensities of the noise differed.
Parameter | Values |
---|---|
Spatial grid size [m] | 0.01 |
Temporal discretization step [s] | |
Speed of sound in air [m/s] | 340 |
Speed of sound in EPS [m/s] [3] | 414 |
Density of air [] | 1.21 |
Density of EPS [][3] | 28.0 |
Number of sound sources | |
Frequency of sound [Hz] | |
Amplitude of sound source [Pa] |

4 Experiment
4.1 Implementation Details
The network was implemented by PyTorch on the basis of the official implementation of CGNet [8, 7]. The CGNet consists of an input layer, four encoder blocks, one middle block, four decoder blocks, and an output layer. The numbers of encoder blocks were set to , , , and , respectively, in order of increasing depth. The number of middle blocks was . The number of decoder blocks were set to , , , and , respectively, in order of decreasing depth.
The input image size was set to pixels, and the number of channel was two. The weighting coefficient of the loss function in Eq. 1 was set to on the basis of experiments conducted with a small amount of data. The of the loss function in Eq. 2 was set to . The AdamW optimizer was used where the learning rate was 0.001, weight decay was , and and were and , respectively. The cosine annealing scheduler was used where the maximum number of iterations was , and the minimum learning rate was 1e-7. The batch size was , and the number of epochs was . The training time was approximately hours with a single NVIDIA GeForce RTX 4090 GPU. The number of training, validation, and evaluation images were , , and , respectively.
The PSNR and structural similarity (SSIM) were used for denoising, and intersection over union (IoU) was used for segmentation as the evaluation metrics. The IoUs were calculated for class , i.e., the silhouette regions.
4.2 Compared Models
In our proposed method, we used CGNet as the base network. To validate the use of CGNet, we also evaluated the performance when using existing networks as the base network. Compared networks were selected from denoising networks since denoising is more complicated task than segmentation, which is a two class classification in our task.
The following four conventional denoising networks were used: denoising convolutional neural network (DnCNN) [7], lightweight residual dense neural network based on the U-net neural network (LRDUNet) [2], NAFNet [1], and kernel basis network (KBNet) [9]. Since all those four networks were for denoising, we changed the number of output channels of the last layer to be the same as that of the proposed method. Under the training of DnCNN, LRDUNet, NAFNet, and KBNet, the of the loss function in Eq. 1 were set to , , , and , respectively. The batch sizes for DnCNN, LRDUNet, NAFNet, and KBNet were set to , , , and , respectively. The number of epochs for all models was . The implementation details are in the supplementary material.
4.3 Results
The results are shown in Tab. 2 and Fig. 4. The table shows that our method, which used CGNet as a base network, recorded the highest scores in terms of PSNR, SSIM, and IoU. As shown in Fig. 4, both sound-field and silhouette regions were denoised except for DnCNN. In the segmentation results, small objects that could not be detected by KBNet are successfully detected by ours. From these results, we confirmed that using CGNet as a base network was valid for sound-field-image joint denoising and segmentation.
To analyze denoising ability, we investigated the effect of the input SNRs of sound-field regions on the PSNRs, as plotted in Fig. 5(a). The horizontal axis means the input SNRs for sound-field regions, and the vertical axis means the output PSNRs. Except for DnCNN (blue dots), the output PSNRs positively correlated with the input SNRs for sound-field regions. Although the output PSNRs performed similarly where the input SNRs were low, except for DnCNN, ours (purple dots) showed improved output PSNRs at high input SNRs around to dB.
To analyze segmentation ability, we investigated the effect of the percentages of silhouette areas on the IoUs as plotted in Fig. 5(b). The percentages of silhouette areas were calculated as the number of pixels in the object area divided by the total number of pixels in the image. Except for DnCNN (blue dots), the number of data points with low IoUs tended to increase as the size of the object area decreased. Ours (purple dots) showed that there are many data points near 1.0 IoU even where the percentages of silhouette areas were small around to %.



5 Evaluation for Experimental Data
To confirm the applicability to the experimentally measured data, we applied our method to the following two types of experimental data: (1) A sound field diffracted with a thin plate, and (2) a sound field generated with a wooden finger castanets, a percussive musical instrument. All the data were captured by parallel phase-shifting interferometry (PPSI) [5], which is often used for sound-field imaging due to its high sensitivity and spatial resolution. PPSI can capture sound-field images within 100 mm in diameter by using a high-speed polarization camera [18].
5.1 Sound Field with Reflection and Diffraction
Experimental Setup
We recorded a sound field with reflection and diffraction where a reflector was set in front of a loudspeaker. The schematic diagram of the setup is shown in Fig. 6(a). The loudspeaker (NS-BP200, YAMAHA) was set outside the measurement area. The reflector was set inside the measurement area and 150 mm from the loudspeaker. The dimensions of the reflector were 150 mm in height, 10 mm in width, and 200 mm in depth. A 20-kHz sinusoidal wave was emitted from the loudspeaker. The frame rate of the camera was set to 50000 frames per second. A frequency spectrum of captured data was shown in Fig. 6(b). The spectrum was calculated from a time series signal at the index of , where represents the pixel coordinates in terms of height and width, respectively. Since the sound emitted from the loudspeaker was a 20-kHz sinusoidal wave, the spectrum exhibited a distinct peak at 20 kHz.
Results
The results are shown in Fig. 6(c). The top row is the input data from to ms with interval of ms. The sound wave propagated from right to left of the imaging area. The reflection and diffraction occurred around the reflector. Since the color range was kept consistent across all conditions to ensure fair visualization, the denoised results of DnCNN were saturated; therefore, it performed the worst. Although the denoised results of LRDUNet seemed to clearly visualize the sound wave, the amplitudes of the diffracted waves seemed larger than in the input data. Ours showed that noise was effectively removed while maintaining amplitudes of sound waves close to the input data. Ours also excelled in denoising silhouette regions compared with the others. In terms of segmentation, although there were undetected pixels, ours could estimate the least number of undetected pixels.
5.2 Sound Field with Sound Source Object
Experimental Setup
We also conducted denoising and segmentation for the sound field including the sound source object. We used wooden finger castanets, a percussive musical instrument as a sound source. The schematic diagram of the experimental setup is shown in Fig. 7(a). The castanets was installed, the edge of which was included in the measurement area. The castanets was played by human hand, and the sound was recorded with a PPSI system at 20,000 frames per second. A frequency spectrum of captured data was shown in Fig. 7(b). The spectrum was calculated from a time series signal at the index of . The spectrum exhibited a broad frequency distribution, indicating that the signal contained multiple frequencies.
Results
The results are shown in Fig. 7(c). The top row is the input data from to ms with intervals of ms. The sound-wave propagation from castanets, located at bottom-left corner, can be seen. Since the color range was consistent across all conditions, the denoised results of DnCNN were saturated. All networks except DnCNN could denoise the fine noise, and the sound wave was smoothed. By KBNet, the wavefront’s shape was more rounded, which was observed especially at and ms, whereas ours kept the shape of the wavefront. Although segmentation did not work well in all models, ours could capture the edge of the castanets. Post-processing, such as dilation, would fill the holes.
6 Comparison of Single-tasking and Multitasking
We evaluated how carrying out denoising and segmentation at the same time would change the accuracy compared with carrying out each as a single task. To do so, denoising and segmentation were conducted separately on the basis of CGNet. The results are shown in Tab. 3. The performance of denoising and segmentation by multitasking was slightly better than those of single-tasking. We also confirmed that multitasking can be implemented with minimal impact on inference time of single image and total model size for single-image input.
For further investigation, PSNRs relative to input SNRs and IoUs relative to the percentage of object silhouettes’ region are shown in Fig. 8. From Fig. 8(a), there was no significant difference in denoising performance between single-tasking and multitasking. On the other hand, according to Fig. 8(b), multitasking IoUs (blue dots) tended to be higher than single-tasking IoUs (orange dots) where object silhouettes’ areas were less than 10 %.
Task | PSNR | SSIM | IoU | Inference | Model |
[dB] | time [ms] | size [MB] | |||
Denoising | 43.1 | 0.986 | - | 21.99 | 1243.87 |
Segmentation | - | - | 0.984 | 21.82 | 1243.74 |
Ours | 43.2 | 0.987 | 0.986 | 22.10 | 1244.01 |

7 Conclusions
We proposed a denoising and segmentation method for 2D sound-field images with object silhouettes. To handle the sound fields with acoustic scattering by objects, we created a dataset through acoustic simulation. Multitasking was realized by using the output of the final layer of CGNet as channels for denoising and segmentation and calculating the loss function for each task. We confirmed that the proposed method can be applied to both simulated and experimental data. We believe that this method can be used for analyzing sound fields with interacting objects and for sonars in self-driving vehicles and assistive robots. Future work includes improving segmentation performance for experimental data, evaluating with recent segmentation architectures, and enabling moving-object segmentation.
References
- [1] Tim-Oliver Buchholz, Mangal Prakash, Alexander Krull, and Florian Jug. DenoiSeg: Joint denoising and segmentation. pages 324–337, 2020.
- [2] Liangyu Chen, Xiaojie Chu, Xiangyu Zhang, and Jian Sun. Simple baselines for image restoration. In ECCV, pages 17–33, 2022.
- [3] Wensu Chen, Hong Hao, Dylan Hughes, Yanchao Shi, Jian Cui, and Zhong-Xian Li. Static and dynamic mechanical properties of expanded polystyrene. Materials & Design, 69:170–180, 2015.
- [4] Yen-Chi Chen. A tutorial on kernel density estimation and recent advances. Biostatistics & Epidemiology, 1(1):161–187, 2017.
- [5] Nachanant Chitanont, Kohei Yatabe, Kenji Ishikawa, and Yasuhiro Oikawa. Spatio-temporal filter bank for visualizing audible sound field by schlieren method. Applied Acoustics, 115:109–120, 2017.
- [6] Abe Davis, Michael Rubinstein, Neal Wadhwa, Gautham Mysore, Frédo Durand, and William T. Freeman. The visual microphone: Passive recovery of sound from video. ACM Transactions on Graphics (Proc. SIGGRAPH), 33(4):79:1–79:10, 2014.
- [7] Amirhosein Ghasemabadi, Muhammad Kamran Janjua, Mohammad Salameh, Chunhua Zhou, Fengyu Sun, and Di Niu. CascadedGaze. https://github.com/Ascend-Research/CascadedGaze, 2024. Accessed on 1st, July, 2024.
- [8] Amirhosein Ghasemabadi, Muhammad Kamran Janjua, Mohammad Salameh, Chunhua Zhou, Fengyu Sun, and Di Niu. CascadedGaze: Efficiency in global context extraction for image restoration. Transactions on Machine Learning Research, page 20 pages, 2024.
- [9] Javier Gurrola-Ramos, Oscar Dalmau, and Teresa Alarcón. U-Net based neural network for fringe pattern denoising. Optics and Lasers in Engineering, 149:14 pages, 2022.
- [10] Michael J. Hargather, Gary S. Settles, and Matthew J. Madalis. Schlieren imaging of loud sounds and weak shock waves in air near the limit of visibility. Shock Waves, 20:9–17, 2010.
- [11] Denny Hermawanto, Kenji Ishikawa, Kohei Yatabe, and Yasuhiro Oikawa. Determination of microphone acoustic center from sound field projection measured by optical interferometry. The Journal of the Acoustical Society of America, 153(2):1138–1146, 2023.
- [12] The MathWorks Inc. MATLAB version: 24.1.0.2603908 (R2024a).
- [13] Kenji Ishikawa, Daiki Takeuchi, Noboru Harada, and Takehiro Moriya. Deep sound-field denoiser: optically-measured sound-field denoising using deep neural network. Optics Express, 31:33405–33420, 2023.
- [14] Kenji Ishikawa, Kohei Yatabe, Nachanant Chitanont, Yusuke Ikeda, Yasuhiro Oikawa, Takashi Onuma, Hayato Niwa, and Minoru Yoshii. High-speed imaging of sound using parallel phase-shifting interferometry. Optics Express, 24(12):12922–12932, Jun 2016.
- [15] Triantafillos Koukoulas and Ben Piper. Towards direct realisation of the SI unit of sound pressure in the audible hearing range based on optical free-field acoustic particle measurements. Applied Physics Letters, 106(16):164101, 2015.
- [16] Alexander Krull, Tim-Oliver Buchholz, and Florian Jug. Noise2Void - learning denoising from single noisy images. In CVPR, 2019.
- [17] David B. Lindell, Gordon Wetzstein, and Vladlen Koltun. Acoustic non-line-of-sight imaging. In CVPR, June 2019.
- [18] Takashi Onuma and Yukitoshi Otani. A development of two-dimensional birefringence distribution measurement system with a sampling rate of 1.3MHz. Optics Communications, 315:69–73, 2014.
- [19] Tatsuya Sakoda and Yoshito Sonoda. Visualization of sound field with uniform phase distribution using laser beam microphone coupled with computerized tomography method. Acoustical Science and Technology, 29(4):295–299, 2008.
- [20] Mark Sheinin, Dorian Chan, Matthew O’Toole, and Srinivasa G. Narasimhan. Dual-shutter optical vibration sensing. In CVPR, 2022.
- [21] Yoshito Sonoda and Yoichi Nakazono. Development of optophone with no diaphragm and application to sound measurement in jet flow. Advances in Acoustics and Vibration, 2012(1):17 pages, 2012.
- [22] Yuki Takase, Kazuki Shimizu, Shogo Mochida, Tomoyoshi Inoue, Kenzo Nishio, Sudheesh K. Rajput, Osamu Matoba, Peng Xia, , and Yasuhiro Awatsuji. High-speed imaging of the sound field by parallel phase-shifting digital holography. Applied Optics, 60(4):A179–A187, 2021.
- [23] Risako Tanigawa, Kohei Yatabe, and Yasuhiro Oikawa. Guided-spatio-temporal filtering for extracting sound from optically measured images containing occluding objects. In ICASSP, pages 945–949, 2019.
- [24] Risako Tanigawa, Kohei Yatabe, and Yasuhiro Oikawa. Experimental visualization of aerodynamic sound sources using parallel phase-shifting interferometry. Experiments in Fluids, 61(206):10 pages, 2020.
- [25] Roberto Tempo, Giuseppe Calafiore, and Fabrizio Dabbene. Random Number and Variate Generation, pages 193–215. Springer London, London, 2013.
- [26] Bradley E. Treeby and Benjamin T. Cox. k-Wave: MATLAB toolbox for the simulation and reconstruction of photoacoustic wave fields. Journal of Biomedical Optics, 15:12 pages, 2010.
- [27] Samuel A. Verburg, Kenji Ishikawa, Efren Fernandez-Grande, and Yasuhiro Oikawa. A century of acousto-optics- from early discoveries to modern sensing of sound with light. Acoustics Today, 19(3):54–62, 2023.
- [28] Zhendong Wang, Xiaodong Cun, Jianmin Bao, Wengang Zhou, Jianzhuang Liu, and Houqiang Li. Uformer: A general u-shaped transformer for image restoration. In CVPR, pages 17683–17693, June 2022.
- [29] Kohei Yatabe, Risako Tanigawa, Kenji Ishikawa, and Yasuhiro Oikawa. Time-directional filtering of wrapped phase for observing transient phenomena with parallel phase-shifting interferometry. Optics Express, 26:13705–13720, 2018.
- [30] Syed Waqas Zamir, Aditya Arora, Salman Hameed Khan, Munawar Hayat, Fahad Shahbaz Khan, and Ming-Hsuan Yang. Restormer: Efficient transformer for high-resolution image restoration. In CVPR, pages 5728–5739, June 2022.
- [31] Kai Zhang, Wangmeng Zuo, Yunjin Chen, Deyu Meng, and Lei Zhang. Beyond a gaussian denoiser: Residual learning of deep CNN for image denoising. IEEE Transactions on Image Processing, 26(7):3142–3155, 2017.
- [32] Yi Zhang, Dasong Li, Xiaoyu Shi, Dailan He, Kangning Song, Xiaogang Wang, Honwei Qin, and Hongsheng Li. KBNet: Kernel basis network for image restoration. arXiv preprint arXiv:2303.02881, 2023.
Supplemental Material
We summarize the content of the supplementary material as follows. Section S1 presents the issue with using the existing denoising/segmentation methods in supporting the data underlying the motivation of our task. Section S2 provides details on the creation of noise data for silhouette regions based on the experimental data. Section S4 provides the implementation details of the compared models.
S1 Issue with Existing Methods
DSFD [4] focuses only on the sound field without objects. Thus, it cannot be directly applied to a sound field with objects. To provide evidence for this, denoising results for sound-field images with object silhouettes are shown in Fig. S1. The denoising was carried out by DSFD trained with the without-silhouette (w/o silhouette) dataset. The trained model was obtained from the publicly available GitHub repository of the author of DSFD [3]. We created the evaluation data, which included object silhouettes. The sound waves appeared inside the silhouette regions on the second-row images. Therefore, the DSFD cannot properly denoise the sound-field images, especially in the silhouette regions.


For segmentation, it may be natural to use a foundational model designed for natural image segmentation. To confirm the applicability of the image segmentation foundation model, we conducted a preliminary experiment. Segment Anything Model (SAM) [6] was used to estimate zero-shot segmentation labels. To handle sound field data with SAM, we extracted only the real part channel of the input sound field images (floating-point numbers), converted them to ranging from 0 to 255, and then transformed them into 1-channel images similar to Grayscale images. Subsequently, these images were converted to RGB for input into SAM. The segmentation results are shown in Fig. S2. The top two rows are the input and segmented images for noisy data, the next two rows are the input and segmented images for clean data, and the last row is the ground truth of the segmentation labels. The visualization of the segmentation masks obtained by SAM is performed by overlaying randomly assigned colors for each mask on the input images. Therefore, the same color represents a single segmentation mask. For the results with noisy images as input (See the second-row of Fig. S2), the segmentation does not perform well where the noises in the input image are high, for example, the second, fourth, sixth, and ninth columns from the left in the Fig. S2. For the results with clean images as input (See the fourth-row of Fig. S2), there are no images where the object silhouettes are entirely unsegmented. However, some images show multiple segments within the same object silhouettes, for example, the first, second, and fourth columns from the left in the Fig. S2. From these results, it can be concluded that even with the foundation model for image segmentation, SAM, attempting zero-shot segmentation on the noisy data is ineffective. Furthermore, the performance, even with the clean data, is inadequate. Hence, we considered the task of joint training and inferring denoising and segmentation.
S2 Noise Creation based on Experimental Data
As mentioned in the main paper regarding dataset creation, we calculated the noise for silhouette regions from experimentally obtained data. In this section, we provide supplemental information for data collection and noise creation.
We estimated the probability density function (PDF) by kernel density estimation (KDE) on the basis of experimentally measured data. The data were collected by parallel phase-shifting interferometry (PPSI) [5]. The experimental setup is shown in Fig. S3(a). We installed a shielding object between two optical flats and recorded the data five times. The frame rate of the high-speed camera in the PPSI system was set to 20,000 frames per second, and 200 images were collected for each recording. To remove the low-frequency noise, a high-pass filter with a 500-Hz cut-off frequency was applied to the recorded images along the time direction. The real and imaginary parts of the Fourier-transformed data were regarded as one image. The single pixel value was regarded as one sample, and 28,800,000 samples in total were used for estimating the PDF. Histograms of the measured data and estimated PDF are shown in Fig. S3(b). There is good agreement between the estimated PDF and histogram of the measured data.
Noise data for silhouette regions were generated on the basis of the estimated PDF by using the inverse transform sampling method. An example of the generated data is shown in Fig. S3(c). The left and right figures show the measured and generated data, respectively. The generated data were sampled data based on the estimated PDF corresponding to the number of pixels in the image and reshaped to match the image dimensions. We confirmed that the generated noise data was similar to the measured data, except for spatial patterns originating from the optical elements.

S3 Preliminary experiment for loss function
To determine the loss function for the proposed method, we conducted a preliminary experiment to compare performance by loss functions. For denoising loss , mean squared error (MSE), mean absolute error (L1), and negative peak signal-to-noise ratio (N-PSNR) losses were compared. For segmentation loss , binary cross entropy (BCE) and balanced BCE and dice (BCE+Dice) losses were compared. We conducted training and evaluation with 6 patterns of all combinations of 3 loss functions for denoising and 2 loss functions for segmentation. The evaluation result is shown in Tab. S1. The weighting coefficient was set to roughly matching digits of loss values. Using N-PSNR as was the best performance for denoising. For segmentation, using BCE+Dice loss as was the best performance for segmentation. Since combination of N-PSNR and BCE+Dice marked best performance in both denoising and segmentation, we selected them as loss functions for proposed method.
PSNR [dB] | SSIM | IoU | |||
---|---|---|---|---|---|
MSE | BCE | 40.8 | 0.983 | 0.981 | |
MSE | BCE+Dice | 41.5 | 0.984 | 0.984 | |
L1 | BCE | 42.2 | 0.986 | 0.980 | |
L1 | BCE+Dice | 42.3 | 0.987 | 0.982 | |
N-PSNR | BCE | 43.2 | 0.987 | 0.985 | |
N-PSNR | BCE+Dice | 43.2 | 0.987 | 0.986 |
S4 Implementation Details
In this section, the implementation details for the compared models are provided. The following parameters were common to all models. All models were implemented by PyTorch. The loss function for segmentation was the combination of binary cross entropy loss and dice loss : with the weighting coefficient . The number of epochs was set to . The number of channels of input and output layers were set to and , respectively. The parameters that differ for each model are listed below.
DnCNN [7]
The denoising and segmentation model based on DnCNN was implemented by referencing publicly available code from the DSFD repository [3]. The network architecture was almost the same as in the original paper [7] except for the number of input/output channels. The Adam optimizer was used where the learning rate was , and and were and , respectively. The exponential learning rate scheduler was used where the multiplicative factor was . MSE loss was used as the loss function for denoising .
LRDUNet [2]
The denoising and segmentation model based on LRDUNet was implemented by referencing publicly available code from the DSFD repository [3]. The network architecture was almost the same as in the original paper [2] except for the number of input/output channels. The Adam optimizer was used where the learning rate was , and and were and , respectively. The exponential learning rate scheduler was used where the multiplicative factor was . L1 loss was used as the loss function for denoising .
NAFNet [1]
The denoising and segmentation model based on NAFNet was implemented by referencing publicly available code from the DSFD repository [3]. The network architecture was almost the same as in the original paper [1] except for the number of input/output channels. The Adam optimizer was used where the learning rate was , and the and were and , respectively. The exponential learning rate scheduler was used where the multiplicative factor was . MSE loss was used as the loss function for denoising .
KBNet [9]
The denoising and segmentation model based on KBNet was implemented by referencing publicly available code from the KBNet repository [8]. The network architecture was almost the same as in the original paper [9] except for the number of input/output channels. The AdamW optimizer was used where the learning rate was 3e-4, weight decay was 1e-4, and and were and , respectively. The cosine annealing with the restart learning rate scheme was used where the periods for each cosine annealing cycle were set to and , the restart weights at each restart iteration were all set to , and the minimum learning rates at each cycle were set to 3e-4 and 1e-6. L1 loss was used as the loss function for denoising .
S5 Evaluation of Denoising Performance for Sound Fields without Objects
To confirm the applicability of the proposed method to sound-field images without object silhouettes, we created an evaluation dataset without objects. The parameters of the dataset, such as the positions, frequencies, and sound pressures of the sound sources, are the same as those of the dataset described in Sec 3.2 of the main paper except for the existence of objects. The evaluation result is shown in Tab. S2. The trained model with the with-silhouette dataset (w/ silhouettes) was used for the evaluation. The IoU was calculated for class 0 (sound fields), where w/o silhouette data was used for the evaluation. These results indicate that the proposed method can be applied to sound fields without objects even if the network is only trained on data w/ silhouettes.
Evaluation data | PSNR [dB] | SSIM | IoU |
---|---|---|---|
w/o silhouettes | 43.5 | 0.991 | 1.00 (for class 0) |
w/ silhouettes | 43.2 | 0.987 | 0.986 (for class 1) |
For further verification, the results applied to the experimental data without objects are shown in Fig. S4. In this experiment, sound images of a 12-kHz burst wave generated from a loudspeaker (FOSTEX FT48D) [4] were used. The top row is the input data where the burst wave propagated from left to right. The noise was eliminated by our method. For segmentation, although all values should be 0 (black), some pixels were falsely detected as silhouette class (white).

References
- [1] Liangyu Chen, Xiaojie Chu, Xiangyu Zhang, and Jian Sun. Simple baselines for image restoration. In ECCV, pages 17–33, 2022.
- [2] Javier Gurrola-Ramos, Oscar Dalmau, and Teresa Alarcón. U-Net based neural network for fringe pattern denoising. Optics and Lasers in Engineering, 149:14 pages, 2022.
- [3] Kenji Ishikawa, Daiki Takeuchi, Noboru Harada, and Takehiro Moriya. deep-sound-field-denoiser. https://github.com/nttcslab/deep-sound-field-denoiser, 2023. Accessed on 1st, July, 2024.
- [4] Kenji Ishikawa, Daiki Takeuchi, Noboru Harada, and Takehiro Moriya. Deep sound-field denoiser: optically-measured sound-field denoising using deep neural network. Optics Express, 31:33405–33420, 2023.
- [5] Kenji Ishikawa, Kohei Yatabe, Nachanant Chitanont, Yusuke Ikeda, Yasuhiro Oikawa, Takashi Onuma, Hayato Niwa, and Minoru Yoshii. High-speed imaging of sound using parallel phase-shifting interferometry. Optics Express, 24(12):12922–12932, Jun 2016.
- [6] Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer Whitehead, Alexander C. Berg, Wan-Yen Lo, Piotr Dollár, and Ross Girshick. Segment anything. In ICCV, pages 4015–4026, October 2023.
- [7] Kai Zhang, Wangmeng Zuo, Yunjin Chen, Deyu Meng, and Lei Zhang. Beyond a gaussian denoiser: Residual learning of deep CNN for image denoising. IEEE Transactions on Image Processing, 26(7):3142–3155, 2017.
- [8] Yi Zhang, Dasong Li, Xiaoyu Shi, Dailan He, Kangning Song, Xiaogang Wang, Honwei Qin, and Hongsheng Li. KBNet. https://github.com/zhangyi-3/kbnet, 2023. Accessed on 1st, July, 2024.
- [9] Yi Zhang, Dasong Li, Xiaoyu Shi, Dailan He, Kangning Song, Xiaogang Wang, Honwei Qin, and Hongsheng Li. KBNet: Kernel basis network for image restoration. arXiv preprint arXiv:2303.02881, 2023.