An efficient light-weighted signal reconstruction method consists of Fast Fourier Transform and Convolutional-based Autoencoder

Pu-Yun Kow 0000-0001-5718-9316 Department of Bioenvironmental System Engineering, National Taiwan University, Taipei 106, Taiwan. [email protected] and Pu-Zhao Kow 0000-0002-2990-3591 Department of Mathematical Sciences, National Chengchi University, Taipei 116, Taiwan. [email protected]

Abstract.

The main theme of this paper is to reconstruct audio signal from interrupted measurements. We present a light-weighted model only consisting discrete Fourier transform and Convolutional-based Autoencoder model (ConvAE), called the FFT-ConvAE model for the Helsinki Speech Challenge 2024. The FFT-ConvAE model is light-weighted (in terms of real-time factor) and efficient (in terms of character error rate), which was verified by the organizers. Furthermore, the FFT-ConvAE is a general-purpose model capable of handling all tasks with a unified configuration.

Key words and phrases:

inverse problem, Fourier transform, Convolutional-based Autoencoder (ConvAE), convolutional neural network (CNN), artificial neural network (ANN), Artificial Intelligent (AI)

2020 Mathematics Subject Classification:

35R25, 35R30

Acknowledgments

This study is supported by the National Science and Technology Council of Taiwan, NSTC 112-2115-M-004-004-MY3.

1. Introduction

In this paper, we are interested in the reconstruction of audio signal from a noisy measurement. We participated the Helsinki Speech Challenge 2024 [LKJS24a] by developing an algorithm consists of fast Fourier transform (FFT) and Convolutional-based Autoencoder model (ConvAE), called the FFT-ConvAE model. We use a free and open-source Python ConvAE package from TensorFlow¹¹1https://www.tensorflow.org/tutorials/generative/autoencoder. Such Python ConvAE package is easy-to-use, one can create/develop new neural network by simply combining building blocks and setting parameters. We are quite surprising that our light-weighted model is quite efficient, which is verified by the organizers, see Section 4 for more details.

Convolutional neural network (CNN) is well-known owing to the deep learning concept, see e.g. [KLS⁺22]. Its unique convolutional layer has a strong ability to extract the nonlinear underlying feature from the input data. Autoencoder (AE) is an important artificial intellengence (AI) architecture, which can be used to denoise image datasets as well as time series datasets, see e.g. [GLL⁺19, WCWW20, XMY16]. AE is a very powerful tool to handle high-dimensional samples owing to its data compressional ability. The Convolutional-based Autoencoder (ConvAE) is constructed based on the architecture of AE but change its hidden layer to CNN layer, see e.g. [CSTK18, KLS⁺24, WHK⁺23]. It is worth to mention that the ConvAE has been applied in different fields, for example, the work [KLS⁺24] studies the forecast of watershed groundwater level with a satisfactory performance (with $R^{2}>0.7$ ).

2. Difficulties of the problem

Training datasets for the Helsinki Speech Challenge can be found in Zenodo repository [LKJS24b], and the test datasets are also provided at the same Zenodo repository after the official results published. The organizers designed 7 filtering experiments (call “task 1”), as well as 3 reverb experiments (call “task 2”). The organizers also designed 2 experiments with combinations of filtering and reverb setup (call “task 3”). More details of the data can be found in [LKJS24a].

The audio are recorded with a sample rate $16\,{\rm kHz}$ , i.e. $16000$ samples recorded per second and each sample is a floating number between $-1$ and $1$ (as a NumPy array). We handle the audio in the 16-bit integer (16-bit PCM) format (i.e. multiply the audio signal by $32767$ and round to the closest integer²²2There are exactly $2^{16}=65536$ integers ranging from $-32767$ to $32767$ , which explains the term “16 bit”.) in order to increase storage efficiency without compromised playback quality. In other words, each audio can be represented by an integer-valued vector. The length of an audio signal, which is represented by an integer-valued vector $\mathbf{v}=(v_{1},\cdots,v_{\ell})$ , is defined by ${\rm length}\,(\mathbf{v}):=\ell$ . The dimension of a task/level is defined by $\max\ell$ , where the maximum is taken over all audio signal $\mathbf{v}=(v_{1},\cdots,v_{\ell})$ corresponding to the task/level. The dimension of each task/level is showed in Table 2.1.

Task 1	Task 2	Task 3
Level 1: 233340	Level 1: 291517	Level 1: 301117
Level 2: 240252	Level 2: 301117	Level 2: 296893
Level 3: 223740	Level 3: 296893
Level 4: 250236
Level 5: 220668
Level 6: 238716
Level 7: 246012

Table 2.1. Dimension of data

The main difficulty of this problem is we need to handle a lot of samples subject to a low real-time factor (RTF), which defined as processing time divided by audio length. In other words, we need to handle high-dimensional data only using light-weighted models. According to the additional rules of the challenge [LKJS24a, Section 5.2], the RTF must average no more than 3 (this means that only at most 3 seconds are allow to process each second of audio signal) and participants are encouraged to create light-weight models.

It is interesting to mention that the Nyquist-Shannon criterion gives a sufficient condition for a sample rate $f_{s}$ that permits a discrete sequence of samples to capture “almost all” information from a continuous-time signal of finite bandwidth [Sha49]: If a continuous-time signal $x(t)$ contains no frequencies higher than $B$ hertz (Hz), then a “almost perfect” reconstruction can be guaranteed possible for a bandlimit

B<\frac{f_{s}}{2}.

For example, the CD audio sample rate is $44.1\,{\rm kHz}$ , which captures frequencies up to $22050\,{\rm Hz}$ , which is enough since human can only hear frequencies ranging from $20\,{\rm Hz}$ to $20\,{\rm kHz}$ . In our case, the sample rate $16\,{\rm kHz}$ can effectively capture frequencies up to $8\,{\rm kHz}$ , which is enough to record speeches.

3. Methodology

Given a clean audio signal $\mathbf{x}^{\rm clean}\in\mathbb{R}^{m}$ (of length $m\in\mathbb{N}$ ) and an interrupted audio signal $\mathbf{x}^{\rm interrupt}\in\mathbb{R}^{m}$ . Our goal is to find an approximation $\mathbf{x}^{\rm approx}\in\mathbb{R}^{m}$ of $\mathbf{x}^{\rm clean}$ . By using Numpy Fast Fourier Transform, we first computed their discrete Fourier transform $\widehat{\mathbf{x}}^{\rm clean}\in\mathbb{C}^{m}$ and $\widehat{\mathbf{x}}^{\rm interrupt}\in\mathbb{C}^{m}$ respectively, given by the formula

\hat{x}_{k}=\sum_{m=0}^{n-1}x_{m}\exp\left(-2\pi\mathbf{i}\frac{mk}{n}\right)\quad\text{for $k=1,\cdots,m$},

where $\mathbf{i}$ is the imaginary number which can be formally understood as $\mathbf{i}=\sqrt{-1}$ and $\exp$ is the complex exponential which can be defined via Euler’s formula. Now it is suffice to find an approximation $\widehat{\mathbf{x}}^{\rm approx}\in\mathbb{C}^{m}$ of $\widehat{\mathbf{x}}^{\rm clean}$ , since its inverse Fourier transform is exactly the desired approximator $\mathbf{x}^{\rm approx}$ .

Based on some mathematical results (see Section 5 below), we decide to find an approximator of the form

(3.1)

\widehat{\mathbf{x}}^{\rm approx}=(\hat{x}_{1}^{\rm approx},\cdots,\hat{x}_{m}^{\rm approx})\quad\text{with}\quad\hat{x}_{j}^{\rm approx}=z_{j}\frac{\hat{x}_{j}^{\rm interrupt}}{\lvert\hat{x}_{j}^{\rm interrupt}\rvert}

for some $z_{j}\in\mathbb{R}$ . This is equivalent to use $\left(\lvert\hat{x}_{1}^{\rm interrupt}\rvert,\cdots,\lvert\hat{x}_{m}^{\rm interrupt}\rvert\right)$ to find an approximator $\mathbf{z}=(z_{1},\cdots,z_{m})$ of $\left(\lvert\hat{x}_{1}^{\rm clean}\rvert,\cdots,\lvert\hat{x}_{m}^{\rm clean}\rvert\right)$ , and we will use a Convolutional-based Autoencoder (ConvAE) model to do so.

Similar to the Autoencoder (AE), the ConvAE also has two phases, called encoder and decoder. Suppose that the encoder and decoder have $\ell_{*}$ and $L_{*}$ hidden layers, respectively. Given a injective function $\phi:(0,\infty)\rightarrow\mathbb{R}$ , with inverse function $\phi^{-1}:\phi((0,\infty))\rightarrow(0,\infty)$ . We begin the encoder by input

m_{0}=m,\quad\mathbf{y}^{0}:=\left(\phi\left(\lvert\hat{x}_{1}^{\rm interrupt}\rvert\right),\cdots,\phi\left(\lvert\hat{x}_{m}^{\rm interrupt}\rvert\right)\right)\quad\text{and activation functions $\left\{f^{\ell}\right\}_{\ell=1}^{\ell_{*}}$.}

Let $\mathbf{y}^{\ell}\in\mathbb{R}^{m_{\ell}}$ for some $m_{\ell}\in\mathbb{N}$ be the state vector of $\ell^{\rm th}$ hidden layer satisfying the relation (in terms of matrix multiplication)

\mathbf{y}^{\ell}=f^{\ell}\left(\mathsf{w}^{\ell}\mathbf{y}^{\ell-1}+\mathbf{a}^{\ell}\right)\quad\text{for all $\ell=1,\cdots,\ell_{*}$}

with (real-valued) matrices $\mathsf{w}^{\ell}\in\mathbb{R}^{m_{\ell}\times m_{\ell-1}}$ and a (real-valued) vector $\mathbf{a}^{\ell}\in\mathbb{R}^{m_{\ell}}$ , which are called the weights. Here $m_{\ell}$ may distinct and $\mathsf{w}^{\ell}$ may not square matrices. The number $m_{\ell}\in\mathbb{N}$ is called the number of neurons in the $\ell^{\rm th}$ hidden layer. If we expand $\mathbf{y}^{\ell}=(y_{1}^{\ell},\cdots,y_{m_{\ell}}^{\ell})\in\mathbb{R}^{m_{\ell}}$ , then the number $y_{j}^{\ell}\in\mathbb{R}$ is the state of the $j^{\rm th}$ neuron in the $\ell^{\rm th}$ hidden layer. The last vector $\widetilde{\mathbf{y}}^{\ell_{*}}=(y_{1}^{\ell_{*}},\cdots,y_{m_{\ell_{*}}}^{\ell_{*}})$ is our desired encoded data. Following, we begin the decoder by input

n_{0}=m_{\ell_{*}},\quad\mathbf{z}^{0}:=\widetilde{\mathbf{y}}^{\ell_{*}}=(y_{1}^{\ell_{*}},\cdots,y_{m_{\ell_{*}}}^{\ell_{*}})\quad\text{and activation functions $\left\{g^{L}\right\}_{L=1}^{L_{*}}$.}

Let $\mathbf{z}^{\ell}\in\mathbb{R}^{n_{L}}$ for some $n_{L}\in\mathbb{N}$ be the state vector of $L^{\rm th}$ hidden layer satisfying the relation (in terms of matrix multiplication)

\mathbf{z}^{L}=g^{L}\left(\mathsf{W}^{L}\mathbf{z}^{L-1}+\mathbf{b}^{\ell}\right)\quad\text{for all $L=1,\cdots,L_{*}$}

with (real-valued) matrices $\mathsf{W}^{L}\in\mathbb{R}^{n_{L}\times n_{L-1}}$ and a (real-valued) vector $\mathbf{b}^{L}\in\mathbb{R}^{n_{L}}$ , which are also called the weights. Finally, the ConvAE is terminated and output

\mathbf{z}^{L_{*}}=\left(z_{1}^{L_{*}},\cdots,z_{L_{*}}^{L_{*}}\right),

which is the desired decoded data. In our case, we will choose $L_{*}=m$ (as well as $\ell_{*}<m$ ) and we found an approximator (3.1) with

z_{j}=\phi^{-1}\left(z_{j}^{m}\right)\quad\text{for all $j=1,\cdots,m$.}

See also Figure 3.1 for model architecture of the above mentioned FFT-ConvAE Model (with the choice $\phi(t)=\log t$ ).

\includegraphics

[width=.8]Model_architecture.png

Figure 3.1. Model architecture of the FFT-ConvAE Model

In our case, we choose $f^{\ell}\equiv\rm Id$ and $g^{L}\equiv\rm Id$ for all task/level, i.e. our model is linear. We choose $\phi\equiv\rm Id$ for Level 1, Level 2 and Level 3 in Task 1, see Figure 3.3 for training of some samples in Task 1 Level 1. We choose $\phi(t)=\log t$ for all other tasks/levels, see Figure 3.4 for plots in different scales, see also Figure 3.2 for overall performance of the training stage, measured in terms of CER using evaluate.py in the Zenodo repository³³3https://zenodo.org/records/14007505. Here we emphasize that we do not train the model using evaluate.py mentioned above. As shown in (3.1), we do not train the phase of the signal, since the model is highly sensitive to phase shifts and over-fitting often occurs when we tried to train the phase of the signals. Our model is light in terms of real-time factor (RTF), we have RTF much lower than 1, see Table 3.1 below.

\includegraphics

[width=0.65]result_figure/training_performance.png

Figure 3.2. The performance of training

	processing time (seconds)	audio length (seconds)	real-time factor
Task 1 Level 1	73	2400	0.03
Task 1 Level 2	93	2440	0.04
Task 1 Level 3	63	2444	0.03
Task 1 Level 4	123	2444	0.05
Task 1 Level 5	63	2444	0.03
Task 1 Level 6	93	2444	0.04
Task 1 Level 7	73	2444	0.03
Task 2 Level 1	53	1292	0.04
Task 2 Level 2	63	1120	0.06
Task 2 Level 3	53	1184	0.04
Task 3 Level 1	53	1120	0.05
Task 3 Level 2	63	1120	0.06

Table 3.1. Real-time factor (RTF)

\includegraphics

[width=.65]result_figure/T1L1_fft_result.png

Figure 3.3. Task 1 Level 1: Blue represents magnitudes of Fourier transformed clean signal. Red color in (i) represents the filtered signal and the one in (ii) represents the trained signal

\includegraphics

[width=.45]result_figure/T1L4_fft_result_Sample_16.png \includegraphics[width=.45]result_figure/T1L4_fft_result_Sample_516.png

Figure 3.4. Sample #16 and Sample #516 in Task 1 Level 4: Blue represents the magnitude of Fourier transformed clean signal. Red color in (i) represents the magnitude of Fourier transformed filtered signal and the one in (ii) represents the magnitude of Fourier transformed trained signal

4. Results

All results shown in this section are provided by the organizers. The organizers used Mozilla DeepSpeech⁴⁴4https://github.com/mozilla/DeepSpeech to recognize the speech (which will output a .txt file) by input a sound track (in .wav format), and the character error rate (CER), which is defined by the ration of the number of wrong/missing character with the total number of character in original text transcribed by evaluate.py. The CER is a real number ranging from $0$ (all characters are correct) to $1$ (all characters are incorrect). The average CER is showed in Figure 4.1, which is posted in the Helsinki Speech Challenge 2024 official result page⁵⁵5https://blogs.helsinki.fi/helsinki-speech-challenge/results/. We also present the spectrogram, transcribed by evaluate.py and CER of some samples in Figures 4.2 and 4.3.

\includegraphics

[width=.8]HSC2024results

Figure 4.1. Our group wins the second place – labeled as NTU

We next compare the performance of FFT-ConvAE in training phase (see Figure 3.2) and testing phase (see Figure 4.1). The performance of FFT-ConvAE remains consistent across all tasks in Task 1 during both stages. However, the mean CER increases in Tasks 2 and 3 when comparing the testing phase to the training phase. This is likely due to high-frequency signals being included in the phase of the Fourier transform for these tasks, as the phase of the interrupted signal’s Fourier transform is directly used as the phase in the FFT-ConvAE. Additionally, beyond the discrete Fourier Transform and log transform, more advanced data preprocessing techniques should be considered for Tasks 2 and 3, as these tasks are more complex compared to Task 1.

Figure 4.2 demonstrates that the FFT-ConvAE effectively avoids removing useful signals from the interrupted signal in Sample #11, as the CER remains unchanged after signal reconstruction. However, in Task 1 Level 1, where the audio is relatively less interrupted, the model tends to over-denoise, as observed in Sample #101, resulting in an increased CER. Despite the deterioration in denoising performance for Sample #101 in terms of CER, the FFT-ConvAE successfully captures the high-frequency components of the signal, enhancing the overall audio quality of the interrupted signal. For Task 1 Level 4, Figure 4.3 shows that the FFT-ConvAE effectively reduces the CER after audio reconstruction for both samples (e.g., Sample #16 and Sample #516). This improvement is attributed to the model’s ability to effectively learn high-frequency information from the clean signal.

\includegraphics

[width=.9]result_figure/T1L1_result.png

	Before reconstruction	After reconstruction	True text
	( ${\rm CER}=0$ )	( ${\rm CER}=0$ )
Sample #11	i have not said the provincial mayor	i have not said the provincial mayor	I have not, said the Provincial Mayor
	( ${\rm CER}=0$ )	( ${\rm CER}=0.0694$ )
Sample #101	You need not be prompted to write with the appearance of sorrow for his disappointment.	you need not be prompted to write that the appearance of sorrow or his disappointment	You need not be prompted to write with the appearance of sorrow for his disappointment

Figure 4.2. Spectrogram, texts transcribed by evaluate.py and CER of (a) Sample # 11 and (b) Sample # 101 in Task 1 Level 1

\includegraphics

[width=.9]result_figure/T1L4_result.png

	Before reconstruction	After reconstruction	True text
	( ${\rm CER}=0.5$ )	( ${\rm CER}=0.115$ )
Sample #16	onn about a mateself the difference	those e ye anything about it must have felt the difference	Those who knew any thing about it, must have felt the difference
	( ${\rm CER}=0.436$ )	( ${\rm CER}=0.128$ )
Sample #516	noman fhop left my sond still more grose	et only inpruthd lest my sriend still more grave	It only, in truth, left my friend still more grave

Figure 4.3. Spectrogram, texts transcribed by evaluate.py and CER of (a) Sample # 16 and (b) Sample # 516 in Task 1 Level 4

5. Discussions and related works

It is not surprising to use discrete Fourier transform to handle audio signal. In practical, it is also difficult to handle audio signals without using discrete Fourier transform, see Figure 5.1 for a demonstration (for which vanishing gradient effect occur).

\includegraphics

[width=.6]result_figure/without_fft_result.png

Figure 5.1. Blue and red color represent clean and trained audio signal, respectively, by using FFT-ConvAE (left) versus pure ConvAE without using FFT (right)

To address the vanishing gradient problem in deep learning, the discrete Fourier transform emerges as a vital tool. Figure 4.2 highlights the significant discrepancies between the filtered signal and the clean signal in the original scale. However, after applying the Fourier transform (see Figure 3.3(a)), the difference between the filtered and clean signals is noticeably reduced. Furthermore, when the log scale is applied to the magnitude of the Fourier-transformed signal, the discrepancy becomes even smaller. We now explain some mathematical results in [KSZ24] (see also [KRS21]), which give some examples to demonstrate some mechanism of inverse problems.

Given any $f\in L^{2}(\mathcal{S}^{n-1})$ with $n\geq 2$ , the corresponding (scaled) Herglotz wave function is formally defined by

A_{k}(f):=\kappa^{\frac{n-1}{2}}\left.P_{\kappa}f\right|_{B_{1}}\quad\text{with}\quad(P_{\kappa}f)(x):=\int_{\mathcal{S}^{n-1}}e^{\mathbf{i}\kappa\omega\cdot x}f(\omega)\,\mathrm{d}S(\omega)\equiv(f\,\mathrm{d}S)\,\widehat{\rule{0.0pt}{6.0pt}}\,(-\kappa x).

By a version of Agmon-Hörmander estimate [KSZ24, Lemma 2.3], there exists a constant $C=C(n)>0$ such that for any integer $m\geq 0$ one has

\lVert A_{\kappa}f\rVert_{L^{2}(B_{1})}\leq C(Cm\kappa)^{2m}\lVert f\rVert_{H^{-2m}(\mathcal{S}^{n-1})}\quad\text{for all $f\in L^{2}(\mathcal{S}^{n-1})$,}

where $H^{-2m}(\mathcal{S}^{n-1})$ is the standard Hilbert space which can be defined in terms of the Laplace-Beltrami operator $-\Delta_{\mathcal{S}^{n-1}}$ on $\mathcal{S}^{n-1}$ . We use Weyl asymptotics (see e.g. [Tay11, Theorem 8.3.1]) to simplify our quantification. The case when $m=0$ can be found in [AH76, Theorem 2.1]. This shows that

(5.1)

A_{\kappa}:L^{2}(\mathcal{S}^{n-1})\rightarrow L^{2}(B_{1})

is a bounded linear operator which is compact. In addition, the analyticity of $P_{\kappa}f$ (due to Paley-Wiener-Schwartz theorem, see e.g. [FJ98, Theorem 10.2.1(i)]) implies that $f$ is uniquely determined by $A_{\kappa}f$ , thus (5.1) is injective, and it has a sequence of singular values $\sigma_{j}=\sigma_{j}(A_{\kappa})$ with $\sigma_{1}\geq\sigma_{2}\geq\cdots\rightarrow 0$ , see e.g. [KRS21, Proposition 2.3]. In order to simplified our notations, we write $A\lesssim B$ (resp. $A\gtrsim B$ or $A\simeq B$ ) for $A\leq CB$ (resp. $A\geq C^{-1}B$ or $C^{-1}A\leq B\leq CA$ ) where $C$ is a constant independent of asymptotic parameters (here $j$ and $\kappa$ ). For each $\kappa\geq 1$ , it was proved in [KSZ24, Theorem 1.1] that the singular values $\sigma_{j}(A_{\kappa})$ of (5.1) satisfy


(5.2a)	$\displaystyle\sigma_{j}(A_{\kappa})\simeq 1$	for all $j\lesssim\kappa^{n-1}$ ,
(5.2b)	$\displaystyle\sigma_{j}(A_{\kappa})\lesssim\exp\left(-c\kappa^{-1}j^{\frac{1}{n-1}}\right)$	for all $j\gtrsim\kappa^{n-1}$ ,

where the constant $c>0$ and the implied constants are independent of $\kappa$ and $j$ . From (5.2a)–(5.2b), by refining the results in [KRS21], it was proved in [KSZ24, Theorem 1.2] that a necessary condition of the existence of such a non-decreasing function $t\in\mathbb{R}_{+}\mapsto\omega(t)\in\mathbb{R}_{+}$ with

\lVert f\rVert_{L^{2}(\mathcal{S}^{n-1})}\leq\omega\left(\lVert A_{\kappa}f\rVert_{L^{2}(B_{1})}\right)\quad\text{whenever $\lVert f\rVert_{H^{1}(\mathcal{S}^{n-1})}\leq 1$}

(5.3)

\omega(t)\gtrsim\max\left\{t,\kappa^{-1}(1+\log(1/t))^{-1}\right\}\quad\text{for all $0<t\lesssim 1$,}

where the implied constants are independent of $\kappa$ and $t$ . By inspecting the proof, one sees that the stability bound $\omega(t)\gtrsim t$ follows from (5.2a), while the instability bound $\omega(t)\gtrsim\kappa^{-1}(1+\log(1/t))^{-1}$ follows from (5.2b), therefore (5.2a) and (5.2b) characterize the number of stable and unstable features in the inverse problem. For each fixed $\kappa>0$ , from (5.3) we conclude that the inverse problem is ill-posed. However, can choose a large $\kappa$ to reduce the effect of the instability term $\kappa^{-1}(1+\log(1/t))^{-1}$ as well as increase the number of stable features in the sense of (5.2a). This is called the increasing resolution phenomena.

Similar mechanisms for linearized inverse acoustic scattering problem also studied [KSZ24]. We also remark that one also can discuss the stability of inverse problems using Bayesian approach [FKW24, KW24]. In principle, we believe that many inverse problems have features which can be stable recovered, but however most of them are unstable to recover. We consider the choice (3.1) is that we want to only train stable features in order to make our model light-weighted, and it seems that our ideas work in Task 1 (filtering experiment) compare with the results by other groups (see Figure 4.1). Unfortunately, our methods do not work for Task 2 (reverb experiment), which means that we still missed some stable feature. It is interesting that we still partially improved Task 3 (combination of filtering and reverb experiment). By comparing the performance of training (see Figure 3.2) and the results verified by the organizers (see Figure 4.1), this means that we successfully capture stable features of filtering experiments, but not the reverb experiments.

6. Conclusions

The combination of the discrete Fourier transform and Convolutional-based Autoencoder method (FFT-ConvAE) proves to be an effective model for extracting high-frequency components from clean signals, resulting in a significant reduction in CER compared to interrupted signals. By applying the Fourier transform (possibly in log scale), the discrepancies between the interrupted and clean signals are first substantially minimized. The magnitude of the Fourier-transformed signal is then further refined by the ConvAE, effectively avoiding the vanishing gradient problem and successfully extracting useful high-frequency information from the clean audio. Moreover, our proposed FFT-ConvAE is a general-purpose model capable of handling various tasks across different scenarios. Additionally, it is a lightweight model – in terms of low real-time factor (RTF) – making it highly suitable for practical, everyday applications. Many inverse problem is unstable, but one can still recover some stable features, which can be extracted by preprocessing the data carefully, before employ machine learning algorithm.

References

[AH76] S. Agmon and L. Hörmander. Asymptotic properties of solutions of differential equations with simple characteristics. J. Analyse Math., 30:1–38, 1976. MR0466902, Zbl:0335.35013, doi:10.1007/BF02786703.
[CSTK18] Z. Cheng, H. Sun, M. Takeuchi, and J. Katto. Deep Convolutional Autoencoder-based lossy image compression. PCS, pages 253–257, 2018. doi:10.1109/PCS.2018.8456308.
[FJ98] F. G. Friedlander and M. Joshi. Introduction to the theory of distributions. Cambridge University Press, Cambridge, second edition, 1998. MR1721032, Zbl:0971.46024.
[FKW24] T. Furuya, P.-Z. Kow, and J.-N. Wang. Consistency of the Bayes method for the inverse scattering problem. Inverse Problems, 40(5), 2024. Paper No. 055001, MR4723841, Zbl:7867314, doi:10.1088/1361-6420/ad3089.
[GLL⁺19] D. Gong, L. Liu, V. Le, B. Saha, M. R. Mansour, S. Venkatesh, and A. van den Hengel. Memorizing normality to detect anomaly: memory-augmented Deep Autoencoder for unsupervised anomaly detection. Proc. IEEE Int. Conf. Comput. Vis., pages 1705–1714, 2019. doi:10.1109/ICCV.2019.00179, arXiv:1904.02639.
[KRS21] H. Koch, A. Rüland, and M. Salo. On instability mechanisms for inverse problems. Ars Inven. Anal., 2021. Paper No. 7, 93 pages, MR4462475, Zbl:1482.35002, doi:10.15781/c93s-pk62, arXiv:2012.01855.
[KLS⁺22] P.-Y. Kow, M.-H. Lee, W. Sun, M.-H. Yao, and F.-J. Chang. Integrate deep learning and physically-based models for multi-step-ahead microclimate forecasting. Expert Syst. Appl., 210, 2022. Article number 118481, doi:10.1016/j.eswa.2022.118481.
[KLS⁺24] P.-Y. Kow, J.-Y. Liou, W. Sun, L.-C. Chang, and F.-J. Chang. Watershed groundwater level multistep ahead forecasts by fusing convolutional-based autoencoder and LSTM models. J. Environ. Manag., 351, 2024. Article number 119789, doi:10.1016/j.jenvman.2023.119789.
[KSZ24] P.-Z. Kow, M. Salo, and S. Zou. Increasing resolution and instability for linear inverse scattering problems. arXiv preprint, 2024. arXiv:2404.18482.
[KW24] P.-Z. Kow and J.-N. Wang. Increasing stability in an inverse boundary value problem – Bayesian viewpoint. Taiwanese J. Math., 2024. 40 pages, doi:10.11650/tjm/240704.
[LKJS24a] M. Ludvigsen, E. Karvonen, M. Juvonen, and S. Siltanen. Helsinki Speech Challenge 2024. arXiv preprint, 2024. arXiv:2406.04123.
[LKJS24b] M. Ludvigsen, E. Karvonen, M. Juvonen, and S. Siltanen. Helsinki Speech Challenge 2024 open audio dataset. Zenodo, 2024. doi:10.5281/zenodo.14007505.
[Sha49] C. E. Shannon. Communication in the presence of noise. Proceedings of the IRE, 37(1):10–21, 1949. doi:10.1109/JRPROC.1949.232969.
[Tay11] M. E. Taylor. Partial differential equations I. Basic theory, volume 115 of Applied Mathematical Sciences. Springer, New York, second edition, 2011. MR2744150, Zbl:1206.35002, doi:10.1007/978-1-4419-7055-8.
[WCWW20] S. Wang, H. Chen, L. Wu, and J. Wang. A novel smart meter data compression method via stacked convolutional sparse auto-encoder. Int. J. Electr. Power Energy Syst., 118, 2020. Article number 105761. doi:10.1016/j.ijepes.2019.105761.
[WHK⁺23] K.-Y. Wu, I-W. Hsia, P.-Y. Kow, L.-C. Chang, and F.-J. Chang. High-spatiotemporal-resolution ${\rm PM}_{2.5}$ forecasting by hybrid deep learning models with ensembled massive heterogeneous monitoring data. J. Clean. Prod., 433, 2023. Article number 139825, doi:10.1016/j.jclepro.2023.139825.
[XMY16] C. Xing, L. Ma, and X. Yang. Stacked Denoise Autoencoder based feature extraction and classification for hyperspectral images. J. Sens., 2016. Article number 3632943, 10 pages. doi:10.1155/2016/3632943.