This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

An efficient light-weighted signal reconstruction method consists of Fast Fourier Transform and Convolutional-based Autoencoder

Pu-Yun Kow 0000-0001-5718-9316 Department of Bioenvironmental System Engineering, National Taiwan University, Taipei 106, Taiwan. [email protected]  and  Pu-Zhao Kow 0000-0002-2990-3591 Department of Mathematical Sciences, National Chengchi University, Taipei 116, Taiwan. [email protected]
Abstract.

The main theme of this paper is to reconstruct audio signal from interrupted measurements. We present a light-weighted model only consisting discrete Fourier transform and Convolutional-based Autoencoder model (ConvAE), called the FFT-ConvAE model for the Helsinki Speech Challenge 2024. The FFT-ConvAE model is light-weighted (in terms of real-time factor) and efficient (in terms of character error rate), which was verified by the organizers. Furthermore, the FFT-ConvAE is a general-purpose model capable of handling all tasks with a unified configuration.

Key words and phrases:
inverse problem, Fourier transform, Convolutional-based Autoencoder (ConvAE), convolutional neural network (CNN), artificial neural network (ANN), Artificial Intelligent (AI)
2020 Mathematics Subject Classification:
35R25, 35R30

Acknowledgments

This study is supported by the National Science and Technology Council of Taiwan, NSTC 112-2115-M-004-004-MY3.

1. Introduction

In this paper, we are interested in the reconstruction of audio signal from a noisy measurement. We participated the Helsinki Speech Challenge 2024 [LKJS24a] by developing an algorithm consists of fast Fourier transform (FFT) and Convolutional-based Autoencoder model (ConvAE), called the FFT-ConvAE model. We use a free and open-source Python ConvAE package from TensorFlow111https://www.tensorflow.org/tutorials/generative/autoencoder. Such Python ConvAE package is easy-to-use, one can create/develop new neural network by simply combining building blocks and setting parameters. We are quite surprising that our light-weighted model is quite efficient, which is verified by the organizers, see Section 4 for more details.

Convolutional neural network (CNN) is well-known owing to the deep learning concept, see e.g. [KLS+22]. Its unique convolutional layer has a strong ability to extract the nonlinear underlying feature from the input data. Autoencoder (AE) is an important artificial intellengence (AI) architecture, which can be used to denoise image datasets as well as time series datasets, see e.g. [GLL+19, WCWW20, XMY16]. AE is a very powerful tool to handle high-dimensional samples owing to its data compressional ability. The Convolutional-based Autoencoder (ConvAE) is constructed based on the architecture of AE but change its hidden layer to CNN layer, see e.g. [CSTK18, KLS+24, WHK+23]. It is worth to mention that the ConvAE has been applied in different fields, for example, the work [KLS+24] studies the forecast of watershed groundwater level with a satisfactory performance (with R2>0.7R^{2}>0.7).

2. Difficulties of the problem

Training datasets for the Helsinki Speech Challenge can be found in Zenodo repository [LKJS24b], and the test datasets are also provided at the same Zenodo repository after the official results published. The organizers designed 7 filtering experiments (call “task 1”), as well as 3 reverb experiments (call “task 2”). The organizers also designed 2 experiments with combinations of filtering and reverb setup (call “task 3”). More details of the data can be found in [LKJS24a].

The audio are recorded with a sample rate 16kHz16\,{\rm kHz}, i.e. 1600016000 samples recorded per second and each sample is a floating number between 1-1 and 11 (as a NumPy array). We handle the audio in the 16-bit integer (16-bit PCM) format (i.e. multiply the audio signal by 3276732767 and round to the closest integer222There are exactly 216=655362^{16}=65536 integers ranging from 32767-32767 to 3276732767, which explains the term “16 bit”.) in order to increase storage efficiency without compromised playback quality. In other words, each audio can be represented by an integer-valued vector. The length of an audio signal, which is represented by an integer-valued vector 𝐯=(v1,,v)\mathbf{v}=(v_{1},\cdots,v_{\ell}), is defined by length(𝐯):={\rm length}\,(\mathbf{v}):=\ell. The dimension of a task/level is defined by max\max\ell, where the maximum is taken over all audio signal 𝐯=(v1,,v)\mathbf{v}=(v_{1},\cdots,v_{\ell}) corresponding to the task/level. The dimension of each task/level is showed in Table 2.1.

Task 1 Task 2 Task 3
Level 1: 233340 Level 1: 291517 Level 1: 301117
Level 2: 240252 Level 2: 301117 Level 2: 296893
Level 3: 223740 Level 3: 296893
Level 4: 250236
Level 5: 220668
Level 6: 238716
Level 7: 246012
Table 2.1. Dimension of data

The main difficulty of this problem is we need to handle a lot of samples subject to a low real-time factor (RTF), which defined as processing time divided by audio length. In other words, we need to handle high-dimensional data only using light-weighted models. According to the additional rules of the challenge [LKJS24a, Section 5.2], the RTF must average no more than 3 (this means that only at most 3 seconds are allow to process each second of audio signal) and participants are encouraged to create light-weight models.

It is interesting to mention that the Nyquist-Shannon criterion gives a sufficient condition for a sample rate fsf_{s} that permits a discrete sequence of samples to capture “almost all” information from a continuous-time signal of finite bandwidth [Sha49]: If a continuous-time signal x(t)x(t) contains no frequencies higher than BB hertz (Hz), then a “almost perfect” reconstruction can be guaranteed possible for a bandlimit

B<fs2.B<\frac{f_{s}}{2}.

For example, the CD audio sample rate is 44.1kHz44.1\,{\rm kHz}, which captures frequencies up to 22050Hz22050\,{\rm Hz}, which is enough since human can only hear frequencies ranging from 20Hz20\,{\rm Hz} to 20kHz20\,{\rm kHz}. In our case, the sample rate 16kHz16\,{\rm kHz} can effectively capture frequencies up to 8kHz8\,{\rm kHz}, which is enough to record speeches.

3. Methodology

Given a clean audio signal 𝐱cleanm\mathbf{x}^{\rm clean}\in\mathbb{R}^{m} (of length mm\in\mathbb{N}) and an interrupted audio signal 𝐱interruptm\mathbf{x}^{\rm interrupt}\in\mathbb{R}^{m}. Our goal is to find an approximation 𝐱approxm\mathbf{x}^{\rm approx}\in\mathbb{R}^{m} of 𝐱clean\mathbf{x}^{\rm clean}. By using Numpy Fast Fourier Transform, we first computed their discrete Fourier transform 𝐱^cleanm\widehat{\mathbf{x}}^{\rm clean}\in\mathbb{C}^{m} and 𝐱^interruptm\widehat{\mathbf{x}}^{\rm interrupt}\in\mathbb{C}^{m} respectively, given by the formula

x^k=m=0n1xmexp(2π𝐢mkn)for k=1,,m,\hat{x}_{k}=\sum_{m=0}^{n-1}x_{m}\exp\left(-2\pi\mathbf{i}\frac{mk}{n}\right)\quad\text{for $k=1,\cdots,m$},

where 𝐢\mathbf{i} is the imaginary number which can be formally understood as 𝐢=1\mathbf{i}=\sqrt{-1} and exp\exp is the complex exponential which can be defined via Euler’s formula. Now it is suffice to find an approximation 𝐱^approxm\widehat{\mathbf{x}}^{\rm approx}\in\mathbb{C}^{m} of 𝐱^clean\widehat{\mathbf{x}}^{\rm clean}, since its inverse Fourier transform is exactly the desired approximator 𝐱approx\mathbf{x}^{\rm approx}.

Based on some mathematical results (see Section 5 below), we decide to find an approximator of the form

(3.1) 𝐱^approx=(x^1approx,,x^mapprox)withx^japprox=zjx^jinterrupt|x^jinterrupt|\widehat{\mathbf{x}}^{\rm approx}=(\hat{x}_{1}^{\rm approx},\cdots,\hat{x}_{m}^{\rm approx})\quad\text{with}\quad\hat{x}_{j}^{\rm approx}=z_{j}\frac{\hat{x}_{j}^{\rm interrupt}}{\lvert\hat{x}_{j}^{\rm interrupt}\rvert}

for some zjz_{j}\in\mathbb{R}. This is equivalent to use (|x^1interrupt|,,|x^minterrupt|)\left(\lvert\hat{x}_{1}^{\rm interrupt}\rvert,\cdots,\lvert\hat{x}_{m}^{\rm interrupt}\rvert\right) to find an approximator 𝐳=(z1,,zm)\mathbf{z}=(z_{1},\cdots,z_{m}) of (|x^1clean|,,|x^mclean|)\left(\lvert\hat{x}_{1}^{\rm clean}\rvert,\cdots,\lvert\hat{x}_{m}^{\rm clean}\rvert\right), and we will use a Convolutional-based Autoencoder (ConvAE) model to do so.

Similar to the Autoencoder (AE), the ConvAE also has two phases, called encoder and decoder. Suppose that the encoder and decoder have \ell_{*} and LL_{*} hidden layers, respectively. Given a injective function ϕ:(0,)\phi:(0,\infty)\rightarrow\mathbb{R}, with inverse function ϕ1:ϕ((0,))(0,)\phi^{-1}:\phi((0,\infty))\rightarrow(0,\infty). We begin the encoder by input

m0=m,𝐲0:=(ϕ(|x^1interrupt|),,ϕ(|x^minterrupt|))and activation functions {f}=1.m_{0}=m,\quad\mathbf{y}^{0}:=\left(\phi\left(\lvert\hat{x}_{1}^{\rm interrupt}\rvert\right),\cdots,\phi\left(\lvert\hat{x}_{m}^{\rm interrupt}\rvert\right)\right)\quad\text{and activation functions $\left\{f^{\ell}\right\}_{\ell=1}^{\ell_{*}}$.}

Let 𝐲m\mathbf{y}^{\ell}\in\mathbb{R}^{m_{\ell}} for some mm_{\ell}\in\mathbb{N} be the state vector of th\ell^{\rm th} hidden layer satisfying the relation (in terms of matrix multiplication)

𝐲=f(𝗐𝐲1+𝐚)for all =1,,\mathbf{y}^{\ell}=f^{\ell}\left(\mathsf{w}^{\ell}\mathbf{y}^{\ell-1}+\mathbf{a}^{\ell}\right)\quad\text{for all $\ell=1,\cdots,\ell_{*}$}

with (real-valued) matrices 𝗐m×m1\mathsf{w}^{\ell}\in\mathbb{R}^{m_{\ell}\times m_{\ell-1}} and a (real-valued) vector 𝐚m\mathbf{a}^{\ell}\in\mathbb{R}^{m_{\ell}}, which are called the weights. Here mm_{\ell} may distinct and 𝗐\mathsf{w}^{\ell} may not square matrices. The number mm_{\ell}\in\mathbb{N} is called the number of neurons in the th\ell^{\rm th} hidden layer. If we expand 𝐲=(y1,,ym)m\mathbf{y}^{\ell}=(y_{1}^{\ell},\cdots,y_{m_{\ell}}^{\ell})\in\mathbb{R}^{m_{\ell}}, then the number yjy_{j}^{\ell}\in\mathbb{R} is the state of the jthj^{\rm th} neuron in the th\ell^{\rm th} hidden layer. The last vector 𝐲~=(y1,,ym)\widetilde{\mathbf{y}}^{\ell_{*}}=(y_{1}^{\ell_{*}},\cdots,y_{m_{\ell_{*}}}^{\ell_{*}}) is our desired encoded data. Following, we begin the decoder by input

n0=m,𝐳0:=𝐲~=(y1,,ym)and activation functions {gL}L=1L.n_{0}=m_{\ell_{*}},\quad\mathbf{z}^{0}:=\widetilde{\mathbf{y}}^{\ell_{*}}=(y_{1}^{\ell_{*}},\cdots,y_{m_{\ell_{*}}}^{\ell_{*}})\quad\text{and activation functions $\left\{g^{L}\right\}_{L=1}^{L_{*}}$.}

Let 𝐳nL\mathbf{z}^{\ell}\in\mathbb{R}^{n_{L}} for some nLn_{L}\in\mathbb{N} be the state vector of LthL^{\rm th} hidden layer satisfying the relation (in terms of matrix multiplication)

𝐳L=gL(𝖶L𝐳L1+𝐛)for all L=1,,L\mathbf{z}^{L}=g^{L}\left(\mathsf{W}^{L}\mathbf{z}^{L-1}+\mathbf{b}^{\ell}\right)\quad\text{for all $L=1,\cdots,L_{*}$}

with (real-valued) matrices 𝖶LnL×nL1\mathsf{W}^{L}\in\mathbb{R}^{n_{L}\times n_{L-1}} and a (real-valued) vector 𝐛LnL\mathbf{b}^{L}\in\mathbb{R}^{n_{L}}, which are also called the weights. Finally, the ConvAE is terminated and output

𝐳L=(z1L,,zLL),\mathbf{z}^{L_{*}}=\left(z_{1}^{L_{*}},\cdots,z_{L_{*}}^{L_{*}}\right),

which is the desired decoded data. In our case, we will choose L=mL_{*}=m (as well as <m\ell_{*}<m) and we found an approximator (3.1) with

zj=ϕ1(zjm)for all j=1,,m.z_{j}=\phi^{-1}\left(z_{j}^{m}\right)\quad\text{for all $j=1,\cdots,m$.}

See also Figure 3.1 for model architecture of the above mentioned FFT-ConvAE Model (with the choice ϕ(t)=logt\phi(t)=\log t).

\includegraphics

[width=.8]Model_architecture.png

Figure 3.1. Model architecture of the FFT-ConvAE Model

In our case, we choose fIdf^{\ell}\equiv\rm Id and gLIdg^{L}\equiv\rm Id for all task/level, i.e. our model is linear. We choose ϕId\phi\equiv\rm Id for Level 1, Level 2 and Level 3 in Task 1, see Figure 3.3 for training of some samples in Task 1 Level 1. We choose ϕ(t)=logt\phi(t)=\log t for all other tasks/levels, see Figure 3.4 for plots in different scales, see also Figure 3.2 for overall performance of the training stage, measured in terms of CER using evaluate.py in the Zenodo repository333https://zenodo.org/records/14007505. Here we emphasize that we do not train the model using evaluate.py mentioned above. As shown in (3.1), we do not train the phase of the signal, since the model is highly sensitive to phase shifts and over-fitting often occurs when we tried to train the phase of the signals. Our model is light in terms of real-time factor (RTF), we have RTF much lower than 1, see Table 3.1 below.

\includegraphics

[width=0.65]result_figure/training_performance.png

Figure 3.2. The performance of training
processing time (seconds) audio length (seconds) real-time factor
Task 1 Level 1 73 2400 0.03
Task 1 Level 2 93 2440 0.04
Task 1 Level 3 63 2444 0.03
Task 1 Level 4 123 2444 0.05
Task 1 Level 5 63 2444 0.03
Task 1 Level 6 93 2444 0.04
Task 1 Level 7 73 2444 0.03
Task 2 Level 1 53 1292 0.04
Task 2 Level 2 63 1120 0.06
Task 2 Level 3 53 1184 0.04
Task 3 Level 1 53 1120 0.05
Task 3 Level 2 63 1120 0.06
Table 3.1. Real-time factor (RTF)
\includegraphics

[width=.65]result_figure/T1L1_fft_result.png

Figure 3.3. Task 1 Level 1: Blue represents magnitudes of Fourier transformed clean signal. Red color in (i) represents the filtered signal and the one in (ii) represents the trained signal
\includegraphics

[width=.45]result_figure/T1L4_fft_result_Sample_16.png \includegraphics[width=.45]result_figure/T1L4_fft_result_Sample_516.png

Figure 3.4. Sample #16 and Sample #516 in Task 1 Level 4: Blue represents the magnitude of Fourier transformed clean signal. Red color in (i) represents the magnitude of Fourier transformed filtered signal and the one in (ii) represents the magnitude of Fourier transformed trained signal

4. Results

All results shown in this section are provided by the organizers. The organizers used Mozilla DeepSpeech444https://github.com/mozilla/DeepSpeech to recognize the speech (which will output a .txt file) by input a sound track (in .wav format), and the character error rate (CER), which is defined by the ration of the number of wrong/missing character with the total number of character in original text transcribed by evaluate.py. The CER is a real number ranging from 0 (all characters are correct) to 11 (all characters are incorrect). The average CER is showed in Figure 4.1, which is posted in the Helsinki Speech Challenge 2024 official result page555https://blogs.helsinki.fi/helsinki-speech-challenge/results/. We also present the spectrogram, transcribed by evaluate.py and CER of some samples in Figures 4.2 and 4.3.

\includegraphics

[width=.8]HSC2024results

Figure 4.1. Our group wins the second place – labeled as NTU

We next compare the performance of FFT-ConvAE in training phase (see Figure 3.2) and testing phase (see Figure 4.1). The performance of FFT-ConvAE remains consistent across all tasks in Task 1 during both stages. However, the mean CER increases in Tasks 2 and 3 when comparing the testing phase to the training phase. This is likely due to high-frequency signals being included in the phase of the Fourier transform for these tasks, as the phase of the interrupted signal’s Fourier transform is directly used as the phase in the FFT-ConvAE. Additionally, beyond the discrete Fourier Transform and log transform, more advanced data preprocessing techniques should be considered for Tasks 2 and 3, as these tasks are more complex compared to Task 1.

Figure 4.2 demonstrates that the FFT-ConvAE effectively avoids removing useful signals from the interrupted signal in Sample #11, as the CER remains unchanged after signal reconstruction. However, in Task 1 Level 1, where the audio is relatively less interrupted, the model tends to over-denoise, as observed in Sample #101, resulting in an increased CER. Despite the deterioration in denoising performance for Sample #101 in terms of CER, the FFT-ConvAE successfully captures the high-frequency components of the signal, enhancing the overall audio quality of the interrupted signal. For Task 1 Level 4, Figure 4.3 shows that the FFT-ConvAE effectively reduces the CER after audio reconstruction for both samples (e.g., Sample #16 and Sample #516). This improvement is attributed to the model’s ability to effectively learn high-frequency information from the clean signal.

\includegraphics

[width=.9]result_figure/T1L1_result.png

Before reconstruction After reconstruction True text
(CER=0{\rm CER}=0) (CER=0{\rm CER}=0)
Sample #11 i have not said the provincial mayor i have not said the provincial mayor I have not, said the Provincial Mayor
(CER=0{\rm CER}=0) (CER=0.0694{\rm CER}=0.0694)
Sample #101 You need not be prompted to write with the appearance of sorrow for his disappointment. you need not be prompted to write that the appearance of sorrow or his disappointment You need not be prompted to write with the appearance of sorrow for his disappointment
Figure 4.2. Spectrogram, texts transcribed by evaluate.py and CER of (a) Sample # 11 and (b) Sample # 101 in Task 1 Level 1
\includegraphics

[width=.9]result_figure/T1L4_result.png

Before reconstruction After reconstruction True text
(CER=0.5{\rm CER}=0.5) (CER=0.115{\rm CER}=0.115)
Sample #16 onn about a mateself the difference those e ye anything about it must have felt the difference Those who knew any thing about it, must have felt the difference
(CER=0.436{\rm CER}=0.436) (CER=0.128{\rm CER}=0.128)
Sample #516 noman fhop left my sond still more grose et only inpruthd lest my sriend still more grave It only, in truth, left my friend still more grave
Figure 4.3. Spectrogram, texts transcribed by evaluate.py and CER of (a) Sample # 16 and (b) Sample # 516 in Task 1 Level 4

5. Discussions and related works

It is not surprising to use discrete Fourier transform to handle audio signal. In practical, it is also difficult to handle audio signals without using discrete Fourier transform, see Figure 5.1 for a demonstration (for which vanishing gradient effect occur).

\includegraphics

[width=.6]result_figure/without_fft_result.png

Figure 5.1. Blue and red color represent clean and trained audio signal, respectively, by using FFT-ConvAE (left) versus pure ConvAE without using FFT (right)

To address the vanishing gradient problem in deep learning, the discrete Fourier transform emerges as a vital tool. Figure 4.2 highlights the significant discrepancies between the filtered signal and the clean signal in the original scale. However, after applying the Fourier transform (see Figure 3.3(a)), the difference between the filtered and clean signals is noticeably reduced. Furthermore, when the log scale is applied to the magnitude of the Fourier-transformed signal, the discrepancy becomes even smaller. We now explain some mathematical results in [KSZ24] (see also [KRS21]), which give some examples to demonstrate some mechanism of inverse problems.

Given any fL2(𝒮n1)f\in L^{2}(\mathcal{S}^{n-1}) with n2n\geq 2, the corresponding (scaled) Herglotz wave function is formally defined by

Ak(f):=κn12Pκf|B1with(Pκf)(x):=𝒮n1e𝐢κωxf(ω)dS(ω)(fdS)^(κx).A_{k}(f):=\kappa^{\frac{n-1}{2}}\left.P_{\kappa}f\right|_{B_{1}}\quad\text{with}\quad(P_{\kappa}f)(x):=\int_{\mathcal{S}^{n-1}}e^{\mathbf{i}\kappa\omega\cdot x}f(\omega)\,\mathrm{d}S(\omega)\equiv(f\,\mathrm{d}S)\,\widehat{\rule{0.0pt}{6.0pt}}\,(-\kappa x).

By a version of Agmon-Hörmander estimate [KSZ24, Lemma 2.3], there exists a constant C=C(n)>0C=C(n)>0 such that for any integer m0m\geq 0 one has

AκfL2(B1)C(Cmκ)2mfH2m(𝒮n1)for all fL2(𝒮n1),\lVert A_{\kappa}f\rVert_{L^{2}(B_{1})}\leq C(Cm\kappa)^{2m}\lVert f\rVert_{H^{-2m}(\mathcal{S}^{n-1})}\quad\text{for all $f\in L^{2}(\mathcal{S}^{n-1})$,}

where H2m(𝒮n1)H^{-2m}(\mathcal{S}^{n-1}) is the standard Hilbert space which can be defined in terms of the Laplace-Beltrami operator Δ𝒮n1-\Delta_{\mathcal{S}^{n-1}} on 𝒮n1\mathcal{S}^{n-1}. We use Weyl asymptotics (see e.g. [Tay11, Theorem 8.3.1]) to simplify our quantification. The case when m=0m=0 can be found in [AH76, Theorem 2.1]. This shows that

(5.1) Aκ:L2(𝒮n1)L2(B1)A_{\kappa}:L^{2}(\mathcal{S}^{n-1})\rightarrow L^{2}(B_{1})

is a bounded linear operator which is compact. In addition, the analyticity of PκfP_{\kappa}f (due to Paley-Wiener-Schwartz theorem, see e.g. [FJ98, Theorem 10.2.1(i)]) implies that ff is uniquely determined by AκfA_{\kappa}f, thus (5.1) is injective, and it has a sequence of singular values σj=σj(Aκ)\sigma_{j}=\sigma_{j}(A_{\kappa}) with σ1σ20\sigma_{1}\geq\sigma_{2}\geq\cdots\rightarrow 0, see e.g. [KRS21, Proposition 2.3]. In order to simplified our notations, we write ABA\lesssim B (resp. ABA\gtrsim B or ABA\simeq B) for ACBA\leq CB (resp. AC1BA\geq C^{-1}B or C1ABCAC^{-1}A\leq B\leq CA) where CC is a constant independent of asymptotic parameters (here jj and κ\kappa). For each κ1\kappa\geq 1, it was proved in [KSZ24, Theorem 1.1] that the singular values σj(Aκ)\sigma_{j}(A_{\kappa}) of (5.1) satisfy

(5.2a) σj(Aκ)1\displaystyle\sigma_{j}(A_{\kappa})\simeq 1 for all jκn1j\lesssim\kappa^{n-1},
(5.2b) σj(Aκ)exp(cκ1j1n1)\displaystyle\sigma_{j}(A_{\kappa})\lesssim\exp\left(-c\kappa^{-1}j^{\frac{1}{n-1}}\right) for all jκn1j\gtrsim\kappa^{n-1},

where the constant c>0c>0 and the implied constants are independent of κ\kappa and jj. From (5.2a)–(5.2b), by refining the results in [KRS21], it was proved in [KSZ24, Theorem 1.2] that a necessary condition of the existence of such a non-decreasing function t+ω(t)+t\in\mathbb{R}_{+}\mapsto\omega(t)\in\mathbb{R}_{+} with

fL2(𝒮n1)ω(AκfL2(B1))whenever fH1(𝒮n1)1\lVert f\rVert_{L^{2}(\mathcal{S}^{n-1})}\leq\omega\left(\lVert A_{\kappa}f\rVert_{L^{2}(B_{1})}\right)\quad\text{whenever $\lVert f\rVert_{H^{1}(\mathcal{S}^{n-1})}\leq 1$}

is

(5.3) ω(t)max{t,κ1(1+log(1/t))1}for all 0<t1,\omega(t)\gtrsim\max\left\{t,\kappa^{-1}(1+\log(1/t))^{-1}\right\}\quad\text{for all $0<t\lesssim 1$,}

where the implied constants are independent of κ\kappa and tt. By inspecting the proof, one sees that the stability bound ω(t)t\omega(t)\gtrsim t follows from (5.2a), while the instability bound ω(t)κ1(1+log(1/t))1\omega(t)\gtrsim\kappa^{-1}(1+\log(1/t))^{-1} follows from (5.2b), therefore (5.2a) and (5.2b) characterize the number of stable and unstable features in the inverse problem. For each fixed κ>0\kappa>0, from (5.3) we conclude that the inverse problem is ill-posed. However, can choose a large κ\kappa to reduce the effect of the instability term κ1(1+log(1/t))1\kappa^{-1}(1+\log(1/t))^{-1} as well as increase the number of stable features in the sense of (5.2a). This is called the increasing resolution phenomena.

Similar mechanisms for linearized inverse acoustic scattering problem also studied [KSZ24]. We also remark that one also can discuss the stability of inverse problems using Bayesian approach [FKW24, KW24]. In principle, we believe that many inverse problems have features which can be stable recovered, but however most of them are unstable to recover. We consider the choice (3.1) is that we want to only train stable features in order to make our model light-weighted, and it seems that our ideas work in Task 1 (filtering experiment) compare with the results by other groups (see Figure 4.1). Unfortunately, our methods do not work for Task 2 (reverb experiment), which means that we still missed some stable feature. It is interesting that we still partially improved Task 3 (combination of filtering and reverb experiment). By comparing the performance of training (see Figure 3.2) and the results verified by the organizers (see Figure 4.1), this means that we successfully capture stable features of filtering experiments, but not the reverb experiments.

6. Conclusions

The combination of the discrete Fourier transform and Convolutional-based Autoencoder method (FFT-ConvAE) proves to be an effective model for extracting high-frequency components from clean signals, resulting in a significant reduction in CER compared to interrupted signals. By applying the Fourier transform (possibly in log scale), the discrepancies between the interrupted and clean signals are first substantially minimized. The magnitude of the Fourier-transformed signal is then further refined by the ConvAE, effectively avoiding the vanishing gradient problem and successfully extracting useful high-frequency information from the clean audio. Moreover, our proposed FFT-ConvAE is a general-purpose model capable of handling various tasks across different scenarios. Additionally, it is a lightweight model – in terms of low real-time factor (RTF) – making it highly suitable for practical, everyday applications. Many inverse problem is unstable, but one can still recover some stable features, which can be extracted by preprocessing the data carefully, before employ machine learning algorithm.

References

  • [AH76] S. Agmon and L. Hörmander. Asymptotic properties of solutions of differential equations with simple characteristics. J. Analyse Math., 30:1–38, 1976. MR0466902, Zbl:0335.35013, doi:10.1007/BF02786703.
  • [CSTK18] Z. Cheng, H. Sun, M. Takeuchi, and J. Katto. Deep Convolutional Autoencoder-based lossy image compression. PCS, pages 253–257, 2018. doi:10.1109/PCS.2018.8456308.
  • [FJ98] F. G. Friedlander and M. Joshi. Introduction to the theory of distributions. Cambridge University Press, Cambridge, second edition, 1998. MR1721032, Zbl:0971.46024.
  • [FKW24] T. Furuya, P.-Z. Kow, and J.-N. Wang. Consistency of the Bayes method for the inverse scattering problem. Inverse Problems, 40(5), 2024. Paper No. 055001, MR4723841, Zbl:7867314, doi:10.1088/1361-6420/ad3089.
  • [GLL+19] D. Gong, L. Liu, V. Le, B. Saha, M. R. Mansour, S. Venkatesh, and A. van den Hengel. Memorizing normality to detect anomaly: memory-augmented Deep Autoencoder for unsupervised anomaly detection. Proc. IEEE Int. Conf. Comput. Vis., pages 1705–1714, 2019. doi:10.1109/ICCV.2019.00179, arXiv:1904.02639.
  • [KRS21] H. Koch, A. Rüland, and M. Salo. On instability mechanisms for inverse problems. Ars Inven. Anal., 2021. Paper No. 7, 93 pages, MR4462475, Zbl:1482.35002, doi:10.15781/c93s-pk62, arXiv:2012.01855.
  • [KLS+22] P.-Y. Kow, M.-H. Lee, W. Sun, M.-H. Yao, and F.-J. Chang. Integrate deep learning and physically-based models for multi-step-ahead microclimate forecasting. Expert Syst. Appl., 210, 2022. Article number 118481, doi:10.1016/j.eswa.2022.118481.
  • [KLS+24] P.-Y. Kow, J.-Y. Liou, W. Sun, L.-C. Chang, and F.-J. Chang. Watershed groundwater level multistep ahead forecasts by fusing convolutional-based autoencoder and LSTM models. J. Environ. Manag., 351, 2024. Article number 119789, doi:10.1016/j.jenvman.2023.119789.
  • [KSZ24] P.-Z. Kow, M. Salo, and S. Zou. Increasing resolution and instability for linear inverse scattering problems. arXiv preprint, 2024. arXiv:2404.18482.
  • [KW24] P.-Z. Kow and J.-N. Wang. Increasing stability in an inverse boundary value problem – Bayesian viewpoint. Taiwanese J. Math., 2024. 40 pages, doi:10.11650/tjm/240704.
  • [LKJS24a] M. Ludvigsen, E. Karvonen, M. Juvonen, and S. Siltanen. Helsinki Speech Challenge 2024. arXiv preprint, 2024. arXiv:2406.04123.
  • [LKJS24b] M. Ludvigsen, E. Karvonen, M. Juvonen, and S. Siltanen. Helsinki Speech Challenge 2024 open audio dataset. Zenodo, 2024. doi:10.5281/zenodo.14007505.
  • [Sha49] C. E. Shannon. Communication in the presence of noise. Proceedings of the IRE, 37(1):10–21, 1949. doi:10.1109/JRPROC.1949.232969.
  • [Tay11] M. E. Taylor. Partial differential equations I. Basic theory, volume 115 of Applied Mathematical Sciences. Springer, New York, second edition, 2011. MR2744150, Zbl:1206.35002, doi:10.1007/978-1-4419-7055-8.
  • [WCWW20] S. Wang, H. Chen, L. Wu, and J. Wang. A novel smart meter data compression method via stacked convolutional sparse auto-encoder. Int. J. Electr. Power Energy Syst., 118, 2020. Article number 105761. doi:10.1016/j.ijepes.2019.105761.
  • [WHK+23] K.-Y. Wu, I-W. Hsia, P.-Y. Kow, L.-C. Chang, and F.-J. Chang. High-spatiotemporal-resolution PM2.5{\rm PM}_{2.5} forecasting by hybrid deep learning models with ensembled massive heterogeneous monitoring data. J. Clean. Prod., 433, 2023. Article number 139825, doi:10.1016/j.jclepro.2023.139825.
  • [XMY16] C. Xing, L. Ma, and X. Yang. Stacked Denoise Autoencoder based feature extraction and classification for hyperspectral images. J. Sens., 2016. Article number 3632943, 10 pages. doi:10.1155/2016/3632943.