A Computationally Efficient Semi-Blind Source Separation Approach for Nonlinear Echo Cancellation Based on an Element-Wise Iterative Source Steering
Abstract
While the semi-blind source separation-based acoustic echo cancellation (SBSS-AEC) has received much research attention due to its promising performance during double-talk compared to the traditional adaptive algorithms, it suffers from system latency and nonlinear distortions. To circumvent these drawbacks, the recently developed ideas on convolutive transfer function (CTF) approximation and nonlinear expansion have been used in the iterative projection (IP)-based semi-blind source separation (SBSS) algorithm. However, because of the introduction of CTF approximation and nonlinear expansion, this algorithm becomes computationally very expensive, which makes it difficult to implement in embedded systems. Thus, we attempt in this paper to improve this IP-based algorithm, thereby developing an element-wise iterative source steering (EISS) algorithm. In comparison with the IP-based SBSS algorithm, the proposed algorithm is computationally much more efficient, especially when the nonlinear expansion order is high and the length of the CTF filter is long. Meanwhile, its AEC performance is as good as that of IP-based SBSS.
Index Terms— Semi-blind source separation, acoustic echo cancellation, convolutive transfer function approximation, nonlinear expansion, element-wise source steering
1 Introduction
In telecommunications or teleconferencing, acoustic echo, which is formed by the coupling between loudspeakers and microphones, is detrimental to full-duplex communication. One widely used way to eliminate the detrimental echo effects is through acoustic echo cancellation in which adaptive filters, such as the normalized least mean square (NLMS), recursive least mean square (RLS) and Kalman filters [1, 2], are used to identify the acoustic impulse response (AIR) [3]. While they have been widely used, adaptive filters generally suffer from great performance degradation or even divergence during double-talk in which both the far-end and near-end speech are present [4, 5].
One way to improve acoustic echo cancellation (AEC) performance during double-talk is through adopting the principle used in blind source separation (BSS) [6, 7, 8, 9] to reformulate the AEC problem as one of semi-blind source separation (SBSS) [10, 11]. Several algorithms have been derived based on this SBSS framework, which have demonstrated promising performance [12, 13, 14]. However, those algorithms still suffer from a number of drawbacks. First, they are computationally very expensive, which makes them difficult to implement in embedded systems. A viable approach to reduce complexity is through using the so-called multiplicative transfer function model (MTF) [15] and performing echo cancellation in the short-time Fourier transform (STFT) domain [16, 17]. But this method increases the system latency as MTF requires the length of the analysis window to be longer than the effective part of AIR. Recently, the so-called convolutive transfer function (CTF) [18, 19, 20] model was applied to SBSS-AEC to achieve different compromise between the system latency and computational complexity [21, 22]. Second, the performance of SBSS-AEC suffers from significant degradation in the presence of loudspeaker nonlinearity, which happens often in small devices [23, 24]. One way to deal with this issue is through nonlinear AEC in which nonlinear expansion [25, 26, 27] is used to model the loudspeaker nonlinearity [28]. In [22], researchers proposed a framework which combines the CTF model and nonlinear expansion. An iterative projection (IP) [8] method is carried out to solve the corresponding optimization problem.
However, the IP-based algorithm still faces the challenge of computational complexity, as it requires to calculate the inverse of an auxiliary matrix. To circumvent this problem, we attempt in this work to reduce the complexity, thereby developing an element-wise source steering (EISS) algorithm, which is an extension of the work in [29, 30, 31]. Since no matrix inverse is required, the developed algorithm is computationally much more efficient and its complexity is one order of magnitude lower than that of the IP-based algorithm, yet its performance is as good as that of the IP-based algorithm.
2 Signal Model and Problem Formulation
Consider the full-duplex speech communication scenario where a microphone is used to pick up the sound signal from the near-end speaker and a loudspeaker is used to playback the signal from the far-end. The microphone output signal at time instant , which is denoted as , can be written as
(1) |
where denotes the near-end speech signal, is the nonlinear acoustic echo, denotes the acoustic impulse response from the loudspeaker to the microphone, represents the linear convolution, stands for the response of the loudspeaker, which includes both the linear and nonlinear effects, and is the far-end signal, respectively. The problem of acoustic echo cancellation is to mitigate or eliminate the echo signal, i.e., , while preserving the near-end signal .
While they have been widely used in practical systems, adaptive filtering algorithms often suffer from great performance degradation in the presence of loundspeaker nonlinearity. One way to deal with loundspeaker nonlinear distortion is to approximate through a th-order basis-generic expansion [26], i.e.,
(2) |
where is the -th order basis function and is the corresponding coefficient. For real-valued signal, the following expansion can be used,
(3) |
Substituting (2) into (1) gives
(4) |
where denote the echo path of the -th order expanded signal .
To reduce the computational complexity, the MTF model is introduced, which requires the analysis window to be longer than the effective part of AIR, leading to larger system latency. To achieve proper compromise between the computational complexity and system latency, the CTF approximation is subsequently adopted. The signal model in (4) is then written in the STFT domain as
(5) |
where and denote the frequency and time-frame indexes, is the length of the CTF filter, and , , , denote, respectively, the STFTs of , , and . Putting (5) into a vector/matrix form gives
(6) |
where
(7) |
(8) |
(9) |
(10) |
(11) |
(12) |
(13) |
the superscript denotes the transpose operation, and denotes the identity matrix of size . Note that the matrix is of size , which is called the mixing matrix in the literature of BSS, is a zero vector of length .
Following the notation in BSS and echo cancellation, we now define the demixing matrix as
(14) |
where is a column vector with parameters to be estimated. The near-end signal extraction filter can then be expressed as . Applying the near-end signal extraction filter to the input signal gives the near-end signal, i.e.,
(15) |
Given the aforementioned signal model and problem formulation, the objective of nonlinear SBSS-AEC is to estimate by exploiting independence between the near-end and the reference signals.
3 Nonlinear SBSS-AEC Algorithms
3.1 Probabilistic Model
We consider to model the source signal with a generalized Gaussian distribution, i.e.,
(16) |
where
(17) |
stands for norm. and are the scale and shape parameters, respectively. Since the reference signal is accessible, the negative log-likelihood function can then be calculated as
(18) |
where is a forgetting factor. By using the well known majorization-minimization (MM) method [32], the following auxiliary function can be obtained
(19) |
To track time-varying signals and acoustic environments, the recursive estimation of is generally used, i.e.,
(20) |
and
(21) | |||
(22) |
Note that the norm of the near-end signal extraction filter does not affect the independence criterion. Therefore, a two-stage estimation strategy can be used in which the extraction filter is updated in the first stage using the maximum likelihood criterion and the updated filter is then normalized in the second stage such that its first element is equal to 1. Since the extraction filters at different frequency bins are estimated independently, we shall omit the frequency index in the rest parts of this paper without introducing any confusion.
3.2 Conventional Iterative Projection Based Method
The IP method can be derived by identifying the Wirtinger derivative of (19) with respect to and then forcing the result equal to . The update rules are shown as follows:
(23) | |||
(24) |
where is the first column of and is the first element of .
3.3 Proposed Element-wise Iterative Source Steering Method
Implementation of the IP method requires to compute the inverse of the matrix every frame, which makes the algorithm computationally very expensive. To reduce the complexity, we propose to update the near-end signal extraction filter with the EISS method in the first stage [29, 30, 31], i.e.,
(25) |
where , is -th element of and is a parameter to estimate. All the ’s need to be estimated sequentially. In other words, the algorithm first computes to update all the elements related to ; it then computes to update ; it subsequently computes to update , and so forth. The EISS update rules are illustrated in Fig. 1.

Substituting (25) into the auxiliary function, we obtain
(26) |
where
(27) |
Identifying the Wirtinger derivative of with respect to and forcing the result to be , one can obtain the following solution
(28) |
where denotes the -th column of and stands for the -th element of .
4 Complexity analysis
In this section, we present the complexity of the IP and EISS algorithms in terms of the number of multiplications/divisions needed per frequency bin and time frame. For the IP algorithm, the complexity is dominated by computing the inverse of the covariance matrix , which has a complexity proportional to . The complexity for all the other computations is of . For regular setup, is much smaller than . Therefore, the complexity of the IP method is
(29) |
The EISS algorithm needs to estimate coefficients. Computation of has a complexity of . The complexity for computing is and it is for all the other operations. Similarly, if we only consider the operations that dominate the complexity, the overall complexity of EISS method is
(30) |
which is one order lower than that of the IP method.
5 Simulations and Experiments
5.1 Experimental Setup
In this section, the proposed EISS-based algorithm is compared with the IP-based algorithm and the single-microphone form of the state-space model (SSM)-based nonlinear acoustic echo cancellation algorithm proposed in [27]. We evaluated the proposed AEC algorithms with the help of objective measures to quantify the performance in terms of echo reduction and speech distortion. For the single-talk case, echo return loss enhancement (ERLE) [26] is used as the performance metric, and for double-talk, true ERLE (tERLE) [17] is used. Besides, perceptual evaluation of speech quality (PESQ) [33], and short time objective intelligibility (STOI) [34] are also used for performance evaluation. The sampling rate for all the signals in this work is 16 kHz.
To assess the efficiency of our algorithm, we also conducted a comparative analysis of the runtime performance between our algorithm and an IP-based algorithm. We executed 100 signals, each lasting 10 seconds, on a laptop equipped with an i7-10750H CPU and computed the average runtime of each signal as the final test result. The average runtime is showned in Fig. 3. For the short-time analysis, the frame length is 1024-point long with an overlap factor of . The Hanning window is applied and the windowed signal is then transformed into the STFT domain with a 1024-point fast Fourier transform (FFT). To balance the computational complexity and performance, the nonlinear expansion order is set to 3 and the length of CTF filter is set to 5. The forgetting factor is set to 0.992. The shape parameter is set to 0.4. In all experiments, the demixing matrix is initialized as an identity matrix and the auxiliary matrix is initialized as .
5.2 AEC Performance for Hard Clipping Mapping
In this experiment, we validate the ability of EISS to handle nonlinear distortion in both single-talk and double-talk scenarios. We use the same data as in [22]. The hard clipping function [26] is used to simulate the loudspeaker nonlinearity, in which the clipping threshold is set to . The reverberation time is approximately 300 ms. The signal-to-echo ratio (SER) for double-talk is 0 dB. Figure 2 (a) and (b) show the performance of the IP and EISS methods in the double-talk and single-talk situations, respectively. One can see that the ERLE and tERLE of EISS and IP algorithms are almost the same, but when compared to SSM, both of them have significantly higher values. Therefore, the proposed algorithm demonstrates superior AEC performance compared to the conventional SSM algorithm.

5.3 Overall Performance on the AEC Challenge Dataset
In this experiment, a total of 30 signals are arbitrarily taken from the AEC challenge synthetic dataset [35]. The nonlinear far-end signals are generated by clipping the maximum amplitude or by applying the sigmoidal function [36] and learned distortion functions to far-end signal. The SER of these signals ranges from dB to dB and the ranges from 200 ms to 1200 ms.
Algorithm | PESQ | STOI | tERLE |
---|---|---|---|
SSM | 1.57 | 0.87 | 8.77 |
IP | 1.89 | 0.93 | 12.89 |
EISS | 1.9 | 0.94 | 12.63 |
Table 1 lists the overall performance of the three algorithms in terms of PESQ, STOI, and tERLE. It can be seen that the proposed algorithm significantly outperforms traditional nonlinear AEC algorithms across various noise and reverberation environments and exhibits similar performance to the IP-based algorithm.
5.4 Runtime Comparison
In the last set of experiments, we compare the runtime of the IP and EISS method with the same setup as described previously. The time measured here includes the time to compute and update the covariance matrix and auxiliary variables. In this experiment, the nonlinear expansion order is set to 3 and 4, respectively. The CTF filter length, i.e., , varies from to .

As shown in Fig. 3, the EISS method is computationally much more efficient than the IP method and the difference becomes much more dramatic as values of and increases.
6 CONCLUSION
This paper investigated the problem of AEC in the presence of doubletalk and loudspeaker nonlinearity within reverberant environment. Following the framework of SBSS, we developed an element-wise source steering algorithm, which combines the CTF model and nonlinear expansion into the SBSS framework. Unlike the conventional IP-based SBSS method, which requires to compute the inverse of the auxiliary matrix, the proposed algorithm applies an element-wise update strategy, in which no matrix inverse is involved, and as a result, its computational complexity is one order of magnitude lower than that of the IP-based algorithm. Moreover, experiments showed that this efficient algorithm is able to achieve similar performance to that of the IP-based algorithm.
References
- [1] J. Benesty, T. Gänsler, D. R. Morgan, M. M. Sondhi, S. L. Gay et al., Advances in network and acoustic echo cancellation. Springer, 2001.
- [2] G. Enzner and P. Vary, “Frequency-domain adaptive Kalman filter for acoustic echo control in hands-free telephones,” Signal Process., vol. 86, no. 6, pp. 1140–1156, 2006.
- [3] X. Wang, G. Huang, J. Benesty, J. Chen, and I. Cohen, “Time difference of arrival estimation based on a Kronecker product decomposition,” IEEE Signal Process. Lett., vol. 28, pp. 51–55, 2020.
- [4] T. Gansler, S. L. Gay, M. M. Sondhi, and J. Benesty, “Double-talk robust fast converging algorithms for network echo cancellation,” IEEE Trans. Speech, Audio Process., vol. 8, no. 6, pp. 656–663, 2000.
- [5] H. Buchner, J. Benesty, T. Gansler, and W. Kellermann, “Robust extended multidelay filter and double-talk detector for acoustic echo cancellation,” IEEE Trans. Audio, Speech, Lang. Process., vol. 14, no. 5, pp. 1633–1644, 2006.
- [6] P. Comon, “Independent component analysis, a new concept?” Signal Process., vol. 36, no. 3, pp. 287–314, 1994.
- [7] T. Kim, H. T. Attias, S.-Y. Lee, and T.-W. Lee, “Blind source separation exploiting higher-order frequency dependencies,” IEEE Trans. Audio, Speech, Lang. Process., vol. 15, no. 1, pp. 70–79, 2006.
- [8] N. Ono, “Stable and fast update rules for independent vector analysis based on auxiliary function technique,” in Proc. WASPAA, 2011, pp. 189–192.
- [9] S. Makino, Audio source separation. Springer, 2018.
- [10] M. Joho, H. Mathis, and G. S. Moschytz, “Combined blind/nonblind source separation based on the natural gradient,” IEEE Signal Process. Lett., vol. 8, no. 8, pp. 236–238, 2001.
- [11] S. Miyabe, T. Takatani, H. Saruwatari, K. Shikano, and Y. Tatekura, “Barge-in-and noise-free spoken dialogue interface based on sound field control and semi-blind source separation,” in Proc. EUSIPCO, 2007, pp. 232–236.
- [12] J. Gunther, “Learning echo paths during continuous double-talk using semi-blind source separation,” IEEE Trans. Audio, Speech, Lang. Process., vol. 20, no. 2, pp. 646–660, 2011.
- [13] Z. Koldovskỳ, J. Málek, M. Müller, and P. Tichavskjỳ, “On semi-blind estimation of echo paths during double-talk based on nonstationarity,” in Proc. IWAENC, 2014, pp. 198–202.
- [14] J. Gunther and T. Moon, “Blind acoustic echo cancellation without double-talk detection,” in Proc. WASPAA, 2015, pp. 1–5.
- [15] Y. Avargel and I. Cohen, “On multiplicative transfer function approximation in the short-time Fourier transform domain,” IEEE Signal Process. Lett., vol. 14, no. 5, pp. 337–340, 2007.
- [16] T. S. Wada, S. Miyabe, and B.-H. F. Juang, “Use of decorrelation procedure for source and echo suppression,” in Proc. IWAENC, 2008, pp. 1–5.
- [17] F. Nesta, T. S. Wada, and B.-H. Juang, “Batch-online semi-blind source separation applied to multi-channel acoustic echo cancellation,” IEEE Trans. Audio, Speech, Lang. Process., vol. 19, no. 3, pp. 583–599, 2010.
- [18] R. Talmon, I. Cohen, and S. Gannot, “Relative transfer function identification using convolutive transfer function approximation,” IEEE Trans. Audio, Speech, Lang. Process., vol. 17, no. 4, pp. 546–555, 2009.
- [19] R. Talmon, I. Cohen, and S. Gannot, “Convolutive transfer function generalized sidelobe canceler,” IEEE Trans. Audio, Speech, Lang. Process., vol. 17, no. 7, pp. 1420–1434, 2009.
- [20] X. Wang, A. Brendel, G. Huang, Y. Yang, W. Kellermann, and J. Chen, “Spatially informed independent vector analysis for source extraction based on the convolutive transfer function model,” in Proc. IEEE ICASSP, 2023, pp. 1–5.
- [21] Z. Wang, Y. Na, Z. Liu, B. Tian, and Q. Fu, “Weighted recursive least square filter and neural network based residual echo suppression for the AEC-challenge,” in Proc. IEEE ICASSP, 2021, pp. 141–145.
- [22] G. Cheng, L. Liao, K. Chen, Y. Hu, C. Zhu, and J. Lu, “Semi-blind source separation using convolutive transfer function for nonlinear acoustic echo cancellation,” J. Acoust. Soc. Am., vol. 153, no. 1, pp. 88–95, 2023.
- [23] R. Niemistö and T. Mäkelä, “On performance of linear adaptive filtering algorithms in acoustic echo control in presence of distorting loudspeakers,” in Proc. IWAENC, 2003, pp. 79–82.
- [24] M. I. Mossi, N. W. Evans, and C. Beaugeant, “An assessment of linear adaptive filter performance with nonlinear distortions,” in Proc. IEEE ICASSP, 2010, pp. 313–316.
- [25] S. Malik and G. Enzner, “Fourier expansion of Hammerstein models for nonlinear acoustic system identification,” in Proc. IEEE ICASSP, 2011, pp. 85–88.
- [26] ——, “State-space frequency-domain adaptive filtering for nonlinear acoustic echo cancellation,” IEEE Trans. Audio, Speech, Lang. Process., vol. 20, no. 7, pp. 2065–2079, 2012.
- [27] J. Park and J.-H. Chang, “State-space microphone array nonlinear acoustic echo cancellation using multi-microphone near-end speech covariance,” IEEE/ACM Trans. Audio, Speech, Lang. Process., vol. 27, no. 10, pp. 1520–1534, 2019.
- [28] G. Cheng, L. Liao, H. Chen, and J. Lu, “Semi-blind source separation for nonlinear acoustic echo cancellation,” IEEE Signal Process. Lett., vol. 28, pp. 474–478, 2021.
- [29] R. Scheibler and N. Ono, “Fast and stable blind source separation with rank-1 updates,” in Proc. IEEE ICASSP, 2020, pp. 236–240.
- [30] T. Nakashima and N. Ono, “Inverse-free online independent vector analysis with flexible iterative source steering,” in Proc. APSIPA, 2022, pp. 749–753.
- [31] T. Nakashima, R. Ikeshita, N. Ono, S. Araki, and T. Nakatani, “Fast online source steering algorithm for tracking single moving source using online independent vector analysis,” in Proc. IEEE ICASSP, 2023, pp. 1–5.
- [32] K. Lange, MM optimization algorithms. SIAM, 2016.
- [33] A. W. Rix, J. G. Beerends, M. P. Hollier, and A. P. Hekstra, “Perceptual evaluation of speech quality (PESQ)-a new method for speech quality assessment of telephone networks and codecs,” in Proc. IEEE ICASSP, vol. 2, 2001, pp. 749–752.
- [34] C. H. Taal, R. C. Hendriks, R. Heusdens, and J. Jensen, “A short-time objective intelligibility measure for time-frequency weighted noisy speech,” in Proc. IEEE ICASSP, 2010, pp. 4214–4217.
- [35] R. Cutler, A. Saabas, T. Parnamaa, M. Purin, H. Gamper, S. Braun, K. Sørensen, and R. Aichner, “ICASSP 2022 acoustic echo cancellation challenge,” in Proc. IEEE ICASSP, 2022, pp. 9107–9111.
- [36] C. M. Lee, J. W. Shin, and N. S. Kim, “DNN-based residual echo suppression,” in Proc. Interspeech, 2015, pp. 1175–1179.