\corresp

Corresponding author: Ross Cutler (email:[email protected]).

ICASSP 2023 Speech Signal Improvement Challenge

Ross Cutler Ando Saabas Babak Naderi Nicolae-Cătălin Ristea Sebastian Braun Solomiya Branets Microsoft Corporation, Redmond, USA

Abstract

The ICASSP 2023 Speech Signal Improvement Challenge is intended to stimulate research in the area of improving the speech signal quality in communication systems. The speech signal quality can be measured with SIG in ITU-T P.835 and is still a top issue in audio communication and conferencing systems. For example, in the ICASSP 2022 Deep Noise Suppression challenge, the improvement in the background and overall quality is impressive, but the improvement in the speech signal is not statistically significant. To improve the speech signal the following speech impairment areas must be addressed: coloration, discontinuity, loudness, reverberation, and noise. A training and test set was provided for the challenge, and the winners were determined using an extended crowdsourced implementation of ITU-T P.804’s listening phase. The results show significant improvement was made across all measured dimensions of speech quality.

{IEEEkeywords}

speech enhancement, deep learning, subjective testing, speech quality assessment

1 Introduction

Audio telecommunication systems such as remote collaboration systems (Microsoft Teams, Skype, Zoom, etc.), smartphones, and telephones are used by nearly everyone on the planet and have become essential tools for both work and personal usage. Since the invention of the telephone in 1876 by Alexander Graham Bell, audio engineers, and researchers have innovated to improve the speech quality of telecommunication systems, with the ultimate goal of making audio telecommunication systems as good or better than face-to-face communication. After nearly 150 years of effort, there is still a long way to go toward this goal, especially with the use of mainstream devices. For example, it is still common to hear frequency response distortions, isolated and non-stationary distortions, loudness issues, reverberation, and background noise in audio calls.

The ICASSP 2023 Speech Signal Improvement Challenge is intended to stimulate research in the area of improving the send speech signal¹¹1In telecommunication, the audio captured by a near end microphone, processed, and sent to the far end is called the send signal. quality in mainstream telecommunication systems. Subjective speech quality assessment is the gold standard for evaluating speech enhancement, processing, and telecommunication systems. The ITU-T has developed several recommendations for subjective speech quality assessment. In particular, the ITU-T Rec. P.835 [1] provides a lab-based subjective evaluation framework targeting systems that include noise suppression algorithm that gives quality scores of the speech signal (SIG), background noise (BAK), and overall quality (OVRL). In this framework, participants are asked to listen to short clips of speech in a controlled environment and rate the quality of each clip in terms of the speech signal, background noise, and overall quality on three discrete Likert scales (where 1 is Bad quality and 5 is Excellent quality). Each clip is measured by multiple raters, and the results are averaged to obtain a Mean Opinion Score (MOS). By measuring SIG, BAK, and OVRL, P.835 provides a more reliable subjective assessment [1] and allows researchers to determine which area to focus on for improving the overall quality.

The speech signal is still a top issue in audio telecommunication and conferencing systems. For example, in the ICASSP 2022 Deep Noise Suppression Challenge [2], the improvement in BAK and OVRL quality is impressive, but no improvement in SIG was observed. The same was true for the INTERSPEECH 2021 Deep Noise Suppression Challenge [3], and for the more recent ICASSP 2023 Deep Noise Suppression Challenge [4], which focuses on personalized noise suppression. Table 1 shows the amount of improvement in SIG, BAK, and OVRL to get excellent quality rated speech (MOS=5) for the ICASSP 2022 Deep Noise Suppression Challenge. This shows the key area of improvement is SIG, which has 2.3 $\times$ more improvement opportunities than BAK. To improve SIG, the following dimensions of speech quality should be improved [5]:

•

Coloration: Frequency response distortions
•

Discontinuity: Isolated and non-stationary distortions
•

Loudness: Important for the overall quality and intelligibility
•

Reverberation: Room reverberation of speech and noise signals
•

Noisiness: Background noise and circuit and coding noise

The correlation of SIG to these dimensions is given in Figure 5. Theoretically, improving BAK is not necessary to improve SIG as they are orthogonal metrics by design. However, in practice, it is hard for subjective test participants to assess speech signal quality in the presence of strong dominant background noise.

2 Related work

While there have been previous challenges in background noise and reverberation, there have been no challenges in coloration and loudness and a limited challenge in discontinuities (see Table 2). Moreover, there have been no previous challenges that explicitly measure and target improving SIG.

There are many previous methods to improve noisiness, coloration, discontinuity, loudness, and reverberation separately. Two new methods that target universal improvement of the speech signal are [6, 7].

The ITU-T has developed several recommendations for subjective speech quality assessment. ITU-T P.800 [8] describes lab-based methods for the subjective determination of speech quality, including the Absolute Category Rating (ACR). ITU-T P.808 [9] describes a crowdsourcing approach for conducting subjective evaluations of speech quality. It provides guidance on test material, experimental design, and a procedure for conducting listening tests in the crowd. The methods are complementary to laboratory-based evaluations described in P.800. An open-source implementation of P.808 is described in [10]. An open-source implementation of P.835 is described in [11]. More recent multidimensional speech quality assessment standards are ITU-T P.863.2 [12] and P.804 [5] (listening phase), which measure perceptual dimensions of speech quality namely noisiness, coloration, discontinuity, and loudness (see Table 3).

Intrusive objective speech quality assessment tools such as Perceptual Evaluation of Speech Quality (PESQ) [13] and Perceptual Objective Listening Quality Analysis (POLQA) [14] require a clean reference of speech. Non-intrusive objective speech quality assessment tools like ITU-T P.563 [15] do not require a reference, though it has a low correlation to subjective quality [16]. Newer neural net-based methods such as [16, 17, 18, 19] provide better correlations to subjective quality. NISQA [20] is an objective metric for P.804, though the correlation to subjective quality is not sufficient to use as a challenge metric (in the ConferencingSpeech 2022 Challenge [19] NISQA was used as a baseline model and achieved a Pearson Correlation Coefficient = 0.724 to MOS).

3 Challenge description

This challenge benchmarks the performance of speech enhancement models with a real (not simulated) test set. The telecommunication scenario is the near end only send signal; it does not include echo impairments (there is no far end speech or noise). Participants evaluated their speech enhancement model (SEM) on a test set and submitted the results (clips) for subjective evaluation.

3.1 Challenge tracks

The challenge has two tracks:

1.

Real-time SEM
2.

Non-real-time SEM

The goal of the first track is to develop something that can be used today on a typical personal computer, while the goal of the second track is to develop something that could be run on computers much faster than a typical personal computer or be run offline.

3.2 Latency and runtime requirements

Algorithmic latency is defined by the offset introduced by the whole processing chain including short-time Fourier transform (STFT), inverse STFT, overlap-add, additional lookahead frames, etc., compared to just passing the signal through without modification. It does not include buffering latency. Some examples are:

•

A STFT-based processing with window length = 20 ms and hop length = 10 ms introduces an algorithmic delay of window length – hop length = 10 ms.
•

A STFT-based processing with window length = 32 ms and hop length = 8 ms introduces an algorithmic delay of window length – hop length = 24 ms.
•

An overlap-save-based processing algorithm introduces no additional algorithmic latency.
•

A time-domain convolution with a filter kernel size = 16 samples introduces an algorithmic latency of kernel size – 1 = 15 samples. Using one-sided padding, the operation can be made fully “causal”, i.e., left-sided padding with kernel size - 1 samples would result in no algorithmic latency.
•

A STFT-based processing with window_length = 20 ms and hop_length = 10 ms using 2 future frames information introduces an algorithmic latency of (window_length – hop_length) + 2 * hop_length = 30 ms.

Buffering latency is defined as the latency introduced by block-wise processing, often referred to as hop length, frame-shift, or temporal stride. Some examples are:

•

A STFT-based processing has a buffering latency corresponding to the hop size.
•

A overlap-save processing has a buffering latency corresponding to the frame size.
•

A time-domain convolution with stride 1 introduces a buffering latency of 1 sample.

Real-time factor (RTF) is defined as the fraction of time it takes to execute one processing step. For an STFT-based algorithm, one processing step is the hop size. For a time-domain convolution, one processing step is 1 sample. RTF = compute time / time step.

All models submitted to this challenge must meet all of the below requirements:

1.

To be able to execute an algorithm in real-time, and to accommodate for variance in compute time which occurs in practice, we require RTF $\leq$ 0.5 in the challenge on an Intel Core i5 Quadcore clocked at 2.4 GHz using a single thread.
2.

Algorithmic latency + buffering latency $\leq$ 20 ms.
3.

No future information can be used during model inference.

Table 1: Amount of improvement (in Differential MOS (DMOS)) remaining to get excellent quality (MOS=5) rated speech based on the ICASSP 2022 DNS Challenge [2]. DMOS is on a scale of 0-4, where 0 is no difference and 4 is very annoying compared to excellent quality speech.

Area	Headroom (DMOS)
SIG	0.70
BAK	0.30
OVRL	0.87

Table 2: Related challenges.

Area	Related challenge
Noisiness	Deep Noise Suppression [21, 22, 3, 2, 4]
Coloration	None
Discontinuity	Packet Loss Concealment [23]
Loudness	None
Reverberation	REVERB [24]
Echo	AEC Challenge [25, 26, 27, 28]

More details of the challenge are available at https://aka.ms/sig-challenge.

4 Training set

This challenge suggested using the ICASSP 2022 Deep Noise Suppression Challenge [2] and ICASSP 2022 Acoustic Echo Cancellation Challenge [27] training and test sets for training. The AEC Challenge training set in particular includes over 10K unique environments, devices, and speakers. The near end single talk clips have been rated using P.835 and are provided, which can be used during training to improve SIG and OVRL.

However, any training set could have been used, such as [24], [29], [30].

5 Test set

The test set consisted of 500 send clips, each using a unique device, environment, and person speaking. The clips were captured from both PCs and mobile devices using the same methodology as described in [27]. The recordings were stratified to have an approximately uniform distribution for the impairment areas listed in Table 3. The test set language contains English, German, Dutch, French, and Spanish languages, with the majority of files (around 80%) in English. The test set was released near the end of the competition. The distribution of subjective ratings based on P.804 (see Section 6) of the test set for all dimensions is shown in Figure 1.

Refer to caption — Figure 1: Distribution of subjective scores of clips in the blind test set. Ideally, the distribution would be uniform for each dimension, but it is skewed for discontinuity and reverberation.

6 Evaluation methodology

The challenge evaluation is based on a subjective listening test. We have developed an extension of P.804 (listening phase) / P.863.2 (Annex A) based on crowdsourcing and the P.808 toolkit [10] for subjective evaluation. In particular, we added reverberation, speech quality, and overall quality to P.804’s listening phase (see Table 3). Details of this P.804 extension are given in Section 6.1 and [31].

The challenge metric $M\in[0,1]$ is:

M=\frac{(\text{SIG}-1)/4+(\text{OVRL}-1)/4}{2}

(1)

In addition, differential SIG (DSIG) must be $>0$ . Since OVRL $\sim$ BAK + SIG [11], BAK should also be improved to increase OVRL performance. The other metrics measured in Table 3 are informational only. However, in Section 8.5 we show that overall is influenced by the speech signal, reverberation, loudness, discontinuity, coloration, and noisiness, so optimizing each is a good strategy. The clips are evaluated as fullband (48 kHz) in P.804, so frequency extension can help.

Table 3: Speech quality areas from P.804 listening phase (the first four) plus three additional areas.

Area	Description	Possible source
Noisiness	Background noise, circuit noise, coding noise; BAK	Coding, circuit or background noise; device
Coloration	Frequency response distortions	Bandwidth limitation, resonances, unbalanced freq. response
Discontinuity	Isolated and non-stationary distortions	Packet loss; processing; non-linearities
Loudness	Important for the overall quality and intelligibility	Automatic gain control; mic distance
Reverberation	Room reverberation of speech and noise	Rooms with high reverberation
Speech Signal	SIG
Overall	OVRL

6.1 Online subjective evaluation framework

We extended the P.808 Toolkit [10] to include a test template for a multi-dimensional quality assessment. The toolkit provides scripts for preparing the test, including packing the test clips in small test packages, preparing the reliability check questions, and analyzing the results. We ask participants to rate the perceptual quality dimensions of speech namely coloration, discontinuity, noisiness, loudness, reverberation, signal quality, and overall quality of each audio clip. In the following, each section of the test template, as seen by participants, is described. These sections are predefined and only the audio clips under the test will be changed from one study to another.

In the first section, the participant’s eligibility and device suitability are tested and a qualification is assigned to those that pass which remains valid for the entire experiment. The participant’s hearing ability is evaluated through digit-triplet-test [32]. Moreover, we test if their listening device supports the required bandwidths (i.e., fullband, wideband, and narrowband); details are in Section 6.1.1).

Next, the participant’s environment and device are tested using a modified-JND test [33] in which they should select which stimulus from a pair has a better quality in four questions. A temporal certificate will be issued for participants after passing this section which expires after two hours and consequently repeating this section will be required. Detailed instructions are given in the next section including introducing the rating scales and providing multiple samples for each perceptual dimension. Participants are required to listen to all samples for the first time. Figure 2 illustrates how the rating scale for quality dimensions is presented to participants. In addition, we used a Likert 5-point scale for signal quality and overall quality as specified by ITU-T Rec. P.835. In the Training section participants should first adjust the playback loudness to a comfortable level by listening to a provided sample and then rate 7 audio clips. This section is similar to the ratings section, but the platform provides live feedback based on their ratings. By completing this section a temporal certificate is assigned to the participants which is valid for one hour. Last is the Ratings section, where participants listen to ten audio clips and two gold standard and trapping questions and cast their votes on each scale. The gold standard questions are the ones that the experimenter already knows their answers (being excellent or bad) and participants are expected to vote on each scale with a minor deviation from known the answer [32]. Trapping questions are questions in which a synthetic voice is overlaid to a normal clip and asks participants to provide a specific vote to show their attention [34]. For this test, we provide scripts for creating the trapping clips, which ask participants to select answers reflecting the best or worst quality in all scales. For rating an audio clip, the participant should first listen to the end of the clip, and then they start casting their votes. During that time, the audio will be played back in a loop. After participants finish with a test set, they can continue with the next one where only the rating section will be shown until other temporal certificates are valid. By the expiration of any certificate, the corresponding section will be shown when they start the next test set.

[Uncaptioned image] — Table 4: Labels on each scale’s pole and descriptive adjectives provided to participants. Terms used in ITU-T Rec. P.804 are marked in red.

6.1.1 Survey optimization

We utilized the multi-scale template in various research studies and improved it through the incorporation of experts and test participant feedback.

Descriptive adjectives: The understanding of perceptual dimensions might not be intuitive for naive test participants, therefore the P.804 recommendation includes a set of descriptive adjectives to describe the presence or absence of each quality dimension. We expanded this list through multiple preliminary studies, where participants were asked to listen to samples from each perceptual dimension and name three adjectives that best describe them. For each dimension, we selected the top three most frequently selected terms and presented them below each pole of the scale, as shown in Figure 2. The list of selected terms is reported in Table 4. We used discrete scales for dimensions to be consistent with signal and overall scales.

Bandwidth check: This test ensures the participant devices support the expected bandwidth. The test consists of five samples, and each has two parts separated by a beep tone. The second part is the same as the first part but in three samples superimposed by additive noise. Participants should listen to each sample and select if both parts have the same or different quality. We filtered the white noise with the following bandpass filters: 3.5-22K (all devices should play the noise), 9.5-22k (super-wideband or fullband is supported), and 15-22K (fullband is supported).

Gold questions: Gold questions are widely used in crowdsourcing [32]. Here we observed gold questions that represent the strong presence of an impairment on one dimension and the clear absence of impairment on all dimensions can best reveal an inattentive participant.

Randomization: We randomize the presentation order of scales for each participant. However, the signal and overall quality are always presented at the end. The randomized order is kept for each participant until a new round of training is required.

7 Results

There were 7 entries for the real-time track and 5 for the non-real-time track, though the top 3 for non-real-time track were identical submissions to the real-time track and therefore were only considered for the real-time track. Team Cvt-tencent was statistically tied with Legends-tencent and withdrew.

The P.804 and P.835 subjective results for both tracks are given in Table 5 and Table 6. The ANOVAs for each track are given in Table 7 and Table 8. P.835 results are given for reference only but agree with the P.804 results. Objective results are given in Table 9.

Table 7: Real-time track ANOVA. The pairwise p-values are shown for the lower-triangular matrix.

Team	Legends-tencent	Ctv-tencent	Genius-team	Hitiot	Noisy	Njuaa-lab	N&B
Ctv-tencent	0.609
Genius-team	0.001	0.004
Hitiot	0.000	0.000	0.000
Noisy	0.000	0.000	0.000	0.000
Njuaa-lab	0.000	0.000	0.000	0.000	0.681
N&B	0.000	0.000	0.000	0.000	0.000	0.002
Kuaishou	0.000	0.000	0.000	0.000	0.000	0.000	0.094

Table 8: Non-real-time track ANOVA. The pairwise p-values are shown for the lower-triangular matrix.

Team	Legends-tencent	Ctv-tencent	Genius-team	N&B	Hamburg
Ctv-tencent	0.609
Genius-team	0.001	0.004
N&B	0.000	0.000	0.000
Hamburg	0.000	0.000	0.000	0.285
Noisy	0.000	0.000	0.000	0.000	0.000

8 Analysis

8.1 Comparison of methods

A high-level comparison of the top-5 entries is given in Table 10 and Table 11. Some observations are given below:

Table 10: Comparison of the top five teams

Place	Track	Team	Params	Real-time	Training set	Training set	Stages	Domain	$M$
				factor		hours
1	Real-time	Legends-tencent [35]	12.1 M	0.37	DNS [2], private	1500	3	time, STFT	0.610
2	Real-time	Genius-team [36]	5.2 M	0.36	DNS [2]	1500	2	time, STFT	0.589
3	Real-time	HITIoT [37]	9.2 M	0.36	DNS [2]	1500	1	STFT	0.531
4	Non-real-time	N&B [38]	10 M	1.48	DNS [2]	421	2	STFT	0.446
5	Non-real-time	Hamburg [39]	55.7 M	30.1	VCTK [40]	28.2	1	STFT	0.445
PCC to $M$			-0.58	-0.60		0.91	0.61

Table 11: Models used by the top five teams

Team	Model
Legends-tencent [35]	AGC $\rightarrow$ GSM-GAN (Restore) $\rightarrow$ Enhance
Genius-team [36]	TRGAN (Restore) $\rightarrow$ MTFAA-Lite (Enhance)
Hitot [37]	Half temporal, half frequency attention U-Net
N&B [38]	GateDCCRN (Repairing) $\rightarrow$ GateDCCRN, S-DCCRN (Denoising)
Hamburg [39]	Generative diffusion model (modified NCSN++)

•

The top entries improved SIG by DMOS $>0.6$ , unlike previous DNS challenges which had no SIG improvement [2, 3].
•

The correlation between the training set hours (the total duration of data used) and the overall score is PCC = 0.91. The models with larger training sets tended to do better.
•

The correlation between the runtime factor and the overall score is PCC = -0.60. We expected the non-real-time track entries to exceed the performance of the real-time track, but that was not the case. We observed a similar fact in the INTERSPEECH 2021 Deep Noise Suppression Challenge [3], where the non-real-time track also performed significantly worse than the real-time track. In both cases, we received more entries in the real-time track than non-real-time track, and there may be more researchers working on real-time speech enhancement than non-real-time speech enhancement. One approach to get better non-real-time models is to take the winner of the real-time track and increase the model size and complexity by 100x, very likely increasing the performance while making it no longer real-time.
•

The correlation between the model size and the overall score is PCC = -0.58. Smaller models tended to perform better.
•

The correlation between the number of stages and the overall score is PCC = 0.61. More stages tended to perform better.
•

The top model by team Legends-tencent [35] significantly improved all measured speech quality dimensions, and did the best in all dimensions except reverberation. Their performance is illustrated in Figure 3.
•

A successful strategy used by teams Legends-tencent [35], Genius-team [36], and HITIoT [38] is a restoration module followed by a speech enhancement module. The generative models for restoration by teams Legends-tencent [35] and Genius-team [36] perform particularly well.
•

There is still significant room for improvement in this test set for OVRL and SIG.
•

None of the teams used the ICASSP 2022 Acoustic Echo Cancellation Challenge [27] dataset for training, even though it has thousands of clips of real-world speech signal impairments. This is likely because there is no clean speech available for this dataset, and using it would require semi-supervised or unsupervised training. Rather, all teams used the ICASSP 2022 Deep Noise Suppression Challenge [2] for a training set, and the winning team Legends-tencent [35] augmented that with a private training set.

8.2 Distribution of dimensions

Figure 4 shows the distribution of the subjective dimensions compared to overall quality at the model level. All of the dimensions except discontinuity and reverberation have a significant linear correlation to the overall quality (see Figure 5). The high correlation between signal and overall quality (0.98 at the model level and 0.93 at the clip level) can be attributed to the preponderance of signal impairments in this dataset, as opposed to other datasets such as DNS Challenges where background noise was the focus of the challenge. A majority (82%) of the clips in this dataset were found to have lower signal quality than background noise (SIG $<$ BAK), whereas this number was below 30% in the last DNS challenges. Given that the minimum of signal and background quality is a strong determinant of perceived overall quality [1], the observed high correlation between signal and overall quality in this dataset was expected.

8.3 Correlation between P.804 and P.835

The correlation between quality scores collected using P.804- and P.835-based subjective tests, for all entries are reported in Table 12. We observed a strong correlation between all the shared scores between the two subjective methodologies. Considering the rankings of participating teams, only the rank of N&B and Kuaishou teams from the real-time track would swap when scores from P.835 test are used (tied rank using P.804 ratings).

Table 12: Correlations between subjective scores obtained from P.804 and P.835 subjective tests on shared dimensions in model level for all entries. Tau-b95 is Kendall Tau-b applied to corrected ranked-order by considering 95% confidence interval of subjective scores according to [41].

Dimension	PCC	SRCC	Kendall Tau-b	Tau-b95
Background/Noisiness	0.964	0.926	0.825	0.853
Signal	0.954	0.933	0.801	0.914
Overall	0.965	0.940	0.825	0.822
M (challenge metric)	0.961	0.946	0.825	-

8.4 Correlation of subjective and objective data

In Table 9 we present the objective results on the blind set using DNSMOS [18] (MOS $\_$ SIG, MOS $\_$ BAK, MOS $\_$ OVR), and NISQA model [20] (NISQA $\_$ MOS, etc.). Similar to the subjective results, the Legends-tencent, Ctv-tencent, and Genius-team teams attained the best metrics estimated with DNSMOS and NISQA. Moreover, in Table 13 we compute the PCC between the subjective P.804 metrics and the metrics obtained with DNSMOS [18] and NISQA [20]. The correlations range from PCC $0.478$ to $0.700$ , which demonstrates why we still require a subjective test for accurately evaluating speech quality.

Table 13: The PCC between the subjective P.804 results and the objective metrics estimated with DNSMOS P.835 [18] and NISQA [20] models.

Subjective metric	Objective metric	PCC
		Clip level	Model level
P.804 Overall	DNSMOS P.835 OVRL	0.695	0.884
P.804 Overall	NISQA MOS	0.681	0.766
P.804 Signal	DNSMOS P.835 SIG	0.656	0.799
P.804 Noisiness	DNSMOS P.835 BAK	0.545	0.933
P.804 Noisiness	NISQA NOISE	0.586	0.938
P.804 Coloration	NISQA COLOR	0.663	0.872
P.804 Discontinuity	NISQA DISCONTINUITY	0.478	0.310
P.804 Loudness	NISQA LOUDNESS	0.700	0.784

8.5 Model of Overall and other dimensions

We performed Explanatory Factory Analysis (EFA) [42] to investigate the underlying structure between the quality dimensions, namely if there is a shared variance between the sub-dimensions. We used the Maximum Likelihood extraction method with Varimax rotation and extracted three factors as suggested by the Scree plot [43]. The result of Bartlett’s test of sphericity was significant and the KMO value was 0.65 indicating that the data is adequate for explanatory factor analysis. The loading of quality scores on each factor is presented in Table 14. In total 62% of the variance in data is explained by the three factors. Factor 1 represents the coloration with high loading from signal, coloration, and loudness. Discontinuity is loaded on factor 2 with some cross-loading from the signal indicating no or limited shared variance between discontinuity ratings and both coloration and loudness. As expected, noisiness built a separate factor orthogonal to others with moderate loading from reverberation. All in all, the results of EFA show that coloration, discontinuity, and noisiness are loaded on different orthogonal factors that align with the literature [44]. Signal scores share variance with coloration, discontinuity, and loudness, whereas reverberation shares variance with noisiness. Note that this factor structure represents the construct of the current training set and its generalizability should be validated in a separate study.

Table 14: The loading of quality scores on three-factor structure using Maximum Likelihood extraction method with Varimax rotation. KMO value = 0.65. Factor loading

>0.3

is presented.

Quality score	Factor 1	Factor 2	Factor 3
Signal	0.824	0.481
Noisiness			0.742
Coloration	0.787
Discontinuity		0.936
Loudness	0.476
Reverberation			0.413

In addition, we used different regressors to predict the overall quality given the subjective scores of the six sub-dimensions per clip. The results of k-fold cross-validations for clip and model levels are reported in Table 15. Given that only a limited number of models are available in the dataset, random forest performed poorly compared to other regressors at the model level. The coefficients of the linear regression model and the feature importance from the random forest model are reported in Table 16. At the clip level, the importance of features mostly agrees with both models. Given the fact that most of the sub-dimensions have cross-loading with the signal quality in the explanatory factor analyses, we created different regressors to predict that. The performance of those regressors is reported in Table 17 and the coefficients in Table 18. As expected, noisiness and reverberation have the smallest coefficients.

Table 15: Average performance of different regressors predicting overall quality given the six sub-dimensions in 5-fold cross-validation. In the model, level 3-fold cross-validation is used.

Regressor	PCC	RMSE	$R^{2}$	PCC	RMSE	$R^{2}$
	Clip level			Model level
Linear regression	0.947	0.318	0.894	0.993	0.051	0.951
Polynomial (n=4)	0.959	0.276	0.920	0.996	0.047	0.969
Random forest	0.960	0.276	0.921	0.977	0.131	0.754

Table 16: Average coefficient and importance of features in linear regression and random forest models predicting overall Quality, respectively.

	Clip level		Model level
Feature	Linear regression	Random forest	Linear regression	Random forest
	coefficient	features imp.	coefficient	features imp.
Signal	0.646	0.878	0.352	0.378
Loudness	0.146	0.044	0.251	0.162
Coloration	0.102	0.019	0.266	0.146
Noisiness	0.100	0.027	0.134	0.248
Discontinuity	0.065	0.016	0.190	0.051
Reverberation	0.039	0.016	0.014	0.016

Table 17: Average performance of different regressors predicting signal quality given the five sub-dimensions in 5-fold cross-validation. In the model, level 3-fold cross-validation is used.

Regressor	PCC	RMSE	$R^{2}$	PCC	RMSE	$R^{2}$
	Clip level			Model level
Linear regression	0.898	0.453	0.806	0.994	0.035	0.979
Polynomial (n=4)	0.907	0.434	0.821	0.992	0.043	0.964
Random forest	0.901	0.446	0.812	0.852	0.170	0.452

Table 18: Average coefficient and importance of features in linear regression and random forest models predicting signal quality, respectively.

	Clip level		Model level
Feature	Linear regression	Random forest	Linear regression	Random forest
	coefficient	features imp.	coefficient	features imp.
Loudness	0.128	0.061	0.184	0.142
Coloration	0.503	0.559	0.519	0.373
Noisiness	0.072	0.044	0.051	0.306
Discontinuity	0.430	0.281	0.580	0.130
Reverberation	0.099	0.054	0.158	0.049

8.6 Word error rate

To have a more comprehensive view of the signal enhancement models, in Table 19 and Table 20 we included the word error rate (WER) for both tracks. To eliminate potential bias introduced by automatic speech recognition (ASR) systems, we employed human transcripts when calculating the WER. A state-of-the-art speech recognition API from Azure Cognitive service was used for computing WER. In the second track, the rank is identical to the P.804 ranking (excluding noisy), while in the first track, there are some shifts between teams. The best WER result attained by Legends-tencent team is still slightly behind the WER computed on the noisy files, highlighting that there is a huge potential in this research area.

9 Conclusions

Unlike our previous deep noise suppression challenges, this challenge showed several models with significant improvement in the speech signal. The top models improved all areas we measured: noisiness, discontinuity, coloration, loudness, and reverberation. While the improvements are impressive, there is still significant room for improvement in this test set (see Table 21).

All of the models used in this challenge are relatively small compared to large language models or large multimodal language models. An interesting new area would be to apply a large audio language model (e.g., [45]) for speech restoration and enhancement. Even if it can not be run in real-time or with low latency, there are still many scenarios it can be applied. In addition, all of the models submitted in this challenge used training sets with clean speech available. A good future direction of research is to utilize real-world training sets such as [27], which will require semi-supervised or unsupervised learning.

Table 21: Amount of improvement remaining (in MOS) to get excellent quality rated speech based on this challenge

Area	Headroom
Overall	1.73
Signal	1.74
Background	0.36
Coloration	1.40
Loudness	0.80
Discontinuity	0.89
Reverberation	0.68

For future speech signal improvement challenges, we plan to provide an objective metric similar to NISQA. We plan to also add word accuracy rate as an additional metric to optimize. We plan to provide a synthetic data generator and a baseline model to give a better starting point for all participants. As noted above, we hypothesize that large multimodal models could have significant improvements in this area, so keeping a non-real-time track seems important to encourage this exploration.

References

[1] “ITU-T Recommendation P.835: Subjective test methodology for evaluating speech communication systems that include noise suppression algorithm,” 2003.
[2] H. Dubey, V. Gopal, R. Cutler, A. Aazami, S. Matusevych, S. Braun, S. E. Eskimez, M. Thakker, T. Yoshioka, H. Gamper, and R. Aichner, “ICASSP 2022 Deep Noise Suppression Challenge,” in ICASSP, 2022.
[3] C. K. Reddy, H. Dubey, K. Koishida, A. Nair, V. Gopal, R. Cutler, S. Braun, H. Gamper, R. Aichner, and S. Srinivasan, “INTERSPEECH 2021 Deep Noise Suppression Challenge,” in INTERSPEECH. Aug. 2021, pp. 2796–2800, ISCA.
[4] H. Dubey, A. Aazami, V. Gopal, B. Naderi, S. Braun, R. Cutler, A. Ju, M. Zohourian, M. Tang, H. Gamper, M. Golestaneh, and R. Aichner, “ICASSP 2023 Deep Noise Suppression Challenge,” May 2023, arXiv:2303.11510 [cs, eess].
[5] “ITU-T Recommendation P.804: Subjective diagnostic test method for conversational speech quality analysis,” 2017.
[6] J. Su, Z. Jin, and A. Finkelstein, “HiFi-GAN-2: Studio-Quality Speech Enhancement via Generative Adversarial Networks Conditioned on Acoustic Features,” in 2021 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), Oct. 2021, pp. 166–170, ISSN: 1947-1629.
[7] J. Serrà, S. Pascual, J. Pons, R. O. Araz, and D. Scaini, “Universal Speech Enhancement with Score-based Diffusion,” 2022, https://openreview.net/forum?id=7BfWbjOqgMf.
[8] “ITU-T Recommendation P.800: Methods for subjective determination of transmission quality,” 1996.
[9] “ITU-T Recomendation P.808: Subjective evaluation of speech quality with a crowdsourcing approach,” 2018.
[10] B. Naderi and R. Cutler, “An Open Source Implementation of ITU-T Recommendation P.808 with Validation,” INTERSPEECH, pp. 2862–2866, Oct. 2020.
[11] B. Naderi and R. Cutler, “Subjective Evaluation of Noise Suppression Algorithms in Crowdsourcing,” in INTERSPEECH, 2021.
[12] “ITU-T Recommendation P.863.2: Extension of ITU-T P.863 for multi-dimensional assessment of degradations in telephony speech signals up to full-band,” 2022.
[13] A. Rix, J. Beerends, M. Hollier, and A. Hekstra, “Perceptual evaluation of speech quality (PESQ) - a new method for speech quality assessment of telephone networks and codecs,” in ICASSP. 2001, vol. 2, pp. 749–752, IEEE.
[14] J. G. Beerends, M. Obermann, R. Ullmann, J. Pomy, and M. Keyhl, “Perceptual Objective Listening Quality Assessment (POLQA), The Third Generation ITU-T Standard for End-to-End Speech Quality Measurement Part I–Temporal Alignment,” J. Audio Eng. Soc., vol. 61, no. 6, pp. 19, 2013.
[15] “ITU-T Recommendation P.563: Perceptual objective listening quality assessment: An advanced objective perceptual method for end-to-end listening speech quality evaluation of fixed, mobile, and IP-based networks and speech codecs covering narrowband, wideband, and super-wideband signals,” 2011.
[16] A. R. Avila, H. Gamper, C. Reddy, R. Cutler, I. Tashev, and J. Gehrke, “Non-intrusive Speech Quality Assessment Using Neural Networks,” in ICASSP, Brighton, United Kingdom, May 2019, pp. 631–635, IEEE.
[17] C. K. A. Reddy, V. Gopal, and R. Cutler, “DNSMOS: A Non-Intrusive Perceptual Objective Speech Quality metric to evaluate Noise Suppressors,” in INTERSPEECH, 2021.
[18] C. K. A. Reddy, V. Gopal, and R. Cutler, “DNSMOS P.835: A Non-Intrusive Perceptual Objective Speech Quality Metric to Evaluate Noise Suppressors,” in ICASSP, 2022, pp. 886–890, ISSN: 2379-190X.
[19] G. Yi, W. Xiao, Y. Xiao, B. Naderi, S. Möller, W. Wardah, G. Mittag, R. Cutler, Z. Zhang, D. S. Williamson, F. Chen, F. Yang, and S. Shang, “ConferencingSpeech 2022 Challenge: Non-intrusive Objective Speech Quality Assessment (NISQA) Challenge for Online Conferencing Applications,” in INTERSPEECH, 2022.
[20] G. Mittag, B. Naderi, A. Chehadi, and S. Möller, “NISQA: A deep cnn-self-attention model for multidimensional speech quality prediction with crowdsourced datasets,” in INTERSPEECH, 2021.
[21] C. K. Reddy, V. Gopal, R. Cutler, E. Beyrami, R. Cheng, H. Dubey, S. Matusevych, R. Aichner, A. Aazami, S. Braun, P. Rana, S. Srinivasan, and J. Gehrke, “The INTERSPEECH 2020 Deep Noise Suppression Challenge: Datasets, Subjective Testing Framework, and Challenge Results,” in INTERSPEECH 2020. Oct. 2020, pp. 2492–2496, ISCA.
[22] C. K. A. Reddy, H. Dubey, V. Gopal, R. Cutler, S. Braun, H. Gamper, R. Aichner, and S. Srinivasan, “ICASSP 2021 Deep Noise Suppression Challenge,” in ICASSP, June 2021, pp. 6623–6627, ISSN: 2379-190X.
[23] L. Diener, S. Sootla, S. Branets, A. Saabas, R. Aichner, and R. Cutler, “INTERSPEECH 2022 Audio Deep Packet Loss Concealment Challenge,” in INTERSPEECH, 2022, p. 5.
[24] K. Kinoshita, M. Delcroix, S. Gannot, E. A. P. Habets, R. Haeb-Umbach, W. Kellermann, V. Leutnant, R. Maas, T. Nakatani, B. Raj, A. Sehr, and T. Yoshioka, “A summary of the REVERB challenge: state-of-the-art and remaining challenges in reverberant speech processing research,” EURASIP Journal on Advances in Signal Processing, vol. 2016, no. 1, pp. 7, Dec. 2016.
[25] K. Sridhar, R. Cutler, A. Saabas, T. Parnamaa, M. Loide, H. Gamper, S. Braun, R. Aichner, and S. Srinivasan, “ICASSP 2021 Acoustic Echo Cancellation Challenge: Datasets, Testing Framework, and Results,” in ICASSP, 2021.
[26] R. Cutler, A. Saabas, T. Parnamaa, M. Loide, S. Sootla, M. Purin, H. Gamper, S. Braun, K. Sorensen, R. Aichner, and S. Srinivasan, “INTERSPEECH 2021 Acoustic Echo Cancellation Challenge,” in INTERSPEECH, June 2021.
[27] R. Cutler, A. Saabas, T. Parnamaa, M. Purin, H. Gamper, S. Braun, and R. Aichner, “ICASSP 2022 Acoustic Echo Cancellation Challenge,” in ICASSP, 2022.
[28] R. Cutler, A. Saabas, T. Parnamaa, M. Purin, E. Indenbom, N.-C. Ristea, J. Gužvin, H. Gamper, S. Braun, and R. Aichner, “ICASSP 2023 Acoustic Echo Cancellation Challenge,” 2023.
[29] H. Li and J. Yamagishi, “DDS: A new device-degraded speech dataset for speech enhancement,” in INtERSPEECH, 2022.
[30] G. J. Mysore, “Can we Automatically Transform Speech Recorded on Common Consumer Devices in Real-World Environments into Professional Production Quality Speech?—A Dataset, Insights, and Challenges,” IEEE Signal Processing Letters, vol. 22, no. 8, pp. 1006–1010, Aug. 2015, Conference Name: IEEE Signal Processing Letters.
[31] B. Naderi, R. Cutler, and N.-C. Ristea, “Multi-dimensional Speech Quality Assessment in Crowdsourcing,” Sept. 2023, arXiv:2309.07385 [cs, eess].
[32] B. Naderi, R. Zequeira Jiménez, M. Hirth, S. Möller, F. Metzger, and T. Hoßfeld, “Towards speech quality assessment using a crowdsourcing approach: evaluation of standardized methods,” Quality and User Experience, vol. 6, no. 1, pp. 2, Nov. 2020.
[33] B. Naderi and S. Möller, “Application of just-noticeable difference in quality as environment suitability test for crowdsourcing speech quality assessment task,” in QoMEX. 2020, pp. 1–6, IEEE.
[34] B. Naderi, T. Polzehl, I. Wechsung, F. Köster, and S. Möller, “Effect of trapping questions on the reliability of speech quality judgments in a crowdsourcing paradigm,” in INTERSPEECH, 2015.
[35] J. Chen, Y. Shi, W. Liu, W. Rao, S. He, A. Li, Y. Wang, Z. Wu, S. Shang, and C. Zheng, “Gesper: A unified framework for general speech restoration,” in ICASSP, 2023.
[36] W. Zhu, Z. Wang, J. Lin, C. Zeng, and T. Yu, “SSI-NET: A multi-stage speech signal improvement system for ICASSP 2023 SSI challenge,” in ICASSP, 2023.
[37] Z. Zhang, S. Xu, X. Zhuang, Y. Qian, L. Zhou, and M. Wang, “Half-temporal and half-frequency attention u2net for speech signal improvement,” in ICASSP, 2023.
[38] M. Liu, S. Lv, Z. Zhang, R. Han, X. Hao, X. Xia, L. Chen, Y. Xiao, and L. Xie, “Two-stage neural network for ICASSP 2023 speech signal improvement challenge,” in ICASSP, 2023.
[39] J. Richter, S. Welker, J.-M. Lemercier, B. Lay, T. Peer, and T. Gerkmann, “Speech signal improvement using causal generative diffusion models,” in ICASSP, 2023.
[40] J. Yamagishi, C. Veaux, and K. MacDonald, “CSTR VCTK Corpus: English multi-speaker corpus for CSTR voice cloning toolkit (version 0.92),” 2019, https://doi.org/10.7488/ds/2645.
[41] B. Naderi and S. Möller, “Transformation of Mean Opinion Scores to Avoid Misleading of Ranked Based Statistical Techniques,” in QoMEX, May 2020, pp. 1–4, ISSN: 2472-7814.
[42] M. W. Watkins, “Exploratory Factor Analysis: A Guide to Best Practice,” Journal of Black Psychology, vol. 44, no. 3, pp. 219–246, Apr. 2018, Publisher: SAGE Publications Inc.
[43] R. B. Cattell, “The Scree Test For The Number Of Factors,” Multivariate Behavioral Research, vol. 1, no. 2, pp. 245–276, Apr. 1966, Publisher: Routledge _eprint: https://doi.org/10.1207/s15327906mbr0102_10.
[44] M. Wältermann, Dimension-based Quality Modeling of Transmitted Speech, Springer Science & Business Media, Jan. 2013, Google-Books-ID: _TJEAAAAQBAJ.
[45] Z. Borsos, R. Marinier, D. Vincent, E. Kharitonov, O. Pietquin, M. Sharifi, D. Roblek, O. Teboul, D. Grangier, M. Tagliasacchi, and N. Zeghidour, “AudioLM: A Language Modeling Approach to Audio Generation,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 31, pp. 2523–2533, 2023, Conference Name: IEEE/ACM Transactions on Audio, Speech, and Language Processing.