This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Private Language Model Adaptation for Speech Recognition

Abstract

Speech model adaptation is crucial to handle the discrepancy between server-side proxy training data and actual data received on local devices of users. With the use of federated learning (FL), we introduce an efficient approach on continuously adapting neural network language models (NNLMs) on private devices with applications on automatic speech recognition (ASR). To address the potential speech transcription errors in the on-device training corpus, we perform empirical studies on comparing various strategies of leveraging token-level confidence scores to improve the NNLM quality in the FL settings. Experiments show that compared with no model adaptation, the proposed method achieves relative 2.6% and 10.8% word error rate (WER) reductions on two speech evaluation datasets, respectively. We also provide analysis in evaluating privacy guarantees of our presented procedure.

Index Terms: federated learning, language modeling, speech recognition, adaptation, confidence scoring

1 Introduction

Neural network language models (NNLMs) play critical roles in automatic speech recognition (ASR) systems [1, 2, 3, 4]. They typically outperform traditional nn-gram LMs with better capability of modeling long-range dependency. For conventional ASR models, NNLMs are widely used in the second pass via NN-best or lattice rescoring [5, 6, 7]. For end-to-end ASR [8, 9, 10], although linguistic information is implicitly learned, NNLMs can still further improve accuracy by fusion in first-pass decoding [11, 12] or second-pass rescoring.

With the latest advances in mobile technologies, hosting an ASR system entirely on-device has important implications from a reliability, latency, and particularly privacy perspective, and has become an active area of research and industrial applications [13]. A common issue arising after deploying an ASR model on user devices is the discrepancy between training data and actual data received on local devices. The semantic and acoustic characteristics of real users’ speech could be very different from those of server-side proxy data, in which case speech model adaptation is indispensable. The privacy-preserving constraint requires user data to stay on their local devices. It is thus more challenging to perform model adaptation on user devices since there is no ground truth speech transcription from users.

To resolve this privacy concern, federated learning (FL) [14, 15, 16], a distributed learning technique, has been proposed and applied in many fields including recommendation [17], keyboard suggestion [18, 19], keyword spotting [20], health care [21], and more recently, ASR including hybrid acoustic models and end-to-end models [22, 23, 24]. FL can protect data privacy by training a shared model in a decentralized manner on users’ local devices, so that raw data never leaves physical devices. Specifically, FL distributes the training process among a large number of client devices, with each client learning from live data and calculating model updates independently, then uploading those updates to a central server for aggregation. The updated model will later be delivered to each client device, after which this procedure is repeated until the training convergence of the model.

In this work, we focus on federated NNLM adaptation for speech recognition application. Federated language modeling has been well explored in mobile keyboard suggestion where sentences typed by users provide instant labeled data for supervised learning [19]. However, for any on-device ASR with privacy-preserving requirement, users’ text data can not be directly accessed. Instead, we could use decoded hypotheses to perform model adaptation. The adaptation quality can be affected by any ASR transcription errors. Thus, more advanced methods are called for to better leverage transcribed data to conduct FL-based adaptation in an unsupervised manner.

To alleviate the transcription errors issue described above, we leverage confidence scores of transcripts, which estimate how likely each token in any decoded hypothesis from ASR is correct [25, 26]. Lattice posteriors from conventional ASR systems can be directly used as confidence scores. Modeling based approaches, for example, confidence classifiers trained with various decoding features [27], can provide more accurate confidence measurements. In this paper, we propose to mitigate errors in decoded hypotheses by modifying NNLM training objective using token-level confidence scores from a confidence classifier directly and improve adaptation quality.

The prior work on using ASR confidence scores in LM task is limited. Authors in [28] use confidence scores for selecting text data for LM adaptation. Our paper presents and investigates this direction in a rigorous way, proposes the weighting method for adjusting the cross-entropy loss, and conducts solid comparisons on the performance of these weighting approaches in the FL framework.

We mainly pursue three goals: (1) to present an effective procedure on FL-based domain adaptation of NNLMs with its applications on ASR; (2) to empirically compare approaches of using token-level confidence scores to improve adaptation quality; and (3) to provide analysis in evaluating privacy guarantees of our proposed method using differential privacy (DP) tools [29, 30]. To the best of our knowledge, our paper is the first work that leverages FL to fine-tune NNLMs for ASR systems and utilizes confidence scores to address any potential quality degradation due to mis-transcribed text as training corpus. In particular, the proposed confidence-based approach can also be applied to other tasks as well, for example, unsupervised speaker adaptation.

The rest of the paper is organized as follows. In Section 2, we introduce the FL-based domain adaptation approach of NNLMs with its applications on speech recognition tasks. We evaluate the proposed method in Section 3 and conclude in Section 4.

2 Methods

2.1 Federated Adaptation Framework

FL distributes the model training process across a large number of client devices. Each device trains on private data and computes model updates independently. Those updates are then uploaded to a central server for aggregation and the updated model is deployed to each client afterwards. We describe our approach on FL-based NNLMs adaptation for ASR as follows.

Pre-training. We train an initial NNLM using a large general corpus and if available, any “close-in-domain” proxy data on the server side. The model then is delivered to each local device along with an ASR model;

Client-side update. As soon as a user has input sufficient volume of utterances, local personal transcribed data is used to fine-tune all parameters of the current NNLM on the device. Then the client model update is sent back to the server if the device is selected to join the cohort;

Server-side update. Once adequate client model updates are received by the server, global model update is conducted and an updated server model is deployed to each local device.

The client-side and server-side updates above are repeated until model convergence. The procedure is outlined in Algorithm 1, with more details in subsections 2.2, 2.3, and 2.4.

Hyper-parameters K,ηl,ηg,β1,β2,ϵgK,\eta_{l},\eta_{g},\beta_{1},\beta_{2},\epsilon_{g};
Initialize θ1\theta_{1} as a pre-trained NNLM (no adaptation);
for each round t=1,2,t=1,2,\ldots do
       Deliver θt\theta_{t} to each client
       Sample a subset t\mathcal{I}_{t} of clients
       for each client iti\in\mathcal{I}_{t} in parallel do
             θi,1t:=θt\theta_{i,1}^{t}:=\theta_{t}
             Load ASR transcripts on client ii for training
             for each local epoch k=1,2,,Kk=1,2,\ldots,K do
                   Compute gradients gi,ktg_{i,k}^{t} on batches
                   θi,k+1tSGD(θi,kt,gi,kt,ηl)\theta_{i,k+1}^{t}\leftarrow\texttt{SGD}(\theta_{i,k}^{t},g_{i,k}^{t},\eta_{l})
                  
             end for
            θit:=θi,K+1t\theta_{i}^{t}:=\theta_{i,K+1}^{t}
             Send Δit:=θtθit\Delta_{i}^{t}:=\theta_{t}-\theta_{i}^{t} to server
            
       end for
      θt+1FedAdam(θt,Δit,wit,t,ηg,β1,β2,ϵg)\theta_{t+1}\leftarrow\texttt{FedAdam}(\theta_{t},\Delta_{i}^{t},w_{i}^{t},t,\eta_{g},\beta_{1},\beta_{2},\epsilon_{g})
      
end for
Algorithm 1 FL-based NNLM adaptation for ASR.

2.2 Client-side NNLM Adaptation

Upon the initial deployment, on-device ASR model with the pre-trained NNLM for second-pass rescoring runs on each local client to transcribe user’s utterances. The decoded hypotheses are then stored on the local device and serve as in-domain data for on-device NNLM adaptation. In this work, only the 1-best hypothesis of each utterance is stored and used for on-device training.

Suppose at round tt of FL training, each selected client downloads the NNLM θt\theta_{t} from server and performs secure local training on their own device, that is, fine-tuning θt\theta_{t} using private data. Mini-batch stochastic gradient descent (SGD) is used as the local optimizer. Specifically, in the kkth training epoch, transcripts data on client ii is split into multiple batches. For the bbth batch, let us denote θi,k,bt\theta_{i,k,b}^{t} as the current client model and gi,k,btg_{i,k,b}^{t} as the computed gradients after back-propagation. Then the client model is updated as

θi,k,b+1t=θi,k,btηlgi,k,bt(θi,k,bt)\displaystyle\theta_{i,k,b+1}^{t}=\theta_{i,k,b}^{t}-\eta_{l}\cdot g_{i,k,b}^{t}(\theta_{i,k,b}^{t}) (1)

where ηl\eta_{l} is a local learning rate. After KK epochs of training, the client uploads its model update (i.e. difference of model parameters; see subsection 2.4) to the central server over a secure connection.

In the next subsection, we describe how to compute gi,k,btg_{i,k,b}^{t} in speech LM task, and how to utilize token-level confidence scores to address the potential quality degradation due to mis-transcribed text as training data.

2.3 NNLM Adaptation with Confidence Scores

Cross entropy loss is usually used for NNLM training. The following shows this function for the bbth batch of kkth local training epoch on client ii

i,k,bt(θ)=1nbj=1nb1Tjs=1Tjlog(p^j,s,vj,s(θ))\displaystyle\mathcal{L}_{i,k,b}^{t}(\theta)=-\frac{1}{n_{b}}\sum_{j=1}^{n_{b}}\frac{1}{T_{j}}\sum_{s=1}^{T_{j}}\log(\hat{p}_{j,s,v_{j,s}^{*}}(\theta)) (2)

where nbn_{b} denotes the batch size, TjT_{j} refers to the sequence length, vj,sv_{j,s}^{*} represents the ssth word of the jjth transcript, and p^j,s,vj,s\hat{p}_{j,s,v_{j,s}^{*}} indicates the predicted probability of observing vj,sv_{j,s}^{*} over the entire vocabulary.

Adapting NNLMs using ASR transcribed data has the challenge of dealing with potential transcription errors. In this work, we leverage external confidence classifier models to mitigate this issue. Specifically, we modify the NNLM training objective using confidence scores. Let c^j,s\hat{c}_{j,s} be the estimated confidence score on the word of vj,sv_{j,s}^{*} in the jjth transcript, and

c^j:=1Tjs=1Tjc^j,s\displaystyle{\hat{c}}_{j}:=\frac{1}{T_{j}}\sum_{s=1}^{T_{j}}\hat{c}_{j,s} (3)

be the estimated utterance-level confidence score, which is the average of token-level confidence scores. We propose the following three modified loss functions for NNLM training.

Hard thresholding. We adopt utterance-level confidence scores for training data selection, which amounts to exclude the set of utterances

{j[nb]:c^j<c}\{j\in[n_{b}]:\hat{c}_{j}<c\}

from training, where cc is some fixed constant. Notice that this method is equivalent to include an indicator function as a multiplier on each utterance in the loss function.

Utterance-level weighting. The utterance-level confidence scores are leveraged for loss weighting

i,k,bt,utt-weight(θ)=1nbj=1nbc^jTjs=1Tjlog(p^j,s,vj,s(θ))\displaystyle\mathcal{L}_{i,k,b}^{t,\text{utt-weight}}(\theta)=-\frac{1}{n_{b}}\sum_{j=1}^{n_{b}}\frac{\hat{c}_{j}}{T_{j}}\sum_{s=1}^{T_{j}}\log(\hat{p}_{j,s,v_{j,s}^{*}}(\theta)) (4)

Token-level weighting. We utilize token-level confidence scores for weighting in the loss function

i,k,bt,token-weight(θ)=1nbj=1nb1Tjs=1Tjc^j,slog(p^j,s,vj,s(θ))\displaystyle\mathcal{L}_{i,k,b}^{t,\text{token-weight}}(\theta)=-\frac{1}{n_{b}}\sum_{j=1}^{n_{b}}\frac{1}{T_{j}}\sum_{s=1}^{T_{j}}\hat{c}_{j,s}\log(\hat{p}_{j,s,v_{j,s}^{*}}(\theta)) (5)

2.4 Server-side NNLM Update

Suppose that at round tt, the server has the model θt\theta_{t} and samples a set t\mathcal{I}_{t} of clients. Let θit\theta_{i}^{t} denotes the model of each client iti\in\mathcal{I}_{t} after local training, and Δit:=θtθit\Delta_{i}^{t}:=\theta_{t}-\theta_{i}^{t} be the model difference on client ii which is sent back to the server. Let

Δt:=iwitΔitiwit\displaystyle\Delta_{t}:=\frac{\sum_{i\in\mathcal{I}}w_{i}^{t}\Delta_{i}^{t}}{\sum_{i\in\mathcal{I}}w_{i}^{t}} (6)

be the averaged model difference or “pseudo-gradient” which is used in general server optimizer updates. Here, witw_{i}^{t} refers to the weight being assigned to the model difference from client ii in the aggregation, i.e. number of words in the training data for adapting the client NNLM in round tt.

We use the FedAdam optimizer for updating the global model [31]. Specifically, let ηg\eta_{g} be the learning rate and hyper-parameters β1\beta_{1}, β2[0,1)\beta_{2}\in[0,1), then

mt=β1mt1+(1β1)Δt\displaystyle m_{t}=\beta_{1}m_{t-1}+(1-\beta_{1})\Delta_{t} (7)
vt=β2vt1+(1β2)(Δt)2\displaystyle v_{t}=\beta_{2}v_{t-1}+(1-\beta_{2})(\Delta_{t})^{2} (8)
m^t=mt/(1β1t)\displaystyle\hat{m}_{t}=m_{t}/(1-\beta_{1}^{t}) (9)
v^t=vt/(1β2t)\displaystyle\hat{v}_{t}=v_{t}/(1-\beta_{2}^{t}) (10)
θt+1=θtηgm^tv^t+ϵg\displaystyle\theta_{t+1}=\theta_{t}-\eta_{g}\cdot\frac{\hat{m}_{t}}{\sqrt{\hat{v}_{t}}+\epsilon_{g}} (11)

where m0=0,v0=0m_{0}=0,v_{0}=0, and ϵg\epsilon_{g} is a small positive number.

2.5 Differential Privacy for NNLM Adaptation

A differentially private mechanism enables the public release of model parameters with a strong privacy protection [29, 30].

Definition 2.1 (DP)

A randomized mechanism \mathcal{M} with a domain 𝒟\mathcal{D} and range 𝒮\mathcal{S} satisfies (ϵ,δ)(\epsilon,\delta)-DP if for any two adjacent datasets d,d𝒟d,d^{\prime}\in\mathcal{D} and for any subset S𝒮S\subseteq\mathcal{S}, it holds that

P((d)S)eϵP((d)S)+δ.\displaystyle P(\mathcal{M}(d)\in S)\leq e^{\epsilon}P(\mathcal{M}(d^{\prime})\in S)+\delta. (12)

Here dd and dd^{\prime} are defined to be adjacent if dd^{\prime} can be formed by adding or removing a single training example from dd. It is worth noting that the definition of adjacent datasets in Definition 2.1 depends on the application. Most prior work on DP deals with example level (or utterance level in our case). For our task, a better definition is user-level adjacency for protecting whole user histories in the training set [32], since a sensitive word may be uttered several times by an individual user. Note that given some target δ\delta, a smaller ϵ\epsilon leads to stronger privacy protection, but often, can degrade the model accuracy.

Two modifications to the FL-based NNLM adaptation are needed to ensure an differentially private algorithm. First, clip the gradient computed on any client per each round to bound a user’s impact on model parameters. Second, add randomly sampled Gaussian noise to the clipped gradient.

In subsection 3.4, we perform privacy analysis of the proposed NNLM adaptation approach.

3 Experiments

3.1 Datasets

In our experiments, the ASR model is trained using in-house video dataset (14K hours), which is sampled from public social media videos and de-identified before transcription; both transcribers and researchers do not have access to any user-identifiable information (UII).

The following two ASR applications are considered. The first is conversation speech and the second is short voice command for smart devices. For both use cases, in-domain speech datasets are collected using mobile devices through crowd-sourcing from a data supplier for ASR; the data is properly anonymized and no UII is contained in the datasets.

Table 1 shows a summary of the two datasets which contain the adaptation split for NNLMs adaptation, and the test split for model evaluation purpose. It is worth mentioning that the word error rates (WERs) on the adaptation split using the video ASR model with a pre-trained NNLM for second-pass rescoring are 9.7% and 12.2% for conversation and voice command applications, respectively.

Table 1: Summary of speech datasets for two applications.
Split Feature Conversation Voice Command
adaptation # of utts 166K 63K
# of words 1,738K 363K
test # of utts 13K 13K
# of words 123K 73K

3.2 Setups

We would like to simulate the real-world scenarios after deploying the ASR and pre-trained NNLM models to clients. Voice data from the adaptation split is streamed to each device and decoded by the on-device speech models. Then the transcripts are used to fine-tune the NNLMs for domain adaptation.

For the ASR model, we use connectionist temporal classification (CTC) [8] criterion to learn an acoustic model and it is further composed with a 5-gram LM in a standard weighted finite-state transducers framework. Here, we adopt a 6-layer latency-control bi-directional LSTM encoder with hidden dimension 1,000. The NNLM is used for second-pass 5-best rescoring, where we utilize a LSTM based model with character embeddings [33] dimension 100, and 2 layers of 512 hidden units. For each hypothesis among the 5-best list, its NNLM score is linearly interpolated with the score from the 5-gram LM. The interpolation weight is chosen to give the lowest WER. The confidence classifier model is trained on video dataset using feed-forward networks and handcrafted input features from decoding results; isotonic regression is used for calibrating the model.

Refer to caption
Figure 1: Histogram of number of utterances on each device in the simulation.

To simulate the environment of FL-based approaches, for each utterance, we generate a random device label from a Zipf distribution. Thus utterances with a common device label are considered being received by the same device. This results in approximately 8K devices for each application. Figure 1 shows the histogram of number of training examples (i.e. utterances) on each device in the simulation, where we can see the empirical distribution is highly skewed.

Regarding the hyper-parameters of FL training, we set the number of selected users per round |t|=100|\mathcal{I}_{t}|=100; learning rate ηg=0.001\eta_{g}=0.001 in the global FedAdam optimizer and ηl=1.0\eta_{l}=1.0 for the client SGD optimizer. Locally, we only train 1 epoch with batch size 8 for any selected client per each FL round. We use 10 epochs for FL training, where each epoch corresponds to 100 rounds.

3.3 Evaluation Results

The baseline model in our experiments is the one where we use the on-server pre-trained NNLM for rescoring, without domain adaptation. We compare it with the fine-tuned NNLM using in-domain unsupervised text from the adaptation split (transcribed by the ASR model) in the FL frameworks. Note that such in-domain data never leaves physical devices and is thus not accessible from servers due to privacy restrictions. Multiple methods in handling potential transcription errors in the adaptation data are measured, including using all transcripts, hard thresholding, utterance-level and token-level weighting. For comparison purposes, we also include the result without NNLM rescoring.

Table 2 and Table 3 show the perplexity (PPL) and WER results on the conversation and voice command evaluation datasets. Compared with the baseline model, FL-based domain adapted NNLMs (using all transcripts) obtain relative 2.1% and 8.4% WER gains on the two use cases. In addition, models leveraging confidence scoring always outperform the one using all transcripts as the training data, and obtain up to relative 0.5% and 2.4% WER reductions on the two applications, respectively. For short voice command evaluation set, hard thresholding leads to the best result. For longer conversation utterances, token-level weighting performs the best.

Table 2: Results on the Conversation evaluation set.
Conversation
Model PPL WER
No NNLM - 8.07
Server-pretrained NNLM (no adapt) 109.4 7.96
FL-finetuned NNLM (all utts) 32.0 7.79 (-2.1%)
FL-finetuned NNLM (hard thres. utts) 31.2 7.76 (-2.5%)
FL-finetuned NNLM (utt weighted) 30.8 7.77 (-2.4%)
FL-finetuned NNLM (token weighted) 30.3 7.75 (-2.6%)
Table 3: Results on the Voice Command evaluation set.
Voice Command
Model PPL WER
No NNLM - 10.10
Server-pretrained NNLM (no adapt) 420.4 9.89
FL-finetuned NNLM (all utts) 8.1 9.06 (-8.4%)
FL-finetuned NNLM (hard thres. utts) 8.0 8.82 (-10.8%)
FL-finetuned NNLM (utt weighted) 8.0 9.00 (-9.0%)
FL-finetuned NNLM (token weighted) 8.0 8.93 (-9.7%)

We also evaluate the performance of NNLM adaptation without pre-training, that is, training from scratch using the in-domain data in the FL settings. From the results in Table 4, we can see that training from scratch has around relative 1% WER degradation comparing to fine-tuned models in the FL settings. Thus it is beneficial to start with some pre-trained NNLM before on-device adaptation.

Table 4: Results on the Conversation evaluation set without pre-training on NNLMs.
Conversation
Model PPL WER
FL-finetuned NNLM (all utts) 32.0 7.79
FL-finetuned NNLM (token weighted) 30.3 7.75
FL-trained-from-scratch NNLM (all utts) 35.8 7.85
FL-trained-from-scratch NNLM (token weighted) 34.5 7.82

3.4 Privacy Analysis

Our privacy analysis is performed in the framework of Rényi DP [34], which is a natural relaxation of DP based on the Rényi divergence. It is well-suited for expressing guarantees of privacy-preserving approaches and for composition of heterogeneous mechanisms.

In this analysis, the L2L_{2} norm clip is set to 0.5, and the noise multiplier, which is the ratio of the standard deviation of Gaussian noise to the L2L_{2} sensitivity, is set to 0.2, 0.5, and 1.5 in our experiments. We set the target δ\delta as 1e-5 and the value of ϵ\epsilon is calculated accordingly.

Table 5 displays the privacy analysis results on FL-based adapted NNLMs using all or token-level weighted transcripts. It is expected that the lower the value of ϵ\epsilon, the larger the PPL and WER. It is worth noting that these resulting models still perform better than the server-trained NNLM without domain adaptation, although the margins of gains become smaller than the one without strong privacy guarantees.

Table 5: Privacy analysis on the Conversation evaluation set.
Conversation
Model (ϵ,δ)(\epsilon,\delta)-DP PPL WER
(248.6, 1e-5) 46.1 7.88
FL-finetuned NNLM (all utts) (14.2, 1e-5) 50.4 7.91
(0.9, 1e-5) 56.7 7.92
(248.6, 1e-5) 45.9 7.85
FL-finetuned NNLM (token weighted) (14.2, 1e-5) 50.4 7.89
(0.9, 1e-5) 56.3 7.93

4 Conclusion

In this paper, we introduce a NNLM adaptation approach for ASR in the FL settings. Particularly, we leverage confidence scoring models to adjust the NNLM training objective accordingly. Experiments show that compared with no adaptation, the presented method obtains modest WER reductions on two speech datasets. We also perform privacy analysis of the proposed approach using DP. Future work includes exploring the personalization of NNLMs in a FL framework.

References

  • [1] T. Mikolov, M. Karafiát, L. Burget, J. Černockỳ, and S. Khudanpur, “Recurrent neural network based language model,” in Proc. Interspeech, 2010.
  • [2] X. Chen, X. Liu, M. J. Gales, and P. C. Woodland, “Improving the training and evaluation efficiency of recurrent neural network language models,” in Proc. ICASSP, 2015.
  • [3] H. Xu, K. Li, Y. Wang, J. Wang, S. Kang, X. Chen, D. Povey, and S. Khudanpur, “Neural network language modeling with letter-based features and importance sampling,” in Proc. ICASSP, 2018.
  • [4] K. Irie, A. Zeyer, R. Schlüter, and H. Ney, “Language modeling with deep transformers,” in Proc. Interspeech, 2019.
  • [5] X. Liu, Y. Wang, X. Chen, M. J. Gales, and P. C. Woodland, “Efficient lattice rescoring using recurrent neural network language models,” in Proc. ICASSP, 2014.
  • [6] H. Xu, T. Chen, D. Gao, Y. Wang, K. Li, N. Goel, Y. Carmiel, D. Povey, and S. Khudanpur, “A pruned RNNLM lattice-rescoring algorithm for automatic speech recognition,” in Proc. ICASSP, 2018.
  • [7] K. Li, D. Povey, and S. Khudanpur, “A parallelizable lattice rescoring strategy with neural language models,” in Proc. ICASSP, 2021.
  • [8] A. Graves, S. Fernández, F. Gomez, and J. Schmidhuber, “Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks,” in Proc. ICML, 2006.
  • [9] A. Graves, “Sequence transduction with recurrent neural networks,” arXiv preprint arXiv:1211.3711, 2012.
  • [10] W. Chan, N. Jaitly, Q. Le, and O. Vinyals, “Listen, attend and spell: A neural network for large vocabulary conversational speech recognition,” in Proc. ICASSP, 2016.
  • [11] A. Kannan, Y. Wu, P. Nguyen, T. N. Sainath, Z. Chen, and R. Prabhavalkar, “An analysis of incorporating an external language model into a sequence-to-sequence model,” in Proc. ICASSP, 2018.
  • [12] S. Kim, Y. Shangguan, J. Mahadeokar, A. Bruguier, C. Fuegen, M. L. Seltzer, and D. Le, “Improved neural language model fusion for streaming recurrent neural network transducer,” in Proc. ICASSP, 2021.
  • [13] Y. He, T. N. Sainath, R. Prabhavalkar, I. McGraw, R. Alvarez, D. Zhao, D. Rybach, A. Kannan, Y. Wu, R. Pang et al., “Streaming end-to-end speech recognition for mobile devices,” in Proc. ICASSP, 2019.
  • [14] J. Konečnỳ, H. B. McMahan, D. Ramage, and P. Richtárik, “Federated optimization: Distributed machine learning for on-device intelligence,” arXiv preprint arXiv:1610.02527, 2016.
  • [15] J. Konečnỳ, H. B. McMahan, F. X. Yu, P. Richtárik, A. T. Suresh, and D. Bacon, “Federated learning: Strategies for improving communication efficiency,” NeurIPS Workshop on Private Multi-Party Machine Learning, 2016.
  • [16] B. McMahan, E. Moore, D. Ramage, S. Hampson, and B. A. y Arcas, “Communication-efficient learning of deep networks from decentralized data,” in AISTATS, 2017.
  • [17] F. Chen, M. Luo, Z. Dong, Z. Li, and X. He, “Federated meta-learning with fast convergence and efficient communication,” arXiv preprint arXiv:1802.07876, 2018.
  • [18] K. C. Arnold, K. Z. Gajos, and A. T. Kalai, “On suggesting phrases vs. predicting words for mobile text composition,” in Proceedings of the 29th Annual Symposium on User Interface Software and Technology, 2016.
  • [19] S. Ji, S. Pan, G. Long, X. Li, J. Jiang, and Z. Huang, “Learning private neural language modeling with attentive aggregation,” in IJCNN.   IEEE, 2019, pp. 1–8.
  • [20] D. Leroy, A. Coucke, T. Lavril, T. Gisselbrecht, and J. Dureau, “Federated learning for keyword spotting,” in Proc. ICASSP, 2019.
  • [21] J. Xu, B. S. Glicksberg, C. Su, P. Walker, J. Bian, and F. Wang, “Federated learning for healthcare informatics,” Journal of Healthcare Informatics Research, pp. 1–19, 2020.
  • [22] D. Dimitriadis, K. Kumatani, R. Gmyr, Y. Gaur, and S. E. Eskimez, “A federated approach in training acoustic models.” in Proc. Interspeech, 2020.
  • [23] D. Guliani, F. Beaufays, and G. Motta, “Training speech recognition models with federated learning: A quality/cost framework,” in Proc. ICASSP, 2021.
  • [24] X. Cui, S. Lu, and B. Kingsbury, “Federated acoustic modeling for automatic speech recognition,” in Proc. ICASSP, 2021.
  • [25] H. Jiang, “Confidence measures for speech recognition: A survey,” Speech communication, 2005.
  • [26] P.-S. Huang, K. Kumar, C. Liu, Y. Gong, and L. Deng, “Predicting speech recognition confidence using deep learning with word identity and score features,” in Proc. ICASSP, 2013.
  • [27] K. Kalgaonkar, C. Liu, Y. Gong, and K. Yao, “Estimating confidence scores on asr results using recurrent neural networks,” in Proc. ICASSP, 2015.
  • [28] S. Xie and L. Chen, “Evaluating unsupervised language model adaptation methods for speaking assessment,” in Proceedings of the Eighth Workshop on Innovative Use of NLP for Building Educational Applications, 2013, pp. 288–292.
  • [29] C. Dwork, F. McSherry, K. Nissim, and A. Smith, “Calibrating noise to sensitivity in private data analysis,” in Theory of Cryptography Conference.   Springer, 2006, pp. 265–284.
  • [30] C. Dwork and A. Roth, “The algorithmic foundations of differential privacy,” Foundations and Trends in Theoretical Computer Science, vol. 9, no. 3-4, pp. 211–407, 2014.
  • [31] S. Reddi, Z. Charles, M. Zaheer, Z. Garrett, K. Rush, J. Konečnỳ, S. Kumar, and H. B. McMahan, “Adaptive federated optimization,” in Proc. ICLR, 2021.
  • [32] H. B. McMahan, D. Ramage, K. Talwar, and L. Zhang, “Learning differentially private recurrent language models,” in Proc. ICLR, 2018.
  • [33] Y. Kim, Y. Jernite, D. Sontag, and A. M. Rush, “Character-aware neural language models,” in Proc. AAAI, 2016.
  • [34] I. Mironov, “Rényi differential privacy,” in 30th Computer Security Foundations Symposium (CSF).   IEEE, 2017, pp. 263–275.