This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Speaker Diarization as a Fully Online Learning Problem in MiniVox

Abstract

We proposed a novel machine learning framework to conduct real-time multi-speaker diarization and recognition without prior registration and pretraining in a fully online learning setting. Our contributions are two-fold. First, we proposed a new benchmark to evaluate the rarely studied fully online speaker diarization problem. We built upon existing datasets of real world utterances to automatically curate MiniVox, an experimental environment which generates infinite configurations of continuous multi-speaker speech stream. Second, we considered the practical problem of online learning with episodically revealed rewards and introduced a solution based on semi-supervised and self-supervised learning methods. Additionally, we provided a workable web-based recognition system which interactively handles the cold start problem of new user’s addition by transferring representations of old arms to new ones with an extendable contextual bandit. We demonstrated that our proposed method obtained robust performance in the online MiniVox framework. 111The web-based application of a real-time system can be accessed at https://www.baihan.nyc/viz/VoiceID/ (as in [1]). The code for benchmark evaluation can be accessed at https://github.com/doerlbh/MiniVox.

Index Terms—  Speaker diarization, online learning, semi-supervised learning, self-supervision, contextual bandit

1 Introduction

Speaker recognition involves two essential steps: registration and identification [2]. In laboratory setting, the state-of-the-art approaches usually emphasize the registration step with deep networks [3] trained on large-scale speaker profile dataset [4]. However, in real life, requiring all users to complete voiceprint registration before a multi-speaker teleconference is hardly a preferable way of system deployment. Dealing with this challenge, speaker diarization is the task to partition an audio stream into homogeneous segments according to the speaker identity [5]. Recent advancements have enabled (1) contrastive audio embedding extractions such as Mel Frequency Cepstral Coefficients (MFCC) [6], i-vectors [7] and d-vectors [8]; (2) effective clustering modules such as Gaussian mixture models (GMM) [9], mean shift [10], Kmeans and spectral clustering [8] and supervised Bayesian non-parametric methods [11, 12]; and (3) reasonable resegmentation modules such as Viterbi and factor analysis subspace [13]. In this work, we proposed a new paradigm to consider the speaker diarization as a fully online learning problem of the speaker recognition task: it combines the embedding extraction, clustering and resegmentation into the same problem as an online decision making problem.

Why is this online learning problem different? The state-of-the-art speaker diarization systems usually require large datasets to train their audio extraction embeddings and clustering modules, especially the ones with deep neural networks and Bayesian nonparametric models. In many real-world applications in developing countries, however, the training set can be limited and hard to collect. Since these modules are pretrained, applying them to out-of-distribution environments can be problematic. For instance, an intelligent system trained with American elder speaker data might find it hard to generalize to a Japanese children diarization task because both the acoustic and contrastive features are different. To tackle this problem, we want the system to learn continually. To push this problem to the extreme, we are interested in a fully online learning setting, where not only the examples are available one by one, the agent receives no pretraining from any training set before deployment, and learns to detect speaker identity on the fly through reward feedbacks. To the best of our knowledge, this work is the first to consider diarization as a fully online learning problem. Through this work, we aim to understand the extent to which diarization can be solved as merely an online learning problem and whether traditional online learning algorithms (e.g. contextual bandits [14, 15, 16]) can be a practical solution.

What is a preferable online speaker diarization system? A preferable AI engine for such a realistic speaker recognition and diarization system should (1) not require user registrations before its deployment, (2) allow new user to be registered into the system real-time, (3) transfer voiceprint information from old users to new ones, (4) be up running without pretraining on large amount of data in advance. While attractive, assumption (4) introduced an additional caveat that the labeling of the user profiles happens purely on the fly, trading off models pretrained on big data with the user directly interacting with the system by correcting the agent as labels. To tackle these challenges, we formulated this problem into an interactive learning model with cold-start arms and episodically revealed rewards (users can either reveal no feedback, approving by not intervening, or correcting the agent).

Why do we need a new benchmark? Traditional dataset in the speaker diarization task are limited: CALLHOME American English [17] and NIST RT-03 English CTS [18] contained limited number of utterances recorded under controlled conditions. For online learning experiments, a learn-from-scratch agent usually needs a large length of data stream to reach a comparable result. Large scale speaker recognition dataset like VoxCeleb [4, 19] and Speakers in the Wild (SITW) [20] contained thousands of speaker utterances recorded in various challenging multi-speaker acoustic environments, but they are usually only used to pretrain diarization embeddings. In this work, we proposed a new benchmark called MiniVox, which can transform any large scale speaker identification dataset into infinitely long online audio streams with various configurations.

To the best of our knowledge, this is the first approach to apply the Bandit problem to the speaker diarization task. We built upon the Linear Upper Confidence Bound algorithm (LinUCB) [21] and proposed a semi-supervised learning variant to account for the fact that the rewards are entirely missing in many episodes. For each episode without feedbacks, we applied a self-supervision process to assign a pseudo-action upon which the reward mapping is updated. Finally, we generated new arms by transferring learned arm parameters to similar profiles given user feedbacks.

Refer to caption
Fig. 1: The arm expansion process of the bandit agents.

2 Background: the Bandit Problem

In the online learning setting, the data become available in a sequential order and later used to update the best predictor for future data or reward associated with the data features. In many case, the reward feedback is the only source where the online learning agent can effectively learn from the sequential past experience. This problem is especially important in the field of sequential decision making where the agent must choose the best possible action to perform at each step to maximize the cumulative reward over time. One key challenge is to obtain an optimal trade-off between the exploration of new actions and the exploitation of the possible reward mapping from known actions. This framework is usually formulated as the Bandit problem where each arm of the bandit corresponds to an unknown (but usually fixed) reward probability distribution [22], and the agent selects an arm to play at each round, receives a reward feedback and updates accordingly. An especially useful variant of Bandit is the Contextual Bandit, where at each step, the agent observes an NN-dimensional context, or feature vector before selecting an action. Theoretically, the ultimate goal of the Contextual Bandit is to learn the relationship between the rewards and the context vectors so as to make better decisions given the context [23].

3 The Fully Online Learning Problem

Algorithm 1 presents at a high-level our problem setting of our interactive learning system for speaker diarization, where x(t)dx(t)\in\mathbbm{R}^{d} is a vector describing the context CC at time tt, ra,t(t)[0,1]r_{a,t}(t)\in[0,1] is the reward of action aa at time tt, and r(t)[0,1]Nr(t)\in[0,1]^{N} denotes a vector of rewards for all arms at time tt. x,r\mathbbm{P}_{x,r} denotes a joint probability distribution over (x,r)(x,r), and π:CA\pi:C\rightarrow A denotes a policy. Unlike traditional setting, in step 5 we have the rewards revealed in an episodic fashion (i.e. sometimes there are feedbacks of rewards being 0 or 1, sometimes there are no feedbacks of any kind). We consider our setting online semi-supervised learning [24], where agents continually learn from both labeled and unlabeled data.

Algorithm 1 Online Learning with Episodic Rewards
1:  for t = 1,2,3,\cdots, T do
2:    (x(t),r(t))(\textbf{x}(t),\textbf{r}(t)) is drawn according to x,r\mathbbm{P}_{x,r}
3:   Context x(t)\textbf{x}(t) is revealed to the player
4:   Player chooses an action at=πt(x(t))a_{t}=\pi_{t}(\textbf{x}(t))
5:    Feedback rat,t(t)r_{a_{t},t}(t) for the arm ata_{t} is episodically revealed
6:    Player updates its policy πt\pi_{t}
7:   end for
Refer to caption
Fig. 2: (A) The flowchart of the speaker diarization task as a Fully Online Learning problem and (B) the MiniVox Benchmark.

4 Proposed Online Learning Solution

4.1 Contextual Bandits with Extendable Arms

In an ideal online learning scenario without oracle, we start with a single arm, and when new labels arrive new arms are then generated accordingly. This problem is loosely modelled by the bandits with infinitely many arms [25]. For our specific application of speaker registration process, we applied the arm expansion process outlined in Figure 1: starting from a single arm (for the “new” action), if a feedback confirms a new addition, a new arm is initialized and stored (more details on the handling of growing arms can be found in section 4.4).

4.2 Episodically Rewarded LinUCB

We proposed Background Episodically Rewarded LinUCB (BerlinUCB [26]), a semi-supervised and self-supervised online contextual bandit which updates the context representations and reward mapping separately given the state of the feedbacks being present or missing. As in Algorithm 2, the steps 1 through 12 of BerlinUCB are the same as the standard LinUCB algorithm [21], and in case of a missing reward, we introduced the steps 13 through 20 as the alternative strategy. We assume that: (1) when there are feedbacks available, the feedbacks are genuine, assigned by the oracle, and (2) when the feedbacks are missing (not revealed by the background), it is either due to the fact that the action is preferred (no intervention required by the oracle, i.e. with an implied default rewards), or that the oracle didn’t have a chance to respond or intervene (i.e. with unknown rewards). Especially in the Step 15, when there is no feedbacks, we assign the context xt\textbf{x}_{t} to a class aa^{\prime} (an action arm) with the self-supervision given the previous labelled context history (section 4.3). Since we don’t have the actual label for this context, we only update the reward mapping parameter ba\textbf{b}_{a^{\prime}} and leave the covariance matrix Aa\textbf{A}_{a^{\prime}} untouched. This additional usage of unlabelled data (or unrevealed feedback) is especially important in our model.

Algorithm 2 BerlinUCB
1:   Initialize ct+,AaId,ba0d×1a𝒜tc_{t}\in\mathbbm{R}_{+},\textbf{A}_{a}\leftarrow\textbf{I}_{d},\textbf{b}_{a}\leftarrow\textbf{0}_{d\times 1}\forall a\in\mathcal{A}_{t}
2:  for t = 1,2,3,\cdots, T do
3:    Observe features xtd\textbf{x}_{t}\in\mathbbm{R}^{d}
4:    for all a𝒜ta\in\mathcal{A}_{t} do
5:     θ^aAa1ba\hat{\mathbf{\theta}}_{a}\leftarrow\textbf{A}_{a}^{-1}\textbf{b}_{a}
6:     pt,aθ^axt+ctxtAa1xtp_{t,a}\leftarrow\hat{\mathbf{\theta}}_{a}^{\top}\textbf{x}_{t}+c_{t}\sqrt{\textbf{x}_{t}^{\top}\textbf{A}_{a}^{-1}\textbf{x}_{t}}
7:    end for
8:    Choose arm at=argmaxa𝒜tpt,aa_{t}=\operatorname*{arg\,max}_{a\in\mathcal{A}_{t}}p_{t,a}
9:    if the background revealed the feedbacks then
10:     Observe feedback rat,tr_{a_{t},t}
11:     AatAat+xtxt\textbf{A}_{a_{t}}\leftarrow\textbf{A}_{a_{t}}+\textbf{x}_{t}\textbf{x}^{\top}_{t}
12:     batbat+rat,txt\textbf{b}_{a_{t}}\leftarrow\textbf{b}_{a_{t}}+r_{a_{t},t}\textbf{x}_{t}
13:    elif the background revealed NO feedbacks then
14:     if use self-supervision feedback
15:      r=[at==predict(xt)]r^{\prime}=[a_{t}==\text{predict}(\textbf{x}_{t})] % clustering modules
16:      batbat+rxt\textbf{b}_{a_{t}}\leftarrow\textbf{b}_{a_{t}}+r^{\prime}\textbf{x}_{t}
17:     elif % ignore self-supervision signals
18:      AatAat+xtxt\textbf{A}_{a_{t}}\leftarrow\textbf{A}_{a_{t}}+\textbf{x}_{t}\textbf{x}^{\top}_{t}
19:     end if
20:    end if
21:  end for

4.3 Self-Supervision and Semi-Supervision Modules

We construct our self-supervision modules given the cluster assumption of the semi-supervision problem: the points within the same cluster are more likely to share a label. As shown in many work in modern speaker diarization, clustering algorithms like GMM [9] and spectral clustering [8] are especially powerful unsupervised modules, especially in their offline versions. Their online variants, however, often perform poorly [12]. Nonetheless, we chose three popular clustering algorithms as the self-supervision: GMM, Kmeans and K-nearest neighbors (KNN), all in their online versions.

4.4 Complete Engine for Online Speaker Diarization

To adapt our BerlinUCB algorithm to the specific application of speaker recognition, we firdst define our actions. There are three major classes of actions: an arm “New” to denote that a new speaker is detected, an arm “No Speaker” to denote that no one is speaking, and N different arms “User n” to denote that user n is speaking. Table 1 presents the reward assignment given four types of feedbacks. Note that we assume that when the agent correctly identifies the speaker (or no speaker), the user (as the feedback dispenser) should send no feedbacks to the system by doing nothing. In another word, in an ideal scenario when the agent does a perfect job by correctly identifying the speaker all the time, we are not necessary to be around to correct it anymore (i.e. truly feedback free). As we pointed out earlier, this could be a challenge earlier on, because other than implicitly approving the agent’s choice, receiving no feedbacks could also mean the feedbacks are not revealed properly (e.g. the human oracle took a break). Furthermore, we noted that when “No Speaker” and “User n” arms are correctly identified, there is no feedback from us the human oracle (meaning that these arms would never have learned from a single positive reward if we don’t use the “None” feedback iterations at all!). The semi-supervision by self-supervision step is exactly tailored for a scenario like this, where the lack of revealed positive reward for “No Speaker” and “User n” arms is compensated by the additional training of the reward mapping bat\textbf{b}_{a_{t}} if context xt\textbf{x}_{t} is assigned to the right arm.

To tackle the cold start problem, the agent grows it arms in the following fashion: the agent starts with two arms, “No Speaker” and “New”; if it is actually a new speaker speaking, we have the following three conditions: (1) if “New” is chosen, the user approves this arm by giving it a positive reward (i.e. clicking on it) and the agent initializes a new arm called “User NN” and update N=N+1N=N+1 (where NN is the number of registered speakers at the moment); (2) if “No Speaker” is chosen, the user disapproves this arm by giving it a zero reward and clicking on the “New” instead), while the agent initializes a new arm; (3) if one of the user arms is chosen (e.g. “User 5” is chosen while in fact a new person is speaking), the agent copies the wrong user arm’s parameters to initialize the new arm, since the voiceprint of the mistaken one might be beneficial to initialize the new user profile. In this way, we can transfer what has been learned for a similar context representations to the new arm.222Potential problems might occur if the number of users grows steadily through misclassifications. In future work, we will investigate possible branch pruning strategies and the processing of very sparse reward feedback.

Feedback types (+, +) (+, -) (-, +) None
New r=1r=1 r=0r=0 - Alg. 2 Step 13
No Speaker - r=0r=0 r=0r=0
User n - r=0r=0 r=0r=0
Table 1: Possible algorithm routes given no feedbacks, or a feedback telling the agent that the correct label is aa*. (+, +) means that the agent guessed it right by choosing the right arm; (+, -) means that the agent chose this arm incorrectly, since the correct one is another arm; (-, +) means that the agent didn’t choose this arm, while it turned out to be the correct one. “-” marks scenarios not applicable.

5 Benchmark Description: MiniVox

MiniVox is an automatic framework to transform any speaker-labelled dataset into continuous speech datastream with episodically revealed label feedbacks. Since our online learning problem setting assumes learning the voiceprints without any previous training data at all, MiniVox’s flexibility in length and configuration is especially important. As outlined in Figure 2, MiniVox has a straightforward data stream generation pipeline: given a pool of single-speaker-annotated utterances, randomly concatenate multiple pieces with a chosen number of speakers and a desired length. The reward stream is then sparsified with a parameter pp as the percentage of time a feedback is revealed.

There are two scenarios that we can evaluate in MiniVox: if we assume there is an oracle, the online learning model is given the fixed number of the speakers in the stream; if we assume there is no oracle, the online learning model will start from zero speaker and then gradually discover and register new speakers for future identification and diarization.

6 Empirical Evaluation

Refer to caption
Fig. 3: Example reward curves where (a, b, c, d) BerlinUCB is the best; (e, f) the self-supervision is the best; (g, h) LinUCB is the best.

6.1 Experimental Setup and Metrics

We applied MiniVox on VoxCeleb [4] to generate three data streams with 5, 10 and 20 speakers to simulate real-world conversations. We extracted two types of features (more details in section 6.2) and evaluated it in two scenarios (with or without oracle). The reward streams are sparsified given a revealing probability of 0.5, 0.1 and 0.01. In summary, we evaluated our models in a combinatorial total of 3 speaker numbers ×\times 3 reward revealing probabilities ×\times 2 feature types ×\times 2 test scenarios == 36 online learning environments. The online learning timescale range from \sim12000 to \sim60000 timeframes, with a frame shift of 10 ms. For notation of a specific MiniVox, in this paper we would denote “MiniVox C5-MFCC-60k” as a MiniVox environment with 5 speakers ranging 60k time frames using MFCC as features.

To evaluate performance in the above MiniVox environments, we reported Diarization Error Rates (DER), the standard introduced in the NIST Rich Transcription 2009 (RT-09). In addition, as a common metric in online learning literature, we also recorded the cumulative reward: at each frame, if the agent correctly predicts a given speaker, the reward is counted as +1 (regardless of whether the agent observes it or not).

We compared 5 agents: The baseline, LinUCB is the contextual bandit with extendable arms proposed in section 4.1. BerlinUCB is our standard contextual bandit model designed for sparse feedbacks without the self-supervision modules. To test the effect of self-supervision, we introduced three clustering modules in BerlinUCB (Alg 2, Step 15) denoted B-Kmeans, B-KNN (with K=5), and B-GMM, whose clustering modules are randomly initialized and updated online.

6.2 Feature Embeddings: MFCC and Neural Networks

We utilized two feature embeddings for our evaluation: MFCC [6] and a Convolutional Neural Network (CNN). We utilized the same CNN architecture as the VGG-M [27] used in VolCeleb evaluation [4]. It takes the spectrogram of an utterance as the input, and generate a feature vector of 1024 in layer fc8 (table 4 in [4] for details about this CNN).

Why don’t we use more complicated embeddings? Although more complicated embedding extraction modules such as i-vectors [7] or d-vectors [8] can improve diarization, they require extensive pretraining on big datasets, which is contradictory to our problem setting and beyond our scope.

Why do we still include a pretrained CNN embedding? Indeed, if our end goal is to let the system learn from scratch without pretraining, why do we consider it in our evaluation? The CNN model was trained for speaker verification task in VoxCeleb and we are curious about the relationship between a learned representation and our online learning agents. Despite this note, we are most interested in the performance given MFCC, because we aim to push the system to the extreme of not having pretraining of any type before deployment.

Table 2: Diarization Error Rate (%) in MiniVox without Oracle
MiniVox C5-MFCC-60k MiniVox C5-CNN-12k
p=0.5p=0.5 p=0.1p=0.1 p=0.01p=0.01 p=0.5p=0.5 p=0.1p=0.1 p=0.01p=0.01
BerlinUCB 71.81 80.03 82.38 17.42 32.03 65.16
LinUCB 74.74 78.71 79.30 17.81 32.73 58.98
B-Kmeans 82.82 79.15 77.39 28.83 63.67 82.58
B-KNN 78.71 80.62 77.39 28.36 82.58 82.58
B-GMM 85.32 83.41 87.67 99.61 99.61 99.69
MiniVox C10-MFCC-60k MiniVox C10-CNN-12k
p=0.5p=0.5 p=0.1p=0.1 p=0.01p=0.01 p=0.5p=0.5 p=0.1p=0.1 p=0.01p=0.01
BerlinUCB 82.46 85.31 89.26 42.77 57.41 74.02
LinUCB 84.36 86.73 93.36 49.55 68.57 81.16
B-Kmeans 91.15 92.58 96.68 60.89 70.89 99.55
B-KNN 89.73 90.05 96.68 60.89 82.05 99.55
B-GMM 90.21 94.63 98.42 99.20 93.57 99.64
MiniVox C20-MFCC-60k MiniVox C20-CNN-12k
p=0.5p=0.5 p=0.1p=0.1 p=0.01p=0.01 p=0.5p=0.5 p=0.1p=0.1 p=0.01p=0.01
BerlinUCB 88.62 87.02 92.79 41.72 59.06 83.28
LinUCB 91.35 88.94 88.46 51.56 83.52 74.84
B-Kmeans 95.19 95.99 96.96 72.03 75.31 99.53
B-KNN 93.43 95.99 96.79 72.03 74.06 99.53
B-GMM 92.79 96.31 97.76 87.73 81.09 83.28
Table 3: Diarization Error Rate (%) in MiniVox with Oracle
MiniVox C5-MFCC-60k MiniVox C5-CNN-12k
p=0.5p=0.5 p=0.1p=0.1 p=0.01p=0.01 p=0.5p=0.5 p=0.1p=0.1 p=0.01p=0.01
BerlinUCB 74.89 77.24 86.93 17.27 22.19 66.02
LinUCB 72.83 78.12 76.80 17.73 32.73 58.98
B-Kmeans 75.33 78.27 83.11 20.55 40.70 58.98
B-KNN 77.39 77.97 83.99 20.47 41.33 58.98
B-GMM 74.16 76.21 77.24 52.58 81.02 58.98
MiniVox C10-MFCC-60k MiniVox C10-CNN-12k
p=0.5p=0.5 p=0.1p=0.1 p=0.01p=0.01 p=0.5p=0.5 p=0.1p=0.1 p=0.01p=0.01
BerlinUCB 88.31 90.21 95.89 45.18 65.27 79.38
LinUCB 84.99 91.63 97.00 50.00 72.14 65.18
B-Kmeans 87.84 91.47 91.94 50.27 72.50 72.32
B-KNN 86.73 85.78 92.58 49.64 72.14 77.77
B-GMM 88.94 84.52 92.58 76.52 71.88 69.46
MiniVox C20-MFCC-60k MiniVox C20-CNN-12k
p=0.5p=0.5 p=0.1p=0.1 p=0.01p=0.01 p=0.5p=0.5 p=0.1p=0.1 p=0.01p=0.01
BerlinUCB 92.31 94.55 96.31 58.75 68.98 88.83
LinUCB 89.10 93.43 95.67 53.44 70.47 83.44
B-Kmeans 92.95 95.67 96.96 55.16 70.86 94.06
B-KNN 91.83 92.47 97.44 54.30 89.84 96.72
B-GMM 95.19 91.99 97.44 86.48 77.97 96.64

6.3 Results

Given MFCC features without pretraining, our online learning agent demonstrated a robust performance (Figure 3a,b,c,d): in most cases, it significantly outperformed the baseline. 333We wish to note the overall high diarization error rate in all MFCC benchmarks: it is important to keep in mind that the bandit feedback (correct or incorrect classification) makes the online speaker diarization problem significantly more challenging, as compared to the standard supervised learning in offline speaker diarization, since the true label is never revealed in bandit setting unless the classification is correct. Thus, the diarization error rate in a bandit online setting is expected to be much higher than in the supervised learning setting, which is not due to inferiority of bandit decision making algorithm versus other classifiers, but due to increased problem difficulty.

Learning without Oracle (Table 2). In both the MFCC and CNN MiniVox environments, we observed that BerlinUCB and its variants outperform the baseline most of the time. The discrepancy of performance of the MFCC and CNN environments can be explained by the innate difficulties of the two tasks: while the CNN embeddings are already well separated because they were pretrained with contrastive loss [4], in MFCC environments our online learning models need to learn from scratch both how to cluster and how to map the reward with the features, while maintaining a good balance between exploitation and exploration.

Learning with Oracle (Table 3). Given the number of speakers, the online clustering modules appears to be more effective. However, the behaviors vary: we observed that B-GMM performed the poorest in the oracle-free environments, but performed the best in some environments with oracle; we also noted that despite the consistent best model in many oracle-free environments, the standard BerlinUCB was surpassed by the baseline and its self-supervised variants in a few MFCC cases with oracle; in certain challenging cases where the reward is sparsely revealed (p=0.01 or 0.1), the self-supervised variants improve the performance of BerlinUCB.

Is self-supervision useful? To our surprise, our benchmark results suggested that for most cases, the proposed self-supervision modules didn’t improve upon our proposed contextual bandit model. Only in specific conditions (e.g. MiniVox C20-MFCC-60k p=0.1 with Oracle), the self-supervised contextual bandits outperformed both the standard BerlinUCB and the baseline. Further investigation into the reward curve revealed more complicated interactions between the self-supervision modules with the online learning modules (the contextual bandit): as shown in Figure 3e,f, B-GMM and B-KNN built upon the effective reward mapping from their BerlinUCB backbone, and benefited from the unlabelled data points to yield a fairly good performance.

7 Conclusion and outlook

We considered the novel problem of online learning speaker diarization. We formulated the practical task as an interactive system that episodically receives sparse bandit feedback from users. During unlabelled episodes, we proposed to learn from pseudo-feedback generated by self-supervised modules enabled by clustering. We provided a benchmark to evaluate this task, and demonstrated an empirical merit of the proposed methods over standard online learning algorithm. Ongoing work include extending the online learning framework in both extraction and clustering modules, branch management (e.g. routing [28]) and self-supervision with graph methods.

References

  • [1] Baihan Lin and Xinxin Zhang, “VoiceID on the fly: A speaker recognition system that learns from scratch,” in INTERSPEECH, 2020.
  • [2] Sreenivas Sremath Tirumala, Seyed Reza Shahamiri, Abhimanyu Singh Garhwal, and Ruili Wang, “Speaker identification features extraction methods: A systematic review,” Expert Syst. Appl., vol. 90, pp. 250–271, 2017.
  • [3] David Snyder, Daniel Garcia-Romero, Gregory Sell, Daniel Povey, and Sanjeev Khudanpur, “X-vectors: Robust dnn embeddings for speaker recognition,” in ICASSP. IEEE, 2018, pp. 5329–5333.
  • [4] A. Nagrani, J. S. Chung, and A. Zisserman, “Voxceleb: a large-scale speaker identification dataset,” in INTERSPEECH, 2017.
  • [5] X. Anguera, S. Bozonnet, N. Evans, C. Fredouille, G. Friedland, and O. Vinyals, “Speaker diarization: A review of recent research,” IEEE Trans Audio Speech Lang Process, vol. 20, no. 2, pp. 356–370, 2012.
  • [6] Md Rashidul Hasan, Mustafa Jamil, MGRMS Rahman, et al., “Speaker identification using mel frequency cepstral coefficients,” variations, vol. 1, no. 4, 2004.
  • [7] Stephen H Shum, Najim Dehak, Réda Dehak, and James Glass, “Unsupervised methods for speaker diarization: An integrated and iterative approach,” IEEE Trans Audio Speech Lang Process, vol. 21, pp. 2015–2028, 2013.
  • [8] Quan Wang, Carlton Downey, Li Wan, Philip Andrew Mansfield, and Ignacio Lopz Moreno, “Speaker diarization with lstm,” in ICASSP. IEEE, 2018, pp. 5239–5243.
  • [9] Z. Zajíc, M. Hrúz, and L. Müller, “Speaker diarization using convolutional neural network for statistics accumulation refinement.,” in INTERSPEECH, 2017.
  • [10] Mohammed Senoussaoui, Patrick Kenny, Themos Stafylakis, and Pierre Dumouchel, “A study of the cosine distance-based mean shift for telephone speech diarization,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 22, pp. 217–227, 2013.
  • [11] Emily Fox, Erik Sudderth, Michael Jordan, and Alan Willsky, “A sticky hdp-hmm with application to speaker diarization,” Ann. Appl. Stat., pp. 1020–1056, 2011.
  • [12] Aonan Zhang, Quan Wang, Zhenyao Zhu, John Paisley, and Chong Wang, “Fully supervised speaker diarization,” in ICASSP. IEEE, 2019, pp. 6301–6305.
  • [13] Gregory Sell and Daniel Garcia-Romero, “Diarization resegmentation in the factor analysis subspace,” in ICASSP. IEEE, 2015, pp. 4794–4798.
  • [14] John Langford and Tong Zhang, “The epoch-greedy algorithm for contextual multi-armed bandits,” in NIPS. Citeseer, 2007, pp. 817–824.
  • [15] Baihan Lin, Guillermo Cecchi, Djallel Bouneffouf, Jenna Reinen, and Irina Rish, “Unified models of human behavioral agents in bandits, contextual bandits and rl,” arXiv preprint arXiv:2005.04544, 2020.
  • [16] Baihan Lin, Djallel Bouneffouf, and Guillermo Cecchi, “Online learning in iterated prisoner’s dilemma to mimic human behavior,” arXiv preprint arXiv:2006.06580, 2020.
  • [17] A Canavan, D Graff, and G Zipperlen, “Callhome american english speech ldc97s42. web download,” Philadelphia, PA, USA: Linguistic Data Consortium, 1997.
  • [18] Alvin Martin and Mark Przybocki, “The nist 1999 speaker recognition evaluation—an overview,” Digital signal processing, vol. 10, no. 1-3, pp. 1–18, 2000.
  • [19] Arsha Nagrani, Joon Son Chung, Weidi Xie, and Andrew Zisserman, “Voxceleb: Large-scale speaker verification in the wild,” Comput. Sci. & Lang., 2019.
  • [20] Mitchell McLaren, Luciana Ferrer, Diego Castan, and Aaron Lawson, “The speakers in the wild (sitw) speaker recognition database.,” in INTERSPEECH, 2016.
  • [21] Lihong Li, Wei Chu, John Langford, and Robert E Schapire, “A contextual-bandit approach to personalized news article recommendation,” in WWW, 2010.
  • [22] T. L. Lai and Herbert Robbins, “Asymptotically efficient adaptive allocation rules,” Advances in Applied Mathematics, vol. 6, no. 1, pp. 4–22, 1985.
  • [23] Shipra Agrawal and Navin Goyal, “Thompson sampling for contextual bandits with linear payoffs,” in ICML (3), 2013, pp. 127–135.
  • [24] B. Yver, “Online semi-supervised learning: Application to dynamic learning from radar data,” in RADAR, 2009.
  • [25] Donald A Berry, Robert W Chen, Alan Zame, David C Heath, and Larry A Shepp, “Bandit problems with infinitely many arms,” Ann. Stat., pp. 2103–2116, 1997.
  • [26] Baihan Lin, “Online semi-supervised learning in contextual bandits with episodic reward,” in Australasian Joint Conference on Artificial Intelligence, 2020.
  • [27] Ken Chatfield, Karen Simonyan, Andrea Vedaldi, and Andrew Zisserman, “Return of the devil in the details: Delving deep into convolutional nets,” in BMVC, 2014.
  • [28] Baihan Lin, Djallel Bouneffouf, Guillermo Cecchi, and Irina Rish, “Contextual bandit with adaptive feature extraction,” in 2018 IEEE International Conference on Data Mining Workshops (ICDMW). IEEE, 2018.