This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Learning Adaptive Embedding Considering Incremental Class

Yang Yang,  Zhen-Qiang Sun,  HengShu Zhu,  Yanjie Fu,  Hui Xiong,  and Jian Yang Yang Yang and Jian Yang are with the Nanjing University of Science and Technology, Nanjing 210094, China. E-mail: yyang,[email protected] Sun is with the Nanjing Normal University, Nanjing 210023, China. E-mail: [email protected] Zhu is with Baidu Talent Intelligence Center, Baidu Inc, Beijing 100000, China. E-mail:[email protected] Fu is with the Missouri University of Science and Technology, Rolla, MO 65401. E-mail: [email protected] Xiong is with the Management Science and Information Systems Department, Rutgers Business School, Rutgers University, Newark, NJ 07102, USA. E-mail: [email protected] Yang and Jian Yang are with PCA Lab, Key Lab of Intelligent Perception and Systems for High-Dimensional Information of Ministry of Education, and Jiangsu Key Lab of Image and Video Understanding for Social Security, School of Computer Science and Engineering, Nanjing University of Science and Technology. Yang Yang is the corresponding author.
Abstract

Class-Incremental Learning (CIL) aims to train a reliable model with the streaming data, which emerges unknown classes sequentially. Different from traditional closed set learning, CIL has two main challenges: 1) Novel class detection. The initial training data only contains incomplete classes, and streaming test data will accept unknown classes. Therefore, the model needs to not only accurately classify known classes, but also effectively detect unknown classes; 2) Model expansion. After the novel classes are detected, the model needs to be updated without re-training using entire previous data. However, traditional CIL methods have not fully considered these two challenges, first, they are always restricted to single novel class detection each phase and embedding confusion caused by unknown classes. Besides, they also ignore the catastrophic forgetting of known categories in model update. To this end, we propose a Class-Incremental Learning without Forgetting (CILF) framework, which aims to learn adaptive embedding for processing novel class detection and model update in a unified framework. In detail, CILF designs to regularize classification with decoupled prototype based loss, which can improve the intra-class and inter-class structure significantly, and acquire a compact embedding representation for novel class detection in result. Then, CILF employs a learnable curriculum clustering operator to estimate the number of semantic clusters via fine-tuning the learned network, in which curriculum operator can adaptively learn the embedding in self-taught form. Therefore, CILF can detect multiple novel classes and mitigate the embedding confusion problem. Last, with the labeled streaming test data, CILF can update the network with robust regularization to mitigate the catastrophic forgetting. Consequently, CILF is able to iteratively perform novel class detection and model update. We verify the effectiveness of our model on four streaming classification task, empirical studies show the superior performances of the proposed method.

Index Terms:
Class-incremental learning, Novel class detection, Incremental Model Update, Open Environment

1 Introduction

Traditional closed set recognition (CSR) assumes that training and testing data are draw from the same space, i.e., the label and feature spaces, and various methods have achieved significant success in different applications [1, 2]. However, the real-world is dynamically changing, and many applications are non-stationary, which always receive the data as streaming form and many unknown classes will emerge sequentially, for example, driverless cars need to identify unknown objects, face recognition needs to distinguish unseen personal pictures, image retrieval often appears with new categories, etc. This is defined as class-incremental learning (CIL) in literature, which is more challenging and practical than CSR. As shown in Figure 1, CIL includes two key components: novel class detection (NCD) and incremental model update (IMU). The main difficulty of NCD is to effectively distinguish the known and unknown classes, i.e., the instances from novel classes during testing, which are unknown in training phase as shown in Figure 1 (a). Meanwhile, after novel class detection, we also need to consider the IMU, which aims to re-train the model with newly labeled instances from unknown classes as shown in Figure 1 (b). Consequently, streaming test data continue to present novel classes, and we need to conduct the NCD and IMU operators iteratively.

Refer to caption
Figure 1: Schematic of class-incremental learning. Unknown categories occur with the streaming data, (a) the model first detect novel class with pre-trained model; (b) the model is then updated with newly labeled instances from unknown classes, without or with limited examples from known classes.

To address the NCD issue, zero-shot learning (ZSL) is firstly proposed [3, 4], which aims to classify instances from unknown categories, by merely utilizing seen class examples and semantic information about unknown classes. Whereas the standard ZSL methods only test unknown classes, rather than test both known and unknown classes. Thus, generalized zero-shot learning (GZSL) is proposed, which automatically detect known and unknown classes simultaneously. For example, Lampert et al. introduced an attribute-based classification method, which detected new objects based on a high-level description in terms of semantic attributes [6]; Changpinyo et al. proposed a GZSL method via manifold learning, which was to align the semantic space with visual features [7]; Li et al. introduced the feature confusion GAN, which proposed a boundary loss to maximize the decision boundary of seen categories and unseen ones [8]. However, both ZSL and GZSL assume that semantic information (for example, attributes or descriptions) of the unknown classes is given, which is limited to detect with prior knowledge, and have no ability to detect incrementally.

Therefore, a more realistic prediction is to detect unknown classes without any information, either instances or side-information during training. Recent NCD approaches usually leverage the powerful deep neural networks, and can mainly be divided into two aspects: discriminative and generative models. Discriminative models mainly utilize the powerful feature learning and prediction capabilities of deep models to design corresponding distance or prediction confidence measures. For example, Bendale and Boult proposed the OpenMax model, which trained a deep neural network with the normal SoftMax layer by using the Weibull distribution fitting score [9], yet it failed to recognize the adversarial images which are visually indistinguishable from training instances; Hassen and Chan learned the neural network based representation, which restricted the inter-class and intra-class distances during training, thus to lead larger spaces for novelty detection [10]. However, as shown in Figure 2 (a) and (b), it is notable that on simple visual datasets such as MNIST, known and unknown classes have strong separability on the trained model, thereby the distance measure is more effective. Whereas in more complex visual datasets such as CIFAR-10, unknown and known categories have embedding confusion in feature space, i.e, instances of unknown and known classes are mixed, thus the performance of distance measure based methods will greatly reduce. On the other hand, Generative models mainly employ the adversarial learning to generate instances that can fool the discriminative model, thus to detect the novel class. Ge et al. utilized the generative model to generate unknown class instances near the decision margin, which can provide explicit probability estimation over generated unknown class [11]. However, as shown in Figure 2 (c) and (d), we can get similar conclusions to discriminative model, i.e., generated instances on complex datasets such as CIFAR-10 are almost uselesss, which is also mentioned in [13]. Besides, most existing detection methods are limited to detect single novel class, i.e., they assume that only one novel class appears at each period.

Furthermore, we need incremental model update (IMU) with the newly labeled instances of novel classes after detecting. Different from re-training with all previous known data, IMU aims to re-train the model only referring no or limited known data, which can ensure the efficiency of incremental update. Therefore, a big challenge for IMU is the catastrophic forgetting phenomenon [14], i.e., we can find that the knowledge learned from the previous task (known classes classification) will be lost when information relevant to the current task (novel class classification) is incorporated. To mitigate the catastrophic forgetting, there are many attempts, including replay-based methods that explicitly re-train on stored examples while training on new tasks [15, 16], and regularization-based methods that utilize extra regularization term on output or parameters to consolidate previous knowledge [13, 17, 18].

Refer to caption

     (a) DM on MNIST

Refer to caption

     (b) DM on CIFAR-10

Refer to caption

     (c) GM on MNIST

Refer to caption

     (d) GM on CIFAR-10

Figure 2: T-SNE of discriminative model (DM) [17] and generative model (GM) [13] on simple (MNIST) and complex (CIFAR-10) datasets. In detail, we train the two models with five classes (i.e., 0-4) in the training stage, then utilize the pre-trained model to achieve the feature embeddings of data from two unknown classes (i.e., 5, 6) and known classes (i.e., 0-4) appearing in the testing stage, and give the T-SNE in (a)-(d).

To this end, we propose a Class-Incremental Learning without Forgetting (CILF) framework, which aims to process new class detection and model update iteratively. In detail, we firstly develop a novel decoupled prototype based network to train the known classes, which employs the constrained clustering loss to regularize the inter-class and intra-class structure. In testing, considering emergence of single or multiple novel classes, we develop the curriculum operator for learning adaptive embedding, which aims to conduct learnable clustering from easy to hard instances and overcome the embedding confusion. Then, with the limited memory data of known classes, CILF updates the network with robust regularization to mitigate the catastrophic forgetting. In summary, the main contributions are summarized as follows:

  • Propose the “ Class-Incremental Learning without Forgetting” (CILF) framework, which considers both novel class detection and incremental model update;

  • Propose a novel decoupled prototype based network, which can conduct novel class detection and model update effectively;

  • Propose curriculum clustering operator for better multiple novel classes detection and robust regularization to mitigate catastrophic forgetting.

In remaining sections of this paper, section 2 introduces the related work. Section 3 presents the proposed method. Section 4 evaluates the proposed method. Finally, the whole work is concluded in Section 5.

2 Related Work

Our work aims to detect novel classes in streaming data, and update the model with limited known data without forgetting. Therefore, our work is related to: novel class detection and incremental model update.

Traditional novel class detection approaches mainly restrict intra-class and inter-class distance property in training data, then detect novel class by identifying outliers. For example, Da et al. developed a SVM-based method, which learned the concept of known classes while incorporating the structure presented in the unlabeled data collected from open set [19]; Mu et al. proposed to dynamically maintain two low-dimensional matrix sketches for detecting novel classes [20]. However, these approaches are difficult to process high dimensional space considering complex matrix operations. Recently, with the development of deep learning techniques, several studies have applied convolutional neural network (CNN) on the detection scenario. Hendrycks and Gimpel verified that CNN trained on the MNIST images can predict high confidence (90%\%) on gaussian noise instances, thus we can use the softmax output probabilities to distinguish known/unknown class [21]. Furthermore, Liang et al. directly utilized temperature scaling or added small perturbations to separate the softmax score distributions between in- and out-of-distribution images [22]; Neal et al. introduced a novel augmentation technique, which adopted an encoder-decoder GAN architecture to generate the synthetic instances similar to known class [13]; Wang et al. proposed a cnn-based prototype ensemble method, which adaptively update the prototype for robust detection [17]. Nevertheless, these methods always limited to detect single novel class in once time. Therefore, Han et al. proposed an extended deep transfer embedded clustering method for multiple novel class detection [24]. Nevertheless, existing NCD methods usually have superior detection performance on simple datasets, but are easily interfered by embedding confusion on complex datasets.

Incremental learning is always applied for streaming data. In most situations, only a few examples from known classes/features/distributions are available in the beginning and data with new classes/features/distributions emerge thereafter. Incremental learning methods aim to update the models from streaming data sequentially only with newly coming data and limited previous data, without re-training with all previous data [25]. As a matter of fact, incremental deep learning can directly apply with online backpropagation, yet with one important drawback: catastrophic forgetting, which is the tendency for losing the learned knowledge of previous distribution (previously known classes/features/distributions). To solve this problem, there are many attempts, for example, Rebuffi et al. stored a subset of examples per class, which are selected to best approximate the mean of each class in the feature space [15], Lopez-Paz and Ranzato projected the estimated gradient direction on the feasible region outlined by previous tasks through a first order Taylor series approximation [26]; Li and Hoiem utilized the output of previous model as soft labels for previous tasks [27]; Kirkpatrick et al. proposed the elastic weight consolidation to reduce catastrophic forgetting [28]; Lee et al. proposed to incrementally match the moment of posterior distribution of the neural network [29].

3 Proposed Method

In this section, we formalize the problem of class-incremental learning with streaming data, and give the details of proposed framework.

3.1 Problem Definition

Without any loss of generality, at initial time, we have a supervised training set D0={(𝐱i0,𝐲i0)}i=1ND^{0}=\{({\bf x}_{i}^{0},{\bf y}_{i}^{0})\}_{i=1}^{N}, where 𝐱i0d{\bf x}_{i}^{0}\in\mathbb{R}^{d} denotes the ii-th instance, and 𝐲i0Y0={1,2,,C}{\bf y}_{i}^{0}\in Y^{0}=\{1,2,\cdots,C\} denotes corresponding label, 0 represents initial time. Then, we receive a non-stationary unlabeled testing data D1={(𝐱j)}j=1N1D^{1}=\{({\bf x}_{j})\}_{j=1}^{N_{1}}, where 𝐱jd{\bf x}_{j}\in\mathbb{R}^{d} denotes the jj-th instance, and label 𝐲jY^={1,2,,C,C+1,,C+K1}{\bf y}_{j}\in\hat{Y}=\{1,2,\cdots,C,C+1,\cdots,C+K^{1}\} is unknown, K1K^{1} is the number of unknown classes. Thus, novel class detection can be defined as:

Definition 1

Novel Class Detection (NCD) With the initial training set D0={(𝐱i0,𝐲i0)}i=1ND^{0}=\{({\bf x}_{i}^{0},{\bf y}_{i}^{0})\}_{i=1}^{N}, we aim to construct a model, i.e., f0:X0Y0f^{0}:X^{0}\rightarrow Y^{0}. Then with the pre-trained model f0f^{0}, novel class detection is to classify the known and unknown classes in D1D^{1} accurately.

On the other hand, it is notable that streaming data with novel classes has two characteristics: (1) Data window. At time window tt, we only get the data of current time window, not the full amount of streaming data for detection; and (2) Novel class continuity. At time window tt, only partial novel classes will appear, or even no novel classes. Therefore, we need to incrementally detect novel classes, i.e., with the streaming data, every time after receiving the data of time window tt, NCD is performed [30]. Specifically, the streaming test data DD can be denoted as D={Dt}t=1TD=\{D^{t}\}_{t=1}^{T}, where Dt={𝐱jt}j=1NtD^{t}=\{{\bf x}_{j}^{t}\}_{j=1}^{N_{t}} is with NtN_{t} unlabeled instances, and the underlying label 𝐲jtY^t{\bf y}_{j}^{t}\in\hat{Y}^{t} is unknown, Y^t=Y^t1Yt\hat{Y}^{t}=\hat{Y}^{t-1}\cup Y^{t}, where Y^t1\hat{Y}^{t-1} is the cumulative known classes until (t1)(t-1)-th time window and YtY^{t} is the new classes in tt-th window. Y^T=Y^={1,2,,C,C+1,,C+K}\hat{Y}^{T}=\hat{Y}=\{1,2,\cdots,C,C+1,\cdots,C+K\}. Therefore, we can give the definition of class-incremental learning:

Definition 2

Class-Incremental Learning (CIL) At time t{1,2,,T}t\in\{1,2,\cdots,T\}, we have pre-trained model ft1f^{t-1} and finite stored instance set Mt1M^{t-1} from known classes until (t1)(t-1)-th time, and receive streaming data DtD^{t}. First, we aim to classify known and unknown classes in DtD^{t} as Definition 1. Then, with the newly labeled data from novel classes and stored data Mt1M^{t-1}, we update the model while mitigating forgetting to acquire ftf^{t}. Cycle this process until terminated.

Note that there exist two labeling cases after novel class detection, i.e., manually labeling or self-taught labeling [20]. We consider manually labeling following most approaches [30, 20, 17] to avoid label noise accumulation, and is more in line with real-world applications.

Refer to caption
Figure 3: Overview of the CILF framework. The blue and orange dots are initial training data for developing the deep network. While the gray dots are unlabeled testing data of tt-th time window, which are received from the stream data. With the trained deep network, CILF aims to classify the known and novel classes, then query the ground-truths of novel class instances for updating the network continuously.

3.2 CILF Framework

The main idea of CILF is to learn the feature embeddings such that instances exhibit distinguishing characteristics for label prediction, novel class detection, and subsequent model update over the non-stationary streaming data. Therefore, the most critical parts of CILF are: (1) feature embedding network; (2) novel class detection operator; and (3) model update mechanism.

  • Feature Embedding Network: With the initial training data, i.e., the blue and orange dots as shown in figure 3, the decoupled neural network model is trained using the labeled initial data with the prototype based loss, which concerns the intr-class/inter-class structure and can be easily transformed for NCD;

  • Novel Class Detection Operator: At time tt, we receive a set of unlabeled data from the streaming data, i.e., gray dots as shown in figure 3, which includes known (blue and orange dots) and unknown (green dots) classes. The observed instances set DtD^{t} are transformed through learned network, and achieve feature representations. Then we employ the pre-trained ft1f^{t-1} for curriculum clustering operating, which can detect multiple unknown classes from easy to difficult in self-taught form;

  • Model Update Mechanism: After NCD, true labels of instances from novel classes are queried (partially or fully), then IMU is performed on ft1f^{t-1} with the newly labeled data, while regularizing the performance of stored data from known classes to mitigate forgetting. The updated model is then used to further classify incoming instances along the stream.

This process is repeated until the end of streaming data. Figure 3 illustrates the overall streaming data classification performed by CILF framework. And Table I provides the definition of symbols used in this paper.

TABLE I: Description of symbols.
Sym. Definition
D0={(𝐱i0,𝐲i0)}i=1ND^{0}=\{({\bf x}_{i}^{0},{\bf y}_{i}^{0})\}_{i=1}^{N} initial supervised training data
Dt={𝐱jt}j=1NtD^{t}=\{{\bf x}_{j}^{t}\}_{j=1}^{N_{t}} set of unlabeled data at tt-th time window
DnewtD_{new}^{t} set of labeled new class data at time tt
Y^t\hat{Y}^{t} label set at tt-th time window
ftf^{t} trained model tt-th time window
MtM^{t} stored memory data of known classes
until tt-th time window
ft(𝐱)f^{t}({\bf x}) feature embedding of instance 𝐱{\bf x}
pitp_{i}^{t} prediction of ii-th instance at time tt
μct\mu_{c}^{t} prototype of cc-th class at time tt
wjtw_{j}^{t} weight of each instance at time tt
g(l)g(l) pacing function to determine the number of
selected instances in each mini-batch

3.3 Feature Embedding Network

Given the initial training data D0D^{0}, our primary objective is to build an effective model f0f^{0} for subsequent classification. Recent researches have demonstrated the effectiveness of deep model on feature embedding and subsequent tasks, thereby we employ the deep models for building f0f^{0}, for example, convolution neural networks for images. Importantly, the built deep model needs to consider two aspects: (1) Distance measure. The model needs to emphasize the exploitation of feature embeddings considering intra-class compactness and inter-class separability, thus leave larger space for novel class detection; (2) Model scalability. The model needs to effectively learn novel class knowledge and incrementally update model with the emergence of novel class data. However, traditional deep models using cross entropy cannot consider the distance measure effectively, and are difficult to conduct the model update (the prediction layer is coupled to the fully connected layer, and is difficult to expand). Consequently, we develop a decoupled deep embedding network with prototype based loss to improve the inter-class and intra-class structure.

Particularly, for a given input 𝐱i0{\bf x}_{i}^{0}, the output feature representations are denoted as f0(𝐱i0,θ)f^{0}({\bf x}_{i}^{0},\theta), θ\theta is the correspond network parameters, and we utilize notation f0(𝐱i0)f^{0}({\bf x}_{i}^{0}) for clarity. Inspired from the topic of metric learning [31], the loss can be defined as:

L=Lintra+λLinter\begin{split}L=L_{intra}+\lambda L_{inter}\end{split} (1)

where LintraL_{intra} aims to pull data towards their neighbor from same class, and LinterL_{inter} is to push data away from different classes. λ\lambda is the balance parameter.

3.3.1 Intra-Class Compactness

LintraL_{intra} can be obtained by calculating the distance between each instance and corresponding prototype, here we utilize the class center. Similar to cross entropy loss [32], i.e., i=1N𝐲ilog(g(f(𝐱i,θ)))\sum_{i=1}^{N}-{\bf y}_{i}log(g(f({\bf x}_{i},\theta))), where 𝐲i{\bf y}_{i} is the ground truth of ii-th instance, g()g(\cdot) denotes the fully connected layer with softmax function. LintraL_{intra} is depending on the feature output f0(𝐱0)f^{0}({\bf x}^{0}). Consequently, we define the prototype-based cross entropy loss as following:

Lintra=i=1Nc=1Cyic0log(pic0)\begin{split}L_{intra}=\sum_{i=1}^{N}\sum_{c=1}^{C}-y_{i_{c}}^{0}log(p_{i_{c}}^{0})\end{split} (2)

where pic0p_{i_{c}}^{0} is the probability of 𝐱i0{\bf x}_{i}^{0} being classified as yc0y_{c}^{0}, which is negatively related to the distance between instance and prototype of cc-th class, i.e., the probability is larger if the distance is closer, otherwise is smaller. Thus the pic0𝐱i0μc022p_{i_{c}}^{0}\propto-\|{\bf x}_{i}^{0}-\mu_{c}^{0}\|_{2}^{2}, where μc0\mu_{c}^{0} is the representations of cc-th class prototype, and can be defined as following:

pic0=exp(αf0(𝐱i0)μc022)m=1Cexp(αf0(𝐱i0)μm022)\begin{split}p_{i_{c}}^{0}=\frac{exp(-\alpha\|f^{0}({\bf x}_{i}^{0})-\mu_{c}^{0}\|_{2}^{2})}{\sum_{m=1}^{C}exp(-\alpha\|f^{0}({\bf x}_{i}^{0})-\mu_{m}^{0}\|_{2}^{2})}\end{split} (3)

where CC is the class number, and α\alpha is a hyperparameter that controls the strength of distance similar to large margin cross-entropy [33]. Note that Eq. 3 minimizes loss via maximizing the probability of 𝐱i0{\bf x}_{i}^{0} being associated with the prototype μyi0\mu_{y_{i}}^{0}. Moreover, it is crucial to initialize and update each class prototype effectively. The labels of initial training data 𝒟0\mathcal{D}^{0} are given, thereby we use the output representation f0(𝐱0)f^{0}({\bf x}^{0}) , for each prototype initialization:

μc0=1|πc|𝐱0πcf0(𝐱0)\begin{split}\mu_{c}^{0}=\frac{1}{|\pi_{c}|}\sum_{{\bf x}^{0}\in\pi_{c}}f^{0}({\bf x}^{0})\end{split}

where |πc||\pi_{c}| is the size of cc-th class. On the other hand, the key idea of prototype update is to anneal clusters slowly to eliminate the biased instances in each mini-batch. Thus we propose to smooth the annealing process via temporal ensemble [34]:

μc0e=βμc0e1+(1β)μc0e\begin{split}\mu_{c}^{0_{e}}=\beta\mu_{c}^{0_{e-1}}+(1-\beta)\mu_{c}^{0_{e}}\end{split} (4)

where β\beta is a momentum term controlling the ensemble, and 0e{0_{e}} indicates the ee-th epoch for the initial training.

3.3.2 Inter-Class Separability

The prototype-based cross entropy loss guarantees the local intra-class compactness, while neglects the inter-class separability. To make the projection of instances robust in distance measure, LinterL_{inter} focuses on improving global separation between different classes. Particularly, LinterL_{inter} aims to transform instances from similar classes to be closer than those from different classes, i.e., d(f0(𝐱i0),f0(𝐱p0))<d(f0(𝐱i0),f0(𝐱n0))d(f^{0}({\bf x}_{i}^{0}),f^{0}({\bf x}_{p}^{0}))<d(f^{0}({\bf x}_{i}^{0}),f^{0}({\bf x}_{n}^{0})), where 𝐱i0,𝐱p0{\bf x}_{i}^{0},{\bf x}_{p}^{0} share same class and 𝐱n0{\bf x}_{n}^{0} is from different class, d(f0(𝐱i0),f0(𝐱j0))d(f^{0}({\bf x}_{i}^{0}),f^{0}({\bf x}_{j}^{0})) is a metric function measuring distances in the embedding space, and we use notation di,jd_{i,j} for clarity. This is known as the triplet loss with a pre-specified margin value mm, i.e., [m+da,pda,n]+=max{0,m+da,pda,n}\big{[}m+d_{a,p}-d_{a,n}\big{]}_{+}=\max\{0,m+d_{a,p}-d_{a,n}\}. It is notable that triplet loss always suffers from slow convergence, thus triplet construction is central for improving the performance. Inspired by [35], we consider the hard triplet to fully explore multiple negative examples from different classes in each mini-batch, which can further improve the inter-class distances. In result, hard triplet is denoted as:

Ωi=(𝐱i0,𝐱p0,𝐱n10,,𝐱nC10)\begin{split}\Omega_{i}=({\bf x}_{i}^{0},{\bf x}_{p}^{0},{\bf x}_{n_{1}}^{0},\cdots,{\bf x}_{n_{C}-1}^{0})\end{split} (5)

where CC is the class number and C1C-1 negative examples 𝐱nc{\bf x}_{n_{c}} is from different classes. Eq. 5 can better consider the global inter-class distances. Thereby the LinterL_{inter} can be defined as:

Linter=Ωi𝕋[m+d(i,p)min𝐱nc0Ωidi,nc]+\begin{split}L_{inter}=\sum_{\Omega_{i}\in\mathbb{T}}\big{[}m+d(i,p)-\min\limits_{{\bf x}_{n_{c}}^{0}\in\Omega_{i}}d_{i,n_{c}}\big{]}_{+}\end{split} (6)

where 𝕋\mathbb{T} denotes the hard triplet set. Here we utilize euclidean distance to evaluate the distance between two examples:

di,j=f0(𝐱i0)f0(𝐱j0)22\begin{split}d_{i,j}=\|f^{0}({\bf x}_{i}^{0})-f^{0}({\bf x}_{j}^{0})\|_{2}^{2}\end{split} (7)

Consequently, we can learn discriminative feature embedding, and boost the performance of classification and detection via optimizing Eq. 1: (1) Prototype-based loss highlights the compactness of representation, i.e., the intra-class would be more compact and inter-class would be more distant. This property is suited for distinguishing the known and unknown classes; and (2) Prototype-based loss is based on the feature output embedding, which is independent of prediction layer. Therefore, it is easy to update the model and learn novel classes, without the expansion of model structure (prediction layer). The details are shown in Algorithm 1.

Algorithm 1 Feature Embedding Network
  • Input:

  • Data set: D0={(𝐱i0,𝐲i0)}i=1ND^{0}=\{({\bf x}_{i}^{0},{\bf y}_{i}^{0})\}_{i=1}^{N}

  • Parameter: λ\lambda, α\alpha, Learning rate parameter: η\eta

  • Output:

  • Decoupled deep clustering network: f0f^{0}

1:  Initialize model parameters θ\theta;
2:  Initialize the prototype μ\mu for each class;
3:  while stop condition is not triggered do
4:     for instance mini-batch do
5:        Calculate LintraL_{intra} according to Equation 2;
6:        Calculate LinterL_{inter} according to Equation 6;
7:        Calculate loss L=Lintra+λLinterL=L_{intra}+\lambda L_{inter} according to Equation 1;
8:        Update model parameters using gradient descent;
9:     end for
10:     Update the prototype μ\mu according to Equation 4;
11:  end while

3.4 Novel Class Detection

Traditional closed-set methods predict the known classes of training phase, in which the number of possible labels at testing is known and fixed. However, in class-incremental setting, instances belonging to unknown classes may appear with the streaming test data. Therefore, we need to distinguish the known and unknown classes. Specifically, we receive a set of unlabeled data DtD^{t} at tt-th time, and there may occur KtK^{t} novel classes, where Kt0K^{t}\geq 0. However, most current detection methods either assume that only one novel class appears per time, i.e., Kt=1K^{t}=1, or classify multiple novel classes as a super-class, which is impractical and difficult to operate considering efficiency. To solve this problem, we aim to fine-tune the deep clustering network ft1f^{t-1} of last time for multiple novel class detection. As shown in Figure 2, a key challenge is that adversarial instances of novel classes are mixing with known classes in complex scenario, leading the embedding confusion and greatly affecting the clustering effect, i.e., biased prototypes for known and unknown classes. To solve this problem, we employ a learnable curriculum clustering operator, which aims to conduct clustering from easy (distinguishable) to difficult (confused) instance via curriculum learning [36]. Consequently, we can acquire more reliable prototype and novel class detection result.

In detail, considering the model training in Algorithm 1 is entirely supervised, whereas DtD^{t} is unsupervised, we aim to discover novel classes in DtD^{t} by unsupervised clustering, which fine-tunes the ft1f^{t-1} trained on t1t-1 phase with easy instances first, and then cluster the mixed ones. We address this challenge by decomposing the learnable curriculum clustering into two closely related sub-tasks as curriculum learning. The first is weighting function to calculate the weight of each instances, and initialize the prototype with weighted k-means. The second is pacing function to determine the pace for which data are presented to fine-tune the model, thus conduct curriculum clustering.

3.4.1 Weighting Function

Inspired by [37], we evaluate the weight of each instance by self-taught weighting function. In detail, we compute confidence score for each instance 𝐱jt{\bf x}_{j}^{t} in DtD^{t} using existing model ft1f^{t-1}. We first obtain the statistic confidence by applying intra-class distance using Eq. 2, i.e., ujt=c=1Y^t1yjct1log(pjct)u_{j}^{t}=\sum_{c=1}^{\hat{Y}^{t-1}}-y_{j_{c}}^{t-1}log(p_{j_{c}}^{t}). It is notable that ujtu_{j}^{t} of the instance near prototype are smaller, and ujtu_{j}^{t} of the instances away from all class prototypes are larger. Therefore, the weight of each instance can be denoted as wjt=(ujtγ)2w_{j}^{t}=(u_{j}^{t}-\gamma)^{2}, where γ\gamma is the threshold parameter. Thereby the highly confident and unsure instances have larger weights, and confusing ones have lower weights.

On the other hand, Algorithm 1 requires initial setting for prototypes μct,cY^t\mu_{c}^{t},c\in\hat{Y}^{t}. Thus we initialize prototypes by running semi-supervised weighted k-means algorithm [38] by combing the unlabeled set DtD^{t} and pre-trained μct1\mu_{c}^{t-1}. In result, we can obtain more robust initial prototypes:

μct={βμct1+(1β)𝐱jtπcwjtft1(𝐱jt)𝐱jtπcwjt,whencY^t1,𝐱jtπcwjtft1(𝐱jt)𝐱jtπcwjt,whencYt\begin{split}\mu_{c}^{t}=\left\{\begin{aligned} &\beta\mu_{c}^{t-1}+(1-\beta)\sum_{{\bf x}_{j}^{t}\in\pi_{c}}\frac{w_{j}^{t}f^{t-1}({\bf x}_{j}^{t})}{\sum_{{\bf x}_{j}^{t}\in\pi_{c}}w_{j}^{t}},{when\quad c\in\hat{Y}^{t-1}},\\ &\sum_{{\bf x}_{j}^{t}\in\pi_{c}}\frac{w_{j}^{t}f^{t-1}({\bf x}_{j}^{t})}{\sum_{{\bf x}_{j}^{t}\in\pi_{c}}w_{j}^{t}},\qquad\qquad\qquad\quad{\rm~{}~{}}{when\quad c\in Y^{t}}\end{aligned}\right.\end{split} (8)

where πc\pi_{c} is the set of cc-the class, in which the pseudo-label argmax{pjct}\arg\max\{p_{j_{c}}^{t}\} of each instance can be calculated by Eq. 3.

3.4.2 Pacing Function

A direct way for classifying known and unknown classes in DtD^{t} is to fine-tune ft1f^{t-1} using all the instances. However, considering the embedding confusion, the initialized prototypes are biased and pseudo-labels exist noises. If we randomly sample batches from the full amount of data to fine-tune model, the embedding confusion will further affect the update of prototypes and pseudo-labels. Therefore, we turn to sort the instances according to difficulty, and present from easier to harder instances for fine-tuning with the model capability increase, rather than giving a sequence of mini-batches uniformly from DtD^{t} in most common training procedure, LL is the number of batches.

In detail, the pacing function h:[L][Nt]h:[L]\rightarrow[N_{t}] is used to determine a sequence of subsets {B1,B2,,BL}Dt\{B_{1},B_{2},\cdots,B_{L}\}\in D^{t}, of size |Bl|=h(l)|B_{l}|=h(l). The ll-th subset BlB_{l} includes the first h(l)h(l) elements of the instances, which are sorted by the scoring function in ascending order. Here, we utilize the fixed exponential pacing, which has a fixed step length, and exponentially increasing size in each batch. Formally, it is given by:

h(l)=min(υδlϕ,1)Nt\begin{split}h(l)=min(\upsilon\cdot\delta^{\lfloor\frac{l}{\phi}\rfloor},1)\cdot{N_{t}}\end{split} (9)

Where υ\upsilon denotes the fraction of data in the initial step, δ\delta is the exponential factor for increasing the size of sampled mini-batches in each step, ϕ\phi is the number of iterations in each step, \lfloor\cdot\rfloor denotes round down, ll is the index of batches, NtN_{t} is the number of instances. Consequently, in each mini-batches, we select episodic data with variable length for reliable fine-tuning.

3.4.3 Fine-tune Clustering

With the sampled mini-batches {B1,B2,,BL}\{B_{1},B_{2},\cdots,B_{L}\} in each epoch, we aim to fine-tune the ft1f^{t-1} from easier to harder, and Eq. 2 can be reformulated as:

Lt=Lintrat+λ1Lintert+λ2RtLintrat=l=1Lj=1|Bl|c=1|Y^t|y¯ljctlog(pljct)Lintert=l=1LΩj𝕋l[m+dlj,pmin𝐱nctΩjdli,nc]+Rt=c=1|Y^t1|μctμct122\begin{split}L^{t}&=L_{intra}^{t}+\lambda_{1}L_{inter}^{t}+\lambda_{2}R^{t}\\ L_{intra}^{t}&=\sum_{l=1}^{L}\sum_{j=1}^{|B_{l}|}\sum_{c=1}^{|\hat{Y}^{t}|}-\bar{y}_{l_{j_{c}}}^{t}log(p_{l_{j_{c}}}^{t})\\ L_{inter}^{t}&=\sum_{l=1}^{L}\sum_{\Omega_{j}\in\mathbb{T}_{l}}\big{[}m+d_{l_{j,p}}-\min\limits_{{\bf x}_{n_{c}}^{t}\in\Omega_{j}}d_{l_{i,n_{c}}}\big{]}_{+}\\ R^{t}&=\sum_{c=1}^{|\hat{Y}^{t-1}|}\|\mu_{c}^{t}-\mu_{c}^{t-1}\|_{2}^{2}\end{split} (10)

where 𝕋l\mathbb{T}_{l} denotes the triplet set of ll-th batch. RtR^{t} aims to constraint the updated prototypes of known classes approaching the pre-trained ones, which can regularize the embeddings of known classes. The pseudo-labels 𝐲¯jt=argmax{pjct}\bar{{\bf y}}_{j}^{t}=\arg\max\{p_{j_{c}}^{t}\} for each instance can be calculated by Eq. 3. So far, we assume that the number of classes KtK^{t} is known, which is impractical in real applications. Thus we aim to estimate the number of classes in the unlabeled data. Specifically, we fine-tune clustering using DtD^{t} by varying the number of unknown classes. The resulting clusters are then examined by computing cluster validity index (CVI), which concerns the intra-cluster cohesion vs inter-cluster separation. And we select the generally used Silhouette index [39]:

CVI=𝐱Dtb(𝐱)a(𝐱)max{a(𝐱),b(𝐱)}\begin{split}CVI=\sum_{{\bf x}\in D^{t}}\frac{b({\bf x})-a({\bf x})}{\max\{a({\bf x}),b({\bf x})\}}\end{split} (11)

where a(𝐱)a({\bf x}) is the average distance between 𝐱{\bf x} and all other data instances within the same cluster, and b(𝐱)b({\bf x}) is the smallest average distance of 𝐱{\bf x} to all instances in any other different cluster. The optimal number of categories is the inflection point of CVI with maximum curvature. The details are shown in Algorithm 2.

Algorithm 2 Novel class Detection
  • Input:

  • Data set: Dt={(𝐱jt}ij=1NtD^{t}=\{({\bf x}_{j}^{t}\}_{ij=1}^{N_{t}}

  • Parameter: β\beta, γ\gamma, υ\upsilon, δ\delta, ϕ\phi

  • Output:

  • Novel Class Detection Network: f^t\hat{f}^{t}

1:  for 0KKmaxt0\leq K\leq K_{max}^{t} do
2:     Initialize prototypes μct\mu_{c}^{t} according to Equation 8;
3:     while stop condition is not triggered do
4:        Generate mini-batches {B1,B2,,BL}\{B_{1},B_{2},\cdots,B_{L}\} according to Equation 9;
5:        for instance mini-batch XlX_{l} do
6:           Calculate LtL^{t} using Eq. 10 similar to Algorithm 1;
7:           Fine-tune model parameters using gradient descent;
8:        end for
9:        Update the prototype μct\mu_{c}^{t} according to Equation 4;
10:        Update the pseudo-labels 𝐲¯\bar{{\bf y}} according to Equation 3;
11:     end while
12:     Computer CVI for DtD^{t} according to Eq. 11;
13:  end for
14:  Let f^t\hat{f}^{t} as the KK^{*} with optimal CVI value.

3.5 Incremental Model Update

Ideally, the initial model training and novel class detection processes can identify the known and unknown classes. However, considering streaming data with unceasing novel class, we need reliable training data of novel classes to create new prototypes and update the model parameters in incremental fashion. Thus, we need to collect novel class data for labeling, which can be used to re-train ft1f^{t-1}. Similar to previous studies [40, 20], after curriculum clustering operator for detection, we can achieve potential novel class instances DnewtD_{new}^{t} for querying their true labels. Note that we can query full or only partial data of novel class. However, there exist catastrophic forgetting of known classes if we only use the new data to update the model.

To solve this problem, we develop a mechanism to incorporate the stored memory and novel class information incrementally, which can mitigate the forgetting of discriminatory characteristics about known classes. In detail, we utilize the exemplary data Mt1M^{t-1} for regularization in re-training:

L(Dnewt,Mt1)=L^intrat+λ1L^intert+λ2RtL^intrat=𝐱jDnewtMt1yjtlog(pjt)+𝐱jMt1ft1(𝐱j)logft(𝐱j)L^intert=Ωi𝕋t[m+dli,pmin𝐱nctΩidli,nc]+Rt=c=1|Y^t1|μctμct122\begin{split}L(D_{new}^{t},M^{t-1})&=\hat{L}_{intra}^{t}+\lambda_{1}\hat{L}_{inter}^{t}+\lambda_{2}R^{t}\\ \\ \hat{L}_{intra}^{t}&=\sum_{{\bf x}_{j}\in D_{new}^{t}\cup M^{t-1}}-y_{j}^{t}log(p_{j}^{t})+\\ &\sum_{{\bf x}_{j}\in M^{t-1}}f^{t-1}({\bf x}_{j})\log f^{t}({\bf x}_{j})\\ \hat{L}_{inter}^{t}&=\sum_{\Omega_{i}\in\mathbb{T}^{t}}\big{[}m+d_{l_{i,p}}-\min\limits_{{\bf x}_{n_{c}}^{t}\in\Omega_{i}}d_{l_{i,n_{c}}}\big{]}_{+}\\ R^{t}&=\sum_{c=1}^{|\hat{Y}^{t-1}|}\|\mu_{c}^{t}-\mu_{c}^{t-1}\|_{2}^{2}\end{split} (12)

The first term encourages the network to output the correct class indicator (classification loss) for all labeled examples, i.e., DnewtD_{new}^{t} and Mt1M^{t-1}, and reproduces the scores calculated in the previous step (distillation loss) for stored in-class examples, i.e., Mt1M^{t-1}. 𝕋t\mathbb{T}^{t} is constituted from DnewtD_{new}^{t} and Mt1M^{t-1}. After re-training, we need to update the Mt1M^{t-1} to store key points of novel classes, we randomly remove |Yt||M||Y^t1||Y^t|\frac{|Y^{t}||M|}{|\hat{Y}^{t-1}||\hat{Y}^{t}|} instances for each known class, and sample |M||Y^t|\frac{|M|}{|\hat{Y}^{t}|} instances for each novel class. The details are shown in Algorithm 3.

Algorithm 3 Class-Incremental Learning
  • Input:

  • Data set: memory data Mt1M^{t-1}, labeled novel class data DnewtD_{new}^{t}

  • Learning rate parameter: η\eta

  • Output:

  • Re-trained deep clustering Network: ftf^{t}

1:  Calculate the ft1(𝐱j)f^{t-1}({\bf x}_{j}) of the examples from Mt1M^{t-1} and DnewtD_{new}^{t};
2:  while stop condition is not triggered do
3:     for instance mini-batch do
4:        Calculate L(Dnewt,Mt1,ft)L(D_{new}^{t},M^{t-1},f^{t}) according to Eq. 12;
5:        Re-train model parameters using gradient descent;
6:     end for
7:     Update the prototype μ\mu according to Equation 4;
8:  end while

4 Experiments

In this section, we mainly verify the proposed CILF from two aspects: (1) classification of known and novel classes; and (2) forgetting of known classes. Considering that most large-scale datasets are concentrated on images, thus we empirically evaluate CILF comparing with the state-of-the-art approaches on four simulated stream image datasets.

Refer to caption

     (a) Single Novel Class

Refer to caption

     (b) Multiple Novel Classes

Figure 4: The class distribution of simulated stream on CIFAR-10 dataset as an example. (a) represents the single novel class case, and (b) denotes the multiple novel classes case. The X-axis denotes the streaming data and the Y-axis is the class information.
TABLE II: Classification of known classes and novel class detection performance over streaming data in single novel class case. The best results are highlighted in bold.
Methods Average NA \uparrow Average Macro-F-Measure \uparrow
MNIST CIFAR-10 CIFAR-50 CIFAR-100 MNIST CIFAR-10 CIFAR-50 CIFAR-100
Iforest .189±\pm.021 .131±\pm.023 .045±\pm.006 .040±\pm.004 .156±\pm.023 .096±\pm.011 .035±\pm.005 .027±\pm.004
One-SVM .211±\pm.032 .136±\pm.031 .043±\pm.007 .039±\pm.000 .135±\pm.021 .090±\pm.014 .032±\pm.003 .024±\pm.005
LACU-SVM .222±\pm.034 .141±\pm.020 .045±\pm.005 .039±\pm.004 .170±\pm.018 .096±\pm.010 .034±\pm.004 .026±\pm.006
SENC-MAS .203±\pm.030 .134±\pm.033 .043±\pm.006 .038±\pm.001 .149±\pm.031 .082±\pm.013 .031±\pm.003 .022±\pm.005
ODIN-CNN .293±\pm.011 .194±\pm.020 .068±\pm.001 .034±\pm.027 .323±\pm.029 .156±\pm.033 .031±\pm.002 .055±\pm.045
CFO .276±\pm.008 .202±\pm.015 .037±\pm.001 .022±\pm.008 .292±\pm.022 .167±\pm.025 .045±\pm.002 .033±\pm.013
CPE .298±\pm.013 .250±\pm.020 .055±\pm.001 .047±\pm.017 .337±\pm.034 .281±\pm.033 .109±\pm.002 .087±\pm.031
DEC .289±\pm.058 .245±\pm.152 .035±\pm.001 .027±\pm.001 .357±\pm.081 .261±\pm.344 .048±\pm.001 .018±\pm.001
CILF .371±\pm.022 .288±\pm.034 .092±\pm.008 .055±\pm.004 .365±\pm.012 .302±\pm.033 .158±\pm.005 .092±\pm.008
Methods Average Micro-F-Measure \uparrow Average AUROC \uparrow
MNIST CIFAR-10 CIFAR-50 CIFAR-100 MNIST CIFAR-10 CIFAR-50 CIFAR-100
Iforest .234±\pm.036 .142±\pm.021 .066±\pm.008 .052±\pm.008 .083±\pm.014 .140±\pm.343 .077±\pm.014 .105±\pm.009
One-SVM .214±\pm.074 .147±\pm.031 .064±\pm.008 .046±\pm.007 .117±\pm.009 .101±\pm.021 .096±\pm.011 .074±\pm.020
LACU-SVM .258±\pm.042 .148±\pm.022 .064±\pm.007 .050±\pm.008 .123±\pm.013 .068±\pm.012 .089±\pm.016 .117±\pm.081
SENC-MAS .216±\pm.065 .140±\pm.036 .062±\pm.008 .044±\pm.006 .104±\pm.036 .082±\pm.026 .043±\pm.010 .079±\pm.096
ODIN-CNN .285±\pm.021 .188±\pm.040 .035±\pm.001 .058±\pm.041 .259±\pm.189 .185±\pm.063 .102±\pm.045 .123±\pm.096
CFO .351±\pm.016 .304±\pm.030 .054±\pm.001 .047±\pm.017 .208±\pm.180 .162±\pm.073 .076±\pm.030 .102±\pm.092
CPE .336±\pm.026 .299±\pm.040 .110±\pm.001 .092±\pm.033 .270±\pm.184 .193±\pm.060 .114±\pm.037 .119±\pm.100
DEC .323±\pm.073 .289±\pm.305 .064±\pm.001 .025±\pm.001 .264±\pm.187 .190±\pm.058 .108±\pm.036 .114±\pm.097
CILF .355±\pm.025 .310±\pm.025 .122±\pm.008 .103±\pm.008 .317±\pm.018 .255±\pm.039 .125±\pm.012 .130±\pm.027
Refer to caption

(a-1) Original-1

Refer to caption

(a-2) Original-2

Refer to caption

(a-3) Original-3

Refer to caption

(a-4) Original-4

Refer to caption

(a-5) Original-5

Refer to caption

(a-6) Original-6

Refer to caption

(b-1) CPE-1

Refer to caption

(b-2) CPE-2

Refer to caption

(b-3) CPE-3

Refer to caption

(b-4) CPE-4

Refer to caption

(b-5) CPE-5

Refer to caption

(b-6) CPE-6

Refer to caption

(c-1) DEC-1

Refer to caption

(c-2) DEC-2

Refer to caption

(c-3) DEC-3

Refer to caption

(c-4) DEC-4

Refer to caption

(c-5) DEC-5

Refer to caption

(c-6) DEC-6

Refer to caption

(d-1) CILF-1

Refer to caption

(d-2) CILF-2

Refer to caption

(d-3) CILF-3

Refer to caption

(d-4) CILF-4

Refer to caption

(d-5) CILF-5

Refer to caption

(d-6) CILF-6

Figure 5: T-SNE Visualization for both known and unknown classes on CIFAR-10 in single novel class case. (a) original feature space; (b) Learned representations through single detection method CPE [17]; (c) Learned representations through multi detection method DEC [24]; (d) Learned representations through proposed CILF. Methodt-t indicates the T-SNE of tt-th time window of different methods.
TABLE III: Forgetting measure of known classes over streaming data in single novel class case. The best results are highlighted in bold.
Methods Forgetting \downarrow
MNIST CIFAR-10 CIFAR-50 CIFAR-100
Iforest .027±\pm.006 .021±\pm.002 .043±\pm.001 .067±\pm.001
One-SVM .028±\pm.007 .019±\pm.004 .042±\pm.002 .068±\pm.002
LACU-SVM .032±\pm.005 .029±\pm.002 .038±\pm.003 .061±\pm.001
SENC-MAS .028±\pm.005 .020±\pm.001 .036±\pm.002 .060±\pm.001
ODIN-CNN .040±\pm.004 .012±\pm.006 .023±\pm.002 .018±\pm.005
CFO .023±\pm.012 .010±\pm.003 .014±\pm.001 .016±\pm.005
CPE .022±\pm.001 .007±\pm.001 .017±\pm.001 .013±\pm.003
DEC .026±\pm.005 .009±\pm.001 .015±\pm.002 .014±\pm.001
DNN-Base .042±\pm.005 .012±\pm.007 .015±\pm.003 .024±\pm.006
DNN-L2 .032±\pm.010 .011±\pm.004 .012±\pm.007 .029±\pm.003
DNN-EWC .023±\pm.014 .016±\pm.010 .014±\pm.009 .016±\pm.007
IMM .030±\pm.012 .009±\pm.003 .016±\pm.004 .025±\pm.004
DEN .024±\pm.003 .007±\pm.004 .011±\pm.009 .017±\pm.001
CILF .020±\pm.016 .006±\pm.001 .010±\pm.001 .013±\pm.003
Refer to caption

     (a) NA

Refer to caption

     (b) Macro-F-Measure

Refer to caption

     (c) Micro-F-Measure

Refer to caption

     (d) AUROC

Figure 6: Performance of known classes on different time window of CIFAR-10.

4.1 Datasets

We utilize three commonly used visual datasets for class-incremental scenario following [17, 30, 13], including MNIST [41], CIFAR-10 [42], CIFAR-100 111http://www.cs.toronto.edu/kriz/cifar.html. In detail, MNIST dataset contains labeled handwritten digits images from 10 categories, where each class contains between 6313 and 7877 monochrome images; CIFAR-10 dataset has a total of 60000 color images of 32x32 pixels from 10 natural image classes; CIFAR-100 dataset is enlarged CIFAR-10, and we structure CIFAR-100 into 2 datasets: CIFAR-50 and CIFAR-100 according to [13].

Inspired from [43, 17, 30, 13, 44], we utilize the given testing data from the raw datasets as a holdout set to evaluate forgetting, and use the given training data to generate the streaming data. Specifically, we rearrange instances in each dataset to emulate a streaming form with novel classes considering two forms: (1) single novel class each time window; (2) multiple novel classes each time window. For single novel class case, we randomly choose CC initial classes, and only 1 novel class may start for each time window. In order to be more in line with real-world applications, each known class may disappear randomly at the end of current time window. Specifically, we set C=5C=5 for MNIST and CIFAR-10, C=30C=30 for CIFAR-50, C=50C=50 for CIFAR-100. Figure 4 (a) presents a simulated example of the CIFAR-10, i.e., we randomly choose 5 initial classes, and there are 5 time windows with 1 novel class starting for each time window. For multiple novel class case, we randomly choose CC initial classes, and KtK^{t} novel classes (i.e., Kt[2,K]K^{t}\in[2,K] novel classes) may randomly start for each time window. Similar to single class setting, each class may disappear randomly. Specifically, we set C=3C=3 for MNIST and CIFAR-10, C=30C=30 for CIFAR-50, C=50C=50 for CIFAR-100. Figure 4 (b) presents a simulated example of the CIFAR-10, i.e., we choose 3 initial classes and there are 3 time windows, 2 novel classes start for the first time window, 3 novel classes for the second time window, 2 novel classes for the last window.

TABLE IV: Classification of known classes and novel class detection performance over streaming data in multiple novel class case. The best results are highlighted in bold.
Methods Average NA \uparrow Average Macro-F-Measure \uparrow
MNIST CIFAR-10 CIFAR-50 CIFAR-100 MNIST CIFAR-10 CIFAR-50 CIFAR-100
Iforest .348±\pm.069 .277±\pm.062 .133±\pm.021 .095±\pm.014 .181±\pm.062 .103±\pm.004 .040±\pm.004 .028±\pm.002
One-SVM .361±\pm.085 .277±\pm.049 .136±\pm.017 .089±\pm.014 .188±\pm.085 .107±\pm.008 .038±\pm.005 .026±\pm.003
LACU-SVM .357±\pm.054 .274±\pm.062 .133±\pm.012 .090±\pm.017 .191±\pm.054 .098±\pm.008 .032±\pm.005 .023±\pm.002
SENC-MAS .336±\pm.075 .280±\pm.050 .139±\pm.021 .091±\pm.013 .163±\pm.071 .106±\pm.002 .035±\pm.004 .024±\pm.002
ODIN-CNN .396±\pm.021 .304±\pm.039 .181±\pm.008 .071±\pm.007 .268±\pm.032 .243±\pm.005 .160±\pm.006 .054±\pm.004
CFO .353±\pm.012 .287±\pm.027 .212±\pm.009 .104±\pm.008 .380±\pm.021 .363±\pm.001 .288±\pm.007 .183±\pm.005
CPE .413±\pm.030 .401±\pm.061 .240±\pm.020 .132±\pm.020 .236±\pm.047 .339±\pm.037 .249±\pm.022 .244±\pm.021
DEC .350±\pm.142 .398±\pm.084 .206±\pm.012 .089±\pm.016 .327±\pm.315 .302±\pm.217 .230±\pm.017 .110±\pm.011
CILF .493±\pm.051 .422±\pm.054 .258±\pm.040 .179±\pm.031 .392±\pm.045 .407±\pm.033 .332±\pm.035 .259±\pm.041
Methods Average Micro-F-Measure \uparrow Average AUROC \uparrow
MNIST CIFAR-10 CIFAR-50 CIFAR-100 MNIST CIFAR-10 CIFAR-50 CIFAR-100
Iforest .272±\pm.079 .153±\pm.009 .067±\pm.002 .049±\pm.003 .140±\pm.015 .126±\pm.019 .113±\pm.023 .140±\pm.009
One-SVM .293±\pm.122 .161±\pm.018 .068±\pm.008 .045±\pm.003 .158±\pm.016 .153±\pm.011 .146±\pm.008 .092±\pm.010
LACU-SVM .294±\pm.060 .144±\pm.006 .062±\pm.006 .043±\pm.002 .149±\pm.002 .158±\pm.016 .145±\pm.005 .152±\pm.024
SENC-MAS .250±\pm.082 .165±\pm.012 .063±\pm.008 .045±\pm.004 .107±\pm.092 .123±\pm.006 .086±\pm.051 .043±\pm.032
ODIN-CNN .242±\pm.024 .193±\pm.006 .072±\pm.009 .067±\pm.006 .231±\pm.052 .145±\pm.245 .193±\pm.140 .133±\pm.250
CFO .245±\pm.016 .212±\pm.004 .108±\pm.010 .104±\pm.008 .195±\pm.050 .214±\pm.257 .115±\pm.145 .159±\pm.245
CPE .207±\pm.037 .367±\pm.035 .200±\pm.023 .146±\pm.022 .241±\pm.056 .255±\pm.243 .201±\pm.140 .183±\pm.254
DEC .294±\pm.292 .231±\pm.204 .141±\pm.016 .116±\pm.011 .234±\pm.061 .217±\pm.076 .197±\pm.137 .171±\pm.253
CILF .422±\pm.043 .442±\pm.043 .212±\pm.022 .152±\pm.016 .286±\pm.028 .261±\pm.022 .189±\pm.034 .192±\pm.016
TABLE V: Forgetting measure of known classes over streaming data in multiple novel class case. The best results are highlighted in bold.
Methods Forgetting \downarrow
MNIST CIFAR-10 CIFAR-50 CIFAR-100
Iforest .014±\pm.015 .006±\pm.004 .035±\pm.002 .048±\pm.005
One-SVM .018±\pm.010 .005±\pm.003 .033±\pm.004 .049±\pm.003
LACU-SVM .017±\pm.010 .004±\pm.002 .029±\pm.003 .042±\pm.004
SENC-MAS .011±\pm.011 .006±\pm.002 .032±\pm.002 .041±\pm.004
ODIN-CNN .011±\pm.008 .013±\pm.011 .015±\pm.006 .013±\pm.010
CFO .019±\pm.005 .008±\pm.006 .009±\pm.008 .023±\pm.003
CPE .007±\pm.001 .011±\pm.003 .009±\pm.002 .005±\pm.002
DEC .002±\pm.003 .009±\pm.001 .006±\pm.002 .023±\pm.001
DNN-Base .022±\pm.002 .023±\pm.009 .032±\pm.007 .039±\pm.006
DNN-L2 .021±\pm.002 .026±\pm.001 .035±\pm.005 .041±\pm.002
DNN-EWC .016±\pm.010 .017±\pm.010 .020±\pm.005 .021±\pm.009
IMM .022±\pm.030 .023±\pm.010 .024±\pm.012 .030±\pm.019
DEN .010±\pm.009 .013±\pm.005 .013±\pm.002 .021±\pm.011
CILF .001±\pm.001 .005±\pm.001 .001±\pm.001 .008±\pm.001

4.2 Compared Methods

To validate the effectiveness of proposed CILF, we compared with existing state-of-the-art novel class detection approaches and incremental learning methods.

First, we compared CILF with existing NCD and incremental NCD methods. Including traditional anomaly detection and linear methods: Iforest [45], One-Class SVM (One-SVM) [ScholkopfPSSW01], LACU-SVM (LACU) [19], SENC-MAS (SENC) [20]; deep methods: ODIN-CNN (ODIN) [22], CFO [13], CPE [17] and DTC [24]. Abbreviations in parentheses. DTC is clustering based methods for multiple unknown classes detection. Note that Iforest, One-SVM, LACU, ODIN, CFO, and DTC are NCD methods, SENC and CPE are incremental NCD methods. All NCD baselines except Iforest can be updated incrementally using newly labeled unknown class data and memory data.

  • Iforest: an ensemble tree method to detect outliers;

  • One-Class SVM (One-SVM): a baseline for out-of-class detection and classification;

  • LACU-SVM (LACU): a SVM-based method that incorporates the unlabeled data from open set for unknown class detection;

  • SENC-MAS (SENC): a matrix sketching method that approximates original information with a dynamic low-dimensional structure;

  • ODIN-CNN (ODIN): a CNN-based method that distinguishes in-distribution and out-of-distribution over softmax score;

  • CFO: a generative method that adopts an encoder-decoder GAN to generate synthetic unknown instances;

  • CPE: a CNN-based ensemble method, which adaptively updates the prototype for detection;

  • DTC: an extended deep transfer clustering method for novel class detection.

Specifically, 1) Iforest, ODIN and CFO can only perform binary classifications, i.e., whether the instance is an unknown class. Thus we further conduct unsupervised clustering on both know and unknown class data for subdividing; 2) all baselines are one-class methods except DTC, i.e., they perform NCD in two steps: first detect the super-class of unknown classes, then perform unsupervised clustering; 3) all of baselines are NCD methods except LACU, SENC and CPE, but they can be applied in incremental NCD by combing memory data to update following [17].

To validate the incremental model update, we also compare with state-of-the-art forgetting methods: DNN-Base, DNN-L2, DNN-EWC [28], IMM [29], DEN [46], each time window is regarded as a task in these methods. In detail, the compared methods are:

  • DNN-Base: Base DNN with L2L_{2}-regularizations;

  • DNN-L2: Base DNN, where at each stage t, Θt\Theta_{t} is initialized as Θt1\Theta_{t-1} and continuously trained with L2L_{2}-regularization between Θt\Theta_{t} and Θt1\Theta_{t-1};

  • DNN-EWC: Deep network trained with elastic weight consolidation for regularization, which remembers old stages by selectively slowing down learning on the weights important for those stages;

  • IMM: An incremental moment matching method with two extensions: Mean-IMM and Mode-IMM, which incrementally matches the posterior distribution of the neural network trained on the previous stages;

  • DEN: A deep network architecture for incremental learning, which can dynamically decide its network structure with a sequence of stages and learn the overlapping knowledge sharing structure among stages.

4.3 Evaluation Metrics

Considering that CILF can distinguish the known and unknown classes, while mitigating the forgetting. Thus we measure the proposed method from two aspects: (1) NCD performance; (2) Forgetting performance.

Following [30], we adopt the commonly used evaluation metrics for novel class detection: 1) Normalized Accuracy (NA), which weights the accuracy for known and novel classes [47]; 2) Macro-F-measure and Micro-F-measure; and 3) AUROC, which considers the NCD task as a combination of novelty detection and multi-class recognition [13]. Moreover, to validate the catastrophic forgetting, we calculate the performance about forgetting profile of different learning algorithms as [44], i.e., let accm,nacc_{m,n} be the accuracy evaluated on the hold-out sets, i.e. the novel classes emerge on nn-th time window (nmn\leq m), after training the network incrementally from stage 1 to mm, the average accuracy at time mm is defined as: Am=1mn=1maccm,nA_{m}=\frac{1}{m}\sum_{n=1}^{m}acc_{m,n}  [44]. higher AmA_{m} represents for better classifier. Forgetting=Amean(A)AForgetting=\frac{A^{*}-mean(A)}{A^{*}}, AA^{*} is the optimal accuracy with the entire data. We repeat all experiments with 5 times, and record the mean and std.

Refer to caption

(a-1) Original-1

Refer to caption

(a-2) Original-2

Refer to caption

(a-3) Original P3

Refer to caption

(a-4) Original-4

Refer to caption

(a-5) Original-5

Refer to caption

(b-1) CPE-1

Refer to caption

(b-2) CPE-2

Refer to caption

(b-3) CPE-3

Refer to caption

(b-4) CPE-4

Refer to caption

(b-5) CPE-v5

Refer to caption

(c-1) DEC-1

Refer to caption

(c-2) DEC-2

Refer to caption

(c-3) DEC-v3

Refer to caption

(c-4) DEC-4

Refer to caption

(c-5) DEC-5

Refer to caption

(d-1) CILF-1

Refer to caption

(d-2) CILF-2

Refer to caption

(d-3) CILF-3

Refer to caption

(d-4) CILF-4

Refer to caption

(d-5) CILF-5

Figure 7: T-SNE Visualization for both known and unknown classes on CIFAR-10 in multiple novel class case. (a) original feature space; (b) Learned representations through single detection method CPE [17]; (c) Learned representations through multi detection method DEC [24]; (d) Learned representations through proposed CILF. Methodt-t indicates the T-SNE of tt-th time window of different methods.
Refer to caption

     (a) NA

Refer to caption

     (b) Macro-F-Measure

Refer to caption

     (c) Micro-F-Measure

Refer to caption

     (d) AUROC

Figure 8: Performance criteria on different time window on CIFAR-10 in multiple novel class case.

4.4 Implementation

We develop CILF based on convolutional network structure as ResNet18 [48]. Note that we use an identical set of hyperparameters (λ1=1\lambda_{1}=1, λ2=1\lambda_{2}=1, α=0.3\alpha=0.3, β=0.8\beta=0.8, υ=0.2\upsilon=0.2, δ=3\delta=3, ϕ=10\phi=10). In all of our models and experiments, we adopt standard SGD with Nesterov momentum [49], where the momentum is 0.9. We train the initial model ff as following: the number of epochs is 20, the batch size is 128, the learning rate is 0.01, and weight decay is 0.001. We implement all baselines and perform all experiments based on code released by corresponding authors. For CNN based methods, we use the same network architecture and parameters during training, such as optimizer, learning rate schedule, and data pre-processing. Our method is implemented on a RTX 2080TI GPU with Pytorch 0.4.06 222https://pytorch.org/.

4.5 Single Novel Class Detection

Table II compares the detection performance of CILF with all baseline methods on each streaming data with single novel class. We observe that: (1) CNN-based methods are better than traditional detection approaches, i.e., One-SVM, LACU-SVM, SENC-MAS. This indicates that neural network provides superior feature embeddings for prediction and detection over high dimensional streaming data; (2) CILF consistently outperforms all compared CNN-based methods in all the criteria. For example, in CIFAR-10, CILF provides at least 2%\% improvements than other methods. This indicates the effectiveness of prototype based loss for feature embedding and curriculum clustering operator for detection; and (3) The detection performance for large-scale data sets still needs to be improved, and the results of all methods are low.

Figure 5 shows feature embedding results within each time window using T-SNE [50], in which each class randomly samples 800 instances. Note that we turn to utilize the more complex dataset CIFAR-10 as an example, rather than simpler MNIST dataset in previous methods. Clearly, the optimal discriminative method (CPE) and generative method (DEC) are greatly interfered by the embedding confusion. While the output of CILF has more distinct groups from different classes compared to other methods, which is attributed to the prototype based loss for model training. Moreover, instances from novel classes are well separated from other known clusters, which is benefit for novel class detection. The compactness of new class indicates the effectiveness of curriculum clustering operator, in which reliable prototypes are developed and the the model is fine-tuned from easy to difficult.

Table III compared the forgetting performance of CILF with all baseline methods, which defines the forgetting of emerge class on a particular window, i.e., the difference between maximum knowledge gained about that window throughout learning process and we currently have about it, the lower difference the better. The results show that CILF has the least forgetting, which validates that the memory distillation and prototype regularization can mitigate the forgetting of known class data. Moreover, Figure 6 gives more direct results. Due to page limitation, we only report the result of CIFAR-10. The results indicate that at different window, the performance of known classes falls slower, which shows that CILF can mitigate forgetting efficiently.

Refer to caption

     (a) Single CIFAR-10

Refer to caption

     (d) Multiple CIFAR-10

Figure 9: Relationship between detection performance during stream with label request percentage for every time window.

4.6 Multiple Novel Class Detection

Table IV compares the detection performance of CILF with all baseline methods considering multiple novel classes. We observe that: (1) multiple novel class detection method, i.e., DEC, has not achieved an advantage than single novel class detection methods with subsequent clustering operator. This indicates that direct clustering method may be influenced by the embedding confusion; and (2) CILF consistently outperforms all compared CNN-based methods in all the criteria except AUROC on CIFAR-50. This further indicates the effectiveness of curriculum clustering operator for detection.

Figure 7 shows feature embedding results within each time window using T-SNE, in which each class randomly samples 800 instances. Similarly, the output of CILF has more distinct groups from different classes compared to other methods, which indicates that CILF can solve the embedding confusion effectively. Table V and Figure 8 compared the forgetting performance of CILF with baseline methods. Identically, the results show that CILF has the least forgetting, and performance of known classes fall slower, which shows that CILF can mitigate forgetting under multiple novel classes scenario.

4.7 Influence of Query Size

Figure 9 shows the influence of querying number about potential novel class instances, we only give the results of CIFAR-10 considering page limitation. Here, we randomly query a subset, i.e., a percent of potential instances from the current window. From the figure, it can be observed that the prediction performance improves with the increase of labeled data, which verifies the importance of ground-truths for model update.

Refer to caption

     (a) Single CIFAR-10

Refer to caption

     (b) Multiple CIFAR-10

Figure 10: Parameter sensitivity of λ1\lambda_{1} and λ2\lambda_{2} for the CIFAR-10 in novel detection. (a) is single novel class case, (c) is multiple novel class case.

4.8 Parameter Sensitivity

The main parameters in novel class detection and model update are the λ1\lambda_{1} and λ2\lambda_{2} in Eq. 10. We vary these parameters in {0.01,0.1,1,10,100}\{0.01,0.1,1,10,100\} to study its sensitivity to classification performance and record the AUROC results in figure 10. Both the single and multiple cases indicate that the performances are higher when setting λ1\lambda_{1} with a larger value, i.e., larger than 1.

Refer to caption

     (c) Multiple CIFAR-10

Refer to caption

     (d) Multiple CIFAR-50

Figure 11: Execution time analysis.

4.9 Execution Time for Model Update

Consider our method focuses on multiple novel class detection, thus we analyze execution time for detecting and updating model with multiple novel class case. In detail, we select the five deep methods, i.e. ODIN-CNN, CFO, CPR, DTC and CILF, and record the results of multiple novel class case in Figure 11. CILF achieves the fastest results, this is because other methods require additional clustering operations, and embedding confusion will slow down the clustering convergence, which indicates the curriculum clustering can accelerate detection.

5 Conclusion

Real-word application always receive the data in stream form, which emerges previously unknown classes sequentially. Incremental NCD has two main challenges: 1) Novel class detection, streaming test data will accept unknown classes; 2) Model expansion, the model needs to be effectively updated after the new class detection. However, traditional methods have not always fully considered these two challenges. To this end, we propose a Class-Incremental Learning without Forgetting (CILF) framework. CILF designed to regularize classification with decoupled prototype based loss, which can improve the intra-class and inter-class structure significantly, and acquire a compact embedding representation for novel class detection in result. Then, CILF employed a learnable curriculum clustering operator to estimate the number of semantic clusters via fine-tuning the learned network. Last, CILF updates the network effectively with robust regularization to mitigate the catastrophic forgetting. Consequently, empirical studies showed the superior performances of CILF.

References

  • Masi et al. [2018] I. Masi, Y. Wu, T. Hassner, and P. Natarajan, “Deep face recognition: A survey,” in Proceedings of the 31st SIBGRAPI Conference on Graphics, Patterns and Images, Parana, Brazil, 2018, pp. 471–478.
  • Schmarje et al. [2020] L. Schmarje, M. Santarossa, S.-M. Schroder, and R. Koch, “A survey on semi-, self- and unsupervised techniques in image classification,” CoRR, vol. abs/2002.08721, 2020.
  • Palatucci et al. [2009] M. Palatucci, D. Pomerleau, G. E. Hinton, and T. M. Mitchell, “Zero-shot learning with semantic output codes,” in Advances in Neural Information Processing Systems 22, Vancouver, British Columbia, 2009, pp. 1410–1418.
  • Fu et al. [2018] Y. Fu, T. Xiang, Y.-G. Jiang, X. Xue, L. Sigal, and S. Gong, “Recent advances in zero-shot recognition: Toward data-efficient understanding of visual content,” IEEE Signal Process. Mag., vol. 35, no. 1, pp. 112–125, 2018.
  • Min et al. [2019] S. Min, H. Yao, H. Xie, Z.-J. Zha, and Y. Zhang, “Domain-specific embedding network for zero-shot recognition,” in Proceedings of the 27th ACM International Conference on Multimedia, Nice, France, 2019, pp. 2070–2078.
  • Lampert et al. [2014] C. H. Lampert, H. Nickisch, and S. Harmeling, “Attribute-based classification for zero-shot visual object categorization,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 36, no. 3, pp. 453–465, 2014.
  • Changpinyo et al. [2016] S. Changpinyo, W.-L. Chao, B. Gong, and F. Sha, “Synthesized classifiers for zero-shot learning,” in Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, 2016, pp. 5327–5336.
  • Li et al. [2019] J. Li, M. Jing, K. Lu, L. Zhu, Y. Yang, and Z. Huang, “Alleviating feature confusion for generative zero-shot learning,” in Proceedings of the 27th ACM International Conference on Multimedia, Nice, France, 2019, pp. 1587–1595.
  • Bendale and Boult [2016] A. Bendale and T. E. Boult, “Towards open set deep networks,” in Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, 2016, pp. 1563–1572.
  • Hassen and Chan [2018] M. Hassen and P. K. Chan, “Learning a neural-network-based representation for open set recognition,” CoRR, vol. abs/1802.04365, 2018.
  • Ge et al. [2017] Z. Ge, S. Demyanov, and R. Garnavi, “Generative openmax for multi-class open set classification,” in Proceedings of the British Machine Vision Conference, London, UK, 2017.
  • Jo et al. [2018] I. Jo, J. Kim, H. Kang, Y. Kim, and S. Choi, “Open set recognition by regularising classifier with fake data generated by generative adversarial networks,” in Proceedings of the 2018 IEEE International Conference on Acoustics, Speech and Signal Processing, Calgary, Canada, 2018, pp. 2686–2690.
  • Neal et al. [2018] L. Neal, M. L. Olson, X. Z. Fern, W.-K. Wong, and F. Li, “Open set learning with counterfactual images,” in Proceedings of the 15th European Conference Computer Vision, Munich, Germany, 2018, pp. 620–635.
  • Ratcliff [1990] R. Ratcliff, “Connectionist models of recognition memory: constraints imposed by learning and forgetting functions.” Psychol. Review, vol. 97, no. 2, p. 285, 1990.
  • Rebuffi et al. [2017] S.-A. Rebuffi, A. Kolesnikov, G. Sperl, and C. H. Lampert, “icarl: Incremental classifier and representation learning,” in Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, 2017, pp. 5533–5542.
  • Isele and Cosgun [2018] D. Isele and A. Cosgun, “Selective experience replay for lifelong learning,” in Proceedings of the 32nd AAAI Conference on Artificial Intelligence, New Orleans, Louisiana, 2018, pp. 3302–3309.
  • Wang et al. [2019] Z. Wang, Z. Kong, S. Chandra, H. Tao, and L. Khan, “Robust high dimensional stream classification with novel class detection,” in Proceedings of the 35th IEEE International Conference on Data Engineering, Macao, China, 2019, pp. 1418–1429.
  • Nguyen et al. [2017] C. V. Nguyen, Y. Li, T. D. Bui, and R. E. Turner, “Variational continual learning,” CoRR, vol. abs/1710.10628, 2017.
  • Da et al. [2014] Q. Da, Y. Yu, and Z.-H. Zhou, “Learning with augmented class by exploiting unlabeled data,” in Proceedings of the 28th AAAI Conference on Artificial Intelligence, Quebec, Canada, 2014, pp. 1760–1766.
  • Mu et al. [2017] X. Mu, F. Zhu, J. Du, E.-P. Lim, and Z.-H. Zhou, “Streaming classification with emerging new class by class matrix sketching,” in Proceedings of the 31st AAAI Conference on Artificial Intelligence, San Francisco, California, 2017, pp. 2373–2379.
  • Hendrycks and Gimpel [2017] D. Hendrycks and K. Gimpel, “A baseline for detecting misclassified and out-of-distribution examples in neural networks,” in Proceedings of the 5th International Conference on Learning Representations, Toulon, France, 2017.
  • Liang et al. [2018] S. Liang, Y. Li, and R. Srikant, “Enhancing the reliability of out-of-distribution image detection in neural networks,” in Proceedings of the 6th International Conference on Learning Representations, Vancouver, Canada, 2018.
  • Hsu et al. [2019] Y. Hsu, Z. Lv, J. Schlosser, P. Odom, and Z. Kira, “Multi-class classification without multi-class labels,” in Proceedings of the 7th International Conference on Learning Representations, New Orleans, LA, 2019.
  • Han et al. [2019] K. Han, A. Vedaldi, and A. Zisserman, “Learning to discover novel visual categories via deep transfer clustering,” in Proceedings of the 2019 IEEE International Conference on Computer Vision, Seoul, Korea, 2019, pp. 8400–8408.
  • jun Zhang et al. [2016] L. jun Zhang, T. Yang, R. Jin, Y. Xiao, and Z.-H. Zhou, “Online stochastic linear optimization under one-bit feedback,” in Proceedings of the 31sh International Conference Machine Learning, New York City, NY, 2016, pp. 392–401.
  • Lopez-Paz and Ranzato [2017] D. Lopez-Paz and M. Ranzato, “Gradient episodic memory for continual learning,” in Advances in Neural Information Processing Systems 30, Long Beach, CA, 2017, pp. 6467–6476.
  • Li and Hoiem [2016] Z. Li and D. Hoiem, “Learning without forgetting,” in Proceedings of the 14th European Conference Computer Vision, Amsterdam, The Netherlands, 2016, pp. 614–629.
  • Kirkpatrick et al. [2016] J. Kirkpatrick, R. Pascanu, N. C. Rabinowitz, J. Veness, G. Desjardins, A. A. Rusu, K. Milan, J. Quan, T. Ramalho, A. Grabska-Barwinska, D. Hassabis, C. Clopath, D. Kumaran, and R. Hadsell, “Overcoming catastrophic forgetting in neural networks,” CoRR, vol. abs/1612.00796, 2016.
  • Lee et al. [2017a] S.-W. Lee, J.-H. Kim, J. Jun, J.-W. Ha, and B.-T. Zhang, “Overcoming catastrophic forgetting by incremental moment matching,” in NIPS, Long Beach, CA, 2017, pp. 4655–4665.
  • Geng et al. [2018] C. Geng, S.-J. Huang, and S. Chen, “Recent advances in open set recognition: A survey,” CoRR, vol. abs/1811.08581, 2018.
  • Kulis [2013] B. Kulis, “Metric learning: A survey,” Foundations and Trends in Machine Learning, vol. 5, no. 4, pp. 287–364, 2013.
  • Bishop and Nasrabadi [2007] C. M. Bishop and N. M. Nasrabadi, “Pattern recognition and machine learning,” J. Electronic Imaging, vol. 16, no. 4, p. 049901, 2007.
  • Liu et al. [2016] W. Liu, Y. Wen, Z. Yu, and M. Yang, “Large-margin softmax loss for convolutional neural networks,” in Proceedings of the 33nd International Conference on Machine Learning, New York City, NY, 2016, pp. 507–516.
  • Laine and Aila [2017] S. Laine and T. Aila, “Temporal ensembling for semi-supervised learning,” in Proceedings of the 5th International Conference on Learning Representations, Toulon, France, 2017.
  • Yu and Tao [2019] B. Yu and D. Tao, “Deep metric learning with tuplet margin loss,” in Proceedings of the 2019 IEEE International Conference on Computer Vision, Seoul, Korea, 2019, pp. 6489–6498.
  • Bengio et al. [2009] Y. Bengio, J. Louradour, R. Collobert, and J. Weston, “Curriculum learning,” in Proceedings of the 26th Annual International Conference on Machine Learning, Montreal, Canada, 2009, pp. 41–48.
  • Hacohen and Weinshall [2019] G. Hacohen and D. Weinshall, “On the power of curriculum learning in training deep networks,” in Proceedings of the 36th International Conference on Machine Learning, Long Beach, California, 2019, pp. 2535–2544.
  • Dhillon et al. [2007] I. S. Dhillon, Y. Guan, and B. Kulis, “Weighted graph cuts without eigenvectors A multilevel approach,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 29, no. 11, pp. 1944–1957, 2007.
  • Arbelaitz et al. [2013] O. Arbelaitz, I. Gurrutxaga, J. Muguerza, J. M. Perez, and I. Perona, “An extensive comparative study of cluster validity indices,” Pattern Recognit., vol. 46, no. 1, pp. 243–256, 2013.
  • Haque et al. [2016] A. Haque, L. Khan, and M. Baron, “SAND: semi-supervised adaptive novel class detection and classification over data stream,” in Proceedings of the 30th AAAI Conference on Artificial Intelligence, Phoenix, Arizona, 2016, pp. 1652–1658.
  • LeCun et al. [1998] Y. LeCun, C. Cortes, and C. J. Burges, “The mnist database of handwritten digits, 1998,” URL http://yann. lecun. com/exdb/mnist, vol. 10, p. 34, 1998.
  • Krizhevsky et al. [2009] A. Krizhevsky, G. Hinton et al., “Learning multiple layers of features from tiny images,” 2009.
  • Masud et al. [2011] M. M. Masud, J. Gao, L. Khan, J. Han, and B. M. Thuraisingham, “Classification and novel class detection in concept-drifting data streams under time constraints,” IEEE Trans. Knowl. Data Eng., vol. 23, no. 6, pp. 859–874, 2011.
  • Chaudhry et al. [2018] A. Chaudhry, P. K. Dokania, T. Ajanthan, and P. H. Torr, “Riemannian walk for incremental learning: Understanding forgetting and intransigence,” CoRR, vol. abs/1801.10112, 2018.
  • Liu et al. [2008] F. T. Liu, K. M. Ting, and Z. Zhou, “Isolation forest,” in Proceedings of the 8th IEEE International Conference on Data Mining, Pisa, Italy, 2008, pp. 413–422.
  • Lee et al. [2017b] J. Lee, J. Yoon, E. Yang, and S. J. Hwang, “Lifelong learning with dynamically expandable networks,” CoRR, vol. abs/1708.01547, 2017.
  • Junior et al. [2017] P. R. M. Junior, R. M. de Souza, R. de Oliveira Werneck, B. V. Stein, D. V. Pazinato, W. R. de Almeida, O. A. B. Penatti, R. da Silva Torres, and A. Rocha, “Nearest neighbors distance ratio open-set classifier,” Mach. Learn., vol. 106, no. 3, pp. 359–386, 2017.
  • He et al. [2016] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, 2016, pp. 770–778.
  • Sutskever et al. [2013] I. Sutskever, J. Martens, G. E. Dahl, and G. E. Hinton, “On the importance of initialization and momentum in deep learning,” in ICML, 2013, pp. 1139–1147.
  • Maaten and Hinton [2008] L. v. d. Maaten and G. Hinton, “Visualizing data using t-sne,” Journal of machine learning research, vol. 9, no. Nov, pp. 2579–2605, 2008.
[Uncaptioned image] Yang Yang received the Ph.D. degree in computer science, Nanjing University, China in 2019. At the same year, he became a faculty member at Nanjing University of Science and Technology, China. He is currently a Professor with the Computer Science and Engineering. His research interests lie primarily in machine learning and data mining, including heterogeneous learning, model reuse, and incremental mining. He serves as PC in leading conferences such as IJCAI, AAAI, ICML, NIPS, etc.
[Uncaptioned image] Zhen-Qiang Sun is working towards the M.Sc. degree with the School of Computer Science &\& Technology in Nanjing Normal University, China. His research interests lie primarily in machine learning and data mining, including incremental learning.
[Uncaptioned image] Yanjie Fu received the BE degree from the University of Science and Technology of China, in 2008, the ME degree from the Chinese Academy of Sciences, in 2011, and the PhD degree from Rutgers University, in 2016. He is currently an assistant professor with the Missouri University of Science and Technology. His general interests are data mining and big data analytics. He has published prolifically in refereed journals and conference proceedings, such as the IEEE Transactions on Knowledge and Data Engineering, the ACM Transactions on Knowledge Discovery from Data, the IEEE Transactions on Mobile Computing, and ACM SIGKDD.
[Uncaptioned image] HengShu Zhu (SM’19) is currently a principal data scientist &\& architect at Baidu Inc. He received the Ph.D. degree in 2014 and B.E. degree in 2009, both in Computer Science from University of Science and Technology of China (USTC), China. His general area of research is data mining and machine learning, with a focus on developing advanced data analysis techniques for innovative business applications. He has published prolifically in refereed journals and conference proceedings, including IEEE Transactions on Knowledge and Data Engineering (TKDE), IEEE Transactions on Mobile Computing (TMC), ACM Transactions on Information Systems (ACM TOIS), ACM Transactions on Knowledge Discovery from Data (TKDD), ACM SIGKDD, ACM SIGIR, WWW, IJCAI, and AAAI. He has served regularly on the organization and program committees of numerous conferences, including as a program co-chair of the KDD Cup-2019 Regular ML Track, and a founding co-chair of the first International Workshop on Organizational Behavior and Talent Analytics (OBTA) and the International Workshop on Talent and Management Computing (TMC), in conjunction with ACM SIGKDD. He was the recipient of the Distinguished Dissertation Award of CAS (2016), the Distinguished Dissertation Award of CAAI (2016), the Special Prize of President Scholarship for Postgraduate Students of CAS (2014), the Best Student Paper Award of KSEM-2011, WAIM-2013, CCDM-2014, and the Best Paper Nomination of ICDM-2014. He is the senior member of IEEE, ACM, and CCF.
[Uncaptioned image] Hui Xiong (SM’07) is currently a Full Professor at the Rutgers, the State University of New Jersey, where he received the 2018 Ram Charan Management Practice Award as the Grand Prix winner from the Harvard Business Review, RBS Dean’s Research Professorship (2016), the Rutgers University Board of Trustees Research Fellowship for Scholarly Excellence (2009), the ICDM Best Research Paper Award (2011), and the IEEE ICDM Outstanding Service Award (2017). He received the Ph.D. degree from the University of Minnesota (UMN), USA. He is a co-Editor-in-Chief of Encyclopedia of GIS, an Associate Editor of IEEE Transactions on Big Data (TBD), ACM Transactions on Knowledge Discovery from Data (TKDD), and ACM Transactions on Management Information Systems (TMIS). He has served regularly on the organization and program committees of numerous conferences, including as a Program Co-Chair of the Industrial and Government Track for the 18th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD), a Program Co-Chair for the IEEE 2013 International Conference on Data Mining (ICDM), a General Co-Chair for the IEEE 2015 International Conference on Data Mining (ICDM), and a Program Co-Chair of the Research Track for the 2018 ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. He is an IEEE Fellow and an ACM Distinguished Scientist.
[Uncaptioned image] Jian Yang (M’08) received the Ph.D. degree in pattern recognition and intelligence systems from the Nanjing University of Science and Technology (NUST), Nanjing, China, in 2002. In 2003, he was a Post-Doctoral Researcher with the University of Zaragoza, Zaragoza, Spain. From 2004 to 2006, he was a Post-Doctoral Fellow with the Biometrics Centre, The Hong Kong Polytechnic University, Hong Kong. From 2006 to 2007, he was a Post-Doctoral Fellow with the Department of Computer Science, New Jersey Institute of Technology, Newark, NJ, USA. He is currently a Chang-Jiang Professor with the School of Computer Science and Engineering, NUST. He has authored more than 200 scientific papers in pattern recognition and computer vision. His papers have been cited more than 6000 times in the Web of Science and 15,000 times in the Scholar Google. His current research interests include pattern recognition, computer vision, and machine learning. Dr. Yang is a Fellow of IAPR. He is currently an Associate Editor of Pattern Recognition, Pattern Recognition Letters, the IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, and Neurocomputing.