This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Autonomous Cross Domain Adaptation under Extreme Label Scarcity

Weiwei Weng, Mahardhika Pratama, , Choiru Za’in, Marcus de Carvalho,
Rakaraddi Appan, Andri Ashfahani, Edward Yapp Kien Yee
W. Weng and M. Pratama share equal contributions. W. Weng, A. Ashfahani, M. de Carvalho, R. Appan are with the School of Computer Science and Engineering, Nanyang Technological University, Singapore. M. Pratama is with the academic unit of STEM, University of South Australia, Adelaide, Australia. C. Za’in is with the school of IT, Monash University. E. Y. Kien Yee is with Singapore Institute of Manufacturing Technology, A*Star, Singapore. Majority of the work was done when M. Pratama was with SCSE, NTU, Singapore. E-mail: [email protected]; [email protected];[email protected];marcus.decarvalho@ ntu.edu.sg;[email protected];[email protected];edward_yapp@ simtech.a-star.edu.sg.
Abstract

A cross domain multistream classification is a challenging problem calling for fast domain adaptations to handle different but related streams in never-ending and rapidly changing environments. Notwithstanding that existing multistream classifiers assume no labelled samples in the target stream, they still incur expensive labelling cost since they require fully labelled samples of the source stream. This paper aims to attack the problem of extreme label shortage in the cross domain multistream classification problems where only very few labelled samples of the source stream are provided before process runs. Our solution, namely Learning Streaming Process from Partial Ground Truth (LEOPARD), is built upon a flexible deep clustering network where its hidden nodes, layers and clusters are added and removed dynamically in respect to varying data distributions. A deep clustering strategy is underpinned by a simultaneous feature learning and clustering technique leading to clustering-friendly latent spaces. A domain adaptation strategy relies on the adversarial domain adaptation technique where a feature extractor is trained to fool a domain classifier classifying source and target streams. Our numerical study demonstrates the efficacy of LEOPARD where it delivers improved performances compared to prominent algorithms in 15 of 24 cases. Source codes of LEOPARD are shared in https://github.com/wengweng001/LEOPARD.git to enable further study.

Index Terms:
Multistream Classification, Transfer Learning, Data Streams, Incremental Learning, Concept Drifts

I Introduction

Background: Multistream classification problems [1] portray a classification problem across many streaming processes running simultaneously but independently. Each streaming process features different but related characteristics to be handled by a single model having a stream-invariant trait. That is, each stream suffers from the domain shift problem in which they follow different distributions. Multistream classification problem also considers the issue of labelling cost where the ground truth access is only provided in the source stream while leaving the target stream with the absence of any labelled samples. Unlike the traditional domain adaptation problems, the multistream problem deals with continuous information flows which must be handled in the fast and sample-wise fashion. Another typical problem is the asynchronous drift problem which distinguishes itself from the conventional single stream problem. The asynchronous drift problem refers to independent drifts between source and target streams taking place at different time points. The multistream classification problem distinguishes itself from the online transfer learning problem [2] in which both source and target domains are streaming in nature whereas the online transfer learning problem assumes a static source domain although it considers a streaming problem of the target domain. The underlying goal of the multistream classification problem is to build a predictive model f(.)f(.) which simultaneously performs the unsupervised domain adaptation as well as addresses the issue of data streams. Notwithstanding the recent progress of the multistream classification area, most works are designed from a single domain perspective in which both source and target streams are drawn from the same feature space. In addition, existing solutions incur expensive labelling cost because they require a full supervision of the source stream.

Practical Scenario: This paper puts into perspective a cross-domain multistream classification problem under extreme label scarcity where, unlike conventional multistream classification problems, the source stream and the target stream are generated from different feature spaces but share the same target attributes. The extreme label scarcity issue presents in the fact that no label is provided for the target stream while only few labelled samples are made available in the source stream during the warm-up phase. That is, an operator is only capable of labelling few prerecorded samples of the source stream while leaving the rest of data samples of the source steam unlabelled. The practical scenario of this problem is seen in the condition monitoring problem involving different machines. Instead of building a machine-specific model for monitoring purposes, a single machine-invariant model is constructed thereby saving significant developmental costs because data collection, annotation and preprocessing do not have to be repeated for each machine. Nevertheless, this task is challenging because data samples captured by sensors are streaming in nature. Different machines are installed with different sensors or of different types thereby producing different feature spaces while having different sampling rates leading to different batch sizes. Process’s deviations due to tool wear or any other external influencing factors occur independently to each machine at different time points leading to drifting data distributions in each machine with different rates, magnitudes, types. The issue of labelling cost occurs because visual inspections leading to interruption of machine operations are necessitated to annotate data samples. It hinders the labelling process during the process runs. The labelling process is possible to be done only for prerecorded samples to avoid frequent stoppages of machine operations.

We visualize the significance of label’s scarcity in the context of domain adaptation in Fig. 1 where DANN [3] is evaluated under different label proportions of source streams. Our numerical results are produced in the office31 problem (Webcam \rightarrow DSLR) using five label ratios: 10%,20%,30%,40%,50%10\%,20\%,30\%,40\%,50\%. It is observed that DANN’s performances are significantly compromised with reductions of label proportions, less than 10%10\% accuracy on source and target streams under 5,10%5,10\% label proportions. That is, its accuracy on source and target streams consistently slips.

Our Contribution: Learning Streaming Process from Partial Ground Truth (LEOPARD) approach is proposed in this paper and resolves the cross-domain multistream classification problems under extreme label scarcity. LEOPARD is developed under the framework of a flexible deep clustering network where it features an elastic and progressive network structure to handle changing data distributions. That is, hidden nodes, hidden layers and hidden clusters are self-evolved in respect to the asynchronous drift problem in both source and target streams. The learning process of LEOPARD aims to achieve two objectives under shared network parameters: clustering-friendly latent space and cross domain alignment in which it minimizes three loss functions. The reconstruction loss functions as the nonlinear dimension reducer where it projects input samples into a low dimension and establishes a common latent space between the source stream and the target stream. It is achieved by the stacked autoencoder (SAE) performing nonlinear mapping. The second component is the clustering loss creating a clustering-friendly latent space preventing the trivial solution. The cross domain adaptation loss is meant to induce the domain alignment and utilizes the adversarial domain adaptation approach [3]. This strategy relies on a domain classifier to classify the origin of data samples and a feature extractor. The feature extractor and the domain classifier compete to each other thus resulting in domain-invariant representations. LEOPARD does not call for any labelled samples for its updates and few prerecorded labelled samples of the source stream are only used to establish the class-to-cluster relationship.

This paper presents four major contributions: 1) it proposes a new problem, namely cross domain multistream classification problems under extreme label scarcity; 2) an algorithm, namely LEOPARD, is developed to address the issue of label’s scarcity in the cross-domain multistream classification problem; 3) a joint optimization problem is formulated to attain the clustering-friendly latent space as well as the domain alignment such that the target stream can be predicted accurately with very few labels of the source stream and no labels of the target stream; 4) the source code of LEOPARD along with all datasets are made public in https://github.com/wengweng001/LEOPARD.git to enable further study. Our numerical study has substantiated the efficacy of LEOPARD in handling the issue of extreme label scarcity in the cross domain multistream classification problems. It delivers highly competitive performances compared to prominent algorithms.

II Related Works

Multistream Classification: The area of multistream classification has attracted growing research interests as observed by the number of works published in the literature. A pioneering work is proposed in [1] using the kernel mean matching (KMM) method as a domain adaptation technique combined with a drift detection method to detect the concept drift in each domain. Considering high computational complexity and memory demand of [1], FUSION is proposed in [4] where the KLIEP method is implemented for domain adaptation while a density ratio method is designed for detecting the asynchronous drifts. MSCRDR is put forward in [5] and uses the Pearson divergence method for domain adaptation. Recently, a deep learning algorithm, namely ATL, is proposed to solve the multi-stream classification problem using the encoder and decoder structure under shared parameters coupled with the KL divergence method for domain adaptation [6]. ATL characterizes an inherent drift handling aptitude with a self-evolving network structure. MELANIE is proposed in [7] to handle the multi-source multistream classification problem. This work is extended in [8]. Another solution of multi-source multistream classification is offered in [9] where the CMD-based regularization is integrated. The problem of multisource unsupervised domain adaptation under both homogeneous and heterogeneous settings are discussed in [10]. The area of multi-stream classifications deserves an in depth study due to at least two reasons: 1) these approaches are designed for a single domain problem where both source and target streams share the same feature space (domain). To the best of our knowledge, there exists only one work in the literature handling the cross-domain multistream classification problem [11] using the empirical maximum mean discrepancy for domain adaptation. However, this approach is based on a non-deep learning solution relying on a simple linear projection for feature transformation, prone to trivial solutions; 2) Although these approaches rely on the unsupervised domain adaptation approaches where no label is offered for the target stream, full annotations are required for the source stream. On the other side, the multistream classification problem distinguishes itself from the online transfer learning problem [2] assuming a fixed and static source domain. Hence, the problem of asynchronous drift problem is absent in the online transfer learning problem.

Semi-Supervised Transfer Learning: The issue of labelling cost has attracted research interest in the transfer learning community. In [12], the notion of complementary labels incurring less expensive labelling cost than true class label is implemented. Dual deep neural networks is designed in which one focuses on complementary labels while another handles the domain adaptation. [13] concerns on reductions of the labelling cost in the heterogeneous domain adaptation problem usually calling for some labelled samples of the target domain. Another effort is devoted to reduce the labelling cost in [14] where it concerns on an open set domain adaptation where the target domain contains unknown classes. The use of noisy labels for unsupervised domain adaptation has been investigated in [15]. Our work differs from these works in two aspects: 1) LEOPARD handles the situation of cross-domain multistream classification under extreme label scarcity. That is, labelled samples are only revealed for the source stream during the warm-up period while no labelled samples for both streams are given for model updates during the process runs; 2) The learning approach is designed for the stream learning scenario.

Refer to caption

Figure 1: DANN performance on Office31 (D\rightarrowW) different label proportions of source streams leaving the target stream unlabelled.

III Problem Formulation

Suppose that DS,DTD_{S},D_{T} stand for the source and target domains respectively. The goal of domain adaptation is to solve a classification problem of the target domain DTD_{T} without any labels by transferring knowledge base from the source domain DSD_{S} where there exist some labelled samples. Referring to [16], the generalization error of the target domain is upper bounded by how good a model f(.)f(.) learns the source domain and the discrepancies between two domains:

ϵT(f)ϵS(f)1st+d1(DS,DT)2nd+min(EDS(fS,fT),EDT(fS,fT))3rd\epsilon_{T}(f)\leq\underbrace{\epsilon_{S}(f)}_{1^{st}}+\underbrace{d_{1}(D_{S},D_{T})}_{2^{nd}}\\ +\underbrace{\min{(E_{D_{S}}(f_{S},f_{T}),E_{D_{T}}(f_{S},f_{T}))}}_{3^{rd}} (1)

where the first term is the source error, the second term is divergence between the two domains and the last term is the difference in labelling function between the two domains which should be small [16]. Direct minimization of the divergence is a challenging task due to a lack of correspondences between data samples of the two domains. For streaming data, it leaves a major challenge because the divergence measure works with a finite number of samples.

A cross-domain multistream classification problem under extreme label scarcity is defined as a classification problem of two independent streaming data B1S,B2S,,BKSSB_{1}^{S},B_{2}^{S},...,B_{K_{S}}^{S} and B1T,B2T,,BKTTB_{1}^{T},B_{2}^{T},...,B_{K_{T}}^{T} termed as a source stream and a target stream respectively where KS,KTK_{S},K_{T} are respectively the number of source stream and target stream unknown in practise. BksS,BktTB_{k_{s}}^{S},B_{k_{t}}^{T} are drawn from the source domain DSD_{S} and the target domain DTD_{T} respectively. Extreme label scarcity is perceived in the limited access of ground truth where only prerecorded samples of the source stream B0S={xiS,yiS}i=1NmB_{0}^{S}=\{x_{i}^{S},y_{i}^{S}\}_{i=1}^{N_{m}} are labelled while no label is provided during the process runs BksS={xiS}i=1NSB_{k_{s}}^{S}=\{x_{i}^{S}\}_{i=1}^{N_{S}}. Nm,NSN_{m},N_{S} denote the number of prerecorded data samples of the source stream and the size of the source stream respectively. On the other hand, the target stream suffers from the absence of true class labels BktT={xiT}i=1NTB_{k_{t}}^{T}=\{x_{i}^{T}\}_{i=1}^{N_{T}} where NTN_{T} is the size of the target stream. Note that we consider a case where both source and target domains are streaming in nature. xiS𝒳𝒮x_{i}^{S}\in\mathcal{X_{S}}, xiT𝒳𝒯x_{i}^{T}\in\mathcal{X_{T}}, 𝒳𝒮𝒳𝒯\mathcal{X_{S}}\neq\mathcal{X_{T}} are input vectors of the source stream and the target stream while yi=[l1,l2,,lm]y_{i}=[l_{1},l_{2},...,l_{m}] is a target vector formed as one-hot vector yiS,yiT𝒴y_{i}^{S},y_{i}^{T}\in\mathcal{Y}. (xiS,yiS)𝒳𝒮×𝒴(x_{i}^{S},y_{i}^{S})\in\mathcal{X_{S}}\times\mathcal{Y} and (xiT,yiT)𝒳𝒯×𝒴(x_{i}^{T},y_{i}^{T})\in\mathcal{X_{T}}\times\mathcal{Y}. That is, the two domains feature different feature spaces but share the same labelling function, target variables, Cross-Domain. The two streaming data are sampled with different speeds resulting in NSNTN_{S}\neq N_{T}, different batch sizes while following different distributions P(xS)P(xT)P(x_{S})\neq P(x_{T}), covariate shift. The source stream and the target stream are non-stationary in nature where their concepts are drifting P(x,y)tSP(x,y)t+1SP(x,y)_{t}^{S}\neq P(x,y)_{t+1}^{S}, P(x,y)tTP(x,y)t+1TP(x,y)_{t^{\prime}}^{T}\neq P(x,y)_{t^{\prime}+1}^{T}, ttt\neq t^{\prime}, i.e., concept drifts of the two streams might develop at different time periods ttt\neq t^{\prime}, asynchronous drift.

IV Learning Procedure of LEOPARD

IV-A Network Structure of LEOPARD

LEOPARD is structured as a deep clustering network developed with a feature extraction layer extracting natural features ZZ from raw input features xx by means of a mapping function FWf(.)F_{W_{f}}(.) where WfW_{f} stands for parameters of the feature extractor. The extracted features ZZ are passed to a fully connected layer formed as a stacked autoencoder (SAE) with a tied-weight constraint. That is, the decoder parameters are the inverse mapping of the encoder parameters. The natural features ZuZ\in\Re^{u^{\prime}} are projected to a low dimensional latent space hlRlh^{l}\in\Re^{R_{l}} where u,Rlu^{\prime},R_{l} are respectively the number of natural features and hidden nodes at the lthl-th layer Rl<<uR_{l}<<u^{\prime}. The decoding and encoding mechanisms are expressed:

hl=r(Wenclhl1+bl);h0=Zh^{l}=r(W_{enc}^{l}h^{l-1}+b^{l});h^{0}=Z (2)
h^l1=r(Wdeclhl+cl);l=1,,L\hat{h}^{l-1}=r(W_{dec}^{l}h^{l}+c^{l});\quad\forall l=1,\dots,L (3)

where WenclRl×ul,blRlW_{enc}^{l}\in\Re^{R_{l}\times u_{l}},b^{l}\in\Re^{R_{l}} stand for the connective weights and biases of the lthl-th layer of the encoder respectively while Wdeclul×Rl,clulW_{dec}^{l}\in\Re^{u_{l}\times R_{l}},c^{l}\in\Re^{u_{l}} denote the connective weights and biases of the lthl-th layer of the decoder respectively. The tied-weight constraint Wdecl=(Wencl)TW_{dec}^{l}=(W_{enc}^{l})^{T} functions as a regularization mechanism preventing the issue of overfitting.

The clustering mechanism is carried out in each deep embedding space, each latent space. That is, it takes place in every hidden layer of SAE h(.)lh(.)^{l} creating different representations of data samples. The inference mechanism is performed by first calculating the similarity degree of a data sample and a hidden cluster[17]:

ϕjl=(1+|hlCjl||2/λ)(λ+1)2j=1Clusl(1+|hlCjl||2/λ)(λ+1)2\phi_{j}^{l}=\frac{(1+|h^{l}-C_{j}^{l}||_{2}/\lambda)^{\frac{-(\lambda+1)}{2}}}{\sum_{j=1}^{Clus^{l}}(1+|h^{l}-C_{j}^{l}||_{2}/\lambda)^{\frac{-(\lambda+1)}{2}}} (4)

where Cjl,hlC_{j}^{l},h^{l} are the centroid of the jthj-th cluster of the lthl-th layer and the latent representation of a data sample xx of the lthl-th layer while CluslClus^{l} is the number of cluster created in the lthl-th latent space, i.e., the lthl-th layer of SAE. λ=1\lambda=1 is chosen here. The student t-distribution is adopted to model the similarity degree and ϕjl\phi_{j}^{l} is also regarded as the cluster posterior probability P(Cjl|X)P(C_{j}^{l}|X) [18] where P(Cj|X)=1P(C_{j}|X)=1 presents the case of perfect match between hlh^{l} and CjlC_{j}^{l}. The similarity degree ϕjl\phi_{j}^{l} is aggregated across NmN_{m} prerecorded samples having true class labels B0S={xiS,yiS}i=1NmB_{0}^{S}=\{x_{i}^{S},y_{i}^{S}\}_{i=1}^{N_{m}}. This operation produces the cluster’s allegiance [19] measuring cluster’s tendencies to a particular class. Suppose that NoN_{o} stands for the number of prerecorded samples having the otho-th class as their labels, the cluster allegiance Alej,olAle_{j,o}^{l} is calculated:

Alej,ol=n=1Noϕj,on,lo=1mn=1Noϕj,on,lAle_{j,o}^{l}=\frac{\sum_{n=1}^{N_{o}}\phi_{j,o}^{n,l}}{\sum_{o=1}^{m}\sum_{n=1}^{N_{o}}\phi_{j,o}^{n,l}} (5)

where ϕj,on,l\phi_{j,o}^{n,l} measures the similarity degree of the cluster CjlC_{j}^{l} and the nthn-th prerecorded sample holh_{o}^{l} falling into the otho-th class. (5) pinpoints the neighborhood degree of the jthj-th cluster to the otho-th class and implies that an unclean cluster, occupied by data samples of mixed classes, possesses low cluster allegiance. The winner-takes-all principle win=argmaxj=1,,Cluslϕjlwin=\arg\max_{j=1,...,Clus^{l}}\phi_{j}^{l} is adopted here, where a data sample is associated to the nearest cluster. The local score of the lthl-th layer is defined as the allegiance of the winning cluster Scorel=AlewinlScore^{l}=Ale_{win}^{l}.The predicted class label Y^\hat{Y} is determined as a class label maximizing its global score. The global score is calculated as the summation of a local score across LL layers:

Y^=argmaxo=1,,ml=1LScorel\hat{Y}=\arg\max_{o=1,...,m}\sum_{l=1}^{L}Score^{l} (6)

where the majority voting approach is implemented here. It is evident that LEOPARD merely benefits from the labelled prerecorded samples of the source stream B0SB_{0}^{S} to associate a cluster to a specific class. No label at all from both streams is solicited in the streaming phase where it confirms its applicability in the extreme label scarcity environments. Fig. 2 visualizes the network structure of LEOPARD. It is perceived that the clustering process occurs in every hidden layer of LEOPARD thus producing its local outputs. The final predicted class label is aggregated across all layers making use of a summation operation. Rl,L,CluslR_{l},L,Clus^{l} are self-evolved in respect to varying distributions.

Refer to caption

Figure 2: Network Structure of LEOPARD: LEOPARD adopts the different-depth network structure where the clustering module is implemented in every layer of SAE thus producing its own local outputs. The final predicted label is aggregated across different embedding layers.

Refer to caption

Figure 3: LEOPARD operates in the extreme label scarcity condition where only prerecorded samples of source stream are labelled while the rests are unlabelled. The learning algorithm of LEOPARD consists of three modules (feature extractor, classifier, domain classifier). Feature extractor generates latent features, classifier produces final prediction, domain classifier identifies sample origin. The feature extractor is updated by taking the gradient of the clustering loss and the cross domain loss. The gradient reversal layer is implemented to change the sign of the gradient of the cross domain loss. The classifier is updated by minimizing the clustering loss and the domain classifier is adjusted by minimizing the cross domain loss. The classifier features a self-evolving characteristic whereas the domain classifier and feature extractor are fixed.

IV-B Parameter Learning of LEOPARD

Adversarial Domain Adaptation: the idea of domain adaptation is to minimize the divergence between the target domain and the source domain. The concept of adversarial domain adaptation is founded by the idea of HH divergence [3] where it relies on a hypothesis class HH, a set of binary classifiers. Definition 1 [20]: Given the two domains DSD_{S} and DTD_{T} and the hypothesis class HH, the HH divergence between DSD_{S} and DTD_{T} is defined as follows:

dH(DS,DT)=2supηH|Pr[η(x)=1]xDSPr[η(x)=1]xDT|\begin{split}d_{H}(D_{S},D_{T})=2\sup_{\eta\in H}|Pr[\eta(x)=1]_{x\sim D_{S}}\\ -Pr[\eta(x)=1]_{x\sim D_{T}}|\end{split} (7)

The HH divergence in (7) relies on the hypothesis class HH to distinguish data samples generated from DSD_{S} or data samples generated from DTD_{T}. In [20], the empirical HH divergence can be used in the case of a symmetric hypothesis class HH:

dH(DS,DT)=2(1minηH[1nxDSI[η(xn)=0]+1nxDTI[η(xn)=1]])\begin{split}d_{H}(D_{S},D_{T})=2(1-\min_{\eta\in H}[\frac{1}{n}\sum_{x\backsim D_{S}}I[\eta(x_{n})=0]+\\ \frac{1}{n^{^{\prime}}}\sum_{x\backsim D_{T}}I[\eta(x_{n})=1]])\end{split} (8)

where I[a]I[a] denotes an indicator function returning 1 if aa is true or 0 otherwise. This implies that the HH divergence can be minimized by finding a representation where the source and target samples are indistinguishable [3].

The concept of adversarial domain adaptation can be implemented by deploying a domain classifier ζWDC(FWF(.))\zeta_{W_{DC}}(F_{W_{F}}(.)) working along with the feature extractor FWf(.)F_{W_{f}}(.) and a classifier ξWC(FWF(.))\xi_{W_{C}}(F_{W_{F}}(.)). The domain classifier predicts the origin of data samples whether they are generated by the source domain DSD_{S} or the target domain DTD_{T} while the classifier generates the final output of a network. The domain reversal layer is implemented in updating the feature extractor such that indistinguishable features of source and target domains are induced. That is, the overall loss function is written as follows:

L=1NSn1=1NSLξ(ξWC(FWf(xn1),yn1)λ(1NSn2=1NSLζ(ζWDC(FWf(xn2),dn2)+1NTn3=1NTLζ(1ζWDC(FWf(xn3),dn3))\begin{split}L=\frac{1}{N_{S}}\sum_{n_{1}=1}^{N_{S}}L_{\xi}(\xi_{W_{C}}(F_{W_{f}}(x_{n_{1}}),y_{n_{1}})\\ -\lambda(\frac{1}{N_{S}}\sum_{n_{2}=1}^{N_{S}}L_{\zeta}(\zeta_{W_{DC}}(F_{W_{f}}(x_{n_{2}}),d_{n_{2}})\\ +\frac{1}{N_{T}}\sum_{n_{3}=1}^{N_{T}}L_{\zeta}(1-\zeta_{W_{DC}}(F_{W_{f}}(x_{n_{3}}),d_{n_{3}}))\end{split} (9)

where Lξ,ζ(.)L_{\xi,\zeta}(.) is implemented as the cross entropy loss function and dnd_{n} is the domain identity, i.e., 1 for the source domain and 0 for the target domain. From (9), the gradient reversal layer inserts a negative constant confusing the domain classifier, i.e., generating indistinguishable samples. The parameter learning process is formulated as follows:

Wf=Wfμ(LξWfα1LζWf)W_{f}=W_{f}-\mu(\frac{\partial L_{\xi}}{\partial W_{f}}-\alpha_{1}\frac{\partial L_{\zeta}}{\partial W_{f}}) (10)
WC=WCμLξWCW_{C}=W_{C}-\mu\frac{\partial L_{\xi}}{\partial W_{C}} (11)
WDC=WDCμλLζWDCW_{DC}=W_{DC}-\mu\lambda\frac{\partial L_{\zeta}}{\partial W_{DC}} (12)

where the feature extractor is trained to produce similar features of the two domains seen in the negative sign of the gradient thereby leading to the domain invariant network.
Loss Function: the parameter learning strategy of LEOPARD is constructed using a joint loss function comprising two modules: clustering loss and cross-domain adaptation loss. The underlying goal is to produce domain-invariant parameters as well as clustering-friendly latent spaces such that the online cross domain adaptation can be solved under extreme label scarcity. The overall cost function is formalized as follows:

Lall=Lclusterα1LcdL_{all}=L_{cluster}-\alpha_{1}L_{cd} (13)

where Lcluster,LcdL_{cluster},L_{cd} respectively denote the clustering loss and the cross-domain adaptation loss while α1\alpha_{1} is a trade-off constant controlling the influence of the cross-domain adaptation loss. It is an unconstrained optimization problem which can be optimized using the stochastic gradient descent approach with no epoch or epoch per batch to assure scalability in streaming environments. That is, a number of iteration is done per batch. A data batch is discarded once iterations across a number of epoch is completed to allow bounded complexity. The negative sign in (13) follows the gradient reversal strategy generating similar features across two domains. In other words, the gradient of clustering loss and the gradient of cross domain loss, the domain classifier loss, is subtracted [3].
Clustering-Friendly Latent Space: the clustering loss aims to achieve the clustering-friendly latent space via simultaneous feature learning and clustering. The clustering loss is formulated as the reconstruction loss and the KL divergence loss minimizing probabilistic distance of the latent space and the auxiliary target distribution [17]:

Lcluster=Lξ(xS,T,x^S,T)L1+l=1L(Lξ(hS,Tl,h^S,Tl)+α2KL(ϕl|Φl))L2L_{cluster}=\underbrace{L_{\xi}(x_{S,T},\hat{x}_{S,T})}_{L_{1}}+\underbrace{\sum_{l=1}^{L}(L_{\xi}(h_{S,T}^{l},\hat{h}_{S,T}^{l})\\ +\alpha_{2}KL(\phi^{l}|\Phi^{l}))}_{L_{2}} (14)

where Φl\Phi^{l} is the auxiliary target distribution of the lthl-th latent space and α2\alpha_{2} is a regularization constant controlling the strength of the KL divergence loss. ϕl\phi^{l} is the similarity degree of the current sample to existing clusters. Lξ(.)L_{\xi}(.) is the reconstruction loss formed as the mean square error (MSE) loss function. It also performs nonlinear dimension reduction preventing the trivial solutions often happening in the case of linear mapping. It guarantees a data sample to be mapped back to its original representation. The key difference between the two loss functions lies in the adaptation mechanism in which L1L_{1} is solved in the end-to-end fashion while L2L_{2} is carried out in the layer-wise fashion.

The last term also known as the KL divergence loss minimizes the discrepancy of the distribution of a current data batch calculated via (4) and the auxiliary target distribution KL(ϕl|Φl)=ijϕi,jllogϕi,jlΦi,jlKL(\phi^{l}|\Phi^{l})=\sum_{i}\sum_{j}\phi_{i,j}^{l}\log{\frac{\phi_{i,j}^{l}}{\Phi_{i,j}^{l}}}. The auxiliary target distribution should satisfy three requirements [17]: 1) improve prediction; 2) emphasizes samples of high confidence; 3) normalize loss contribution of each cluster to avoid creations of large clusters. We adopt the same auxiliary distribution as in [17] where Φi,jl\Phi_{i,j}^{l} is raised to the second power and normalized by frequency per cluster:

Φi,jl=(ϕi,jl)2/ζjj=1Clusl(ϕi,jl)2/ζj\Phi_{i,j}^{l}=\frac{(\phi_{i,j}^{l})^{2}/\zeta_{j}}{\sum_{j=1}^{Clus^{l}}(\phi_{i,j}^{l})^{2}/\zeta_{j}} (15)

where ζj=i=1Nϕi,jl\zeta_{j}=\sum_{i=1}^{N}\phi_{i,j}^{l} is the frequency of a cluster. This strategy is understood as the soft-cluster assignment [17] where all clusters are updated and differs from the hard-cluster assignment only tuning the winning cluster. The clustering mechanism is hard to conduct in the high-dimensional space [21] thus calling for feature learning steps to be committed simultaneously. The clustering process takes place in every latent space h(.)lh(.)^{l} set as the common feature spaces between the source and target domain. That is, (14) is executed using samples of both source and target streams. This process also functions as an implicit domain adaptation strategy since the minimization of reconstruction loss across two streams with shared parameters ends up with an overlapping region of both domains [6]. The optimization procedure takes place simultaneously where the network parameters and the cluster parameters are adjusted concurrently with the SGD method.
Domain-Invariant Network: LEOPARD consists of three sub-modules: feature extractor F(.)F(.), classifier ξ(.)\xi(.) and domain classifier ζ(.)\zeta(.) to achieve a domain invariant property as depicted in Fig. 3. The feature extractor is parameterized by WFW_{F} and the classifier formed as the deep clustering module is parameterized by WC{Wencl,Wdecl,Cl}W_{C}\in\{W_{enc}^{l},W_{dec}^{l},C^{l}\} while the domain classifier formed as a single hidden layer network is parameterized by WDCW_{DC}. The feature extractor and the domain classifier play a minimax game via the gradient reversal layer where the feature extractor is trained to fool the domain classifier via production of similar features of source and target streams while the domain classifier is trained to identify the origin of data samples. The cross domain adaptation loss is thus formulated as the domain classifier loss as follows:

Lcd=1NSn=1NSLζ(ζWDC(FWf(xn)),dn)+1Ntn=1NTLζ(1ζWDC(FWf(xn)),dn)\begin{split}L_{cd}=\frac{1}{N_{S}}\sum_{n=1}^{N_{S}}L_{\zeta}(\zeta_{W_{DC}}(F_{W_{f}}(x_{n})),d_{n})\\ +\frac{1}{N_{t}}\sum_{n^{\prime}=1}^{N_{T}}L_{\zeta}(1-\zeta_{W_{DC}}(F_{W_{f}}(x_{n^{\prime}})),d_{n^{\prime}})\end{split} (16)

where dnd_{n} stands for the origin of data samples, i.e., 11 for the source stream and 0 for the target stream. The domain classifier is tasked to solve a binary classification problem where Lζ(.)L_{\zeta}(.) is set as the cross entropy loss function. This leads to similar parameter learning processes for feature extractor, domain classifier and classifier respectively as in (10) - (12) except the presence of the clustering loss instead of the cross-entropy loss as defined in (14): Wf=Wfμ(LclusterWfα1LcdWf);WC=WCμLclusterWC;WDC=WDCμα1LcdWDCW_{f}=W_{f}-\mu(\frac{\partial L_{cluster}}{\partial W_{f}}-\alpha_{1}\frac{\partial L_{cd}}{\partial W_{f}});W_{C}=W_{C}-\mu\frac{\partial L_{cluster}}{\partial W_{C}};W_{DC}=W_{DC}-\mu\alpha_{1}\frac{\partial L_{cd}}{\partial W_{DC}} where μ\mu denotes the learning rate. Note that the gradient reversal layer has no parameters and simply alters the sign of the gradients allowing maximization process to be carried out via the stochastic gradient descent approach. This only applies to the feature extractor as illustrated in Fig. 3.

IV-C Structural Learning of LEOPARD

Evolution of Cluster: The classifier of LEOPARD implements the self-organizing mechanism of network clusters where the clusters are flexibly grown in every hidden layer h(.)lh(.)^{l} if changing data distributions are identified. Furthermore, it is performed for both source data samples hSlh_{S}^{l} and target data samples hTlh_{T}^{l}. That is, the clustering mechanism does not generate stream-specific clusters. Suppose that D(X,Y)D(X,Y) stands for the L2L_{2} distance between two variables X,YX,Y and an ithi-th cluster of lthl-th layer is parameterized by its centre CilC_{i}^{l}, the growing condition is formulated as follows:

mini=1,,CluslD(hl,Cil)>μD,il+k1σD,il\min_{i=1,...,Clus^{l}}D(h^{l},C_{i}^{l})>\mu_{D,i}^{l}+k_{1}\sigma_{D,i}^{l} (17)

where μD,il,σD,il\mu_{D,i}^{l},\sigma_{D,i}^{l} denote the mean and standard deviation of the distance D(hl,Cil)D(h^{l},C_{i}^{l}) of the ithi-th cluster of the lthl-th layer while k1=2exphlCwinl+2k_{1}=2\exp{-||h_{l}-C_{win}^{l}||}+2 leading to a dynamic confidence degree. The dynamic confidence degree enables the cluster growing phase to be carried out in the case of a far proximity between a data sample and the winning cluster. (17) examines the coverage span of existing clusters where a new cluster is inserted if a data sample is remote from the influence zone of existing clusters or a concept drift develops. A new cluster is crafted by assigning the current sample of interest as the cluster’s center CClusl+1l=hlC_{Clus^{l}+1}^{l}=h^{l} and setting the cluster’s cardinality to be NClusl+1=1N_{Clus^{l}+1}=1.
Evolution of Network Structure: The classifier of LEOPARD is equipped by the hidden node growing and pruning strategies adapting to the concept drifts of data streams. That is, this mechanism takes place for both the source stream and the target stream. The self-organizing mechanism is controlled by the network significance (NS) method [22] adopting the bias-variance decomposition concept of every layer. That is, a high bias situation leads to an introduction of a new node while a high variance condition triggers the node pruning mechanism. Note that the network bias and variance here are evaluated in respect to the local error of a layer. All of which are carried out in an unsupervised fashion in respect to the reconstruction error. The network significance (NS) method is formalized as follows:

NS=(E[h^l]hl)2+(E[(h^l)2]E[h^l]2)NS=(E[\hat{h}^{l}]-h^{l})^{2}+(E[(\hat{h}^{l})^{2}]-E[\hat{h}^{l}]^{2}) (18)

(18) can be solved by finding the expected output E[h^l]E[\hat{h}^{l}] under a certain probability density function p(x)p(x) assumed to follow the normal distribution N(μ,σ2)N(\mu,\sigma^{2}) with mean μ\mu and variance σ2\sigma^{2}. The bottleneck of this approach is found in the case of drift p(x)tp(x)t+1p(x)_{t}\neq p(x)_{t+1} where it does not keep pace with rapidly changing distributions. To correct this shortcoming, Autonomous Gaussian Mixture Model (AGMM) can be used to estimate a complex probability density function p(x)p(x) as done in [6]. It is computationally expensive and often unstable in the high input dimension case due to the use of product norm. Furthermore, we deal with a multi-layer network here doubling the complexity of AGMM.

The hidden unit growing and pruning steps are signalled by the statistical process control (SPC) approach [23] commonly used for anomaly detection tasks. The SPC method is applied here to detect the high bias or high variance condition and written as follows:

μbiasn,l+σbiasn,lμbiasmin,l+k2σbiasmin,l\mu_{bias}^{n,l}+\sigma_{bias}^{n,l}\geq\mu_{bias}^{min,l}+k_{2}\sigma_{bias}^{min,l} (19)
μvarn,l+σvarn,lμvarmin,l+2k3σvarmin,l\mu_{var}^{n,l}+\sigma_{var}^{n,l}\geq\mu_{var}^{min,l}+2*k_{3}\sigma_{var}^{min,l} (20)

The SPC method is generalized here using k2=1.3exp(Bias2)+0.7k_{2}=1.3\exp{(-Bias^{2})}+0.7 and k3=1.3exp(Var2)+0.7k_{3}=1.3\exp{(-Var^{2})}+0.7. This modification leads to dynamic confidence levels enabling for flexible growing and pruning phases. That is, the node growing process is likely performed in the case of a high bias while being strict in the case of a low bias. The same case also applies for the node pruning mechanism. μbiasmin,l\mu_{bias}^{min,l} and σbiasmin,l\sigma_{bias}^{min,l} are reset if the growing condition (19) is satisfied. On the other hand, if the pruning condition is met, μvarmin,l\mu_{var}^{min,l} and σvarmin,l\sigma_{var}^{min,l} are reset. The initialization of a new node is carried out using the Xavier initialization strategy. The least contributing node having the least statistical contribution is subject to the pruning step if (20) is observed. Since LEOPARD is constructed under a different-depth structure where every layer performs its own clustering mechanism and produces its local output, the growing and pruning steps are independently undertaken per layer. Furthermore, this mechanism occurs for both source and target streams to anticipate the asynchronous drift problem where the network structure is shared across two domains.

The classifier of LEOPARD is fitted with the hidden layer growing mechanism where it expands the network depth based on the drift detection mechanism [24]. The drift detection mechanism is designed from the concept of Hoeffding’s bound and analyzes the dynamic of latent features ZZ to identify the change of marginal distribution. Note that no labelled samples are offered for model updates and the drift detection approach is executed for both source and target streams. The addition of a network layer is desired in practise because it is capable of substantiating network capacity significantly thus enhancing model’s generalization. The drift detection procedure starts by finding the cutting point, a point where population mean increases. A cutting point is declared by the following condition.

P^+ϵPQ^+ϵQ\hat{P}+\epsilon_{P}\geq\hat{Q}+\epsilon_{Q} (21)

where P2NP\in\Re^{2N} is a data matrix containing two consecutive data batches [Bk1,Bk][B_{k-1},B_{k}], i.e., previous and current data batches while QcutQ\in\Re^{cut} is a data matrix with cutcut as the hypothetical cutting point of interest, cut<2Ncut<2N. Two data batches are applied here to increase the sensitivity of cutting point identification because latent features are relatively stable compared to the original input space. The hypothetical cutting point is arranged as cut=[25%,50%,75%]×2Ncut=[25\%,50\%,75\%]\times 2N instead of every point to avoid false alarm. P^,Q^\hat{P},\hat{Q} denote the statistics of data matrices P,QP,Q. ϵP,Q\epsilon_{P,Q} stand for the error bound derived from the concept of Hoeffding’s bound as follows:

ϵP,Q=12×sizeln1αx\epsilon_{P,Q}=\sqrt{\frac{1}{2\times size}\ln{\frac{1}{\alpha_{x}}}} (22)

where αx\alpha_{x} is the significance level being inversely proportional to the confidence level 1αx1-\alpha_{x} while sizesize refers to the size of the data matrix of interest P,QP,Q.

Once eliciting the cutting point of interest cutcut, a data matrix R2NcutR\in\Re^{2N-cut} is constructed. A drift is signalled if |R^Q^|ϵD|\hat{R}-\hat{Q}|\geq\epsilon_{D}. Beside the drift condition, a warning condition is set and pinpoints a case where a drift needs to be confirmed by the next data batch. That is, ϵW|R^Q^|ϵD\epsilon_{W}\leq|\hat{R}-\hat{Q}|\leq\epsilon_{D} where αW<αD\alpha_{W}<\alpha_{D}. The error bounds ϵD,W\epsilon_{D,W} are defined as follows:

ϵD,W=(ba)×sizecut2×cut×sizeln1αD.W\epsilon_{D,W}=(b-a)\times\sqrt{\frac{size-cut}{2\times cut\times size}\ln{\frac{1}{\alpha_{D.W}}}} (23)

where [a,b][a,b] denotes the range of the data matrix PP. A new layer is created if a concept drift is found. That is, the number of nodes is set as the half of the network width of the previous layer l1l-1. This step enables the nonlinear feature reduction and avoids an over-complete network. The domain classifier and the feature extractor have a fixed structure because the structural learning of the classifier suffices to address the asynchronous drift problem.

IV-D Algorithm

Learning policy of LEOPARD is visualized in Fig. 3 and Algorithm 1 where LEOPARD is driven by the feature extractor, the classifier and the domain classifier. The forward pass procedure is done by feeding raw input attributes xS,Tx_{S,T} to the feature extractor F(.)F(.) leading to latent input features ZS,TZ_{S,T}. The latent features are passed to the classifier ξ(.)\xi(.) implemented as the SAE and the clustering module. Note that the clustering module exists in every layer of SAE producing its own local output ScorelScore^{l} where the majority voting is performed to generate a final predicted output. The learning process starts with a warm-up phase using NinitN_{init} unlabelled samples iterated across EE number of epochs to avoid the cold start problem. This process only involves the reconstruction loss Lξ(xS,T,x^S,T)L_{\xi}(x_{S,T},\hat{x}_{S,T}) and Lξ(hS,Tl,h^S,Tl)L_{\xi}(h^{l}_{S,T},\hat{h}^{l}_{S,T}) affecting only network parameters WFW_{F} and Wencl,WdeclW_{enc}^{l},W_{dec}^{l}. The main training loop is executed by minimizing Lcluster(.)L_{cluster}(.) and is applied to WF,Wencl,Wdecl,ClW_{F},W_{enc}^{l},W_{dec}^{l},C^{l}. Minimization of clustering loss across the two domains can be also seen as the domain adaptation strategy because it leads to an overlapping region of source domain and target domain to be created, i.e., both the source stream and the target stream are used under shared parameters. The adversarial domain adaptation is carried out by minimizing Lcd(.)L_{cd}(.) afterward where the domain classifier ζWDC(.)\zeta_{W_{DC}}(.) is updated as well as the feature extractor FWf(.)F_{W_{f}}(.) using the cross domain loss. The gradient reversal strategy is adopted when adjusting the feature extractor thus converting the minimization problem to the maximization problem and in turn resulting in indistinguishable features of the source stream and the target stream. The cross domain adaptation strategy makes possible for the source streams and the target streams following different distributions to be mapped similarly, i.e., the covariate shift is addressed.

Input: Source streaming data {B0S,B1S,B2S,,BKSSB_{0}^{S},B_{1}^{S},B_{2}^{S},...,B_{K_{S}}^{S}}, target streaming data {B1T,B2T,,BKTTB_{1}^{T},B_{2}^{T},...,B_{K_{T}}^{T}}, initialization epochs EinitE_{init}, batch number of source and target streaming data bkb_{k}, epoch number EE.
Output: Network parameters of feature extractor WfW_{f}, classifier WCW_{C} and domain classifier WDCW_{DC}. Average accuracy AccAcc.
1 for i=1:Einiti=1:E_{init} do
2       Initializing clusters using scarcity labelled data B0SB_{0}^{S};
3      
4 end for
5for j=1:Ej=1:E do
6       Network layer evolution of classifier (SAE) ξWc\xi_{W_{c}} by Eq. (23);
7       Hidden unit of classifier (SAE) ξWc\xi_{W_{c}} growing and pruning by Eq. (19) and (20);
8       Lcluster=Lξ(xS,T,x^S,T)+l=1L(Lξ(hS,Tl,h^S,Tl)+α2KL(ϕl|Φl))L_{cluster}={L_{\xi}(x_{S,T},\hat{x}_{S,T})}+{\sum_{l=1}^{L}(L_{\xi}(h_{S,T}^{l},\hat{h}_{S,T}^{l})+\alpha_{2}KL(\phi^{l}|\Phi^{l}))};
9       Update feature extractor parameter WfW_{f} and classifier parameter WCW_{C} in respect to LclusterL_{cluster};
10       Lcd=1NSn=1NSLζ(ζWDC(FWf(xn)),dn)+1Ntn=1NTLζ(1ζWDC(FWf(xn)),dn)L_{cd}=\frac{1}{N_{S}}\sum_{n=1}^{N_{S}}L_{\zeta}(\zeta_{W_{DC}}(F_{W_{f}}(x_{n})),d_{n})+\frac{1}{N_{t}}\sum_{n^{\prime}=1}^{N_{T}}L_{\zeta}(1-\zeta_{W_{DC}}(F_{W_{f}}(x_{n^{\prime}})),d_{n^{\prime}});
11       Update feature extractor parameter WfW_{f} and domain classifier parameter WDCW_{DC} in respect to LcdL_{cd};
12      
13 end for
14return Wf,WC,WDCW_{f},W_{C},W_{DC} and average accuracy AccAcc;
Algorithm 1 LEOPARD

The structural learning process occurs in both the initialization phase and the main training phase in which it includes the cluster growing process, the hidden node growing and pruning processes and the hidden layer growing process. As with the warm-up phase, the initialization phase using NinitN_{init} prerecorded samples over EE epochs is implemented if a new layer is created. It is obvious that LEOPARD does not exploit any labelled samples for model updates except for labelled samples to be used to calculate the cluster allegiance (5). The structural learning mechanism addresses the issue of asynchronous drifts across both streams.

V Numerical Study

This section presents numerical validation of LEOPARD putting forward nine datasets leading to 24 independent numerical results. Ablation study is added in this section to further numerically validate the contribution of each learning component. Source codes of LEOPARD can be found in https://github.com/wengweng001/LEOPARD.git. Our analysis of label proportions and visualizations of LEOPARD’s learning performances are offered in the supplemental document.

V-A Dataset

MNIST(MN)\leftrightarrowUSPS(US): this problem presents a digit recognition problem having 10 classes. The data samples are formed by gray-scale images of hand-written digits resized to 28×2828\times 28 for US\rightarrowMN and 28×2828\times 28 for MN\rightarrowUS cases.
Amazon@X(AM): this is a multi-domain sentiment analysis problem encompassing product reviews obtained from Amazon.com. XX stands for the product type [25]. Five product types, namely beauty, books, industrial, luxury and magazine, are selected here where the cross-domain multistream classification problem is formulated with two products with similar contexts but different topics. The averaged summed outputs from Google’s word2vec model pretrained on 100 billion words [26] is used to perform feature extraction.
Office31: this problem presents three domains: amazon (A), DSLR (D) and Webcam (W). It comprises 31 categories of the office objects. We present the case of D\leftrightarrowW where D comprises 498 images and W consists of 795 images. The characteristics of nine datasets are summed up in Table I.

V-B Simulation Protocol

The numerical study is carried out using the prequential test-then-train protocol as per [23], A model is tested first before updating it with the same data stream. The numerical evaluation is independently undertaken per-batch where the numerical result is averaged across all batches. Our simulation is repeated 5 times to guarantee the consistency of numerical results where the final numerical results are averaged over 5 independent runs. The asynchronous drift problem is induced by applying the scaling hyper-plane strategy [27, 28] where a data stream is scaled to xi=dz×xixx_{i}=\frac{d_{z}\times x_{i}}{||x||}. dzd_{z} is a randomly generated concept drift vector where zz is the number of concept drifts in the stream: z=1z=1 for every source stream and z=1z=1 for every target data stream. A fixed random seed is selected in setting dzd_{z} to assure fair comparison. In realm of MN\leftrightarrowUS, the concept drifts occurs at k=35k=35 for source stream and k=36k=36 for target stream whereas the concept drift takes place at k=5k=5 for source stream and k=6k=6 for target stream in the amazon@X and Office 31 problems. These configurations assure the asynchronous drift to be presented.

TABLE I: Characteristics of Datasets
Dataset Attributes Labels Samples NB
MNIST(MN) 784 10 70000 65
USPS(US) 256 10 9298 65
Amazon@Beauty(AM1) 300 5 5150 20
Amazon@Books(AM2) 300 5 500000 20
Amazon@Industrial(AM3) 300 5 73146 20
Amazon@Luxury(AM4) 300 5 33784 20
Amazon@Magazine(AM5) 300 5 2230 20
Office31(D) 36636672 31 498 10
Office31(W) 921600 31 795 10

NB: Number of Batches

V-C Baseline

LEOPARD is compared with five algorithms: autonomous deep clustering network (ADCN) [29], deep clustering network (DCN) [30], autoencoder followed by K-Means (AE+KMeans), deep embedding clustering (DEC) [17] and domain adversarial neural networks (DANN) [3]. ADCN is a self-evolving deep clustering network where hidden clusters, nodes and layers are grown and pruned dynamically. The loss function is formulated with a combination of a clustering loss and a reconstruction loss. ADCN is not equipped by a specific domain adaptation loss function while it applies the hard cluster assignment approach as with [30], i.e., The L2L_{2} distance loss of the winning cluster and the latent sample is put forward. DCN adopts a fixed network structure where the clustering mechanism only takes place at the bottleneck layer. It applies the same loss function as ADCN. AE+KMeans differs from DCN where the clustering mechanism is carried out after the training process. It does not utilize any clustering loss. DEC adopts the soft-assignment approach as with LEOPARD except that it relies on a static network structure and suffers from the absence of any domain adaptation loss. The reconstruction loss in the baseline algorithms are perceived as a domain adaptation procedure because they are carried out for both source and target streams under shared network parameters. DANN utilizes the adversarial domain adaptation as per LEOPARD without any clustering mechanism.

All of them work under the extreme label scarcity condition as with LEOPARD where access of true class labels is only provided for the prerecorded samples of the source stream while no label is offered during the process runs for both the source stream and the target stream. Comparison with ADCN is done by executing their published codes to assure fair comparisons. We utilize our own implementations of DCN, AE+KMeans, DEC and DANN.

V-D Hyperparameters

The learning rate and momentum of LEOPARD are allocated as 0.01 and 0.95 while the regularization constant of the clustering loss α2\alpha_{2} is set as 1 and the tradeoff constant of the cross-domain loss α1\alpha_{1} is set as 0.1. LEOPARD also depends on labelled prerecorded samples B0SB_{0}^{S} of the source stream set as 10%10\% of source samples proportionally taken from each class Nm=10%NSN_{m}=10\%N_{S}. That is, each class contributes the same number of samples. The number of initial epochs are set as E=100E=100 (amazon@X), E=50E=50 (MN\leftrightarrowUS), and E=500E=500 (D\leftrightarrowW) respectively. The initialization phase is carried out using labelled prerecorded samples of the source stream. The parameters of the drift detector αx,αD,αW\alpha_{x},\alpha_{D},\alpha_{W} are selected respectively as 0.001,0.001,0.0050.001,0.001,0.005. For amazon@X problems, LEOPARD runs in the one-pass learning procedure whereas for MN\leftrightarrowUS experiments, the training process of the clustering loss adopts the epoch per batch strategy with 1010 (MN\rightarrowUS, W\leftrightarrowD) epochs and 55 epochs (US\rightarrowMN) respectively. The epoch per batch strategy satisfy the online learning requirement because a data batch is discarded after training over predetermined epochs. The same setting is also applied to the baseline algorithms assuring fair comparisons.

For MN\leftrightarrowUS problem, the feature extractor is formed as convolutional neural networks. The encoder part is constructed as 2 convolutional layers using 16 and 4 filters respectively while having the max pooling layer in between. The decoder part is built upon two transposed convolutional layers with 4 and 16 filters respectively. For amazon@X sentiment analysis problems, the multi-layer perceptron feature extractor is put forward with two hidden layers where the number of nodes is fixed as 300300 and 100100. For the office31 problem, ResNet34 is applied as feature extractors. The initial nodes of fully connected layer are simply assigned as 9696 for the MNIST\leftrightarrowUSPS problem, 3030 for amazon@X sentiment analysis problems and 500500 for D\leftrightarrowW. The ReLU activation function is applied for the intermediate layers while the decoder output utilizes the sigmoid activation function producing normalized reconstructed output. The network structures of baseline algorithms are set similarly to ensure fair comparison. Further details of our numerical studies are explained in the LEOPARD’s codes shared in https://github.com/wengweng001/LEOPARD.git

These parameters are fixed throughout all study cases to guarantee non ad-hoc performance of LEOPARD. The hyper-parameters of the baselines are selected as per the guidelines of their publications and hand-tuned if their performances are surprisingly compromised. The hyper-parameters of all consolidated algorithms are listed in the supplemental document.

TABLE II: Average Accuracy (%) of The Target Stream across 5 runs, *indicates statistically significant results and BOLD denotes the best numerical results
Experiments LEOPARD ADCN AE-kmeans DCN DEC DANN
AM1 \rightarrow AM2 20.6320 ±\pm 2.3958 19.8160 ±\pm 4.7555 27.8012 ±\pm 1.6392 27.7792 ±\pm 1.8319 18.3774 ±\pm 1.3356 15.2291 ±\pm 22.7966
AM1 \rightarrow AM3 *71.5300 ±\pm 1.0819 57.0520 ±\pm 10.7930 25.8713 ±\pm 1.6412 26.1010 ±\pm 1.9047 17.2324 ±\pm 1.5599 34.8559 ±\pm 29.7157
AM1 \rightarrow AM4 *57.7840 ±\pm 0.0476 43.9800 ±\pm 3.3004 27.9307 ±\pm 1.2602 28.1326 ±\pm 1.1633 16.6092 ±\pm 0.6464 44.4809 ±\pm 20.6428
AM1 \rightarrow AM5 *63.5100 ±\pm 1.2016 60.7980 ±\pm 3.0386 31.3240 ±\pm 0.8684 31.1996 ±\pm 0.8381 13.9947 ±\pm 1.0473 41.5806 ±\pm 13.6181
AM2 \rightarrow AM1 *71.5480 ±\pm 8.8031 25.2880 ±\pm 6.6382 36.7868 ±\pm 1.5799 36.9540 ±\pm 2.0093 8.8334 ±\pm 0.9026 49.4693 ±\pm 39.4318
AM2 \rightarrow AM3 45.4920 ±\pm 13.2055 19.9380 ±\pm 15.1533 31.0612 ±\pm 1.8636 30.9986 ±\pm 1.8202 14.2852 ±\pm 1.0375 43.4064 ±\pm 28.3212
AM2 \rightarrow AM4 *48.5600 ±\pm 3.5658 14.1460 ±\pm 1.7088 27.1297 ±\pm 1.0116 27.2251 ±\pm 0.9788 15.6150 ±\pm 1.4221 42.8489 ±\pm 14.2926
AM2 \rightarrow AM5 50.3680 ±\pm 15.8943 31.9160 ±\pm 5.8286 28.7212 ±\pm 1.6700 28.4330 ±\pm 1.0860 18.3333 ±\pm 2.6212 60.4227 ±\pm 8.1435
AM3 \rightarrow AM1 37.3520 ±\pm 4.7246 53.0180 ±\pm 17.7237 25.7504 ±\pm 1.2193 25.4591 ±\pm 1.4807 7.7442 ±\pm 1.5116 17.1871 ±\pm 19.2744
AM3 \rightarrow AM2 31.3120 ±\pm 8.9392 8.9900 ±\pm 2.0600 22.1118 ±\pm 1.1815 25.3291 ±\pm 5.8252 17.2165 ±\pm 1.6547 40.9799 ±\pm 13.1832
AM3 \rightarrow AM4 18.7240 ±\pm 1.0962 28.4340 ±\pm 4.5292 23.2826 ±\pm 1.4628 23.1707 ±\pm 1.6232 15.2796 ±\pm 2.2464 22.2822 ±\pm 17.1165
AM3 \rightarrow AM5 *59.5520 ±\pm 2.8339 37.1540 ±\pm 8.1052 22.1968 ±\pm 0.9799 22.3941 ±\pm 0.6026 16.0303 ±\pm 2.0662 20.1265 ±\pm 21.2838
AM4 \rightarrow AM1 45.5560 ±\pm 6.3118 69.4040 ±\pm 6.0339 23.2919 ±\pm 3.1591 23.4620 ±\pm 3.0392 8.0682 ±\pm 0.7888 55.4146 ±\pm 40.1126
AM4 \rightarrow AM2 23.2340 ±\pm 5.5658 21.9140 ±\pm 6.7656 21.5370 ±\pm 0.8304 21.3970 ±\pm 0.9446 18.1515 ±\pm 1.2285 22.8453 ±\pm 19.9388
AM4 \rightarrow AM3 58.0500 ±\pm 4.2461 62.6160 ±\pm 3.7864 22.4032 ±\pm 1.3615 22.6766 ±\pm 1.4520 17.2892 ±\pm 1.9414 54.9784 ±\pm 22.9740
AM4 \rightarrow AM5 *64.3480 ±\pm 0.0895 56.0280 ±\pm 2.6346 21.3376 ±\pm 1.1374 21.3101 ±\pm 0.9658 15.5601 ±\pm 1.4566 20.0455 ±\pm 13.7290
AM5 \rightarrow AM1 *87.6760 ±\pm 0.3844 56.3760 ±\pm 19.8715 20.1306 ±\pm 1.2201 20.4453 ±\pm 0.8831 8.8645 ±\pm 1.4946 61.0045 ±\pm 17.2396
AM5 \rightarrow AM2 12.5060 ±\pm 1.1611 10.8880 ±\pm 3.3736 19.3033 ±\pm 1.0140 19.6829 ±\pm 0.8429 15.7549 ±\pm 2.1423 36.7490 ±\pm 25.1993
AM5 \rightarrow AM3 *36.7900 ±\pm 6.2853 27.9480 ±\pm 3.3931 21.4889 ±\pm 2.2889 21.1898 ±\pm 2.2209 16.9886 ±\pm 3.9668 31.0573 ±\pm 32.7811
AM5 \rightarrow AM4 *50.7580 ±\pm 2.7020 32.5880 ±\pm 4.3853 19.8205 ±\pm 0.7625 19.5470 ±\pm 0.6045 16.0969 ±\pm 1.2545 33.5787 ±\pm 21.3489
MNIST \rightarrow USPS 45.3740 ±\pm 14.3497 62.9800 ±\pm 3.0666 10.1138 ±\pm 0.2647 9.9913 ±\pm 0.3020 10.3657 ±\pm 1.5318 23.1336 ±\pm 3.0172
USPS \rightarrow MNIST *49.4660 ±\pm 2.1841 33.3800 ±\pm 14.5818 10.0563 ±\pm 0.3269 9.6323 ±\pm 0.8033 9.3434 ±\pm 0.5433 39.0328 ±\pm 6.7573
D \rightarrow W *41.8080 ±\pm 9.8034 4.0000 ±\pm 0.5589 3.7722 ±\pm 0.3789 3.0633 ±\pm 0.7778 3.2658 ±\pm 0.6425 10.9821 ±\pm 2.7133
W \rightarrow D *35.2820 ±\pm 12.0613 3.7560 ±\pm 0.4568 2.9388 ±\pm 0.6657 3.1429 ±\pm 0.7370 2.8163 ±\pm 0.4725 6.4323 ±\pm 1.8659
  • *

    The standard for statistically significant results is based on the T-Test score between LEOPARD and other baselines, 4 T scores for each experiment. We determined the result as statistically significant if t score is greater than 2.015 for at least 3 out of 4.

V-E Numerical Results

From Table II, it is seen that LEOPARD outperforms other algorithms in 15 of 24 cases with noticeable margins. This aspect portrays the efficacy of the adversarial domain adaptation approach and the soft-cluster assignment mechanism where these two modules are absent in the baseline algorithms, i.e., ADCN, DCN, AE+KMeans adopt the hard-cluster assignment strategy and suffer from the absence of the adversarial domain adaptation approach. This finding also confirms the advantage of the adversarial domain adaptation over the feature reconstruction strategy with shared parameters across the two streams implemented in all baselines. The soft-cluster assignment approach where the cluster and network parameters are simultaneously optimized via the SGD method performs better than the hard cluster assignment strategy. It is demonstrated by the fact that LEOPARD beats ADCN in significant numbers of cases. There is no significant performance difference between AE+KMEANS and DEC. On the other hand, the importance of the structural learning component in handling data streams is clearly portrayed here where LEOPARD and ADCN are superior compared to other algorithms having static structures. Such mechanism allows timely reactions to the asynchronous drift problem across the source stream and the target stream. The performance of DANN implemented under conventional neural network structures is far inferior to LEOPARD under extreme label scarcity condition. This fact confirms the advantage of clustering approach compared to the conventional neural network structure to reduce label’s dependencies. The statistical test is undertaken using the t test (P<0.05)(P<0.05) confirming the advantage of LEOPARD where it beats other algorithms with statistically significant gaps in 13 of 24 cases.

TABLE III: Ablation Study of LEOPARD
A B C D E
AM1 \rightarrow AM3 56.6620 ±\pm 7.3010 70.0860 ±\pm 2.3159 51.7800 ±\pm 5.3515 40.8860 ±\pm 2.6187 71.5300 ±\pm 1.0819
AM1 \rightarrow AM4 49.4800 ±\pm 3.9323 57.2160 ±\pm 0.3800 49.3940 ±\pm 3.5215 27.7520 ±\pm 1.9921 57.7840 ±\pm 0.0476
AM5 \rightarrow AM1 85.6580 ±\pm 2.3485 83.2040 ±\pm 6.6126 86.7800 ±\pm 1.5401 31.1860 ±\pm 1.5308 87.6760 ±\pm 0.3844
AM5 \rightarrow AM4 29.6480 ±\pm 5.5269 43.5760 ±\pm 6.5908 29.4860 ±\pm 3.4137 24.7100 ±\pm 1.6597 50.7580 ±\pm 2.7020
  • A

    LEOPARD model without network structure evolution and additional loss (KL(ϕl|Φl)KL(\phi^{l}|\Phi^{l}) and LcdL_{cd})

  • B

    LEOPARD model without network structure evolution

  • C

    LEOPARD model with the absence of KL(ϕl|Φl)KL(\phi^{l}|\Phi^{l}) and LcdL_{cd}

  • D

    LEOPARD model using BERT as the feature extractor

  • E

    LEOPARD model

V-F Ablation Study

This section studies the effect of LEOPARD’s learning modules by analyzing its performance when deactivating a particular learning module. LEOPARD is configured into four models: (A) the structural learning method is switched off leaving LEOPARD to have a static network structure. In addition, the parameter learning strategy is done with the absence of the cross domain adaptation loss and the KL divergence loss. In short, LEOPARD is driven by the reconstruction loss only; (B) the structural learning strategy of LEOPARD is deactivated while the parameter learning step utilizes both the cross domain adaptation loss and the KL divergence loss; (C) the structural learning mechanism of LEOPARD is activated but with the absence of the KL divergence loss and the cross domain adaptation loss; (D) BERT is applied as a feature extractor in lieu of a word2vec model. Our ablation study is carried out using four study cases: AM1\rightarrowAM3, AM1\rightarrowAM4, AM5\rightarrowAM1 and AM5\rightarrowAM4.

From table III, LEOPARD suffers from major performance degradation for all configurations (A)-(D) confirming the efficacy of its current version. For AM5\rightarrowAM1, configuration (A) where both the structural learning strategy, the KL divergence loss and the cross domain adaptation loss are absent produces poor results. The performance improves slightly with the activation of the KL divergence loss and the cross domain adaptation loss as per configuration (B). Although the KL divergence loss and the cross domain adaptation loss are deactivated, the structural learning mechanism improves the accuracy as per configuration (C) but is not yet on par to the LEOPARD. An interesting observation presents in the case of AM5\rightarrowAM4 where configuration (C) produces poor results. The performance betters in configuration (A) and (B) without the structural learning mechanism. Nevertheless, configuration (A)-(C) are not comparable to the LEOPARD where all modules are engaged. We note that the size of source stream is much less than the size of target stream in the AM5\rightarrowAM4 case. This leads to very few labelled samples provided in the warm-up phase. For AM1\rightarrowAM4, the absence of structural evolution, configuration (B), drops the LEOPARD’s accuracy slightly. More severe degradation than configuration (B) occurs in configuration (A) and configuration (C) where the KL divergence loss and the cross domain adaptation loss are deactivated. The same pattern is demonstrated in the case of AM1\rightarrowAM3. This finding clearly confirms the advantage of the KL divergence loss and the cross domain adaptation loss for LEOPARD. In addition, the structural evolution boosts the performance of LEOPARD especially in the presence of drifts. The use of BERT as feature extractor as shown in Configuration (D) worsens predictive performances of LEOPARD significantly. This finding confirms the compatibility of the word2vec model over the BERT model as a feature extractor of LEOPARD most likely due to the absence of self-attention mechanism or recurrent connection in LEOPARD.

Refer to caption
(a) before training
Refer to caption
(b) after training
Figure 4: USPS \rightarrow MNIST tSNE plots on 1000 target data samples.

V-G t-SNE Plots

Fig. 4 illustrates the t-SNE plots of LEOPARD [18] for USPS\rightarrowMNIST case on the target stream before and after the training process. It is observed that there does not exist any cluster structures initially in this problem but such cluster structures are clearly present after the training process showing the effectiveness of (14). These facts confirm that LEOPARD does not call for the existence of any cluster structures beforehand and the clustering loss LclusterL_{cluster} is capable of establishing the clustering-friendly latent space.

V-H Future Directions

This paper has successfully developed an algorithmic solution of multistrean classification problems under extreme label shortages, LEOPARD. That is, given two different but related streaming processes, LEOPARD properly functions with few prerecorded samples of the source domain and the absence of any labels when the streaming processes run. This benefit goes one step ahead of existing multistream classifiers or unsupervised domain adaptation methods calling for fully labelled source streams. Nonetheless, the problem of multistream classification remain at the infant stages leaving several open issues for future works.

The problem of open set domain adaptation presents a case where the source and target domains do not share the same target classes [31]. Such setting is also seen as a way to reduce the labelling cost and is beneficial in realm of multistream classifications. [31] proposes a feature transformation strategy associating target classes of target domain to those of source domain. [32] puts forward the concept of openness and unknown classes in the open set domain adaptation problem. The theoretical bound of the open set domain adaptation is derived in [33]. The application of theoretical bound for deep learning is demontrated in [34]. These approaches are limited to the offline case calling for extensions for the multistream classification setting. Gradual Domain Adaptation [35] is highly relevant to the multistream classification context because the multistream classification problem still considers different but related streams and constant discrepancies. Also, the asynchronous drifts usually appear suddenly. (1) assumes a small and fixed combined risk and is unrealistic because the combined risk may increase during the training process [36]. This issue is still ignored in the multistream classification problems. Few-shot Hypothesis adaptation [37] is another interesting direction for the multistream classification topic and extends the few-shot domain adaptation problem without any source domain data.

VI Conclusion

Learning Streaming Process from Partial Ground Truth (LEOPARD) is proposed in this paper to cope with the cross domain multistream classification problems under lack of labelled samples. The advantage of LEOPARD has been numerically validated using 24 study cases combined from nine datasets. It is demonstrated that LEOPARD outperforms its counterparts with noticeable margins in 15 of 24 cases. Ablation study further confirms the efficacy of LEOPARD’s learning modules. One limitation of LEOPARD lies in the adversarial domain adaptation strategy only performing domain’s alignment. This approach is poor when there exist big conditional distribution discrepancies. This issue is rather tricky here because LEOPARD does not benefit from any labels for its model updates. Our initial insight shows the feasibility of the pseudo-labelling strategy to attack this problem improving class inferences. Noisy labels remain an open issue and is explored in the future.

VII Acknowledgement

We acknowledge the financial support of National Research Foundation, Singapore under IAFPP in the AME domain (contract no.: A19C1A0018) and the UniSA’s start-up grant.

References

  • [1] S. Chandra, A. Haque, L. Khan, and C. Aggarwal, “An adaptive framework for multistream classification,” in Proceedings of the 25th ACM International on Conference on Information and Knowledge Management, ser. CIKM ’16.   New York, NY, USA: ACM, 2016, pp. 1181–1190.
  • [2] P. Zhao, S. C. H. Hoi, J. Wang, and B. Li, “Online transfer learning,” Artif. Intell., vol. 216, pp. 76–102, 2014.
  • [3] Y. Ganin, E. Ustinova, H. Ajakan, P. Germain, H. Larochelle, F. Laviolette, M. Marchand, and V. Lempitsky, “Domain-adversarial training of neural networks,” J. Mach. Learn. Res., vol. 17, no. 1, p. 2096–2030, Jan. 2016.
  • [4] A. Haque, Z. Wang, S. Chandra, B. Dong, L. Khan, and K. W. Hamlen, “Fusion: An online method for multistream classification,” in Proceedings of the 2017 ACM on Conference on Information and Knowledge Management, ser. CIKM ’17.   New York, NY, USA: ACM, 2017, pp. 919–928.
  • [5] B. Dong, S. Chandra, Y. Gao, and L. Khan, “Multistream classification with relative density ratio estimation,” in AAAI 2019, 2019.
  • [6] M. Pratama, M. de Carvalho, X. Renchunzi, E. Lughofer, and J. Lu, “Atl: Autonomous knowledge transfer from many streaming processes,” in Proceedings of The 28th ACM International Conference on Information and Knowledge Management, ser. CIKM’19, 2019, pp. 3861–3870.
  • [7] H. Du, L. L. Minku, and H. Zhou, “Multi-source transfer learning for non-stationary environments,” in 2019 International Joint Conference on Neural Networks (IJCNN), 2019, pp. 1–8.
  • [8] ——, “Marline: Multi-source mapping transfer learning for non-stationary environments,” in 2020 IEEE International Conference on Data Mining (ICDM), 2020, pp. 122–131.
  • [9] X. Renchunzi and M. Pratama, “Automatic online multi-source domain adaptation,” Information Sciences, vol. 582, pp. 480–494, 2022.
  • [10] F. Liu, G. Zhang, and J. Lu, “Multisource heterogeneous unsupervised domain adaptation via fuzzy relation neural networks,” IEEE Transactions on Fuzzy Systems, vol. 29, pp. 3308–3322, 2021.
  • [11] H. Tao, Z. Wang, Y. Li, M. Zamani, and L. Khan, “Comc: A framework for online cross-domain multistream classification,” in 2019 International Joint Conference on Neural Networks (IJCNN), 2019, pp. 1–8.
  • [12] Y. Zhang, F. Liu, Z. Fang, B. Yuan, G. Zhang, and J. Lu, “Clarinet: A one-step approach towards budget-friendly unsupervised domain adaptation,” in Proceedings of the Twenty-Ninth International Joint Conference on Artificial Intelligence, IJCAI-20, C. Bessiere, Ed.   International Joint Conferences on Artificial Intelligence Organization, 7 2020, pp. 2526–2532, main track.
  • [13] F. Liu, G. Zhang, and J. Lu, “Heterogeneous domain adaptation: An unsupervised approach,” IEEE Transactions on Neural Networks and Learning Systems, vol. 31, no. 12, 2020.
  • [14] Z. Fang, J. Lu, F. Liu, J. Xuan, and G. Zhang, “Open set domain adaptation: Theoretical bound and algorithm,” IEEE Transactions on Neural Networks and Learning Systems, pp. 1–14, 2020.
  • [15] F. Liu, J. Lu, B. Han, G. Niu, G. Zhang, and M. Sugiyama, “Butterfly: Robust one-step approach towards wildly-unsupervised domain adaptation,” 2019.
  • [16] S. Ben-David, J. Blitzer, K. Crammer, A. Kulesza, F. Pereira, and J. W. Vaughan, “A theory of learning from different domains,” Mach. Learn., vol. 79, no. 1–2, p. 151–175, May 2010.
  • [17] J. Xie, R. Girshick, and A. Farhadi, “Unsupervised deep embedding for clustering analysis,” in Proceedings of the 33rd International Conference on International Conference on Machine Learning - Volume 48, ser. ICML’16.   JMLR.org, 2016, p. 478–487.
  • [18] L. van der Maaten and G. Hinton, “Visualizing data using t-sne,” Journal of Machine Learning Research, vol. 9, no. 86, pp. 2579–2605, 2008.
  • [19] J. Smith, S. Baer, Z. Kira, and C. Dovrolis, “Unsupervised continual learning and self-taught associative memory hierarchies,” in 2019 International Conference on Learning Representations Workshops, 2019.
  • [20] S. Ben-David, J. Blitzer, K. Crammer, and F. Pereira, “Analysis of representations for domain adaptation,” in Proceedings of the 19th International Conference on Neural Information Processing Systems, ser. NIPS’06.   Cambridge, MA, USA: MIT Press, 2006, p. 137–144.
  • [21] Z. Wang, Z. Kong, S. Changra, H. Tao, and L. Khan, “Robust high dimensional stream classification with novel class detection,” in 2019 IEEE 35th International Conference on Data Engineering (ICDE), 2019, pp. 1418–1429.
  • [22] M. Pratama, C. Za’in, A. Ashfahani, Y. S. Ong, and W. Ding, “Automatic construction of multi-layer perceptron network from streaming examples,” in Proceedings of the 28th ACM International Conference on Information and Knowledge Management, 2019, pp. 1171–1180.
  • [23] J. Gama, Knowledge Discovery from Data Streams, 1st ed.   Chapman & Hall/CRC, 2010.
  • [24] M. Pratama, W. Pedrycz, and G. I. Webb, “An incremental construction of deep neuro fuzzy system for continual learning of nonstationary data streams,” IEEE Transactions on Fuzzy Systems, vol. 28, pp. 1315–1328, 2020.
  • [25] J. Ni, J. Li, and J. McAuley, “Justifying recommendations using distantly-labeled reviews and fine-grained aspects,” in Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP).   Hong Kong, China: Association for Computational Linguistics, Nov. 2019, pp. 188–197.
  • [26] T. Mikolov, K. Chen, G. Corrado, and J. Dean, “Efficient estimation of word representations in vector space,” CoRR, vol. abs/1301.3781, 2013.
  • [27] J. a. Gama, I. Žliobaitė, A. Bifet, M. Pechenizkiy, and A. Bouchachia, “A survey on concept drift adaptation,” ACM Comput. Surv., vol. 46, no. 4, pp. 44:1–44:37, Mar. 2014.
  • [28] M. de Carvalho, M. Pratama, J. Zhang, and E. K. Y. Yapp, “Acdc: Online unsupervised cross-domain adaptation,” 2021.
  • [29] A. Ashfahani and M. Pratama, “Unsupervised continual learning in streaming environments,” IEEE transactions on neural networks and learning systems, vol. PP, 2022.
  • [30] B. Yang, X. Fu, N. D. Sidiropoulos, and M. Hong, “Towards k-means-friendly spaces: Simultaneous deep learning and clustering,” in Proceedings of the 34th International Conference on Machine Learning, ser. Proceedings of Machine Learning Research, D. Precup and Y. W. Teh, Eds., vol. 70.   PMLR, 06–11 Aug 2017, pp. 3861–3870.
  • [31] P. P. Busto and J. Gall, “Open set domain adaptation,” 2017 IEEE International Conference on Computer Vision (ICCV), pp. 754–763, 2017.
  • [32] H. Liu, Z. Cao, M. Long, J. Wang, and Q. Yang, “Separate to adapt: Open set domain adaptation via progressive separation,” 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2922–2931, 2019.
  • [33] Z. Fang, J. Lu, F. Liu, J. Xuan, and G. Zhang, “Open set domain adaptation: Theoretical bound and algorithm,” IEEE Transactions on Neural Networks and Learning Systems, vol. 32, pp. 4309–4322, 2021.
  • [34] L. Zhong, Z. Fang, F. Liu, B. Yuan, G. Zhang, and J. Lu, “Bridging the theoretical bound and deep algorithms for open set domain adaptation,” IEEE transactions on neural networks and learning systems, vol. PP, 2021.
  • [35] A. Kumar, T. Ma, and P. Liang, “Understanding self-training for gradual domain adaptation,” ArXiv, vol. abs/2002.11361, 2020.
  • [36] L. Zhong, Z. Fang, F. Liu, J. Lu, B. Yuan, and G. Zhang, “How does the combined risk affect the performance of unsupervised domain adaptation approaches?” ArXiv, vol. abs/2101.01104, 2021.
  • [37] H. Chi, F. Liu, W. Yang, L. Lan, T. Liu, B. Han, W. Cheung, and J. T.-Y. Kwok, “Tohan: A one-step approach towards few-shot hypothesis adaptation,” ArXiv, vol. abs/2106.06326, 2021.