Autonomous Cross Domain Adaptation under Extreme Label Scarcity

Weiwei Weng, Mahardhika Pratama, , Choiru Za’in, Marcus de Carvalho,
Rakaraddi Appan, Andri Ashfahani, Edward Yapp Kien Yee W. Weng and M. Pratama share equal contributions. W. Weng, A. Ashfahani, M. de Carvalho, R. Appan are with the School of Computer Science and Engineering, Nanyang Technological University, Singapore. M. Pratama is with the academic unit of STEM, University of South Australia, Adelaide, Australia. C. Za’in is with the school of IT, Monash University. E. Y. Kien Yee is with Singapore Institute of Manufacturing Technology, A*Star, Singapore. Majority of the work was done when M. Pratama was with SCSE, NTU, Singapore. E-mail: [email protected]; [email protected];[email protected];marcus.decarvalho@ ntu.edu.sg;[email protected];[email protected];edward_yapp@ simtech.a-star.edu.sg.

Abstract

A cross domain multistream classification is a challenging problem calling for fast domain adaptations to handle different but related streams in never-ending and rapidly changing environments. Notwithstanding that existing multistream classifiers assume no labelled samples in the target stream, they still incur expensive labelling cost since they require fully labelled samples of the source stream. This paper aims to attack the problem of extreme label shortage in the cross domain multistream classification problems where only very few labelled samples of the source stream are provided before process runs. Our solution, namely Learning Streaming Process from Partial Ground Truth (LEOPARD), is built upon a flexible deep clustering network where its hidden nodes, layers and clusters are added and removed dynamically in respect to varying data distributions. A deep clustering strategy is underpinned by a simultaneous feature learning and clustering technique leading to clustering-friendly latent spaces. A domain adaptation strategy relies on the adversarial domain adaptation technique where a feature extractor is trained to fool a domain classifier classifying source and target streams. Our numerical study demonstrates the efficacy of LEOPARD where it delivers improved performances compared to prominent algorithms in 15 of 24 cases. Source codes of LEOPARD are shared in https://github.com/wengweng001/LEOPARD.git to enable further study.

Index Terms:

Multistream Classification, Transfer Learning, Data Streams, Incremental Learning, Concept Drifts

I Introduction

Background: Multistream classification problems [1] portray a classification problem across many streaming processes running simultaneously but independently. Each streaming process features different but related characteristics to be handled by a single model having a stream-invariant trait. That is, each stream suffers from the domain shift problem in which they follow different distributions. Multistream classification problem also considers the issue of labelling cost where the ground truth access is only provided in the source stream while leaving the target stream with the absence of any labelled samples. Unlike the traditional domain adaptation problems, the multistream problem deals with continuous information flows which must be handled in the fast and sample-wise fashion. Another typical problem is the asynchronous drift problem which distinguishes itself from the conventional single stream problem. The asynchronous drift problem refers to independent drifts between source and target streams taking place at different time points. The multistream classification problem distinguishes itself from the online transfer learning problem [2] in which both source and target domains are streaming in nature whereas the online transfer learning problem assumes a static source domain although it considers a streaming problem of the target domain. The underlying goal of the multistream classification problem is to build a predictive model $f(.)$ which simultaneously performs the unsupervised domain adaptation as well as addresses the issue of data streams. Notwithstanding the recent progress of the multistream classification area, most works are designed from a single domain perspective in which both source and target streams are drawn from the same feature space. In addition, existing solutions incur expensive labelling cost because they require a full supervision of the source stream.

Practical Scenario: This paper puts into perspective a cross-domain multistream classification problem under extreme label scarcity where, unlike conventional multistream classification problems, the source stream and the target stream are generated from different feature spaces but share the same target attributes. The extreme label scarcity issue presents in the fact that no label is provided for the target stream while only few labelled samples are made available in the source stream during the warm-up phase. That is, an operator is only capable of labelling few prerecorded samples of the source stream while leaving the rest of data samples of the source steam unlabelled. The practical scenario of this problem is seen in the condition monitoring problem involving different machines. Instead of building a machine-specific model for monitoring purposes, a single machine-invariant model is constructed thereby saving significant developmental costs because data collection, annotation and preprocessing do not have to be repeated for each machine. Nevertheless, this task is challenging because data samples captured by sensors are streaming in nature. Different machines are installed with different sensors or of different types thereby producing different feature spaces while having different sampling rates leading to different batch sizes. Process’s deviations due to tool wear or any other external influencing factors occur independently to each machine at different time points leading to drifting data distributions in each machine with different rates, magnitudes, types. The issue of labelling cost occurs because visual inspections leading to interruption of machine operations are necessitated to annotate data samples. It hinders the labelling process during the process runs. The labelling process is possible to be done only for prerecorded samples to avoid frequent stoppages of machine operations.

We visualize the significance of label’s scarcity in the context of domain adaptation in Fig. 1 where DANN [3] is evaluated under different label proportions of source streams. Our numerical results are produced in the office31 problem (Webcam $\rightarrow$ DSLR) using five label ratios: $10\%,20\%,30\%,40\%,50\%$ . It is observed that DANN’s performances are significantly compromised with reductions of label proportions, less than $10\%$ accuracy on source and target streams under $5,10\%$ label proportions. That is, its accuracy on source and target streams consistently slips.

Our Contribution: Learning Streaming Process from Partial Ground Truth (LEOPARD) approach is proposed in this paper and resolves the cross-domain multistream classification problems under extreme label scarcity. LEOPARD is developed under the framework of a flexible deep clustering network where it features an elastic and progressive network structure to handle changing data distributions. That is, hidden nodes, hidden layers and hidden clusters are self-evolved in respect to the asynchronous drift problem in both source and target streams. The learning process of LEOPARD aims to achieve two objectives under shared network parameters: clustering-friendly latent space and cross domain alignment in which it minimizes three loss functions. The reconstruction loss functions as the nonlinear dimension reducer where it projects input samples into a low dimension and establishes a common latent space between the source stream and the target stream. It is achieved by the stacked autoencoder (SAE) performing nonlinear mapping. The second component is the clustering loss creating a clustering-friendly latent space preventing the trivial solution. The cross domain adaptation loss is meant to induce the domain alignment and utilizes the adversarial domain adaptation approach [3]. This strategy relies on a domain classifier to classify the origin of data samples and a feature extractor. The feature extractor and the domain classifier compete to each other thus resulting in domain-invariant representations. LEOPARD does not call for any labelled samples for its updates and few prerecorded labelled samples of the source stream are only used to establish the class-to-cluster relationship.

This paper presents four major contributions: 1) it proposes a new problem, namely cross domain multistream classification problems under extreme label scarcity; 2) an algorithm, namely LEOPARD, is developed to address the issue of label’s scarcity in the cross-domain multistream classification problem; 3) a joint optimization problem is formulated to attain the clustering-friendly latent space as well as the domain alignment such that the target stream can be predicted accurately with very few labels of the source stream and no labels of the target stream; 4) the source code of LEOPARD along with all datasets are made public in https://github.com/wengweng001/LEOPARD.git to enable further study. Our numerical study has substantiated the efficacy of LEOPARD in handling the issue of extreme label scarcity in the cross domain multistream classification problems. It delivers highly competitive performances compared to prominent algorithms.

II Related Works

Multistream Classification: The area of multistream classification has attracted growing research interests as observed by the number of works published in the literature. A pioneering work is proposed in [1] using the kernel mean matching (KMM) method as a domain adaptation technique combined with a drift detection method to detect the concept drift in each domain. Considering high computational complexity and memory demand of [1], FUSION is proposed in [4] where the KLIEP method is implemented for domain adaptation while a density ratio method is designed for detecting the asynchronous drifts. MSCRDR is put forward in [5] and uses the Pearson divergence method for domain adaptation. Recently, a deep learning algorithm, namely ATL, is proposed to solve the multi-stream classification problem using the encoder and decoder structure under shared parameters coupled with the KL divergence method for domain adaptation [6]. ATL characterizes an inherent drift handling aptitude with a self-evolving network structure. MELANIE is proposed in [7] to handle the multi-source multistream classification problem. This work is extended in [8]. Another solution of multi-source multistream classification is offered in [9] where the CMD-based regularization is integrated. The problem of multisource unsupervised domain adaptation under both homogeneous and heterogeneous settings are discussed in [10]. The area of multi-stream classifications deserves an in depth study due to at least two reasons: 1) these approaches are designed for a single domain problem where both source and target streams share the same feature space (domain). To the best of our knowledge, there exists only one work in the literature handling the cross-domain multistream classification problem [11] using the empirical maximum mean discrepancy for domain adaptation. However, this approach is based on a non-deep learning solution relying on a simple linear projection for feature transformation, prone to trivial solutions; 2) Although these approaches rely on the unsupervised domain adaptation approaches where no label is offered for the target stream, full annotations are required for the source stream. On the other side, the multistream classification problem distinguishes itself from the online transfer learning problem [2] assuming a fixed and static source domain. Hence, the problem of asynchronous drift problem is absent in the online transfer learning problem.

Semi-Supervised Transfer Learning: The issue of labelling cost has attracted research interest in the transfer learning community. In [12], the notion of complementary labels incurring less expensive labelling cost than true class label is implemented. Dual deep neural networks is designed in which one focuses on complementary labels while another handles the domain adaptation. [13] concerns on reductions of the labelling cost in the heterogeneous domain adaptation problem usually calling for some labelled samples of the target domain. Another effort is devoted to reduce the labelling cost in [14] where it concerns on an open set domain adaptation where the target domain contains unknown classes. The use of noisy labels for unsupervised domain adaptation has been investigated in [15]. Our work differs from these works in two aspects: 1) LEOPARD handles the situation of cross-domain multistream classification under extreme label scarcity. That is, labelled samples are only revealed for the source stream during the warm-up period while no labelled samples for both streams are given for model updates during the process runs; 2) The learning approach is designed for the stream learning scenario.

Refer to caption — Figure 1: DANN performance on Office31 (D $\rightarrow$ W) different label proportions of source streams leaving the target stream unlabelled.

III Problem Formulation

Suppose that $D_{S},D_{T}$ stand for the source and target domains respectively. The goal of domain adaptation is to solve a classification problem of the target domain $D_{T}$ without any labels by transferring knowledge base from the source domain $D_{S}$ where there exist some labelled samples. Referring to [16], the generalization error of the target domain is upper bounded by how good a model $f(.)$ learns the source domain and the discrepancies between two domains:

\epsilon_{T}(f)\leq\underbrace{\epsilon_{S}(f)}_{1^{st}}+\underbrace{d_{1}(D_{S},D_{T})}_{2^{nd}}\\ +\underbrace{\min{(E_{D_{S}}(f_{S},f_{T}),E_{D_{T}}(f_{S},f_{T}))}}_{3^{rd}}

(1)

where the first term is the source error, the second term is divergence between the two domains and the last term is the difference in labelling function between the two domains which should be small [16]. Direct minimization of the divergence is a challenging task due to a lack of correspondences between data samples of the two domains. For streaming data, it leaves a major challenge because the divergence measure works with a finite number of samples.

A cross-domain multistream classification problem under extreme label scarcity is defined as a classification problem of two independent streaming data $B_{1}^{S},B_{2}^{S},...,B_{K_{S}}^{S}$ and $B_{1}^{T},B_{2}^{T},...,B_{K_{T}}^{T}$ termed as a source stream and a target stream respectively where $K_{S},K_{T}$ are respectively the number of source stream and target stream unknown in practise. $B_{k_{s}}^{S},B_{k_{t}}^{T}$ are drawn from the source domain $D_{S}$ and the target domain $D_{T}$ respectively. Extreme label scarcity is perceived in the limited access of ground truth where only prerecorded samples of the source stream $B_{0}^{S}=\{x_{i}^{S},y_{i}^{S}\}_{i=1}^{N_{m}}$ are labelled while no label is provided during the process runs $B_{k_{s}}^{S}=\{x_{i}^{S}\}_{i=1}^{N_{S}}$ . $N_{m},N_{S}$ denote the number of prerecorded data samples of the source stream and the size of the source stream respectively. On the other hand, the target stream suffers from the absence of true class labels $B_{k_{t}}^{T}=\{x_{i}^{T}\}_{i=1}^{N_{T}}$ where $N_{T}$ is the size of the target stream. Note that we consider a case where both source and target domains are streaming in nature. $x_{i}^{S}\in\mathcal{X_{S}}$ , $x_{i}^{T}\in\mathcal{X_{T}}$ , $\mathcal{X_{S}}\neq\mathcal{X_{T}}$ are input vectors of the source stream and the target stream while $y_{i}=[l_{1},l_{2},...,l_{m}]$ is a target vector formed as one-hot vector $y_{i}^{S},y_{i}^{T}\in\mathcal{Y}$ . $(x_{i}^{S},y_{i}^{S})\in\mathcal{X_{S}}\times\mathcal{Y}$ and $(x_{i}^{T},y_{i}^{T})\in\mathcal{X_{T}}\times\mathcal{Y}$ . That is, the two domains feature different feature spaces but share the same labelling function, target variables, Cross-Domain. The two streaming data are sampled with different speeds resulting in $N_{S}\neq N_{T}$ , different batch sizes while following different distributions $P(x_{S})\neq P(x_{T})$ , covariate shift. The source stream and the target stream are non-stationary in nature where their concepts are drifting $P(x,y)_{t}^{S}\neq P(x,y)_{t+1}^{S}$ , $P(x,y)_{t^{\prime}}^{T}\neq P(x,y)_{t^{\prime}+1}^{T}$ , $t\neq t^{\prime}$ , i.e., concept drifts of the two streams might develop at different time periods $t\neq t^{\prime}$ , asynchronous drift.

IV Learning Procedure of LEOPARD

IV-A Network Structure of LEOPARD

LEOPARD is structured as a deep clustering network developed with a feature extraction layer extracting natural features $Z$ from raw input features $x$ by means of a mapping function $F_{W_{f}}(.)$ where $W_{f}$ stands for parameters of the feature extractor. The extracted features $Z$ are passed to a fully connected layer formed as a stacked autoencoder (SAE) with a tied-weight constraint. That is, the decoder parameters are the inverse mapping of the encoder parameters. The natural features $Z\in\Re^{u^{\prime}}$ are projected to a low dimensional latent space $h^{l}\in\Re^{R_{l}}$ where $u^{\prime},R_{l}$ are respectively the number of natural features and hidden nodes at the $l-th$ layer $R_{l}<<u^{\prime}$ . The decoding and encoding mechanisms are expressed:

h^{l}=r(W_{enc}^{l}h^{l-1}+b^{l});h^{0}=Z

(2)

\hat{h}^{l-1}=r(W_{dec}^{l}h^{l}+c^{l});\quad\forall l=1,\dots,L

(3)

where $W_{enc}^{l}\in\Re^{R_{l}\times u_{l}},b^{l}\in\Re^{R_{l}}$ stand for the connective weights and biases of the $l-th$ layer of the encoder respectively while $W_{dec}^{l}\in\Re^{u_{l}\times R_{l}},c^{l}\in\Re^{u_{l}}$ denote the connective weights and biases of the $l-th$ layer of the decoder respectively. The tied-weight constraint $W_{dec}^{l}=(W_{enc}^{l})^{T}$ functions as a regularization mechanism preventing the issue of overfitting.

The clustering mechanism is carried out in each deep embedding space, each latent space. That is, it takes place in every hidden layer of SAE $h(.)^{l}$ creating different representations of data samples. The inference mechanism is performed by first calculating the similarity degree of a data sample and a hidden cluster[17]:

\phi_{j}^{l}=\frac{(1+|h^{l}-C_{j}^{l}||_{2}/\lambda)^{\frac{-(\lambda+1)}{2}}}{\sum_{j=1}^{Clus^{l}}(1+|h^{l}-C_{j}^{l}||_{2}/\lambda)^{\frac{-(\lambda+1)}{2}}}

(4)

where $C_{j}^{l},h^{l}$ are the centroid of the $j-th$ cluster of the $l-th$ layer and the latent representation of a data sample $x$ of the $l-th$ layer while $Clus^{l}$ is the number of cluster created in the $l-th$ latent space, i.e., the $l-th$ layer of SAE. $\lambda=1$ is chosen here. The student t-distribution is adopted to model the similarity degree and $\phi_{j}^{l}$ is also regarded as the cluster posterior probability $P(C_{j}^{l}|X)$ [18] where $P(C_{j}|X)=1$ presents the case of perfect match between $h^{l}$ and $C_{j}^{l}$ . The similarity degree $\phi_{j}^{l}$ is aggregated across $N_{m}$ prerecorded samples having true class labels $B_{0}^{S}=\{x_{i}^{S},y_{i}^{S}\}_{i=1}^{N_{m}}$ . This operation produces the cluster’s allegiance [19] measuring cluster’s tendencies to a particular class. Suppose that $N_{o}$ stands for the number of prerecorded samples having the $o-th$ class as their labels, the cluster allegiance $Ale_{j,o}^{l}$ is calculated:

Ale_{j,o}^{l}=\frac{\sum_{n=1}^{N_{o}}\phi_{j,o}^{n,l}}{\sum_{o=1}^{m}\sum_{n=1}^{N_{o}}\phi_{j,o}^{n,l}}

(5)

where $\phi_{j,o}^{n,l}$ measures the similarity degree of the cluster $C_{j}^{l}$ and the $n-th$ prerecorded sample $h_{o}^{l}$ falling into the $o-th$ class. (5) pinpoints the neighborhood degree of the $j-th$ cluster to the $o-th$ class and implies that an unclean cluster, occupied by data samples of mixed classes, possesses low cluster allegiance. The winner-takes-all principle $win=\arg\max_{j=1,...,Clus^{l}}\phi_{j}^{l}$ is adopted here, where a data sample is associated to the nearest cluster. The local score of the $l-th$ layer is defined as the allegiance of the winning cluster $Score^{l}=Ale_{win}^{l}$ .The predicted class label $\hat{Y}$ is determined as a class label maximizing its global score. The global score is calculated as the summation of a local score across $L$ layers:

\hat{Y}=\arg\max_{o=1,...,m}\sum_{l=1}^{L}Score^{l}

(6)

where the majority voting approach is implemented here. It is evident that LEOPARD merely benefits from the labelled prerecorded samples of the source stream $B_{0}^{S}$ to associate a cluster to a specific class. No label at all from both streams is solicited in the streaming phase where it confirms its applicability in the extreme label scarcity environments. Fig. 2 visualizes the network structure of LEOPARD. It is perceived that the clustering process occurs in every hidden layer of LEOPARD thus producing its local outputs. The final predicted class label is aggregated across all layers making use of a summation operation. $R_{l},L,Clus^{l}$ are self-evolved in respect to varying distributions.

IV-B Parameter Learning of LEOPARD

Adversarial Domain Adaptation: the idea of domain adaptation is to minimize the divergence between the target domain and the source domain. The concept of adversarial domain adaptation is founded by the idea of $H$ divergence [3] where it relies on a hypothesis class $H$ , a set of binary classifiers. Definition 1 [20]: Given the two domains $D_{S}$ and $D_{T}$ and the hypothesis class $H$ , the $H$ divergence between $D_{S}$ and $D_{T}$ is defined as follows:

\begin{split}d_{H}(D_{S},D_{T})=2\sup_{\eta\in H}|Pr[\eta(x)=1]_{x\sim D_{S}}\\ -Pr[\eta(x)=1]_{x\sim D_{T}}|\end{split}

(7)

The $H$ divergence in (7) relies on the hypothesis class $H$ to distinguish data samples generated from $D_{S}$ or data samples generated from $D_{T}$ . In [20], the empirical $H$ divergence can be used in the case of a symmetric hypothesis class $H$ :

\begin{split}d_{H}(D_{S},D_{T})=2(1-\min_{\eta\in H}[\frac{1}{n}\sum_{x\backsim D_{S}}I[\eta(x_{n})=0]+\\ \frac{1}{n^{^{\prime}}}\sum_{x\backsim D_{T}}I[\eta(x_{n})=1]])\end{split}

(8)

where $I[a]$ denotes an indicator function returning 1 if $a$ is true or 0 otherwise. This implies that the $H$ divergence can be minimized by finding a representation where the source and target samples are indistinguishable [3].

The concept of adversarial domain adaptation can be implemented by deploying a domain classifier $\zeta_{W_{DC}}(F_{W_{F}}(.))$ working along with the feature extractor $F_{W_{f}}(.)$ and a classifier $\xi_{W_{C}}(F_{W_{F}}(.))$ . The domain classifier predicts the origin of data samples whether they are generated by the source domain $D_{S}$ or the target domain $D_{T}$ while the classifier generates the final output of a network. The domain reversal layer is implemented in updating the feature extractor such that indistinguishable features of source and target domains are induced. That is, the overall loss function is written as follows:

\begin{split}L=\frac{1}{N_{S}}\sum_{n_{1}=1}^{N_{S}}L_{\xi}(\xi_{W_{C}}(F_{W_{f}}(x_{n_{1}}),y_{n_{1}})\\ -\lambda(\frac{1}{N_{S}}\sum_{n_{2}=1}^{N_{S}}L_{\zeta}(\zeta_{W_{DC}}(F_{W_{f}}(x_{n_{2}}),d_{n_{2}})\\ +\frac{1}{N_{T}}\sum_{n_{3}=1}^{N_{T}}L_{\zeta}(1-\zeta_{W_{DC}}(F_{W_{f}}(x_{n_{3}}),d_{n_{3}}))\end{split}

(9)

where $L_{\xi,\zeta}(.)$ is implemented as the cross entropy loss function and $d_{n}$ is the domain identity, i.e., 1 for the source domain and 0 for the target domain. From (9), the gradient reversal layer inserts a negative constant confusing the domain classifier, i.e., generating indistinguishable samples. The parameter learning process is formulated as follows:

W_{f}=W_{f}-\mu(\frac{\partial L_{\xi}}{\partial W_{f}}-\alpha_{1}\frac{\partial L_{\zeta}}{\partial W_{f}})

(10)

W_{C}=W_{C}-\mu\frac{\partial L_{\xi}}{\partial W_{C}}

(11)

W_{DC}=W_{DC}-\mu\lambda\frac{\partial L_{\zeta}}{\partial W_{DC}}

(12)

where the feature extractor is trained to produce similar features of the two domains seen in the negative sign of the gradient thereby leading to the domain invariant network.
Loss Function: the parameter learning strategy of LEOPARD is constructed using a joint loss function comprising two modules: clustering loss and cross-domain adaptation loss. The underlying goal is to produce domain-invariant parameters as well as clustering-friendly latent spaces such that the online cross domain adaptation can be solved under extreme label scarcity. The overall cost function is formalized as follows:

L_{all}=L_{cluster}-\alpha_{1}L_{cd}

(13)

where $L_{cluster},L_{cd}$ respectively denote the clustering loss and the cross-domain adaptation loss while $\alpha_{1}$ is a trade-off constant controlling the influence of the cross-domain adaptation loss. It is an unconstrained optimization problem which can be optimized using the stochastic gradient descent approach with no epoch or epoch per batch to assure scalability in streaming environments. That is, a number of iteration is done per batch. A data batch is discarded once iterations across a number of epoch is completed to allow bounded complexity. The negative sign in (13) follows the gradient reversal strategy generating similar features across two domains. In other words, the gradient of clustering loss and the gradient of cross domain loss, the domain classifier loss, is subtracted [3].
Clustering-Friendly Latent Space: the clustering loss aims to achieve the clustering-friendly latent space via simultaneous feature learning and clustering. The clustering loss is formulated as the reconstruction loss and the KL divergence loss minimizing probabilistic distance of the latent space and the auxiliary target distribution [17]:

L_{cluster}=\underbrace{L_{\xi}(x_{S,T},\hat{x}_{S,T})}_{L_{1}}+\underbrace{\sum_{l=1}^{L}(L_{\xi}(h_{S,T}^{l},\hat{h}_{S,T}^{l})\\ +\alpha_{2}KL(\phi^{l}|\Phi^{l}))}_{L_{2}}

(14)

where $\Phi^{l}$ is the auxiliary target distribution of the $l-th$ latent space and $\alpha_{2}$ is a regularization constant controlling the strength of the KL divergence loss. $\phi^{l}$ is the similarity degree of the current sample to existing clusters. $L_{\xi}(.)$ is the reconstruction loss formed as the mean square error (MSE) loss function. It also performs nonlinear dimension reduction preventing the trivial solutions often happening in the case of linear mapping. It guarantees a data sample to be mapped back to its original representation. The key difference between the two loss functions lies in the adaptation mechanism in which $L_{1}$ is solved in the end-to-end fashion while $L_{2}$ is carried out in the layer-wise fashion.

The last term also known as the KL divergence loss minimizes the discrepancy of the distribution of a current data batch calculated via (4) and the auxiliary target distribution $KL(\phi^{l}|\Phi^{l})=\sum_{i}\sum_{j}\phi_{i,j}^{l}\log{\frac{\phi_{i,j}^{l}}{\Phi_{i,j}^{l}}}$ . The auxiliary target distribution should satisfy three requirements [17]: 1) improve prediction; 2) emphasizes samples of high confidence; 3) normalize loss contribution of each cluster to avoid creations of large clusters. We adopt the same auxiliary distribution as in [17] where $\Phi_{i,j}^{l}$ is raised to the second power and normalized by frequency per cluster:

\Phi_{i,j}^{l}=\frac{(\phi_{i,j}^{l})^{2}/\zeta_{j}}{\sum_{j=1}^{Clus^{l}}(\phi_{i,j}^{l})^{2}/\zeta_{j}}

(15)

where $\zeta_{j}=\sum_{i=1}^{N}\phi_{i,j}^{l}$ is the frequency of a cluster. This strategy is understood as the soft-cluster assignment [17] where all clusters are updated and differs from the hard-cluster assignment only tuning the winning cluster. The clustering mechanism is hard to conduct in the high-dimensional space [21] thus calling for feature learning steps to be committed simultaneously. The clustering process takes place in every latent space $h(.)^{l}$ set as the common feature spaces between the source and target domain. That is, (14) is executed using samples of both source and target streams. This process also functions as an implicit domain adaptation strategy since the minimization of reconstruction loss across two streams with shared parameters ends up with an overlapping region of both domains [6]. The optimization procedure takes place simultaneously where the network parameters and the cluster parameters are adjusted concurrently with the SGD method.
Domain-Invariant Network: LEOPARD consists of three sub-modules: feature extractor $F(.)$ , classifier $\xi(.)$ and domain classifier $\zeta(.)$ to achieve a domain invariant property as depicted in Fig. 3. The feature extractor is parameterized by $W_{F}$ and the classifier formed as the deep clustering module is parameterized by $W_{C}\in\{W_{enc}^{l},W_{dec}^{l},C^{l}\}$ while the domain classifier formed as a single hidden layer network is parameterized by $W_{DC}$ . The feature extractor and the domain classifier play a minimax game via the gradient reversal layer where the feature extractor is trained to fool the domain classifier via production of similar features of source and target streams while the domain classifier is trained to identify the origin of data samples. The cross domain adaptation loss is thus formulated as the domain classifier loss as follows:

\begin{split}L_{cd}=\frac{1}{N_{S}}\sum_{n=1}^{N_{S}}L_{\zeta}(\zeta_{W_{DC}}(F_{W_{f}}(x_{n})),d_{n})\\ +\frac{1}{N_{t}}\sum_{n^{\prime}=1}^{N_{T}}L_{\zeta}(1-\zeta_{W_{DC}}(F_{W_{f}}(x_{n^{\prime}})),d_{n^{\prime}})\end{split}

(16)

where $d_{n}$ stands for the origin of data samples, i.e., $1$ for the source stream and $0$ for the target stream. The domain classifier is tasked to solve a binary classification problem where $L_{\zeta}(.)$ is set as the cross entropy loss function. This leads to similar parameter learning processes for feature extractor, domain classifier and classifier respectively as in (10) - (12) except the presence of the clustering loss instead of the cross-entropy loss as defined in (14): $W_{f}=W_{f}-\mu(\frac{\partial L_{cluster}}{\partial W_{f}}-\alpha_{1}\frac{\partial L_{cd}}{\partial W_{f}});W_{C}=W_{C}-\mu\frac{\partial L_{cluster}}{\partial W_{C}};W_{DC}=W_{DC}-\mu\alpha_{1}\frac{\partial L_{cd}}{\partial W_{DC}}$ where $\mu$ denotes the learning rate. Note that the gradient reversal layer has no parameters and simply alters the sign of the gradients allowing maximization process to be carried out via the stochastic gradient descent approach. This only applies to the feature extractor as illustrated in Fig. 3.

IV-C Structural Learning of LEOPARD

Evolution of Cluster: The classifier of LEOPARD implements the self-organizing mechanism of network clusters where the clusters are flexibly grown in every hidden layer $h(.)^{l}$ if changing data distributions are identified. Furthermore, it is performed for both source data samples $h_{S}^{l}$ and target data samples $h_{T}^{l}$ . That is, the clustering mechanism does not generate stream-specific clusters. Suppose that $D(X,Y)$ stands for the $L_{2}$ distance between two variables $X,Y$ and an $i-th$ cluster of $l-th$ layer is parameterized by its centre $C_{i}^{l}$ , the growing condition is formulated as follows:

\min_{i=1,...,Clus^{l}}D(h^{l},C_{i}^{l})>\mu_{D,i}^{l}+k_{1}\sigma_{D,i}^{l}

(17)

where $\mu_{D,i}^{l},\sigma_{D,i}^{l}$ denote the mean and standard deviation of the distance $D(h^{l},C_{i}^{l})$ of the $i-th$ cluster of the $l-th$ layer while $k_{1}=2\exp{-||h_{l}-C_{win}^{l}||}+2$ leading to a dynamic confidence degree. The dynamic confidence degree enables the cluster growing phase to be carried out in the case of a far proximity between a data sample and the winning cluster. (17) examines the coverage span of existing clusters where a new cluster is inserted if a data sample is remote from the influence zone of existing clusters or a concept drift develops. A new cluster is crafted by assigning the current sample of interest as the cluster’s center $C_{Clus^{l}+1}^{l}=h^{l}$ and setting the cluster’s cardinality to be $N_{Clus^{l}+1}=1$ .
Evolution of Network Structure: The classifier of LEOPARD is equipped by the hidden node growing and pruning strategies adapting to the concept drifts of data streams. That is, this mechanism takes place for both the source stream and the target stream. The self-organizing mechanism is controlled by the network significance (NS) method [22] adopting the bias-variance decomposition concept of every layer. That is, a high bias situation leads to an introduction of a new node while a high variance condition triggers the node pruning mechanism. Note that the network bias and variance here are evaluated in respect to the local error of a layer. All of which are carried out in an unsupervised fashion in respect to the reconstruction error. The network significance (NS) method is formalized as follows:

NS=(E[\hat{h}^{l}]-h^{l})^{2}+(E[(\hat{h}^{l})^{2}]-E[\hat{h}^{l}]^{2})

(18)

(18) can be solved by finding the expected output $E[\hat{h}^{l}]$ under a certain probability density function $p(x)$ assumed to follow the normal distribution $N(\mu,\sigma^{2})$ with mean $\mu$ and variance $\sigma^{2}$ . The bottleneck of this approach is found in the case of drift $p(x)_{t}\neq p(x)_{t+1}$ where it does not keep pace with rapidly changing distributions. To correct this shortcoming, Autonomous Gaussian Mixture Model (AGMM) can be used to estimate a complex probability density function $p(x)$ as done in [6]. It is computationally expensive and often unstable in the high input dimension case due to the use of product norm. Furthermore, we deal with a multi-layer network here doubling the complexity of AGMM.

The hidden unit growing and pruning steps are signalled by the statistical process control (SPC) approach [23] commonly used for anomaly detection tasks. The SPC method is applied here to detect the high bias or high variance condition and written as follows:

\mu_{bias}^{n,l}+\sigma_{bias}^{n,l}\geq\mu_{bias}^{min,l}+k_{2}\sigma_{bias}^{min,l}

(19)

\mu_{var}^{n,l}+\sigma_{var}^{n,l}\geq\mu_{var}^{min,l}+2*k_{3}\sigma_{var}^{min,l}

(20)

The SPC method is generalized here using $k_{2}=1.3\exp{(-Bias^{2})}+0.7$ and $k_{3}=1.3\exp{(-Var^{2})}+0.7$ . This modification leads to dynamic confidence levels enabling for flexible growing and pruning phases. That is, the node growing process is likely performed in the case of a high bias while being strict in the case of a low bias. The same case also applies for the node pruning mechanism. $\mu_{bias}^{min,l}$ and $\sigma_{bias}^{min,l}$ are reset if the growing condition (19) is satisfied. On the other hand, if the pruning condition is met, $\mu_{var}^{min,l}$ and $\sigma_{var}^{min,l}$ are reset. The initialization of a new node is carried out using the Xavier initialization strategy. The least contributing node having the least statistical contribution is subject to the pruning step if (20) is observed. Since LEOPARD is constructed under a different-depth structure where every layer performs its own clustering mechanism and produces its local output, the growing and pruning steps are independently undertaken per layer. Furthermore, this mechanism occurs for both source and target streams to anticipate the asynchronous drift problem where the network structure is shared across two domains.

The classifier of LEOPARD is fitted with the hidden layer growing mechanism where it expands the network depth based on the drift detection mechanism [24]. The drift detection mechanism is designed from the concept of Hoeffding’s bound and analyzes the dynamic of latent features $Z$ to identify the change of marginal distribution. Note that no labelled samples are offered for model updates and the drift detection approach is executed for both source and target streams. The addition of a network layer is desired in practise because it is capable of substantiating network capacity significantly thus enhancing model’s generalization. The drift detection procedure starts by finding the cutting point, a point where population mean increases. A cutting point is declared by the following condition.

\hat{P}+\epsilon_{P}\geq\hat{Q}+\epsilon_{Q}

(21)

where $P\in\Re^{2N}$ is a data matrix containing two consecutive data batches $[B_{k-1},B_{k}]$ , i.e., previous and current data batches while $Q\in\Re^{cut}$ is a data matrix with $cut$ as the hypothetical cutting point of interest, $cut<2N$ . Two data batches are applied here to increase the sensitivity of cutting point identification because latent features are relatively stable compared to the original input space. The hypothetical cutting point is arranged as $cut=[25\%,50\%,75\%]\times 2N$ instead of every point to avoid false alarm. $\hat{P},\hat{Q}$ denote the statistics of data matrices $P,Q$ . $\epsilon_{P,Q}$ stand for the error bound derived from the concept of Hoeffding’s bound as follows:

\epsilon_{P,Q}=\sqrt{\frac{1}{2\times size}\ln{\frac{1}{\alpha_{x}}}}

(22)

where $\alpha_{x}$ is the significance level being inversely proportional to the confidence level $1-\alpha_{x}$ while $size$ refers to the size of the data matrix of interest $P,Q$ .

Once eliciting the cutting point of interest $cut$ , a data matrix $R\in\Re^{2N-cut}$ is constructed. A drift is signalled if $|\hat{R}-\hat{Q}|\geq\epsilon_{D}$ . Beside the drift condition, a warning condition is set and pinpoints a case where a drift needs to be confirmed by the next data batch. That is, $\epsilon_{W}\leq|\hat{R}-\hat{Q}|\leq\epsilon_{D}$ where $\alpha_{W}<\alpha_{D}$ . The error bounds $\epsilon_{D,W}$ are defined as follows:

\epsilon_{D,W}=(b-a)\times\sqrt{\frac{size-cut}{2\times cut\times size}\ln{\frac{1}{\alpha_{D.W}}}}

(23)

where $[a,b]$ denotes the range of the data matrix $P$ . A new layer is created if a concept drift is found. That is, the number of nodes is set as the half of the network width of the previous layer $l-1$ . This step enables the nonlinear feature reduction and avoids an over-complete network. The domain classifier and the feature extractor have a fixed structure because the structural learning of the classifier suffices to address the asynchronous drift problem.

IV-D Algorithm

Learning policy of LEOPARD is visualized in Fig. 3 and Algorithm 1 where LEOPARD is driven by the feature extractor, the classifier and the domain classifier. The forward pass procedure is done by feeding raw input attributes $x_{S,T}$ to the feature extractor $F(.)$ leading to latent input features $Z_{S,T}$ . The latent features are passed to the classifier $\xi(.)$ implemented as the SAE and the clustering module. Note that the clustering module exists in every layer of SAE producing its own local output $Score^{l}$ where the majority voting is performed to generate a final predicted output. The learning process starts with a warm-up phase using $N_{init}$ unlabelled samples iterated across $E$ number of epochs to avoid the cold start problem. This process only involves the reconstruction loss $L_{\xi}(x_{S,T},\hat{x}_{S,T})$ and $L_{\xi}(h^{l}_{S,T},\hat{h}^{l}_{S,T})$ affecting only network parameters $W_{F}$ and $W_{enc}^{l},W_{dec}^{l}$ . The main training loop is executed by minimizing $L_{cluster}(.)$ and is applied to $W_{F},W_{enc}^{l},W_{dec}^{l},C^{l}$ . Minimization of clustering loss across the two domains can be also seen as the domain adaptation strategy because it leads to an overlapping region of source domain and target domain to be created, i.e., both the source stream and the target stream are used under shared parameters. The adversarial domain adaptation is carried out by minimizing $L_{cd}(.)$ afterward where the domain classifier $\zeta_{W_{DC}}(.)$ is updated as well as the feature extractor $F_{W_{f}}(.)$ using the cross domain loss. The gradient reversal strategy is adopted when adjusting the feature extractor thus converting the minimization problem to the maximization problem and in turn resulting in indistinguishable features of the source stream and the target stream. The cross domain adaptation strategy makes possible for the source streams and the target streams following different distributions to be mapped similarly, i.e., the covariate shift is addressed.

Input: Source streaming data {

B_{0}^{S},B_{1}^{S},B_{2}^{S},...,B_{K_{S}}^{S}

}, target streaming data {

B_{1}^{T},B_{2}^{T},...,B_{K_{T}}^{T}

}, initialization epochs

E_{init}

, batch number of source and target streaming data

b_{k}

, epoch number

E

Output: Network parameters of feature extractor

W_{f}

, classifier

W_{C}

and domain classifier

W_{DC}

. Average accuracy

Acc

1 for $i=1:E_{init}$ do

2 Initializing clusters using scarcity labelled data

B_{0}^{S}

;

4 end for

5for $j=1:E$ do

6 Network layer evolution of classifier (SAE)

\xi_{W_{c}}

by Eq. (23);

7 Hidden unit of classifier (SAE)

\xi_{W_{c}}

growing and pruning by Eq. (19) and (20);

L_{cluster}={L_{\xi}(x_{S,T},\hat{x}_{S,T})}+{\sum_{l=1}^{L}(L_{\xi}(h_{S,T}^{l},\hat{h}_{S,T}^{l})+\alpha_{2}KL(\phi^{l}|\Phi^{l}))}

;

9 Update feature extractor parameter

W_{f}

and classifier parameter

W_{C}

in respect to

L_{cluster}

;

L_{cd}=\frac{1}{N_{S}}\sum_{n=1}^{N_{S}}L_{\zeta}(\zeta_{W_{DC}}(F_{W_{f}}(x_{n})),d_{n})+\frac{1}{N_{t}}\sum_{n^{\prime}=1}^{N_{T}}L_{\zeta}(1-\zeta_{W_{DC}}(F_{W_{f}}(x_{n^{\prime}})),d_{n^{\prime}})

;

11 Update feature extractor parameter

W_{f}

and domain classifier parameter

W_{DC}

in respect to

L_{cd}

;

13 end for

14return

W_{f},W_{C},W_{DC}

and average accuracy

Acc

;

Algorithm 1 LEOPARD

The structural learning process occurs in both the initialization phase and the main training phase in which it includes the cluster growing process, the hidden node growing and pruning processes and the hidden layer growing process. As with the warm-up phase, the initialization phase using $N_{init}$ prerecorded samples over $E$ epochs is implemented if a new layer is created. It is obvious that LEOPARD does not exploit any labelled samples for model updates except for labelled samples to be used to calculate the cluster allegiance (5). The structural learning mechanism addresses the issue of asynchronous drifts across both streams.

V Numerical Study

This section presents numerical validation of LEOPARD putting forward nine datasets leading to 24 independent numerical results. Ablation study is added in this section to further numerically validate the contribution of each learning component. Source codes of LEOPARD can be found in https://github.com/wengweng001/LEOPARD.git. Our analysis of label proportions and visualizations of LEOPARD’s learning performances are offered in the supplemental document.

V-A Dataset

MNIST(MN) $\leftrightarrow$ USPS(US): this problem presents a digit recognition problem having 10 classes. The data samples are formed by gray-scale images of hand-written digits resized to $28\times 28$ for US $\rightarrow$ MN and $28\times 28$ for MN $\rightarrow$ US cases.
Amazon@X(AM): this is a multi-domain sentiment analysis problem encompassing product reviews obtained from Amazon.com. $X$ stands for the product type [25]. Five product types, namely beauty, books, industrial, luxury and magazine, are selected here where the cross-domain multistream classification problem is formulated with two products with similar contexts but different topics. The averaged summed outputs from Google’s word2vec model pretrained on 100 billion words [26] is used to perform feature extraction.
Office31: this problem presents three domains: amazon (A), DSLR (D) and Webcam (W). It comprises 31 categories of the office objects. We present the case of D $\leftrightarrow$ W where D comprises 498 images and W consists of 795 images. The characteristics of nine datasets are summed up in Table I.

V-B Simulation Protocol

The numerical study is carried out using the prequential test-then-train protocol as per [23], A model is tested first before updating it with the same data stream. The numerical evaluation is independently undertaken per-batch where the numerical result is averaged across all batches. Our simulation is repeated 5 times to guarantee the consistency of numerical results where the final numerical results are averaged over 5 independent runs. The asynchronous drift problem is induced by applying the scaling hyper-plane strategy [27, 28] where a data stream is scaled to $x_{i}=\frac{d_{z}\times x_{i}}{||x||}$ . $d_{z}$ is a randomly generated concept drift vector where $z$ is the number of concept drifts in the stream: $z=1$ for every source stream and $z=1$ for every target data stream. A fixed random seed is selected in setting $d_{z}$ to assure fair comparison. In realm of MN $\leftrightarrow$ US, the concept drifts occurs at $k=35$ for source stream and $k=36$ for target stream whereas the concept drift takes place at $k=5$ for source stream and $k=6$ for target stream in the amazon@X and Office 31 problems. These configurations assure the asynchronous drift to be presented.

TABLE I: Characteristics of Datasets

Dataset	Attributes	Labels	Samples	NB
MNIST(MN)	784	10	70000	65
USPS(US)	256	10	9298	65
Amazon@Beauty(AM1)	300	5	5150	20
Amazon@Books(AM2)	300	5	500000	20
Amazon@Industrial(AM3)	300	5	73146	20
Amazon@Luxury(AM4)	300	5	33784	20
Amazon@Magazine(AM5)	300	5	2230	20
Office31(D)	36636672	31	498	10
Office31(W)	921600	31	795	10

NB: Number of Batches

V-C Baseline

LEOPARD is compared with five algorithms: autonomous deep clustering network (ADCN) [29], deep clustering network (DCN) [30], autoencoder followed by K-Means (AE+KMeans), deep embedding clustering (DEC) [17] and domain adversarial neural networks (DANN) [3]. ADCN is a self-evolving deep clustering network where hidden clusters, nodes and layers are grown and pruned dynamically. The loss function is formulated with a combination of a clustering loss and a reconstruction loss. ADCN is not equipped by a specific domain adaptation loss function while it applies the hard cluster assignment approach as with [30], i.e., The $L_{2}$ distance loss of the winning cluster and the latent sample is put forward. DCN adopts a fixed network structure where the clustering mechanism only takes place at the bottleneck layer. It applies the same loss function as ADCN. AE+KMeans differs from DCN where the clustering mechanism is carried out after the training process. It does not utilize any clustering loss. DEC adopts the soft-assignment approach as with LEOPARD except that it relies on a static network structure and suffers from the absence of any domain adaptation loss. The reconstruction loss in the baseline algorithms are perceived as a domain adaptation procedure because they are carried out for both source and target streams under shared network parameters. DANN utilizes the adversarial domain adaptation as per LEOPARD without any clustering mechanism.

All of them work under the extreme label scarcity condition as with LEOPARD where access of true class labels is only provided for the prerecorded samples of the source stream while no label is offered during the process runs for both the source stream and the target stream. Comparison with ADCN is done by executing their published codes to assure fair comparisons. We utilize our own implementations of DCN, AE+KMeans, DEC and DANN.

V-D Hyperparameters

The learning rate and momentum of LEOPARD are allocated as 0.01 and 0.95 while the regularization constant of the clustering loss $\alpha_{2}$ is set as 1 and the tradeoff constant of the cross-domain loss $\alpha_{1}$ is set as 0.1. LEOPARD also depends on labelled prerecorded samples $B_{0}^{S}$ of the source stream set as $10\%$ of source samples proportionally taken from each class $N_{m}=10\%N_{S}$ . That is, each class contributes the same number of samples. The number of initial epochs are set as $E=100$ (amazon@X), $E=50$ (MN $\leftrightarrow$ US), and $E=500$ (D $\leftrightarrow$ W) respectively. The initialization phase is carried out using labelled prerecorded samples of the source stream. The parameters of the drift detector $\alpha_{x},\alpha_{D},\alpha_{W}$ are selected respectively as $0.001,0.001,0.005$ . For amazon@X problems, LEOPARD runs in the one-pass learning procedure whereas for MN $\leftrightarrow$ US experiments, the training process of the clustering loss adopts the epoch per batch strategy with $10$ (MN $\rightarrow$ US, W $\leftrightarrow$ D) epochs and $5$ epochs (US $\rightarrow$ MN) respectively. The epoch per batch strategy satisfy the online learning requirement because a data batch is discarded after training over predetermined epochs. The same setting is also applied to the baseline algorithms assuring fair comparisons.

For MN $\leftrightarrow$ US problem, the feature extractor is formed as convolutional neural networks. The encoder part is constructed as 2 convolutional layers using 16 and 4 filters respectively while having the max pooling layer in between. The decoder part is built upon two transposed convolutional layers with 4 and 16 filters respectively. For amazon@X sentiment analysis problems, the multi-layer perceptron feature extractor is put forward with two hidden layers where the number of nodes is fixed as $300$ and $100$ . For the office31 problem, ResNet34 is applied as feature extractors. The initial nodes of fully connected layer are simply assigned as $96$ for the MNIST $\leftrightarrow$ USPS problem, $30$ for amazon@X sentiment analysis problems and $500$ for D $\leftrightarrow$ W. The ReLU activation function is applied for the intermediate layers while the decoder output utilizes the sigmoid activation function producing normalized reconstructed output. The network structures of baseline algorithms are set similarly to ensure fair comparison. Further details of our numerical studies are explained in the LEOPARD’s codes shared in https://github.com/wengweng001/LEOPARD.git

These parameters are fixed throughout all study cases to guarantee non ad-hoc performance of LEOPARD. The hyper-parameters of the baselines are selected as per the guidelines of their publications and hand-tuned if their performances are surprisingly compromised. The hyper-parameters of all consolidated algorithms are listed in the supplemental document.

TABLE II: Average Accuracy (%) of The Target Stream across 5 runs, *indicates statistically significant results and BOLD denotes the best numerical results

Experiments	LEOPARD	ADCN	AE-kmeans	DCN	DEC	DANN
AM1 $\rightarrow$ AM2	20.6320 $\pm$ 2.3958	19.8160 $\pm$ 4.7555	27.8012 $\pm$ 1.6392	27.7792 $\pm$ 1.8319	18.3774 $\pm$ 1.3356	15.2291 $\pm$ 22.7966
AM1 $\rightarrow$ AM3	*71.5300 $\pm$ 1.0819	57.0520 $\pm$ 10.7930	25.8713 $\pm$ 1.6412	26.1010 $\pm$ 1.9047	17.2324 $\pm$ 1.5599	34.8559 $\pm$ 29.7157
AM1 $\rightarrow$ AM4	*57.7840 $\pm$ 0.0476	43.9800 $\pm$ 3.3004	27.9307 $\pm$ 1.2602	28.1326 $\pm$ 1.1633	16.6092 $\pm$ 0.6464	44.4809 $\pm$ 20.6428
AM1 $\rightarrow$ AM5	*63.5100 $\pm$ 1.2016	60.7980 $\pm$ 3.0386	31.3240 $\pm$ 0.8684	31.1996 $\pm$ 0.8381	13.9947 $\pm$ 1.0473	41.5806 $\pm$ 13.6181
AM2 $\rightarrow$ AM1	*71.5480 $\pm$ 8.8031	25.2880 $\pm$ 6.6382	36.7868 $\pm$ 1.5799	36.9540 $\pm$ 2.0093	8.8334 $\pm$ 0.9026	49.4693 $\pm$ 39.4318
AM2 $\rightarrow$ AM3	45.4920 $\pm$ 13.2055	19.9380 $\pm$ 15.1533	31.0612 $\pm$ 1.8636	30.9986 $\pm$ 1.8202	14.2852 $\pm$ 1.0375	43.4064 $\pm$ 28.3212
AM2 $\rightarrow$ AM4	*48.5600 $\pm$ 3.5658	14.1460 $\pm$ 1.7088	27.1297 $\pm$ 1.0116	27.2251 $\pm$ 0.9788	15.6150 $\pm$ 1.4221	42.8489 $\pm$ 14.2926
AM2 $\rightarrow$ AM5	50.3680 $\pm$ 15.8943	31.9160 $\pm$ 5.8286	28.7212 $\pm$ 1.6700	28.4330 $\pm$ 1.0860	18.3333 $\pm$ 2.6212	60.4227 $\pm$ 8.1435
AM3 $\rightarrow$ AM1	37.3520 $\pm$ 4.7246	53.0180 $\pm$ 17.7237	25.7504 $\pm$ 1.2193	25.4591 $\pm$ 1.4807	7.7442 $\pm$ 1.5116	17.1871 $\pm$ 19.2744
AM3 $\rightarrow$ AM2	31.3120 $\pm$ 8.9392	8.9900 $\pm$ 2.0600	22.1118 $\pm$ 1.1815	25.3291 $\pm$ 5.8252	17.2165 $\pm$ 1.6547	40.9799 $\pm$ 13.1832
AM3 $\rightarrow$ AM4	18.7240 $\pm$ 1.0962	28.4340 $\pm$ 4.5292	23.2826 $\pm$ 1.4628	23.1707 $\pm$ 1.6232	15.2796 $\pm$ 2.2464	22.2822 $\pm$ 17.1165
AM3 $\rightarrow$ AM5	*59.5520 $\pm$ 2.8339	37.1540 $\pm$ 8.1052	22.1968 $\pm$ 0.9799	22.3941 $\pm$ 0.6026	16.0303 $\pm$ 2.0662	20.1265 $\pm$ 21.2838
AM4 $\rightarrow$ AM1	45.5560 $\pm$ 6.3118	69.4040 $\pm$ 6.0339	23.2919 $\pm$ 3.1591	23.4620 $\pm$ 3.0392	8.0682 $\pm$ 0.7888	55.4146 $\pm$ 40.1126
AM4 $\rightarrow$ AM2	23.2340 $\pm$ 5.5658	21.9140 $\pm$ 6.7656	21.5370 $\pm$ 0.8304	21.3970 $\pm$ 0.9446	18.1515 $\pm$ 1.2285	22.8453 $\pm$ 19.9388
AM4 $\rightarrow$ AM3	58.0500 $\pm$ 4.2461	62.6160 $\pm$ 3.7864	22.4032 $\pm$ 1.3615	22.6766 $\pm$ 1.4520	17.2892 $\pm$ 1.9414	54.9784 $\pm$ 22.9740
AM4 $\rightarrow$ AM5	*64.3480 $\pm$ 0.0895	56.0280 $\pm$ 2.6346	21.3376 $\pm$ 1.1374	21.3101 $\pm$ 0.9658	15.5601 $\pm$ 1.4566	20.0455 $\pm$ 13.7290
AM5 $\rightarrow$ AM1	*87.6760 $\pm$ 0.3844	56.3760 $\pm$ 19.8715	20.1306 $\pm$ 1.2201	20.4453 $\pm$ 0.8831	8.8645 $\pm$ 1.4946	61.0045 $\pm$ 17.2396
AM5 $\rightarrow$ AM2	12.5060 $\pm$ 1.1611	10.8880 $\pm$ 3.3736	19.3033 $\pm$ 1.0140	19.6829 $\pm$ 0.8429	15.7549 $\pm$ 2.1423	36.7490 $\pm$ 25.1993
AM5 $\rightarrow$ AM3	*36.7900 $\pm$ 6.2853	27.9480 $\pm$ 3.3931	21.4889 $\pm$ 2.2889	21.1898 $\pm$ 2.2209	16.9886 $\pm$ 3.9668	31.0573 $\pm$ 32.7811
AM5 $\rightarrow$ AM4	*50.7580 $\pm$ 2.7020	32.5880 $\pm$ 4.3853	19.8205 $\pm$ 0.7625	19.5470 $\pm$ 0.6045	16.0969 $\pm$ 1.2545	33.5787 $\pm$ 21.3489
MNIST $\rightarrow$ USPS	45.3740 $\pm$ 14.3497	62.9800 $\pm$ 3.0666	10.1138 $\pm$ 0.2647	9.9913 $\pm$ 0.3020	10.3657 $\pm$ 1.5318	23.1336 $\pm$ 3.0172
USPS $\rightarrow$ MNIST	*49.4660 $\pm$ 2.1841	33.3800 $\pm$ 14.5818	10.0563 $\pm$ 0.3269	9.6323 $\pm$ 0.8033	9.3434 $\pm$ 0.5433	39.0328 $\pm$ 6.7573
D $\rightarrow$ W	*41.8080 $\pm$ 9.8034	4.0000 $\pm$ 0.5589	3.7722 $\pm$ 0.3789	3.0633 $\pm$ 0.7778	3.2658 $\pm$ 0.6425	10.9821 $\pm$ 2.7133
W $\rightarrow$ D	*35.2820 $\pm$ 12.0613	3.7560 $\pm$ 0.4568	2.9388 $\pm$ 0.6657	3.1429 $\pm$ 0.7370	2.8163 $\pm$ 0.4725	6.4323 $\pm$ 1.8659

*

The standard for statistically significant results is based on the T-Test score between LEOPARD and other baselines, 4 T scores for each experiment. We determined the result as statistically significant if t score is greater than 2.015 for at least 3 out of 4.

V-E Numerical Results

From Table II, it is seen that LEOPARD outperforms other algorithms in 15 of 24 cases with noticeable margins. This aspect portrays the efficacy of the adversarial domain adaptation approach and the soft-cluster assignment mechanism where these two modules are absent in the baseline algorithms, i.e., ADCN, DCN, AE+KMeans adopt the hard-cluster assignment strategy and suffer from the absence of the adversarial domain adaptation approach. This finding also confirms the advantage of the adversarial domain adaptation over the feature reconstruction strategy with shared parameters across the two streams implemented in all baselines. The soft-cluster assignment approach where the cluster and network parameters are simultaneously optimized via the SGD method performs better than the hard cluster assignment strategy. It is demonstrated by the fact that LEOPARD beats ADCN in significant numbers of cases. There is no significant performance difference between AE+KMEANS and DEC. On the other hand, the importance of the structural learning component in handling data streams is clearly portrayed here where LEOPARD and ADCN are superior compared to other algorithms having static structures. Such mechanism allows timely reactions to the asynchronous drift problem across the source stream and the target stream. The performance of DANN implemented under conventional neural network structures is far inferior to LEOPARD under extreme label scarcity condition. This fact confirms the advantage of clustering approach compared to the conventional neural network structure to reduce label’s dependencies. The statistical test is undertaken using the t test $(P<0.05)$ confirming the advantage of LEOPARD where it beats other algorithms with statistically significant gaps in 13 of 24 cases.

TABLE III: Ablation Study of LEOPARD

	A	B	C	D	E
AM1 $\rightarrow$ AM3	56.6620 $\pm$ 7.3010	70.0860 $\pm$ 2.3159	51.7800 $\pm$ 5.3515	40.8860 $\pm$ 2.6187	71.5300 $\pm$ 1.0819
AM1 $\rightarrow$ AM4	49.4800 $\pm$ 3.9323	57.2160 $\pm$ 0.3800	49.3940 $\pm$ 3.5215	27.7520 $\pm$ 1.9921	57.7840 $\pm$ 0.0476
AM5 $\rightarrow$ AM1	85.6580 $\pm$ 2.3485	83.2040 $\pm$ 6.6126	86.7800 $\pm$ 1.5401	31.1860 $\pm$ 1.5308	87.6760 $\pm$ 0.3844
AM5 $\rightarrow$ AM4	29.6480 $\pm$ 5.5269	43.5760 $\pm$ 6.5908	29.4860 $\pm$ 3.4137	24.7100 $\pm$ 1.6597	50.7580 $\pm$ 2.7020

A

LEOPARD model without network structure evolution and additional loss ( $KL(\phi^{l}|\Phi^{l})$ and $L_{cd}$ )
B

LEOPARD model without network structure evolution
C

LEOPARD model with the absence of $KL(\phi^{l}|\Phi^{l})$ and $L_{cd}$
D

LEOPARD model using BERT as the feature extractor
E

LEOPARD model

V-F Ablation Study

This section studies the effect of LEOPARD’s learning modules by analyzing its performance when deactivating a particular learning module. LEOPARD is configured into four models: (A) the structural learning method is switched off leaving LEOPARD to have a static network structure. In addition, the parameter learning strategy is done with the absence of the cross domain adaptation loss and the KL divergence loss. In short, LEOPARD is driven by the reconstruction loss only; (B) the structural learning strategy of LEOPARD is deactivated while the parameter learning step utilizes both the cross domain adaptation loss and the KL divergence loss; (C) the structural learning mechanism of LEOPARD is activated but with the absence of the KL divergence loss and the cross domain adaptation loss; (D) BERT is applied as a feature extractor in lieu of a word2vec model. Our ablation study is carried out using four study cases: AM1 $\rightarrow$ AM3, AM1 $\rightarrow$ AM4, AM5 $\rightarrow$ AM1 and AM5 $\rightarrow$ AM4.

From table III, LEOPARD suffers from major performance degradation for all configurations (A)-(D) confirming the efficacy of its current version. For AM5 $\rightarrow$ AM1, configuration (A) where both the structural learning strategy, the KL divergence loss and the cross domain adaptation loss are absent produces poor results. The performance improves slightly with the activation of the KL divergence loss and the cross domain adaptation loss as per configuration (B). Although the KL divergence loss and the cross domain adaptation loss are deactivated, the structural learning mechanism improves the accuracy as per configuration (C) but is not yet on par to the LEOPARD. An interesting observation presents in the case of AM5 $\rightarrow$ AM4 where configuration (C) produces poor results. The performance betters in configuration (A) and (B) without the structural learning mechanism. Nevertheless, configuration (A)-(C) are not comparable to the LEOPARD where all modules are engaged. We note that the size of source stream is much less than the size of target stream in the AM5 $\rightarrow$ AM4 case. This leads to very few labelled samples provided in the warm-up phase. For AM1 $\rightarrow$ AM4, the absence of structural evolution, configuration (B), drops the LEOPARD’s accuracy slightly. More severe degradation than configuration (B) occurs in configuration (A) and configuration (C) where the KL divergence loss and the cross domain adaptation loss are deactivated. The same pattern is demonstrated in the case of AM1 $\rightarrow$ AM3. This finding clearly confirms the advantage of the KL divergence loss and the cross domain adaptation loss for LEOPARD. In addition, the structural evolution boosts the performance of LEOPARD especially in the presence of drifts. The use of BERT as feature extractor as shown in Configuration (D) worsens predictive performances of LEOPARD significantly. This finding confirms the compatibility of the word2vec model over the BERT model as a feature extractor of LEOPARD most likely due to the absence of self-attention mechanism or recurrent connection in LEOPARD.

V-G t-SNE Plots

Fig. 4 illustrates the t-SNE plots of LEOPARD [18] for USPS $\rightarrow$ MNIST case on the target stream before and after the training process. It is observed that there does not exist any cluster structures initially in this problem but such cluster structures are clearly present after the training process showing the effectiveness of (14). These facts confirm that LEOPARD does not call for the existence of any cluster structures beforehand and the clustering loss $L_{cluster}$ is capable of establishing the clustering-friendly latent space.

V-H Future Directions

This paper has successfully developed an algorithmic solution of multistrean classification problems under extreme label shortages, LEOPARD. That is, given two different but related streaming processes, LEOPARD properly functions with few prerecorded samples of the source domain and the absence of any labels when the streaming processes run. This benefit goes one step ahead of existing multistream classifiers or unsupervised domain adaptation methods calling for fully labelled source streams. Nonetheless, the problem of multistream classification remain at the infant stages leaving several open issues for future works.

The problem of open set domain adaptation presents a case where the source and target domains do not share the same target classes [31]. Such setting is also seen as a way to reduce the labelling cost and is beneficial in realm of multistream classifications. [31] proposes a feature transformation strategy associating target classes of target domain to those of source domain. [32] puts forward the concept of openness and unknown classes in the open set domain adaptation problem. The theoretical bound of the open set domain adaptation is derived in [33]. The application of theoretical bound for deep learning is demontrated in [34]. These approaches are limited to the offline case calling for extensions for the multistream classification setting. Gradual Domain Adaptation [35] is highly relevant to the multistream classification context because the multistream classification problem still considers different but related streams and constant discrepancies. Also, the asynchronous drifts usually appear suddenly. (1) assumes a small and fixed combined risk and is unrealistic because the combined risk may increase during the training process [36]. This issue is still ignored in the multistream classification problems. Few-shot Hypothesis adaptation [37] is another interesting direction for the multistream classification topic and extends the few-shot domain adaptation problem without any source domain data.

VI Conclusion

Learning Streaming Process from Partial Ground Truth (LEOPARD) is proposed in this paper to cope with the cross domain multistream classification problems under lack of labelled samples. The advantage of LEOPARD has been numerically validated using 24 study cases combined from nine datasets. It is demonstrated that LEOPARD outperforms its counterparts with noticeable margins in 15 of 24 cases. Ablation study further confirms the efficacy of LEOPARD’s learning modules. One limitation of LEOPARD lies in the adversarial domain adaptation strategy only performing domain’s alignment. This approach is poor when there exist big conditional distribution discrepancies. This issue is rather tricky here because LEOPARD does not benefit from any labels for its model updates. Our initial insight shows the feasibility of the pseudo-labelling strategy to attack this problem improving class inferences. Noisy labels remain an open issue and is explored in the future.

VII Acknowledgement

We acknowledge the financial support of National Research Foundation, Singapore under IAFPP in the AME domain (contract no.: A19C1A0018) and the UniSA’s start-up grant.

References

[1] S. Chandra, A. Haque, L. Khan, and C. Aggarwal, “An adaptive framework for multistream classification,” in Proceedings of the 25th ACM International on Conference on Information and Knowledge Management, ser. CIKM ’16. New York, NY, USA: ACM, 2016, pp. 1181–1190.
[2] P. Zhao, S. C. H. Hoi, J. Wang, and B. Li, “Online transfer learning,” Artif. Intell., vol. 216, pp. 76–102, 2014.
[3] Y. Ganin, E. Ustinova, H. Ajakan, P. Germain, H. Larochelle, F. Laviolette, M. Marchand, and V. Lempitsky, “Domain-adversarial training of neural networks,” J. Mach. Learn. Res., vol. 17, no. 1, p. 2096–2030, Jan. 2016.
[4] A. Haque, Z. Wang, S. Chandra, B. Dong, L. Khan, and K. W. Hamlen, “Fusion: An online method for multistream classification,” in Proceedings of the 2017 ACM on Conference on Information and Knowledge Management, ser. CIKM ’17. New York, NY, USA: ACM, 2017, pp. 919–928.
[5] B. Dong, S. Chandra, Y. Gao, and L. Khan, “Multistream classification with relative density ratio estimation,” in AAAI 2019, 2019.
[6] M. Pratama, M. de Carvalho, X. Renchunzi, E. Lughofer, and J. Lu, “Atl: Autonomous knowledge transfer from many streaming processes,” in Proceedings of The 28th ACM International Conference on Information and Knowledge Management, ser. CIKM’19, 2019, pp. 3861–3870.
[7] H. Du, L. L. Minku, and H. Zhou, “Multi-source transfer learning for non-stationary environments,” in 2019 International Joint Conference on Neural Networks (IJCNN), 2019, pp. 1–8.
[8] ——, “Marline: Multi-source mapping transfer learning for non-stationary environments,” in 2020 IEEE International Conference on Data Mining (ICDM), 2020, pp. 122–131.
[9] X. Renchunzi and M. Pratama, “Automatic online multi-source domain adaptation,” Information Sciences, vol. 582, pp. 480–494, 2022.
[10] F. Liu, G. Zhang, and J. Lu, “Multisource heterogeneous unsupervised domain adaptation via fuzzy relation neural networks,” IEEE Transactions on Fuzzy Systems, vol. 29, pp. 3308–3322, 2021.
[11] H. Tao, Z. Wang, Y. Li, M. Zamani, and L. Khan, “Comc: A framework for online cross-domain multistream classification,” in 2019 International Joint Conference on Neural Networks (IJCNN), 2019, pp. 1–8.
[12] Y. Zhang, F. Liu, Z. Fang, B. Yuan, G. Zhang, and J. Lu, “Clarinet: A one-step approach towards budget-friendly unsupervised domain adaptation,” in Proceedings of the Twenty-Ninth International Joint Conference on Artificial Intelligence, IJCAI-20, C. Bessiere, Ed. International Joint Conferences on Artificial Intelligence Organization, 7 2020, pp. 2526–2532, main track.
[13] F. Liu, G. Zhang, and J. Lu, “Heterogeneous domain adaptation: An unsupervised approach,” IEEE Transactions on Neural Networks and Learning Systems, vol. 31, no. 12, 2020.
[14] Z. Fang, J. Lu, F. Liu, J. Xuan, and G. Zhang, “Open set domain adaptation: Theoretical bound and algorithm,” IEEE Transactions on Neural Networks and Learning Systems, pp. 1–14, 2020.
[15] F. Liu, J. Lu, B. Han, G. Niu, G. Zhang, and M. Sugiyama, “Butterfly: Robust one-step approach towards wildly-unsupervised domain adaptation,” 2019.
[16] S. Ben-David, J. Blitzer, K. Crammer, A. Kulesza, F. Pereira, and J. W. Vaughan, “A theory of learning from different domains,” Mach. Learn., vol. 79, no. 1–2, p. 151–175, May 2010.
[17] J. Xie, R. Girshick, and A. Farhadi, “Unsupervised deep embedding for clustering analysis,” in Proceedings of the 33rd International Conference on International Conference on Machine Learning - Volume 48, ser. ICML’16. JMLR.org, 2016, p. 478–487.
[18] L. van der Maaten and G. Hinton, “Visualizing data using t-sne,” Journal of Machine Learning Research, vol. 9, no. 86, pp. 2579–2605, 2008.
[19] J. Smith, S. Baer, Z. Kira, and C. Dovrolis, “Unsupervised continual learning and self-taught associative memory hierarchies,” in 2019 International Conference on Learning Representations Workshops, 2019.
[20] S. Ben-David, J. Blitzer, K. Crammer, and F. Pereira, “Analysis of representations for domain adaptation,” in Proceedings of the 19th International Conference on Neural Information Processing Systems, ser. NIPS’06. Cambridge, MA, USA: MIT Press, 2006, p. 137–144.
[21] Z. Wang, Z. Kong, S. Changra, H. Tao, and L. Khan, “Robust high dimensional stream classification with novel class detection,” in 2019 IEEE 35th International Conference on Data Engineering (ICDE), 2019, pp. 1418–1429.
[22] M. Pratama, C. Za’in, A. Ashfahani, Y. S. Ong, and W. Ding, “Automatic construction of multi-layer perceptron network from streaming examples,” in Proceedings of the 28th ACM International Conference on Information and Knowledge Management, 2019, pp. 1171–1180.
[23] J. Gama, Knowledge Discovery from Data Streams, 1st ed. Chapman & Hall/CRC, 2010.
[24] M. Pratama, W. Pedrycz, and G. I. Webb, “An incremental construction of deep neuro fuzzy system for continual learning of nonstationary data streams,” IEEE Transactions on Fuzzy Systems, vol. 28, pp. 1315–1328, 2020.
[25] J. Ni, J. Li, and J. McAuley, “Justifying recommendations using distantly-labeled reviews and fine-grained aspects,” in Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP). Hong Kong, China: Association for Computational Linguistics, Nov. 2019, pp. 188–197.
[26] T. Mikolov, K. Chen, G. Corrado, and J. Dean, “Efficient estimation of word representations in vector space,” CoRR, vol. abs/1301.3781, 2013.
[27] J. a. Gama, I. Žliobaitė, A. Bifet, M. Pechenizkiy, and A. Bouchachia, “A survey on concept drift adaptation,” ACM Comput. Surv., vol. 46, no. 4, pp. 44:1–44:37, Mar. 2014.
[28] M. de Carvalho, M. Pratama, J. Zhang, and E. K. Y. Yapp, “Acdc: Online unsupervised cross-domain adaptation,” 2021.
[29] A. Ashfahani and M. Pratama, “Unsupervised continual learning in streaming environments,” IEEE transactions on neural networks and learning systems, vol. PP, 2022.
[30] B. Yang, X. Fu, N. D. Sidiropoulos, and M. Hong, “Towards k-means-friendly spaces: Simultaneous deep learning and clustering,” in Proceedings of the 34th International Conference on Machine Learning, ser. Proceedings of Machine Learning Research, D. Precup and Y. W. Teh, Eds., vol. 70. PMLR, 06–11 Aug 2017, pp. 3861–3870.
[31] P. P. Busto and J. Gall, “Open set domain adaptation,” 2017 IEEE International Conference on Computer Vision (ICCV), pp. 754–763, 2017.
[32] H. Liu, Z. Cao, M. Long, J. Wang, and Q. Yang, “Separate to adapt: Open set domain adaptation via progressive separation,” 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2922–2931, 2019.
[33] Z. Fang, J. Lu, F. Liu, J. Xuan, and G. Zhang, “Open set domain adaptation: Theoretical bound and algorithm,” IEEE Transactions on Neural Networks and Learning Systems, vol. 32, pp. 4309–4322, 2021.
[34] L. Zhong, Z. Fang, F. Liu, B. Yuan, G. Zhang, and J. Lu, “Bridging the theoretical bound and deep algorithms for open set domain adaptation,” IEEE transactions on neural networks and learning systems, vol. PP, 2021.
[35] A. Kumar, T. Ma, and P. Liang, “Understanding self-training for gradual domain adaptation,” ArXiv, vol. abs/2002.11361, 2020.
[36] L. Zhong, Z. Fang, F. Liu, J. Lu, B. Yuan, and G. Zhang, “How does the combined risk affect the performance of unsupervised domain adaptation approaches?” ArXiv, vol. abs/2101.01104, 2021.
[37] H. Chi, F. Liu, W. Yang, L. Lan, T. Liu, B. Han, W. Cheung, and J. T.-Y. Kwok, “Tohan: A one-step approach towards few-shot hypothesis adaptation,” ArXiv, vol. abs/2106.06326, 2021.