SR2CNN: Zero-Shot Learning for Signal Recognition

Yihong Dong, Xiaohan Jiang, Huaji Zhou, Yun Lin, and Qingjiang Shi This work was supported in part by the National Key Research and Development Project under grant 2017YFE0119300, and in part by the NSFC under Grants 61731018 and U1709219. (Corresponding author: Qingjiang Shi)Y. Dong, X. Jiang and Q. Shi are all with the School of Software Engineering, Tongji University, Shanghai 201804, China. Q. Shi is also with the Shenzhen Research Institute of Big Data, Shenzhen 518172, China. (e-mail: [email protected]; [email protected]; [email protected])H. Zhou is with the School of Artificial Intelligence, Xidian University, Xi’an 710071, China (e-mail: [email protected])Y. Lin is with the School of College of Information and Communication Engineering, Harbin Engineering University, Harbin 150001, China (e-mail: [email protected])

Abstract

Signal recognition is one of the significant and challenging tasks in the signal processing and communications field. It is often a common situation that there’s no training data accessible for some signal classes to perform a recognition task. Hence, as widely-used in image processing field, zero-shot learning (ZSL) is also very important for signal recognition. Unfortunately, ZSL regarding this field has hardly been studied due to inexplicable signal semantics. This paper proposes a ZSL framework, signal recognition and reconstruction convolutional neural networks (SR2CNN), to address relevant problems in this situation. The key idea behind SR2CNN is to learn the representation of signal semantic feature space by introducing a proper combination of cross entropy loss, center loss and reconstruction loss, as well as adopting a suitable distance metric space such that semantic features have greater minimal inter-class distance than maximal intra-class distance. The proposed SR2CNN can discriminate signals even if no training data is available for some signal class. Moreover, SR2CNN can gradually improve itself in the aid of signal detection, because of constantly refined class center vectors in semantic feature space. These merits are all verified by extensive experiments with ablation studies.

Index Terms:

Zero-Shot Learning, Signal Recognition, CNN, Autoencoder, Deep Learning.

I Introduction

Nowadays, developments in deep convolutional neural networks (CNNs) have made remarkable achievement in the area of signal recognition, improving the state of the art significantly, such as [1, 2, 3, 4, 5] and so on. Generally, a vast majority of existing learning methods follow a closed-set assumption[6], that is, all of the test classes are assumed to be the same as the training classes. However, in the real-world applications, new signal categories often appear while the model is only trained for the current dataset with some limited known classes. It is open-set learning [7, 8, 9, 10] that was proposed to partially tackle this issue (i.e., test samples could be from unknown classes). The goal of an open-set recognition system is to reject test samples from unknown classes while maintaining the performance on known classes. However, in some cases, the learned model should be able to not only differentiate the unknown classes from known classes, but also distinguish among different unknown classes. Zero-shot learning (ZSL) [11, 12, 13] is one way to address the above challenges and has been applied in image tasks. For images, it is easy for us to extract some human-specified high-level descriptions as semantic attributes. For example, from a picture of zebra, we can extract the following semantic attributes 1) color: white and black, 2) stripes: yes, 3) size: medium, 4) shape: horse, 5) land: yes. However, for a real-world signal it is almost impossible to have a high-level description due to obscure signal semantics. Therefore, although ZSL has been widely used in image tasks, to the best of our knowledge it has not yet been studied for signal recognition.¹¹1A closely related work is [14] which proposed a ZSL method for fault diagnosis based on vibration signal. Notice that fault diagnosis is a binary classification problem, which is different from the multi-class signal recognition. More importantly, the ZSL definition in this paper is standard and quite different from the ZSL definition of [14], where ZSL refers to fault diagnosis with unknown motor loads and speeds, which is essentially domain adaptation, while in our paper, ZSL refers to recognition of unknown classes of the signal.

Refer to caption — Figure 1: Overview of SR2CNN. In SR2CNN, a pre-processing (top left) transforms signal data to input $x$ . A deep net (right) is trained to provide semantic feature $z$ within known classes while maintaining the performance on decoder and classifier according to reconstruction $\tilde{x}$ and prediction $y$ . A zero-shot learning classifier, which consists of a known classifier and an unknown classifier, exploits $z$ for discriminator.

In this paper, unlike the conventional signal recognition task where a classifier is learned to distinguish only known classes (i.e., the labels of test data and training data are all within the same set of classes), we aim to propose a learning framework that can not only classify known classes but also unknown classes without annotations. To do so, a key issue that needs to be addressed is to automatically learn a representation of semantic attribute space of signals. In our scheme, CNN combined with autoencoder is exploited to extract the semantic attribute features. Afterwards, semantic attribute features are well-classified using a suitably defined distance metric. The overview of proposed scheme is illustrated in Fig. 1.

In addition, to make a self-evolution learning model, incremental learning needs to be considered when the algorithm is executed continuously. The goal of incremental learning is to dynamically adapt the model to new knowledge from newly coming data without forgetting the already learned one. Based on incremental learning, the obtained model will gradually improve its performance over time.

In summary, the main contribution of this paper is threefold:

•

First, we propose a deep CNN-based zero-shot learning framework, called SR2CNN, for open-set signal recognition. SR2CNN is trained to extract semantic feature $z$ while maintaining the performance on decoder and classifier. Afterwards, the semantic feature $z$ is exploited to discriminate signal classes.
•

Second, extensive experiments on various signal datasets show that the proposed SR2CNN can discriminate not only known classes but also unknown classes and it can gradually improve itself.
•

Last but not least, we provide a new signal dataset SIGNAL-202002 including eight digital and three analog modulation classes.

II Related Work

In recent years, signal recognition via deep learning has achieved a series of successes. O’Shea et al. [15] proposed the convolutional radio modulation recognition networks, which can adapt itself to the complex temporal radio signal domain, and also works well at low signal-to-noise ratios (SNRs). The work [1] used residual neural network [16] to perform the signal recognition tasks across a range of configurations and channel impairments, offering referable statistics. Peng et al. [3] used two convolutional neural networks, AlexNet and GoogLeNet, to address modulation classification tasks, demonstrating the significant advantage of deep learning based approach in this field. The authors in [17] presented a deep learning based big data processing architecture for end-to-end signal processing task, seeking to obtain important information from radio signals. The work presented in [18] evaluated the adversarial evasion attacks that causes the misclassification in the context of wireless communications. In [19], the authors proposed an automatic multiple multicarrier waveforms classification and used the principal component analysis to suppress the additive white Gaussian noise and reduce the input dimensions of CNNs. Additionally, the work [20] proposed a specific emitter identification using CNN-Based inphase/quadrature (I/Q) imbalance estimators. The work [21] proposed a compressive convolutional neural network for automatic modulation classification. In [22], the authors used unsynchronized off-the-shelf software-defined radios to build up a complete communications system which is solely composed of deep neural networks, demonstrating that over-the-air transmissions are possible.

Moreover, the work [23] proposed an LPI radar waveform recognition technique based on a single-shot multi-box detector and a supplementary classifier. The work [24] proposed a more flexible network architecture with an augmented hierarchical-leveled training techniques to decently classify a total of 29 signals. O’Shea et al. [25] used both the auto-encoder-based communications system and the feature learning-based radio signal sensor to emulate the optimization procedure directly on real-world data samples and distributions. Baldini et al. [26] utilized various techniques to transform the time series derived from the radio frequency to images, then applied a deep CNN to conduct the identification task, finally outperforming those conventional dissimilarity-based methods. The work [27] trained a convolutional neural network on time and stockwell channeled images for radio modulation classification tasks, performing superior to those networks trained on just raw I-Q time series samples or time-frequency images. The authors for [28] demonstrated the generality of the effectiveness of deep learning at the interference source identification task, while using band selection, SNR selection and sample selection to optimize training time. The work [29] presented a novel system based on CNNs to “fingerprint” a unique radio from a large pool of devices by deep-learning the fine-grained hardware impairments imposed by radio circuitry on physical-layer I/Q samples. The work [30] proposed a DNN based power control method that aims at solving the non-convex optimization problem of maximizing the sum rate of a fading multi-user interference channel. Chen et al. [31] proposed adaptive transmission scheme and generalized data representation scheme to address the limited data rate issue. In [32], the authors proposed the radio frequency (RF) adversarial learning framework for building a robust system to identify rogue RF transmitters by designing and implementing a generative adversarial net. The work [33] presented an intelligent duty-cycle medium access control protocol to realize the effective and fair spectrum sharing between LTE and WiFi systems without requiring signalling exchanges.

For semi-supervised learning, the work [34] proposed a generative adversarial networks-based automatic modulation recognition for cognitive radio networks. Besides, when it comes to unsupervised learning, the authors in [35] provided a comprehensive survey of the applications of unsupervised learning in the domain of networking, offering certain instructions. The work [36] built an automatic modulation recognition architecture, based on stack convolution autoencoder, using the reconfigurability of field-programmable gate arrays. These experiments basically follow closed-set assumption, namely, their deep models are expected to, whilst are only capable to distinguish among already-known signal classes.

All the above works cannot handle the case with unknown signal classes. When considering the recognition task of those unknown signal classes, some traditional machine learning methods like anomaly (also called outlier or novelty) detection can more or less provide some guidance. Isolation Forest [37] constructs a binary search tree to preferentially isolate those anomalies. Elliptic Envelope [38], fits an ellipse for enveloping these central data points, while rejecting the outsiders. One-class SVM [39], an extension of SVM, finds a decision hyperplane to separate the positive samples and the outliers. Local Outlier Factor [40], uses distance and density to determine whether a data point is abnormal or not. The work [41] proposed a classification-reconstruction learning for open-set recognition method that utilizes latent representations for reconstruction and enables robust unknown detection without harming the known-class classification accuracy. Geng et al. [42] provided a comprehensive survey of existing open set recognition techniques covering various aspects ranging from related definitions, representations of models, datasets, evaluation criteria, and algorithm comparisons. The work [43] proposed a multitask deep learning method that simultaneously conducts classification and reconstruction in the open world where unknown classes may exist. Moreover, the work [44] proposed a generative adversarial networks based technique to address an open-set problem, which is to identify rogue RF transmitters and classify trusted ones. The work [45] presented spectrum anomaly detector with interpretable features, which is an adversarial autoencoder based unsupervised model for wireless spectrum anomaly detection. The above open-set learning methods can indeed identify known samples (positive samples) and detect unknown ones (outliers). However, a common and inevitable defect of these methods are that they can never carry out any further classification tasks for the unknown signal classes.

Zero-shot learning is well-known to be able to classify unknown classes and it has already been widely used in image tasks. For example, the work [11] proposed a ZSL framework that can predict unknown classes omitted from a training set by leveraging a semantic knowledge base. Another paper [12] proposed a novel model for jointly doing standard and ZSL classification based on deeply learned word and image representations. The efficiency of ZSL in image processing field majorly profits from the perspicuous semantic attributes which can be manually defined by high-level descriptions. However, it is almost impossible to give any high-level descriptions regarding signals and thus the corresponding semantic attributes cannot be easily acquired beforehand. This may be the main reason why ZSL has not yet been studied in signal recognition.

III Problem Definition

We begin by formalizing the problem. Let $X$ , $Y$ be the signal input space and output space. The set $Y$ is partitioned into $K$ and $U$ , denoting the collection of known class labels and unknown labels, respectively.

Given training data $\{(x_{1},y_{1}),\ldots,(x_{n},y_{n})\}\subset X\times K$ , the task is to extrapolate and recognize signal class that belongs to $Y$ . Specifically, when we obtain the signal input data $x\in X$ , the proposed learning framework, elaborated in the sequel, can rightly predict the label $y$ . Notice that our learning framework differs from open-set learning in that we not only classify the $x$ into either $K$ or $U$ , but also predict the label $y\in Y$ . Note that $Y$ includes both known classes $K$ and unknown classes $U$ .

We restrict our attention to ZSL that uses semantic knowledge to recognize $K$ and extrapolate to $U$ . To this end, we first map from $X$ into the semantic space $Z$ , and then map this semantic encoding to a class label. Mathematically, we can use nonlinear mapping to describe our scheme as follows. $H$ is the composition of two other functions, $F$ and $P$ defined below, such that:

		$\displaystyle H=P(F(\cdot))$		(1)
		$\displaystyle F:X\rightarrow Z$
		$\displaystyle P:Z\rightarrow Y$

Therefore, our task is left to find proper $F$ and $P$ to build up a learning framework that can identify both known signal classes and unknown signal classes.

IV Proposed Approach

This section formally presents a non-annotation zero-shot learning framework for signal recognition. Overall, the proposed framework is mainly composed of the following four modules:

1.

Feature Extractor ( $F$ )
2.

Classifier ( $C$ )
3.

Decoder ( $D$ )
4.

Discriminator ( $P$ )

Our approach consists of two main steps. In the first step, we build a semantic space for signals through $F$ , $C$ and $D$ . Fig. 2 shows the architecture of $F$ , $C$ and $D$ . $F$ is modeled by a CNN architecture that projects the input signal onto a latent semantic space representation. $C$ , modeled by a fully-connected neural network, takes the latent semantic space representation as input and determines the label of data. $D$ , modeled by another CNN architecture, aims to produce the reconstructed signal which is expected to be as similar as possible to the input signal. In the second step, we find a proper distance metric for the trained semantic space and use the distance to discriminate the signal classes. $P$ is devised to discriminate among all classes including both known and unknown.

IV-A Feature Extractor, Classifier and Decoder

Signal is a special data type, which is very different from image. While it is easy to give a description of semantic attributes of images in terms of visual information, extracting semantic features of signals without relying on any computation is almost impossible. Hence, a natural way to automatically extract the semantic information of signal data is using feature extractor networks $F$ . Considering about the unique features of signals, the input shape of $F$ should be a rectangle matrix with 2 rows rather than square matrix. In our scheme, $F$ consists of four convolutional layers and two fully connected layers.

Generally, $F$ can be represented by a mapping from the input space $X$ to the latent semantic space $Z$ . In order to minimize the intra-class variations in space $Z$ while keeping the inter-classes’ semantic features well separated, center loss [46] is used. Let $x_{i}\in X$ and $y_{i}$ be the label of $x_{i}$ , then $z_{i}=F(x_{i})\in Z$ . Assuming that batch size is $N$ , the center loss is expressed as follows:

L_{ct}=\frac{1}{2N}\sum_{i=1}^{N}||F(x_{i})-c_{y_{i}}||^{2}_{2}

(2)

where $c_{y_{i}}$ denotes the semantic center vector of class $y_{i}$ in $Z$ and the $c_{y_{i}}$ needs to be updated as the semantic features of class $y_{i}$ changed. Ideally, entire training dataset should be taken into account and the features of each class need to be averaged in every iterations. In practice, $c_{y_{i}}$ can be updated for each batch according to $c_{y_{i}}\leftarrow c_{y_{i}}-\alpha\Delta_{c_{y_{i}}}$ , where $\alpha$ is the learning rate and $\Delta_{c_{y_{i}}}$ is computed via

\left\{\begin{aligned} &\Delta_{c_{y_{i}}}=0,\quad if\,\,\sum_{j=1}^{N}\delta(y_{j}=y_{i})=0,\\ &\Delta_{c_{y_{i}}}=\frac{\sum_{j=1}^{N}\delta(y_{j}=y_{i})(c_{y_{i}}-F(x_{i}))}{\sum_{j=1}^{N}\delta(y_{j}=y_{i})},\quad otherwise.\end{aligned}\right.

(3)

where $\delta(\cdot)=1$ if the condition inside $()$ holds true, and $\delta(\cdot)=0$ otherwise.

The classifier $C$ will discriminate the label of samples based on semantic features. It consists of several fully connected layers. Furthermore, cross entropy loss $L_{ce}$ is utilized to control the error of classifier $C$ , which is defined as

L_{ce}=-\frac{1}{N}\sum_{i=1}^{N}y_{i}\log(C(F(x_{i})))

(4)

where $C(F(x_{i}))$ is the prediction of $x_{i}$ .

Further, auto-encoder [47, 48, 49] is used in order to retain the effective semantic information in $Z$ . As shown in the right part of Fig 2, decoder $D$ is used to reconstruct $X$ from $Z$ . It is made up of deconvolution, unpooling and fully connected layers. Among them, unpooling is the reverse of pooling and deconvolution is the reverse of convolution. Specifically, max unpooling keeps the maximum position information during max pooling, and then it restores the maximum values to the corresponding positions and set zeros to the rest positions as shown in Fig. 3(a). Analogously, average unpooling expands the feature map in the way of copying it as shown in Fig. 3(b).

The deconvolution is also called transpose convolution to recover the shape of input from output, as shown in Fig. 3(c). See appendix A for the detailed convolution and deconvolution Operation, as well as toy examples.

In addition, reconstruction loss is utilized to evaluate the difference between original signal data and reconstructed signal data.

L_{r}=\frac{1}{2N}\sum_{i=1}^{N}||D(F(x_{i}))-x_{i}||_{2}^{2}

(5)

where $D(F(x_{i}))$ is the reconstruction of signal $x_{i}$ . Intuitively, the more complete signal is reconstructed, the more valid information is carried within $Z$ . Thus, the auto-encoder greatly helps the model to generate appropriate semantic features.

As a result, the total loss function combines cross entropy loss, center loss and reconstruction loss as

L_{t}=L_{ce}+\lambda_{ct}L_{ct}+\lambda_{r}L_{r}

(6)

where the weights $\lambda_{ct}$ and $\lambda_{r}$ are used to balance the three loss functions. We have carefully designed the total loss function. The cross entropy loss is used to learn information from labels. And center loss minimizes the intra-class variations in the semantic space while keeping the inter-classes’ semantic features well separated, which also helps unknown classes to separate. Reconstruction loss makes model learn more information about signal data, because of well-reconstructed data. Ablation study in Section V also validates the above points. The whole learning process with loss $L_{t}$ is summarized in Algorithm 1, where $\theta_{F}$ , $\theta_{C}$ , $\theta_{D}$ denote the model parameters of the feature extractor $F$ , the classifier $C$ and the decoder $D$ , respectively.

Algorithm 1 Pseudocode for SR2CNN Update

0: Labeled input and output set

\left\{(x_{i},y_{i})\right\}

and hyperparameters

N,\eta,\alpha,\lambda_{ct},\lambda_{r}

0: Parameters

\theta_{F},\theta_{C},\theta_{D}

and

\left\{c_{j}\right\}

Initial parameters

\theta_{F},\theta_{C},\theta_{D}

Initial parameter

\left\{c_{j}|j\in K\right\}

repeat

for each batch with size

N

Update

c_{j}

for each

j

c_{j}\leftarrow c_{j}-\alpha\Delta_{c_{j}}

Calculate

L_{ct}

via Eq. (2).

Calculate

L_{ce}

via Eq. (4).

Calculate

L_{r}

via Eq. (5).

L_{t}=L_{ce}+\lambda_{ct}L_{ct}+\lambda_{r}L_{r}

Update

\theta_{F}

\theta_{F}\leftarrow\theta_{F}-\eta\nabla_{\theta_{F}}L_{t}

Update

\theta_{C}

\theta_{C}\leftarrow\theta_{C}-\eta\nabla_{\theta_{C}}L_{t}

Update

\theta_{D}

\theta_{D}\leftarrow\theta_{D}-\eta\nabla_{\theta_{D}}L_{t}

end for

until convergence

IV-B Discriminator

The discriminator $P$ is the tail but the core of the proposed framework. It discriminates among known and unknown classes based on the latent semantic space $Z$ . For each known class $k$ , the feature extractor $F$ extracts and computes the corresponding semantic center vector $S_{k}$ as:

S_{k}=\frac{\sum_{j=1}^{m}\delta(y_{j}=k)F(x_{j})}{\sum_{j=1}^{m}\delta(y_{j}=k)}

(7)

where $m$ is the number of all training samples. When a test signal $\mathcal{I}$ appears and $F(\mathcal{I})$ is obtained, the difference between the vector $F(\mathcal{I})$ and $S_{k}$ can be measured for each $k$ . Specifically, the generalized distance between $F(\mathcal{I})$ and $S_{k}$ is used, which is defined as follows:

d(F(\mathcal{I}),S_{k})=\sqrt{(F(\mathcal{I})-S_{k})^{T}A_{k}^{-1}(F(\mathcal{I})-S_{k})}

(8)

where $A_{k}$ is the transformation matrix associated with class $k$ and $A_{k}^{-1}$ denotes the inverse of matrix $A_{k}$ . When $A_{k}$ is the covariance matrix $\Sigma$ of semantic features of signals of class $k$ , $d(\cdot,\cdot)$ is called Mahalanobis distance. When $A_{k}$ is the identity matrix²²2This is also the only possible choice in the case when the covariance matrix $\Sigma$ is not available, which happens for example when the signal set of some class is singleton. $I$ , $d(\cdot,\cdot)$ is reduced to Euclidean distance. $A_{k}$ also can be $\Lambda$ and $\sigma^{2}I$ where $\Lambda$ is a diagonal matrix formed by taking diagonal elements of $\Sigma$ and $\sigma^{2}\triangleq\frac{trace(\Sigma)}{t}$ with $t$ being the dimension of $S_{k}$ . The corresponding distance based on $A_{k}=\Lambda$ and $A_{k}=\sigma^{2}I$ are called the second distance and third distance. Note that when the Mahalanobis distance, second distance and third distance are applied, the covariance matrix of each known class needs to be computed in advance.

With the above distance metric, we can establish our discriminant model which is divided into two steps. Firstly, distinguish between known and unknown classes. Secondly, discriminate which known classes or unknown classes the test signal belongs to. The first step is done by comparing the threshold $\Theta_{1}$ with the minimal distance $d_{1}$ given by

d_{1}=\min_{S_{k}\in S}d(F(\mathcal{I}),S_{k})

(9)

where $S$ is the set of known semantic center vectors. Let us denote by $y_{\mathcal{I}}$ the prediction of $\mathcal{I}$ . If $d_{1}<\Theta_{1}$ , $y_{\mathcal{I}}\in K$ , otherwise $y_{\mathcal{I}}\in U$ . Owing to utilizing the center loss in training, the semantic features of signals of class $k$ are assumed to obey multivariate Gaussian distribution. Inspired by the three-sigma rule [50], we set $\Theta_{1}$ as follows

\Theta_{1}=\lambda_{1}\times 3\sqrt{t}

(10)

where $\lambda_{1}$ is a control parameter referred to as the discrimination coefficient.

Two remarks are made as follows to explain the Gaussian distribution assumption and the choice of $\Theta_{1}$ , respectively.

Remark 1

In our loss function, we have the center loss component which aims to minimize (2) with respect to the semantic layer. It is not difficult to show that

\displaystyle\begin{split}&\arg\min_{\theta_{F}}L_{ct}=\arg\max_{\theta_{F}}-L_{ct}\\ &=\arg\max_{\theta_{F}}-\frac{1}{2N}\sum_{i=1}^{N}||F(x_{i})-c_{y_{i}}||^{2}_{2}\\ &=\arg\max_{\theta_{F}}-\frac{1}{2N}\sum_{i=1}^{N}(F(x_{i})-c_{y_{i}})^{T}(F(x_{i})-c_{y_{i}})\end{split}

(11)

Because of the monotonicity of exponential function, we have

\displaystyle\begin{split}&\arg\max_{\theta_{F}}-\frac{1}{2N}\sum_{i=1}^{N}(F(x_{i})-c_{y_{i}})^{T}(F(x_{i})-c_{y_{i}})\\ &=\arg\max_{\theta_{F}}e^{\frac{1}{N}}\prod_{i=1}^{N}e^{-\frac{(F(x_{i})-c_{y_{i}})^{T}(F(x_{i})-c_{y_{i}})}{2}}\\ &=\arg\max_{\theta_{F}}e^{\frac{1}{N}}\prod_{i=1}^{N}e^{-\frac{(F(x_{i})-c_{y_{i}})^{T}I^{-1}(F(x_{i})-c_{y_{i}})}{2}}\\ &=\arg\max_{\theta_{F}}e^{\frac{1}{N}}(2\pi)^{\frac{t}{2}}|I|^{\frac{1}{2}}\\ &\qquad\qquad\prod_{i=1}^{N}\frac{1}{(2\pi)^{\frac{t}{2}}}\frac{1}{|I|^{\frac{1}{2}}}e^{-\frac{(F(x_{i})-c_{y_{i}})^{T}I^{-1}(F(x_{i})-c_{y_{i}})}{2}}\end{split}

(12)

where $t$ denotes the dimension of Gaussian distribution and $I$ denotes the identity matrix. Let $\beta\triangleq e^{\frac{1}{N}}(2\pi)^{\frac{t}{2}}|I|^{\frac{1}{2}}$ and $P(F(x_{i})|y_{i})=\frac{1}{(2\pi)^{\frac{t}{2}}}\frac{1}{|I|^{\frac{1}{2}}}e^{-\frac{(F(x_{i})-c_{y_{i}})^{T}I^{-1}(F(x_{i})-c_{y_{i}})}{2}}$ , the above equation can be equivalently written as

\arg\max_{\theta_{F}}\beta\prod_{i=1}^{N}P(F(x_{i})|y_{i})

(13)

where $P(F(x_{i})|y_{i})=\mathcal{N}(c_{y_{i}},I)$ . This indicates that very likely the output of the semantic layer follows the Gaussian distribution.³³3Note that, however, due to the existence of the other two component loss functions, we propose using a general covariance matrix to describe the output of the semantic layer, as shown in (8).

Remark 2

The choice of $\Theta_{1}$ in (10) is made due to the following two considerations. First, the well-known three-sigma rule of thumb is often used for identification of outliers [51]. It is shown in [51] that this rule should be properly generalized due to the impact of the dimension in the mult-dimensional case. We here present a natural generalization to the t-dimensional case by simply averaging the Mahalanobis distance over $\sqrt{t}$ , so as to remove the impact of the dimension on the choice of $\Theta_{1}$ . The above explains why we have the term $3\sqrt{t}$ in (10). Second, a control parameter $\lambda_{1}$ is incorporated to make the choice of $\Theta_{1}$ more sophisticated so that it can work well for complex recognition tasks. Our numerical experiments later validate the effectiveness of the choice of $\Theta_{1}$ .

The second step is more complicated. If $\mathcal{I}$ belongs to the known classes, its label $y_{\mathcal{I}}$ can be easily obtained via

y_{\mathcal{I}}=\arg\min_{k}d(F(\mathcal{I}),S_{k}).

(14)

Obviously the main difficulty lies in dealing with the case when $\mathcal{I}$ is classified as unknown in the first step. To illustrate, let us denote by $R$ the recorded unknown classes and define $S_{R}$ to be the set of the semantic center vectors of $R$ . In this difficult case with $R\subseteq\varnothing$ , a new signal label $R_{1}$ is added to $R$ and $F(\mathcal{I})$ is set to be the semantic center vector $S_{R_{1}}$ . The unknown signal $\mathcal{I}$ is saved in set $G_{R_{1}}$ and let $y_{\mathcal{I}}=R_{1}$ . While in the difficult case with $R\not\subseteq\varnothing$ , the threshold $\Theta_{2}$ is compared to the minimal distance $d_{2}$ which is defined by

d_{2}=\min_{S_{R_{u}}\in S_{R}}d(F(\mathcal{I}),{R_{u}})

(15)

TABLE I: Standard metadata of dataset 2016.10A. For a larger version, 2016.10B, the class ”AM-SSB” is removed, while the number of samples for each class is sixfold (120000). For a smaller one, 2016.04C, all 11 classes is included, but the number of samples for each class is disparate (range from 4120 to 24940).

total samples # of samples each class # of samples each SNR feature dimension classes (modulations) 220000 20000 1000 $2\times 128$ 11 modulation types 8PSK, AM-DSB, AM-SSB, BPSK, CPFSK, GFSK, PAM4, QAM16, QAM64, QPSK, WBFM # of SNR values SNR values 20 -20,-18,-16,-14,-12,-10,-8,-6,-4,-2,0,2,4,6,8,10,12,14,16,18

Intuitively, a good choice of $\Theta_{2}$ may be made based on the distance between $F(x)$ and $S_{k}$ ’s. $d_{1}$ is the minimum distance which is firstly used in our test of choice of $\Theta_{2}$ . Actually, we test a set of choices of $\Theta_{2}$ and numerically find that unknown classes can be often correctly identified when $\Theta_{2}$ is set between $d_{1}$ and $d_{med}$ , where $d_{med}$ is the median distance between $F(x)$ and each $S_{k}$ . Therefore, the threshold $\Theta_{2}$ is finally set as

\Theta_{2}=\frac{d_{1}+\lambda_{2}\times d_{med}}{1+\lambda_{2}}

(16)

where $\lambda_{2}$ is used to balance the two distances $d_{1}$ and $d_{med}$ .

To proceed, let $n_{R}$ denote the number of recorded signal labels in $R$ . Then, if $d_{2}>\Theta_{2}$ , a new signal label $R_{n_{R}+1}$ is added to $R$ and set $y_{\mathcal{I}}=n_{R}+1$ . Note that we don’t impose any prior restrictions on the value of $n_{R}$ (the size of set $R$ ), i.e., our model can never know the number of the unknown classes pending to be discriminated. Then if $d_{2}\leq\Theta_{2}$ , we set

y_{\mathcal{I}}=\arg\min_{R_{u}}d(F(\mathcal{I}),S_{R_{u}}).

(17)

and save the signal $\mathcal{I}$ in $G_{y_{\mathcal{I}}}$ . Accordingly, $S_{y_{\mathcal{I}}}$ is updated via

S_{y_{\mathcal{I}}}=\frac{\sum_{k\in G_{y_{\mathcal{I}}}}F(k)}{\#(G_{y_{\mathcal{I}}})}

(18)

where $\#(G_{y_{\mathcal{I}}})$ denotes the number of signals in set $G_{y_{\mathcal{I}}}$ . As a result, with the increase of the number of predictions for unknown signals, the model will gradually improve itself by way of refining $S_{R_{u}}$ ’s.

Algorithm 2 Pseudocode for Discriminator

P

0: Test input

\left\{(\mathcal{I})\right\}

, transformation matrices

\left\{A_{k},A_{R_{u}}\right\}

, sets

S,R,S_{R},D

and hyperparametes

\Theta_{1}

\Theta_{2}

y_{\mathcal{I}}

Calculate

F(\mathcal{I})

Calculate

d_{1}

via Eq. (9).

Calculate

d_{2}

via Eq. (15).

d_{1}<\Theta_{1}

then

Calculate

y_{\mathcal{I}}

via Eq. (14).

else if

d_{1}\geq\Theta_{1}

and

R\subseteq\varnothing

then

Add

R_{1}

R

y_{\mathcal{I}}=R_{1}

else if

d_{1}\geq\Theta_{1}

R\not\subseteq\varnothing

and

d_{2}>\Theta_{2}

then

Add

R_{n_{R}+1}

R

y_{\mathcal{I}}=R_{n_{R}+1}

else

Calculate

y_{\mathcal{I}}

via Eq. (17)

end if

Save

\mathcal{I}

G_{y_{\mathcal{I}}}

update

S_{y_{\mathcal{I}}}

via Eq. (18).

To summarize, we present the whole procedure of the discriminator in Algorithm 2. We emphasize that our SR2CNN is different from the common open-set recognition methods. Assuming that there are $n$ known classes and an uncertain number of unknown classes, the traditional open-set recognition method will only distinguish the test samples into $n+1$ classes, while SR2CNN will distinguish the test samples into $n+n_{R}$ classes via Algorithm 2, where $n_{R}$ is the number of unknown classes recognized by the discriminator. Specifically, for the case when a test sample belongs to an unknown class, we determine whether it belongs to an existing unknown class or a new unknown class by comparing $d_{2}$ with threshold $\Theta_{2}$ . Hence, the notable advantage of SR2CNN over the common open-set recognition method lies in that SR2CNN can roughly distinguish how many unknown classes there are in the test set, not just label the test sample as unknown.

V Experiments and Results

TABLE II: Contrast between supervised learning and our ZSL learning scenario on three datasets. Dash lines in the ZSL column specify the boundary between known and unknown classes. Bold: accuracy for a certain unknown class. Italic: accuracy computed only to help draw a transverse comparison. Items split by slash ”/” like ”75.9%/8.4%” denote the accuracy respectively for two isotopic classes. “-” denotes no corresponding result for such case.

2016.10A 2016.10B 2016.04C supervised ZSL supervised ZSL supervised ZSL accuracy 8PSK (1) 85.0% 85.5% 95.5% 86.7% 74.9% 69.3% AM-DSB (2) 100.0% 73.5% 100.0% 41.3% 100.0% 91.1% BPSK (4) 99.0% 95.0% 99.8% 96.5% 99.8% 97.6% PAM4 (7) 98.5% 94.5% 97.6% 93.4% 99.6% 96.8% QAM16 (8) 41.6% 49.3% 56.8% 40.0% 97.6% 98.4% QAM64 (9) 60.6% 44.0% 47.5% 49.6% 94.0% 97.6% QPSK (10) 95.0% 90.5% 98.9% 90.6% 86.8% 81.5% WBFM (11) 38.2% 32.0% 39.6% 50.4% 88.8% 86.9% \cdashline6-6 CPFSK (5) 100.0% 99.0% 100.0% 75.9%/8.4% 100.0% 96.2% \cdashline4-4\cdashline8-8 GFSK (6) 100.0% 99.0% 100.0% 95.6%/2.3% 100.0% 82.0% AM-SSB (3) 100.0% 100.0% - - 100.0% 100.0% total accuracy 83.5% 78.4% 83.6% 72.0% 94.7% 91.5% average known accuracy 79.8% 73.7% 79.5% 68.5% 93.5% 91.6% true known rate - 95.9% - 86.9% - 97.0% true unknown rate - 99.5% - 91.1% - 90.0%

In this section, we demonstrate the effectiveness of the proposed SR2CNN approach by conducting extensive experiments with the dataset 2016.10A, as well as its two counterparts, 2016.10B and 2016.04C [15]. The data description is presented in Table I. All $11$ types of modulations are numbered with class labels from left to right.

Sieve samples. Samples with SNR less than 16 are firstly filtered out, only leaving a purer and higher-quality portion (one-tenth of origin) to serve as the overall datasets in our experiments.

Choose unknown classes. Empirically, a class whose features are hard to learn is an arduous challenge for a standard supervised learning model, let alone when it plays an unknown role in our ZSL scenario (especially when no prior knowledge about the number of the unknown classes is given, as we mentioned in the Subsection 4.2). Hence, necessarily, a completely supervised learning stage is carried out beforehand, to help us nominate suitable unknown classes. If the prediction accuracy of the full supervision method is rather low for certain class, it is reasonable to exclude this class in ZSL, because ZSL will definitely not yield a good performance for this class. In our experiments, unknown classes are randomly selected from a set of classes for which the accuracy of full supervision is higher than 50%. As shown in Table II, the ultimate candidates fall on AM-SSB(3) and GFSK(6) for 2016.10A and 2016.04C, while CPFSK(5) and GFSK(6) for 2016.10B.

Split training, validation and test data. 70% of the samples from the known classes make up the overall training set while 15% makes up the known validation set and the rest 15% makes up the known test set. For the unknown classes, there’s only a test set needed, which consists of 15% of the unknown samples.

Due to the three preprocessing steps, we get a small copy of, e.g., dataset 2016.10A, which contains a training set of $12600$ samples, a known validation set of $2700$ samples, a known test set of $2700$ samples and an unknown test set of $600$ samples.

All of the networks in SR2CNN are computed on a single GTX Titan X graphic processor and implemented in Python, and trained using the Adam optimizer with learning rate $\eta=0.001$ and batch size $N=256$ . Generally, we allow our model to learn and update itself maximally for 250 epochs. In addition, the grid search is applied to the validation set to determine the hyperparameters.

V-A In-training Views

Basically, the average softmax accuracy of the known test set will converge roughly to $80\%$ on both 2016.10A and 2016.10B, while to $94\%$ on 2016.04C, as indicated in Fig. 4. Note that there’s almost no perceptible loss on the accuracy when using the clustering approach (i.e., the distance measure-based classification method described in Section IV) to predict instead of softmax, meaning that the semantic feature space established by our SR2CNN functions very well. For ease of exposition, we will refer to the known cluster accuracy as upbound (UB).

During the training course, the cross entropy loss undergoes sharp and violent oscillations. This phenomenon makes sense, since the extra center loss and reconstruction loss will intermittently shift the learning focus of the SR2CNN.

TABLE III: Ablation study about the discrimination task via

P

on 2016.10A in test. Bold: performance of the original SR2CNN model. F1 score denotes

2\times accuracy\times precision/(accuracy+precision)

SR2CNN without Cross Entropy Loss without Center Loss without Reconstruction Loss L1 Loss accuracy AM-SSB(3) 100.0% 100.0% 99.5% 100.0% 100.0% GFSK(6) 99.5% 98.5% 61.0% 94.8% 95.8% average known accuracy 73.7% 72.1% 69.0% 72.3% 70.4% precision known 76.8% 75.3% 79.1% 74.5% 82.8% unknown 96.1% 95.2% 82.4% 94.5% 86.1% F1 score known 75.3% 73.6% 73.7% 73.3% 76.1% unknown 98.0% 97.2% 81.3% 95.9% 91.6%

V-B Critical Results

The most critical results are presented in Table II. To better illustrate it, we will firstly make a few definitions in analogy to the binary classification problem. By superseding the binary condition positive and negative with known and unknown respectively, we can similarly elicit true known (TK), true unknown (TU), false known (FK) and false unknown (FU). Subsequently, we get two important indicators as follows:

true~{}known~{}rate~{}(TKR)~{}=\frac{TK}{K}=\frac{TK}{TK+FU}

true~{}unknown~{}rate~{}(TUR)~{}=\frac{TU}{U}=\frac{TU}{TU+FK}

Furthermore, we define precision likewise as follows:

known~{}precision~{}(KP)=\frac{S_{correct}}{TK+FK}

unknown~{}precision~{}(UP)=\frac{U_{dominantly\_correct}}{TU+FU}

where $S_{correct}$ denotes the total number of known samples that are classified to their exact known classes correctly. $U_{dominantly\_correct}$ denotes the total number of unknown samples that are classified to their exact newly-identified unknown classes correctly. For evaluation, the real label of a certain newly-recorded unknown class is determined as the label of the most signal samples in that class. Note that sometimes unexpectedly, our SR2CNN may classify a small portion of signals into different unknown classes but their real labels are actually identical and correspond to one certain unknown class (we name these unknown classes as isotopic classes). In this rare case, we only count the identified unknown class with the highest accuracy in calculating $U_{dominantly\_correct}$ .

For ZSL, we test our SR2CNN with several different combinations of aforementioned parameters $\lambda_{1}$ and $\lambda_{2}$ , hopefully to snatch a certain satisfying result out of multiple trials. Fixing $\lambda_{2}$ to 1 simply leads to fair performance, though still, we adjust $\lambda_{1}$ in a range between 0.05 and 1.0. Here, the pre-defined indicators above play an indispensable part to help us sift the results. Generally, a well-chosen result is supposed to meet the following requirements: 1. the weighted true rate (WTR): $0.4\times$ TKR+ $0.6\times$ TUR is as great as possible; 2. KP $>0.95\times$ UB, where UB is the upbound defined as the known cluster accuracy; 3. $\#^{j}_{isotopic}<=$ 2 for all possible $j$ , where $\#^{j}_{isotopic}$ denotes the number of isotopic classes corresponding to a certain unknown class $j$ .

TABLE IV: Performance among different set of chosen unknown classes on 2016.10A. Bold: recall rate. Item split by slash ”/” like ”87.8%/9.0%” and ”-” basically are of the same meanings with Table II.

unknown classes AM-SSB and GFSK CPFSK and GFSK AM-SSB and CPFSK AM-SSB, CPFSK and GFSK accuracy AM-SSB(3) 100.0% - 100.0% 100.0% CPFSK(5) - 71.0% 87.8%/9.0% 65.5% GFSK(6) 99.5% 100.0% - 90.5% average known accuracy 73.7% 68.3% 75.6% 69.6% true known rate 95.9% 89.6% 96.2% 90.9% true unknown rate 99.8% 85.5% 98.4% 85.4% precision known 76.8% 73.6% 78.3% 74.0% unknown 96.1% 89.2% 91.9% 90.4%

In order to better make a transverse comparision, we compute two extra indicators, average total accuracy in ZSL scenario and also average known accuracy in completely supervised learning, shown as italics in Table II. On the whole, the results are promising and excellent. However, we have to admit that ZSL learning somewhat incurs a little bit performance loss as compared with the fully supervised model. Looking vertically, among all modulations, the performance loss especially occurs in the class AM-DSB. While looking horizontally among all datasets, the performance loss especially occurs in dataset 2016.10B. After all, when losing sight of the two unknown classes, SR2CNN can only acquire a segment of the intact knowledge that shall be totally learned in a supervised case. It is this imperfection that presumably leads to an apparent variation on each class’s accuracy when compared with supervised learning. Among these classes, the poorest victim is always AM-DSB, with considerable portion of its samples rejected as unknown ones. Besides, the features, especially those of the unknown classes, among these three datasets are not exactly in the same difficulty levels of learning. Some unknown features may even be similar to those known ones, which can consequently cause confusions in the discrimination tasks. It is no doubt that these uncertainties and differences in the feature domain matter a lot. Take 2016.10B, compared with its two counterparts, it emanates the greatest loss (more than 10%) on average accuracy (both total and known), and also a pair of inferior true rates. Moreover, it is indeed the single case, where both two unknown classes are separately identified into two isotopic classes.

It is obvious that average accuracy strongly depends on the weighted true rate (WTR). Since the clearer for the discrimination between known and unknown, the more accurate for the further classification and identification. Therefore, to better study this discrimination ability, we depict Fig. 5 to elucidate its variation trends regarding discrimination coefficient ( $\lambda_{1}$ ). At the same time, we introduce a new concept discrimination interval as an interval where the weighted true rate is always greater than 80%. The width of the above interval is used to help quantify this discrimination ability.

TABLE V: Comparison between our SR2CNN model and several traditional open-set model and outlier detectors on 2016.10A. Bold: performance of the dominant SR2CNN model. Italic: performance of these traditional methods when true known rates reach the highest. Vertical bar ”

|

” is used to split the standard results and the italic ones.

SR2CNN IsolationForest [37] EllipticEnvelope [38] OneClassSVM [39] LocalOutlierFactor [40] OpenMax [8] MDL4OW [43] AM-SSB(3) 100.0% 72.3% $|$ 00.0% 100.0% $|$ 100.0% 96.3% $|$ 26.0% 100.0% 100.0% 99.3% GFSK(6) 99.5% 01.3% $|$ 00.0% 90.0% $|$ 00.0% 00.0% $|$ 00.0% 00.0% 00.0% 26.5% true known rate 95.9% 81.3% $|$ 99.9% 46.1% $|$ 97.6% 85.5% $|$ 92.0% 96.7% 98.1% 79.4% true unknown rate 99.8% 36.8% $|$ 00.0% 95.0% $|$ 50.0% 48.1% $|$ 13.0% 50.0% 50.0% 62.9%

TABLE VI: Contrast between supervised learning and our ZSL learning scenario on dataset SIGNAL-202002. Dash lines in the ZSL column specify the boundary between known and unknown classes. Bold: accuracy for a certain unknown class. Italic: accuracy computed only to help draw a transverse comparision. ”-” basically is of the same meanings with Table II.

SIGNAL-202002 supervised learning zero-shot learning accuracy BPSK (1) 84.3% 70.8% QPSK (2) 86.5% 67.8% 8PSK (3) 67.8% 70.3% 16QAM (4) 99.5% 96.8% 64QAM (5) 95.5% 84.8% PAM4 (6) 97.0% 89.0% GFSK (7) 56.3% 38.3% AM-DSB (10) 63.8% 67.3% AM-SSB (11) 44.3% 62.0% \cdashline4-4 CPFSK (8) 100.0% 81.0% B-FM (9) 93.5% 74.5% average total accuracy 80.8% 73.0% average known accuracy 77.3% 71.9% true known rate - 82.3% true unknown rate - 84.9% precision known - 87.4% unknown - 91.6%

Apparently, the curves for the primary two kinds of true rate are monotonic, increasing for the known while decreasing for the unknown. The maximum points of these weighted true rate curves for each dataset, are about 0.4, 0.2, and 0.4 respectively. These points exactly correspond to the results shown in Table II. Besides, the width of the discrimination interval of 2016.10B is only approximately one third of those of 2016.10A and 2016.04C. This implies that the features of 2016.10B are more difficult to learn, and just accounts for its relatively poor performance.

V-C Ablation Study

In this subsection, we explain the necessity of each of the three loss functions. Relevant experiments are mainly based on 2016.10A.

Fig. 6 presents the known accuracy in absence of cross entropy loss, center loss and reconstruction loss respectively during training. In general, we found that the best performance in training will be degraded when any one of these three loss functions is excluded. It can be observed that both cross entropy loss and reconstruction loss make a positive impact on the known accuracy, boosting about 3% to 5%, while center loss seems slightly weaker.

Analyzing Table III, we can easily discern the effect of these three loss functions in the test course, especially the center loss. Results show that the F1 score in absence of cross entropy loss, center loss and reconstruction loss decreases by 1.8%, 1.7% and 2.0% respectively for the known classes. For the unknown classes, the minimum degradation in F1 score is 0.8% after removing cross entropy loss, while the maximum degradation in F1 score is 16.7% after removing center loss. Actually, Fig. 7 indicates that the usage of center loss on 2016.10A indeed helps our model to discriminate more distinctly, resulting in a notably broader discrimination interval. Besides, we have also made an attempt at applying L1 loss [52] to calculate center loss (Eq. (2) in Section IV) and reconstruction loss (Eq. (5) in Section IV). Those related results are presented in the last column of Table III. It is seen that L1 loss can indeed slightly increase the F1 score of known classes by 0.8%, however, at the cost of a decrease in the F1 score of unknown classes by 6.4%.

In sum, the three loss functions, though not exactly promoting our SR2CNN in the same way and in the same fields, are indeed useful.

V-D Other Extensions

We tentatively change several unknown classes on 2016.10A, seeking to excavate more in the feature domain of data. As shown in Table IV, both known precision (KP) and unknown precision (UP) are insensitive to the change of unknown classes, proving that the classification ability of SR2CNN are consistent and well-preserved for the considered dataset. Nevertheless, obviously, the unknown class CPFSK is always the hardest obstacle in the course of discrimination. Since accuracy of CPFSK is always the lowest as well as some isotopic classes are observed in this case. Especially, when class CPFSK and GFSK simultaneously show up in the unknown roles, the performance loss (on both TKR and TUR) is quite striking. We speculate that the unknown CPFSK and GFSK may share a considerable number of similarities with some known classes, which will unluckily mislead SR2CNN about the further discrimination task.

To justify SR2CNN’s superiority, we compare it with a couple of traditional methods prevailing in the field of outlier detection, as well as two open-set recognition methods, i.e., OpenMax [8] and MDL4OW [43]. For outlier detection methods, the detected outlier will be regarded as an unknown sample. For OpenMax, an extra dimension is appended to the output vector to indicate the probability of the current sample being unknown. While for MDL4OW, the extreme value theory is adopted to detect the unknown classes by modeling the distribution of loss. The results are presented in Table V. It is found that our SR2CNN significantly outperforms both outlier detection methods and open-set recognition methods in terms of the true unknown rate. Furthermore, we find that most of the aforementioned methods cannot correctly identify GFSK as unknown. For example, in our experiment, OpenMax wrongly classifies all GFSK samples as known. As for MDL4OW, it identifies a small percentage of GFSK samples at the cost of true known rate. However, it can be found from the experiment results that our SR2CNN can still work very well for this open-set recognition task.

Note that there are no unknown classes identification tasks launched, only discrimination tasks are considered. Hence, here, for a certain unknown class $j$ , we compute its unknown rate, instead of accuracy, as $\frac{\#^{j}_{unknown}}{N_{j}}$ , where $N_{j}$ denotes the number of samples from unknown class $j$ , while $\#^{j}_{unknown}$ denotes the number of samples from unknown class $j$ , which are discriminated as unknown ones.

In addition, relevant ROC curves for the above comparison experiments are depicted in Fig. 8. It is observed that SR2CNN has the largest AUC, indicating its superiority over other methods. Besides, notably, there seems as if a steep ‘cliff erecting’ where False Known Rate approximately equals to 0.5, particularly for EllipticEnvelope, LocalOutlierFactor, and OpenMax. This means that almost half samples of unknown classes are not easy to be correctly discriminated. Correspondingly, according to Table V, we can speculate that these ‘hard’ samples all come from unknown class GFSK.

VI Dataset SIGNAL-202002

We newly synthesize a dataset, denominated as SIGNAL-202002, to hopefully be of great use for further researches in signal recognition field. Basically, the dataset consists of 11 modulation types, which are BPSK, QPSK, 8PSK, 16QAM, 64QAM, PAM4, GFSK, CPFSK, B-FM, AM-DSB and AM-SSB. Each type is composed of 20000 frames. Data is modulated at a rate of 8 samples per symbol, while 128 samples per frame. The channel impairments are modeled by a combination of additive white Gaussian noise, Rayleigh fading, multipath channel and clock offset. We pass each frame of our synthetic signals independently through the above channel model, seeking to emulate the real-world case, which shall consider translation, dilation and impulsive noise, etc. The configuration is set as follows:

20000 samples per modulation type

$2\times 128$ feature dimension

20 different SNRs, even values between [2dB, 40dB]

The complete dataset is stored as a python pickle file which is about 450 MBytes in complex 32 bit floating point type. Related code for the generation process is implemented in MatLab and the SIGNAL-202002 dataset is available on the link: https://drive.google.com/file/d/1EDfKRNIk_txxyAyPCR7BEGs0BvEk3Bof/view.

We conduct zero-shot learning experiments on our newly-generated dataset and report the results here. As mentioned above, a supervised learning trial is similarly carried out to help us get an overview of the regular performance for each class of SIGNAL-202002. Unfortunately, as Table VI shows, the original two candidates of 2016.10A, AM-SSB and GFSK, both fail to keep on top. Therefore, here, we relocate the unknown roles to another two modulations, CPFSK with the highest accuracy overall, as well as B-FM, which stands out in the three analogy modulation types (B-FM, AM-SSB and AM-DSB).

According to Table VI, an apparent loss on the discrimination ability is observed, as both the TKR and the TUR just slightly pass 80%. However, our SR2CNN still maintain its classification ability, as the accuracy for each class remains encouraging compared with the completely-supervised model. A significant fact is that, the known precision (KP) is incredibly high, even exceeding those KPs on 2016.10A by almost 10%, as shown in Table IV. To account for this, we speculate that the absence of two unknown classes may unintentionally allow SR2CNN to better focus on the features of the known ones, which consequently, leads to a superior performance of known classification task.

VII Conclusion

In this paper, we have proposed a ZSL framework SR2CNN, which can successfully extract precise semantic features of signals and discriminate both known classes and unknown classes. SR2CNN can works very well in the situation where we have no sufficient training data for certain class. Moreover, SR2CNN can generally improve itself in the way of updating semantic center vectors. Extensive experiments demonstrate the effectiveness of SR2CNN. In addition, we provide a new signal dataset SIGNAL-202002 including eight digital and three analog modulation classes for further research. Finally, we would like to point out that, because we often have I/Q signals, a possible direction for future research is using complex neural networks [53] to establish the semantic space.

Appendix A Convolution and Deconvolution Operation

Let $\bm{a},\bm{b}\in\mathbb{R}^{n}$ denote the vectorized input and output matrices. Then the convolution operation can be expressed as

\bm{b}=\mathbf{M}\bm{a}

(19)

where $\mathbf{M}$ denotes the convolutional matrix, which is sparse. With back propagation of convolution, $\frac{\partial Loss}{\partial\bm{b}}$ is obtained, thus

\frac{\partial Loss}{\partial a_{j}}=\sum_{i}\frac{\partial Loss}{\partial b_{i}}\frac{b_{i}}{a_{j}}=\sum_{i}\frac{\partial Loss}{\partial b_{i}}\mathbf{M}_{i,j}=\mathbf{M}_{*,j}^{T}\frac{\partial Loss}{\partial\bm{b}}

(20)

where $a_{j}$ denotes the $j$ -th element of $\bm{a}$ , $b_{i}$ denotes the $i$ -th element of $\bm{b}$ , $\mathbf{M}_{i,j}$ denotes the element in the i-th row and j-th column of $\mathbf{M}$ , and $\mathbf{M}_{*,j}$ denotes the $j$ -th column of $\mathbf{M}$ . Hence,

\frac{\partial Loss}{\partial\bm{a}}=\left[\begin{matrix}\frac{\partial Loss}{\partial a_{1}}\\ \frac{\partial Loss}{\partial a_{2}}\\ \vdots\\ \frac{\partial Loss}{\partial a_{n}}\end{matrix}\right]=\left[\begin{matrix}\mathbf{M}_{*,1}^{T}\frac{\partial Loss}{\partial\bm{b}}\\ \mathbf{M}_{*,2}^{T}\frac{\partial Loss}{\partial\bm{b}}\\ \vdots\\ \mathbf{M}_{*,n}^{T}\frac{\partial Loss}{\partial\bm{b}}\end{matrix}\right]=\mathbf{M}^{T}\frac{\partial Loss}{\partial\bm{b}}.

(21)

Similarly, the deconvolution operation can be notated as

\bm{a}=\mathbf{\widetilde{M}}\bm{b}

(22)

where $\mathbf{\widetilde{M}}$ denotes a convolutional matrix that has the same shape as $M^{T}$ , and it needs to be learned. Then the back propagation of convolution can be formulated as follows:

\frac{\partial Loss}{\partial\bm{b}}=\mathbf{\widetilde{M}}^{T}\frac{\partial Loss}{\partial\bm{a}}.

(23)

For example, the size of the input and output matrices is $4\times 4$ and $2\times 2$ as shown in Fig. 3(c). Then $\bm{a}$ is a 16-dimensional vector and $\bm{b}$ is a 4-dimensional vector. Define convolutional kernel $\mathbf{K}$ as

\mathbf{K}=\left[\begin{matrix}w_{00}&w_{01}&w_{02}\\ w_{10}&w_{11}&w_{12}\\ w_{20}&w_{21}&w_{22}\end{matrix}\right].

(24)

It is not hard to imagine that $\mathbf{M}$ is a $4\times 16$ matrix, and it can be represented as follows:

\left[\begin{matrix}w_{00}&w_{01}&w_{02}&0&\ldots&0&0&0&0\\ 0&w_{00}&w_{01}&w_{02}&\ldots&0&0&0&0\\ 0&0&0&0&\ldots&w_{20}&w_{21}&w_{22}&0\\ 0&0&0&0&\ldots&0&w_{20}&w_{21}&w_{22}\end{matrix}\right].

(25)

Hence, deconvolution is expressed as left-multiplying $\mathbf{\widetilde{M}}$ in forward propagation, and left-multiplying $\mathbf{\widetilde{M}}^{T}$ in back propagation.

References

[1] T. J. O’Shea, T. Roy, and T. C. Clancy, “Over-the-air deep learning based radio signal classification,” IEEE Journal of Selected Topics in Signal Processing, vol. 12, no. 1, pp. 168–179, 2018.
[2] F. Gama, A. G. Marques, G. Leus, and A. Ribeiro, “Convolutional neural network architectures for signals supported on graphs,” IEEE Transactions on Signal Processing, vol. 67, no. 4, pp. 1034–1049, 2018.
[3] S. Peng, H. Jiang, H. Wang, H. Alwageed, Y. Zhou, M. M. Sebdani, and Y.-D. Yao, “Modulation classification based on signal constellation diagrams and deep learning,” IEEE transactions on neural networks and learning systems, vol. 30, no. 3, pp. 718–727, 2018.
[4] T. J. O’Shea, J. Corgan, and T. C. Clancy, “Unsupervised representation learning of structured radio communication signals,” in 2016 First International Workshop on Sensing, Processing and Learning for Intelligent Machines (SPLINE). IEEE, 2016, pp. 1–5.
[5] L. Du, H. Liu, P. Wang, B. Feng, M. Pan, and Z. Bao, “Noise robust radar hrrp target recognition based on multitask factor analysis with small training data size,” IEEE Transactions on Signal Processing, vol. 60, no. 7, pp. 3546–3559, 2012.
[6] G. C. Garriga, P. Kralj, and N. Lavrač, “Closed sets for labeled data,” Journal of Machine Learning Research, vol. 9, no. Apr, pp. 559–580, 2008.
[7] W. J. Scheirer, A. de Rezende Rocha, A. Sapkota, and T. E. Boult, “Toward open set recognition,” IEEE transactions on pattern analysis and machine intelligence, vol. 35, no. 7, pp. 1757–1772, 2012.
[8] A. Bendale and T. E. Boult, “Towards open set deep networks,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 1563–1572.
[9] H. Liu, Z. Cao, M. Long, J. Wang, and Q. Yang, “Separate to adapt: Open set domain adaptation via progressive separation,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2019, pp. 2927–2936.
[10] C. Geng and S. Chen, “Collective decision for open set recognition,” IEEE Transactions on Knowledge and Data Engineering, 2020.
[11] M. Palatucci, D. Pomerleau, G. E. Hinton, and T. M. Mitchell, “Zero-shot learning with semantic output codes,” in Advances in neural information processing systems, 2009, pp. 1410–1418.
[12] R. Socher, M. Ganjoo, C. D. Manning, and A. Ng, “Zero-shot learning through cross-modal transfer,” in Advances in neural information processing systems, 2013, pp. 935–943.
[13] Z. Wang, X. Ye, C. Wang, J. Cui, and P. Yu, “Network embedding with completely-imbalanced labels,” IEEE Transactions on Knowledge and Data Engineering, 2020.
[14] Y. Gao, L. Gao, X. Li, and Y. Zheng, “A zero-shot learning method for fault diagnosis under unknown working loads,” Journal of Intelligent Manufacturing, pp. 1–11, 2019.
[15] T. J. O’Shea, J. Corgan, and T. C. Clancy, “Convolutional radio modulation recognition networks,” in International conference on engineering applications of neural networks. Springer, 2016, pp. 213–226.
[16] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 770–778.
[17] S. Zheng, S. Chen, L. Yang, J. Zhu, Z. Luo, J. Hu, and X. Yang, “Big data processing architecture for radio signals empowered by deep learning: Concept, experiment, applications and challenges,” IEEE Access, vol. 6, pp. 55 907–55 922, 2018.
[18] B. Flowers, R. M. Buehrer, and W. C. Headley, “Evaluating adversarial evasion attacks in the context of wireless communications,” IEEE Transactions on Information Forensics and Security, vol. 15, pp. 1102–1113, 2019.
[19] S. Duan, K. Chen, X. Yu, and M. Qian, “Automatic multicarrier waveform classification via pca and convolutional neural networks,” IEEE Access, vol. 6, pp. 51 365–51 373, 2018.
[20] L. J. Wong, W. C. Headley, and A. J. Michaels, “Specific emitter identification using convolutional neural network-based iq imbalance estimators,” IEEE Access, vol. 7, pp. 33 544–33 555, 2019.
[21] S. Huang, L. Chai, Z. Li, D. Zhang, Y. Yao, Y. Zhang, and Z. Feng, “Automatic modulation classification using compressive convolutional neural network,” IEEE Access, vol. 7, pp. 79 636–79 643, 2019.
[22] S. Dörner, S. Cammerer, J. Hoydis, and S. ten Brink, “Deep learning based communication over the air,” IEEE Journal of Selected Topics in Signal Processing, vol. 12, no. 1, pp. 132–143, 2017.
[23] L. M. Hoang, M. Kim, and S.-H. Kong, “Automatic recognition of general lpi radar waveform using ssd and supplementary classifier,” IEEE Transactions on Signal Processing, vol. 67, no. 13, pp. 3516–3530, 2019.
[24] G. Vanhoy, N. Thurston, A. Burger, J. Breckenridge, and T. Bose, “Hierarchical modulation classification using deep learning,” in MILCOM 2018-2018 IEEE Military Communications Conference (MILCOM). IEEE, 2018, pp. 20–25.
[25] T. J. O’Shea, T. Roy, N. West, and B. C. Hilburn, “Demonstrating deep learning based communications systems over the air in practice,” in 2018 IEEE International Symposium on Dynamic Spectrum Access Networks (DySPAN). IEEE, 2018, pp. 1–2.
[26] G. Baldini, C. Gentile, R. Giuliani, and G. Steri, “Comparison of techniques for radiometric identification based on deep convolutional neural networks,” Electronics Letters, vol. 55, no. 2, pp. 90–92, 2018.
[27] S. M. Hiremath, S. Behura, S. Kedia, S. Deshmukh, and S. K. Patra, “Deep learning-based modulation classification using time and stockwell domain channeling,” in 2019 National Conference on Communications (NCC). IEEE, 2019, pp. 1–6.
[28] X. Zhang, T. Seyfi, S. Ju, S. Ramjee, A. El Gamal, and Y. C. Eldar, “Deep learning for interference identification: Band, training snr, and sample selection,” in 2019 IEEE 20th International Workshop on Signal Processing Advances in Wireless Communications (SPAWC). IEEE, 2019, pp. 1–5.
[29] K. Sankhe, M. Belgiovine, F. Zhou, L. Angioloni, F. Restuccia, S. D’Oro, T. Melodia, S. Ioannidis, and K. Chowdhury, “No radio left behind: Radio fingerprinting through deep learning of physical-layer hardware impairments,” IEEE Transactions on Cognitive Communications and Networking, vol. 6, no. 1, pp. 165–178, 2019.
[30] F. Liang, C. Shen, W. Yu, and F. Wu, “Towards optimal power control via ensembling deep neural networks,” IEEE Transactions on Communications, vol. 68, no. 3, pp. 1760–1776, 2019.
[31] X. Chen, J. Cheng, Z. Zhang, L. Wu, J. Dang, and J. Wang, “Data-rate driven transmission strategies for deep learning-based communication systems,” IEEE Transactions on Communications, vol. 68, no. 4, pp. 2129–2142, 2020.
[32] D. Roy, T. Mukherjee, M. Chatterjee, E. Blasch, and E. Pasiliao, “Rfal: Adversarial learning for rf transmitter identification and classification,” IEEE Transactions on Cognitive Communications and Networking, 2019.
[33] J. Tan, L. Zhang, Y.-C. Liang, and D. Niyato, “Intelligent sharing for lte and wifi systems in unlicensed bands: A deep reinforcement learning approach,” IEEE Transactions on Communications, vol. 68, no. 5, pp. 2793–2808, 2020.
[34] M. Li, O. Li, G. Liu, and C. Zhang, “Generative adversarial networks-based semi-supervised automatic modulation recognition for cognitive radio networks,” Sensors, vol. 18, no. 11, p. 3913, 2018.
[35] M. Usama, J. Qadir, A. Raza, H. Arif, K.-L. A. Yau, Y. Elkhatib, A. Hussain, and A. Al-Fuqaha, “Unsupervised machine learning for networking: Techniques, applications and research challenges,” IEEE Access, vol. 7, pp. 65 579–65 615, 2019.
[36] Z.-L. Tang, S.-M. Li, and L.-J. Yu, “Implementation of deep learning-based automatic modulation classifier on FPGA SDR platform,” Electronics, vol. 7, no. 7, p. 122, 2018.
[37] F. T. Liu, K. M. Ting, and Z.-H. Zhou, “Isolation forest,” in 2008 Eighth IEEE International Conference on Data Mining. IEEE, 2008, pp. 413–422.
[38] P. J. Rousseeuw and K. V. Driessen, “A fast algorithm for the minimum covariance determinant estimator,” Technometrics, vol. 41, no. 3, pp. 212–223, 1999.
[39] Y. Chen, X. S. Zhou, and T. S. Huang, “One-class SVM for learning in image retrieval,” in Proceedings 2001 International Conference on Image Processing (Cat. No. 01CH37205), vol. 1. IEEE, 2001, pp. 34–37.
[40] M. M. Breunig, H.-P. Kriegel, R. T. Ng, and J. Sander, “Lof: identifying density-based local outliers,” in Proceedings of the 2000 ACM SIGMOD international conference on Management of data, 2000, pp. 93–104.
[41] R. Yoshihashi, W. Shao, R. Kawakami, S. You, M. Iida, and T. Naemura, “Classification-reconstruction learning for open-set recognition,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2019, pp. 4016–4025.
[42] C. Geng, S.-j. Huang, and S. Chen, “Recent advances in open set recognition: A survey,” IEEE Transactions on Pattern Analysis and Machine Intelligence, 2020.
[43] S. Liu, Q. Shi, and L. Zhang, “Few-shot hyperspectral image classification with unknown classes using multitask deep learning,” IEEE Transactions on Geoscience and Remote Sensing, 2020.
[44] D. Roy, T. Mukherjee, M. Chatterjee, and E. Pasiliao, “Detection of rogue RF transmitters using generative adversarial nets,” in 2019 IEEE Wireless Communications and Networking Conference (WCNC). IEEE, 2019, pp. 1–7.
[45] S. Rajendran, W. Meert, V. Lenders, and S. Pollin, “Saife: Unsupervised wireless spectrum anomaly detection with interpretable features,” in 2018 IEEE International Symposium on Dynamic Spectrum Access Networks (DySPAN). IEEE, 2018, pp. 1–9.
[46] Y. Wen, K. Zhang, Z. Li, and Y. Qiao, “A discriminative feature learning approach for deep face recognition,” in European conference on computer vision. Springer, 2016, pp. 499–515.
[47] X. Chen, D. P. Kingma, T. Salimans, Y. Duan, P. Dhariwal, J. Schulman, I. Sutskever, and P. Abbeel, “Variational lossy autoencoder,” arXiv preprint arXiv:1611.02731, 2016.
[48] I. Higgins, L. Matthey, A. Pal, C. Burgess, X. Glorot, M. Botvinick, S. Mohamed, and A. Lerchner, “beta-vae: Learning basic visual concepts with a constrained variational framework.” ICLR, vol. 2, no. 5, p. 6, 2017.
[49] A. Ng et al., “Sparse autoencoder,” CS294A Lecture notes, vol. 72, no. 2011, pp. 1–19, 2011.
[50] F. Pukelsheim, “The three sigma rule,” The American Statistician, vol. 48, no. 2, pp. 88–91, 1994.
[51] P. Bajorski, Statistics for imaging, optics, and photonics. John Wiley & Sons, 2011, vol. 808.
[52] H. Zhao, O. Gallo, I. Frosio, and J. Kautz, “Loss functions for image restoration with neural networks,” IEEE Transactions on Computational Imaging, vol. 3, no. 1, pp. 47–57, 2017.
[53] A. Hirose, Complex-valued neural networks: Advances and applications. John Wiley & Sons, 2013, vol. 18.