Variational Temporal Deep Generative Model for Radar HRRP Target Recognition

Dandan Guo, Bo Chen, Wenchao Chen, Chaojie Wang, Hongwei Liu, and Mingyuan Zhou This research was partially supported by the Program for Oversea Talent by the Chinese Central Government, the 111 Project under Grant B18039, NSFC under Grant 61771361, the National Science Fund for Distinguished Young Scholars of China (61525105), and Shaanxi Innovation Team Project. (Corresponding author: Bo Chen.) Dandan Guo, Bo Chen, Wenchao Chen, Chaojie Wang and Hongwei Liu are with the National Laboratory of Radar Signal Processing, Xidian University, Xi’an 710071, China. (e-mail:[email protected]; [email protected]; [email protected]; [email protected]; [email protected]) M. Zhou is with McCombs School of Business, The University of Texas at Austin, Austin, TX 78712, USA. (e-mail: [email protected])

Abstract

We develop a recurrent gamma belief network (rGBN) for radar automatic target recognition (RATR) based on high-resolution range profile (HRRP), which characterizes the temporal dependence across the range cells of HRRP. The proposed rGBN adopts a hierarchy of gamma distributions to build its temporal deep generative model. For scalable training and fast out-of-sample prediction, we propose the hybrid of a stochastic-gradient Markov chain Monte Carlo (MCMC) and a recurrent variational inference model to perform posterior inference. To utilize the label information to extract more discriminative latent representations, we further propose supervised rGBN to jointly model the HRRP samples and their corresponding labels. Experimental results on synthetic and measured HRRP data show that the proposed models are efficient in computation, have good classification accuracy and generalization ability, and provide highly interpretable multi-stochastic-layer latent structure.

Index Terms:

Recurrent gamma belief network (rGBN), radar automatic target recognition (RATR), high-resolution range profile (HRRP), stochastic-gradient MCMC, recurrent variational inference

I Introduction

Radar automatic target recognition (RATR) is to identify the unknown target from its radar echoes, which plays an important role in many applications, such as surveillance, homeland security, and military tasks [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18]. Generally, the researches of RATR in high-resolution wideband radar can be divided into two categories, including RATR based on high-resolution range profile (HRRP) and that based on synthetic aperture radar (SAR) or inverse SAR (ISAR). Compared with SAR and ISAR, HRRP owns the distinct advantage of being processed directly without first forming an image [16]. Specifically, HRRP is composed of the amplitude of the coherent summations of the complex returns from target scatterers in each range cell, which represents the projection of the complex echoes from the target scattering center onto the radar line-of-sight (LOS). Since HRRP contains abundant discriminative information, such as the target size and scatterer distribution, HRRP based RATR has received significant attention [9, 10, 11, 12, 13, 14, 15, 16, 17, 18].

As the key step of HRRP based RATR, feature extraction has been widely studied, which can be motivated by different opinions according to the specific requirement. For example, based on the integration of the bispectra of range profiles, Liao et al. [18] investigate bispectral features from real HRRP data. Zhang et al. [9] and Du et al. [10] further develop HRRP target recognition methods using high-order spectra features for their shift-invariance properties. Molchanov et al. [12] study the possibility of classifying aerial targets using the micro-Doppler signatures, where the novel features are computed in the form of cepstral coefficients and bicoherence estimates. Despite their usefulness for target recognition, those engineered features are hand-crafted and heavily rely on personal experiences, limiting their use in practice [15].

To learn data-driven features, Du et al. [19] introduce principal component analysis (PCA) to extract the complex HRRPs’ feature subspace, by minimizing the reconstruction error for the HRRP data. Du et al. [20] further propose a factor analysis (FA) model based on multitask learning to describe the Fast Fourier Transform (FFT)-magnitude feature of complex HRRP. Some researchers [21] adopt the K-Singularly Valuable Decomposition (K-SVD) dictionary learning method to extract the desired sparse over-complete features of HRRP data. Those methods have proven their effectiveness in practice, but they are all shallow models and only good at extracting linear features [22]. Inspired by the ability of deep learning methods in extracting multilayer non-linear features [23], Feng et al. [15] adopt stacked corrective auto-encoders (SCAE) to extract robust hierarchical features for HRRP, employing the average profile of each HRRP frame as the correction term. Pan et al. [24] propose a discriminant deep belief network (discriminant DBN) to recognize non-cooperative targets with imbalanced HRRP training data. Nevertheless, all these models only depict the global structure of the target in a single HRRP sample, ignoring the sequential relationships across range cells within a single HRRP sample.

Several approaches have been proposed to exploit the temporal dependence in HRRP. Du et al. [16] propose a Bayesian dynamic model for the HRRP sequence, where the spatial structure across range cells in HRRP is depicted by a hidden Markov model (HMM) and the temporal dependence between HRRP samples is depicted by the time evolution of the transition probabilities. Pan et al. [25, 17] characterize the spectrogram feature from a single HRRP sample via an HMM, which is a two-dimensional time-frequency representation providing the variation of the target in both the frequency and time domains. Besides, Wang et al. [26] characterize the frequency spectrum amplitude of HRRP based on linear dynamical systems (LDS) to relax the requirement of a large training set. Despite the tremendous success of HMM and LDS in many areas, efficiently capturing complex dependencies between range cells in HRRP remains a challenging problem due to their limited expressiveness. With the development of temporal one-dimensional convolution neural network (CNN) in raw audio generation tasks [27], Wan et al. [28] use one-dimensional CNN to handle HRRP in the time domain and construct a two-dimensional CNN model for the corresponding spectrogram representation. Besides, recurrent neural networks (RNNs) [29, 30, 31] have been developed to capture complex temporal behavior in high-dimensional sequences. While in principle RNN is a powerful model, it does not consider the kind of variability observed in highly structured data [32, 33] and ignores the weight uncertainty when updating its parameters with stochastic optimization such as stochastic gradient descent (SGD) [34]. Moreover, learning both hierarchical and temporal representations has been a long-standing challenge for RNNs, in spite of the fact that hierarchical structures naturally exist in many sequential data and the learned latent hierarchical structures can provide useful information to other downstream tasks such as sequence generation and classification [35, 36]. In addition, the neural network representations are generally inaccessible and uninterpretable to humans [37]. More recently, several deep probabilistic dynamical models have been proposed to capture the relationships between latent variables across multiple stochastic layers on video and music sequences, text streams, and motion capture data [38, 39, 40].

Deep Poisson gamma dynamical system (DPGDS) is a deep Bayesian top-down directed generative model (decoder) [40], which takes advantage of the hierarchical structure to efficiently incorporate both between-layer and temporal dependencies. Different from classical Kalman filters, such as LDS, where the uses of linear transition and emission distribution limit the capacity to model complex phenomena [41], DPGDS directly chains the observed data in a state space model (deep gamma Markov chain) that evolves with gamma noise. To take advantage of the temporal dependence within each HRRP sequence, we can construct a sequential HRRP RATR model with DPGDS, where each HRRP can be divided into multiple overlapping sequential HRRP segments as input. Despite being able to infer the multilayer contextual representation of observed HRRP sequences with scalable inference, the inference of DPGDS relies on a potentially large number of Markov chain Monte Carlo (MCMC) iterations to extract the latent representation of a new sample at the testing stage, which may be unattractive when real-time processing is desired. Thus, the key challenge for DPGDS in RATR (unconventional deep Markov chain) is to infer the gamma distributed latent states with higher efficiency. In this paper, we first generalize DPGDS to recurrent gamma belief network (rGBN) for real-time processing at the test stage, by equipping it with a novel inference network (encoder). Specifically, to provide scalable training and fast out-of-sample prediction, the potential solution is to build an inference model with a variational auto-encoder (VAE) [42]. But existing VAE variants are mostly restricted to non-sequential observed data and Gaussian distributed latent variables. To address these constraints, motivated by related constructions in Zhang et al. [43], we construct a novel recurrent variational inference network (encoder) to map the observed HRRP samples directly to their latent temporal representations. Then we provide the hybrid of stochastic-gradient-MCMC (SG-MCMC) and recurrent variational inference to infer both the posterior distribution (rather than a point estimate) of the global parameters of generative model and latent temporal representations. To the best of our knowledge, the proposed rGBN, characterized by a top-down generative structure with temporal feedforward structure on each layer and a novel inference model, is the first deep probabilistic dynamical model for the HRRP RATR task. Although the features unsupervisedly learned by rGBN can be fed into a downstream classifier to make predictions, it is often beneficial to incorporate the target label information into the model [44]. To explore this potential, we further develop an end-to-end supervised rGBN (s-rGBN), whose extracted features are good for both classification and data representation.

The remainder of the paper is organized as follows. In Section II, we present a brief description of HRRP, HMM, LDS, and RNN. We introduce our generative and inference networks in Section III. By jointly modeling HRRP samples and their labels, we further propose the supervised model, $i.e.$ , s-rGBN, in Section IV. The detailed experimental results based on synthetic data and measured HRRP data are reported in Section V. Section VI concludes the paper.

II Preliminaries

II-A Review of HRRP

Refer to caption — Figure 1: Radar returns from the scatterers on the target are projected onto the line-of-sight (LOS), resulting in an HRRP. This figure is quoted from Zwart et al. [45].

For a high resolution radar (HRR), the wavelength of the radar signal is far smaller than the target size. Intuitively, HRRP is a representation of the time domain response of the target to an HRR pulse as a one-dimensional signature, which is the expression of the distribution of radar scattering centers along with the radar LOS [16][20][22], as shown in Fig. 1. Suppose the transmitted signal of an HRR is $s(t)e^{j2\pi f_{c}t}$ , where $s(t)$ denotes the complex envelop and $f_{c}$ is the radar carrier frequency. Following the literatures [13, 19], the $n$ th complex echo from the $d$ th range cell ( $n=1,2,...,N,d=1,2,...,D$ ) in the baseband is defined as follows

\displaystyle x_{d}(t,n)=\sum_{i}^{P_{d}}\sigma_{di}s\left(t-\textstyle\frac{2R_{di}(n)}{c}\right)e^{-j[\frac{4\pi}{\lambda}R_{di}(n)-\theta_{di0}]}

(1)

where $P_{d}$ represents the number of target scatterers in the $d$ th range cell, $\lambda$ is the wavelength of HRR, $\sigma_{di}$ and $\theta_{di0}$ represent the intensity and initial phase of the $i$ th scatterer in the $d$ th range cell, respectively. $R_{di}(n)$ can be regarded as the radial distance between the $i$ th scatterer in the $d$ th range cell of the $n$ th returned echo and the radar.

We set $R(n)$ as the radial distance between the target reference center in the $n$ th returned echo and the radar, and set $L_{x}$ as the radial length of the target, usually $R(n)\gg L_{x}$ . Due to the target rotation, there are radial displacements for the scatterers. Then $R_{di}(n)\approx R(n)+\Delta r_{di}(n)$ , where $\Delta r_{di}(n)$ represents the radial displacement of the $i$ th scatterer in the $d$ th range cell in the $n$ th returned echo. When $s(\cdot)$ is a rectangular pulse signal with unit intensity, it is usually omitted. Equation (1) can be approximated as $x_{d}(t,n)\approx x_{d}(n)=\sum_{i}^{P_{d}}\sigma_{di}e^{-j\frac{4\pi}{\lambda}R(n)}e^{j\phi_{di}(n)}$ , where $\phi_{di}(n)=\theta_{di0}-\frac{4\pi}{\lambda}\Delta r_{di}(n)$ denotes the remained echo phase of the $i$ th scatterer in the $d$ th range cell of the $n$ th returned echo, and $e^{-j\frac{4\pi}{\lambda}R(n)}$ represents the initial phase of the $n$ th returned echo related to the target distance and radar wavelength. Since $e^{-j\frac{4\pi}{\lambda}R(n)}$ does not contain the target discriminative information, we can eliminate it and define the $n$ th real HRRP sample $\boldsymbol{x}(n)$ as

		$\displaystyle\boldsymbol{x}(n)\triangleq\left[\|x_{1}(n)\|,\|x_{2}(n)\|,\ldots,\|x_{D}(n)\|\right]$		(2)
		$\displaystyle=\small\left[\left\|\sum_{i}^{P_{1}}\sigma_{1i}e^{j\phi_{1i}(n)}\right\|,\left\|\sum_{i}^{P_{2}}\sigma_{2i}e^{j\phi_{2i}(n)}\right\|,\ldots\left\|\sum_{i}^{P_{D}}\sigma_{Di}e^{j\phi_{Di}(n)}\right\|\right],$		(2)

where $|\cdot|$ means taking absolute value, and $D$ is the dimensionality of $\boldsymbol{x}(n)$ .

II-B Related dynamical models

Denoting a sample as $T$ sequentially observed $V$ -dimensional vectors, we present a short review of the existing dynamical models used to model the temporal dependence across the range cells in HRRP [16, 17, 25, 26, 31].

Hidden Markov Model: The joint likelihood of the observation and underlying state sequence can be expressed as

	$\displaystyle P(\boldsymbol{x}_{n},\boldsymbol{s}_{n}\,\|\,\boldsymbol{\omega}_{0},\boldsymbol{\Pi},\boldsymbol{\phi})=P(\boldsymbol{s}_{1,n}\,\|\,\omega_{0})\prod_{t=1}^{T-1}P(\boldsymbol{s}_{t,n}\,\|\,\boldsymbol{s}_{t-1,n},\boldsymbol{\Pi})$
	$\displaystyle\times\prod_{t=1}^{T}P(\boldsymbol{x}_{t,n}\,\|\,\boldsymbol{s}_{t,n},\boldsymbol{\phi}),$		(3)

where $\boldsymbol{s}_{n}=\{\boldsymbol{s}_{1,n},..,\boldsymbol{s}_{t,n},..,\boldsymbol{s}_{T,n}\}$ denotes the latent states of the $n$ th sample, $\boldsymbol{\omega}_{0}$ is the initial state probability, $\boldsymbol{\Pi}$ is the state transition distribution and $\boldsymbol{\phi}$ are a set of parameters governing the emission probability.

Linear Dynamical System: It consists of the following observation and state equations

\displaystyle\boldsymbol{x}_{t,n}\sim N(\boldsymbol{\Phi}\boldsymbol{s}_{t,n},\Sigma),~{}\boldsymbol{s}_{t,n}\sim N(\boldsymbol{\Pi}\boldsymbol{s}_{t-1,n},\Delta),

(4)

where $\boldsymbol{\Pi}$ is the transition matrix, $\Sigma$ and $\Delta$ are covariance matrices, and $\boldsymbol{s}_{t,n}$ denotes the latent state that is projected to the observed space via the factor loading matrix $\boldsymbol{\Phi}$ .

Recurrent Neural Network: At timestep $t$ , RNN reads the symbol $\boldsymbol{x}_{t,n}$ and updates its hidden state $\boldsymbol{h}_{t,n}^{(l)}$ at layer $l$

\small\boldsymbol{h}_{t,n}^{(l)}=\left\{\begin{array}[]{ll}{f\left({\bf W}_{hh}^{(l)}\boldsymbol{h}_{t-1,n}^{(l)}+{\bf W}_{xh}^{(l)}\boldsymbol{x}_{t,n}\right)},&{\text{ if }l=1},\\ f\left({\bf W}_{hh}^{(l)}\boldsymbol{h}_{t-1,n}^{(l)}+{\bf W}_{xh}^{(l)}\boldsymbol{h}_{t,n}^{(l-1)}\right),&{\text{ if }L\geq l>1},\\ \end{array}\right.

(5)

where $f$ is a deterministic non-linear function, ${\bf W}_{hh}^{(l)}$ is the transition matrix, ${\bf W}_{xh}^{(l)}$ is the weight matrix at layer $l$ , and the bias vectors are omitted for conciseness.

Although HMM and LDS have been widely studied, their representation power may be limited when modeling the HRRP sequential samples. Compared with LDS and HMM, RNN typically owns extra expressive power due to the existence of deep hidden states and flexible non-linear transition function. However, the internal structure of RNN is in general entirely deterministic, with the only source of variability provided by its conditional output probability model, which may be inappropriate to model the kind of variability observed in the HRRP data [33].

III Variational Temporal Deep Generative Model

III-A Input representation

According to Xu et al. [31], to discover the temporal dependence between the range cells within the single HRRP, we divide the $n$ th $D$ -dimensional real HRRP $\boldsymbol{x}(n)$ in (2) into $\boldsymbol{x}_{n}\in\mathbb{R}_{+}^{V\times T}$ , which consists of $T$ sequentially observed $V$ -dimensional vector. Shown in Fig. 2, we denote the HRRP sequence as $\boldsymbol{x}_{n}=[\boldsymbol{x}_{1,n},..,\boldsymbol{x}_{t,n},..\boldsymbol{x}_{T,n}]$ , where $\boldsymbol{x}_{t,n}\in\mathbb{R}_{+}^{V\times 1}$ is the time sequential feature at timestep $t$ of sample $n$ and can be defined as

\displaystyle{\boldsymbol{x}_{t,n}=\boldsymbol{x}(n)\left(a_{t}+1:a_{t}+V\right),}

(6)

where $V$ is the size of window function, intercepting $V$ range cells from the real HRRP sample, $a_{t}=(V\!-\!O)\ast(t\!-\!1)$ is the index at timestep $t$ , and $O$ denotes the overlap length across the windows, determining the degree of correlation between adjacent timesteps. Thus $T=\left\lfloor{(D-O)/(V-O)}\right\rfloor$ . We denote ${\bf X}=\{\boldsymbol{x}_{n}\}_{n=1}^{N}$ as the HRRP dataset, which consists of $N$ independent and identically distributed (IID) HRRP sequences, and ${\bf Y}=\{y_{n}\}_{n=1}^{N}$ as corresponding labels, where $y_{n}$ is the label of $\boldsymbol{x}_{n}$ . Note sequential inputs $\boldsymbol{x}_{1:T,n}$ from one HRRP sample $\boldsymbol{x}_{n}$ share the same label. The dataset $\{{\bf X},{\bf Y}\}$ can be fed into a dynamic model as $N$ IID samples, each of which contains $T$ sequentially observed $V$ -dimensional vectors (with identical label). For simplicity, we only exhibit the modeling process for the $n$ th HRRP sequence.

III-B Generative Model

To characterize the sequential feature within a single HRRP sample, we generalize the deep Poisson gamma dynamical system (DPGDS) [40] to rGBN, whose generative model is sketched in Fig. 3 (a). Specifically, we consider the deep architecture with $L$ layers, and denote $\boldsymbol{s}_{t,n}^{(l)}\in\mathbb{R}_{+}^{K_{l}}$ as the latent state of $\boldsymbol{x}_{t,n}$ in (6) at layer $l$ , time step $t$ , where $K_{l}$ is the number of states at layer $l$ . Different from HMM and LDS whose single-layer latent state at time step $t$ only relies on the state at previous time step $t-1$ , the latent state $\boldsymbol{s}_{t,n}^{(l)}$ of rGBN, from the top layer to bottom, is formulated as

\displaystyle\small\boldsymbol{s}_{t,n}^{(l)}\sim\mbox{Gam}\left(b^{(l)}\left(\boldsymbol{\Pi}^{(l)}\boldsymbol{s}_{t-1,n}^{(l)}+\boldsymbol{\Phi}^{(l+1)}\boldsymbol{s}_{t,n}^{(l+1)}\right),1/b^{(l)}\right),

(7)

where $x\sim\mbox{Gam}(a,1/b)$ denotes the gamma distribution with shape $a$ , scale $1/b$ , and mean $a/b$ ; $\boldsymbol{\Pi}^{(l)}\in\mathbb{R}_{+}^{K_{l}\times K_{l}}$ the transition matrix of layer $l$ , $\boldsymbol{\Phi}^{(l)}\in\mathbb{R}_{+}^{K_{l-1}\times K_{l}}$ the weight matrix connecting layers $l-1$ and $l$ , $K_{0}=V$ , and $1/b^{(l)}$ the gamma scale parameter at layer $l$ . When $t=1$ , $\boldsymbol{s}_{1,n}^{(l)}\sim\mbox{Gam}\left(b^{(l)}\boldsymbol{\Phi}^{(l+1)}\boldsymbol{s}_{1,n}^{(l+1)},1/b^{(l)}\right)$ for $1\leq l<L$ and $\boldsymbol{s}_{1,n}^{(L)}\sim\mbox{Gam}\left(b^{(L)}\textrm{1}_{K_{L}},1/b^{(L)}\right)$ . When $t>1$ , $\boldsymbol{s}_{t,n}^{(L)}\sim\mbox{Gam}\left(b^{(L)}\boldsymbol{\Pi}^{(L)}\boldsymbol{s}_{t-1,n}^{(L)},1/b^{(L)}\right)$ .

The gamma shape parameter of $\boldsymbol{s}_{t,n}^{(l)}$ can be divided into two parts. One is the product of the transition matrix $\boldsymbol{\Pi}^{(l)}$ and latent state $\boldsymbol{s}_{t-1,n}^{(l)}$ , capturing the temporal dependence at the current layer, while the other is the product of the connection weight matrix $\boldsymbol{\Phi}^{(l+1)}$ and latent state $\boldsymbol{s}_{t,n}^{(l+1)}$ , capturing the hierarchical dependence at the current time. Moving beyond RNN using deterministic non-linear transition functions, we construct a dynamic probabilistic model using gamma distributed non-negative hidden units. Therefore, the proposed model is characterized by its expressive structure, which not only captures the correlations between the latent states across all layers and time steps, but also models the variability of the latent states, improving its ability to model sequential HRRP.

As a generative model with the deep temporal structure, the observed HRRP sequence at time step $t$ can be drawn from $p(\boldsymbol{x}_{t,n}\,|\,\boldsymbol{\Phi}^{(1)},\boldsymbol{s}_{t,n}^{(1)})$ . To consider the non-negative nature of the HRRP data and facilitate inference, we introduce the Poisson randomized gamma (PRG) distribution defined as $\mbox{PRG}\left(\boldsymbol{x}_{t,n}\,|\,\boldsymbol{\Phi}^{(1)}\boldsymbol{s}_{t,n}^{(1)},c\right)$ in Zhou et al. [46], which has a point mass at $\boldsymbol{x}_{t,n}=0$ and is continuous for $\boldsymbol{x}_{t,n}>0$ . Since the PRG distribution is generated as a Poisson mixed gamma distribution, the data likelihood can be expressed as

\displaystyle\boldsymbol{x}_{t,n}\sim\mbox{Gam}\left(\boldsymbol{r}_{t,n},1/c\right),\boldsymbol{r}_{t,n}\sim\mbox{Pois}\left(\boldsymbol{\Phi}^{(1)}\boldsymbol{s}_{t,n}^{(1)}\right),

(8)

where $c>0$ , $x\sim\mbox{Pois}(\lambda)$ represents the Poisson distribution with mean $\lambda$ and variance $\lambda$ , and $\boldsymbol{\Phi}^{(1)}\in\mathbb{R}_{+}^{V\times K_{1}}$ is the weight matrix of layer $1$ .

For scale identifiability and ease of inference and interpretation, we place the Dirichlet distribution prior on each column of $\boldsymbol{\Phi}^{(l)}$ and $\boldsymbol{\Pi}^{(l)}$ , $i.e.$ , $\boldsymbol{\phi}_{k}^{(l)}$ and $\boldsymbol{\pi}_{k}^{(l)}$ , by letting

	$\displaystyle\boldsymbol{\phi}_{k}^{(l)}\sim\textrm{Dir}(\boldsymbol{\eta}^{(l)},...,\boldsymbol{\eta}^{(l)}),$		(9)
	$\displaystyle\boldsymbol{\pi}_{k}^{(l)}\sim\textrm{Dir}(\boldsymbol{\nu}^{(l)},...,\boldsymbol{\nu}^{(l)}),$		(10)

for $l\in{1,...,L}$ , which makes the elements of each column be non-negative and sum to one. Note $\boldsymbol{\pi}_{k}^{(l)}=(\boldsymbol{\pi}_{1,k}^{(l)},..,\boldsymbol{\pi}_{k_{1},k}^{(l)},..,\boldsymbol{\pi}_{K_{l},k}^{(l)})$ and $\boldsymbol{\pi}_{k_{1},k}^{(l)}$ describes how much the weight of state $k$ of the previous time at layer $l$ is transited to influence state $k_{1}$ of the current time at the same layer. Under the hierarchical dynamical model defined by (7) and (8), the joint likelihood of the observation HRRP and the temporal latent states can be constructed as

		$\displaystyle P\left(\boldsymbol{x}_{t,n},\{\boldsymbol{s}_{t,n}^{(l)}\}_{l=1}^{L}\,\|\,\{\boldsymbol{\Phi}^{(l)},\boldsymbol{\Pi}^{(l)},b^{(l)}\}_{l=1}^{L},c\right)$
		$\displaystyle=\left[\prod\limits_{l=1}^{L}{p\left(\boldsymbol{s}_{t,n}^{(l)}\,\|\,\boldsymbol{\Phi}^{(l+1)}\boldsymbol{s}_{t,n}^{(l+1)},\boldsymbol{\Pi}^{(l)}\boldsymbol{s}_{t-1,n}^{(l)},b^{(l)}\right)}\right]$		(11)
		$\displaystyle\times p\left(\boldsymbol{x}_{t,n}\,\|\,\boldsymbol{\Phi}^{(1)}\boldsymbol{s}_{t,n}^{(1)},c\right).$

The parameters of the generative model comprise the transition and weight matrices, which we write as $\{\boldsymbol{\Pi}^{(l)},\boldsymbol{\Phi}^{(l)}\}_{l=1}^{L}$ . Fig. 3 (a) shows the graphical representation of the proposed generative model and Fig. 3(b) is the unfolded representation of the model structure.

Structure Analysis: As discussed above, if $x\sim\mbox{Gam}(a,1/b)$ , the mean of $x$ is $a/b$ ; while if $x\sim\mbox{Pois}(\lambda)$ , the mean of $x$ is $\lambda$ . Therefore, the expected value of $\boldsymbol{x}_{t,n}$ in (8) and $\boldsymbol{s}_{t,n}^{(l)}$ in (7) can be expressed as

\displaystyle\small\mathbb{E}\left[\boldsymbol{x}_{t,n}\,|\,\boldsymbol{r}_{t,n},c\right]=\mathbb{E}\left[\boldsymbol{r}_{t,n}\,|\,\boldsymbol{\Phi}^{(1)}\boldsymbol{s}_{t,n}^{(1)}\right]/c=\boldsymbol{\Phi}^{(1)}\boldsymbol{s}_{t,n}^{(1)}/c,

(12)

\displaystyle\small\mathbb{E}\left[\boldsymbol{s}_{t,n}^{(l)}\,|\,\boldsymbol{s}_{t,n}^{(l+1)},\boldsymbol{s}_{t-1,n}^{(l)},\boldsymbol{\Phi}^{(l+1)},\boldsymbol{\Pi}^{(l)}\right]=\boldsymbol{\Phi}^{(l+1)}\boldsymbol{s}_{t,n}^{(l+1)}+\boldsymbol{\Pi}^{(l)}\boldsymbol{s}_{t-1,n}^{(l)}.

(13)

Based on (12) and (13), for a three-hidden layer rGBN shown in Fig. 3(a), we have

		$\displaystyle\mathbb{E}[\boldsymbol{x}_{t,n}\,\,\|\,\,\boldsymbol{s}_{t-1,n}^{(1)},\boldsymbol{s}_{t-2,n}^{(2)},\boldsymbol{s}_{t-3,n}^{(3)}]/c$
		$\displaystyle=\boldsymbol{\Phi}^{(1)}\boldsymbol{\Pi}^{(1)}\boldsymbol{s}_{t-1,n}^{(1)}+\boldsymbol{\Phi}^{(1)}\boldsymbol{\Phi}^{(2)}[\boldsymbol{\Pi}^{(2)}]^{2}\boldsymbol{s}_{t-2,n}^{(2)}$		(14)
		$\displaystyle~{}~{}~{}~{}+\boldsymbol{\Phi}^{(1)}\boldsymbol{\Phi}^{(2)}\left(\boldsymbol{\Pi}^{(2)}\boldsymbol{\Phi}^{(3)}+\boldsymbol{\Phi}^{(3)}\boldsymbol{\Pi}^{(3)}\right)[\boldsymbol{\Pi}^{(3)}]^{2}\boldsymbol{s}_{t-3,n}^{(3)},$

where the expected value of $\boldsymbol{x}_{t,n}$ depends on the latent states at the previous time step in each layer, indicating our proposed model captures and transmits long-range temporal information through its higher hidden layers. In addition, the proposed model can be viewed as the generalization of LDS and HMM with deep gamma distributed latent representations, and also can be considered as a probabilistic construction of the traditionally deterministic RNN by adding uncertainty into the latent space via a deep generative model.

Ignoring the temporal structure of the equation, we notice that the information of the whole HRRP data set can be compressed into the inferred network $\{\boldsymbol{\Phi}^{(1)},\boldsymbol{\Phi}^{(2)},\boldsymbol{\Phi}^{(3)}\}$ , which depicts the global structure of the target in a single HRRP. To be more specific, we can visualize the $\boldsymbol{\phi}_{k}^{(l)}$ at layer $l$ as $\left[\prod_{p=1}^{l-1}\boldsymbol{\Phi}^{(p)}\right]\boldsymbol{\phi}_{k}^{(l)}$ , which are often quite specific at the bottom layer and become increasingly more general when moving upwards. We will examine the weights of different layers to understand the general and specific aspects of the HRRP data, and illustrate how the weights of different layers are related to each other.

Quantizing the HRRP data: While the non-negative real HRRP dataset can be modeled by PRG distribution, the latent count $\boldsymbol{r}_{t,n}$ in (8) need to be sampled in each iteration of the training stage. Specifically, the conditional posterior of $\boldsymbol{r}_{t,n}$ given $\boldsymbol{x}_{t,n}$ , $\boldsymbol{\Phi}^{(1)}\boldsymbol{s}_{t,n}^{(1)}$ , and $c$ can be expressed as

\displaystyle\leavevmode\resizebox{413.38667pt}{}{$p\left(\boldsymbol{r}_{t,n}\,|\,\boldsymbol{x}_{t,n},\boldsymbol{\Phi}^{(1)}\boldsymbol{s}_{t,n}^{(1)},c\right)\!=\!\frac{\mbox{Gam}\left(\boldsymbol{r}_{t,n},1/c\right)\mbox{Pois}\left(\boldsymbol{\Phi}^{(1)}\boldsymbol{s}_{t,n}^{(1)}\right)}{\sum\limits_{\boldsymbol{r}_{t,n}\!=\!0}^{\infty}\mbox{Gam}\left(\boldsymbol{r}_{t,n},1/c\right)\mbox{Pois}\left(\boldsymbol{\Phi}^{(1)}\boldsymbol{s}_{t,n}^{(1)}\right)}$}.

(15)

More details about the PRG distribution can be found in Zhou et al. [46] and are omitted here for brevity. Despite the desired ability to depict non-negative continuous data, the PRG distribution may be time-consuming for the iterative procedure to infer latent counts in real-world applications. Considering the limited computation power in the training stage, HRRP data can be modeled with Poisson distribution, by directly discretizing the HRRP data to counts ( $i.e.$ , non-negative integers) before training, where the sampling step of the counts in each iteration can be avoided. Therefore, instead of modeling the observed data with (8), we assume that

\displaystyle\lfloor\mu\boldsymbol{x}_{t,n}\rfloor\sim\mbox{Pois}\left(\boldsymbol{\Phi}^{(1)}\boldsymbol{s}_{t,n}^{(1)}\right),

(16)

where the scaling factor $\mu$ controls the fineness of this discretization. There is a trade-off between the benefit of the PRG distribution and restriction of computation resources. Note that modeling the observed HRRP with the Poisson distribution can still obtain similar temporal dependencies in (III-B) and hierarchical structure $\prod_{p=1}^{l}\boldsymbol{\Phi}^{(p)}$ , omitted here for brevity.

III-C Hybrid Inference Model

In this section, we first infer the generative model global parameters $\{\boldsymbol{\Pi}^{(l)}\}_{l=1}^{L}$ and $\{\boldsymbol{\Phi}^{(l)}\}_{l=1}^{L}$ with MCMC, then introduce a recurrent inference model to infer the latent states $\{\boldsymbol{s}_{t,n}^{(l)}\}_{t=1,l=1,n=1}^{T,L,N}$ . Finally, we provide a hybrid SG-MCMC and recurrent variational inference, which is scalable at the training stage and fast at the testing stage.

MCMC inference for the generative network: Given the HRRP sequences, the inference task here is to find the weight matrices $\{\boldsymbol{\Phi}^{(l)}\}_{l=1}^{L}$ , transition matrices $\{\boldsymbol{\Pi}^{(l)}\}_{l=1}^{L}$ , and latent states $\{\boldsymbol{s}_{t,n}^{(l)}\}_{t=1,l=1,n=1}^{T,L,N}$ . While it is difficult to infer the introduced model for the coupling of $\{\boldsymbol{s}_{t,n}^{(l)}\}_{t=1,l=1,n=1}^{T,L,N}$ with $\{\boldsymbol{\Phi}^{(l)}\}_{l=1}^{L}$ and $\{\boldsymbol{\Pi}^{(l)}\}_{l=1}^{L}$ , the latent variables of the proposed model can be trained with a backward-upward–forward-downward (BUFD) Gibbs sampler described in Guo et al. [40], based on a variety of variable augmentation techniques. However, the Gibbs sampler needs to process all HRRP data samples in each iteration and hence has limited scalability. For scalable inference, we adopt the topic-layer-adaptive stochastic gradient Riemannian (TLASGR) MCMC algorithm [47, 43], which is proposed to update simplex-constrained global parameters [48] in a mini-batch learning setting. Relying on the Fisher information matrix (FIM) [49] to automatically adjust relative learning rates for different parameters across all layers, TLASGR-MCMC has proven its improved sampling efficiency. Given the latent states, we first sample augmented latent counts, then use TLASGR-MCMC to sample generative model parameters $\{\boldsymbol{\Phi}^{(l)}\}_{l=1}^{L}$ and $\{\boldsymbol{\Pi}^{(l)}\}_{l=1}^{L}$ . To be more specific, we can sample $\boldsymbol{\pi}_{k}^{(l)}$ , the $k$ th column of the transition matrix $\boldsymbol{\Pi}^{(l)}$ , as

	$\left({\boldsymbol{\pi}_{k}^{(l)}}\right)_{i+1}$	$\displaystyle=\leavevmode\resizebox{375.80542pt}{}{$\!\bigg{[}\!\left({\boldsymbol{\pi}_{k}^{(l)}}\!\right)_{i}\!+\!\frac{\varepsilon_{i}}{M_{k}^{(l)}}\!\left[\left(\rho\tilde{}\boldsymbol{z}_{:k\boldsymbol{\cdot}}^{(l)}\!+\!\boldsymbol{\nu}^{(l)}\right)\!-\!\left(\rho\tilde{z}_{\boldsymbol{\cdot}k\boldsymbol{\cdot}}^{(l)}\!+\!\nu_{\boldsymbol{\cdot}}^{(l)}\!\right)\!\left({\boldsymbol{\pi}_{k}^{(l)}}\!\right)_{i}\right]$}$
		$\displaystyle+\!\leavevmode\resizebox{328.82706pt}{}{$\mathcal{N}\!\left(0,\!\frac{2\varepsilon_{i}}{M_{k}^{(l)}}\!\left[\mbox{diag}(\boldsymbol{\pi}_{k}^{(l)})_{i}\!-\!(\boldsymbol{\pi}_{k}^{(l)})_{i}(\boldsymbol{\pi}_{k}^{(l)})_{i}^{T}\!\right]\!\right)\!\bigg{]}_{\angle}$},$		(17)

where ${\left[.\right]_{\angle}}$ denotes the simplex constraint that $\boldsymbol{\pi}_{k_{1},k}^{(l)}\geqslant 0$ and $\sum_{k_{1}=1}^{K_{l}}\boldsymbol{\pi}_{k_{1},k}^{(l)}=1$ , $M_{k}^{(l)}$ is calculated using the estimated FIM, $\varepsilon_{i}$ denotes the learning rate at the $i$ th iteration, both ${\tilde{}\boldsymbol{z}_{:k\boldsymbol{\cdot}}^{\left(l\right)}}$ and ${\tilde{z}_{\boldsymbol{\cdot}k\boldsymbol{\cdot}}^{\left(l\right)}}$ come from the augmented latent counts ${\bf Z}^{(l)}$ , and ${\boldsymbol{\nu}^{\left(l\right)}}$ denotes the prior of ${\boldsymbol{\pi}_{k}^{\left(l\right)}}$ . More details of TLASGR-MCMC can be found in Cong et al. [47] and Guo et al. [40].

Despite the attractive properties, both the proposed Gibbs sampler and TLASGR-MCMC usually rely on an iterative procedure to learn the temporal latent states of a new HRRP sample at the testing stage, which hinders real-time processing of the HRRP based RATR. To allow fast out-of-sample prediction, we further build an inference network to learn the latent states, as described below.

Recurrent variational inference model: Our inference model is motivated by variational auto-encoders (VAEs) [42]. As an example of a directed graphical model, the joint distribution over the observed variables $\boldsymbol{x}$ and latent variables $\boldsymbol{z}$ can be defined as $p(\boldsymbol{x},\boldsymbol{z})=p(\boldsymbol{x}\,|\,\boldsymbol{z})p(\boldsymbol{z})$ , where $p(\boldsymbol{z})$ is the prior placed on the latent variables. To admit efficient inference, VAEs approximate $p(\boldsymbol{z}\,|\,\boldsymbol{x})$ with a variational family of distributions $q(\boldsymbol{z}\,|\,\boldsymbol{x})$ , which adopts an inference network to map the observations directly to their latent space by optimizing the evidence lower bound (ELBO) [33][50][51], expressed as

L=\mathbb{E}_{q(\boldsymbol{z}\,|\,\boldsymbol{x})}[\ln p(\boldsymbol{x}\,|\,\boldsymbol{z})]-\mathbb{E}_{q(\boldsymbol{z}\,|\,\boldsymbol{x})}[\ln[q(\boldsymbol{z}\,|\,\boldsymbol{x})/p(\boldsymbol{z})]].

(18)

The approximate posterior $q(\boldsymbol{z}\,|\,\boldsymbol{x})$ is often taken to be a Gaussian distribution as $N(\boldsymbol{u},diag(\boldsymbol{\sigma}))$ , where the mean $\boldsymbol{u}$ and standard deviation $\boldsymbol{\sigma}$ are the output of highly non-linear functions of the input $\boldsymbol{x}$ . Given the Gaussian variational posterior, the second term of the ELBO in (18) is analytic. Using the reparameterization trick [42], VAEs sample $\boldsymbol{z}$ with $\boldsymbol{z}=\boldsymbol{u}+\boldsymbol{\sigma}\odot\boldsymbol{\epsilon}$ , where $\boldsymbol{\epsilon}$ is a vector of standard Gaussian variables. Therefore, the gradient of the first term of the ELBO with respect to the parameters of the inference network can be constructed as

\displaystyle\small\nabla E_{q(\boldsymbol{z}\,|\,\boldsymbol{x})}[\ln p(\boldsymbol{x}\,|\,\boldsymbol{z})]=E_{q(\boldsymbol{\epsilon})}[\nabla\ln p(\boldsymbol{x}\,|\,\boldsymbol{z}=\boldsymbol{u}+\boldsymbol{\sigma}\odot\boldsymbol{\epsilon})],

(19)

whose estimation via Monte Carlo integration has low variance. The parameters of the inference model can be typically optimized via SGD.

VAEs provide an effective modeling paradigm for complex data distributions. However, their success so far is mostly restricted to non-sequential data with Gaussian distributed latent variables and does not generalize well to model non-negative and sequential HRRP representations. In this section, we propose a recurrent variational inference method to efficiently produce the multilayer temporal HRRP representations with (7) and (8) or (16) as the generative model. Given the global parameters $\{\boldsymbol{\Pi}^{(l)},\boldsymbol{\Phi}^{(l)}\}_{l=1}^{L}$ , the task here is to infer the latent states $\{\boldsymbol{s}_{t,n}^{(l)}\}_{t=1,l=1,n=1}^{T,L,N}$ via an inference network. We first introduce a fully factorized distribution as

\displaystyle q\left(\left\{\boldsymbol{s}_{t,n}^{(l)}\right\}_{t=1,l=1,n=1}^{T,L,N}\right)=\prod\limits_{n=1}^{N}{\prod\limits_{l=1}^{L}{\prod\limits_{t=1}^{T}{q\left(\boldsymbol{s}_{t,n}^{(l)}\right)}}}.

(20)

With (III-B) and (20), the objective function becomes

	$\displaystyle L\left(\left\{q\left({\boldsymbol{s}_{t,n}^{(l)}}\right)\right\}_{l=1,t=1,n=1}^{L,T,N};\boldsymbol{x}_{1:N},\left\{\boldsymbol{\Pi}^{(l)},\boldsymbol{\Phi}^{(l)}\right\}_{l=1}^{L}\right)$
	$\displaystyle=\sum\limits_{n=1}^{N}{\sum\limits_{t=1}^{T}{{\mathbb{E}_{q\left({\boldsymbol{s}_{t,n}^{(1)}}\right)}}\left[{\ln p\left({\boldsymbol{x}_{t,n}\,\|\,{\boldsymbol{\Phi}^{(1)}},\boldsymbol{s}_{t,n}^{(1)}}\right)}\right]}}$		(21)
	$\displaystyle\!-\!\sum\limits_{n=1}^{N}\!{\sum\limits_{l=1}^{L}\!{\sum\limits_{t=1}^{T}{{\mathbb{E}_{q\left({\boldsymbol{s}_{t,n}^{(l)}}\right)}}\!\left[{\ln\frac{\!{q\!\left({\boldsymbol{s}_{t,n}^{(l)}}\!\right)\!}}{p\left({\boldsymbol{s}_{t,n}^{(l)}\,\|\,\boldsymbol{a}_{t,n}^{\left(l\right)},1/{b^{\left(l\right)}}}\right)}}\right]}}},$

where $\boldsymbol{a}_{t,n}^{\left(l\right)}:=b^{(l)}\left({{\boldsymbol{\Phi}^{(l+1)}}\boldsymbol{s}_{t,n}^{(l+1)}+{\boldsymbol{\Pi}^{(l)}}\boldsymbol{s}_{t{\rm{-}}1,n}^{(l)}}\right)$ denotes the shape parameter of $\boldsymbol{s}_{t,n}^{(l)}$ . Note under the BUFD Gibbs sampler [40], the conditional posterior of $\boldsymbol{s}_{t,n}^{(l)}$ given augmented latent counts is gamma distributed, and hence it might be more appropriate to use the gamma rather than Gaussian based distributions to construct the variational distribution $q\left({\boldsymbol{s}_{t,n}^{(l)}}\right)$ . However, the gamma random variable is not reparameterizable with respect to its shape parameter. This motivates us to choose a surrogate distribution, which can be easily reparameterized, to approximate the gamma distribution. The Weibull distribution is a desirable choice for this purpose, as its probability density function resembles that of the gamma distribution, and the second term in the ELBO shown in (III-C) becomes analytic if it is used to construct $q\left({\boldsymbol{s}_{t,n}^{(l)}}\right)$ [43].

We may directly follow Zhang et al. [43] to construct a Weibull distribution based inference network as

\displaystyle q\left({\boldsymbol{s}_{t,n}^{(l)}}\right)\sim\mbox{Weibull}\left({{\boldsymbol{k}_{t,n}^{(l)}},{\boldsymbol{\lambda}_{t,n}^{(l)}}}\right),

(22)

whose parameters $\boldsymbol{k}_{t,n}^{(l)}$ and $\boldsymbol{\lambda}_{t,n}^{(l)}$ are both deterministically transformed from the hidden unit $\boldsymbol{h}_{t,n}$ and specified as

	$\displaystyle\boldsymbol{k}_{t,n}^{(l)}=\ln[1+\exp({\bf W}_{hk}^{(l)}\boldsymbol{h}_{t,n}^{(l)}+\boldsymbol{b}_{1}^{(l)})],$		(23)
	$\displaystyle\boldsymbol{\lambda}_{t,n}^{(l)}=\ln[1+\exp({\bf W}_{h\lambda}^{(l)}\boldsymbol{h}_{t,n}^{(l)}+\boldsymbol{b}_{2}^{(l)})],$		(24)

where $\boldsymbol{h}_{t,n}^{(l)}$ denotes the output of highly non-linear function of the observed $\boldsymbol{x}_{t,n}$ . However, this construction does not take into consideration the temporal information transmitted from the previous time step. To exploit the temporal information, we propose a recurrent inference network which induces temporal dependencies across time steps, as illustrated in Fig. 3 (c). Therefore, similar to the RNN in (5), we define $\boldsymbol{h}_{t,n}^{(l)}$ as

\small\boldsymbol{h}_{t,n}^{(l)}=\left\{\begin{array}[]{ll}{\!\!\textrm{tanh}\left({\bf W}_{xh}^{(l)}\boldsymbol{x}_{t,n}+{\bf W}_{hh}^{(l)}\boldsymbol{h}_{t-1,n}^{(l)}+\boldsymbol{b}_{3}^{(l)}\right)},&{\!\!\!\!\!\text{ if }l=1},\\ \!\!\textrm{tanh}\left({\bf W}_{xh}^{(l)}\boldsymbol{h}_{t,n}^{(l-1)}+{\bf W}_{hh}^{(l)}\boldsymbol{h}_{t-1,n}^{(l)}+\boldsymbol{b}_{3}^{(l)}\right),&{\!\!\!\!\!\text{ if }1<l\leq L},\\ \end{array}\right.

(25)

where at layer $l$ , ${\bf W}_{xh}^{(l)}\in\mathbb{R}^{K_{l}\times K_{l-1}}$ denotes the upward weight matrix, ${\bf W}_{hh}^{(l)}\in\mathbb{R}^{K_{l}\times K_{l}}$ the forward weight matrix connecting the hidden states, and $\boldsymbol{b}_{3}^{(l)}\in\mathbb{R}^{K_{l}}$ the bias vector. Therefore, the variational parameters of $\boldsymbol{s}_{t,n}^{(l)}$ are both non-linearly transformed with neural networks from the hidden state $\boldsymbol{h}_{t,n}^{(l-1)}$ of layer $l-1$ at time $t$ and the hidden state $\boldsymbol{h}_{t-1,n}^{(l)}$ of layer $l$ at time $t-1$ , which are helpful for the inference network to take into account the temporal structure of the observed data.

While the sequential latent states of the inference model are similar to that of an RNN, the internal transition structure of an RNN is in general entirely deterministic. By contrast, the proposed model introduces uncertainty into the latent space to help better model the variability observed in highly structured HRRP data. One can sample the Weibull distributed latent state in (22) using the reparameterization trick as

\displaystyle\small\boldsymbol{s}_{t,n}^{(l)}

\displaystyle\small={\boldsymbol{\lambda}_{t,n}^{(l)}}\left(-\ln(1-{\boldsymbol{\epsilon}_{t,n}^{(l)}})\right)^{1/{\boldsymbol{k}_{t,n}^{(l)}}},~{}{\boldsymbol{\epsilon}_{t,n}^{(l)}}\sim\mbox{Uniform}(0,1).

(26)

For standard VAE, the generative model parameters and the corresponding inference network parameters can be typically jointly optimized via SGD, seeking to maximize the ELBO in (18) with standard backpropagation technique. Instead of finding a point estimate of the global parameters of generative model like in VAEs, we adopt a hybrid MCMC/VAE inference algorithm by combining TLASGR-MCMC and the proposed recurrent variational inference network. In specific, the generative model parameters $\{\boldsymbol{\Pi}^{(l)},\boldsymbol{\Phi}^{(l)}\}_{1,L}$ can be sampled with TLASGR-MCMC in (III-C) and the neural network parameters $\boldsymbol{\Omega}=\{{\bf W}_{xh}^{(l)},{\bf W}_{hh}^{(l)},{\bf W}_{hk}^{(l)},{\bf W}_{h\lambda}^{(l)},\boldsymbol{b}_{1}^{(l)},\boldsymbol{b}_{2}^{(l)},\boldsymbol{b}_{3}^{(l)}\}_{1,L}$ can be updated via SGD by maximizing the ELBO in (III-C). Applying the reparameterization trick of the Weibull distribution in (26), the gradient of the ELBO with respect to $\boldsymbol{\Omega}$ can be evaluated with low variance. In practice, a single Monte Carlo sample from $q\left(\boldsymbol{s}_{t,n}^{(l)}\right)$ is enough to obtain satisfactory performance. The proposed inference network is depicted in Fig. 3 (c) and the training strategy is outlined in Algorithm 1. For a new HRRP sample at the testing stage, given the generative model parameters and inferred network parameters $\boldsymbol{\Omega}$ , we can directly obtain the conditional posteriors of latent states using the inference network, without performing iterations.

IV Supervised Variational Temporal Deep Generative Model

The rGBN discussed above is an unsupervised model, which can infer the hierarchical latent states from HRRP samples under the condition of no class information. Although the latent states can be used together with a downstream classifier to make label predictions, it is often beneficial to learn a joint model that considers both the HRRP samples and corresponding labels to discover more discriminative representations. Therefore, we further develop a supervised rGBN (s-rGBN), providing multilayer latent representations that are good for both HRRP generation and classification.

Denote the $n$ th sequential HRRP sample as a pair $\{\boldsymbol{x}_{n},y_{n}\}$ , where $y_{n}\in\{1,2,...,C\}$ is the ground truth label of input $\boldsymbol{x}_{n}$ and $C$ the total number of target classes. Assume that the HRRP label is generated from a categorical distribution $p(y_{n}\,|\,\boldsymbol{x}_{n},\boldsymbol{\Omega})\sim\mbox{Categorical}(p_{1n},...,p_{Cn})$ , written as

\displaystyle p(y_{n}\,|\,\boldsymbol{x}_{n},\boldsymbol{\Omega})=\prod\limits_{c=1}^{C}{p_{cn}}^{\textrm{I}\{y_{n}=c\}},

(27)

where $p_{cn}$ is the probability of the current input $\boldsymbol{x}_{n}$ classified to label $c$ , $\sum_{c=1}^{C}p_{cn}=1$ , and $\textrm{I}\{y_{n}=c\}$ is an indicator function which is equal to one if $y_{n}=c$ and zero otherwise.

Since the introduced model is able to mine the deep hierarchical structure from the HRRP data, where the weight matrices at different layers reveal different levels of abstraction, we combine the latent states across all hidden layers to define $p_{cn}$ . Note the sequential inputs from one HRRP sample share the same label, making its modeling different from a conventional sequence-to-sequence task addressed by RNNs. We concatenate the latent state $\boldsymbol{s}_{t,n}^{(l)}$ in (26) across all hidden layers and time steps to construct a latent feature vector of dimension $T\sum_{l=1}^{L}{{K_{l}}}$ , denoted as

\displaystyle{\bf S}_{n}=\left[(\boldsymbol{s}_{1,n}^{(1)},...,\boldsymbol{s}_{T,n}^{(1)}),...,(\boldsymbol{s}_{1,n}^{(L)},...,\boldsymbol{s}_{T,n}^{(L)})\right].

(28)

Given ${\bf S}_{n}$ , the label probability vector $\boldsymbol{p}_{n}=(p_{1n},...,p_{Cn})$ is calculated with the softmax function as

\displaystyle\small\boldsymbol{p}_{n}=\left[\frac{e^{\boldsymbol{w}_{1}{\bf S}_{n}}}{\sum_{i=1}^{C}e^{\boldsymbol{w}_{i}{\bf S}_{n}}},...,\frac{e^{\boldsymbol{w}_{c}{\bf S}_{n}}}{\sum_{i=1}^{C}e^{\boldsymbol{w}_{i}{\bf S}_{n}}},...,\frac{e^{\boldsymbol{w}_{C}{\bf S}_{n}}}{\sum_{i=1}^{C}e^{\boldsymbol{w}_{i}{\bf S}_{n}}}\right],

(29)

where $\boldsymbol{w}_{c}$ denotes the $c$ th row of the learnable weight matrix ${\bf W}_{sy}$ . Since ${{\bf S}_{n}}$ is the concatenation of the latent states projected from $\boldsymbol{x}_{n}$ using (22), the label likelihood (27) can be rewritten as $\ p(y_{n}\,|\,{\bf W}_{sy},{\bf S}_{n})$ . The generative model for both the observed HRRP samples and labels can be displayed in Fig. 3(d).

Given the generative process, our proposed s-rGBN can be trained by maximizing the ELBO of the joint likelihood of the HRRP samples and labels, expressed as


		$\displaystyle~{}~{}~{}~{}\leavevmode\resizebox{446.2658pt}{}{$-\sum\limits_{n=1}^{N}\!{\sum\limits_{l=1}^{L}\!{\sum\limits_{t=1}^{T}{{E_{q\left({\boldsymbol{s}_{t,n}^{(l)}}\right)}}\!\left[{\ln\frac{\!{q\!\left({\boldsymbol{s}_{t,n}^{(l)}}\!\right)\!}}{p\left({\boldsymbol{s}_{t,n}^{(l)}\,\|\,\boldsymbol{a}_{t,n}^{\left(l\right)},1/{b^{\left(l\right)}}}\right)}+\ln p(y_{n}\,\|\,{\bf W}_{sy},{\bf S}_{n})}\right]}}}$},$		(30)

where the inference network for the latent states is the same as that of rGBN shown in Fig. 3 (c). We also sample $\{\boldsymbol{\Phi}^{(l)},\boldsymbol{\Pi}^{(l)}\}_{1,L}$ with TLASGR-MCMC, and update the neural network parameters $\boldsymbol{\Omega}$ and classifier weight matrix ${\bf W}_{sy}$ via SGD by maximizing the ELBO in (IV), using the Adam optimizer [52]. Moreover, based on the concatenation in (28), we notice that the supervised network enforces direct supervision for the latent states across all time steps and layers, improving the discriminativeness and robustness of the extracted latent states [53]. At the test stage, given the learned inference network and classifier, s-rGBN can be very efficient on predicting the label of an HRRP sample $\boldsymbol{x}_{m}$ , expressed as

\displaystyle\hat{y}_{m}=\mathop{\arg\max}\limits_{c\in C}p\left(y_{m}=c\,|\,\boldsymbol{w}_{c},{\bf S}_{m}\right),

(31)

where ${\bf S}_{m}$ is the combination of $\boldsymbol{s}_{t,m}^{(l)}$ , which can be sampled with the reparameterization trick shown in (26).

Algorithm 1 Hybrid stochastic-gradient MCMC and recurrent variational inference for rGBN.

Set mini-batch size

M

, the number of layer

L

, the width of layer

K_{l}

and hyperparameters.

Initialize inference model parameters

\boldsymbol{\Omega}

, generative model parameters

\{\boldsymbol{\Pi}^{(l)},\boldsymbol{\Phi}^{(l)}\}_{1,L}

for

iter=1,2,\cdots

Randomly select a mini-batch of

m

HRRP samples to form a subset

{\bf X}=\{\boldsymbol{x}_{m}\}_{1,M}

;Draw random noise

\{{{\boldsymbol{\epsilon}_{t,m}^{(l)}}}\}_{t=1,m=1,l=1}^{T,M,L}

from uniform distribution (26); Sample latent states

\{\boldsymbol{s}_{t,m}^{\left(l\right)}\}_{t=1,m=1,l=1}^{T,M,L}

from (26);Compute subgradient

g={\nabla_{\boldsymbol{\Omega}}}L

according to (III-C), and update

\boldsymbol{\Omega}

using subgradient

g

;

for

l=1,2,\cdots,L

and

k=1,2,\cdots,K_{l}

Update

M_{k}^{(l)}

according to [47]; then

\boldsymbol{\pi}_{k}^{(l)}

with (III-C);Update

\boldsymbol{\phi}_{k}^{(l)}

similar with (III-C);

end for

end forReturn global parameters

\{\boldsymbol{\Omega},\{\boldsymbol{\Pi}^{(l)},\boldsymbol{\Phi}^{(l)}\}_{l=1}^{L}\}

V Experimental Results

V-A Results with Synthetic data

To illustrate the proposed model and compare it to existing methods, here we consider several three-dimensional synthetic datasets. Each toy data discussed below can be denoted as $\boldsymbol{x}\in\mathbb{R}^{V\times T}$ with $V=3,T=100$ .

$\bullet$ Toy data 1: To verify whether our proposed model can learn the transition matrix accurately, we generate the data with rGBN-PRG, where we assume the transition matrix as $\boldsymbol{\Pi}=\begin{bmatrix}0.65&0.20\\ 0.35&0.80\\ \end{bmatrix}$ , initial latent state as $\boldsymbol{s}_{1}\!=\!\begin{bmatrix}100&0\\ \end{bmatrix}^{\mathrm{T}}$ , and the Dirichlet distributed weight matrix as $\boldsymbol{\Phi}\in\mathbb{R}^{3\times 2}$ . The latent states can be generated with $\boldsymbol{s}_{t}\sim\mbox{Gam}\left(\boldsymbol{\Pi}\boldsymbol{s}_{t-1},1\right)$ , where $\boldsymbol{s}_{t}\in\mathbb{R}^{2\times 1},T=100$ . The non-negative observed data can be sampled from $\boldsymbol{r}_{t}\sim\mbox{Pois}\left(\boldsymbol{\Phi}\boldsymbol{s}_{t}\right),\boldsymbol{x}_{t}\sim\mbox{Gam}\left(\boldsymbol{r}_{t},1\right)$ , where $\boldsymbol{x}_{t}\in\mathbb{R}^{3\times 1}$ .

$\bullet$ Toy data 2: $x_{1,t}=t$ , $x_{2,t}=2\exp(-t/15)+\exp(-((t-25)/10)^{2})$ , and $x_{3,t}=5\sin(t^{2})+6$ for $t=1,\ldots,100$ .

$\bullet$ Toy data 3: $x_{1,t}=t$ , $x_{2,t}=2\mbox{mod}(t,3)$ , $x_{3,t}=20\exp(-t/3)$ for $t=1,\ldots,50$ , and $x_{1,t}=2t+30$ , $x_{2,t}=3\mbox{mod}(t,2)+5$ , and $x_{3,t}=30t\exp(-t)+10$ for $t=51,\ldots,100$ , where $\mbox{mod}(t,k)$ denotes the modulo operation that returns the remainder after division of $t$ by $k$ .

Following previous work [54, 39, 55], we choose the particular expressions in Toy 2 and Toy 3 to check if our proposed model can capture such a complex temporal structure. We set the number of latent states as $K=2$ , and compare the proposed single layer rGBN-PRG, LDS, and HMM with both mean squared error (MSE) and prediction MSE (PMSE). MSE is measured between the estimated value and the ground truth for all the models at $t=1:T-1$ , which is observed at the training stage; PMSE is measured between the predicted value and ground truth at the last time step, which is unobserved at the training stage. Table I illustrates the performance of different methods evaluated based on the simulated data. It is clear that the proposed model provides satisfactory performance in both fitting and prediction for all datasets, which shows the benefits of capturing complex temporal structure in the latent space. For Toy 1, both rGBN and HMM are capable of discovering a transition matrix, $e.g.$ , $\hat{}\boldsymbol{\Pi}=\begin{bmatrix}0.712&0.164\\ 0.288&0.836\\ \end{bmatrix}$ in rGBN and $\hat{}\boldsymbol{\Pi}=\begin{bmatrix}0.924&0.018\\ 0.076&0.982\\ \end{bmatrix}$ in HMM, so the estimated transition matrix of the proposed model is closer to the ground truth than that of HMM. For Toy 2, while LDS obtains lower MSE than the proposed model does, its PMSE is much worse, suggesting LDS is prone to overfitting on the training data. For Toy 3, we can find that features of dimensions 2 and 3 have very complex temporal relationships. HMM performs worse in Toy 3, possibly because of the complicated structure in Toy 3 makes HMM difficult to model.

TABLE I: Results on Synthetic Data.

Data	Measure	Our model	HMM	LDS
Toy1	MSE	15.06	20.08	21.48
Toy1	PMSE	4.47	9.83	17.65
Toy2	MSE	2.11	27.59	1.21
Toy2	PMSE	2.47	85.72	7.08
Toy3	MSE	2.11	53.98	2.28
Toy3	PMSE	2.51	250.69	3.92

V-B Measured HRRP data

A widely used HRRP dataset [15, 16, 20, 31], consisting of the measurements from three real airplanes, is adopted to evaluate the performance of the proposed method. Yak-42 is a large and medium-sized jet aircraft, Cessna Citation is a small-sized jet aircraft, and An-26 is a medium-sized propeller aircraft. The detailed parameters of the targets and measurement radar are presented in Table II. We display the projections of the target trajectories onto the ground plane in Fig. 4, where the measured data can be segmented into several parts. Following the experiment settings in previous studies [15, 20, 31, 56], the measured HRRP data used in our experiments has been divided into training and testing data. There are two preconditions for selecting the training and test data. For one thing, the training data should contain almost all of the target-aspect angles of the test data. For another, the elevation angles of the training data should be different from those of the test data to verify the generalization ability of the proposed model. Therefore, we take the second and fifth segments of Yak-42, the sixth and seventh segments of Cessna Citation, and the fifth and sixth segments of An-26 for training, and take the other segments for testing. There are in total 7800 HRRP samples for training, corresponding to 2600 profiles for each of the three classes, and 5200 HRRP samples for testing. The number of target classes is $C=3$ .

V-C Set up

For the networks used in this paper, the elements of all weight matrices are initialized with Gaussian distributions whose standard deviations are set to $0.1$ , and all bias terms are initialized to $0$ . For the proposed models, we set the mini-batch size $M$ as $100$ . The Adam optimizer [52] with learning rate ${10}^{-4}$ is utilized for optimization. Here we only choose a linear SVM (LSVM) [57] for classifying the representations extracted with unsupervised models, in order to highlight the discriminability of the learned features. Real 256-dimensional HRRP vectors in (2) are utilized to train the non-dynamical models. We set $[K_{1},K_{2},K_{3}]=[30,20,10]$ for the proposed methods. All experiments are implemented in non-optimized software written in Python, on a Pentium PC with 3.7-GHz CPU and 64 GB RAM. Unless specified otherwise, this setting will be adopted across all experiments. For Bayesian generative model, to ensure the corresponding random variables drawn from non-informative priors and all the posteriors leaned from data, we directly set $\{\eta^{(l)}=0.1,\nu^{(l)}=0.1,b^{(l)}=1,c=200\}$ without tuning exhaustively. Besides, we set scaling parameter $\mu=100$ , and manipulate the training and test data by a window function with window size $V=30$ and overlap $O=15$ , leaving $T=16$ for each HRRP sequence, since they have the corresponding reasonable range as a priori knowledge. Below, instead of exhaustively optimizing these parameters, we demonstrate the influence of each parameter on model performance.

TABLE II: Parameters of radar and planes in the ISAR experiment.

Radar parameters	Center Frequency	5520MHZ
Radar parameters	bandwidth	400MHZ
Planes	Length(m)	Width(m)	Height(m)
Yak-42	36.38	34.88	9.83
Cessna Citation	14.40	15.90	4.57
An-26	23.80	29.20	8.58

V-D Influence of Model Parameters

As shown in previous studies [15, 31], it is important to analyze how the recognition performance is influenced by the model parameters, which for the proposed models include the scaling parameter $\mu$ , window size $V$ , and overlap $O$ .

Scaling parameter $\mu$ : As discussed in section III, the sequential HRRP sample can be modeled with the PRG distribution in (8), where we refer our models as rGBN-PRG and s-rGBN-PRG. Considering the PRG link may be time-consuming due to its iterative procedure to infer latent counts, we discretize the HRRP sequences to produce counts ( $i.e.$ , non-negative integers) with the scaling parameter $\mu$ . The input counts can be modeled with the Poisson distribution in (16), where we refer our models as rGBN-Poisson and s-rGBN-Poisson. Fig. 5 shows the recognition performances of our proposed models varying with $\mu$ . Clearly, a large value of $\mu$ will lead to overfitting and a small one will drop too much detailed information of the HRRP data, resulting in underfitting. For our proposed models, although using the PRG link generally perform better than using the Poisson link, we can achieve a compromise between the performance and computation with the Poisson link.

Window size $V$ : Fig. 6 shows the variation of the classification accuracies of dynamical models with the value of window size $V\in[10,50]$ , where the overlap length is fixed as $O=V/2$ . As $V$ decreases, we get the sequence with a higher computational and memory burden for all temporal models, thus increasing the inference difficulty and degrading their performance. On the contrary, a large $V$ allows more information of the target to be contained in each subsequence $\boldsymbol{x}_{t,n}$ , but a too-large one may result in the loss of sequential information. Fig. 6 shows that compared with RNN, our proposed models, which provide a stochastic generalization of the deterministic RNN by adding uncertainty into the latent space via a deep generative model, are more robust to the window length.

Overlap $O$ : After fixing the window size as $V=30$ , we compare the performance of different dynamical methods as a function of the overlap length $O$ , which varies from 5 to 25 range cells. Because the redundancy of the segments is determined by the overlap length across the windows, a lower correlation between adjacent inputs is encouraged as parameter $O$ decreases, which will weaken the sequential relationship between time steps. Instead, an improperly large one will raise the length of sequence fast, imposing higher computational and memory burden and reducing the performance. Besides, the proposed models are more robust to the overlap length compared with RNN.

V-E Recognition Performance

TABLE III: Comparison of Recognition Accuracy

Non-dynamical Models	Size	Accuracy
MCC [10]	-	59.00
AGC [56]	-	85.20
LSVM [57]	-	87.14
LDA [58]	2	82.16
PCA [19]	200	83.81
VAE [42]	200	87.84
DBN [59]	200	88.28
	200-100	88.51
	200-100-50	89.16
Dynamical Models	Size	Accuracy
LDS [26]	30	87.65
HMM [17]	30	87.24
TARAN [31]	30	90.10
TCNN [28]	1 $\times$ 9 (32-32-64)	92.57
RNN [30]	30	88.99
	30-20	89.77
	30-20-10	91.64
rGBN-Poisson	30	88.36
	30-20	89.22
	30-20-10	90.23
rGBN-PRG	30	88.66
	30-20	89.67
	30-20-10	90.58
s-rGBN-Poisson	30	90.47
	30-20	91.87
	30-20-10	92.91
s-rGBN-PRG	30	91.02
	30-20	92.52
	30-20-10	93.54

Similar to Feng et al. [15], we evaluate the proposed models (rGBN and s-rGBN) against several commonly used recognition methods for HRRP, such as maximum correlation coefficient (MCC) [10], adaptive Gaussian classifier (AGC) [56], and LSVM using the original HRRP dataset as input [57]. A variety of feature extraction methods, including linear discriminant analysis (LDA) [58], PCA [19], VAE [42], LDS [26], and HMM [17], are also included for comparison. To demonstrate the advantages of constructing deep dynamical systems, we further consider RNN [30] and deep belief network (DBN) [59], where DBN can be viewed as a stack of restricted Boltzmann machines (RBMs) for modeling the binary hidden units in the lower layers. In addition to the basic RNN, we also compare the results of the proposed method with target-aware recurrent attentional network (TARAN) [31]. TARAN first utilizes RNN to explore the sequential relationship between the range cells within an HRRP sample, then employs an attention mechanism to weight up each hidden state and discover the target area. We further compare our model with temporal one-dimensional convolution neural network (TCNN) [28], where the convolution operation only takes place along the range dimension. Note TARAN [31] presents the recognition result of the time sequential HRRP only with one hidden layer (hidden dimension is $30$ ), achieving the performance of 90.10%, and TCNN [28] achieves a recognition accuracy of 92.57% with the structure of 32-32-64 (kernel size is $1\times 9$ in each layer). For dynamical models, we set the number of hidden states of LDS and HMM as $30$ , and set that of the RNN the same as that of the proposed models. The feature dimension of both PCA and VAE is set as 200, while that of DBN as $[K_{1},K_{2},K_{3}]=[200,100,50]$ , which all belong to non-dynamical feature extraction models. According to the literature [15], the hidden dimension of LDA can be set as $C-1=2$ . The extracted features from the training set with different unsupervised models are utilized to train the LSVM classifier, where the regularization parameter is five-fold cross-validated on the training set. The softmax function shown in (31) is adopted to predict the labels of the testing samples for supervised models, including s-rGBN and RNN. Summarized in Table III are the recognition results of various methods on the HRRP dataset.

The shallow dynamical models including LDS, HMM, and TARAN, which take into account the temporal information in HRRP, already clearly outperform other traditional models for HRRP RATR including AGC, MCC, LSVM, LDA, and PCA. Besides, the deep models tend to have better performance on classifying HRRP samples than shallow architectures do, for providing more discriminative hierarchical non-linear features. As for DBN, although it builds a deep generative model, we find that its performance is inferior to our proposed deep dynamical models, which is not surprising as the former ignores the temporal dependence in HRRP samples. Compared with TCNN, RNN, and TARAN, which build the deterministic mapping and find point estimates for the global parameters, s-rGBN-PRG achieves better accuracy, proving the efficiency of the probabilistic framework and the hybrid Bayesian inference algorithm.

Note that our proposed s-rGBN-PRG and s-rGBN-Poisson perform better than the unsupervised rGBN-PRG and rGBN-Poisson, which verifies that introducing the label information into proposed models certainly benefits the recognition performance. Though our proposed unsupervised rGBN schemes underperform supervised models, such as RNN at layer 3 and TCNN, they still can learn hierarchical latent states from HRRP data and obtain acceptable recognition rates in the absence of label information.

TABLE IV: The confusion matrices of the proposed s-rGBN at different layers, with the network architecture set as 30-20-10.

Methods	s-rGBN-layer1			s-rGBN-layer2			s-rGBN-layer3
	An-26	Cessna Citation	Yak-42	An-26	Cessna Citation	Yak-42	An-26	Cessna Citation	Yak-42
An-26	86.17	8.63	5.20	87.87	7.08	5.05	89.39	6.30	4.31
Cessna Citation	5.30	93.45	1.25	4.30	94.40	1.30	4.05	95.30	0.65
Yak-42	3.08	2.17	94.75	1.84	1.33	96.83	1.57	1.18	97.25
Average Recognition Rates	91.02			92.52			93.54

To investigate the performance of the methods on different targets, we list the confusion matrices and average recognition rates in Table IV. Each row of the confusion matrix should be the number of samples in a specific predicted class. For the sake of comparison, we set each row of the matrix as a predicted ratio rather than a number. Specifically, we set the confusion matrix of each method as $P\in\mathbb{R}_{+}^{C\times C}$ , and $P_{c_{1},c_{2}}={N_{c_{1},c_{2}}}/{N_{c_{1}}}$ , where $N_{c_{1}}$ denotes the number of HRRP samples for category $c_{1}$ , $N_{c_{1},c_{2}}$ denotes the the number of samples which are in an actual class $c_{1}$ but are classified into class $c_{2}$ , and $\sum_{c_{2}=1}^{C}P_{c_{1},c_{2}}=1$ . There is a clear trend that the classification performance for the HRRP samples based on s-rGBN is increasing with more layers, where the gains of the accuracy mainly come from the improvement at layer 2 for Yak-42 and Cessna Citation and that at layer 3 for An-26. The An-26 aircraft is a propeller aircraft whose waveform has larger fluctuation, and hence its feature may be much more overdispersed in comparison to these of Cessna Citation and Yak-42. s-rGBN, whose learned neurons become more general with the increase of the layers, maybe more robust to the fluctuation in higher layers and hence learn discriminative features to improve the accuracy.

V-F Qualitative Analysis

As discussed in Section III, in comparison to RNN that builds temporal non-linearity via “black-box” neural networks and other dynamical methods that only model single-layer temporal latent states, our proposed models can provide the desired interpretation, for being a deep Bayesian dynamical generative model. In particular, in addition to quantitative evaluations, we further visualize the learned neurons, reconstruction examples, latent space interpolations on the HRRP test set, and learned latent states, at each layer.

The structured neurons: To visualize the multilayer s-rGBN, it is straightforward to project the neuron $\boldsymbol{\phi}_{k}^{(l)}$ of layer $l$ to the bottom data layer. Specifically, we first choose a node at the top layer, with a large coefficient, then grow the tree downward to include any lower-layer neurons that are connected to the node with non-negligible weights. For $\boldsymbol{\phi}_{k}^{(l)}$ , we plot all its terms whose values are larger than 1% of the largest element of the corresponding $\boldsymbol{\phi}_{k}^{(l)}$ . As shown in Fig. 8, we find that the neurons at layer 1 are fundamental, similar to the original echo per range cell in HRRP. As layer $l$ increases, the neurons become increasingly more general or sparsely utilized, which consist of several echoes from different range cells in HRRP, covering longer-range temporal dependencies. To sum up, our proposed model can not only extract meaningful neurons at each layer but also capture the relationships between the neurons of different layers.

Reconstruction: In Fig. 9, we show the reconstruction examples for the three airplane targets at layers 1, 2, and 3. To reconstruct the HRRP samples, given the learned global parameters $\{\boldsymbol{\Omega},\{\boldsymbol{\Phi}^{(l)},\boldsymbol{\Pi}^{(l)}\}_{l=1}^{L}\}$ , we need to find the conditional posterior of latent state $\boldsymbol{s}_{t,n}^{(l)}$ , whose variational parameters can be directly transformed from the observed HRRP examples using the neural networks in (23), (24), and (25). Then we sample the latent states using the reparameterization trick in (26), where a single Monte Carlo sample from $q\left(\boldsymbol{s}_{t,n}^{(l)}\right)$ is enough to obtain satisfactory performance. We find that our model can not only retain the main structural information of the original test HRRP samples but also reconstruct the details of the targets in each sequential HRRP.

Latent space interpolations: One might want to investigate whether s-rGBN has indeed extracted the abstract representations of HRRP data. Following previous ideas [43, 60], we present the latent space interpolations on the HRRP test set examples. Given two HRRP sequences $\boldsymbol{x}_{1:T,1}$ and $\boldsymbol{x}_{1:T,2}$ , we project them into $\boldsymbol{s}_{1:T,1}^{(1:3)}$ and $\boldsymbol{s}_{1:T,2}^{(1:3)}$ , with the 3-layer model learned before. Using linear interpolation between $\boldsymbol{s}_{1:T,1}^{(l)}$ and $\boldsymbol{s}_{1:T,2}^{(l)}$ at layer $l$ , we can produce new sequences by passing the intermediate points through the dynamical generative model. In Fig. 10, we display the reconstruction results of $\boldsymbol{s}_{1:T,1}^{(3)}$ and $\boldsymbol{s}_{1:T,2}^{(3)}$ in (a) and (f), respectively, and the generated sequences from the linearly interpolated $\boldsymbol{s}$ -values in (b)-(e). The proposed model can generate realistic and interpretable HRRP sequences for all interpolated $\boldsymbol{s}$ -values. In other words, the inferred latent space of the model is on a manifold, indicating our proposed model has learned a generalizable latent state representation instead of concentrating its probability mass around the HRRP training samples.

Latent states: To compare the latent states learned from different methods, we visualize high-dimensional features by mapping them to the two-dimensional subspace with $t$ -distributed stochastic neighbor embedding ( $t$ -SNE). It is a non-linear dimensionality reduction technique well-suited for embedding high-dimensional data into a low-dimensional space [61]. As shown in Fig. 11, each dot represents an HRRP sample and each color-shape pair denotes a category. We visually illustrate that the features learned by the proposed supervised method are more discriminative than those by the unsupervised method and RNN. The phenomenon proves the benefits of learning a supervised hierarchical probabilistic model that considers both target and label generation.

V-G Robustness to Training Data Size

It is worth pointing that a practical RATR system should provide an acceptable recognition rate even with a few training samples [15, 26], such as in the non-cooperative circumstance. In Fig. 12, we depict how the recognition performance of each method varies with the size of the training dataset. With a relatively small training dataset, the accuracy of deterministic RNN drops sharply; a possible explanation for this phenomenon is that a point estimate by SGD ignores model uncertainty and the kind of variability observed in highly structured data is poorly modeled by the output probabilistic model alone. The parameters in the generative model of VAE (the weights) are updated by SGD towards a point estimate, leading to an obvious decrease in test accuracy. The DBN is robust to the small training samples whose weights are updated with Gibbs sampling, but it does not efficiently incorporate both between-layer and temporal dependencies of the observed HRRP samples, leading to lower accuracy than that of the proposed methods. By contrast, by exploiting temporal correlations, the proposed models achieve higher accuracy and maintain acceptable performance even when the training data size becomes much smaller. We attribute that to the following three advantages: 1) building the novel dynamical probabilistic model to describe the complicated sequential HRRP samples; 2) developing the hybrid stochastic gradient MCMC and recurrent variational inference to update model parameters and provide model uncertainty; 3) fusing multi-stochastic-layer features to enhance robustness.

V-H Computational Complexity

TABLE V: Comparison of computational cost at the testing stage.

Models	Testing a sample(s)
LDS	0.6035
HMM	0.4412
VAE	0.0006
RNN layer-3	0.0008
s-rGBN-Poisson layer-3	0.0010
s-rGBN-PRG layer-3	0.0010

In this section, we evaluate the computational complexity of various methods. In Fig. 13, we compare various methods in terms of how their test recognition rates vary as the increase of training time. At each training iteration, for both LDS and HMM, all training HRRP samples are processed, while for RNN, VAE, s-rGBN-PRG, and s-rGBN-Poisson, a mini-batch is randomly selected for training. As shown in Fig. 13, both s-rGBN-PRG and s-rGBN-Poisson outperform RNN and VAE in providing higher performance as time progresses. Note that RNN is only a deterministic recurrent network that does not provide a probabilistic generative model in the latent space to model uncertainty and discover latent hierarchical structure, and VAE is restricted to model non-sequential data. Our mini-batch based inference algorithm converges faster than batch-based ones (including LDS and HMM), which further demonstrates the advantage of our proposed hybrid MCMC/VAE inference algorithm.

In Table V, we compare the computational complexity of various algorithms at the testing stage. Both HMM and LDS have high computational cost, due to the need to perform a number of iterations to infer the latent representation of each text sample. By contrast, by directly mapping a test data into its latent representation via non-linear transformation, RNN, VAE, and the proposed models equipped with the feed-forward inference network, are able to process the test data in real time.

VI Conclusion

For radar high-resolution range profile (HRRP) target recognition, we introduce recurrent gamma belief network (rGBN), a temporal deep generative model, to efficiently capture the global structure of the targets and temporal dependence between the range cells in a single HRRP. Scalable inference for rGBN is developed by integrating stochastic gradient MCMC and recurrent variational inference into a hybrid inference scheme. We further propose supervised rGBN to increase the discriminative power of the latent states by jointly modeling the HRRP samples and their labels. Experimental results on synthetic and measured HRRP data demonstrate that in comparison to existing models, the proposed ones not only exhibit superior recognition performance and enhanced robustness to the variation of the training set size, but also provide highly interpretable latent structure.

References

[1] A. Zyweck and R. E. Bogner, “Radar target classification of commercial aircraft,” IEEE Transactions on Aerospace and Electronic Systems, vol. 32, no. 2, pp. 598–606, 1996.
[2] B. Bhanu, D. E. Dudgeon, E. G. Zelnio, A. Rosenfeld, D. Casasent, and I. S. Reed, “Guest editorial introduction to the special issue on automatic target detection and recognition,” IEEE Transactions on Image Processing, vol. 6, no. 1, pp. 1–6, 1997.
[3] H. Chiang, R. L. Moses, and L. C. Potter, “Model-based classification of radar images,” IEEE Transactions on Information Theory, vol. 46, no. 5, pp. 1842–1854, 2000.
[4] Y. Sun, Z. Liu, S. Todorovic, and J. Li, “Adaptive boosting for SAR automatic target recognition,” IEEE Transactions on Aerospace and Electronic Systems, vol. 43, no. 1, pp. 112–125, 2007.
[5] B. Chen, H. Liu, J. Chai, and Z. Bao, “Large margin feature weighting method via linear programming,” IEEE Transactions on Knowledge and Data Engineering, vol. 21, no. 10, pp. 1475–1488, 2009.
[6] H. Zhang, N. M. Nasrabadi, Y. Zhang, and T. S. Huang, “Multi-view automatic target recognition using joint sparse representation,” IEEE Transactions on Aerospace and Electronic Systems, vol. 48, no. 3, pp. 2481–2497, 2012.
[7] U. Srinivas, V. Monga, and R. G. Raj, “SAR automatic target recognition using discriminative graphical models,” IEEE Transactions on Aerospace and Electronic Systems, vol. 50, no. 1, pp. 591–606, 2014.
[8] J. Ding, B. Chen, H. Liu, and M. Huang, “Convolutional neural network with data augmentation for SAR target recognition,” IEEE Geoscience and Remote Sensing Letters, vol. 13, no. 3, pp. 364–368, 2016.
[9] X. D. Zhang, Y. Shi, and Z. Bao, “A new feature vector using selected bispectra for signal classification with application in radar target recognition,” IEEE Transactions on Signal Processing, vol. 49, no. 9, pp. 1875–1885, 2001.
[10] L. Du, H. Liu, Z. Bao, and M. Xing, “Radar HRRP target recognition based on higher order spectra,” IEEE Transactions on Signal Processing, vol. 53, no. 7, pp. 2359–2368, 2005.
[11] L. Shi, P. Wang, H. Liu, L. Xu, and Z. Bao, “Radar HRRP statistical recognition with local factor analysis by automatic Bayesian Ying-Yang harmony learning,” IEEE Transactions on Signal Processing, vol. 59, no. 2, pp. 610–617, 2011.
[12] P. Molchanov, K. O. Egiazarian, J. Astola, A. V. Totsky, S. Leshchenko, and M. Jaraboamores, “Classification of aircraft using micro-doppler bicoherence-based features,” IEEE Transactions on Aerospace and Electronic Systems, vol. 50, no. 2, pp. 1455–1467, 2014.
[13] L. Du, H. Liu, Z. Bao, and J. Zhang, “A two-distribution compounded statistical model for radar HRRP target recognition,” IEEE Transactions on Signal Processing, vol. 54, no. 6, pp. 2226–2238, 2006.
[14] L. Du, H. Liu, and Z. Bao, “Radar HRRP statistical recognition: Parametric model and model selection,” IEEE Transactions on Signal Processing, vol. 56, no. 5, pp. 1931–1944, 2008.
[15] B. Feng, B. Chen, and H. Liu, “Radar HRRP target recognition with deep networks,” Pattern Recognition, vol. 61, pp. 379–393, 2017.
[16] L. Du, P. Wang, H. Liu, M. Pan, F. Chen, and Z. Bao, “Bayesian spatiotemporal multitask learning for radar HRRP target recognition,” IEEE Transactions on Signal Processing, vol. 59, no. 7, pp. 3182–3196, 2011.
[17] M. Pan, L. Du, P. Wang, H. Liu, and Z. Bao, “Multi-task hidden Markov modeling of spectrogram feature from radar high-resolution range profiles,” EURASIP Journal on Advances in Signal Processing, vol. 2012, no. 1, p. 86, 2012.
[18] X. Liao and Z. Bao, “Circularly integrated bispectra: novel shift invariant features for high-resolution radar target recognition,” Electronics Letters, vol. 34, no. 19, pp. 1879–1880, 1998.
[19] L. Du, H. W. Liu, Z. Bao, and J. Zhang, “Radar automatic target recognition using complex high-resolution range profiles,” IET Radar Sonar and Navigation, vol. 1, no. 1, pp. 18–26, 2007.
[20] L. Du, H. Liu, P. Wang, B. Feng, M. Pan, and Z. Bao, “Noise robust radar HRRP target recognition based on multitask factor analysis with small training data size,” IEEE Transactions on Signal Processing, vol. 60, no. 7, pp. 3546–3559, 2012.
[21] B. Feng, L. Du, H. W. Liu, and F. Li, “Radar HRRP target recognition based on K-SVD algorithm,” in IEEE CIE International Conference on Radar, 2012, pp. 642–645.
[22] C. Du, B. Chen, B. Xu, D. Guo, and H. Liu, “Factorized discriminative conditional variational auto-encoder for radar HRRP target recognition,” Signal Processing, vol. 158, pp. 176–189, 2019.
[23] Y. Lecun, Y. Bengio, and G. Hinton, “Deep learning,” Nature, vol. 521, no. 7553, p. 436, 2015.
[24] M. Pan, J. Jiang, Q. Kong, J. Shi, Q. Sheng, and T. Zhou, “Radar HRRP target recognition based on t-SNE segmentation and discriminant deep belief network,” IEEE Geoscience and Remote Sensing Letters, vol. 14, no. 9, pp. 1609–1613, 2017.
[25] M. Pan, L. Du, P. Wang, H. Liu, and Z. Bao, “Multi-task hidden Markov model for radar automatic target recognition,” in IEEE CIE International Conference on Radar, 2011, pp. 650–653.
[26] P. Wang, L. Du, M. Pan, X. Zhang, and H. Liu, “Radar HRRP target recognition based on linear dynamic model,” in IEEE CIE International Conference on Radar, vol. 1, 2011, pp. 662–665.
[27] A. van den Oord, S. Dieleman, H. Zen, K. Simonyan, O. Vinyals, A. Graves, N. Kalchbrenner, A. Senior, and K. Kavukcuoglu, “Wavenet: A generative model for raw audio,” in 9th ISCA Speech Synthesis Workshop, 2016, pp. 125–125.
[28] J. Wan, B. Chen, B. Xu, H. Liu, and L. Jin, “Convolutional neural networks for radar hrrp target recognition and rejection,” EURASIP Journal on Advances in Signal Processing, vol. 2019, no. 1, p. 5, 2019.
[29] J. L. Elman, “Finding structure in time,” Cognitive Science, vol. 14, no. 2, pp. 179–211, 1990.
[30] A. Graves, A. Mohamed, and G. E. Hinton, “Speech recognition with deep recurrent neural networks,” in IEEE International Conference on Acoustics, Speech and Signal Processing, 2013, pp. 6645–6649.
[31] B. Xu, B. Chen, J. Wan, H. Liu, and L. Jin, “Target-aware recurrent attentional network for radar HRRP target recognition,” Signal Processing, vol. 155, pp. 268–280, 2019.
[32] R. Pascanu, T. Mikolov, and Y. Bengio, “On the difficulty of training recurrent neural networks,” in International conference on machine learning, 2013, pp. 1310–1318.
[33] J. Chung, K. Kastner, L. Dinh, K. Goel, A. C. Courville, and Y. Bengio, “A recurrent latent variable model for sequential data,” in Advances in Neural Information Processing Systems, 2015, pp. 2980–2988.
[34] Z. Gan, C. Li, C. Chen, Y. Pu, Q. Su, and L. Carin, “Scalable bayesian learning of recurrent neural networks for language modeling,” in Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2017, pp. 321–331.
[35] J. Chung, S. Ahn, and Y. Bengio, “Hierarchical multiscale recurrent neural networks,” in International Conference on Learning Representations, 2017.
[36] J. Koutnik, K. Greff, F. Gomez, and J. Schmidhuber, “A clockwork RNN,” in International Conference on Machine Learning, 2014, pp. 1863–1871.
[37] M. Zaheer, A. Ahmed, and A. J. Smola, “Latent LSTM allocation: Joint clustering and non-linear dynamic modeling of sequence data,” in International Conference on Machine Learning, 2017, pp. 3967–3976.
[38] Z. Gan, C. Li, R. Henao, D. E. Carlson, and L. Carin, “Deep temporal sigmoid belief networks for sequence modeling,” in Advances in Neural Information Processing Systems, 2015, pp. 2467–2475.
[39] C. Gong and W. Huang, “Deep dynamic poisson factorization model,” in Advances in Neural Information Processing Systems, 2017, pp. 1666–1674.
[40] D. Guo, B. Chen, H. Zhang, and M. Zhou, “Deep Poisson gamma dynamical systems,” in Advances in Neural Information Processing Systems, 2018, pp. 8442–8452.
[41] R. G. Krishnan, U. Shalit, and D. Sontag, “Deep Kalman Filters,” arXiv preprint arXiv:1511.05121, 2015.
[42] D. P. Kingma and M. Welling, “Auto-encoding variational Bayes,” in International Conference on Learning Representations, 2014.
[43] H. Zhang, B. Chen, D. Guo, and M. Zhou, “WHAI: Weibull hybrid autoencoding inference for deep topic modeling,” in International Conference on Learning Representations, 2018.
[44] C. Li, J. Zhu, T. Shi, and B. Zhang, “Max-margin deep generative models,” in Advances in Neural Information Processing Systems, 2015, pp. 1837–1845.
[45] J. P. Zwart, R. van der Heiden, S. Gelsema, and F. Groen, “Fast translation invariant classification of HRR range profiles in a zero phase representation,” IEE Proceedings - Radar, Sonar and Navigation, vol. 150, no. 6, pp. 411–418, 2003.
[46] M. Zhou, Y. Cong, and B. Chen, “Augmentable gamma belief networks,” Journal of Machine Learning Research, vol. 17, no. 163, pp. 1–44, 2016.
[47] Y. Cong, B. Chen, H. Liu, and M. Zhou, “Deep latent dirichlet allocation with topic-layer-adaptive stochastic gradient riemannian mcmc,” in International Conference on Machine Learning, 2017, pp. 864–873.
[48] Y. Cong, B. Chen, and M. Zhou, “Fast simulation of hyperplane-truncated multivariate normal distributions,” Bayesian Analysis, vol. 12, no. 4, pp. 1017–1037, 2017.
[49] M. A. Girolami and B. Calderhead, “Riemann manifold Langevin and Hamiltonian Monte Carlo methods,” Journal of The Royal Statistical Society Series B-statistical Methodology, vol. 73, no. 2, pp. 123–214, 2011.
[50] M. I. Jordan, Z. Ghahramani, T. S. Jaakkola, and L. K. Saul, “An introduction to variational methods for graphical models,” Machine Learning, vol. 37, no. 2, pp. 183–233, 1999.
[51] A. Roberts, J. Engel, C. Raffel, C. Hawthorne, and D. Eck, “A hierarchical latent vector model for learning long-term structure in music,” in International Conference on Machine Learning, 2018.
[52] D. P. Kingma and J. L. Ba, “Adam: A method for stochastic optimization,” in International Conference on Learning Representations, 2015.
[53] C.-Y. Lee, S. Xie, P. Gallagher, Z. Zhang, and Z. Tu, “Deeply-supervised nets,” in International Conference on Artificial Intelligence and Statistics, 2015, pp. 562–570.
[54] A. Acharya, J. Ghosh, and M. Zhou, “Nonparametric Bayesian Factor Analysis for Dynamic Count Matrices,” in International Conference on Artificial Intelligence and Statistics, 2015, pp. 1–9.
[55] R. P. Adams, I. Murray, and D. J. MacKay, “Tractable nonparametric Bayesian inference in Poisson processes with Gaussian process intensities,” in Proceedings of the 26th Annual International Conference on Machine Learning, 2009, pp. 9–16.
[56] L. Du, H. Liu, and Z. Bao, “Radar HRRP statistical recognition: Parametric model and model selection,” IEEE Transactions on Signal Processing, vol. 56, no. 5, pp. 1931–1944, 2008.
[57] C. Chang and C. Lin, “LIBSVM: A library for support vector machines,” ACM Transactions on Intelligent Systems and Technology, vol. 2, no. 3, p. 27, 2011.
[58] H. Yu and J. Yang, “A direct LDA algorithm for high-dimensional data - with application to face recognition,” Pattern Recognition, vol. 34, no. 10, pp. 2067–2070, 2001.
[59] G. E. Hinton, S. Osindero, and Y. W. Teh, “A fast learning algorithm for deep belief nets,” Neural Computation, vol. 18, no. 7, pp. 1527–1554, 2006.
[60] V. Dumoulin, I. Belghazi, B. Poole, A. Lamb, M. Arjovsky, O. Mastropietro, and A. C. Courville, “Adversarially learned inference,” in International Conference on Learning Representations, 2017.
[61] L. V. Der Maaten and G. E. Hinton, “Visualizing data using t-SNE,” Journal of Machine Learning Research, vol. 9, pp. 2579–2605, 2008.

		$\displaystyle P\left(\boldsymbol{x}_{t,n},\{\boldsymbol{s}_{t,n}^{(l)}\}_{l=1}^{L}\,\|\,\{\boldsymbol{\Phi}^{(l)},\boldsymbol{\Pi}^{(l)},b^{(l)}\}_{l=1}^{L},c\right)$
		$\displaystyle=\left[\prod\limits_{l=1}^{L}{p\left(\boldsymbol{s}_{t,n}^{(l)}\,\|\,\boldsymbol{\Phi}^{(l+1)}\boldsymbol{s}_{t,n}^{(l+1)},\boldsymbol{\Pi}^{(l)}\boldsymbol{s}_{t-1,n}^{(l)},b^{(l)}\right)}\right]$		(11)
		$\displaystyle\times p\left(\boldsymbol{x}_{t,n}\,\|\,\boldsymbol{\Phi}^{(1)}\boldsymbol{s}_{t,n}^{(1)},c\right).$