Multiview Variational Deep Learning with Application to Scalable Indoor Localization

Minseuk Kim
School of Electrical Engineering
KAIST, Daejeon, Republic of Korea
[email protected]
&Changjun Kim
School of Electrical Engineering
KAIST, Daejeon, Republic of Korea
[email protected]
Dongsoo Han
School of Computing
KAIST, Daejeon, Republic of Korea
[email protected]
&June-Koo Kevin Rhee
School of Electrical Engineering
ITRC of Quantum Computing for AI
KAIST, Daejeon, Republic of Korea
[email protected]

Abstract

Radio channel state information (CSI) measured with many receivers is a good resource for localizing a transmit device with machine learning with a discriminative model. However, CSI localization is nontrivial when the radio map is complicated, such as in building corridors. This paper introduces a view-selective deep learning (VSDL) system for indoor localization using CSI of WiFi. The multiview training with CSI obtained from multiple groups of access points (APs) generates latent features on supervised variational deep network. This information is then applied to an additional network for dominant view classification to enhance the regression accuracy of localization. As non-informative latent features from multiple views are rejected, we can achieve a localization accuracy of 0.77 m, which outperforms by 30 $\%$ the best known accuracy in practical applications in a real building environment. To the best of our knowledge, this is the first approach to apply variational inference and to construct a scalable system for radio localization. Furthermore, our work investigates a methodology for supervised learning with multiview data where informative and non-informative views coexist.

1 Introduction

Machine learning applications with a multiview embodiment rendered with multiple sources can exploit feature correlations among various views to attain the best model for inference. Multiview data can be efficiently generated by variational inference hoffman2013stochastic from a single view, as reported in zhao2018multi . Variational inference is also adopted in broad discriminative models such as clustering dilokthanakul2016deep , classification xu2017variational , and regression girolami2006variational to utilize the probabilistic latent feature space. From the perspective of a deep network, variational deep learning (DL) wilson2016stochastic can jointly optimize objectives of a Gaussian process marginal likelihood to train the deep network. Reparameterization of variational inference wainwright2008graphical derives mean-field latent feature vectors to represent the posterior of an input. Compared to stand-alone deep networks, support vector machines (SVMs), and other recent systems, the variational DL can substantially improve classification and regression performance wainwright2008graphical . In this paper, we apply the variational DL to localization of transmit devices based on radio signal data in a practical WiFi environment.

Similarly to an example that generates a single image view from multiple image views tulsiani2017multi , radio signals retrieved in multiple groups of receivers at distributed locations can form multiview information for machine learning to locate the transmit device. By the nature of variational DL that the elements of the extracted latent vector have no correlations, we can map between a multivariate standard normal distribution and each view. As the variational DL is forced to ensure its latent vector approximated to the standard normal distribution with a reduced number of independent Gaussian processes, we can mitigate the signal uncertainties. Compared to the global positioning system (GPS), which leads to 5 m to 10 m outdoor localization accuracy, indoor localization requires more accurate positioning of the transmit devices. Recently, a channel state information (CSI) of WiFi has emerged as the strong candidate for indoor localization rather than a scalar-valued received signal strength indicator (RSSI) li2018indoor . The subcarrier CSIs of WiFi orthogonal frequency division multiplexing (OFDM) channel form complex vector information, providing much rich information on radio localization with an excellent localization accuracy.

1.1 Related work

At a server side of a localization system, CSI data is collected from multiple receivers at the same time to find the transmit device location. Transmit device localization is achieved by geometric analysis of such CSI data to determine the time of flight and angle of arrival information of the radio packet xiong2013arraytrack ; kotaru2015spotfi . However, in a practical indoor environment, the noise and signal fading problems become critical against finding true transmit device locations with such analytical methods.

In order to cope with noise and signal fading, many machine learning methods have been developed to successfully find the transmit device location from the complex CSI data by considering it as a single view. References wang2015deepfi ; li2019defe ; wang2017biloc utilized restricted Boltzmann machine (RBM) based approaches to reconstruct the CSI data for better likelihood determination for localization. Convolutional neural network (CNN) based approaches were proposed in berruet2018delfin ; chen2017confi ; wang2017cifi . The consecutive data packets were concatenated to a single batch as a 2-D CSI image. However, in the radio localization, it is preferable that the system be capable of packet-by-packet processing, rather than batch processing. References chen2019smartphone and zhou2017csi introduced SVM based classification and regression, respectively. Reference gao2019crisloc adopted transfer learning to reconstruct CSI data and applied the enhanced $k$ -nearest neighbor (KNN) approach for localization. To enhance the accuracy of spot location classification, tsai2018refined introduced autoencoder. In dang2019novel , principal component analysis (PCA), one of the preprocessing methods, was applied to reduce the multi dimension CSI before passing through a deep neural network (DNN). Also, combining RSSI and CSI, hsieh2019deep proposed multi-layer perceptron (MLP) and 1-D CNN. From the perspective of device-free indoor localization, sanam2020multi carried out a canonical correlation analysis (CCA) to classify the location of a device with a human, where the device is neither a transmitter nor a receiver. The localization accuracies achieved by all the papers listed above did not reach below a meter in practical application environments, except for the case where the training and test locations are the same. However, our result in this paper has achieved a localization accuracy of sub-meter by adopting multiview architecture with variational DL.

1.2 Our contributions

An advanced scalable localization with a keen accuracy is pursued in this work, to extend the area of radio localization on a complicated floor plan for an in-building application. To construct a scalable learning system for localization in a real building environment with corridors, we propose a supervised learning system named view-selective deep learning (VSDL) with CSI data consisting of multiple views. The VSDL obtains much improved regression performance due to latent feature generation by the use of the variational inference and non-informative latent feature rejection among the multiple views. The proposed VSDL achieved a localization accuracy of 0.77 m, which outperforms by more than 30% the best known accuracy of other works applied in practical building experiment. To the best of our knowledge, this is the first approach to apply variational inference for CSI-based WiFi localization and to construct a scalable system for localization in a wide and complex environment. Furthermore, application of our system can be extended to general supervised learning with multiview data where informative and non-informative views coexist.

2 CSI Preliminaries and data collection

In a WiFi network, device localization can be achieved by analyzing the CSIs of a radio packet arriving at multiple receiver antennas of an access point (AP), complying with the IEEE 802.11a/g/n/ac standards for the multi-input multi-output (MIMO) air interface. In the experiment with an Intel WiFi link (IWL) 5300 network interface controller (NIC), the physical layer API reports a complex CSI vector of 30 selected subcarriers for an antenna receiving a WiFi packet halperin2011predictable . The phase difference of CSIs of multiple antennas provides angle of arrival of a received packet, which is the key information to localize the transmit device.

The received CSI of packet at subcarrier $i\in\{1,\dots,I\}$ of antenna $m\in\{1,\dots,M\}$ with nominal CSI $H_{m,i}$ and noise $N_{m,i}$ is represented as $\hat{H}_{m,i}=|H_{m,i}|e^{j2\pi\angle{H_{m,i}}}+N_{m,i}$ . The nominal CSI amplitude is $|H_{m,i}|$ , and the nominal CSI phase is represented as

\angle{H_{m,i}}=s_{i}~{}\delta_{f}~{}\tau+(m-1){f_{c}}\frac{d\sin\theta}{c},

(1)

where a subset of subcarriers $\{s_{i}\}$ is selected among available subcarriers indexed between $-S$ and $+S$ . The constant $\delta_{f}$ is the frequency difference between subcarriers, $f_{c}$ is the center frequency of the channel, $d$ is the distance between adjacent receiver antennas, and $c$ is the speed of light. Here, the phase $\angle{H_{m,i}}$ is a function of the time of flight $\tau$ and the angle of arrival $\theta$ , which implies that subcarrier frequencies and the geometry of the antenna array cause the relative phase difference due to different radio arrival times. Many of the previous localization techniques aimed to find the nominal time of flight and angle of arrival schmidt1986multiple . But in real 802.11 communication, several offsets and noise accompany them, and the measured phase $\angle{\hat{H}}_{m,i}$ is represented as $\angle{\hat{H}}_{m,i}=\angle{H_{m,i}}+s_{i}~{}\lambda+\mu_{m}+\beta+Z_{m,i},$ where $\lambda$ and $\mu_{m}$ denote the subcarrier-dependent offset coefficient and the receiver antenna-dependent offset, respectively, and $\beta$ and $Z_{m,i}$ denote packet-dependent offset and noise, respectively tzur2015direction . Empirically, these offset and noise cause a large fluctuation to the CSI phase and thus make it hard to be solved by the analytical methods. In our approach with variational inference, the CSI is mapped to an unit phasor complex that measures phase difference of CSIs among different antennas:

x_{m,i}=\frac{\hat{H}_{m,i}/|\hat{H}_{m,i}|}{\hat{H}_{M,i}/|\hat{H}_{M,i}|},~{}~{}~{}~{}m\in\{1,\dots,M-1\}.

(2)

Here, we adopt variational DL to mitigate the noise and signal fading problems in the CSI training samples, defined as $\mathbf{x}=[x_{m,i}]$ , with $m\in\{1,\dots,M-1\}$ and $i\in\{1,\dots,I\}$ .

In order to construct the localization system scalable in a complex area, we should consider exclusion of non-informative CSI views. One can deploy multiple APs over the area consisting of $K$ sub-areas, where a collection of APs in each sub-area forms a view of the CSI sample. We apply deep learning to classify the dominant views from multiple sub-areas.

3 Variational Deep Learning

We apply a variational DL for regression to be trained with input as a pair of $\mathbf{x}$ (i.e., CSI in our case) and a true label $\mathbf{y}$ (i.e., Cartesian coordinate in our case). Let us assume a latent feature vector $\mathbf{z}$ consisting of $z_{j}=f_{j}(\mathbf{x}),j\in\{1,\dots,J\}$ of independent Gaussian processes (GPs) with probability of $z_{j}\sim\mathcal{GP}(\mu_{j},\sigma_{j}^{2})$ , where the mean-field feature vectors $\boldsymbol{\mu}=[\mu_{1},\dots,\mu_{J}]$ and $\boldsymbol{\sigma}=[\sigma_{1},\dots,\sigma_{J}]$ indicate mean and standard deviation, respectively. The latent vector is used to estimate the regression output $\mathbf{\hat{y}}$ where the true $\mathbf{y}$ is supposed to be represented as $\mathbf{y}(\mathbf{x})|\mathbf{z}\sim\mathcal{N}(\mathbf{y}(\mathbf{x});\mathbf{z},\boldsymbol{\sigma}^{2}\odot\boldsymbol{I})$ . In order to make the learning variables differentiable for gradient descent based back-propagation, it requires reparameterization. The weights and biases of a neural network (NN) to obtain the latent vector $\mathbf{z}$ from the input $\mathbf{x}$ is updated by sampling of noise vector $\boldsymbol{\epsilon}$ :

\mathbf{z}=\boldsymbol{\mu}+\boldsymbol{\sigma}\odot\boldsymbol{\epsilon},~{}~{}~{}\boldsymbol{\epsilon}\sim\mathcal{N}(0,\mathbf{I}).

(3)

Then with the variational posterior over the estimated distribution $q(\mathbf{z})$ , Jensen’s inequality can be applied to give evidence lowerbound (ELBO) of the marginal log-likelihood of regression wilson2016stochastic :

\log p(\mathbf{y})\geq\mathbb{E}_{q(\mathbf{z}|\mathbf{y})}[\log p(\mathbf{y}|\mathbf{z})]-\mathbf{KL}[q(\mathbf{z}|\mathbf{y})||p(\mathbf{z})]\triangleq\mathcal{L}(q)=\mathbf{ELBO}.

(4)

In (4), the first term at the right hand side is cross entropy between the label $\mathbf{y}$ and the regression output from the latent vector $\mathbf{z}$ , which is equivalent to the regression loss. The second term is the Kullback–Leibler (KL) divergence between the estimated posterior $q(\mathbf{z}|\mathbf{y})$ and the distribution $p(\mathbf{z})$ . Our aim is to approximate the estimated posterior $q(\mathbf{z}|\mathbf{y})$ to be close to the posterior $p(\mathbf{z}|\mathbf{y})$ ; in other words, to minimize $\mathbf{KL}[q(\mathbf{z}|\mathbf{y})||p(\mathbf{z}|\mathbf{y})]$ . Again the log-likelihood can be represented as the following equation:

\begin{split}\log p(\mathbf{y})&=\int\log\big{(}p(\mathbf{y})\big{)}q(\mathbf{z}|\mathbf{y})dz\\ &=\int\log\frac{p(\mathbf{y},\mathbf{z})}{q(\mathbf{z}|\mathbf{y})}q(\mathbf{z}|\mathbf{y})dz+\int\log\frac{q(\mathbf{z}|\mathbf{y})}{p(\mathbf{z}|\mathbf{y})}q(\mathbf{z}|\mathbf{y})dz\\ &=\mathbf{ELBO}+\mathbf{KL}[q(\mathbf{z}|\mathbf{y})||p(\mathbf{z}|\mathbf{y})].\end{split}

(5)

Since the log-likelihood $\log p(\mathbf{y})$ is bounded, in order to minimize the KL divergence $\mathbf{KL}[q(\mathbf{z}|\mathbf{y})||p(\mathbf{z}|\mathbf{y})]$ , we have to maximize the term ELBO. The KL divergence term in (4) with the normal distribution $q(\mathbf{z}|\mathbf{y})$ and the standard normal distribution $p(\mathbf{z})$ can be simplified to

\mathbf{KL}[q(\mathbf{z}|\mathbf{y})||p(\mathbf{z})]=\frac{1}{2}\sum_{j=1}^{J}(\mu_{j}^{2}+\sigma_{j}^{2}-\ln(\sigma_{j}^{2})-1).

(6)

By updating the NN parameters to jointly reduce the cross entropy and the KL divergence $\mathbf{KL}[q(\mathbf{z}|\mathbf{y})||p(\mathbf{z})]$ , we can both approximate the posterior and estimate the desired regression output.

4 Localization: View-Selective Deep Learning

We introduce novel learning system, or the VSDL, with which the relative degree of importance of each view among the multiple views is used to improve the regression accuracy. A type of selective sampling, referred to as co-testing, was introduced muslea2000selective to efficiently extract features from multiview data, where its basic idea is to inject divided views to multiple independent learning networks. However, this model had a strict assumption that each data sample might have a strong correlation among the views. On the other hand, we focus on a situation where views are not correlated to each other nor informative, but there is a given information whether or not a view is informative. In our case, we require localizing a target in a two-corridor environment, as in Figure 1.

As you can see the Figure 1, the WiFi radio signal can propagate not only through a line of sight (LoS) but also non-line of sight (NLoS) paths. Subsequently the received signal consists of multi-path fading as well as the signal noise. The CSI data view at APs only from NLoS paths are non-informative, since the multi-paths are very unpredictable. In this kind of case, it must be inefficient if all views are included in training. However, we can judge whether or not a view is informative so as to utilize such given information for supervised training. In our VSDL model, the learning parameters are much effectively updated by excluding non-informative data views. Looking into the big picture, VSDL is designed as a two-stage learning network for regression, consisting of a view-oriented variational deep network and a view-classified regression network.

4.1 View-Oriented Variational Deep Network

The VSDL model is first trained to extract latent feature vectors from the multiview data in a view-oriented way, as shown in the left grey box in Figure 2. We first reconstruct the multiview input data $\mathbf{x}^{\prime}=\{\mathbf{x}_{1},\dots,\mathbf{x}_{K}\}$ , from a single sample $\mathbf{x}$ to represent $K$ views. In our application, multiple views are created by different groups of WiFi APs located in different corridors. View $\mathbf{x}_{k}$ is the $k$ -th subset of correlated data in $\mathbf{x}$ . In our case, $K$ becomes the number of corridors and $\mathbf{x}_{k}$ consists of the data from the APs in the corridor $k$ . Then we define $\mathbf{u}=\{u_{1},\dots,u_{K}\}$ , $u_{k}\in\{0,1\}$ , that indicates the given view label over $K$ views. In our supervised training, we set the given corridor label $u_{k}=1$ if the location of the training sample is seen by a corridor view of $k$ . Otherwise, set $u_{k}=0$ . We model $K$ independent NNs to optimize the parameters. In short, each NN for the $k$ -th view has a training input $(\mathbf{x}_{k},\mathbf{y},u_{k})$ as a set of data, true label, and given view label, respectively. In our case, usually $\mathbf{u}$ has a value of 1 for only one $k$ , unless the training sample is seen by multiple views as seen in Figure 1(c), where it has a value of 1 for multiple $k$ s.

To apply variational inference, each latent sampling NN encodes $\mathbf{x}_{k}$ to latent feature $\mathbf{z}_{k}$ by the mapping, $\mathbf{z}_{k}=h_{k}(\mathbf{x}_{k})$ , with optimized mean-field vectors $\boldsymbol{\mu}_{k}=[\mu_{1,k},\dots,\mu_{J,k}]$ and $\boldsymbol{\sigma}_{k}=[\sigma_{1,k},\dots,\sigma_{J,k}]$ through $L$ hidden layers, $h^{(1)}_{k},\dots,h^{(L)}_{k}$ . The weight and bias $\mathbf{W}^{(l)}_{k}$ and $\mathbf{b}^{(l)}_{k}$ of the layer $l\in\{1,\dots,L\}$ are used to evaluate the feature output $\boldsymbol{\phi}^{(l)}_{k}$ ( $\boldsymbol{\phi}^{(0)}_{k}=\mathbf{x}_{k}$ ) for the next layer input. The output layer $h^{(L)}_{k}$ generates the mean-field vectors:

[\boldsymbol{\mu}_{k},\boldsymbol{\sigma}_{k}]=\mathbf{W}^{(L)}_{k}\boldsymbol{\phi}^{(L-1)}_{k}+\mathbf{b}^{(L)}_{k},

(7)

followed by reparameterization for latent vector $\mathbf{z}_{k}$ as

\mathbf{z}_{k}=\boldsymbol{\mu}_{k}+\boldsymbol{\sigma}_{k}\odot\boldsymbol{\epsilon}_{k},~{}~{}~{}\boldsymbol{\epsilon}_{k}\sim\mathcal{N}(0,\mathbf{I}).

(8)

Here, we try to minimize KL divergence of $\mathbf{z}_{k}$ according to (6) to approximate the posterior $q(\mathbf{z}_{k}|\mathbf{y})$ to the distribution $p(\mathbf{z}_{k})$ .

Along with the process of the KL divergence minimization, the regression NN maps latent vector $\mathbf{z}_{k}$ to $\mathbf{\hat{y}}$ by the mapping, $\mathbf{\hat{y}}=g_{k}(\mathbf{z}_{k}|\mathbf{y})$ , which consists of $P$ layers, $g^{(1)}_{k},\dots,g^{(P)}_{k}$ . The weight $\mathbf{W}^{\prime(P)}_{k}$ and bias $\mathbf{b}^{\prime(P)}_{k}$ of the last layer estimate the output $\mathbf{\hat{y}}=[\hat{y}_{1},\hat{y}_{2}]$ , which is represented in the normalized Cartesian coordinate in our case:

\mathbf{\hat{y}}=\boldsymbol{\phi}^{\prime(P)}_{k}=\mathbf{W}^{\prime(P)}_{k}\boldsymbol{\phi}^{\prime(P-1)}_{k}+\mathbf{b}^{\prime(P)}_{k}.

(9)

Then its regression loss is evaluated in terms of Euclidean distance with the known true $y$ for supervised learning. We jointly minimize the KL divergence and the regression loss by updating the weight and bias parameters.

In order to achieve supervised view-oriented learning, we utilize the given view label $\mathbf{u}$ during the training. For every multiview training sample, the mappings $h_{k}(\mathbf{x}_{k})$ and $g_{k}(\mathbf{z}_{k}|\mathbf{y})$ are optimized to generate $\mathbf{z}_{k}$ only for the informative views as follows:

\left\{\begin{array}[]{ll}\min_{\mathbf{W}_{k},\mathbf{b}_{k},\mathbf{W}^{\prime}_{k},\mathbf{b^{\prime}}_{k}}{\{(y_{1}-\hat{y}_{1})^{2}+(y_{2}-\hat{y}_{2})^{2}\}+\mathbf{KL}[q(\mathbf{z}_{k}|\mathbf{y})||p(\mathbf{z}_{k})]}&\mbox{if $u_{k}=1$},\\ \mbox{Do nothing}&\mbox{if $u_{k}=0$}.\end{array}\right.

(10)

We expect the NNs properly extract latent features with excluding the non-informative data view. Although trained weights and biases can generate the latent feature vector $\mathbf{z}_{k}$ for every test sample regardless of the condition $u_{k}$ , the invisible latent features from two different samples may have a strong correlation if both are informative in a certain view $k$ (i.e., $u_{k}=1$ ). Now $\mathbf{z}$ for every training and test sample becomes a new input in the next step. In the following section 4.2, we will introduce an additional network to enhance regression using the latent features and their invisible correlation.

4.2 View-Classified Regression Network

The view-classified regression network, as described in the right grey box in Figure 2, uses an intermediately integrated noble2004support latent vector $\mathbf{z}=\{\mathbf{z}_{1},\dots,\mathbf{z}_{K}\}$ and the given view label $\mathbf{u}$ used in section 4.1. The network consists of two NNs; 1) to classify view information $\mathbf{\hat{u}}$ and 2) to obtain regression output $\mathbf{\hat{y}}$ regarding each classified view information $\hat{u}_{k}$ as a reweight parameter to a subset $\mathbf{z}_{k}$ . Our insight in this network starts from the hypothesis that the latent vector $\mathbf{z}$ generated from the previous view-oriented learning can select dominant views $k$ s through the view classification NN. The regression NN then utilizes the classification result to enhance the desired regression performance. The aim is to jointly approximate the classification output $\mathbf{\hat{u}}$ to the given view label $\mathbf{u}$ , and the regression output $\mathbf{\hat{y}}$ to the true label $\mathbf{y}$ . The classification output $\mathbf{\hat{u}}=[\hat{u}_{1},\dots,\hat{u}_{K}]$ becomes the reweight parameter where $\hat{u}_{k}$ means how importantly the regression NN should consider the influence of view $k$ . First, since more than one $u_{k}$ s may have a value of 1, we must regulate them with normalization to feed view classification NN with to learn the balanced reweight parameter:

\tilde{u}_{k}=\frac{u_{k}}{\Sigma_{i=1}^{K}u_{i}}.

(11)

The view classification NN of layers $h^{(1)}_{Q},\dots,h^{(Q)}_{Q}$ maps $\mathbf{z}$ to $\mathbf{\hat{u}}$ such that $\mathbf{\hat{u}}=h_{Q}(\mathbf{z}|\mathbf{\tilde{u}})$ . Starting from the first layer input $\mathbf{z}$ ( $\boldsymbol{\phi}^{(0)}_{Q}=\mathbf{z}$ ), we calculate the view classification result $\mathbf{\hat{u}}$ from the output layer through the softmax activation:

\mathbf{\hat{u}}=\boldsymbol{\phi}^{(Q)}_{Q}=\mbox{softmax}(\mathbf{W}^{(Q)}_{Q}\boldsymbol{\phi}^{(Q-1)}_{Q}+\mathbf{b}^{(Q)}_{Q}),

(12)

where $\mathbf{W}^{(q)}_{Q}$ and $\mathbf{b}^{(q)}_{Q}$ denote the weight and bias of the layer $q\in\{1,\dots,Q\}$ , respectively.

Regarding $\mathbf{\hat{u}}$ as the reweight parameter, we recalculate each subset $\mathbf{z}_{k}$ to $\mathbf{z^{\prime}}_{k}$ :

\mathbf{z^{\prime}}_{k}=\hat{u}_{k}\odot\mathbf{z}_{k},

(13)

and use it as a regression input. With concatenated $\mathbf{z}^{\prime}=\{\mathbf{z^{\prime}}_{1},\dots,\mathbf{z^{\prime}}_{K}\}$ , the regression NN of layers $h^{(1)}_{R},\dots,h^{(R)}_{R}$ maps $\mathbf{z}^{\prime}$ to $\mathbf{\hat{y}}$ , such that $\mathbf{\hat{y}}=h_{R}(\mathbf{z}^{\prime}|\mathbf{y})$ , and obtains the Cartesian coordinate output $\mathbf{\hat{y}}=[\hat{y}_{1},\hat{y}_{2}]$ , in our case:

\mathbf{\hat{y}}=\boldsymbol{\phi}^{(R)}_{R}=\mathbf{W}^{(R)}_{R}\boldsymbol{\phi}^{(R-1)}_{R}+\mathbf{b}^{(R)}_{R},

(14)

where $\mathbf{W}^{(r)}_{R}$ and $\mathbf{b}^{(r)}_{R}$ are the weight and bias of the layer $r\in\{1,\dots,R\}$ . The NNs update parameters $\mathbf{W}_{Q}$ , $\mathbf{b}_{Q}$ , $\mathbf{W}_{R}$ , and $\mathbf{b}_{R}$ to jointly minimize both Euclidean losses:

\displaystyle\min_{\mathbf{W}_{Q},\mathbf{b}_{Q},\mathbf{W}_{R},\mathbf{b}_{R}}\alpha\{(y_{1}-\hat{y}_{1})^{2}+(y_{2}-\hat{y}_{2})^{2}\}+(1-\alpha)\{\Sigma_{k}{(\tilde{u}_{k}-\hat{u}_{k})^{2}}\},

(15)

where $\alpha\in(0,1)$ denotes a trade-off parameter between two losses. With the fingerprint database consisting of weights and biases, for the test data, we can obtain the regression output $(\hat{y}_{1},\hat{y}_{2})$ which is the localization result. Strictly speaking, our reweighting is different from the existing iterative reweight (IR) methods chartrand2008iteratively ; mohan2012iterative , which sought the reweighting based on their gradient directions. In contrast, we suggest a simpler method that derives reweight parameters based on the given information. Our reweighting method assists the supervised learning system to decide which data view should be considered more importantly and further to achieve better performance.

5 Field Experiment

We apply the VSDL system for indoor localization on a two-corridor real building environment with 43 training points and 9 test points, as in Figure 3(a). Each corridor is 7 m long, where training and test points are spread by 0.5 m spacing. We install seven APs consisting of 3-antenna IWL 5300 NIC in laptop computers, which are placed at the corners of the corridors. For a transmitter, the same laptop with a single antenna is used to transmit WiFi packets on channel 36 at 5.18 GHz. The APs receive packets from the transmitter at the same time using the monitor mode and we combine them as a multiview input at the server side. Each AP receives the WiFi packet with three antennas to form three Tx-Rx radio channels. Each channel produces a CSI vector consisting of 30 subcarrier CSIs ( $I=30$ ). We then take one of the three CSI vectors as the reference to produce two relative CSI vectors. In this way, we obtain an input sample $\mathbf{x}$ consisting of 420 $(=7\times(3-1)\times 30)$ relative CSIs. We collect 100 sample packets for every training and test points in a noisy environment, which manifests high data fluctuation that makes it difficult to estimate the location by other analytical methods.

In this scenario, we divide input $\mathbf{x}$ into two views $\mathbf{x_{1}}$ and $\mathbf{x_{2}}$ ( $K=2$ ), which represent AP associations with corridors 1 and 2, respectively. Therefore, $\mathbf{x_{1}}$ has information of AP1 to AP5 and $\mathbf{x_{2}}$ has that of AP3 to AP7. Along with the CSI data, the location label $\mathbf{y}$ in the Cartesian coordinate and the given corridor label $\mathbf{u}$ are utilized for training. There are three cases for $\mathbf{u}$ depending on the training location: The location belongs, 1) only to corridor 1 ( $\mathbf{u}=[1,0]$ ), 2) only to corridor 2 ( $\mathbf{u}=[0,1]$ ), and 3) to both corridors 1 and 2 ( $\mathbf{u}=[1,1]$ ). The NNs for view-oriented learning update parameters only when the view is informative ( $u_{k}=1$ ). Here, the information from AP3 to AP5 are common in both views and considered to be informative for every sample.

Along the entire system, we set the number of hidden layers $L$ , $P$ , $Q$ , and $R$ to be three. The number of nodes of each layer decreases from 1000 to 500. For the feature output of every layer except for the last one, the Relu activation is used. Adam optimizers are used to update the parameters with learning rate of $10^{-5}$ . First, we aim to verify if our joint optimization works well. As seen in Figure 4, the losses of regression and view classification are minimized together for any given number of latent variable $J$ . This trend implies that our system properly operates to obtain an advanced regression result assisted by the view selection. In addition, we obtain the best regression result with $J=120$ rather than keeping the size of the input CSI vector, which corresponds to a mean-field feature compression ratio of $2/7$ . There should be a sufficiently large number of variables to estimate the posterior of the CSI data, while too many variables cause overfitting of the network. Obviously, $J$ depends on the scenario as well as the application requirement.

In our VSDL system, the trade-off parameter $\alpha$ can influence the performance of regression as seen in Figure 5. As $\alpha$ approaches 0, the network becomes too sensitive on view classification, and it makes both losses worse. On the other hand, as $\alpha$ approaches 1, high view classification loss occurs, resulting in poor regression (localization) accuracy. We obtain the best regression accuracy with $\alpha=0.5$ .

Figures 3(b) and 3(c) show comparisons of the regression results from a simple variational DL and our VSDL with $J=120$ and $\alpha=0.5$ . We take three test locations as the representative cases. The locations A and B are near the end of corridors and location C is in the intersection of the two corridors. The test results for the location A, B, and C are shown in blue, red, and green, respectively. In terms of multiview data learning, the variational DL extracts features from all corridor views including non-informative views. Therefore, as seen in Figure 3(b), the regression results for many cases are located somehow outside the topology of training. In contrast, the VSDL system updates the learning parameters only for informative views to classify the dominant view and hence to achieve better localization, as seen in Figure 3(c).

[Uncaptioned image]

Figure 6: Localization error CDF of the systems. The proposed VSDL significantly outperforms other previous systems.

Algorithm	Localization error (m)
VSDL(proposed)	0.7715
VDL (variational DL)	1.0607
SVR	1.1037
DNN	1.1246
BiLoc	1.1844
CiFi	1.9739

Table 1: Localization error comparison. The proposed VSDL improves 30

\%

of the localization accuracy

Further, we compare the proposed VSDL with several existing machine learning systems. To discriminate the CSI data, both classification and regression methods were introduced in previous papers to improve the localization accuracy. In our experiment environment, as well as simple variational DL, we implemented RBM based classification BiLoc wang2017biloc , CNN based classification CiFi wang2017cifi , SVM based regression SVR zhou2017csi , and stand-alone DNN based regression. Figure 6 and Table 1 show the comparison results. We do not plot the results of CiFi in this scenario, since the convolutional analysis from batch information cannot extract proper features and results in a very poor localization accuracy of 1.97 m. First, the variational DL, whose results are described in 3(b), outperforms other existing systems due to the usage of variational inference. Here, we observe that the introduction of the variational inference brings the key advantage for WiFi CSI localization in a noisy radio channel. In addition, the VSDL system with novel two-stage view-selective learning on the variational inference base significantly improves the localization accuracy by 30 $\%$ , from 1.10 m to 0.77 m. As the VSDL is very scalable by the nature of its design, we expect further performance improvement in environments with more corridors.

6 Conclusions

WiFi device localization has been a very attractive area of study as WiFi networks are omnipresent to provide network application services to anonymous users in these days, anticipated to open new business opportunities as well as new technical challenges. The technical performance of WiFi localization has been improved disruptively by the use of channel state information measured at multiple receive APs.

In this paper, we introduce a machine learning design that combines the variational deep learning very effectively in multiview learning architecture. We report an observation that the latent vectors generated at the intermediate layer of variational deep learning form strong feature behaviors to provide classification of effective view selection that enhance the accuracy of localization by a great deal. Our system, the view-selective deep learning, or the VSDL, achieves a localization accuracy of 0.77 m, which manifests a more than 30 % improvement in a two-corridor field experiment compared with the best known system based on SVM. The VSDL is completely scalable as exploiting the benefit of multiview-based regression, and hence the WiFi localization network can be expanded with no limit, such as in complex building structure. Our design of extracting features in the latent space to deal with informative and non-informative views in a multiview variational deep learning network is very powerful so that it can be applied to various applications with no limit on scalability.

Broader Impact

The indoor localization with radio signals, for example, WiFi radio signals, which finds the location of a mobile device very accurately can create a great deal of impact in mobile service application. This can be a benefit to off-line stores and services such as in a shopping mall, hospitial, and public buildings, where location-based services can directly improve quality of experiences, especially when associated with social network services. Of course, such network features of localization can also harm the privacy of people in public. Radio localization may fail in a hot spot in the sense of crowded radio network traffic. However, such a failure may not cause critical problems except for some frustration with internet-based applications.

References

[1] Matthew D Hoffman, David M Blei, Chong Wang, and John Paisley. Stochastic variational inference. The Journal of Machine Learning Research, 14(1):1303–1347, 2013.
[2] Bo Zhao, Xiao Wu, Zhi-Qi Cheng, Hao Liu, Zequn Jie, and Jiashi Feng. Multi-view image generation from a single-view. In Proceedings of the 26th ACM international conference on Multimedia, pages 383–391, 2018.
[3] Nat Dilokthanakul, Pedro AM Mediano, Marta Garnelo, Matthew CH Lee, Hugh Salimbeni, Kai Arulkumaran, and Murray Shanahan. Deep unsupervised clustering with gaussian mixture variational autoencoders. arXiv preprint arXiv:1611.02648, 2016.
[4] Weidi Xu, Haoze Sun, Chao Deng, and Ying Tan. Variational autoencoder for semi-supervised text classification. In Thirty-First AAAI Conference on Artificial Intelligence, 2017.
[5] Mark Girolami and Simon Rogers. Variational bayesian multinomial probit regression with gaussian process priors. Neural Computation, 18(8):1790–1817, 2006.
[6] Andrew G Wilson, Zhiting Hu, Russ R Salakhutdinov, and Eric P Xing. Stochastic variational deep kernel learning. In Advances in Neural Information Processing Systems, pages 2586–2594, 2016.
[7] Martin J Wainwright and Michael I Jordan. Graphical models, exponential families, and variational inference. Foundations and Trends® in Machine Learning, 1(1-2):1–305, 2008.
[8] Shubham Tulsiani, Tinghui Zhou, Alexei A Efros, and Jitendra Malik. Multi-view supervision for single-view reconstruction via differentiable ray consistency. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2626–2634, 2017.
[9] Guoquan Li, Enxu Geng, Zhouyang Ye, Yongjun Xu, Jinzhao Lin, and Yu Pang. Indoor positioning algorithm based on the improved rssi distance model. Sensors, 18(9):2820, 2018.
[10] Jie Xiong and Kyle Jamieson. Arraytrack: A fine-grained indoor location system. In Presented as part of the 10th $\{$ USENIX $\}$ Symposium on Networked Systems Design and Implementation ( $\{$ NSDI $\}$ 13), pages 71–84, 2013.
[11] Manikanta Kotaru, Kiran Joshi, Dinesh Bharadia, and Sachin Katti. Spotfi: Decimeter level localization using wifi. In Proceedings of the 2015 ACM Conference on Special Interest Group on Data Communication, pages 269–282, 2015.
[12] Xuyu Wang, Lingjun Gao, Shiwen Mao, and Santosh Pandey. Deepfi: Deep learning for indoor fingerprinting using channel state information. In 2015 IEEE wireless communications and networking conference (WCNC), pages 1666–1671. IEEE, 2015.
[13] Xiandi Li, Jingshi Shi, and Jianli Zhao. Defe: indoor localization based on channel state information feature using deep learning. In Journal of Physics: Conference Series, volume 1303, page 012067. IOP Publishing, 2019.
[14] Xuyu Wang, Lingjun Gao, and Shiwen Mao. Biloc: Bi-modal deep learning for indoor localization with commodity 5ghz wifi. IEEE Access, 5:4209–4220, 2017.
[15] Brieuc Berruet, Oumaya Baala, Alexandre Caminada, and Valery Guillet. Delfin: A deep learning based csi fingerprinting indoor localization in iot context. In 2018 International Conference on Indoor Positioning and Indoor Navigation (IPIN), pages 1–8. IEEE, 2018.
[16] Hao Chen, Yifan Zhang, Wei Li, Xiaofeng Tao, and Ping Zhang. Confi: Convolutional neural networks based indoor wi-fi localization using channel state information. IEEE Access, 5:18066–18074, 2017.
[17] Xuyu Wang, Xiangyu Wang, and Shiwen Mao. Cifi: Deep convolutional neural networks for indoor localization with 5 ghz wi-fi. In 2017 IEEE International Conference on Communications (ICC), pages 1–6. IEEE, 2017.
[18] Pengpeng Chen, Fen Liu, Shouwan Gao, Peihao Li, Xu Yang, and Qiang Niu. Smartphone-based indoor fingerprinting localization using channel state information. IEEE Access, 7:180609–180619, 2019.
[19] Rui Zhou, Jiesong Chen, Xiang Lu, and Jia Wu. Csi fingerprinting with svm regression to achieve device-free passive localization. In 2017 IEEE 18th International Symposium on A World of Wireless, Mobile and Multimedia Networks (WoWMoM), pages 1–9. IEEE, 2017.
[20] Zhihui Gao, Yunfan Gao, Sulei Wang, Dan Li, Yuedong Xu, and Hongbo Jiang. Crisloc: Reconstructable csi fingerprintingfor indoor smartphone localization. arXiv preprint arXiv:1910.06895, 2019.
[21] Hsiao-Chien Tsai, Chun-Jie Chiu, Po-Hsuan Tseng, and Kai-Ten Feng. Refined autoencoder-based csi hidden feature extraction for indoor spot localization. In 2018 IEEE 88th vehicular technology conference (VTC-Fall), pages 1–5. IEEE, 2018.
[22] Xiaochao Dang, Jiaju Ren, Zhanjun Hao, Yili Hei, Xuhao Tang, and Yan Yan. A novel indoor localization method using passive phase difference fingerprinting based on channel state information. International Journal of Distributed Sensor Networks, 15(4):1550147719844099, 2019.
[23] Chaur-Heh Hsieh, Jen-Yang Chen, and Bo-Hong Nien. Deep learning-based indoor localization using received signal strength and channel state information. IEEE access, 7:33256–33267, 2019.
[24] Tahsina Farah Sanam and Hana Godrich. A multi-view discriminant learning approach for indoor localization using amplitude and phase features of csi. IEEE Access, 8:59947–59959, 2020.
[25] Daniel Halperin, Wenjun Hu, Anmol Sheth, and David Wetherall. Predictable 802.11 packet delivery from wireless channel measurements. ACM SIGCOMM Computer Communication Review, 41(4):159–170, 2011.
[26] Ralph Schmidt. Multiple emitter location and signal parameter estimation. IEEE transactions on antennas and propagation, 34(3):276–280, 1986.
[27] Asaf Tzur, Ofer Amrani, and Avishai Wool. Direction finding of rogue wi-fi access points using an off-the-shelf mimo–ofdm receiver. Physical Communication, 17:149–164, 2015.
[28] Ion Muslea, Steven Minton, and Craig A Knoblock. Selective sampling with redundant views. In AAAI/IAAI, pages 621–626, 2000.
[29] William Stafford Noble et al. Support vector machine applications in computational biology. Kernel methods in computational biology, 71:92, 2004.
[30] Rick Chartrand and Wotao Yin. Iteratively reweighted algorithms for compressive sensing. In 2008 IEEE International Conference on Acoustics, Speech and Signal Processing, pages 3869–3872. IEEE, 2008.
[31] Karthik Mohan and Maryam Fazel. Iterative reweighted algorithms for matrix rank minimization. Journal of Machine Learning Research, 13(Nov):3441–3473, 2012.