Recovering Surveillance Video Using RF Cues

Xiang Li ¹, Rabih Younes ²,

Abstract

Video capture is the most extensively utilized human perception source due to its intuitively understandable nature. A desired video capture often requires multiple environmental conditions such as ample ambient-light, unobstructed space, and proper camera angle. In contrast, wireless measurements are more ubiquitous and have fewer environmental constraints. In this paper, we propose CSI2Video, a novel cross-modal method that leverages only WiFi signals from commercial devices and a source of human identity information to recover fine-grained surveillance video in a real-time manner. Specifically, two tailored deep neural networks are designed to conduct cross-modal mapping and video generation tasks respectively. We make use of an auto-encoder-based structure to extract pose features from WiFi frames. Afterward, both extracted pose features and identity information are merged to generate synthetic surveillance video. Our solution generates realistic surveillance videos without any expensive wireless equipment and has ubiquitous, cheap, and real-time characteristics.

Introduction

In the human perception field, camera-based and radio frequency (RF)-based approaches have been extensively researched and achieve substantial success in many tasks such as 2D (Alp Güler, Neverova, and Kokkinos 2018; Sun et al. 2019; Cao et al. 2017) and 3D (Chen and Ramanan 2017; Rayat Imtiaz Hossain and Little 2018; Tome, Russell, and Agapito 2017) pose estimation and body segmentation (Wang et al. 2020a, b; Li et al. 2022c, a). Nevertheless, both existing camera-based and RF-based approaches suffer from technical or practical issues. For instance, illumination and occlusion constrains for camera-based approaches, and equipment cost and energy consumption for LiDARs-based approaches (Wang et al. 2019). The latest works (Wang et al. 2019; Jiang et al. 2020; Huang et al. 2021b, a) prove that WiFi signals have the capability to achieve camera-based comparable human perception precision which can be further explored in many field. For example, surveillance, game and human robot interaction. However, since a WiFi signal is a 1D sequence, it cannot carry as much information as an 2D image. Limited by that, previous works never tried to recover the complete frames from WiFi signal. In this paper, we take a step forward to recover RGB surveillance video by merging WiFi signals and a weak source of the human identity profile.

Refer to caption — Figure 1: An example of how the CSI2Video system works. The left is the first stage-recording of the identity of intruder. The right is the second stage-generating surveillance video after the intruder enters the RF perception area.

We propose CSI2Video, an innovative cross-modal method that leverages only WiFi signals from commercial devices and a source of human identity information to recover fine-grained surveillance video in a real-time manner. The proposed CSI2Video is supposed to take WiFi signals as input and output colorful video frames which accurately reflect localization and movements of human instances. To make the system more applicable, the WiFi signal used in CSI2Video is designed to be collected from off-the-shelf WiFi routers. Thanks to the rapid development of internet in recent years, WiFi routers have been widely deployed in all areas of daily life. Thus, WiFi signal can be obtained without additional cost even out of lab. Moreover, in CSI2Video, we leverage the out-sourced human identity information to help the network to determine which instance to display on the screen. For example, those identity can be obtained by surveillance cameras in security scenario, user defined image for virtual portrait generation, or any kind of personal information from intelligent on-body devices. Because the source of identity profile has no constrains on our frameworks, in the rest of the paper, we will only show the results utilize video frames from surveillance video be the identity source from convenient consideration. Specifically, we assume the visual perception area and RF perception area are spatially and temporally separated since we only need to recognize human identity one time instead of tracking it. By separating the two perception areas, the CSI2Video system could work efficiently even when real-time surveillance video is unavailable (e.g. under attack or malfunction).

Figure 1 shows an example of CSI2Video system which contains one surveillance camera and one WiFi router pair. The system works in a two-stage manner. In the first stage, the surveillance camera records the video in a normal way which aims to get the appearance profiles of people who enter the RF perception area. Then, in the second stage, the WiFi router pair record the wireless measurements which contain the direct localization and movement information then transmit the data to the central server. The center server conducts cross-modal mapping and video generation processes taking advantage of both real-time wireless measurements and formally recorded visual captures to produce the synthetic surveillance video.

Our contributions are summarized as follows.

•

To better visualize the spatial information encoded in WiFi signals and exploit the potential usage of WiFi signals in human perception field, we present a novel cross-modal video generation system, CSI2Video. To our best knowledge, this is the first attempt to generate realistic RGB surveillance video using WiFi signals in a real-time manner.
•

We propose a novel cross-modal mapping network that maps wireless measurements to human pose features.
•

We propose a novel video generation network taking pose features instead of keypoint coordinates as inputs.
•

We propose a novel evaluation metric for RF-based video generation which can efficiently reflects both localization error and image quality of generated human figures.

Related Works

Human Perception

Camera-Based Human Perception

Most works about human perception are rooted in the use of cameras that enables the well-developed feature extraction method in computer vision field. To date, several human perception tasks have achieved impressive performance and have been widely deployed in daily life such as cognitive process understanding (Li et al. 2020), pose estimation (Alp Güler, Neverova, and Kokkinos 2018; Sun et al. 2019; Cao et al. 2017; Chen and Ramanan 2017; Rayat Imtiaz Hossain and Little 2018; Tome, Russell, and Agapito 2017), body segmentation (Wang et al. 2020a; Li et al. 2021, 2022b, 2022d; Zhao et al. 2022) as well as activity recognition (Ijjina and Chalavadi 2017; Garcia, Morerio, and Murino 2018; Kniaz et al. 2018; Luo, Li, and Younes 2021; Li, Luo, and Younes 2020). More recently, cross-modal fusion is introduced to improve performance and avoid environmental constraints. For example, Zou Han et al (Zou et al. 2019) proposed a cross-modal model which merges both visual and wireless features to conduct indoor activity recognition task and got almost perfect results. (Premebida et al. 2014; Matti, Ekenel, and Thiran 2017; Han et al. 2015) combined LiDAR with RGB cameras for pedestrian detection.

RF-Based Human Perception

The most popular sensing modalities for RF-based systems are 3D point clouds captured by LiDAR and depth map captured by Frequency Modulated Continuous Wave (FMCW). LiDAR-based sensing is a well-developed method in many human perception tasks such as human detection (Maturana and Scherer 2015), tracking (Leigh et al. 2015) etc.. The FMCW radar system was introduced by Adib et al(Adib et al. 2015). They first attempted to capture coarse human body using dedicated RF radios. After that, they continued to exploit the usability of FMCW radio for 2D (Zhao et al. 2018a) and 3D (Zhao et al. 2018b) pose estimation through wall or occlusions. Compared with the above methods, which rely severely on dedicated equipment, WiFi signals provide a more ubiquitous and cheap solution. WiPose (Jiang et al. 2020) generated 3D human skeletons in a single person scenario. Person in WiFi (Wang et al. 2019) proposed end-to-end body segmentation and two-stage pose estimation methods taking advantage of only commercial WiFi transceivers.

On-Body Devices-Based Human Perception

With the rapid advances in information transmission and battery storage, it has become much easier to deploy on-body devices. Compared to the fixed perception area of camera-based and RF-based approach, on-body devices provide a flexible solution. Since on-body sensor can be designed tiny and portable, it is convenient to be carried around. In the past, the main concern of on-body devices is battery storage capability, while battery-free and wireless charging technologies have greatly downplayed this problem. In addition, on-body devices can take a bunch of sources to monitor human body condition. For example, the most popular biosignal sources are ECG, EMG, EEG and motion signal sources are acceleration and angular velocity. By taking those signals, on-body devices has succeeded in human activity (Lara and Labrador 2012), mood (Zenonos et al. 2016), tachycardia(Acharya et al. 2017) etc. recognition.

Image & Video Generation

Deep generative models have been extensively investigated in image or video generation tasks. Variational Auto-Encoders (VAEs) (Kingma and Welling 2013) and Generative Adversarial Networks (GANs) (Goodfellow et al. 2014) are the most popular models. After the first debut of GANs, many follow-up works are committed to solving problems encountered in original GANs. Arjovsky et al noticed the model collapse and unstable training process in GANs and proposed an improved version wGAN (Arjovsky, Chintala, and Bottou 2017) and wGAN-GP (Gulrajani et al. 2017). (Mirza and Osindero 2014) explored the performance of GANs in a conditional side. They first proposed cGAN to generate images controlled by class attributes. Besides, many other variants of GANs were proposed to improve the original GANs in different aspects. For example, DCGAN (Radford, Metz, and Chintala 2015) enables CNN to conduct feature extraction, StarGAN (Choi et al. 2018) allows image-to-image transformation among multiple domains, PatchGAN (Isola et al. 2017) introduces pixel-wise loss function. For human image generation, (Yang et al. 2018) proposed a pose-guided method to generate synthetic human video in a disentangled way-plausible motion prediction and coherent appearance generation. Similarly, Cai et al also proposed a two-stage human motion video generation method via GANs (Cai et al. 2018). Though GANs is a powerful tool to generate videos and have made exceptional breakthrough in many tasks, we will not employ it in this work since CSI2Video system has a unique ground-truth. Thereby, the Mean Square Error (MSE) loss would not impair the image quality.

CSI2Video

System Overview

CSI2Video is a cross-modal system that merges both wireless and visual information to produce real-time surveillance video. Specifically, the visual and wireless measurements are gathered spatially and temporally separated to obtain appearance and localization features respectively. In our system, the wireless information is extracted from standard IEEE 802.11n WiFi signals using commercial devices. Due to the different electromagnetic properties of the background environment and human bodies, WiFi signals have embedded rich spatial information of human localization and movements. However, the WiFi signals transmit at the same time may not be received simultaneously by the antennas considering the different propagation paths resulted from the signal penetrates, refracts and reflects. Fortunately, in WiFi communication system, channel state information (CSI) was adopted to describe the current channel condition between each transceiver ( $\mathit{T}$ ) and receiver ( $\mathit{R}$ ) pairs. This way, the spatial information of human body is encoded into a $N_{tx}\times N_{rx}\times N_{C}$ matrix where $N_{tx}$ and $N_{rx}$ denote the numbers of transmitting and receiving antennas respectively and $N_{C}$ denotes the number of orthogonal frequency division multiplexing (OFDM) subcarriers. Let $\mathbf{C}$ be the single CSI measurement where $\mathbf{C}\in\mathbb{C}^{N_{tx}\times N_{rx}\times N_{C}}$ . For each continuous CSI measurements, the recorded CSI data can be represented as $\mathcal{C}=\left\{\mathbf{C_{0}}\cdots\mathbf{C_{i}}\cdots\mathbf{C_{n}}\right\}$ . For human identity profile, it is used to give the network side information to determine which human instance should be displayed since the wireless measurement contains no explicit identity information. We choose video captures as our source of identity profile in this work. For each continuous video captures, the recorded frames can be denoted as $\mathcal{F}=\left\{\mathbf{F_{0}}\cdots\mathbf{F_{i}}\cdots\mathbf{F_{n}}\right\}$ . Therein, $\mathbf{F}$ indicates one single frame where $\mathbf{F}\in\mathbb{R}^{H\times W\times\mathit{3}}$ .

As depicted in Figure 2, CSI2Video consists of two parts-cross-modal mapping and video generation. The cross-modal mapping network takes CSI measurements as input and outputs Joint Heat Maps (JHMs) and Part Affinity Fields (PAFs) (Cao et al. 2017). The PAFs and JHMs will be concatenated by the selected video frames which contain human appearance features and then be feed into the video generation network. Finally, the video generation network merges both human pose features and appearance features then produces the synthetic surveillance video. In general, the CSI2Video converts the spatial information of the human body contained in the wireless signals into a form of ordinary RGB surveillance video.

Cross-Modal Mapping

The cross-modal mapping network takes CSI measurements $\mathcal{C}$ as input and outputs PAFs and JHMs. So, to train the cross-modal mapping network, we need a set of synchronized CSI measurements $\mathcal{C}$ and their corresponding PAFs and JHMs pairs. For convenience, we employed Openpose (Cao et al. 2017) to compute the PAFs & JHMs for training phase and denote Openpose as $\mathcal{O}(\cdot)$ . Let $\mathcal{I}=\left\{\mathbf{I_{0}}\cdots\mathbf{I_{i}}\cdots\mathbf{I_{n}}\right\}$ be the frames used to compute body pose where $\mathbf{I}\in\mathbb{R}^{H\times W\times\mathit{3}}$ represents a single recorded frame. Since $\mathcal{C}$ is a 1D time-series with less points than $\mathcal{I}$ , it can not contain as much information as $\mathcal{I}$ . To ease the influence of unbalanced information density, we use a higher frequency to sample CSI measurement and pair multiple CSI measurements to one video frame. Let $K$ be number of CSI measurements which are corresponded to one video frame. Moreover, to feed a complex matrix to the neural network, we choose the amplitude of CSI measurement to represent the whole complex matrix. Therefore, the data pair used to train cross-modal mapping network can be represented as Equation 1 where $\mathbf{C_{m}}$ and $\mathbf{I_{m}}$ is assumed to have already been synchronized.

\mathbf{D}_{m}=\left\{(|\mathbf{C_{m}}|\cdots|\mathbf{C_{m+K}}|),\mathcal{O}(\mathbf{I_{m}})\right\}

(1)

Video Generation

The video generation network takes two sources of input. The first source is generated PAFs ( $\mathcal{P}$ ) and JHMs ( $\mathcal{J}$ ) from the cross-modal mapping network which aims to reflect the localization and movement features. The second source is the visual captures for giving side information of human instances. Assuming $\mathcal{F}$ as a continuous video captures, however, we just need a few of them to give the video generation network cues about human appearances. Let $l$ be the number of selected frames from $\mathcal{F}$ . Given PAFs $\mathbf{P_{i}}$ , JHMs $\mathbf{J_{i}}$ and selected video frames $\mathcal{F}_{l}$ , the video generation network outputs synthetic surveillance frames $\mathbf{S_{i}}$ as Equation 2. Specifically, we interpolate all input to $d\times H_{\mathcal{G}}\times W_{\mathcal{G}}$ where $d$ is the number of channels of each source of input.

\mathbf{S_{i}}=\mathcal{G}(\mathbf{P_{i}},\mathbf{J_{i}},\mathcal{F}_{l})

(2)

Experiment

Hardware Setup and Data Preparation

We collected the data in an $8m\times 16m$ office room with 5 volunteers. During the experiment phase, zero to three volunteers were asked to perform walking, sitting, waving hands and several random movements concurrently in the perception area. We leveraged one logitech 720p camera and two Intel 5300 NICs equipped with three antennas each to record the CSI and video frame pairs. As shown in Figure 4, the NICs and RGB camera were placed at a height of 1m and 2m respectively. Antennas of each NICs were uniformly spaced at a distance of 2.6cm like common commercial WiFi routers and the two NICs were separated with each other about 6m away. Moreover, the camera was set to record $1280\times 720$ p RGB video with a FPS of 7.5. For CSI measurement, we used an open-source tool to record (Halperin et al. 2011) and set the NICs to communicate with a bandwidth of 20 MHz centering in the 5.6 GHz WiFi band. In this condition, we recorded the CSI measurements of 30 subcarriers with a sampling rate of 100Hz.

Since the unstable nature of wireless communication, the sampling interval of the measured CSI signals are not strictly equal to 0.01s. Thus, before synchronizing the CSI measurements $\mathcal{C}$ and video frames $\mathcal{I}$ , we conducted data cleaning to remove outliers and to resample the CSI measurements to exactly 100Hz. In our implementation, $K$ is set to 5. This means 5 CSI measurements are matched with one video frame. Consequently, we created 24082 pairs where 75% of it is used for training and 25% of it is used for testing. Specifically, to better fit our network structure, we interpolated the CSI measurements to an image-size tensor.

For video frames $\mathbf{F}_{l}$ , we randomly selected them from $\mathcal{I}$ in our implementation for convenience consideration. Since we just suppose frame $\mathcal{F}$ contains human appearance information and take advantage of nothing beyond it, selecting $\mathbf{F}_{l}$ from $\mathcal{I}$ has no difference from using any other source and will not result in any negative influence on overall results. We set $l=1$ in our implementation.

Networks

In CSI2Video, two deep neural network structures are tailored. The cross-modal mapping network maps the CSI tensor to JHMs and PAFs and the video generation network makes use of JHMs, PAFs and video frames to generate synthetic surveillance video. Note that this is a proof-of-concept experiment that aims to show the ability to recover surveillance video from CSI measurements and can be further improved by carefully frame selection and manual annotations.

Cross-Modal Mapping Network

Figure 5 shows the details of cross-modal mapping network. It is based on an auto-encoder structure and the embedded residual blocks (He et al. 2016) serve for domain transformation. To obtain a larger receptive field, the first two layers have a $7\times 7$ kernel while other layers contain a $3\times 3$ kernel. In our implementation, we chose 14 keypoints of human body to force cross-modal mapping network to learn which are Nose, Neck, Rshoulder, RElbow, RWrist, LShoul- der, LElbow, LWrist, RHip, RKnee, LHip, LKnee, RAnkle, LAnkle. These 14 keypoints reflect the essential pose information to recover the human body in a video frame. Thereby, the output of the cross-modal mapping network should be $14\times H_{J}\times W_{J}$ for JHMs and $26\times H_{P}\times W_{P}$ for PAFs.

As depicted in Figure 3, we trained our cross-modal mapping network using a teacher-student architecture. By doing this, we obtain labels without laborious manual involvement. Moreover, we leveraged mean sum error (MSE) criterion and computed the loss function as Equation 3

\mathcal{L}_{c}=\lambda_{J}\mathcal{L_{\mathbf{J}}}+\lambda_{P}\mathcal{L_{\mathbf{P}}}

(3)

where $\mathcal{L_{\mathbf{J}}}$ and $\mathcal{L_{\mathbf{P}}}$ are losses on JHMs and PAFs respectively. $\lambda_{J}$ and $\lambda_{P}$ are scalar weights to balance the two losses of $\mathcal{L}_{c}$ . Further, since JHMs and PAFs have almost zeros value in most points except for body parts, we added attention mechanism to force network coverage faster. The JHM loss $\mathcal{L}_{J}$ and pixel-wise weight $w^{c,h,w}_{J}$ can be computed as Equation 4 and Equation 5.

\mathcal{L}_{\mathbf{J}}=||w_{J}^{c,h,w}\cdot(\hat{\mathbf{J}}^{c,h,w}-\mathbf{J}^{c,h,w})||_{2}^{2}

(4)

w^{c,h,w}_{J}=\alpha_{J}\cdot|\hat{\mathbf{J}}^{c,h,w}|+\beta_{J}

(5)

where $\alpha_{J}$ and $\beta_{J}$ are scalars to adjust the attention between JHMs and background, $\hat{\mathbf{J}}^{c,h,w}$ is the ground-truth of JHMs, $\mathbf{J}^{c,h,w}$ is the output of cross-modal network. Similarly, we computed the PAFs loss $\mathcal{L}_{P}$ .

Video Generation Network

Before feeding generated PAFs, JHMs and appearance profile to the video generation network, all inputs are interpolated to $d_{\mathbf{P,J},\mathcal{F}}\times H_{\mathcal{G}}\times W_{\mathcal{G}}$ . The video generation network inherits the auto-encoder structure of cross-modal mapping network but with less residual blocks. In addition, unlike the cross-modal mapping network, the video generation network has a symmetry structure. Thus, the output frame is a $3\times H_{\mathcal{G}}\times W_{\mathcal{G}}$ tensor. We set $H_{\mathcal{G}}=128,W_{\mathcal{G}}=256$ in our implementation.

Since we trained the network on a fixed background, we need the network to pay more attention to the human body reconstruction task. For this reason, we utilized Mask-RCNN which can be denote as $\mathcal{M}(\cdot)$ to generate human masks to segment the foreground and background in order to compute the loss separately. The loss functions can be computed as Equation 6 and Equation 7 where $\mathbf{B}^{c,h,w}$ is the selected background image.

\mathcal{L}_{f}=\mathcal{M}(\mathbf{I}^{c,h,w})\times||\mathbf{S}^{c,h,w}-\mathbf{I}^{c,h,w}||_{2}^{2}

(6)

\mathcal{L}_{b}=(\mathbbm{1}^{c,h,w}-\mathcal{M}(\mathbf{I}^{c,h,w}))\times||\mathbf{S}^{c,h,w}-\mathbf{B}^{c,h,w}||_{2}^{2}

(7)

Consequently, the total loss of video generation network can be computed as Equation 8. Therein, $\lambda_{f}$ and $\lambda_{b}$ are scalars to balance the two losses.

\mathcal{L}_{v}=\lambda_{f}\mathcal{L}_{f}+\lambda_{b}\mathcal{L}_{b}

(8)

Implementation Details

We train each of the networks for 20 epochs on the Pytorch framework. An Adam optimizer with $\beta_{1}=0.9$ and $\beta_{2}=0.999$ is adapted in all of the training process. The learning rate starts from 1e-6 for cross-modal mapping network and from 1e-3 for video generation network. They are divided by 10 for each 5 epochs. Moreover, we set the $\alpha_{J}=1$ , $\beta_{J}=1$ , $\alpha_{P}=1$ , $\beta_{P}=0.5$ , $\lambda_{J}=1$ , $\lambda_{P}=1$ and $\lambda_{f}=1$ , $\lambda_{b}=0.5$ in our experiment.

Results

Evaluation Metrics

We evaluate the generated JHMs & PAFs and recovered video frames separately.

JHMs & PAFs Evaluation Metrics

We employed Percentage of Correct Keypoint (PCK) as JHMs and PAFs evaluation metrics due to the complexity to directly evaluate the heatmap and 2D vector field. Before computing PCK, we assembled JHMs and PAFs to keypoints and calculated their corresponding coordinates using algorithms given by Openpose. The PCK which is computed as Equation 9 reflects the distance between correct body keypoints and predicted keypoints.

PCK_{i}@\alpha=\frac{1}{H}\sum\limits^{H}_{h=1}\mathbb{I}\left(\frac{||p_{h,i}-g_{h,i}||}{diag_{h}}\leq\alpha\%\right)

(9)

In Equation 9, H is the total number of detected human instances, i is the index of body keypoint, ${diag}_{h}$ is the diagonal length of the instance bounding box, $\mathbb{I}(\cdot)$ is an indicator function equals to 1 if the situation in bracket is satisfied otherwise 0.

Recovered Video Evaluation Metrics

As in Equation 7, the video generation network is trained with a fixed background label. Therefore, it is meaningless to evaluate the whole frame using conventional synthetic image evaluation metrics. To focus on the generated human body, we use Mask-RCNN (He et al. 2017) which is well-trained on COCO dataset to create masks from raw video frames and recovered frames then calculate the intersection over union (IoU) between them. This idea is inherently similar to Inception Score (IS) (Mirza and Osindero 2014) which utilizes other neural network to evaluate. This way, the well-trained Mask-RCNN can automatically evaluate the synthetic human body’s quality and the IoU could efficiently reflect the localization and pose discrepancy between real and recovered video frames. Given the above, we denote the Mask-RCNN as $\mathcal{M}(\cdot)$ and the IoU could be computed as Equation 10.

IoU@\alpha=\frac{1}{N}\sum\limits_{n=1}^{N}\mathbb{I}(\frac{\mathcal{M}(\mathbf{S})\cap\mathcal{M}(\mathbf{I})}{\mathcal{M}(\mathbf{S})\cup\mathcal{M}(\mathbf{I})}\leq\alpha\%)

(10)

where $N$ is the total frame number and $\mathbb{I(\cdot)}$ is defined the same as Equation 9.

Pose Estimation Quality

Figure 8 describes the PCKs of all 14 keypoints extracted from cross-modal mapping network. The mean [email protected] of all 14 points are about 70% which indicates the majority of keypoints will not diverge from ground truth away more than 20% of diagonal length of instance’s bounding box. This is an acceptable result since the spatial resolution of commercial Wireless signal is approximately 4cm. To further explore the gaps between vision-based approach, we presents the mPCK of both our method and Openpose on Table 1. The mean [email protected] drops 10% by replacing source from image to wireless pixel.

Video Reconstruction Quality

In Figure 7, we show our video generation results for single person scenario. The raw frames are recorded directly from the camera and the O-frames and C-frames are generated using JHMs and PAFs from Openpose and cross-modal network respectively. The O-frames have almost perfect quality except the face part. The reason for the blurred face perhaps because we just used an $256\times 128$ image as label. It could be further improved by using higher-resolution ground-truth. In comparison to the O-frames, the C-frames have a relatively low quality. The gap between O-frame and C-frame indicates that the wireless pixel cannot reflect the localization and movement as accurate as image pixels. However, the image quality of C-frame is quite enough to meet the requirements for behavior monitoring. And the image quality is supposed to be improved by training on more samples and manually annotating the ground-truth.

Figure 9 depicts an example of the multiple-people scenario. We can find that the image quality of it has no distinct difference from that of single person scenario while the appearances of generated human instances are almost random. This is because the network cannot match the human instance with their identities since the video frame only contains explicit information about who are in the RF perception field but cannot tell the identity of each one. We will address this issue in CSI2Video2 by obtaining human identity information using indoor identification and localization technology.

Table 1: mIoU

Mask-RCNN	Person in WiFi	Ours
0.79	0.68	0.66

Then, we illustrate the results of objective evaluation metrics. For this purpose, we compute the IoU scores as Equation 10 on different thresholds as depicted in Figure 8. We observe that the recovered frames have a high IoU score and decline gently at low threshold values. The IoU begins to decrease rapidly from 0.55 to 0.75. This indicates IoUs concentrates in this interval. Additionally, to our best knowledge, this is the first attempt to generate RGB surveillance video using WiFi signals. For this reason, we choose Person in WiFi (Wang et al. 2019) and Mask-RCNN (He et al. 2017) as baseline. Specifically, Person in WiFi is an end-to-end body segmentation and pose estimation method using WiFi signals. Mask-RCNN is a camera-based segmentation method. Since we utilized Mask-RCNN to generate annotations in the training phase, we still need new manually annotated labels for Mask-RCNN evaluation. To evaluate the results, we randomly selected 250 samples and manually annotated them. Table 1 illustrates the performance between those three methods. Although those methods use completely different sources to generate masks, they are all toward fine-grained human body segmentation. The gap between video-based method is obvious. Incompletely reconstructed body part and overlapped body part may account for the gap. Although having an obvious gap compared to camera-based method, the IoU of masks from generated frames are comparable with that from WiFi signals directly. This indicates that the generated video does not lose information from WiFi signals, however, the wireless signals cannot contain as much information as video frame. Whilst this problem can be mitigated by adding more WiFi router pairs or more recorded OFDM subcarriers.

Conclusion

In this paper, we propose CSI2Video, a novel human perception and video generation scheme that is able to generate RGB surveillance video in an accurate and real-time manner by conducting cross-modal mapping and video generation using the pervasive WiFi signals from commercial devices and a source of human identity information. Two tailored networks are designed to map CSI measurements to pose features and generate synthetic surveillance video respectively. To evaluate the generated video quality, we propose an innovative evaluation metric that can reflect the image quality and instance localization simultaneously. We conduct extensive experiments, including visualization and performance comparison, to demonstrate the effectiveness of the proposed system.

References

Acharya et al. (2017) Acharya, U. R.; Fujita, H.; Lih, O. S.; Hagiwara, Y.; Tan, J. H.; and Adam, M. 2017. Automated detection of arrhythmias using different intervals of tachycardia ECG segments with convolutional neural network. Information sciences 405: 81–90.
Adib et al. (2015) Adib, F.; Hsu, C.-Y.; Mao, H.; Katabi, D.; and Durand, F. 2015. Capturing the human figure through a wall. ACM Transactions on Graphics (TOG) 34(6): 1–13.
Alp Güler, Neverova, and Kokkinos (2018) Alp Güler, R.; Neverova, N.; and Kokkinos, I. 2018. Densepose: Dense human pose estimation in the wild. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 7297–7306.
Arjovsky, Chintala, and Bottou (2017) Arjovsky, M.; Chintala, S.; and Bottou, L. 2017. Wasserstein gan. arXiv preprint arXiv:1701.07875 .
Cai et al. (2018) Cai, H.; Bai, C.; Tai, Y.-W.; and Tang, C.-K. 2018. Deep video generation, prediction and completion of human action sequences. In Proceedings of the European Conference on Computer Vision (ECCV), 366–382.
Cao et al. (2017) Cao, Z.; Simon, T.; Wei, S.-E.; and Sheikh, Y. 2017. Realtime multi-person 2d pose estimation using part affinity fields. In Proceedings of the IEEE conference on computer vision and pattern recognition, 7291–7299.
Chen and Ramanan (2017) Chen, C.-H.; and Ramanan, D. 2017. 3d human pose estimation= 2d pose estimation+ matching. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 7035–7043.
Choi et al. (2018) Choi, Y.; Choi, M.; Kim, M.; Ha, J.-W.; Kim, S.; and Choo, J. 2018. Stargan: Unified generative adversarial networks for multi-domain image-to-image translation. In Proceedings of the IEEE conference on computer vision and pattern recognition, 8789–8797.
Garcia, Morerio, and Murino (2018) Garcia, N. C.; Morerio, P.; and Murino, V. 2018. Modality distillation with multiple stream networks for action recognition. In Proceedings of the European Conference on Computer Vision (ECCV), 103–118.
Goodfellow et al. (2014) Goodfellow, I.; Pouget-Abadie, J.; Mirza, M.; Xu, B.; Warde-Farley, D.; Ozair, S.; Courville, A.; and Bengio, Y. 2014. Generative adversarial nets. In Advances in neural information processing systems, 2672–2680.
Gulrajani et al. (2017) Gulrajani, I.; Ahmed, F.; Arjovsky, M.; Dumoulin, V.; and Courville, A. C. 2017. Improved training of wasserstein gans. In Advances in neural information processing systems, 5767–5777.
Halperin et al. (2011) Halperin, D.; Hu, W.; Sheth, A.; and Wetherall, D. 2011. Tool release: Gathering 802.11 n traces with channel state information. ACM SIGCOMM Computer Communication Review 41(1): 53–53.
Han et al. (2015) Han, X.; Lu, J.; Tai, Y.; and Zhao, C. 2015. A real-time lidar and vision based pedestrian detection system for unmanned ground vehicles. In 2015 3rd IAPR Asian Conference on Pattern Recognition (ACPR), 635–639. IEEE.
He et al. (2017) He, K.; Gkioxari, G.; Dollár, P.; and Girshick, R. 2017. Mask r-cnn. In Proceedings of the IEEE international conference on computer vision, 2961–2969.
He et al. (2016) He, K.; Zhang, X.; Ren, S.; and Sun, J. 2016. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, 770–778.
Huang et al. (2021a) Huang, Y.; Li, X.; Wang, W.; Jiang, T.; and Zhang, Q. 2021a. Forgery Attack Detection in Surveillance Video Streams Using Wi-Fi Channel State Information. IEEE Transactions on Wireless Communications .
Huang et al. (2021b) Huang, Y.; Li, X.; Wang, W.; Jiang, T.; and Zhang, Q. 2021b. Towards cross-modal forgery detection and localization on live surveillance videos. In IEEE INFOCOM 2021-IEEE Conference on Computer Communications, 1–10. IEEE.
Ijjina and Chalavadi (2017) Ijjina, E. P.; and Chalavadi, K. M. 2017. Human action recognition in RGB-D videos using motion sequence information and deep learning. Pattern Recognition 72: 504–516.
Isola et al. (2017) Isola, P.; Zhu, J.-Y.; Zhou, T.; and Efros, A. A. 2017. Image-to-image translation with conditional adversarial networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, 1125–1134.
Jiang et al. (2020) Jiang, W.; Xue, H.; Miao, C.; Wang, S.; Lin, S.; Tian, C.; Murali, S.; Hu, H.; Sun, Z.; and Su, L. 2020. Towards 3D human pose construction using wifi. In Proceedings of the 26th Annual International Conference on Mobile Computing and Networking, 1–14.
Kingma and Welling (2013) Kingma, D. P.; and Welling, M. 2013. Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114 .
Kniaz et al. (2018) Kniaz, V. V.; Knyaz, V. A.; Hladuvka, J.; Kropatsch, W. G.; and Mizginov, V. 2018. Thermalgan: Multimodal color-to-thermal image translation for person re-identification in multispectral dataset. In Proceedings of the European Conference on Computer Vision (ECCV), 0–0.
Lara and Labrador (2012) Lara, O. D.; and Labrador, M. A. 2012. A survey on human activity recognition using wearable sensors. IEEE communications surveys & tutorials 15(3): 1192–1209.
Leigh et al. (2015) Leigh, A.; Pineau, J.; Olmedo, N.; and Zhang, H. 2015. Person tracking and following with 2d laser scanners. In 2015 IEEE international conference on robotics and automation (ICRA), 726–733. IEEE.
Li et al. (2022a) Li, X.; Cao, H.; Zhao, S.; Li, J.; Zhang, L.; and Raj, B. 2022a. Panoramic Video Salient Object Detection with Ambisonic Audio Guidance. arXiv preprint arXiv:2211.14419 .
Li, Luo, and Younes (2020) Li, X.; Luo, J.; and Younes, R. 2020. ActivityGAN: Generative adversarial networks for data augmentation in sensor-based human activity recognition. In Adjunct Proceedings of the 2020 ACM International Joint Conference on Pervasive and Ubiquitous Computing and Proceedings of the 2020 ACM International Symposium on Wearable Computers, 249–254.
Li et al. (2021) Li, X.; Wang, J.; Li, X.; and Lu, Y. 2021. Video Instance Segmentation by Instance Flow Assembly. arXiv preprint arXiv:2110.10599 .
Li et al. (2022b) Li, X.; Wang, J.; Li, X.; and Lu, Y. 2022b. Hybrid instance-aware temporal fusion for online video instance segmentation. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 36, 1429–1437.
Li et al. (2022c) Li, X.; Wang, J.; Xu, X.; Li, X.; Lu, Y.; and Raj, B. 2022c. R^ 2VOS: Robust Referring Video Object Segmentation via Relational Multimodal Cycle Consistency. arXiv preprint arXiv:2207.01203 .
Li et al. (2022d) Li, X.; Wang, J.; Xu, X.; Raj, B.; and Lu, Y. 2022d. Online Video Instance Segmentation via Robust Context Fusion. arXiv preprint arXiv:2207.05580 .
Li et al. (2020) Li, X.; Younes, R.; Bairaktarova, D.; and Guo, Q. 2020. Predicting spatial visualization problems’ difficulty level from eye-tracking data. Sensors 20(7): 1949.
Luo, Li, and Younes (2021) Luo, J.; Li, X.; and Younes, R. 2021. Toward Data Augmentation and Interpretation in Sensor-Based Fine-Grained Hand Activity Recognition. In International Workshop on Deep Learning for Human Activity Recognition, 30–42. Springer.
Matti, Ekenel, and Thiran (2017) Matti, D.; Ekenel, H. K.; and Thiran, J.-P. 2017. Combining LiDAR space clustering and convolutional neural networks for pedestrian detection. In 2017 14th IEEE International Conference on Advanced Video and Signal Based Surveillance (AVSS), 1–6. IEEE.
Maturana and Scherer (2015) Maturana, D.; and Scherer, S. 2015. Voxnet: A 3d convolutional neural network for real-time object recognition. In 2015 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), 922–928. IEEE.
Mirza and Osindero (2014) Mirza, M.; and Osindero, S. 2014. Conditional generative adversarial nets. arXiv preprint arXiv:1411.1784 .
Premebida et al. (2014) Premebida, C.; Carreira, J.; Batista, J.; and Nunes, U. 2014. Pedestrian detection combining RGB and dense LIDAR data. In 2014 IEEE/RSJ International Conference on Intelligent Robots and Systems, 4112–4117. IEEE.
Radford, Metz, and Chintala (2015) Radford, A.; Metz, L.; and Chintala, S. 2015. Unsupervised representation learning with deep convolutional generative adversarial networks. arXiv preprint arXiv:1511.06434 .
Rayat Imtiaz Hossain and Little (2018) Rayat Imtiaz Hossain, M.; and Little, J. J. 2018. Exploiting temporal information for 3d human pose estimation. In Proceedings of the European Conference on Computer Vision (ECCV), 68–84.
Sun et al. (2019) Sun, K.; Xiao, B.; Liu, D.; and Wang, J. 2019. Deep high-resolution representation learning for human pose estimation. In Proceedings of the IEEE conference on computer vision and pattern recognition, 5693–5703.
Tome, Russell, and Agapito (2017) Tome, D.; Russell, C.; and Agapito, L. 2017. Lifting from the deep: Convolutional 3d pose estimation from a single image. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2500–2509.
Wang et al. (2019) Wang, F.; Zhou, S.; Panev, S.; Han, J.; and Huang, D. 2019. Person-in-WiFi: Fine-grained person perception using WiFi. In Proceedings of the IEEE International Conference on Computer Vision, 5452–5461.
Wang et al. (2020a) Wang, X.; Kong, T.; Shen, C.; Jiang, Y.; and Li, L. 2020a. SOLO: Segmenting Objects by Locations. In Proc. Eur. Conf. Computer Vision (ECCV).
Wang et al. (2020b) Wang, X.; Zhang, R.; Kong, T.; Li, L.; and Shen, C. 2020b. SOLOv2: Dynamic, Faster and Stronger. arXiv preprint arXiv:2003.10152 .
Yang et al. (2018) Yang, C.; Wang, Z.; Zhu, X.; Huang, C.; Shi, J.; and Lin, D. 2018. Pose guided human video generation. In Proceedings of the European Conference on Computer Vision (ECCV), 201–216.
Zenonos et al. (2016) Zenonos, A.; Khan, A.; Kalogridis, G.; Vatsikas, S.; Lewis, T.; and Sooriyabandara, M. 2016. HealthyOffice: Mood recognition at work using smartphones and wearable sensors. In 2016 IEEE International Conference on Pervasive Computing and Communication Workshops (PerCom Workshops), 1–6. IEEE.
Zhao et al. (2022) Zhao, C.; Li, X.; Dong, S.; and Younes, R. 2022. Self-supervised Multi-Modal Video Forgery Attack Detection. arXiv preprint arXiv:2209.06345 .
Zhao et al. (2018a) Zhao, M.; Li, T.; Abu Alsheikh, M.; Tian, Y.; Zhao, H.; Torralba, A.; and Katabi, D. 2018a. Through-wall human pose estimation using radio signals. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 7356–7365.
Zhao et al. (2018b) Zhao, M.; Tian, Y.; Zhao, H.; Alsheikh, M. A.; Li, T.; Hristov, R.; Kabelac, Z.; Katabi, D.; and Torralba, A. 2018b. RF-based 3D skeletons. In Proceedings of the 2018 Conference of the ACM Special Interest Group on Data Communication, 267–281.
Zou et al. (2019) Zou, H.; Yang, J.; Prasanna Das, H.; Liu, H.; Zhou, Y.; and Spanos, C. J. 2019. Wifi and vision multimodal learning for accurate and robust device-free human activity recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, 0–0.