WixUp: A General Data Augmentation Framework for Wireless Perception in Tracking of Humans
Abstract
Recent advancements in wireless perception technologies, including mmWave, WiFi, and acoustics, have expanded their application in human motion tracking and health monitoring. They are promising alternatives to traditional camera-based perception systems, thanks to their efficacy under diverse conditions or occlusions, and enhanced privacy. However, the integration of deep learning within this field introduces new challenges such as the need for extensive training data and poor model generalization, especially with sparse and noisy wireless point clouds. As a remedy, data augmentation is one solution well-explored in other deep learning fields, but they are not directly applicable to the unique characteristics of wireless signals. This motivates us to propose a custom data augmentation framework, WixUp, tailored for wireless perception. Moreover, we aim to make it a general framework supporting various datasets, model architectures, sensing modalities, and tasks; while previous wireless data augmentation or generative simulations do not exhibit this generalizability, only limited to certain use cases. More specifically, WixUp can reverse-transform lossy coordinates into dense range profiles using Gaussian mixture and probability tricks, making it capable of in-depth data diversity enhancement; and its mixing-based method enables unsupervised domain adaptation via self-training, allowing training of the model with no labels from new users or environments in practice. In summary, our extensive evaluation experiments show that WixUp provides consistent performance improvement across various scenarios and outperforms the baselines.
1 Introduction
The advancements in wireless perception have enabled a broad spectrum of applications, ranging from human tracking and health monitoring to radar systems for self-driving and robotics. It relies on various wireless sensing modalities including mmWave, acoustics, WiFi, etc. These technologies offer several advantages over traditional camera-based perception systems. Wireless signals are not restricted by light conditions or line-of-sight occlusions, and they provide better privacy protection compared to cameras. Typically, a wireless perception system emits custom signals such as Frequency-modulated continuous wave (FMCW) from the transmitter [32]; then by analyzing the reflections recorded by the receiver, it can extract information for localization or motion tracking.
The recent emergence of utilizing deep learning for wireless perception systems highly improves their capability of tracking [62, 61], but also brings forth a set of new challenges. Firstly, there is a significant demand for collecting extensive training data that requires considerable resources and effort. Secondly, the trained model often exhibits poor generalization when deployed in new scenarios, such as a new user or environment. Moreover, when the raw FMCW signal is processed into point clouds, it exhibits severe sparsity issues that affect the training data quality. Consequently, downstream tasks encounter limitations in accuracy and struggle to address complex tasks such as multi-person tracking or flow estimation. For instance, Fig. 2 demonstrates two samples of the processed mmWave as point clouds, which appear sparse and barely resemble a coherent human skeletal structure; while the depth camera provides dense point clouds. The disparity is owing to the inherent limitations of wireless technology, where its low granularity is insufficient for multi-part tracking.
To address these challenges, data augmentation is an effective solution that has been proven in many fields of machine learning [34, 57]. There are well-explored methods for augmenting 2D or 3D data like images [57] and Lidar point clouds [51], broadly categorized into global augmentation and local augmentation [51]. 1) Global augmentation performs random scaling, flipping, and rotation on 2D images or 3D Lidar point clouds to transform data. However, they fail to augment local structures and ignore the relevance between elements within one data sample. 2) Instead, local augmentation techniques tend to mix data to generate new samples. The notable mixup [57] paper uses convex combination to mix data, and [56, 51] crop&paste local structure from one data to another, at a variety of granularity such as a patch or an object. Additional benefits of mixing-based data augmentation are that it can not only improve model performance in supervised learning but also perform unsupervised domain adaptation via self-training[51], which helps with reducing labeling efforts. Nevertheless, mixing-based augmentation requires designing a domain-specific patch/object selection algorithm.
In current wireless research, some work has adopted straightforward global data augmentation methods such as rotation, Gaussian noise to classify RF signal types [23, 11, 16, 46] or shifting the range or initial phase for perception [1]. However, mixing-based local data augmentation remains unexplored because of its complexity in incorporating domain-specific representations. Also, these previous methods are confined to specific tasks or models and often lack a comprehensive evaluation.
Besides, a recent trend involves leveraging generative models to simulate and synthesize wireless data [64, 14], sharing similar motivation with data augmentation; but they are also limited to certain tasks or data. For instance, [5, 59] uses video data of the human body to simulate RF signals, but it can only generate scenes from existing video data. [14] overcomes this issue by using text prompts to synthesize the 3D mesh of human actions, and then simulate mmWave data from the 3D visuals using ray tracing and Diffussion models. Nevertheless, this only supports action-related tasks, because the text prompts can only describe body actions or hand gestures. For example, it can only synthesize data for pose estimation or action recognition tasks, but not for user identification, since it is intractable to accurately depict behavior traits solely by text prompts. Besides, [64] uses NeRF to simulate RF signals to aid simple tasks of localization or 5G channel estimation, i.e. static scene synthesis other than motions; and it needs labeled data from the environment to train the NeRF first. Overall, generative methods demand non-trivial effort to train an extra model and do not generalize to out-of-distribution data.
In contrast, in this work, we aim to propose a mixing-based data augmentation framework, that is training-free and particularly good at increasing data diversity or even closing domain gaps [51]. It supports multiple human tracking tasks, model architectures, and sensing modalities. Moreover, beyond improving model performance, it could also reduce labeling efforts by unsupervised domain adaptation via self-training. In practice, this enables training the model with new user/environment data without any labels. These benefits make our framework much more practical for wireless sensing than either conventional global augmentation or generative simulation.
However, this goal poses two primary challenges: 1) We need a mixing algorithm tailored to the features of wireless signals. Unlike 2D images or Lidar point clouds, wireless raw data is less interpretable, rendering many conventional augmentation strategies not applicable, like flipping. Even after processing the raw data into point clouds, the noise and the sparsity issue we discussed in Fig. 2 make general 3D data augmentation insufficient. 2) To perform a comprehensive evaluation of our method on diverse real-world data and tasks, we need access to multiple high-quality datasets. Excitingly, in the realm of mmWave sensing, we have started seeing trends of open-sourcing data recently [17, 53, 7], but they do not provide the unprocessed raw data. The processed point clouds are lossy information, because of the constant false alarm rate (CFAR) filtering as part of the standard FMCW processing pipeline.
Thereby, we propose WixUp, a range-profile level data augmentation tailored for the unique characteristics in wireless perception, incorporating the Gaussian mixture to solve the lossy point cloud issue. To elaborate, we employ a custom data processing pipeline to transform any format of raw data into the lossless space of range profile, by simulating irreversible transformation of CFAR using Gaussian mixture. Then the mixing is equivalent to the intersection of two Gaussian mixtures, along with probability-based methods to bootstrap the results and embody spherical angles. Finally, by framing a unified code base, we meticulously investigate the generalizability of our data augmentation framework across three datasets, two model architectures, three tasks, and two sensing modalities of mmWave and acoustics, with a focus on human perception because this topic has most open research resource available to make this study ready. Furthermore, we demonstrate its efficacy in unsupervised domain adaptation via self-training, across unseen environments and users. Experiment results show that our data augmentation framework exhibits a consistent improvement across different evaluation experiments over baselines.
In summary, our paper has the following main contributions:
-
•
We propose a custom mixing-based data augmentation in wireless perception, which preserves characteristic representations for in-depth augmentation and handles imperfect data types.
-
•
Our pipeline is a general data augmentation framework across datasets, tasks, model architectures, and sensing modalities. And we conduct comprehensive experiments in a unified setup to validate its generalizability.
-
•
Our framework could incorporate self-training to reduce labeling efforts by unsupervised domain adaptation via self-training. This enables training the model with no labels from the new environments or users.
2 Related work
Data augmentation techniques (DA) is well-explored for 2D images [22, 45, 39] and 3D Lidar point clouds [51], since collecting and annotating training data is labor-intensive and time-consuming. DA of 2D images typically refers to global augmentation techniques such as random cropping [26, 44], scaling [42, 44], erasing [65], and color jittering [44], which aim to learn transformation invariance for image recognition tasks. In contrast, local augmentation techniques, such as mixup [57] and CutMix [56], generate new training data through various mixing operations. 3D DA has garnered attention recently, which also adopts similar methods such as scaling, rotation, and translation; Recent studies attempt to augment local structures of point clouds [30, 40, 25] and introduce concepts for object-level augmentation [15, 27] and scene-level [51] in self-driving. However, conventional augmentation techniques are not suitable for addressing the unique characteristics of wireless sensing data, as we discussed above, necessitating the development of customized approaches.
Data augmentation for labeling efficiency. The mixing-based local DA can not only improve model performance in standard supervised learning, but also reduce labeling efforts via self-training [51]. It enables unsupervised domain adaptation without explicit supervision. Its effectiveness has been studied in text analysis, computer vision [55, 18, 51], and speech processing [41], improving the generalizability of models. It synthesizes silver standard labels generated from input data to facilitate learning. Similarly, in self-supervised learning, DA [9, 10] even brings more performance improvements than algorithmic techniques like revising model architectures or algorithmic additions [12, 13, 19, 8]. This aspect is crucial for wireless perception because, unlike crawling images and text from the web, collecting sensing data with labels from the real physical world is especially expensive.
Wireless perception is a promising alternative to cameras for tracking humans, relying on the modalities of radio frequency (RF) signals, such as mmWave and WiFi, or acoustics. The applications range from gesture classification [21, 48], localization [60, 38, 37], motion detection and pose estimation [4, 3, 2, 62, 61, 24], and fine-grained face reconstruction [52] using RF. In addition, acoustics detection leverages ubiquitous speakers and microphones with little bandwidth for hand gesture recognition [47, 50, 20, 54] and hand tracking [36, 49, 33, 31, 28].
Wireless perception datasets have become more accessible in recent research. Some focus primarily on a single task, with a majority being devoted to keypoint estimation or action recognition using mmWave [43, 63]. MARS [7] is one of the pioneers, providing data for rehabilitation using mmWave. mRI [6] and MM-Fi [53] are large-scale structures that contain around 160K frames. MiliPoint [17] first includes all three main tasks in human tracking: user identification, action classification, and keypoint pose estimation, with a total of 49 different actions across 545k frames. These datasets facilitate research on wireless perception and make DA research possible in this domain by being able to benchmark comprehensively.
Data augmentation in wireless perception was previously mentioned in some work, typically tailored to their specific system, while there is a lack of a general data augmentation framework with comprehensive evaluations. For examples of global augmentation, MiliPoint [17] employs a DA, named stack, which involves zero-padding and random resampling. Consequently, the augmented points essentially replicate the original ones with some subset and duplication. We use this as a baseline in our experiments. [23, 11, 16, 46] use global DA like flipping, rotation, Gaussian noise, or wavelet but only to classify radio signal types, not for perception. [1] augments data by shifting the range profile along the range axis by a small scale. However, beyond a certain augmentation scale, the shift makes data distorted, leading to a decrease in model performance. Besides, [58] uses transfer learning to improve data efficiency across environments but it requires labeled data from the target domain.
Recently, generative models have been investigated to synthesize wireless data [64, 14], but they are limited to certain tasks or data. [5, 59] simulates RF signals from videos of human actions, but the generation is bound to existing video data. [14] solves this issue by using text prompts to synthesize the 3D mesh of humans first, and then simulate mmWave data from the 3D visuals using ray tracing and Diffussion models. However, the text prompts are limited to the description of actions; it does not support non-action tasks like user identification, where describing behavior traits with text is intractable. NeRF is also applied to simulate RF signals[64], but it only demonstrates simple static tasks of localization or 5G channel estimation rather than motion tracking, since the vanilla NeRF is for static synthesis; and it requires labeled data from the target scene to train the NeRF model. In general, generative methods require significant effort to train an additional model and struggle to generalize to out-of-distribution data, while mixing-based DA is training-free and excels at enhancing data diversity.
3 Background: FMCW raw data processing and formats


Before delving into WixUp’s mixing pipeline, we first extend the background, to reveal what is challenging and motivate our custom designs for WixUp.
So, we briefly introduce the standard FMCW-based data processing pipeline in the majority of wireless perception systems, including the standard mmWave radar [32], and many acoustic sensing and WiFi sensing systems adopted similar radar modulation [35, 29, 62, 61]. Fig. 2 shows the pipeline including de-chirping and extraction of range, angle, velocity, and then 3D Cartesian coordinates. In detail, firstly, the transmitter (TX) sends out chirps; by reflecting from the subject, it is captured by the receiver (RX). Then the signals are processed through a mixer and filter to produce a mixed Intermediate Frequency (IF) signal (This could also achieved by algorithms outside the hardware when a mixer is not available). This signal undergoes a Fast Fourier Transform (FFT) process, where the amplitudes can induce range measurement. Subsequently, another Doppler-FFT is applied to the phases along the slow time for velocity measurement. The multi-channel phases can further estimate the angles of arrival. Then a Constant False Alarm Rate (CFAR) algorithm filtered the outputs by the noise level. Finally, the ranges and angles can be transformed to Cartesian coordinates along with velocity and signal intensity as 5D times series for downstream applications. This pipeline enables detecting position and movement in human tracking or other applications.
The challenge is the prevalent lack of raw data availability within the majority of public datasets, while the raw data, rich in detail, is critically important for performing meaningful and effective data augmentation. These datasets typically provide only Cartesian coordinates, i.e. the culmination of the data processing sequence. The issue with Cartesian coordinates is their high-level abstraction, owing to the filtering by CFAR, which leads to a considerable loss of fine-grained information that is imperative for in-depth data augmentation. Moreover, the sparsity issue in the processed point clouds also makes it worse, while the range profile is a relatively dense representation. Unfortunately, it is intractable to reverse the CFAR algorithmically like inverse FFT.
Thereby, we aim to propose a framework that can transform any data format of wireless signals into range profiles. Because the range profile is lossless, unlike Cartesian coordinates, and it is more interpretable compared with the raw data, allowing in-depth DA. The following subsections will detail how we achieve this goal, focusing on developing a way to inverse Cartesian coordinates to a simulated range profile.
4 Method

As the high-level illustration shown in Fig. 3, WixUp uses a common process of mixing-based DA. So, in this section, we first provide a formal definition of the mix operation in the context of wireless.
Next, we detail WixUp’s algorithm, tailored for mixing wireless signals at the range-profile level, and how it is embedded into the common mixing-based process. In overview, as in Fig 3, we first describe our solution to transform coordinates into range profiles using Gaussian mixtures. Next, we define a fault-tolerant mixing method leveraging the intersection of the Gaussians, with only O(n) computation complexity. Additionally, we use bootstrapping to further densify the mixed output. Finally, if the downstream model input is not a range profile but coordinates, we induce 3D coordinates from the intersected range profiles along with the angles based on probability distribution.
Furthermore, since we aim to test WixUp’s capability of self-learning for reducing labeling efforts, this section also briefly introduces the process of self-training to instruct our experiment design.
4.1 Define mixing-based data augmentation
A typical mixing-based DA has the inputs involving two sets of data: let and denote the sensing data and their ground truth. The input for a mixing operation is a pair of data and . The output is synthesized data, , in the same format as the two input tuples. Therefore, by iteratively mixing pairs of and where ( is the size of the original data), we could yield of synthesized data by one pass of iteration. Moreover, the distance between each pair can vary from 1 to 2 or more, i.e. and . Thereby, by iterating with , we scale up the data by times.
What DA focuses on is how we mix and , especially and . To elaborate, is usually mixed by a convex combination, i.e. weighted average. In our paper, we take the average regardless of task-specific ground truth format. For example, in pose estimation, could be a matrix, representing the 3D skeleton of 19 body keypoints’ coordinates of . Taking the average of and equals to the 3D skeleton located in the middle between the two input skeletons. In user identification or action recognition tasks, the could be the class ID and is usually one-hot encoded. Therefore, taking the average of two one-hot vectors means the mixed data is half-chance class A and half-chance class B, or the same class when the input classes are equivalent. As for the data , in the DA for 2D images, they could be a matrix of pixels in RGB channels, where a popular mixing-based DA undergoes taking convex combination in the same way of . However, in wireless perception, could be raw signals, range profiles, 3D point clouds, or 5D time series of where is the Doppler velocity and is the signal intensity. As discussed above, directly adding them in some way does not meaningfully align with the feature of wireless signals. Instead, WixUp aims to transform them all into the range profile level for the convenience of mixing with rich raw information. Next, we will introduce how we mix and by a pipeline of signal processing and probability-based algorithms.
4.2 WixUp data mixing pipeline
4.2.1 Inverse Cartesian coordinates to simulated range profile using Gaussian mixture
It is possible for most states in the FMCW data processing pipeline to freely transform to range profile using the standard signal processing algorithms. Because most of the steps can be reverse-engineered, such as inverse FFT, so that we can transform any data format into a range profile. However, the CFAR presents a unique challenge as it inherently discards a significant quantity of data, making reversal complex. Therefore, we put forth a novel inverse data processing pipeline that operates between the range data and Cartesian coordinates. This enables the application of WixUp to any public datasets, irrespective of the data format they are provided in, allowing us to bypass the limitation imposed by the absence of raw data.
The core idea is to simulate the range profile as a Gaussian mixture model of Euclidean distances, as shown in the left of Fig. 3. In detail, first, we translate Cartesian coordinates to spherical coordinates as a representation better aligned with the transmission features of wireless signals. Next, by dividing the Euclidean range by range resolution, we get indices of bins. Utilizing these indices as the means in a Gaussian mixture model, the probability density function (PDF) is derived as the range profile, i.e. the statistical distribution of the ranges. The parameters defined for the system include a window size of 512, which is the size for de-chirping data, and a range resolution calculated as 3.75cm. Then the maximum allowable detection range is 19.2m. Assuming the human joint size is around 10-15cm, we use the standard Gaussian, implying its main lobe falls within the joint size.
4.2.2 Mixing range profiles using intersection
After transforming data into range-profile, whether through the simulation above or calculated from the raw signal using standard de-chirping, we then mix every pair of data into new synthetic data. We design a mixing strategy by taking the intersection of two range profiles, aligning closely with the physical interpretation of wireless range profiles while also being computationally efficient.
As illustrated in the center of Fig. 3, each pair of peaks may intersect midway along the range axis; the height of the intersected side lobe is roughly inversely proportional to the distance between the peaks. This approach involves taking middle points from all possible bisections of two sets of real points, as it is unclear which body joint each real point corresponds to; thus, we cannot restrict middle points to be extracted from identical joints. By considering all possible bisections, we ensure coverage of eligible synthetic points and introduce some "temperature" into the process, borrowed from sampling in VAE. This makes the mixing fault-tolerant.
Instead of employing the closed-form intersection of Gaussian algorithms, we utilize an O(n) one-pass function to identify intersections. The validity of intersections can be determined using the formula , where and are the two PDF arrays and denotes a neighbor pair. This approach ensures that pairs like (0,0)(0,0) and (0,1)(0,0) are excluded. This exclusion is crucial as such pairs do not represent valid intersection points. By implementing this method, we effectively filter out non-intersecting segments by one pass. This streamlines the intersection detection process, enhancing both efficiency and accuracy and does not rely on the complexity of the input Gaussian mixture.
4.2.3 Boostrapping high-probability synthesis
Besides, we further bootstrap each intersection point weighting on its height. Since the height of the intersection is proportional to its validity as an eligible point, we use it as a weight to resample points around it. This follows the Monte Carlo simulation, where random samples are generated to simulate the properties of a system or to approximate integrals and sums. Besides, to add more randomness, we also keep the original peaks in the simulated range profile. They are assigned the lowest weight of one and all the weights of the above intersections are elevated by one as well.
All aforementioned modules would be ablate studies in our experiment section to justify their effectiveness. For instance, initially we also adopt skewness in Gaussian to represent the multipath induced around the subject, i.e. skewed towards the larger range along the range axis. However, the skewness does not help with improving the model performance, so we discard it from the design of WixUp.
4.2.4 Enrich range profile with angles based on probability distribution double-paned
Finally, if the downstream model input is not a range profile but 3D coordinates, we then need to induce back the coordinates. While, in addition to the ranges, reconstructing coordinates also requires azimuth and elevation angles. Therefore, we use a probability-based method to assign angles to each synthetic range by sampling from a distribution. This distribution is built from the actual angles of the input data pairs. The probability distribution of angles mirrors the Gaussian mixture used for ranges. Consequently, the final angle at each range is determined by a convex combination of the probabilities associated with major nearby points. For instance, ideally, the midpoint between two ranges will also have averaged angles, making it the midpoint in 3D. This arrangement is in harmony with the expectations set by our mixing method. Finally, we transform the spherical coordinates to Cartesian as visualized at the end in Fig. 3.
4.3 Use WixUp for unsupervised domain adaptation via self-training
We not only aim to evaluate our data augmentation in standard supervised learning for wireless perception tasks, but also demonstrate its capability for reducing labeling efforts via self-training. This is achieved through our method of data mixing to generate new data. Note that non-mixing methods like random scaling are not a good fit, since they augment one data at a time, unlike mixing two data from different domains to close the domain gap. Mixing-based augmentation enables the creation of more realistic semi-real labels by mixing real data and predicted data, facilitating efficient domain adaptation through self-training.
Source domain & Target domain: The process of training and evaluating unsupervised domain adaptation using self-training is to adapt models trained on a source domain to inference well on a target domain. In this approach, the target domain dataset might have scarce labeled data or even no labeled data. In step 1 of this approach, the model is first trained with labeled data from the source domain. In step 2, partial target domain data is split out as training data and the rest is for testing. If we make predictions on testing target data using the trained model from step 1, it might likely yield low accuracy. (Note that the labels of testing data are only used for evaluation.) In step 3, the trained model then runs inference on the unlabeled target domain training data, where the output predictions are kept as pseudo-labels for them. Next, in step 4, we take pairs of one source domain data and one target training data and then mix each pair of them as one semi-labeled training data. The mixing algorithm could be customized, such as WixUp tailored for wireless sensing data. Finally, in step 5, the new training data further fine-tunes the trained model. Thus, the model performance on the same testing data in the target domain should improve.
Use case: This self-training process proves especially advantageous in wireless sensing, where acquiring new data often demands significant manual efforts, and system performance can significantly drop in new rooms or new users, i.e., unseen domains that are not part of the training data. Currently, to solve this issue, many wireless perception systems need users to undergo calibration/personalization before using it, but obtaining ground truth for the sensing data at the user’s end is often unfeasible. Thus, by employing unsupervised domain adaptation through self-training, WixUp addresses this practical challenge by enabling calibration/personalization without the necessity of collecting labels from new users or environments.
5 Experiments
In this section, we carry out an extensive evaluation of WixUp, divided into three groups of experiments:
-
•
First, we benchmark how WixUp enhances model performance in supervised learning across a variety of use cases, no matter what you use for datasets, model architectures, tasks, and sensing modalities. In general, WixUp helps increase the size of training data, which constantly provides a significant margin of improvement over no augmentation and outperforms other baseline augmentation methods.
-
•
The second group of experiments illustrates WixUp’s capability to significantly reduce the need for labeling data by utilizing unsupervised domain adaptation through self-learning.
-
•
Finally, we ablate study the algorithmic modules and hyper-parameters in our method to verify their effectiveness.
The following subsections begin with an elaboration on the experiment setup; then we present and analyze the results for these three groups of experiments.
5.1 Experiment setup
In the setup, we detail the overall evaluation metrics, task-specific metrics, the baseline methods as a comparison, the datasets we employ, and the implementation we utilize.
5.1.1 Overall evaluation metrics
For both the supervised learning and the self-training experiments, we aim to show that the model could yield a lower error on the validation data after leveraging WixUp, under the same settings including data and training epoch. Moreover, in self-training, WixUp should be able to achieve this goal even though some data has no labels.
Supervised learning typically starts with a labeled dataset that is divided into training and evaluation subsets, allowing us to measure performance across both the training and evaluation, using loss or task-specific error metrics. In our experiments, we ensure only augmenting the training data. More specifically, by augmentation, we can increase the size of the training dataset by two times or more. We then train the model using this enlarged training dataset. To demonstrate the efficacy of the DA, there should be drops in loss and errors in both the training and evaluation; the training loss should also converge faster than that without the DA. Furthermore, we aim to show that WixUp surpasses baseline augmentations by achieving more significant improvements in the evaluation results.
Self-training assumes that not all training and evaluation data have labels; the model trained on limited labeled data might not perform well on the evaluation set. While WixUp enables further fine-tuning of the trained model with no-label data; it mixes the no-label data and predicted pseudo-labels with the real data to feed into the model again. Note that the data for mixing is separate from the evaluation set; also, the evaluation set’s labels are only for assessing performance, so the labels could be absent in real use cases. In conclusion, effective self-training with WixUp should be able to reduce the error on the evaluation set, in comparison with the model only trained on limited labeled data.
5.1.2 Task-specific evaluation metrics
To demonstrate the generalizability of our DA across different applications, we have selected three of the most common tasks in human tracking: keypoint pose estimation, action recognition, and user identification. For each task, we train and evaluate the model with WixUp to confirm that it reliably contributes to improvements in the evaluation data. This underscores WixUp’ usability in diverse applications.
The evaluation metric for each task is 1) Mean Per Joint Position Error (MPJPE) for keypoint pose estimation and 2) classification accuracy for action recognition or user identification. In detail, MPJPE is one of the most common metrics for 3D human pose estimation or hand pose estimation. It calculates the average Euclidean distance between the predicted and the ground truth positions across all joints. MiliPoint uses MLE, so we also follow this usage in our paper. Besides, classification accuracy measures the proportion of total predictions that a model gets right, expressed in percentage. The goal of action recognition is to classify into one of the predefined actions (e.g., waving hands, limb extension, jumping up, etc.). For user identification, the objective is to recognize a user based on their behavioral or physical traits hidden in the detected sensor data, like movement patterns or interaction styles with devices.
5.1.3 Baseline augmentation methods
Three baselines serve as a comparison with WixUp to show its effectiveness under the same experimental settings: no augmentation, conventional global augmentation, and stacking. We run them in all the following benchmarking experiments.
Baseline Null: no augmentation. Initially, we employ a baseline approach using solely the original real data without any augmentation.
Baseline CGA: conventional global augmentation Since there is a lack of prior general DA method for all tasks and models discussed in this paper, we look into conventional global DA in general 2D and 3D data as our baselines instead. Therefore, we opt for a conventional global augmentation, named CGA, involving random scaling within the range of 0.8 to 1.2. Although it can not facilitate unsupervised domain adaptation through self-training due to its non-mixing-based approach, it is applicable in all the benchmarking experiments we will delve into in the following. More specifically, in pose estimation, the scaling is applied to both the sensing point clouds and the ground truth points. As for the other two tasks of user and action recognition, the scaling does not apply to the class labels.
Baseline Stack: random duplication. The MiliPoint [17] dataset also introduces a simple DA technique, named stack. This method involves zero-padding each frame to standardize the number of points. Subsequently, they randomly resample from these points, serving as one data duplication. According to the original paper, this procedure is repeated several times for each frame: five times for tasks involving pose estimation and action recognition, and fifty times for the user identification task. Intuitively, this approach not only ensures a uniform input size but also enhances the data diversity through randomness. However, our reproduction results show that it actually negatively impacts the model performance in many cases. So, we only run it in initial benchmarking experiments and refrain from the rest.
WixUp+: a combination of WixUp and CGA. Except for the self-training experiments, it is valid to use WixUp on top of Baseline CGA in benchmarking experiments. In practice, it is common to employ multiple data augmentation methods simultaneously in many machine learning applications. Therefore, we combine WixUp with CGA as another comparison with only WixUp. However, in our experiment results, we observed that the inclusion of CGA, noted as WixUp+, does not always bring additional performance enhancements, a point we will delve into in the forthcoming analysis section. Overall, WixUp itself constantly increases performance beyond the capabilities of individual baselines in most cases.
5.1.4 Datasets
In our experiments, we leverage three mmWave datasets and one additional acoustic dataset for the cross-modality experiment. Specifically, they are three publicly available mmWave human tracking datasets alongside an acoustic hand-tracking dataset collected from our prior research. This diverse selection enables us to demonstrate the generalizability of WixUp across datasets and sensing modalities.
MiliPoint Dataset, published recently in the NeurIPS dataset track [17], focuses on low-intensity cardio-burning fitness movements. With 49 distinct actions and a massive dataset of 545,000 frames from 11 subjects, it surpasses previous datasets in both action diversity and data volume. It also covers pose estimation and user identification. We use this dataset as the main test bed for the benchmarking experiments in this work. The data collection process considers factors such as movement intensity and diversity. Participants were asked to perform a series of movements while being monitored by a mmWave radar and a Zed 2 Stereo Camera for the ground truth. The Texas Instrument IWR1843 mmWave radar, a common choice for wireless perception research, operates between 77 GHz to 81 GHz with a chirp duration of 100 us and a slope of 40 MHz/us; it has a large bandwidth of 4GHz and thus can achieve a range resolution of 4 cm. The Zed 2 Stereo Camera complemented this by providing the ground-truth 3D skeleton of 18 keypoints, by getting a depth map from the disparity between two views and then feeding it into a neural network for pose estimation. This offers reliable accuracy for their settings in both 3 and 15 meters away. However, this public dataset consists solely of the 3D point clouds without the raw mmWave data as we discussed above, leaving around 8-22 points per frame. So, we show WixUp can inversely use the processed data for DA.
MARS Dataset (Millimeter-wave Assistive Rehabilitation System) [7] is a pioneer work providing large-scale datasets in wireless sensing, designed for the rehabilitation of motor disorders utilizing mmWave. This work comes with a first-of-its-kind dataset of mmWave point cloud data, featuring 70 minutes of 10 different rehabilitation movements performed by 4 human subjects, providing 19 human keypoints and 40,083 labeled frames alongside video demonstrations made public. One common Texas Instrument IWR1443 Boost mmWave radar runs data acquisition at 76-81GHz. The chirp duration is 32us with a slope of 100MHz/us, facilitating a range resolution of 4.69cm and a maximum detection range of 3.37cm. Additionally, a Kinect V2 with infrared depth cameras captures ground truth, collocated with the radar and synchronized with the Kinect’s fixed sampling rate of 30 Hz. Similarly, their signal processing only keeps the first 64 points, consisting of a 5D time-series point. We take the subset of solely the coordinates to run the same experiment set up with the other mmWave dataset we use.
MMFi Dataset stands out as another large-scale wireless dataset [53]. It features as the first multi-modal non-intrusive 4D human dataset, including mmWave radar, LiDAR, WiFi CSI, infrared cameras, and RGB cameras, emphasizing the fusion of sensors for multi-modal perception. The total has 320,000 synchronized frames across five modalities from 40 human subjects with 27 categories of daily and rehabilitation actions, providing a valuable view for both everyday and clinical research in human motion. The radar data is collected by a Texas Instrument IWR6843 60-64GHz mmWave. The detailed parameters for the chirp were not disclosed. Moreover, a novel mobile mini-PC captures and synchronizes data from multiple sensors, allowing for data collection in diverse environments. So, we use this dataset in our self-training experiments to test unsupervised domain adaptation across environments.
Acoustic Dataset: We collected an acoustic dataset [1], which is a sensing platform designed for human hand tracking, employing the same FMCW (frequency-modulated continuous wave) signal algorithms utilized by mmWave radar systems. It features fine-grained continuous tracking of 21 highly self-occluded finger joints in 3D. Data collection involved 11 participants across three distinct environments, yielding a total of 64 minutes of meticulously selected hand motion data, covering a wide range of expressive finger joint movements. The hardware setup comprises a development microphone array board alongside a speaker and a Leap Motion infrared camera utilized solely for collecting ground truth during training. The 7-channel mic array shares the same layout and sensitivity specifications as the Amazon Echo 2 Home assistant. The system emits ultrasound modulated into 17k-20kHz FMCW chirps with a duration of 10ms. Subsequently, through the de-chirping algorithm, it achieves a range resolution of 3.57mm with only a small bandwidth, owing to the low speed of sound. This acoustic dataset helps demonstrate WixUp’s versatility across multiple wireless sensing modalities and its flexibility in handling other formats of input data.
5.1.5 Implementation
To be able to run a wide range of experiments across datasets, tasks, model architectures, and baselines, we need a flexible code framework. Two of the three mmWave datasets, MARS and MMFi, do not come with a full code base, so we rewrite the code base of MiliPoint as a uniform framework to train and evaluate all the datasets. It supports a variety of model architectures including DGCNN, PointNet++, and PointTransformer, along with three tasks. We intend to release our framework as open source upon the acceptance of this paper, aiming to facilitate further research in DA within this domain.
For context, all experiments are executed on a GPU server equipped with NVIDIA Quadro RTX 8000. Wherever possible, we adhere to the hyper-parameters outlined in the original dataset papers to ensure accurate and equitable reproduction. It covers the hyper-parameters such as the learning rate for training, the batch size, pre-processing schemes, and even the random seed, as well as the evaluation metrics. Except for one point, we disable the random shuffling of data before splitting into training and testing sets, which does not align with the reality that the testing data only happens after the time of training data, although no-shuffle might decrease our accuracy. Most experiments utilize subsets of the original data to facilitate extensive ablation studies and benchmarking. We ensure that any comparisons are made align with the reproduced original settings under our framework.
|
|
|
|||||||
Report | 77.65 | 13.61 | 16.53 | ||||||
Reproduction | 80.90 | 15.88 | 16.93 |
Table 1 illustrates our reproduction result of MiliPoint, wherein we run 50 epochs on the full data set for each of the three tasks using the DGCNN model. It runs less than 10 hours on a single RTX 8000 for each task. Our reproduction yields slightly better results for user identification and action recognition in terms of top-one classification accuracy in percentage, and it maintains a similar MLE for 9-point keypoint pose estimation. In conclusion, the reproduction result is verified to align with the reported results in the original paper. Besides, the errors here are usually higher than those in the following benchmarking because those are trained with a subset of data and five times fewer training epochs, in order to facilitate the extensive benchmarking experiments of various scenarios.
5.2 Benchmark the generalizability of WixUp in supervised learning
Following the above experiment setup, first, we assess how WixUp boosts model performance in supervised learning across diverse scenarios, encompassing datasets, model architectures, tasks, and sensing modalities. Broadly, WixUp expands the training data size for supervised learning, consistently delivering notable enhancements over the no-augmentation result and surpassing other baseline augmentation techniques.
5.2.1 Generalize across datasets
The experimental results presented in Table. 2 showcase the performance across three distinct datasets: MiliPoint, MARS, and MMFi, as we elaborated above in the experiment setup section 5.1.4. To make the training and evaluation data size equitable across datasets, we task a 20% subset of MiliPoint, and 20% of MMFi; then they all have around 40k data in total split into 80% for training and 20% for testing. The numbers are errors of MLE in cm for keypoint pose estimation, trained with the DGCNN model.
The initial observation from the results reveals that, as indicated by the percentages in parenthesis, all augmentation methods could outperform the Baseline Null, which has no augmentation. For example, (+26.95%) means the WixUp+ reduces the error from 28.89 to 21.10 by 26.95% percent. Except for Baseline Stack, proposed by the original MiliPoint paper, which actually negatively impacts accuracy. Consequently, we refrain from running this baseline in subsequent benchmarking efforts. Secondly, we observe that WixUp consistently delivers greater improvements compared to Baseline CGA. Furthermore, the additional integration of WixUp and CGA, noted as WixUp+, leads to further enhancements in most scenarios. However, as previously mentioned, the addition of CGA does not always yield improvement, potentially due to the unstable nature of random scaling in CGA, a topic we will delve into in subsequent results’ discussions. In summary, our system consistently delivers the most significant performance improvements across the three benchmarking datasets. This underscores its robustness and ability to generalize effectively across diverse datasets.
MiliPoint | MARs | MMFi | |
Null | 24.36 | 28.89 | 28.53 |
Stack | 25.16(-3.25%) | 28.89(-0.02%) | 28.53(+0.01%) |
CGA | 23.26(+4.55%) | 22.89(+20.76%) | 26.25(+8.00%) |
WixUp | 23.25(+4.58%) | 22.65(+21.60%) | 25.84(+9.42%) |
WixUp+ | 23.10 (+5.18%) | 21.10(+26.95%) | 25.60(+10.28%) |
5.2.2 Generalize across model architectures
DGCNN | PointTransformer | ||||||
MiliPoint | MARS | MMFi | MiliPoint | MARS | MMFi | ||
Null | 24.36 | 28.89 | 28.53 | 17.15 | 25.64 | 24.69 | |
CGA | 23.26(+4.55%) | 22.89(+20.76%) | 26.25(+8.00%) | 16.88(+1.60%) | 20.74(+19.10%) | 24.03(+2.68%) | |
WixUp | 23.25(+4.58%) | 22.65(+21.60%) | 25.84(+9.42%) | 16.81(+1.97%) | 20.35(+20.64%) | 22.73(+7.96%) | |
WixUp+ | 23.10(+5.18%) | 21.10(+26.95%) | 25.60(+10.28%) | 16.67(+2.79%) | 21.35(+16.72%) | 23.26(+5.80%) |
The experimental results in Table. 3, highlight the performance across two distinct model architectures: DGCNN and Pointformer. And we run them for all three datasets of MiliPoint, MARS, and MMFi. The numbers in the table are errors of MLE in cm for keypoint pose estimation. As shown in the table, both our methods and the baselines bring model improvement across two model architectures. And WixUp consistently yields more improvements compared to Baseline CGA.
Moreover, the supplementary incorporation of WixUp alongside CGA, i.e., WixUp+, results in additional enhancements across most scenarios, except for PointTransformer on MARS and MMFi. Instead, WixUp performs even better without additional CGA. The incorporation of CGA, we think, doesn’t consistently result in improvements, possibly owing to the unstable nature of random scaling within CGA. The instability arises from the significant randomness inherent in random scaling, despite its straightforward usage in this context.
To elaborate, the chosen model architectures collectively represent diverse approaches and have greatly influenced advancements in point cloud analysis. DGCNN captures local and global features through graph convolutions, Pointformer utilizes self-attention mechanisms inspired by transformers. Alongside DGCNN and Pointformer, we actually also tested with PointNet++ model. However, we encountered difficulties reproducing the results reported in the original paper; our errors were significantly larger. As a result, we refrain from depicting it here as a valid setting to test WixUp, although WixUp outperforms other baselines anyway, regardless that their overall errors are high.
In summary, WixUp is robust and might even outperform its combination of CGA (WixUp+) because it is a stable method. Overall, this table of results demonstrates its flexibility as a general module that can be integrated into downstream research works regardless of their modeling methods.
5.2.3 Generalize across human tracking tasks
Table. 4 illustrates the performance across three human tracking tasks: keypoint pose estimation, action recognition, and user identification. The numbers in the table are task-specific errors. Within the subset of MiliPoint used in benchmarking, it has two unique users and 49 unique actions.
In brief, WixUp outperforms Baseline CGA in most cases. Furthermore, WixUp+, the combination of WixUp with CGA, leads to further improvements. Notably, while action recognition with WixUp falls short of CGA, WixUp+ surpasses CGA by a significant margin, reaching 84% of improvement. The reason could be that the overall accuracy in action recognition is low, so the percentage might diverge dramatically.
The choice of keypoint pose estimation, action recognition, and user identification tasks for evaluation in wireless perception underscores their fundamental relevance in various applications. They play a crucial role in scenarios such as security surveillance and healthcare monitoring. The inclusion of these tasks in the evaluation demonstrates the potential of WixUp for widespread deployment in wireless perception applications.
|
|
|
|||||||
Null | 24.36 | 0.8766 | 0.1305 | ||||||
CGA | 23.26(+4.55%) | 0.9097(+3.77%) | 0.2143(+64.17%) | ||||||
WixUp | 23.25(+4.58%) | 0.9031(+3.03%) | 0.1924(+47.44%) | ||||||
WixUp+ | 23.10(+5.18%) | 0.9229(+5.29%) | 0.2405(+84.25%) |
![[Uncaptioned image]](https://cdn.awesomepapers.org/papers/4efc44a7-6f94-44fb-a31e-29a1ed14c188/x4.png)
![[Uncaptioned image]](https://cdn.awesomepapers.org/papers/4efc44a7-6f94-44fb-a31e-29a1ed14c188/x5.png)
5.2.4 Generalize across sensing modality&data format
In order to assess WixUp with different wireless sensing modalities and varied formats of raw data, we test with this acoustic dataset along with its dedicated code base using a CNN+LSTM model. As outlined above 5.1.4, this dataset entails an acoustic sensing system designed for tracking 21 finger joints in 3D. To apply WixUp with acoustic range profiles, we simply bypass the step of simulating from coordinates to range profiles. Instead, we directly intersect the range profile to generate new range profiles as model input. In contrast, the Baseline CGA does not directly apply to the range profile anymore. Instead, we keep the straightforward data augmentation method by slightly shifting the range profile along the range axis; and we use the original accuracy reported in the paper as a baseline comparison with WixUp.
In summary, the original pose estimation yielded a mean absolute error (MAE) of 13.93mm for user-dependent testing. After running the same user-dependent test with WixUp, we achieved an MAE of 10.56mm, which closely approaches the best results reported by the paper achieved by user-adaptive testing.
5.3 WixUp reduces labeling efforts by unsupervised domain adaptation via self-training
Beyond testing our data augmentation in standard supervised learning for wireless perception, we extend our objective to show its capability of self-training, founded on its scheme of mixing data to synthesize new data. In this subsection, we conduct another group of experiments that illustrates WixUp’s capability to significantly reduce the effort for labeling data by utilizing unsupervised domain adaptation (UDA) through self-learning.
Moreover, it is worth mentioning that non-mix-up methods like CGA and stacking can NOT be applied to UDA for self-training. Because they augment one data point at a time instead of mixing two data points from two distributions. In contrast, synthesizing data by mixing two points allows for the creation of more diverse and realistic samples, aiding in domain adaptation via self-training.
In practice, this self-training process proves especially advantageous in wireless sensing, where acquiring new data often demands significant manual efforts, and system performance might significantly drop in new rooms or new users.
5.3.1 Unsupervised domain adaptation across users
In the context of user domain adaptation, the steps of UDA via self-training are the same as the above 4.3. In step 1, the model undergoes training using labeled data sourced from the training users, constituting the source domain. Subsequently, in step 2, the device is sold to a new user, who starts by inputting some new data designated for self-training. Moving to step 3, the trained model performs inference on the unlabeled data from the new user, generating output predictions that are retained as pseudo-labels. Step 4 involves pairing one sample from the source domain data with one from the new user data, and mixing them via WixUp to form semi-labeled training data. Finally, in step 5, the newly created training data is utilized to fine-tune the trained model. Consequently, the model’s performance on the future new user data is expected to exhibit improvement. In other words, the training and labeling are confined to the efforts of the sensor developer; when a user buys a new device, they can effortlessly improve the device performance without labels of their data, since the device might not equipped with ground truth cameras.
To prove this, we train and evaluate UDA across users in MiliPoint by leave-one-user-out at a time. The left in Fig. 4 depicts the error of MLE in cm for keypoint pose estimation for six separate users. While some users get great model performance overall, such as user 4. We do see constant performance improvement with WixUp than that without (w/o) WixUp. The average improvements are 3.04%, ranging from 1.26% to 5.83%.
5.3.2 Unsupervised domain adaptation across environment
Beyond the need for new users, new environments are often another essential factor that impacts the sensing system’s performance. New environments here refer to new rooms, or the same room with furniture or nearby metal objects rearranged. The multi-path reflections from the environment might influence the received sensing signals, which could distort learning-based sensing algorithms. In the context of environment domain adaptation, the steps of UDA via self-training are similar to the cross-user process. In short, when employing the device in a new room, users can effortlessly improve the device’s performance without any labeling.
To validate this, we use the MMFi dataset because it has clearly labeled four scenes in data collection. Note that new scenes also mean new users in this dataset. Therefore, we train and evaluate UDA across environments in a leave-one-scene-out manner for the four scenes. The right in Fig. 4 depicts the error of MLE in cm for keypoint pose estimation. All four scenes yield a large margin of performance improvement with (w/) WixUp than that without (w/o) WixUp. The average improvements are 17.45%, ranging from 15.04% to 21.73%.
In summary, this mix-up-based approach for UDA in self-training has been empirically shown to lead to better performance in closing the domain gap and improving model generalization across users and environments. In future work, if we have one source dataset sharing the same label format as a target dataset, we could even self-train across datasets or even modalities, without having labels for the target dataset.
5.4 Ablation study of WixUp
Finally, this subsection shows the experiments of ablation study over WixUp’s algorithmic modules and hyper-parameters to validate their effectiveness.

MiliPoint | MARS | MMFi | |
Null | 24.36 | 28.89 | 28.53 |
vanilla | 24.29(+0.29%) | 22.8(+21.08%) | 25.93(+9.11%) |
+Boostrap | 23.25(+4.58%) | 22.65(+21.60%) | 25.84(+9.42%) |
+CGA | 23.10 (+5.18%) | 21.10(+26.95%) | 25.60(+10.28%) |
5.4.1 Augmentation size
First, we investigate the impact of mixing distance, which refers to the number of interval frames between the two samples to be mixed. For example, distance=2 means each pair of mixing data is sampled from one real data along with its neighbor located two frames ahead. Usually, stacking pairs from varying distances results in a larger augmentation size and thus greater diversity in data distribution. Although augmenting more data typically improves accuracy, there may be a turning point or plateau point where improvements slow down. As shown in Fig. 5, we increase the distance from one to ten, in the context of key point estimation in MiliPoint with the error as MLE in cm. Excessive augmentation makes the benefits slow down but, impressively, it is still decreasing; The slow down is possibly due to the decreased accuracy in mixing distant pairs. It could also be because the benefits of enriching data distribution reach a limit, ceasing to other primary bottlenecks such as factors in learning algorithms. To clarify, the above experiments only augment by a distance of one by default, in order to facilitate the experiments; thereby, we could expect more drops in their error when excessively fine-tuning a single job.
5.4.2 Effectiveness of algorithmic components
We also perform several ablation studies to examine the contribution of the components in the proposed WixUp. To recap, 1) the vanilla version of WixUp conducts an intersection operation on two range profiles. It first simulates each range profile from coordinates as Gaussian mixture and then inversely maps them back with probability-based angles. 2) Secondly, to further increase the number of generated points from the intersection step, we randomly sample around the intersections based on the quality of the intersection as weights and also sample the original points with minimal weight. This process is referred to as bootstrapping. 3) Finally, in benchmark experiments, we utilize CGA (random scaling) as a baseline and add it on top of WixUp, denoted as WixUp+. As shown in Table. 5, each row represents the incremental adoption of these three versions of WixUp, along with the Baseline Null as a comparison. The incremental drops of errors in rows demonstrate that each component contributes slightly to the enhancement of model accuracy, thereby validating the effectiveness of incorporating these components. Besides, in the early stages of our research, we proposed other components such as skewed Gaussian, which theoretically seemed promising but did not help with improving the results. Consequently, they trust our trial and error ablation study results and decided not to integrate them into the final version of WixUp presented in this study.
6 Discussion
In this paper, our focus is on human tracking tasks within wireless perception, because this popular field provides enough open-source large datasets and methods ready for us to conduct comprehensive DA research. However, it’s important to acknowledge that wireless perception encompasses a broader spectrum of applications beyond human tracking, for instance, healthcare monitoring, smart agriculture, and supply chain management. We hope to extend WixUp to these applications once they grow up with more resources in the future.
Moreover, WiFi is also a promising modality for wireless applications, many of which adhere to similar FMCW-based modulation and processing pipelines. However, the costly hardware and excessive installation efforts make this area have less open-source data in public. For instance, [24, 62, 61] are notable works on this topic but only published the code. We hope WixUp can embark on open research in this modality as well.
Besides FMCW, other types of signals exist, such as those involving sine wave signals with their phase or Doppler characteristics. Since we emphasize generalizability in this work, we surveyed wireless hardware user manuals and related work and then chose to focus on the current settings; this ensures our DA widely covers industrial and research wireless sensing systems. We encourage other types of signals to develop their custom DA on top of WixUp such as customizing their transforming to range profiles.
In our exploration of self-training, we conducted experiments involving cross-user and cross-environment scenarios using the same dataset. However, self-training can also be powerful when mixing data from multiple datasets. However, the lack of public information also hinders us from investigating it in this work; specifically, the disclosure of ground truth format such as joint order is not always detailed. Nevertheless, it is a compelling direction for future research. Note that cross-dataset differences, like hardware variations or signal parameters, do not impede cross-dataset self-training in our approach, owing to its ability to simulate a unified format for sensing data.
References
- [1] Anonymized acoustic sensing dataset paper. 2023.
- [2] F Adib, Chen-Yu Hsu, Hongzi Mao, Dina Katabi, and Fredo Durand. Rf-capture: Capturing a coarse human figure through a wall. Proceedings of the ACM Transactions on Graphics, 2015.
- [3] Fadel Adib, Zach Kabelac, Dina Katabi, and Robert C Miller. 3d tracking via body radio reflections. In Proceedings of the USENIX Symposium on Networked Systems Design and Implementation, pages 317–329, 2014.
- [4] Fadel Adib and Dina Katabi. See through walls with wifi! In Proceedings of the ACM Conference of the Special Interest Group on Data Communication, pages 75–86, 2013.
- [5] Karan Ahuja, Yue Jiang, Mayank Goel, and Chris Harrison. Vid2doppler: Synthesizing doppler radar data from videos for training privacy-preserving activity recognition. In Proceedings of the 2021 CHI Conference on Human Factors in Computing Systems, pages 1–10, 2021.
- [6] Sizhe An, Yin Li, and Umit Ogras. mri: Multi-modal 3d human pose estimation dataset using mmwave, rgb-d, and inertial sensors. Advances in Neural Information Processing Systems, 35:27414–27426, 2022.
- [7] Sizhe An and Umit Y. Ogras. Mars: mmwave-based assistive rehabilitation system for smart healthcare. ACM Trans. Embed. Comput. Syst., 20(5s), sep 2021.
- [8] Mahmoud Assran, Mathilde Caron, Ishan Misra, Piotr Bojanowski, Florian Bordes, Pascal Vincent, Armand Joulin, Mike Rabbat, and Nicolas Ballas. Masked siamese networks for label-efficient learning. In European Conference on Computer Vision, pages 456–473. Springer, 2022.
- [9] Mathilde Caron, Ishan Misra, Julien Mairal, Priya Goyal, Piotr Bojanowski, and Armand Joulin. Unsupervised learning of visual features by contrasting cluster assignments. Advances in neural information processing systems, 33:9912–9924, 2020.
- [10] Mathilde Caron, Hugo Touvron, Ishan Misra, Hervé Jégou, Julien Mairal, Piotr Bojanowski, and Armand Joulin. Emerging properties in self-supervised vision transformers. In Proceedings of the IEEE/CVF international conference on computer vision, pages 9650–9660, 2021.
- [11] Tao Chen, Shilian Zheng, Kunfeng Qiu, Luxin Zhang, Qi Xuan, and Xiaoniu Yang. Augmenting radio signals with wavelet transform for deep learning-based modulation recognition. arXiv preprint arXiv:2311.03761, 2023.
- [12] Ting Chen, Simon Kornblith, Mohammad Norouzi, and Geoffrey Hinton. A simple framework for contrastive learning of visual representations. In International conference on machine learning, pages 1597–1607. PMLR, 2020.
- [13] X Chen, S Xie, and K He. An empirical study of training self-supervised vision transformers. In CVF International Conference on Computer Vision (ICCV), pages 9620–9629, 2021.
- [14] Xingyu Chen and Xinyu Zhang. Rf genesis: Zero-shot generalization of mmwave sensing through simulation-based data synthesis and generative diffusion models. In Proceedings of the 21st ACM Conference on Embedded Networked Sensor Systems, pages 28–42, 2023.
- [15] Yunlu Chen, Vincent Tao Hu, Efstratios Gavves, Thomas Mensink, Pascal Mettes, Pengwan Yang, and Cees GM Snoek. Pointmixup: Augmentation for point clouds. In European Conference on Computer Vision, pages 330–345. Springer, 2020.
- [16] William H Clark IV, Steven Hauser, William C Headley, and Alan J Michaels. Training data augmentation for deep learning radio frequency systems. The Journal of Defense Modeling and Simulation, 18(3):217–237, 2021.
- [17] Han Cui, Shu Zhong, Jiacheng Wu, Zichao Shen, Naim Dahnoun, and Yiren Zhao. Milipoint: A point cloud dataset for mmwave radar. Advances in Neural Information Processing Systems, 36, 2024.
- [18] Golnaz Ghiasi, Yin Cui, Aravind Srinivas, Rui Qian, Tsung-Yi Lin, Ekin D Cubuk, Quoc V Le, and Barret Zoph. Simple copy-paste is a strong data augmentation method for instance segmentation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 2918–2928, 2021.
- [19] Jean-Bastien Grill, Florian Strub, Florent Altché, Corentin Tallec, Pierre Richemond, Elena Buchatskaya, Carl Doersch, Bernardo Avila Pires, Zhaohan Guo, Mohammad Gheshlaghi Azar, et al. Bootstrap your own latent-a new approach to self-supervised learning. Advances in neural information processing systems, 33:21271–21284, 2020.
- [20] Sidhant Gupta, Daniel Morris, Shwetak Patel, and Desney Tan. Soundwave: using the doppler effect to sense gestures. In Proceedings of the Conference on Human Factors in Computing Systems, pages 1911–1914, 2012.
- [21] Eiji Hayashi, Jaime Lien, Nicholas Gillian, Leonardo Giusti, Dave Weber, Jin Yamanaka, Lauren Bedal, and Ivan Poupyrev. Radarnet: Efficient gesture recognition technique utilizing a miniature radar sensor. In Proceedings of the Conference on Human Factors in Computing Systems, pages 1–14, 2021.
- [22] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.
- [23] Liang Huang, Weijian Pan, You Zhang, Liping Qian, Nan Gao, and Yuan Wu. Data augmentation for deep learning-based radio modulation classification. IEEE access, 8:1498–1506, 2019.
- [24] Wenjun Jiang, Hongfei Xue, Chenglin Miao, Shiyang Wang, Sen Lin, Chong Tian, Srinivasan Murali, Haochen Hu, Zhi Sun, and Lu Su. Towards 3d human pose construction using wifi. In Proceedings of the Annual International Conference on Mobile Computing and Networking, pages 1–14, 2020.
- [25] Sihyeon Kim, Sanghyeok Lee, Dasol Hwang, Jaewon Lee, Seong Jae Hwang, and Hyunwoo J Kim. Point cloud augmentation with weighted local transformations. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 548–557, 2021.
- [26] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems, volume 25, 2012.
- [27] Dogyoon Lee, Jaeha Lee, Junhyeop Lee, Hyeongmin Lee, Minhyeok Lee, Sungmin Woo, and Sangyoun Lee. Regularization strategy for point cloud via rigidly mixed sample. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 15900–15909, 2021.
- [28] Dong Li, Jialin Liu, Sunghoon Ivan Lee, and Jie Xiong. Fm-track: pushing the limits of contactless multi-target tracking using acoustic signals. In Proceedings of the Conference on Embedded Networked Sensor Systems, pages 150–163, 2020.
- [29] Dong Li, Jialin Liu, Sunghoon Ivan Lee, and Jie Xiong. Lasense: Pushing the limits of fine-grained activity sensing using acoustic signals. Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies, 6(1):1–27, 2022.
- [30] Ruihui Li, Xianzhi Li, Pheng-Ann Heng, and Chi-Wing Fu. Pointaugment: An auto-augmentation framework for point cloud classification. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6378–6387, 2020.
- [31] Chao Liu, Linlin Gao, Ruobing Jiang, and Zhongwen Guo. Acoustic-based 2-d target tracking with constrained intelligent edge device. Proceedings of the Journal of Systems Architecture, 131:102696, 2022.
- [32] C Lovescu and S Rao. The fundamentals of millimeter wave radar sensors. Texas Instruments, Julio, 2020.
- [33] Wenguang Mao, Mei Wang, Wei Sun, Lili Qiu, Swadhin Pradhan, and Yi-Chao Chen. Rnn-based room scale hand motion tracking. In Proceedings of the Annual International Conference on Mobile Computing and Networking, pages 1–16, 2019.
- [34] Warren Morningstar, Alex Bijamov, Chris Duvarney, Luke Friedman, Neha Kalibhat, Luyang Liu, Philip Mansfield, Renan Rojas-Gomez, Karan Singhal, Bradley Green, et al. Augmentations vs algorithms: What works in self-supervised learning. arXiv preprint arXiv:2403.05726, 2024.
- [35] Rajalakshmi Nandakumar, Shyamnath Gollakota, and Nathaniel Watson. Contactless sleep apnea detection on smartphones. In Proceedings of the 13th annual international conference on mobile systems, applications, and services, pages 45–57, 2015.
- [36] Rajalakshmi Nandakumar, Vikram Iyer, Desney Tan, and Shyamnath Gollakota. Fingerio: Using active sonar for fine-grained finger tracking. In Proceedings of the Conference on Human Factors in Computing Systems, pages 1515–1525, 2016.
- [37] Kun Qian, Chenshu Wu, Zheng Yang, Yunhao Liu, and Kyle Jamieson. Widar: Decimeter-level passive tracking via velocity monitoring with commodity wi-fi. In Proceedings of the ACM International Symposium on Mobile Ad Hoc Networking and Computing, pages 1–10, 2017.
- [38] Kun Qian, Chenshu Wu, Yi Zhang, Guidong Zhang, Zheng Yang, and Yunhao Liu. Widar2. 0: Passive human tracking with a single wi-fi link. In Proceedings of the Annual International Conference on Mobile Systems, Applications, and Services, pages 350–361, 2018.
- [39] Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. Faster r-cnn: Towards real-time object detection with region proposal networks. In Advances in neural information processing systems, volume 28, 2015.
- [40] Shivanand Venkanna Sheshappanavar, Vinit Veerendraveer Singh, and Chandra Kambhamettu. Patchaugment: Local neighborhood augmentation in point cloud classification. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 2118–2127, 2021.
- [41] Connor Shorten, Taghi M Khoshgoftaar, and Borko Furht. Text data augmentation for deep learning. Journal of big Data, 8(1):101, 2021.
- [42] Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556, 2014.
- [43] Akash Deep Singh, Sandeep Singh Sandha, Luis Garcia, and Mani Srivastava. Radhar: Human activity recognition from point clouds generated through a millimeter-wave radar. In Proceedings of the 3rd ACM Workshop on Millimeter-wave Networks and Sensing Systems, pages 51–56, 2019.
- [44] Christian Szegedy et al. Going deeper with convolutions. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1–9, 2015.
- [45] Mingxing Tan and Quoc Le. Efficientnet: Rethinking model scaling for convolutional neural networks. In International Conference on Machine Learning, pages 6105–6114, 2019.
- [46] Peng Wang and Manuel Vindiola. Data augmentation for blind signal classification. In MILCOM 2019 - 2019 IEEE Military Communications Conference (MILCOM), pages 305–310, 2019.
- [47] Penghao Wang, Ruobing Jiang, and Chao Liu. Amaging: Acoustic hand imaging for self-adaptive gesture recognition. In Proceedings of the IEEE Conference on Computer Communications, pages 80–89, 2022.
- [48] Saiwen Wang, Jie Song, Jaime Lien, Ivan Poupyrev, and Otmar Hilliges. Interacting with soli: Exploring fine-grained dynamic gesture recognition in the radio-frequency spectrum. In Proceedings of the Annual Symposium on User Interface Software and Technology, pages 851–860, 2016.
- [49] Wei Wang, Alex X Liu, and Ke Sun. Device-free gesture tracking using acoustic signals. In Proceedings of the 22nd Annual International Conference on Mobile Computing and Networking, pages 82–94, 2016.
- [50] Yanwen Wang, Jiaxing Shen, and Yuanqing Zheng. Push the limit of acoustic gesture recognition. Proceedings of the IEEE Transactions on Mobile Computing, 2020.
- [51] Aoran Xiao, Jiaxing Huang, Dayan Guan, Kaiwen Cui, Shijian Lu, and Ling Shao. Polarmix: A general data augmentation technique for lidar point clouds. Advances in Neural Information Processing Systems, 35:11035–11048, 2022.
- [52] Jiahong Xie, Hao Kong, Jiadi Yu, Yingying Chen, Linghe Kong, Yanmin Zhu, and Feilong Tang. mm3dface: Nonintrusive 3d facial reconstruction leveraging mmwave signals. In Proceedings of the 21st Annual International Conference on Mobile Systems, Applications and Services, pages 462–474, 2023.
- [53] Jianfei Yang, He Huang, Yunjiao Zhou, Xinyan Chen, Yuecong Xu, Shenghai Yuan, Han Zou, Chris Xiaoxuan Lu, and Lihua Xie. Mm-fi: Multi-modal non-intrusive 4d human dataset for versatile wireless sensing. Advances in Neural Information Processing Systems, 36, 2024.
- [54] Zhizheng Yang, Xun Wang, Dongyu Xia, Wei Wang, and Haipeng Dai. Sequence-based device-free gesture recognition framework for multi-channel acoustic signals. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing, pages 1–5. IEEE, 2023.
- [55] Jianlong Yuan, Yifan Liu, Chunhua Shen, Zhibin Wang, and Hao Li. A simple baseline for semi-supervised semantic segmentation with strong data augmentation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 8229–8238, 2021.
- [56] Sangdoo Yun, Dongyoon Han, Seong Joon Oh, Sanghyuk Chun, Junsuk Choe, and Youngjoon Yoo. Cutmix: Regularization strategy to train strong classifiers with localizable features. In Proceedings of the IEEE/CVF international conference on computer vision, pages 6023–6032, 2019.
- [57] Hongyi Zhang, Moustapha Cisse, Yann N Dauphin, and David Lopez-Paz. mixup: Beyond empirical risk minimization. arXiv preprint arXiv:1710.09412, 2017.
- [58] Jie Zhang, Zhanyong Tang, Meng Li, Dingyi Fang, Petteri Nurmi, and Zheng Wang. Crosssense: Towards cross-site and large-scale wifi sensing. In Proceedings of the 24th Annual International Conference on Mobile Computing and Networking, MobiCom ’18, page 305–320, New York, NY, USA, 2018. Association for Computing Machinery.
- [59] Xiaotong Zhang, Zhenjiang Li, and Jin Zhang. Synthesized millimeter-waves for human motion sensing. In Proceedings of the 20th ACM Conference on Embedded Networked Sensor Systems, SenSys ’22, page 377–390, New York, NY, USA, 2023. Association for Computing Machinery.
- [60] Yi Zhang, Yue Zheng, Kun Qian, Guidong Zhang, Yunhao Liu, Chenshu Wu, and Zheng Yang. Widar3.0: Zero-effort cross-domain gesture recognition with wi-fi. Proceedings of the IEEE Transactions on Pattern Analysis and Machine Intelligence, pages 1–1, 2021.
- [61] Mingmin Zhao, Tianhong Li, Mohammad Abu Alsheikh, Yonglong Tian, Hang Zhao, Antonio Torralba, and Dina Katabi. Through-wall human pose estimation using radio signals. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 7356–7365, 2018.
- [62] Mingmin Zhao, Yonglong Tian, Hang Zhao, Mohammad Abu Alsheikh, Tianhong Li, Rumen Hristov, Zachary Kabelac, Dina Katabi, and Antonio Torralba. Rf-based 3d skeletons. In Proceedings of the Conference of the ACM Special Interest Group on Data Communication, pages 267–281, 2018.
- [63] Peijun Zhao, Chris Xiaoxuan Lu, Bing Wang, Niki Trigoni, and Andrew Markham. Cubelearn: End-to-end learning for human motion recognition from raw mmwave radar signals. IEEE Internet of Things Journal, 2023.
- [64] Xiaopeng Zhao, Zhenlin An, Qingrui Pan, and Lei Yang. Nerf2: Neural radio-frequency radiance fields. In Proceedings of the 29th Annual International Conference on Mobile Computing and Networking, pages 1–15, 2023.
- [65] Zhun Zhong, Liang Zheng, Guoliang Kang, Shaozi Li, and Yi Yang. Random erasing data augmentation. In Proceedings of the AAAI conference on artificial intelligence, volume 34, pages 13001–13008, 2020.