Evaluating Large Language Models as Virtual Annotators for Time-series Physical Sensing Data

Aritra Hota [email protected] Indian Institute of TechnologyKharagpur, India721302 , Soumyajit Chatterjee 0000-0001-5604-2267 Nokia Bell LabsCambridgeUnited KingdomCB3 0FA [email protected] and Sandip Chakraborty 0000-0003-3531-968X Indian Institute of TechnologyKharagpur, India721302 [email protected]

(2018; 20 February 2007; 12 March 2009; 5 June 2009)

Abstract.

Traditional human-in-the-loop-based annotation for time-series data like inertial data often requires access to alternate modalities like video or audio from the environment. These alternate sources provide the necessary information to the human annotator, as the raw numeric data is often too obfuscated even for an expert. However, this traditional approach has many concerns surrounding overall cost, efficiency, storage of additional modalities, time, scalability, and privacy. Interestingly, recent large language models (LLMs) are also trained with vast amounts of publicly available alphanumeric data, which allows them to comprehend and perform well on tasks beyond natural language processing. Naturally, this opens up a potential avenue to explore the opportunities in using these LLMs as virtual annotators where the LLMs will be directly provided the raw sensor data for annotation instead of relying on any alternate modality. Naturally, this could mitigate the problems of the traditional human-in-the-loop approach. Motivated by this observation, we perform a detailed study in this paper to assess whether the state-of-the-art (SOTA) LLMs can be used as virtual annotators for labeling time-series physical sensing data. To perform this in a principled manner, we segregate the study into two major phases. In the first phase, we investigate the challenges an LLM like GPT-4 faces in comprehending raw sensor data. Considering the observations from phase 1, in the next phase, we investigate the possibility of encoding the raw sensor data using SOTA SSL approaches and utilizing the projected time-series data to get annotations from the LLM. Detailed evaluation with four benchmark HAR datasets shows that SSL-based encoding and metric-based guidance allow the LLM to make more reasonable decisions and provide accurate annotations without requiring computationally expensive fine-tuning or sophisticated prompt engineering.

Large Language Models, Human-in-the-Loop, Time-series Data

^†^†copyright: acmlicensed^†^†journalyear: 2018^†^†doi: XXXXXXX.XXXXXXX

1. Introduction

Human activity recognition (HAR) has been ubiquitous nowadays with myriad applications in domains ranging from personal healthcare monitoring, sports performance analysis, and intelligent living assistance, among others. Traditional activity and context-sensing applications primarily rely on accurate labels to train robust supervised machine learning (ML) models. Conventional methods for obtaining high-quality labels depend on human-in-the-loop-based approaches assisted by sophisticated techniques like active learning (Hossain and Roy, 2019) and experience sampling (Laput and Harrison, 2019). Notably, one of the primary factors behind involving humans in the labeling process is to create the natural transfer of knowledge from humans to the final ML models. Barring some inherent mislabeling problems, missing labels, and label jitter (Kwon et al., 2019; Zeni et al., 2019), the human-in-the-loop approach often provides accurate labels. This is mainly because the human annotators if chosen correctly, have significant knowledge regarding the activities or the locomotion involved.

Refer to caption — Figure 1. Traditional human-in-the-loop annotation scheme

Interestingly, humans cannot perceive the raw physical sensor data, such as the IMU data, as collected from a body-worn device like a smartwatch, a smart glass, or an earable for activity recognition. Therefore, the conventional method is to use one of the alternate modalities from the environment, like video or audio, and use this as an input context to the human annotator for obtaining the labels (Chatterjee et al., 2023) (see Figure 1). Although the existing literature uses human-generated annotations obtained from such alternate modalities as the gold standard to train and test models developed over physical sensing (i.e., IMU) data, they often come with several challenges and implementation limitations. (1) The data obtained from the alternate modality (like the video/audio) needs to be perfectly time-synced with the primary modality, i.e., the IMU; otherwise, the annotated labels might not align with the corresponding activities. Although existing approaches try to handle such synchronization issues (Zhang et al., 2020), perfectly aligning the activity boundaries in the continuous activity space is severely challenging. (2) There might be perceptional bias from the annotators when the visible activity signatures have only some minute differences in the auxiliary modality (say, running vs. jogging), whereas the physical sensing data have significant differences (i.e., running introduces higher intensity than jogging in the acceleration signatures). Considering these challenges, we ask the following question in this paper: Rather than depending solely on the auxiliary modality for ground truth annotation, can we augment the annotation process with information extracted from the physical sensing data to avoid human-induced bias, as discussed above?

Interestingly, most recent large language models (LLMs) have also been trained with vast knowledge from publicly available data, potentially containing information from numeric data beyond the usual text data (Liu et al., 2023; Yao et al., 2022). This, in turn, opens up an avenue to investigate whether these LLMs can directly annotate the raw sensor data without using alternate modalities. To assess this, we first set up a detailed pilot study (Section 4) whereby we analyze whether the state-of-the-art (SOTA) LLMs like GPT-4 can quickly identify the labels (the activity classes) in a multi-class classification setup. This study shows that although GPT-4 can understand the accelerometer signatures, it fails to separate them across different classes even with a fixed number of bootstrap samples as example observations. Notably, some recent studies like (Liu et al., 2023) have already attempted to investigate how LLMs can be used with sensor data streams and reached similar conclusions.

One important alternative to resolving this concern can be efficient fine-tuning of the LLMs to achieve higher accuracy in the downstream task of recognizing the activity classes (Zhang et al., 2023b). However, fine-tuning these large models with a billion parameters can be challenging given their compute and data requirements (Gao et al., 2024). To avoid this, a few works like (Kim et al., 2024) have also looked into prompt-tuning as a viable alternative where the main objective is to include more context information as a part of the prompt, which in turn could provide the LLM with some additional information regarding the data. Although this prompt tuning approach has been seen to work well with specialized medical sensors (Kim et al., 2024), images (Qu et al., 2023), and videos (Zhang et al., 2023a), they have often failed to achieve the desired accuracy for locomotive sensor data (Jin et al., 2024) due to its inherent lack of enough context information.

Interestingly, works like (Liang et al., 2022; Belyaeva et al., 2023) have investigated a different approach to designing specialized encoders that can provide meaningful representations to the LLMs, which then can be interpreted by the LLM to offer accurate responses to the downstream task. However, unlike for modalities like images, IMU data is often marred with noises and body-position-based domain heterogeneity (Jain et al., 2022). This, in turn, has led to the unavailability of robust pre-trained encoders for locomotive data, which, in particular, is more challenging when no labels are available and we rely solely on the LLM for annotating the raw data.

Notably, papers like (Tang et al., 2020; Ramamurthy et al., 2022; Jain et al., 2022) have shown that self-supervised pre-training can create robust encoders which, when fine-tuned with a small amount of bootstrap data, can achieve significant accuracy. More specifically, this pre-training step often exploits the idea of contrastive learning (Tang et al., 2020; Jain et al., 2022), which allows the encoder to learn robust representations without requiring the labeled data. Motivated by this idea, we, in this paper, design a study to evaluate LLMs with inertial data encoded using a self-supervised pre-trained encoder with a supporting prompt that includes a few bootstrap examples (see Figure 2). The principled evaluation of four benchmark human activity recognition datasets highlights the overall consistency and boost in accuracy of the annotations provided by GPT-4 without any need for fine-tuning or labeling data. We also highlight the cost and time required for using LLMs as virtual annotators. This shows that as virtual annotators, LLMs can help us reduce the overall cost and time for human annotation and eliminate the cost associated with recording and storing data from other alternate modalities while reducing the risk of privacy leaks. Finally, we discuss some of the critical future directions that can be explored given the future use of LLMs as virtual annotators and how such systems can be standardized in the general pipeline of developing sophisticated ML applications and services. In summary, the significant contributions of this paper are as follows.

(1)

Exploring LLM’s Capability as Virtual Annotators of the Physical Sensing Data: We perform a thorough analysis to explore whether SOTA LLMs, like GPT-4, can work as virtual annotators to label physical sensing data, with locomotive data as a use-case for the downstream task of human mobility annotation. We show that although LLMs can understand locomotive data like a triaxial accelerometer signature, they fail to properly annotate the data with the corresponding mobility classes, even when sample bootstrap examples are provided along with the contextual information in the prompt.
(2)

Extending LLM’s Capability as Virtual Annotators of the Physical Sensing Data: Based on the findings from the pilot study, we develop an approach to encode the locomotive data in an embedding space that inherently learns the signatures of different classes and thus provides rich contextual information to the LLMs for annotating the accelerometer signatures with the latent embedding of the raw locomotive data. We explore self-supervised contrastive learning with two different setups for this purpose – (a) with time-domain signatures only and (b) with both time and frequency-domain signatures. To the best of our knowledge, this is the first work that helps LLMs enhance their capability for annotating locomotive data from the raw accelerometer with the corresponding human mobility classes without requiring any retraining.
(3)

Thorough Evaluation, Analysis, and Discussion: We perform a detailed investigation using four benchmark HAR datasets on how the encoded accelerometer data offers better reasoning for the LLM to provide accurate annotations with the increasing number of examples. Notably, with paid LLMs like ChatGPT-4, a mere performance assessment with accuracy as the key highlight is not sufficient as there are factors like the number of tokens and rate limits are restrictions that impact the final usage of such a service for annotating a dataset. Understanding this, we in this paper, also assess the time and cost of analysis, summarise the trade-offs, and discuss the advantages of using LLMs as a virtual annotator for time-series physical sensing data. Additionally, we also investigate the inherent limitations of using an unimodal sensor data from a triaxial accelerometer with a comprehensive analysis considering more than two classes.

The rest of the paper is organized as follows. Section 2 provides a detailed study of the related literature in exploring LLMs for different practical use cases of time-series data and using self-supervised models for time-series data analysis. Section 3 briefly discusses the datasets used for this study. We next perform a thorough pilot study to analyze LLM’s capability in annotating time-series locomotive data, as mentioned in Section 4. Based on the findings, we next explore in Section 5 the use of self-supervised contrastive learning to encode the locomotive data in an embedding space, which can help LLMs annotate the data correctly. Section 6 provides the implementation details and analyzes the empirical results to highlight the significant observations. We discuss our broad learning and the critical insights from this study in Section 7. Finally, Section 8 concludes the paper.

2. Related Work

Before diving deeper into the study’s design, we first perform a detailed survey on the recent works investigating the applications surrounding LLMs considering time-series data.

2.1. Using LLMs for Understanding Time-Series Data

One of the most critical evaluations that SOTA LLMs have faced is towards understanding of data modalities other than the natural language. Works like ELIXR (Xu et al., 2023c), multimodal approaches like Med-PaLM (Tu et al., 2024), and HeLM (Belyaeva et al., 2023) have explored using LLMs with different other data modalities like images and tabular data. Interestingly, unlike image or text data, time-series data is often much more challenging to interpret (Jin et al., 2024). Naturally, it is crucial to investigate whether SOTA LLMs, trained on a massive volume of alphanumeric data, can comprehend time-series data. Subsequently, works like (Liu et al., 2023; Xue and Salim, 2023; Kim et al., 2024; Zhang et al., 2024; Xu et al., 2023a; Jin et al., 2023) have started looking into how LLMs can assist in solving tasks that involve time-series data. Notably, many of these works used examples to provide more context to the LLM to gain assistance in predicting the final output.

Interestingly, works like (Liu et al., 2023) used the raw sensor data as an example; however, the idea did not show consistent performance with increasing classes. More recent works, like (Brown et al., 2020; Kim et al., 2024; Xue et al., 2022; Xu et al., 2023a) have attempted to provide more guidance and examples through in-context learning which allows the inclusion of more information within the prompt and does not necessarily need any fine-tuning of the LLM parameters. This approach also has the immense benefit of allowing the transfer of human expertise to the LLM (Xu et al., 2023a; Xie and Min, 2022). Furthermore, works like (Kim et al., 2024; Xu et al., 2023a) also show the importance of providing signal-processing (Xu et al., 2023a) or interpretation-based guidance (Kim et al., 2024) which allows the model to make predictions regarding the query sample. However, one key aspect that we observe in most of these aforementioned works is that in most cases, the input time-series is from specialized sensors like medical (Kim et al., 2024) or temperature sensors (Xu et al., 2023a) for which expert guidance is possible. For example, the prompt can provide what a typical high heart rate means and how it can be observed in the ECG signal. However, with more generalized sensors like accelerometers, it is hardly possible to provide some expert guidance directly, given the inherent nature of the sensor data itself. To mitigate this, works like (Jin et al., 2024) have discussed the idea of using time-series encoders, albeit with a proposal to align the time-series features with language-specific representations, which in turn would need further fine-tuning of the LLM. In this study, however, we explore using a time-series encoder to provide more context to the LLM without fine-tuning. Based on this objective, we assess whether an LLM like GPT-4 can be used as a service to obtain annotations for the unlabeled accelerometer data for HAR.

2.2. Self-supervised Contrastive Learning with Time-series Data

Human-in-the-loop annotation scheme is expensive, inefficient (Wu et al., 2022), and also known to be noisy (Zeni et al., 2019; Kwon et al., 2019). Owing to these concerns in the recent past, there have been a plethora of works that have looked into self-supervised learning (SSL), which proposes the development of label-efficient models that explore the patterns present inside the data to generate robust representations (Chen et al., 2020). These representations can eventually be used to fine-tune a classifier with a small amount of labeled data (Chen et al., 2020). More specifically, works like SimCLR (Chen et al., 2020) and SwAV (Caron et al., 2020) explored the idea of contrastive learning where the objective is to train an encoder that can cluster similar samples whereas as separate diverse samples in the embedding space without the need to access labeled data. To achieve this with image data, these works often relied on different augmentation approaches like cropping, rotations, adding noise, scaling, etc., to generate augmented versions of the same data and then use them to train the encoder in a self-supervised manner without labels (Jaiswal et al., 2020).

Although originally explored for images, recent works like (Ramamurthy et al., 2022; Tang et al., 2020; Jain et al., 2022; Deldari et al., 2021, 2022; Zhang et al., 2022; Xu et al., 2023b) have extended this idea to sensor time-series data. Most of these works explored different ideas for performing contrastive learning by doing time-series specific augmentations like jitter, noise addition, etc. (Tang et al., 2020), whereas works like (Jain et al., 2022) exploited the multi-device setting to find similar and dissimilar samples. Interestingly, works like (Zhang et al., 2022) and (Xu et al., 2023b) have revisited the approach by looking into frequency and subsequence (or motifs) based information as well. Notably, the overall approach of using an encoder pre-trained using the self-supervised contrastive learning-based approach is extremely efficient and robust, especially for wearable sensing-related tasks (Haresamudram et al., 2022; Qian et al., 2022; Spathis et al., 2022; Dhekane et al., 2023; Yuan et al., 2022). Motivated by these observations and the potential of SSL in a contrastive framework, we in this paper analyze the possibility of using encoded accelerometer data for obtaining high-quality annotations from SOTA LLMs without the need for computationally expensive fine-tuning of the LLM or sophisticated prompt engineering.

3. Datasets Used for the Analysis

From the detailed related work on analyzing time-series data using LLMs and a background study on SSL, mainly focusing on HAR, we chose four benchmark datasets for the studies conducted in the remainder of this paper. Before we move into the details of the studies, in this section, we summarize the dataset details and how we use them in this study. A brief description of the datasets can be seen in Table 1, and the details follow.

Table 1. Details of the dataset used in this analysis. In this setup, we hold out a set of participants for generating the test queries for evaluating the LLMs performance as a virtual annotator.

Dataset

#Training

Samples

#Testing

Samples

Selected Device Position

Chosen Classes

Sampling

Frequency

MotionSense (Malekzadeh et al., 2019)

337171

72728

Trousers’ front pocket

Jogging & Upstairs

50 Hz

PAMAP2 (Reiss, 2012)

1931748

509695

Hand

Running & Walking

100 Hz

UCI HAR (Reyes-Ortiz Jorge, Anguita Davide, Ghio Alessandro, Oneto Luca, Parra Xavier, 2012)

164718

47033

Waist

Standing & Walk upstairs

50 Hz

HHAR (Blunck Henrik, Bhattacharya Sourav, Prentow Thor, Kjrgaard Mikkel, and Dey Anind, 2015)

3020605

521516

Arm

Standing & Stairs up

\sim

200 Hz

(1)

MotionSense – The MotionSense dataset (Malekzadeh et al., 2019) consists data from 24 participants performing 6 different activities (Jogging, Sitting, Standing, Walk upstairs, Walk downstairs and Walking). The data was captured from the embedded IMU sensors (only the reading from triaxial accelerometer) of a smartphone which was kept in the trousers’ front pocket of the participants. The accelerometer readings were sampled at a frequency of 50Hz. Notably, the dataset is divided into 16 different trails and we considered the last trial for each activity for the analysis. We have considered “Jogging” and “Upstairs” as the activity classes and subjects 1, 6, 14, 19 and 23 as our test subjects. The rest of the subjects across all the activity classes constitute the training set.
(2)

PAMAP2 – The PAMAP2 dataset (Reiss, 2012) comprises a total of 9 participants performing 18 different physical activities with 6 optional activities. Out of these $24$ activities we have considered the 12 primary activities (Ascending stairs, Cycling, Descending stairs, Ironing, Lying, Nordic walking, Rope jumping, Running, Sitting, Standing, Vacuum cleaning and Walking). While collecting this dataset, the participants were equipped with 3 IMU’s each having a sampling rate of 100Hz wirelessly attached to 3 body positions: head, ankle and chest. However, in our evaluation, we have considered the data captured using the IMU sensor worn on the hands only. Additionally, we have considered “Running” and “Walking” as the activity classes and subjects 101 and 108 as the test subjects. The rest of the subjects belonging to all the 12 activity classes comprise the training set.
(3)

UCI HAR – The UCI HAR dataset (Reyes-Ortiz Jorge, Anguita Davide, Ghio Alessandro, Oneto Luca, Parra Xavier, 2012) contains inertial data for 6 different activities (Laying, Sitting, Standing, Walking, Walking downstairs and Walking upstairs). These activities were performed by wearing a smartphone around their waist. The IMUs have a sampling rate of 50Hz. For our setup, we have chosen “Standing” and “Walking upstairs” as the activity classes and subjects 1, 3, 5, 7, 9, 11, 13, 15 and 17 as the test subjects whereas, all other remaining subjects across all activity classes are part of the training set.
(4)

HHAR – The HHAR dataset (Blunck Henrik, Bhattacharya Sourav, Prentow Thor, Kjrgaard Mikkel, and Dey Anind, 2015) contains data recorded from 9 participants performing 6 activities (Biking, Sitting, Standing, Stair down, Stair up and Walking). The participants were given 8 smartphones and 4 smartwatches. For this study, we have used the accelerometer readings from the smartwatch worn on one hand of the user. The accelerometer readings were sampled at a frequency of 50-200 Hz depending on the device used. We have considered “Standing” and “Stair up” as our activity classes and subjects 3 and 9 as our test subjects while considering all other remaining participants across all activities for the training set.

4. Phase 1: Raw Sensor Data for Virtual Annotation

Before we proceed in analysing how different encoding schemes can impact the understanding of sensor data by the SOTA LLMs, we first investigate what are the key scenarios where the SOTA LLMs fail or succeed in providing quality labels when the given input is the raw inertial data. As shown in Figure 3, this is significantly different from the traditional human-in-the-loop annotation scheme as the LLM is directly provided with the raw sensor data samples instead of any alternative modality like video or audio from the environment. The details of the pilot experiments conducted in this phase, the basic setup and the key takeaways are summarized as follows.

4.1. Design of the Pilot Study

To assess the potential of LLMs as virtual annotators, we start by designing a set of experiments. We choose the PAMAP2 (Reiss, 2012) as described in Section 3. We have used ChatGPT-4 (or GPT-4) as the representative LLM in the remainder of the paper to highlight the idea of using a paid black-box LLM for getting annotations. The details of the experimental setup follow.

4.1.1. Setup

For the analysis, (a) we first choose the simplest binary class annotation task with data selected for the ground-truth activity classes of running and walking. (b) Next, we assess the quality of labels generated by the LLM by providing some bootstrap examples containing sensor data with ground-truth labels sampled across all subjects. A visual representation of the setup has been shown in Figure 3.

4.1.2. Queries

For obtaining the labels as a response from the LLM, we query it using a set of designed queries, including the raw sensor data and the body position-based information. The template for the queries we use is as follows.

“Classify the following triaxial accelerometer data in meters per second square as either walking or running provided that this data is coming from the wearable worn by the user on their dominant hand: [ $q_{x}$ $q_{y}$ $q_{z}$ ]. Answer in one word, either walking or running.”

Here, [ $q_{x}$ $q_{y}$ $q_{z}$ ] is the query sensor sample from the triaxial accelerometer. Notably, many recent works like (Brown et al., 2020; Liu et al., 2023) often provide examples (few-shots) within the prompt to allow the LLM to gain additional context regarding the data. In this study, we first provide a small amount of bootstrap raw accelerometer data with class labels as examples. To include these examples, we modify the query template by adding the following at the beginning.

“Given the following triaxial accelerometer data in meter per second square coming from the wearable worn by the user on their dominant hand corresponds to running : [ $s_{x_{1}}$ $s_{y_{1}}$ $s_{z_{1}}$ ] Given the following triaxial accelerometer data in meter per second square coming from the wearable worn by the user on their dominant hand corresponds to walking : [ $p_{x_{1}}$ $p_{y_{1}}$ $p_{z_{1}}$ ]”

Here, [ $s_{x_{1}}$ $s_{y_{1}}$ $s_{z_{1}}$ ] and [ $p_{x_{1}}$ $p_{y_{1}}$ $p_{z_{1}}$ ] are the example sensor samples chosen across different subjects for the two different classes. We vary the number of examples to observe the change in the quality of annotations.

4.2. Key Observations

We next summarize the key observations in terms of accuracy of the responses and also look into whether GPT-4 specifically predicts one class more than the other in a biased manner (a.k.a response bias).

4.2.1. Mapping Sensor Data to Labels

We first start by looking at the accuracy of how well the LLM (here GPT-4) can map the raw sensor data to the associated physical activity labels. Interestingly, we see that for certain subjects with specific classes, the mapping accuracy is quite appreciable (see Figure 4), especially if the task is understanding binary classes only. This further motivated us to see whether this mapping accuracy can be improved by giving the LLMs more examples from the different classes. However, we observed that with examples, this mapping accuracy did not change significantly. In this paper, we investigate this further by – (a) looking into the consistency of the generated labels with an increasing number of examples and (b) whether the responses themselves are biased based on the numerical values that the LLM is receiving in the query and the examples. The summary of these observations is provided as follows.

4.2.2. Consistency with Increasing Examples

One of the primary factors behind relying on human annotators has been consistency. This usually relates to the overall consistent improvement in the quality of the annotations, with an increasing number of bootstrap samples provided as examples during the labeling process. However, in this case, we observe that GPT-4 lacks in terms of consistency. More specifically, we do not see any significant correlation between the mapping accuracy and the number of examples (see Figure 4).

4.2.3. Response Bias

Finally, we also analyzed the responses obtained for the LLM for bias. Notably, we see (from Figure 5) that irrespective of the number of classes or examples provided, the LLM is mostly biased towards labeling majority of the sensor data as ‘walking’. This actually tells us that without any specialized encoding of the raw sensor data the chances that LLMs will map the data to wrong labels is appreciably high. Additionally, this also highlights that the significant mapping accuracy for certain classes might be completely because of this biased mapping that the LLM has generated.

4.2.4. Reasoning

With GPT-4, for majority of the cases, we obtained the following response. For example, for subject 101 with ground-truth ‘running’, one of the responses was as follows.

“Sorry, as an AI model, I’m not able to classify real-time activities based on raw accelerometer data instantly. This type of classification typically involves training a machine learning model on large datasets to recognize the patterns associated with different activities. If you’ve trained such a model, you should input this data there.”

Interestingly, for some cases like subject 105 running, even after giving 25 examples each for ‘running’ and ‘walking’, the LLM suspected an outlier sample and gave the following response.

“Based on the provided data, it is not possible to accurately determine the activity (running or walking) directly, as the new reading [5.1111 1.7958 7.9847] falls outside the range of values specified for both running and walking. However, If the above data is the only information we have and we need to make a simplified assumption based on it, we might say the data could be closer to ‘walking’ since the X and Y-axis readings are closer to the walking readings than the running ones, although it is a high-level assumption and not very accurate due to the significant deviation on the Z-axis.”

4.3. Lessons Learnt

One of the critical observations we get through this set of motivational experiments in Phase 1 is that although GPT-4 can understand the mapping of the values to the individual axes of the accelerometer, it cannot comprehend the accurate mapping between the input sample and the final label. Notably, we clearly observed that even with examples, the reasoning given by GPT-4 does not improve, and it lacks consistency in answering.

5. Phase 2: Encoding-based In-context Learning

Objective and Broad Intuition: One of the critical concerns we saw in Phase 1 was the zero-improvement in the response generated by the LLMs, even with additional examples provided with the prompt. Considering this, we, in this phase, attempt to project the raw sensor into an embedding space such that the examples provide enough context information to the LLM through their neighborhood information. Naturally, we proceed in this direction by first including expert knowledge of computing the distance in the prompt and looking into different encoding strategies. A summary of the phases is provided in Figure 6, and the details follow.

5.1. Adding Context with Metric-based Guidance

Recent works like (Xu et al., 2023a) have shown that one of the key ideas for obtaining accurate responses from the LLM can be by providing expert knowledge to the LLM through the prompt itself. As our broad intuition is to exploit the neighborhood of the embedding space as a critical information we specifically mention in the prompt to look for the distance between the example and the query sample (see Section 6.2). This paper profoundly investigates this metric-based guidance with different standard distance metrics.

5.2. Encoding the Raw Sensor Data

As mentioned above, one of the critical aspects of this encoding scheme is to project the raw sensor data into an embedding space, where we can exploit the neighborhood of the samples to retrieve the exact label given a query sample. Interestingly, we see that self-supervised contrastive learning allows us to achieve this even without access to the ground-truth labels (Jaiswal et al., 2020; Chen et al., 2020; Spathis et al., 2022; He et al., 2020; Yuan et al., 2022). Broadly, self-supervised contrastive learning algorithms on time-series data can be categorized into two broad categories – (a) methods that look into the time-domain aspects of the signal (Tang et al., 2020; Ramamurthy et al., 2022), and (b) methods that look beyond the time-domain into frequency and subsequence information as well (Zhang et al., 2022; Xu et al., 2023b). More specifically, we investigate two algorithms from each of these categories and study their impact on the performance of LLMs in understanding the encoded data. The rationale and the details of the algorithms are summarised as follows.

5.2.1. Time-domain Representations using Augmentations

Classical self-supervised learning (SSL) algorithms on images (Chen et al., 2020) are designed to learn visual representations without labeled data. It operates by creating multiple augmented views of the same image and uses a contrastive loss function to train a model to identify which view originates from the same image. Specifically, the contrastive loss function minimizes the distance between the embeddings of similar items while maximizing the distance between the embeddings of dissimilar ones. To ensure that the positive pairs stay closer than the negative pairs in the latent space, a typical contrastive algorithm optimizes the NT-Xcent loss (Chen et al., 2020) defined as follows.

\mathcal{L}_{\text{SimCLR}}(i,j)=-\log\frac{\exp(\text{sim}(z_{i},z_{j})/\tau)}{\sum_{k=1}^{2N}\mathbbm{1}_{[k\neq i]}\exp(\text{sim}(z_{i},z_{k})/\tau)}

where the term $\text{sim}(u,v)$ represents the cosine similarity between two vectors $u$ and $v$ . The parameter $\tau$ is the temperature parameter. The variable $N$ denotes the batch size, which effectively becomes $2N$ due to the augmentation strategy. Lastly, the expression $\mathbbm{1}_{[k\neq i]}$ is an indicator function that equals 1 if and only if $k$ is not equal to $i$ . This approach is further extended by works like (Tang et al., 2020) for time-series inertial data, where they adopt a similar approach with appropriate augmentations like adding Gaussian Noise or scaling, which are more appropriate for the time-series-based data.

5.2.2. Utilizing Frequency-domain Representations

In addition to the general time-domain analysis, a deeper inspection of the frequency-domain information has often provided more insights, especially when the time-series data is noisy. Take, for example, the time and frequency-domain information for the two activities, “Jogging” and “Upstairs” from the MotionSense dataset (Malekzadeh et al., 2019). We can observe from Figure 7 that the frequency-domain representations can better differentiate between the two activities. Understanding this, we next choose the Time-Frequency Contrastive (TFC) learning approach (Zhang et al., 2022) that allows us to extract both frequency and time-domain domain embeddings for a given time series. The key behind this approach is that the time and frequency-based embeddings of a time-series sample should be close to each other in the latent time-frequency space, even when learned from the augmentations of the same sample. To achieve this, TFC uses two encoders, one for generating time-based embeddings (say, $G_{T}$ ) and the other for generating frequency-based embeddings (say, $G_{F}$ ). Later, the representations coming from these two encoders are mapped to a joint time-frequency space to measure the distance between the embeddings while optimizing the pre-training loss defined as follows. The Time-based contrastive loss is expressed as,

\mathcal{L}_{T,i}=d(\mathbf{h}_{i}^{T},\tilde{\mathbf{h}}_{i}^{T},D)=-\log\frac{\exp(\text{sim}(\mathbf{h}_{i}^{T},\tilde{\mathbf{h}}_{i}^{T})/\tau)}{\sum_{x_{j}\in D}\mathbbm{1}_{[i\neq j]}\exp(\text{sim}(\mathbf{h}_{i}^{T},G_{T}(x_{j}))/\tau)}

The Frequency-based contrastive loss is expressed as,

\mathcal{L}_{F,i}=d(\mathbf{h}_{i}^{F},\tilde{\mathbf{h}}_{i}^{F},D)=-\log\frac{\exp(\text{sim}(\mathbf{h}_{i}^{F},\tilde{\mathbf{h}}_{i}^{F})/\tau)}{\sum_{x_{j}\in D}\mathbbm{1}_{[i\neq j]}\exp(\text{sim}(\mathbf{h}_{i}^{F},G_{F}(x_{j}))/\tau)}

Finally, the Time-Frequency contrastive loss is expressed as,

\mathcal{L}_{C,i}=\sum{S_{\text{pair}}}(\lvert S_{i}^{TF}-S_{i}^{\text{pair}}\rvert+\delta),\quad S^{\text{pair}}\in\{S_{i}^{T\tilde{F}},S_{i}^{\tilde{T}F},S_{i}^{\tilde{T}\tilde{F}}\}

Overall, the pre-training loss is computed as,

\mathcal{L}_{TFC,i}=\lambda(\mathcal{L}_{T,i}+\mathcal{L}_{F,i})+(1-\lambda)\mathcal{L}_{C,i}

where $\mathbf{h}i^{T}$ and $\mathbf{\tilde{h}}i^{T}$ denote the embeddings obtained from $G_{T}$ , representing the transformation in the time domain. Similarly, $\mathbf{h}i^{F}$ and $\mathbf{\tilde{h}}i^{F}$ are the embeddings derived from $G_{F}$ , which process the frequency domain aspects of the data. The dataset used for pre-training is denoted by D. The function $\text{sim}(u,v)$ is utilized to compute the cosine similarity between vectors $u$ and $v$ , facilitating a measure of their alignment. The parameter $\tau$ serves as the temperature coefficient, adjusting the distribution of similarity scores. The indicator function $\mathbbm{1}_{[k\neq i]}$ yields a value of 1 if and only if the indices $k$ and $i$ refer to different elements, ensuring the exclusion of self-comparisons. Each input time-series sample is represented by $x_{j}$ . The distances between the transformed embeddings, such as $S^{TF}_{i}=d(z^{T}_{i},z^{F}_{i},D)$ and its variations including $S_{i}^{T\tilde{F}}$ , $S_{i}^{\tilde{T}F}$ , and $S_{i}^{\tilde{T}\tilde{F}}$ , encapsulate the relative positions of embeddings in the transformed space. Here, $z_{i}^{T}=R_{T}(G_{T}(x_{i}^{T}))$ and $z_{i}^{F}=R_{F}(G_{F}(x_{i}^{F}))$ are the results of applying cross projectors $R_{T}$ and $R_{F}$ respectively, ensuring that the embeddings from $G_{T}$ and $G_{F}$ are aligned within a joint time-frequency space.

5.3. Annotation as Retrieval of Labels using Examples

The trained encoders allow the raw sensor data to be projected to the embedding space where the samples from the same class cluster in space (see Figure 8). Also, we observe that for the TFC-based encoder, the aggregated embeddings from both the time and frequency domains provide more distinct clusters in the embedding space in comparison to the individual domain-based representations. Nevertheless, once the encoders are trained, the next task of obtaining annotation labels from the LLM becomes analogous to a query-based retrieval, given some examples drawn from the embedding space. The objective of this study is to assess whether this approach of providing examples from the embedding space provides enough contextual information for the LLM to provide accurate and consistent annotations for a query sample with proper reasoning. The details of the assessment and the primary observations are summarized in the next section.

6. Empirical Results and Analysis

In the remainder of this paper, we now investigate the performance benefits of using the SSL-based encoded sensor data in place of raw sensor data and lay out a few strategies for redesigning the prompt to incorporate the embeddings as a part of the query. The basic setup, design choices, key observations, and the final takeaways are summarized below.

6.1. Implementation and Hyperparameters

6.1.1. SimCLR

In this implementation of SimCLR on time-series data, SGD optimizer with a cosine decay of learning rate 0.1 is used for pre-training for 200 epochs having a mini-batch size of 512 for MotionSense, PAMAP2, and HHAR datasets, respectively. However, for the UCI HAR dataset, we have taken the learning rate as 0.001 and the mini-batch size as 64. For the implementation, we have adapted the official implementation of the paper (Tang et al., 2020) available here¹¹1https://github.com/iantangc/ContrastiveLearningHAR.

6.1.2. TFC

For training, Adam optimizer (Kingma and Ba, 2014) is used with $\beta_{1}=0.9$ and $\beta_{2}=0.99$ . The learning rate used is $3e-4$ . The mini-batch size is 128. The training epochs are 40. The above hyperparameters are used for all the four datasets we have considered. For implementing the TFC-based contrastive learning, we followed the official implementation available here²²2https://github.com/mims-harvard/TFC-pretraining.

6.2. Redesigning the Prompt

Although this paper is not focused on performing sophisticated prompt-tuning, we needed to change the initial set of queries we designed for the motivational experiment (See Section 4.1.2). These changes were made to accommodate the replacement of embeddings in place of raw sensor readings. Furthermore, we introduce newer contextual information by adding information regarding the distance metrics to enable the LLM to find better reasoning in terms of similarity (or proximity) between the query and the example samples. More specifically, the redesigned prompt is designed as follows.

The following given embeddings correspond to “class 1”: $[[\mathbb{E}^{1}_{11},\ldots,\mathbb{E}^{1}_{u1}]\ldots[\mathbb{E}^{1}_{1v},\ldots,\mathbb{E}^{1}_{uv}]]$ , the following given embeddings correspond to “class 2”: $[[\mathbb{E}^{2}_{11},\ldots,\mathbb{E}^{2}_{u1}]\ldots[\mathbb{E}^{2}_{1v},\ldots,\mathbb{E}^{2}_{uv}]]$ . . . . . . and the following given embeddings corresponds to “class n”: $[[\mathbb{E}^{n}_{11},\ldots,\mathbb{E}^{n}_{u1}]\ldots[\mathbb{E}^{n}_{1v},\ldots,\mathbb{E}^{n}_{uv}]]$ classify the embedding $[\mathbb{E}^{q}_{11},\ldots,\mathbb{E}^{q}_{1u}]$ as either “class 1” or “class 2” . . . . . . . or “class n” considering the minimum distance to the example embeddings provided that the distance metric chosen is euclidean distance.

Here $u$ and $v$ represent the dimension and total number of example embeddings, respectively. In this study, We change both $u$ and $v$ to study the impact of the dimension of embeddings. Furthermore, we investigate the impact of distance metrics in assisting the LLM in arriving at a conclusion regarding the final response.

6.3. Designing the Experiments

Although a straightforward analysis can easily be done just by comparing the accuracy of the responses obtained from GPT-4 with the ground truth, there are subsequently different implicit factors like token length and cost of using this approach of annotation that may finally impact the potential of using LLM as virtual sensors. Thus, to perform this study elaborately, we redesign the set of experiments to investigate the impact of different factors on accuracy while also considering their impact on the length of the final query, cost, and time required for annotation.

6.3.1. Studying the Impact of Dimensions

To study the impact of the dimensions, we first fix the choices of distance and number of examples as Euclidean and one, respectively. Next, we vary the dimensions from 2 to 15 and record the changes in the accuracy of the response. To reduce the original embedding dimension (96 in the case of SimCLR and 668 for TFC), we use t-SNE as the unsupervised dimensionality reduction approach. Although a higher accuracy is a crucial factor, an increase in the dimension will actually lead to a significant increase in the number of tokens thus leading to higher cost of annotation (OpenAI, 2024).

6.3.2. Studying Impact of Distance

The purpose of including the distance metric is to provide additional reference to the LLM to assess the similarity (or the proximity in the embedding space) between the queried and the example samples. To study the impact of the distance metric, we first fix the number of examples as 3 and the embedding dimension as 2, except for the baseline case with no encoding of the input where we keep the dimension as 3 (input from the triaxial accelerometer). We then vary the distance metrics between Euclidean, Manhattan, and Cosine. Here, the key factor of analysis is accuracy only as the change in the distance metric type does not impact the number of tokens significantly.

6.3.3. Analysing the Impact of Number of Examples

We study the impact of a number of examples by first fixing the distance metric as Euclidean and embedding dimension as 2. We then vary the number of examples to 1, 3, 10, and 25. Considering the analogy with a human-in-the-loop setup of a good annotation framework will be a higher accuracy with more examples. However, an increase in the number of examples will also have a significant increase in the number of tokens and thus will have a higher cost of annotation with paid APIs like GPT-4 (OpenAI, 2024).

6.4. Performance Accuracy

We start the analysis by first comparing the accuracy of annotations concerning factors like dimensionality of the embeddings, number of examples, and impact of the distance metric. For the results, we compare the baseline approach, with raw sensor data without any encoding presented in Figure 9, with the two encoding approaches of SimCLR and TFC presented in Figure 10 and Figure 11, respectively. The summary of the observations is as follows.

6.4.1. Impact of Dimension

For this experiment, we do not consider the baseline approach as the raw sensor data from a triaxial accelerometer has a fixed dimension. For the encoded inputs through SimCLR (see Figure 10) and TFC (see Figure 11), we observe a similar pattern where the best accuracy is achieved at the reduced dimension of 2. Notably, for both these encoding approaches, we see that with higher dimensions, the accuracy of the responses drops significantly.

6.4.2. Impact of Distance Metric

We evaluate the accuracy of the responses across all datasets considering three distance metrics of Euclidean, Mahattan, and Cosine to evaluate the similarity of the queried sample with the examples given in the query. For all the cases, including the baseline case of using raw sensor data we do not observe a significant difference in the accuracy, albeit there is a slightly higher gain with Manhattan distance which is already known to be highly robust for high dimensional data (see Figure 9, Figure 10, and Figure 11).

6.4.3. Impact of Number of Examples

One key downside of using raw sensor data directly with LLMs that we observed in the motivational study and across all datasets in Figure 9 is the lack of consistency with an increasing number of examples. However, with encoded (for both SimCLR and TFC) examples, we can clearly observe (from Figure 10 and Figure 11) that the system achieves a higher consistency at times even reaching an accuracy of 1.0 when the number of examples is 25 (an average of 0.02% ), the highest that we explore in this study.

Table 2. Impact of including distance metric as a context information in the query. The responses clearly show how GPT-4 then uses the distance metric as the basis for reasoning regarding the generated response.

Raw Sensor Data

Query

The following given triaxial accelerometer readings correspond to “jogging”: [[0.0223 0.1115 0.3571]…[0.161 0.1497 -0.2369]] and the following given triaxial accelerometer readings correspond to “walk upstairs”: [[0.2615 0.118 -0.0637]…[-0.0322 0.0437 0.057]]; classify the triaxial accelerometer reading [-0.0933 0.0165 0.0609] as either ”jogging” or ”walking upstairs” provided that this data is coming from the smartphone stored in the user’s trousers’ front pocket considering the minimum distance to the example triaxial accelerometer readings provided that the distance metric chosen is euclidean distance. Answer in one word.

Response

The triaxial accelerometer reading [- 0.0933, 0.0165, 0.0609 ] [-0.0933, 0.0165, 0.0609] is classified as “walk upstairs” based on the minimum Euclidean distance to the example triaxial accelerometer readings provided.

Reasoning

The classification of the triaxial accelerometer reading [ - 0.0933 , 0.0165 , 0.0609 ] [-0.0933,0.0165,0.0609] as walk upstairs” is based on the principle of the nearest neighbor algorithm, using the Euclidean distance as the metric for similarity. This distance metric was chosen to compare the given accelerometer reading with each of the example readings from both activities (”jogging” and ”walking upstairs”). Minimum Distance: For each activity, I calculated the Euclidean distance from the given reading to all example readings of that activity. Then, I found the minimum distance to each activity, determining the closest match for the given reading. Classification: The activity whose examples have the minimum distance to the given reading is considered the most similar or the most likely classification for that reading. In this case, the minimum distance between the given reading and the examples of ”walk upstairs” was smaller than that to the examples of ”jogging”. This indicates that the given reading is more similar to those recorded during the activity of walking upstairs, leading to its classification as ”walk upstairs”. This method is effective for classifying data points based on similarity to known examples, and it’s particularly useful in contexts like activity recognition from accelerometer data, where patterns of motion can be distinctive but vary slightly from person to person.

SimCLR Encoding

Query

The following given embeddings correspond to “walking upstairs”: [[14.226081 6.4440064]…[0.57457817 -22.347515]] and the following given embeddings correspond to “jogging”: [[56.372967 5.0068436]…[53.95932 -21.747236]]; classify the embedding [14.239516 7.1703763] as either “walking upstairs” or “jogging” considering the minimum distance to the example embeddings provided that the distance metric chosen is euclidean distance. Answer in one word.

Response

The embedding [ 14.239516 , 7.1703763 ] is classified as “walking_upstairs” based on the minimum distance to the example embeddings, using the Euclidean distance metric.

Reasoning

To determine the classification of the given embedding[ 14.239516 , 7.1703763 ] as either “walking upstairs” or “jogging”, we calculated the Euclidean distance between this embedding and each of the example embeddings provided for both activities. The Euclidean distance metric measures the straight-line distance between two points in Euclidean space. Here’s how the decision was made: Calculating Distances: For each category (“walking upstairs” and “jogging”), we calculated the distance from the given embedding to all embeddings in that category. Minimum Distance: We then identified the minimum distance from the given embedding to the embeddings in each category. This means we found the closest embedding to the given embedding within both categories. Comparison: The category (“walking upstairs” or “jogging”) whose embeddings had the smallest minimum distance to the given embedding is considered the most likely category for that embedding. In this case, the minimum distance between the given embedding and the “walking upstairs” embeddings was smaller than the minimum distance to any of the “jogging” embeddings. This indicates that the given embedding is more similar to the “walking upstairs” activity based on the spatial distribution of embeddings in the feature space. Thus, considering the principle that similar activities will result in embeddings that are closer together in space, the classification was based on the proximity of the given embedding to the known embeddings for each activity, leading to the conclusion that the given embedding is most likely associated with “walking upstairs”.

TFC Encoding

Query

The following given embeddings correspond to “walking upstairs”: [[4.256844 -10.712115] … [39.49697 -33.193176]] and the following given embeddings correspond to “jogging”: [[38.99644 -7.460116] … [40.007973 4.696819]]; classify the embedding [2.0370352 3.47346] as either “walking upstairs” or “jogging” considering the minimum distance to the example embeddings provided that the distance metric chosen is euclidean distance. Answer in one word.

Response

The embedding [ 2.0370352, 3.47346 ] is classified as “walking_upstairs” based on the minimum Euclidean distance to the example embeddings provided for “walking upstairs” and “jogging”. This classification indicates that the given embedding is closer to the set of embeddings associated with “walking upstairs” than it is to those associated with “jogging”.

Reasoning

To determine the classification of the embedding [ 2.0370352 , 3.47346 ] [2.0370352,3.47346], I calculated the Euclidean distance between this point and each of the given embeddings for “walking upstairs” and “jogging”. After computing the distances to all provided embeddings, I identified the minimum distance from the point to the embeddings of each activity category. The logic behind this approach is that an embedding is more likely to belong to the category to which it is closest in terms of distance in the embedding space. This space is structured so that similar activities (in terms of sensor readings, motions, or whatever features these embeddings represent) are closer together. The results showed that the closest distance between the point [ 2.0370352 , 3.47346 ] and the “walking upstairs” embeddings was significantly smaller than the closest distance to the “jogging” embeddings. This implies that the point is more similar to, or shares more characteristics with, the “walking upstairs” activity than with “jogging”. Therefore, based on the principle of nearest neighbors (where an entity is more likely to be similar to its nearest neighbors in a feature space), the embedding was classified as “walking upstairs”. This method is a fundamental concept in machine learning for classifying data points based on the known classification of other points nearby in the feature space.

6.5. Time and Cost Analysis of Annotation

One of the critical factors determining the usage of LLMs as virtual sensors could be the pricing model for the API-based access and the time required to get the annotated data. To assess this, we performed a separate small study with the encoded data from subject 14 of the MotionSense dataset (Malekzadeh et al., 2019) for the two classes “Jogging” and “Upstairs” with a total of 2330 data samples. For the encoding, we used the pre-trained encoder trained using the TFC approach and fixed the dimensions, distance, and number of examples as 2, Euclidean, and 25, respectively. Given this setup and the API access restrictions (OpenAI, 2023), it took us $\approx 13.33$ minutes and USD 5.03$ to get this data annotated using the GPT-4 API.

6.6. Reasoning behind the Responses

A critical problem we observed during the motivational experiments (see Section 4.2.4) using a query without contextual information like distance metric and examples was in GPT-4 failing to provide concrete reasoning (Zhao et al., 2023) of how it came to a particular conclusion regarding the responses. Notably, redesigning the query and adding some context information, like a proper distance metric, allowed the LLM to provide better reasoning for the responses. A summary of the same is provided in Table 2.

6.7. Comprehensive Multi-class Analysis

After confirming our initial hypothesis with two classes, we next conduct a comprehensive multi-class analysis using the UCI HAR dataset due to its smaller test set size than other datasets. Figure 12 provides a detailed report of the overall accuracy we observe with increasing classes. Moreover, as we increased the number of classes, we clearly observed a decline in performance from the LLM. This highlights the importance of identifying classes prone to confusion for deeper insights into this phenomenon, which we will investigate next. From the confusion matrices (shown in Figure 13), we see that up to three classes, the LLM can still annotate Figure 13(a) with 100% accuracy. However, when we start adding new classes in the task of annotation, we observe a significant decline in the performance (see Figure 13(b)–Figure 13(d)). One of the key reasons behind this is the inherent confusion introduced by the new classes, like “walking”, when similar representations are already present for another closely related class, like “walking upstairs” and “walking downstairs”. Similar observation is also observed for classes like “standing” and “sitting”. This can be linked to the overall representational capability of the raw sensor data from the triaxial accelerometer in differentiating between these classes where we see that embeddings for these classes are closely clustered in the embedding space (Figure 14). Our exploration with both SimCLR and TFC frameworks revealed consistent results across all classes, demonstrating that optimal accuracy was consistently attained with a dimensionality of the test embeddings set to 2.

6.8. Key Observations

In summary, the key observations from this study are as follows.

(1)

Encoding Raw Signals to Achieve Accurate Annotations: Encoding the time-series information using self-supervised approaches provides the LLMs with robust representations, which they can then use to generate accurate annotations. Notably, the final obtained embeddings can be reduced significantly to lower dimensions to lower the number of tokens (and the associated cost) while enhancing the accuracy.
(2)

Adding Context to Query for Accurate and Reasonable Annotations: One of the key observations that we gain from this study is that even simple information like distance metric adds a significant amount of context to the query. This allows the LLM then to generate an accurate response with better reasoning.
(3)

Consistency with Higher Number of Examples – A Cost vs Accuracy Trade-off: A primary advantage of using SSL-based pre-trained encoders is that they allow similar representations to cluster in the embedding space without the need for label information. This, in turn, facilitated the LLM to provide more accurate responses with increasing examples. However, a critical limitation is that adding more examples to the query can significantly increase the number of tokens and, thus, the cost of getting annotations using the paid APIs.
(4)

Effects of Increasing the Number of Class for HAR Dataset – The Choice of the Classes and the Inherent Limitation of the Modality: The overall representational capability of the data itself limits the approach for annotating the raw sensor data using LLMs. For example, when certain classes are added for the annotation task, the representations obtained using the triaxial accelerometer data become confusing enough for the LLM to clearly differentiate the classes from each other. This highlights that it is not just increasing the class but also the choice of the type of the selected class for annotation and the modality used that impacts the overall accuracy of the annotations obtained from the LLM.

7. Discussion and Future Work

In this section, we summarise some of the future ideas that can be of interest to the community, given the outcomes of this study.

(1)

Virtual Annotation using Vector Databases: Recent growth in LLMs has also fueled the development of Vector databases (Microsoft, 2023). These databases usually contain high-dimensional embeddings or representations obtained from trained models. Interestingly, the broad idea explored in this paper can be further refined by deploying platforms for annotation-as-a-service where the representations from these databases can assist the system in obtaining a few bootstrap samples for any given physical activity mapping. Subsequently, the user can submit the embeddings obtained from an encoder trained using a self-supervised algorithm for obtaining labels. This would indeed allow the user to get their datasets annotated while not needing to transfer any data to third-party servers or crowdsourcing platforms.
(2)

Refined Label Space using Active Learning with Annotations from LLM: Following up on the aforementioned idea of vector databases, a further step towards developing an optimized, high-quality annotation platform can be by applying active learning (Hossain and Roy, 2019) with sample labels obtained from LLMs. In this case, the first set of bootstrap-labeled data can be provided by the LLM. In contrast, active learning can help refine the label quality for samples that were difficult for the LLM to classify accurately. Interestingly, such an approach may also help reduce hallucinations by opportunistically including labels with human feedback (Hanneke et al., 2018).
(3)

Lack of Annotator Agreement: One of the critical benefits of the traditional human-in-the-loop annotation approach is in assessing the confusion in labeling through annotator agreement, which, when properly assessed, can significantly impact the decisions made by an ML model (Gordon et al., 2022). However, with a more stringent setup for labeling controlled by a single LLM, this approach of assessing the confusion in the obtained label will be significantly hampered. Although a straightforward solution can be using multiple such LLMs and observing their outputs for a given data set, this approach will surely increase the overall annotation cost.
(4)

Increasing Cost with Multiclass Annotation: One of the critical restrictions, with the given assessment using the paid API access of GPT-4, would be in the overall cost of annotating the samples. For example, with the number of classes increasing beyond two, the number of examples provided in the prompt will also increase. This, in general, will increase the number of tokens, which in turn will increase the cost significantly.

8. Conclusion

Recent LLMs are trained on vast amounts of publicly available data, which makes them a potential replacement for humans in the conventional human-in-the-loop-based annotation scheme. Definitely, this can offer many advantages of being more scalable, timely, and efficient. However, understanding complex time-series data has been a significant challenge for the LLMs, with most of the SOTA approaches proposing either to adapt the computationally expensive ways of fine-tuning the LLM or perform sophisticated prompt-tuning. This paper presents a different perspective on using LLMs for time-series physical sensing data by encoding them with pre-trained time-series encoders. To arrive at this idea and assess it thoroughly, we first segregated the entire design setup into two phases. The first phase was more dedicated to understanding the challenges of using raw sensor data with LLMs. We saw that GPT-4 could not generate accurate and reasonable annotations even with examples by looking at the raw sensor data. This understanding laid the foundation for the next phase, where we systematically looked into providing expert guidance by including distance as contextual information. Then, to exploit the idea of neighborhood and proximity in the embedding space, we studied using SSL-based pre-training of a time-series encoder. These encoders allow the projection of the raw sensor data in the embedding space so that similar classes form distinct clusters. Thus, with examples drawn from a cluster, the next task for the LLM becomes just assessing the proximity. The observations that we gain from the principled evaluation of this approach on four benchmarking datasets validates the broad intuition of using SSL-based time-series encoders as a plug-and-play module over LLMs for virtual annotation.

References

(1)
Belyaeva et al. (2023) Anastasiya Belyaeva, Justin Cosentino, Farhad Hormozdiari, Krish Eswaran, Shravya Shetty, Greg Corrado, Andrew Carroll, Cory Y McLean, and Nicholas A Furlotte. 2023. Multimodal llms for health grounded in individual-specific data. In ACM ML4MHD. Springer, 86–102.
Blunck Henrik, Bhattacharya Sourav, Prentow Thor, Kjrgaard Mikkel, and Dey Anind (2015) Blunck Henrik, Bhattacharya Sourav, Prentow Thor, Kjrgaard Mikkel, and Dey Anind. 2015. Heterogeneity Activity Recognition. UCI Machine Learning Repository. DOI: https://doi.org/10.24432/C5689X.
Brown et al. (2020) Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. 2020. Language models are few-shot learners. NeurIPS 33 (2020), 1877–1901.
Caron et al. (2020) Mathilde Caron, Ishan Misra, Julien Mairal, Priya Goyal, Piotr Bojanowski, and Armand Joulin. 2020. Unsupervised learning of visual features by contrasting cluster assignments. NeurIPS 33 (2020), 9912–9924.
Chatterjee et al. (2023) Soumyajit Chatterjee, Arun Singh, Bivas Mitra, and Sandip Chakraborty. 2023. Acconotate: Exploiting Acoustic Changes for Automatic Annotation of Inertial Data at the Source. In IEEE DCOSS-IoT. IEEE, 25–33.
Chen et al. (2020) Ting Chen, Simon Kornblith, Mohammad Norouzi, and Geoffrey Hinton. 2020. A simple framework for contrastive learning of visual representations. In ICML. PMLR, 1597–1607.
Deldari et al. (2021) Shohreh Deldari, Daniel V Smith, Hao Xue, and Flora D Salim. 2021. Time series change point detection with self-supervised contrastive predictive coding. In ACM WWW. 3124–3135.
Deldari et al. (2022) Shohreh Deldari, Hao Xue, Aaqib Saeed, Daniel V Smith, and Flora D Salim. 2022. Cocoa: Cross modality contrastive learning for sensor data. Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies 6, 3 (2022), 1–28.
Dhekane et al. (2023) Sourish Gunesh Dhekane, Harish Haresamudram, Megha Thukral, and Thomas Plötz. 2023. How Much Unlabeled Data is Really Needed for Effective Self-Supervised Human Activity Recognition?. In ACM ISWC. 66–70.
Gao et al. (2024) Mingqi Gao, Xinyu Hu, Jie Ruan, Xiao Pu, and Xiaojun Wan. 2024. LLM-based NLG Evaluation: Current Status and Challenges. arXiv preprint arXiv:2402.01383 (2024).
Gordon et al. (2022) Mitchell L Gordon, Michelle S Lam, Joon Sung Park, Kayur Patel, Jeff Hancock, Tatsunori Hashimoto, and Michael S Bernstein. 2022. Jury learning: Integrating dissenting voices into machine learning models. In ACM CHI. 1–19.
Hanneke et al. (2018) Steve Hanneke, Adam Tauman Kalai, Gautam Kamath, and Christos Tzamos. 2018. Actively avoiding nonsense in generative models. In ACM COLT. PMLR, 209–227.
Haresamudram et al. (2022) Harish Haresamudram, Irfan Essa, and Thomas Plötz. 2022. Assessing the state of self-supervised human activity recognition using wearables. Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies 6, 3 (2022), 1–47.
He et al. (2020) Kaiming He, Haoqi Fan, Yuxin Wu, Saining Xie, and Ross Girshick. 2020. Momentum contrast for unsupervised visual representation learning. In IEEE CVPR. 9729–9738.
Hossain and Roy (2019) HM Sajjad Hossain and Nirmalya Roy. 2019. Active deep learning for activity recognition with context aware annotator selection. In ACM SIGKDD. 1862–1870.
Jain et al. (2022) Yash Jain, Chi Ian Tang, Chulhong Min, Fahim Kawsar, and Akhil Mathur. 2022. Collossl: Collaborative self-supervised learning for human activity recognition. Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies 6, 1 (2022), 1–28.
Jaiswal et al. (2020) Ashish Jaiswal, Ashwin Ramesh Babu, Mohammad Zaki Zadeh, Debapriya Banerjee, and Fillia Makedon. 2020. A survey on contrastive self-supervised learning. Technologies 9, 1 (2020), 2.
Jin et al. (2023) Ming Jin, Shiyu Wang, Lintao Ma, Zhixuan Chu, James Y Zhang, Xiaoming Shi, Pin-Yu Chen, Yuxuan Liang, Yuan-Fang Li, Shirui Pan, et al. 2023. Time-llm: Time series forecasting by reprogramming large language models. arXiv preprint arXiv:2310.01728 (2023).
Jin et al. (2024) Ming Jin, Yifan Zhang, Wei Chen, Kexin Zhang, Yuxuan Liang, Bin Yang, Jindong Wang, Shirui Pan, and Qingsong Wen. 2024. Position Paper: What Can Large Language Models Tell Us about Time Series Analysis. arXiv preprint arXiv:2402.02713 (2024).
Kim et al. (2024) Yubin Kim, Xuhai Xu, Daniel McDuff, Cynthia Breazeal, and Hae Won Park. 2024. Health-llm: Large language models for health prediction via wearable sensor data. arXiv preprint arXiv:2401.06866 (2024).
Kingma and Ba (2014) Diederik P Kingma and Jimmy Ba. 2014. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014).
Kwon et al. (2019) Hyeokhyen Kwon, Gregory D Abowd, and Thomas Plötz. 2019. Handling annotation uncertainty in human activity recognition. In ACM ISWC. 109–117.
Laput and Harrison (2019) Gierad Laput and Chris Harrison. 2019. Sensing fine-grained hand activity with smartwatches. In ACM CHI. 1–13.
Liang et al. (2022) Percy Liang, Rishi Bommasani, Tony Lee, Dimitris Tsipras, Dilara Soylu, Michihiro Yasunaga, Yian Zhang, Deepak Narayanan, Yuhuai Wu, Ananya Kumar, et al. 2022. Holistic evaluation of language models. arXiv preprint arXiv:2211.09110 (2022).
Liu et al. (2023) Xin Liu, Daniel McDuff, Geza Kovacs, Isaac Galatzer-Levy, Jacob Sunshine, Jiening Zhan, Ming-Zher Poh, Shun Liao, Paolo Di Achille, and Shwetak Patel. 2023. Large Language Models are Few-Shot Health Learners. arXiv preprint arXiv:2305.15525 (2023).
Malekzadeh et al. (2019) Mohammad Malekzadeh, Richard G Clegg, Andrea Cavallaro, and Hamed Haddadi. 2019. Mobile sensor data anonymization. In IoTDI. 49–58.
Microsoft (2023) Microsoft. 2023. What is a Vector Database? https://learn.microsoft.com/en-us/semantic-kernel/memories/vector-db, Last Accessed: .
OpenAI (2023) OpenAI. 2023. Rate limits. https://platform.openai.com/docs/guides/rate-limits?context=tier-free, Last Accessed: .
OpenAI (2024) OpenAI. 2024. How much does GPT-4 cost? https://help.openai.com/en/articles/7127956-how-much-does-gpt-4-cost, Last Accessed: .
Qian et al. (2022) Hangwei Qian, Tian Tian, and Chunyan Miao. 2022. What makes good contrastive learning on small-scale wearable-based tasks?. In ACM SIGKDD. 3761–3771.
Qu et al. (2023) Leigang Qu, Shengqiong Wu, Hao Fei, Liqiang Nie, and Tat-Seng Chua. 2023. Layoutllm-t2i: Eliciting layout guidance from llm for text-to-image generation. In ACM MM. 643–654.
Ramamurthy et al. (2022) Sreenivasan Ramasamy Ramamurthy, Soumyajit Chatterjee, Elizabeth Galik, Aryya Gangopadhyay, Nirmalya Roy, Bivas Mitra, and Sandip Chakraborty. 2022. Cogax: Early assessment of cognitive and functional impairment from accelerometry. In PerCom. IEEE, 66–76.
Reiss (2012) Attila Reiss. 2012. PAMAP2 Physical Activity Monitoring. UCI Machine Learning Repository. DOI: https://doi.org/10.24432/C5NW2H.
Reyes-Ortiz Jorge, Anguita Davide, Ghio Alessandro, Oneto Luca, Parra Xavier (2012) Reyes-Ortiz Jorge, Anguita Davide, Ghio Alessandro, Oneto Luca, Parra Xavier. 2012. Human Activity Recognition Using Smartphones. UCI Machine Learning Repository. DOI: https://doi.org/10.24432/C54S4K.
Spathis et al. (2022) Dimitris Spathis, Ignacio Perez-Pozuelo, Laia Marques-Fernandez, and Cecilia Mascolo. 2022. Breaking away from labels: The promise of self-supervised machine learning in intelligent health. Patterns 3, 2 (2022).
Tang et al. (2020) Chi Ian Tang, Ignacio Perez-Pozuelo, Dimitris Spathis, and Cecilia Mascolo. 2020. Exploring contrastive learning in human activity recognition for healthcare. arXiv preprint arXiv:2011.11542 (2020).
Tu et al. (2024) Tao Tu, Shekoofeh Azizi, Danny Driess, Mike Schaekermann, Mohamed Amin, Pi-Chuan Chang, Andrew Carroll, Charles Lau, Ryutaro Tanno, Ira Ktena, et al. 2024. Towards generalist biomedical ai. NEJM AI 1, 3 (2024), AIoa2300138.
Wu et al. (2022) Xingjiao Wu, Luwei Xiao, Yixuan Sun, Junhang Zhang, Tianlong Ma, and Liang He. 2022. A survey of human-in-the-loop for machine learning. Future Generation Computer Systems 135 (2022), 364–381.
Xie and Min (2022) Sang Michael Xie and Sewon Min. 2022. How does in-context learning work? A framework for understanding the differences from traditional supervised learning. https://ai.stanford.edu/blog/understanding-incontext/, Last Accessed: .
Xu et al. (2023a) Huatao Xu, Liying Han, Mo Li, and Mani Srivastava. 2023a. Penetrative ai: Making llms comprehend the physical world. arXiv preprint arXiv:2310.09605 (2023).
Xu et al. (2023b) Maxwell Xu, Alexander Moreno, Hui Wei, Benjamin Marlin, and James Matthew Rehg. 2023b. Retrieval-Based Reconstruction For Time-series Contrastive Learning. In ICLR.
Xu et al. (2023c) Shawn Xu, Lin Yang, Christopher Kelly, Marcin Sieniek, Timo Kohlberger, Martin Ma, Wei-Hung Weng, Attila Kiraly, Sahar Kazemzadeh, Zakkai Melamed, et al. 2023c. ELIXR: Towards a general purpose X-ray artificial intelligence system through alignment of large language models and radiology vision encoders. arXiv preprint arXiv:2308.01317 (2023).
Xue and Salim (2023) Hao Xue and Flora D Salim. 2023. Promptcast: A new prompt-based learning paradigm for time series forecasting. IEEE Transactions on Knowledge and Data Engineering (2023).
Xue et al. (2022) Hao Xue, Bhanu Prakash Voutharoja, and Flora D Salim. 2022. Leveraging language foundation models for human mobility forecasting. In ACM SIGSPATIAL. 1–9.
Yao et al. (2022) Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao. 2022. React: Synergizing reasoning and acting in language models. arXiv preprint arXiv:2210.03629 (2022).
Yuan et al. (2022) Hang Yuan, Shing Chan, Andrew P Creagh, Catherine Tong, David A Clifton, and Aiden Doherty. 2022. Self-supervised learning for human activity recognition using 700,000 person-days of wearable data. arXiv preprint arXiv:2206.02909 (2022).
Zeni et al. (2019) Mattia Zeni, Wanyi Zhang, Enrico Bignotti, Andrea Passerini, and Fausto Giunchiglia. 2019. Fixing mislabeling by human annotators leveraging conflict resolution and prior knowledge. Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies 3, 1 (2019), 1–23.
Zhang et al. (2023a) Hang Zhang, Xin Li, and Lidong Bing. 2023a. Video-llama: An instruction-tuned audio-visual language model for video understanding. arXiv preprint arXiv:2306.02858 (2023).
Zhang et al. (2024) Xiyuan Zhang, Ranak Roy Chowdhury, Rajesh K. Gupta, and Jingbo Shang. 2024. Large Language Models for Time Series: A Survey. arXiv:2402.01801 [cs.LG]
Zhang et al. (2022) Xiang Zhang, Ziyuan Zhao, Theodoros Tsiligkaridis, and Marinka Zitnik. 2022. Self-supervised contrastive pre-training for time series via time-frequency consistency. NeurIPS 35 (2022), 3988–4003.
Zhang et al. (2020) Yun C Zhang, Shibo Zhang, Miao Liu, Elyse Daly, Samuel Battalio, Santosh Kumar, Bonnie Spring, James M Rehg, and Nabil Alshurafa. 2020. SyncWISE: Window induced shift estimation for synchronization of video and accelerometry from wearable sensors. Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies 4, 3 (2020), 1–26.
Zhang et al. (2023b) Zheng Zhang, Chen Zheng, Da Tang, Ke Sun, Yukun Ma, Yingtong Bu, Xun Zhou, and Liang Zhao. 2023b. Balancing specialized and general skills in llms: The impact of modern tuning and data strategy. arXiv preprint arXiv:2310.04945 (2023).
Zhao et al. (2023) Haiyan Zhao, Hanjie Chen, Fan Yang, Ninghao Liu, Huiqi Deng, Hengyi Cai, Shuaiqiang Wang, Dawei Yin, and Mengnan Du. 2023. Explainability for large language models: A survey. ACM TIST (2023).