Non-Contrastive Learning-based Behavioural Biometrics for Smart IoT Devices

Oshan Jayawardana, Fariza Rashid, and Suranga Seneviratne O. Jayawardana is with The School of Computer Science, The University Sydney, Australia and The University of Moratuwa, Sri Lanka. F. Rashid and S. Seneviratne are with The School of Computer Science, The University Sydney, Australia.

Abstract

Behaviour biometrics are being explored as a viable alternative to overcome the limitations of traditional authentication methods such as passwords and static biometrics. Also, they are being considered as a viable authentication method for IoT devices such as smart headsets with AR/VR capabilities, wearables, and erables, that do not have a large form factor or the ability to seamlessly interact with the user. Recent behavioural biometric solutions use deep learning models that require large amounts of annotated training data. Collecting such volumes of behaviour biometrics data raises privacy and usability concerns. To this end, we propose using SimSiam-based non-contrastive self-supervised learning to improve the label efficiency of behavioural biometric systems. The key idea is to use large volumes of unlabelled (and anonymised) data to build good feature extractors that can be subsequently used in supervised settings. Using two EEG datasets, we show that at lower amounts of labelled data, non-contrastive learning performs 4%–11% more than conventional methods such as supervised learning and data augmentation. We also show that, in general, self-supervised learning methods perform better than other baselines. Finally, through careful experimentation, we show various modifications that can be incorporated into the non-contrastive learning process to archive high performance.

Index Terms:

Behavioural Biometrics, Smart Sensing, EEG, Authentication, IoT

I Introduction

The pervasive use of smart devices and the vast amounts of sensitive information stored in those devices exacerbate the problem of user authentication on smart devices. Traditional methods such as passwords, PINs, and security tokens have usability issues [1] and static biometrics such as fingerprinting and face ID are vulnerable to spoofing attacks. As a result, behavioural biometrics has been explored by many works as a user-friendlier (i.e., implicit by nature and no extra effort is required from the user) and secure (i.e., difficult to spoof and allows continuous authentication) alternative for user authentication in smart devices. Example behavioural biometric modalities include gait patterns [2]), typing patterns [3, 4], breathing acoustics [5, 6], and EEG patterns [7, 8]. Behavioural biometrics also finds applications in Smart IoT devices that either do not have enough form factor or limited interactive components [7, 6]

Compared to static biometrics, behaviour biometrics needs a significant number of training samples to be collected from users at registration time and in most cases at different contextual settings [5, 7]. Moreover, the majority of recent behavioural biometrics solutions use deep learning models that are known to require higher amounts of training data [9, 10]. Specifically, many solutions used Convolutional Neural Networks (CNNs) [11, 6] or Recurrent Neural Networks (RNNs) [8, 12] that require massive amounts of labelled data for better generalisation.

Collecting such volumes of labelled behavioural biometric data is not practical in many real-world scenarios. For instance, collecting a significant amount of training data at registration time will inconvenience users, reduce usability, and raise privacy concerns. As a result, it is important to build learning methods that enable building deep learning models using less labelled data.

While collecting large volumes of labelled data for behavioural biometrics is challenging and inconvenient, collecting large volumes of unlabelled data is relatively easy. Unlabelled data can be collected while the device is in use by the user without any supervision and anonymously so that the data does not contain any personally identifiable information, eliminating threats to the user’s privacy. For example, a mobile platform provider planning to build a gait-based behavioural biometric can collect unlabelled data from the motion sensors of their platform users. Therefore, it is necessary to develop learning methods that can leverage large volumes of unlabelled data to reduce the labelled data requirement of behaviours biometrics. To this end, in this paper, we propose to use non-contrastive self-supervised learning. More specifically, we make the following contributions.

•

We propose a SimSiam [13]-based non-contrastive learning approach and associated modifications such as shallow feature extractors and weight decay to develop label-efficient classifiers for behavioural biometrics data.
•

Using two EEG-based behavioural biometrics datasets in three authentication system development scenarios, we show that the proposed non-contrastive learning approach outperforms conventional supervised learning approaches by 4%–11% at lower amounts of labelled data. We also show that non-contrastive learning performs comparably to a state-of-the-art multi-task learning-based baseline.
•

We conduct further experiments and provide insights into the effectiveness of different types of augmentations on the non-contrastive learning process. We also provide empirical evidence of how our modifications to SimSiam models help in the context of behavioural biometrics.

The rest of the paper is organised as follows. In Section II, we present the related work and in Section III, we describe the overall methodology. Next we explain the datasets and model details in Section IV. Section V presents the results and Section VI presents further analysis of various model parameters’ effect on performance. Finally, Section VII discusses limitations of our work and possible extensions, and concludes the paper.

II Related Work

II-A Behavioural Biometrics

There is a vast body of work proposing various behavioural biometric modalities. Early work involved using typing patterns and touch gestures [3, 14, 15] while later modalities leveraged human physiology [5, 12, 7, 8, 16, 17]. The authentication solutions generally involve building machine learning classifiers or signature similarity-based approaches [9]. More recent works use deep learning methods, given their broader success in other domains [18].

Other works in behavioural biometrics aimed to increase the training efficiency with class incremental learning [19] or improved label efficiency using few-shot learning [20] and transfer learning [21]. Similar efforts were also made in human activity recognition [22, 23].

In contrast, we propose to improve the label efficiency by using non-contrastive self-supervised learning. Non-contrastive learning leverages large volumes of unlabelled data to build label-efficient classifiers. To the best of our knowledge, our work is the first to use non-contrastive learning for behavioural biometrics.

II-B Self-supervised Learning (SSL)

Self-supervised learning (SSL) refers to a broader family of methods in which a model learns representations from unlabelled data using pretext tasks. The pretext task acts as a feature extractor for supervised learning tasks reducing the labelled data requirement. For example, in computer vision, a pretext task learning may train a model to predict whether an image is an original or an augmentation. In this way, the model learns the distinguishing features of the original image. The pretext model is then fine-tuned for a downstream task in a supervised setting with labelled data. Jing et al. [24] provide a survey of SSL methods.

Early work closely resembling modern SSL includes Bromley et al. [25], where the authors proposed the ”Siamese” neural network architecture for signature verification. However, due to excessive resource requirements, SSL didn’t receive much attention until their success in natural language processing. In 2013, Mikolov et al. [26] used self-supervised learning to introduce word2vec, which paved the way to powerful generative language models such as BERT [27], RoBERTa [28] and XLM-R [29].

Nonetheless, neither generative methods [30, 31, 32] nor discriminative approaches [33, 34, 35, 36] were successful in other domains such as computer vision due to high computational complexity [37]. In contrast, Siamese networks-based comparative methods have shown promising results in computer vision [37, 38, 39, 13].

The basic form of Siamese networks consists of two identical neural networks which take two views of the same input (i.e., a positive pair) and outputs embeddings that have a low energy (or high similarity) between them. To increase the similarity of the two views, the networks learn spatial or temporal transformation invariant embeddings. Despite many successful applications of Siamese Networks, collapsing networks (where the network converges to a trivial solution) limit their performance.

To overcome these limitations, contrastive learning methods [37, 40, 41, 42, 43] used negatives to avoid collapsing by not only pulling positives towards each other but also by pushing apart negatives in the embedding space. An example is the SimCLR model [37]. However, contrastive learning requires large batch sizes [37, 43], support sets [41], or memory queues [42, 44, 40].

As a result, non-contrastive learning methods, and in particular the SimSiam model [13], emerged as a viable alternative. Non-contrastive learning generally involves clustering [39, 45], momentum encoders [38], and using a cross-correlation matrix between the outputs of two identical networks as the objective function [46], to address collapsing networks. These methods avoid the use of negatives to overcome the limitation of contrastive learning whereby two positive pair samples can get pushed apart in the embedding space, consequently becoming a negative pair and harming the performance of the end task [47]. However, the SimSiam [13] outperforms other non-contrastive approaches without using complex training approaches such as momentum encoders. It emphasises the importance of stop-gradient to present an efficient and a simple solution to the collapsing networks problem.

II-C SSL in Sensing and Behavioural Biometrics

While SSL has majorly contributed to natural language processing, computer vision, and speech processing, its feasibility has been explored in sensing and mobile computing [48]. Saeed et al. [23] introduced self-supervised learning for time-series sensor data by introducing augmentations that are compatible with time-series data. The authors used a multi-task SSL model to reduce the labelled training data requirement in Human Activity Recognition (HAR). Using ten labelled samples per class, the authors achieved approximately 88.8% of the highest score reached by conventional supervised learning. SimCLR and several other contrastive and non-contrastive SSL methods also have been assessed on HAR problems [49, 50]. Others such as Wright and Stewart [51] and Miller et al. [10] explored the use of traditional Siamese networks to reduce the training data requirement of behavioural biometrics-based user authentication.

In contrast to these works, to the best of our understanding, we are the first to propose SimSiam [13]-based non-contrastive learning for behavioural biometrics to reduce the labelled data requirement. Our method neither uses negatives nor requires complex training approaches such as momentum encoders to avoid collapsing. We compare our approach with baselines including traditional supervised learning, transfer learning, data augmentation, and state-of-the-art multi-task learning [23] and show that it can outperform supervised learning and provide comparable performance to multi-task learning at lower amounts of labelled data.

III Methodology

Scenario	Research Question	Baseline Methods
1	Can non-contrastive SSL be used to leverage unlabelled data from a set of users to build a label-efficient classifier for a completely different set of users?	Supervised learning, Data augmentation, Transfer learning, Self-supervised multi task learning, Simple Siamese network
2	Can non-contrastive SSL be used to leverage unlabelled data from a given set of users to train a label-efficient classifier for user authentication?	Supervised learning, Data augmentation, Self-supervised multi task learning, Simple Siamese network
3	Can non-contrastive SSL be used to leverage unlabelled data from an initial set of users to build a label-efficient classifier for both initial set of users and a whole new set of users?	Supervised learning, Data Augmentation, Self-supervised multi-task learning, Simple Siamese network

TABLE I: Summary of research questions and baselines

III-A Non-contrastive Learning Approach

Our approach is based on the SimSiam architecture proposed by Chen et al. [13]. Its architecture is a more simplified non-contrastive architecture that doesn’t use negative pairs or other complex approaches to avoid collapsing. SimSiam architecture consists of two twin networks that share weights, as illustrated in Figure 1.

Refer to caption — Figure 1: SimSiam Architecture

The idea is to learn a good representation of inputs by solving the problem of increasing the similarity between a positive pair $(x_{i},x_{j})$ . A positive pair consists of two randomly augmented versions of the same input sample $x$ . That is;

x_{i}=\tau_{i}(x)

x_{j}=\tau_{j}(x)

Here $\tau$ is a function that generates a random augmentation each time it is called. Then the two versions are encoded using the encoder network $g(x;\theta_{g})$ ,

z_{i}=g(x_{i})

z_{j}=g(x_{j})

The encoder consists of a feature extractor $g_{fe}(x;\theta_{fe})$ and a projector $g_{p}(x;\theta_{p})$ . That is;

g(x)=g_{p}(g_{fe}(x))

The key idea of the projector is to convert the representation learnt by the feature extractor to a vector that can be used to calculate the similarity. Next, the encodings go through another predictor network $h(x,\theta_{h})$ before calculating the similarity.

p_{i}=h(z_{i})

p_{j}=h(z_{j})

The purpose of the predictor is to predict the average of the representation vector across all possible augmentations the network has seen [13]. Next, the model calculates the cosine similarity within the pairs $(p_{i},z_{j})$ and $(p_{j},z_{i})$ .

Sim(p_{i},z_{j})=\dfrac{p_{i}.z_{j}}{\lVert p_{i}\rVert_{2}.\lVert z_{j}\rVert_{2}}

Sim(p_{j},z_{i})=\dfrac{p_{j}.z_{i}}{\lVert p_{i}\rVert_{2}.\lVert z_{j}\rVert_{2}}

Here, $\lVert.\rVert_{2}$ denotes the $l_{2}$ norm of a vector. The task of the SimSiam model is to increase the total similarity, $Sim(p_{i},z_{j})+Sim(p_{j},z_{i})$ . To do that, at training time the symmetric negative cosine similarity loss as defined below is used.

L=-\dfrac{1}{2}Sim(p_{i},stopgrad(z_{j}))-\dfrac{1}{2}Sim(p_{j},stopgrad(z_{i}))

Note that, applying the stopgrad operation is essential for the SimSiam architecture to work [13]. It considers one side of the network as constant when computing the gradients of the other side, to prevent gradients from backpropagating in that direction as shown in Figure 1. During the training process the parameters, $\theta_{fe}$ , $\theta_{p}$ and $\theta_{h}$ are learnt.

After the pre-text training of the SimSiam model, we transfer the trained feature extractor $g_{fe}(x;\theta_{fe})$ to our downstream task of building a classifier as illustrated in Figure 2.

We introduce two modifications to make SimSiam architecture work for time series behavioural biometrics data and further improve its performance. They are based on the hypothesis that easier self-supervision tasks lead to learning useless features and such features do not hold any value for subsequent downstream tasks. Therefore, it is important to make the self-supervised learning part more challenging so that robust features are learned during pre-text training.

•

Shallow feature extractor networks - The original SimSiam architecture was designed for image data. Time series data of behavioural biometrics are less complex compared to images and as such, to avoid the model over-fitting on pre-text the task, we use shallow feature extractor networks. A shallow feature extractor makes the learning task more difficult and as a result, allows the building of better feature extractors.
•

Weight decay - Using weight decay to prevent over-fitting is common in any machine learning application. We add high weight regularisation to the feature extractor network to avoid overfitting and make the pre-text training more challenging.

Later, in Section VI-B we provide an analysis of how the performance improves with our modifications. During our experiments, we also came across another important finding about the predictor network. We found that a deeper predictor network can help improve the non-contrastive SSL process. We provide further analysis and an explanation as to why it is happening in Section VI-C.

III-B Authentication Scenarios

We conduct experiments to demonstrate the effectiveness of non-contrastive SSL in behavioural biometrics under three scenarios that are commonplace in authentication settings.

•

Scenario 1 - This scenario represents what is usually encountered by anyone who is developing a large-scale behavioural biometrics solution. That is, it is possible to collect large volumes of unlabelled data. For example, mobile OS providers can collect unlabelled data streams such as touch patterns and gait patterns from a large number of users, in compliance with privacy regulations. However, only a limited amount of labelled data can be collected from a known set of users due to usability and privacy constraints, either in-house or explicitly obtaining customer consent.

More specifically, for $N_{1}$ users, a large volume of unlabelled data $X_{U_{1}}$ is available. For different $N_{2}$ users, only a limited amount of labelled data $(X_{L_{2_{min}}},Y_{2})$ is available. The task is to build a classifier $f_{SSL}(x;\theta)$ to identify the user $y\in\{1,...,N_{2}\}$ given the input $x$ , by using both unlabelled data $X_{U_{1}}$ and limited labelled data $(X_{L_{2_{min}}},Y_{2})$ . Here $|X_{U}|\gg|X_{L_{min}}|$ .

Here, we use the unlabelled data from $N_{1}$ users $X_{U_{1}}$ to pre-train a SimSiam-base feature extractor. Next, we train a classifier network on top of the pre-trained feature extractor for $N_{2}$ users. We use the labelled data $(X_{L_{2_{min}}},Y_{2})$ to train the classifier. We fine-tune the learnt weights of the feature extractor while training the classifier network. The concatenated fine-tuned feature extractor and the classifier network creates the final classifier $f_{SSL}(x;\theta)$ (cf. Figure 2).
•

Scenario 2 - This scenario is similar to Scenario 1. However here an organisation is trying to build an in-house authentication system. As a result, again there is a large volume of unlabelled data and a limited about of labelled data, but for the same set of users in contrast to Scenario 1.

That is, for $N$ users, only a limited amount of labelled data, $(X_{L_{min}},Y)$ is available while a large volume of unlabelled data $X_{U}$ is available. The task is to build a classifier, $f_{SSL}(x;\theta)$ to correctly identify the user $y\in\{1,...,N\}$ given an input data sample, $x$ , by using the limited labelled data, $X_{L_{min}}$ and unlabelled data $X_{U}$ . Again here $|X_{U}|\gg|X_{L_{min}}|$ .

Similar to Scenario 1, here also we first pre-train a SimSiam-based feature extractor using unlabelled data $X_{U}$ and build a classifier network on top of the feature extractor for $N$ users using available limited labelled data, $(X_{L_{min}},Y)$ . During classifier training, feature extractor fine-tuning happens the same as Scenario 1. The concatenated fine-tuned feature extractor and the classifier network create the final classifier $f_{SSL}(x;\theta)$ .
•

Scenario 3 - This is an incremental step from Scenario 2, where the organisation has collected labelled and unlabelled data for building the authentication system but has additional new users for whom only a limited amount of labelled data is available.

That is, for $N_{1}$ users, a limited amount of labelled data, $(X_{L_{1_{min}}},Y_{1})$ and a large volume of unlabelled data, $X_{U_{1}}$ is available. For different $N_{2}$ users, only a limited amount of labelled data $(X_{L_{2_{min}}},Y_{2})$ is available. The task is to build a classifier, $f_{SSL}(x;\theta)$ to correctly identify the user $y\in\{1,...,(N_{1}+N_{2})\}$ from the combined set, given an input data sample, by leveraging both unlabelled and labelled data.

Here, we first use unlabelled data from $N_{1}$ users $X_{U_{1}}$ to pre-train the feature extractor with the SimSiam architecture. Then, we build a classifier on top of the pre-learned feature extractor, for the combined set of $N_{1}+N_{2}$ users using both labelled datasets $(X_{L_{1_{min}}},Y_{1})$ and $(X_{L_{2_{min}}},Y_{2})$ . Similar to previous scenarios, we fine-tune the learnt weights of the feature extractor while training the classifier network. The concatenated fine-tuned feature extractor and the classifier network create the final classifier $f_{SSL}(x;\theta)$ .

III-C Baseline Methods

We compare our non-contrastive SSL approach with multiple baselines. Below we provide a general overview of the different baselines we use. However, we highlight that not all baselines apply for all three scenarios.

1.

Supervised learning - We train a 1D CNN based on available labelled data. For example, for Scenario 1, we leverage the available limited labelled data, $X_{L_{min}}$ , and train $f_{S}(x;\theta)$ . We do all the required hyperparameter tuning such as finding optimal convolution kernel sizes, number of convolutional filters in a layer, depth of the network, learning rates, and weight regularisation constants to ensure that we leverage the full capability of the supervised learning approach.
2.

Data augmentation - Data augmentation is a default step in any deep neural network training process. It helps to increase the generalisability of the model as well as learn from limited labelled data to some extent. In this baseline, we augment available limited labelled training data using two methods, scaling, and noise addition and continue to train a supervised learning classifier, as usual, using both limited labelled data and augmented data.
3.

Multi-task self-supervised learning (MTSSL) - This is a self-supervised learning baseline, which can leverage unlabelled data compared to the above supervised learning approach. As a result, it is a closer baseline to our approach of non-contrastive SSL. Here, we first train a multi-task model with a common feature extractor using unlabelled data available in a given scenario. The common feature extractor is connected to several heads, each having a dedicated discriminative task. Each head is a binary classifier learning to discriminate whether an assigned augmentation is applied to a sample or not. Next, we build a classifier on top of the pre-trained feature extractor by adding extra fully connected layers.
4.

Transfer learning - Transfer learning is the most common approach to handling the lack of labelled data. Here, the deep neural network trained using labelled data is leveraged as a feature extractor to facilitate adding new users to an existing behavioural biometric system or building a new behavioural biometric system from less labelled data. During the transfer learning phase, several new layers are added to the previous feature extractor, corresponding to the new classification task. The entire model is then fine-tuned with the available limited training data.
5.

Transfer learning with data augmentation - Here, we do transfer learning together with data augmentation.

In Table I we summarise the research questions associated with the three scenarios and the baselines used in each scenario. We give further details of the implementations aspects of the baselines methods in Section IV-C.

III-D Performance Metrics

To measure the performance of trained non-contrastive SSL-based user authentication systems and compare them against other baselines we use Cohen’s Kappa coefficient similar to [23]. Usually, accuracy is the most commonly used metric to evaluate the performance in a multi-class setting. Accuracy measures the agreement between two raters, here raters being the true label and predicted label.

\text{Accuracy}=\dfrac{\text{No. of correct predictions}}{\text{No. of total predictions}}

However, accuracy can be misleading in some occasions especially if the trained model is biased to predict one class more accurately and another class less accurately. In contrast Cohen’s Kappa coefficient measures the agreement between two raters but discount the effect of agreement by chance. That is,

\text{Kappa Score}=\dfrac{P_{o}-P_{e}}{1-P_{e}}

$P_{o}=\text{Probability of agreement (Accuracy)}$
$P_{e}=\text{Probability of agreement by chance}$

That is,

P_{o}=\dfrac{\text{No. of correct predictions}}{\text{No. of total predictions}}

P_{e}=\sum_{i}^{N}\dfrac{n_{true}^{i}}{n_{total}}\times\dfrac{n_{pred}^{i}}{n_{total}}

$N=\text{Number of Classes}$
$n_{total}=\text{No. of total predictions}$
$n_{true}^{i}=\text{No. of true labels of the $i^{th}$ class}$
$n_{pred}^{i}=\text{No. of predicted labels of the $i^{th}$ class}$

The highest possible Kappa Score is 1 indicating the best performance and it can have negative values at worse performances. Overall, we progressively increase the amount of labelled data available and compare the performance of different methods using the Kappa Score. We further discuss this in Section V.

IV Datasets and Models

IV-A Datasets

To demonstrate the effectiveness of non-contrastive learning in behavioural biometrics, we use two datasets; MusicID [7] and MMI [52]. Both of these datasets are EEG datasets and have been used in behavioural biometric settings before. Note that, for the rest of the paper, a session refers to a single experiment of recording sensor readings in one sitting for one user.

•

MusicID - This dataset consists of brainwave data collected from 20 volunteers while they performed two tasks [7]; listening to a popular English song and listening to the individual’s favourite song. The participants wore a Muse brain sensing headset while listening to the music and kept their eyes closed. The dataset experiment was approved by the host institution’s Human Research Ethics Committee as mentioned in the original work. The duration of each task in a single session was 150s and the headset records samples at a rate of 2Hz, resulting a total of 300 readings per participant, per session. Data was collected from each participant over multiple sessions with the number of sessions per user varying between 12-30 (considering both the same song and favourite song sessions together). Each headset recording contains 24 readings; absolute brainwave values of alpha, beta, theta, delta, gamma, and raw EEG from the standard 4-channels of the Muse headset. That is, a single reading in this dataset is a 24 dimensional vector, $x_{i}\in\mathbb{R}^{24}$ . We use 30 readings (i.e., 15 seconds of data) as a single input sample to the model. Consequently, one input to our model is $x\in\mathbb{R}^{30\times 24}$ .
•

MMI - This is a publicly available dataset known as eegmmidb (EEG Motor Movement/Imagery database). It comprises of EEG signals obtained from 109 participants using the BCI2000 system [52]. Separate experiments were conducted where each participant carried out four different tasks, each for a two-minute duration and repeated three times. The tasks involved different combinations of opening and closing fists or feet based on the location of a certain target shown on a screen. Each participant also performed two one-minute baseline runs. The data was obtained in EDF+ format – consisting of 64 EEG signals and an annotation channel where the annotation channel indicates the participant’s activity. We randomly selected eight channels; 3, 12, 13, 18, 50, 60, 61, and 64 to reduce the memory requirements. For each channel, we filter the raw signal to get the alpha, beta, theta, and delta components and use the raw signal and the four filtered components as input features to our models. That is, a single reading $x_{i}\in\mathbb{R}^{40}$ . The sessions are recorded with a $160Hz$ sampling rate. We use $800ms$ of readings making our inputs to the models $x\in\mathbb{R}^{128\times 40}$ .

Even though this dataset has been used for user authentication [8], the previous work only use a portion of the dataset, EEG-S, which only contains data from eight users. However, we, on the other hand, incorporate all 109 users in our experiments. Due to the high number of users and the fact that this dataset was not collected with authentication as a target application, the maximum kappa scores we could achieve for user authentication in the MMI dataset is less than the MusicID dataset.

IV-B Dataset Splits

To emulate the three scenarios described in Section III-B, we split each dataset into two parts; Dataset 1 and Dataset 2. Dataset 1 contains data of approximately 1/3 of the users of the total dataset and Dataset 2 contains the rest. We use these two datasets in different ways in the three scenarios. Note that in the following description, ”labelled data of Dataset 1/Dataset 2” refers to $(x,y)$ pairs coming from a dataset while ”unlabelled data of Dataset 1/Dataset 2” refers to only $(x,)$ values coming from a dataset, ignoring the user ID labels.

In Scenario 1, we use Dataset 1 as the unlabelled data $X_{U_{1}}$ coming from $N_{1}$ users. We use labelled data from Dataset 2 as the labelled data $(X_{L_{2_{min}}},Y_{2})$ coming from $N_{2}$ users who will be part of the authentication system. We progressively increase the amount of labelled data in $(X_{L_{2_{min}}},Y_{2})$ to assess the performance of our method and other methods as presented in Section V-A.

Since in Scenario 2 both the labelled and unlabelled data are associated with the same set of users, we use only Dataset 2. In Scenario 3 we use Dataset 1 as the dataset coming from the initial users $N_{1}$ who provide both unlabelled $X_{U_{1}}$ and labelled data $(X_{L_{1_{min}}},Y_{1})$ . We use labelled data from Dataset 2 as the labelled dataset $(X_{L_{2_{min}}},Y_{2})$ coming from the new set of users $N_{2}$ .

We use dedicated sessions for validation and testing. Reading training data is done with a 50 percent overlapping window similar to [7]. We do not use overlapped windows when reading validation and test data since it will artificially increase the performance metrics. We summarise the two datasets and the splits in Table II.

Dataset	MusicID [7]	MMI [52]
No. of users	20	109
Sessions/user	12-30	14
Samples/user	68-170	3,242-4,119
Split - Dataset 1
No. of users	6	36
Unlabelled Data
Training sessions/user	8-10	6
Training samples/user	120-150	1,524-1,546
Labelled Data
Training sessions/user	8-10	2
Training samples/user	120-150	612-622
Validation sessions/user	8-10	1
Validation samples/user	8-10	153-156
Testing sessions/user	8-10	1
Testing samples/user	8-10	153-156
Split - Dataset 2
No. of users	14	73
Unlabelled Data
Training sessions/user	4-6	6
Training samples/user	60-90	1282-1671
Labelled Data
Training sessions/user	4-6	2
Training samples/user	60-90	490-757
Validation sessions/user	4-6	1
Validation samples/user	4-6	123-156
Testing sessions/user	4-6	1
Testing samples/user	4-6	123-156

TABLE II: Dataset and data split summary

IV-C Deep Learning Models and Training

IV-C1 Model architectures

We use the same 1D ResNet architecture illustrated in Figure 3 as the feature extractor across all datasets, experiments, and models. The reason behind this choice is it allows to effectively compare the learning methods by minimising the effect of the network architecture. The only difference between the models in the two datasets is the number of convolutional filters, which is represented by $k$ in Figure 3. For example, for the MuiscID dataset we used $k=(128,256)$ , meaning the first convolutional layer has 128 filters and the ResNet block has 256 filters. For the MMI dataset it was $k=(48,96)$ .

We used MLP architectures illustrated in Figure 4 for the Projector and Predictor networks of the SimSiam model. Finally, the classifier networks were also MLPs, consisting of two hidden Dense layers with the dimensions of 256, 64. The number of neurons in the final layer was equal to the number of users of the corresponding task. We used ReLU activation function for all the MLP layers except for the final layer, which used the softmax activation.

IV-C2 Input Transformations for SimSiam and MTSSL

For the MusicID dataset we used Random Scaling and Jitter to create the two augmented versions inputs required for SimSiam model training. In Random Scaling, we multiply each channel of an input sample with a randomly generated variable having a normal distribution $\sim N(1,0.65)$ . In Jitter, we generate a noise matrix having the same dimensions as the input. The values of the noise matrix are sampled from a normal distribution $\sim N(0,0.8)$ and added to the original input sample. We selected the values for both the variances through experiments.

Similarly, for the MMI dataset we use Random Scaling and Flipping as the augmentations. In Flipping, we reverse the time dimension of the input. We made these choices based on some early experiments where we tried various data transformation methods individually and in combination, as explained later in Section VI-A.

According to the findings by Saeed et al. [23], using multiple augmentations for multi-task learning leads to higher self-supervised learning performances. Therefore, after a similar pre-experiment as earlier, for the MusicID dataset we use JItter, Random Scaling, Magnitude Warping, Random Sampling, Flipping, Data Dropping, Time Warping, Negation, Channel Shuffling, and Permutations as transformations in our multi-task learning baseline. For the MMI dataset we use only four augmentations; Random Scaling, Magnitude Warping, Time Warping, and Negation. This is because the dataset’s larger size requires high computing memory. In fact this is a limitation of multi-task learning compared to the SimSiam approach as we discuss in Section VII. We summarise all the transformations we used to generate the results in Section V and Section VI in Table III.

Note that here we use the same multi-headed architecture as Saeed et al. [23] where each head of the multi-task model tries to discriminate whether an assigned augmentation is applied to a sample or not. For example, the first head tries to discriminate whether Gaussian noise is added to a sample or not, and the second head tries to discriminate whether a sample is scaled or not.

	Augmentation	Description
1	Jitter	Adding random noise to a sample
2	Random Scaling	Scaling each channel of the input with a randomly generated constant
3	Magnitude Warping	Random element-wise scaling with a smooth transition along time dimension
4	Time Warping	Stretches the data across time dimension. New samples are generated using interpolation (based on the entire sample) to stretch the time dimension
5	Flipping	Reversing the time dimension of a sample
6	Data Dropping	Making parts of the input zero
7	Random Sampling	This is similar to Time Warping, with only a subset of sample is used for interpolation
8	Permutations	Randomly slicing and swapping values across the time dimension, within moving time window s
9	Negation	Multiplying the sample by -1
10	Channel Shuffling	Randomly exchanging the order of the input channels

TABLE III: Summary of transformations/augmentations

IV-C3 Training Process

We use Adam as the optimiser with an exponentially decaying learning rate for both pre-training and classifier training. Initial learning rates are $0.00003$ and $0.01$ for pre-training and classifier training, respectively. As explained in Section III-A, we use $l_{2}\;norm$ weight regularisation with a regularisation factor of $\lambda=0.01$ for the feature extractor networks of self-supervised learning methods. We train self-supervised models up to 30 and 10 epochs, for MusicID and MMI datasets, respectively. We train the final classifiers up to 30 epochs in both datasets. All the time we use early stopping to prevent over-fitting.

We are planning to release all of our codes and data splits publicly upon the acceptance of the paper for reproducibility of our results and to stem further research in the area.

V Results

Next, we present the results for the three scenarios. For each scenario, we progressively increase the amount of available labelled data (i.e., the percentage of labelled samples per user) and compare the non-contrastive SSL approach with other baseline methods. Ideally, we expect to reach a high kappa score (close to one) using as few data samples as possible. Each result we report is the average of ten experiment runs to avoid any biases in weight initialisation and data splits.

V-A Scenario 1

RQ1 - Can non-contrastive SSL be used to leverage unlabelled data from a set of users to build a label-efficient classifier for a completely different set of users?

In Figure 5a and Figure 5b we show the performance of our SimSiam method and various other baselines for MusicID and MMI datasets, respectively. The overall kappa score is low for the MMI dataset (approximately between 80% - 85%) because it is a much noisier dataset not necessarily designed for user authentication (cf. Section IV).

Both the figures show that conventional supervised learning and data augmentation do not perform well because there is not enough labelled data to train them for the second set of users. For both the datasets, transfer learning works best (i.e., shows a high kappa score at a given percentage of data samples per user) followed by multi-task learning and our SimSiam approach. On the MusicID dataset, SimSiam’s performance is competitive with multi-task learning (MTSSL), and on the MMI dataset, up to until 20% of data samples per user, SimSiam is slightly worse, but catches up with multi task learning afterwards.

The high performance of transfer learning can be attributed to the use of labelled data from Dataset 1 during the pre-training phase of transfer learning. Here, SimSiam SSL still could compete with the performance of the transfer learning without using any labelled data from Dataset 1 which indicates the its capability in extracting features from unlabelled data.

Overall, our results show that non-contrastive SSL can indeed learn generic features from unlabelled data of a set of users to build a classifier for a totally different set of users using less labelled data. For instance, for the MusicID dataset, between 20% to 40% samples per user, the average kappa scores for SimSiam was 0.978. Within the same percentages of labelled samples per user, the average kappa scores of supervised learning and data augmentation were 0.879. As a percentage improvement this corresponds to 11% improvement over the baselines. The corresponding percentage increase of the MMI dataset was approximately 4%. Finally, though multi-task SSL performs slightly better than SimSiam it is computationally expensive as we explain in Section VII.

V-B Scenario 2

RQ2 - Can non-contrastive SSL be used to leverage large volumes of unlabelled data from a given set of users to train a label-efficient classifier for user authentication?

In Figure 6a and Figure 6b we compare the results for Scenario 2 for the two datasets. At lower amounts of labelled data two self-supervised learning approaches; SimSiam and multi-task learning (MTSSL) are performing better compared to supervised learning approaches. For example, for the MusicID dataset, when only 20% of labelled samples are used, both SimSiam and MTSSL have average kappa scores of 0.956 and 0.985, respectively while supervised learning and data augmentation approaches only result in kappa scores of 0.885, and 0.800. This difference drops in the MMI dataset, which can be expected since it is a much larger dataset compared to the MusicID dataset and when 20% of labelled samples are used, it is sufficient to train the supervised learning classifier.

Finally, similar to Scenario 1, it is noticeable that between the two self-supervised learning methods, the multi-task approach shows better performance than the non-contrastive SimSiam approach and the difference is more visible in MMI dataset compared to the MusicID dataset. Nonetheless, our results show that unlabelled data can be leveraged using non-contrastive SSL to build more label-efficient classifiers. For instance, the average performance improvements of SimSiam over supervised and data augmentation approaches were 10% and 4% for the MusicID and MMI datasets, respectively.

V-C Scenario 3

RQ3 - Can non-contrastive SSL be used to leverage unlabelled data from an initial set of users to build a label-efficient classifier for both the initial set of users and a whole new set of users?

Figure 7a and Figure 7b present the results for Scenario 3. Overall, the results are similar to Scenario 2, with SimSiam and multi-task learning performing above the other baselines at lower amounts of labelled data. However, compared to Scenario 2, the performance gap between different methods is reduced in Scenario 3. The reason behind this is having access to a larger amount of labelled data in this scenario compared to the other two scenarios. Here we train the classifier for the total user set. That is, 20 users for MusicID and 109 for MMI. Even though the number of samples a user provides is similar to other two scenarios, when taken as a whole it creates a large labelled dataset. Supervised learning methods benefit from this data and reach a performance similar to self-supervised methods. Even then, multi-task SSL and SimSiam outperform other methods and on the MMI dataset SimSiam slightly outperforms multi-task SSL after 20% of samples per user.

VI Performance Analysis

We next present the results of several other experiments to further analyse the performances of the non-contrastive SSL approach, SimSiam. Since multi-task learning (MTSSL), the other SSL approach among the baselines, also resulted in higher performance compared to traditional supervised learning and data augmentation baselines, where possible, we analyse both SimSiam and MTSSL together.

Overall, we analyse how different model parameters and data transformations of SimSiam and MTSSL affect the self-supervised learning process. In each experiment, we only change a single variable keeping everything else fixed, and train multiple self-supervised models for different variable values. We conduct these experiments in two settings.

•

Experimental Setting 1 ( $Ex_{1}$ ) - We use Dataset 1 for both pre-training and classifier training. The goal is to assess the quality of the learnt features when pre-training is done with unlabelled data from the same set of users.
•

Experimental Setting 2 ( $Ex_{2}$ ) - We do pre-training with Dataset 1 and train a classifier on Dataset 2 with the objective of assessing the user invariance of the learnt features.

Note that in these two settings, in contrast to Section V, we do not fine-tune the feature extractor when training the classifier. This is because we aim to assess the quality of pure features learnt in the self-supervised learning phase. When we fine-tune a model with labelled data, features learnt from an earlier phase can get modified or even overwritten.

We train with a fixed number of samples per user - 60 and 300 for the MusicID and MMI datasets, respectively. These two values are chosen according to the data availability of the two datasets. According to Table II, 68 is the minimum number of samples a user has in the MusicID dataset. Thus, to have a balanced dataset, we use 60 samples per user. Choice of using the maximum possible (and balanced) amount of labelled data is important when evaluating the quality of learnt features to obtain a more generalised view. Similar to MusicID we try to use the maximum balanced dataset for the MMI dataset which turns out to extremely large. At the same time, when we observe Figures 5b, 6b, and 7b, we can observe that after 70% samples per user, all the learning methods converge to a single value. That is approximately 300 percent samples per user in absolute value for the MMI dataset. Finally, we note that each result we report is the average over 10 experiments to eliminate any biases caused by the random weight initialisation.

VI-A Transformation Methods

As described in Section III, the SimSiam training process involves feeding the network with positive pairs (i.e., two augmented versions of the same input). As such, it is important to identify data transformation/augmentation methods that result in better self-supervised feature learning. Thus, we train the SimSiam network with different augmentation technique pairs and compare their performance. As mentioned earlier, we keep all other network parameters fixed and only change the augmentation technique pair. We evaluate the feature extractors under both the experiment settings, $Ex_{1}$ and $Ex_{2}$ , for both the datasets and report the average kappa scores in Figure 8a and Figure 8b.

According to Figure 8a any data augmentation technique pair gives high kappa scores of over 0.95 for the MusicID dataset. This can be attributed to the smaller size of the MusicID dataset and to the matching experiment conditions of the data collection process, which was designed for authentication applications from the beginning. In contrast, as can be seen from Figure 8b two augmentation technique pairs do not result in high kappa scores for the MMI dataset. Also, there is a considerable variation in kappa scores between pairs. For example, the augmentation pair Permutation and Magnitude Warping results in only a kappa score of 0.3049 while the pair Drop and Time Warping results in 0.4902. This can be attributed to the MMI dataset being large and noisy. Nonetheless, this analysis justifies using more than two augmentation methods to train a better feature extractor in Section V. For example, when more augmentations are considered, the kappa scores for the MMI reaches close to 0.8 (cf. Section V).

VI-B Effect of Model Modifications

As mentioned in Section III-A we did two modifications to the SimSiam learning process to make it more suitable for behavioural biometrics data and improve its performance. Here, we experimentally show how the two modifications we introduced; shallow feature extractor and weight decay, improve the self-supervised training process.

VI-B1 Depth of the feature extractor

We show that shallow feature extractor networks can lead to higher performance in SimSiam models. We show different feature extractor network configurations we tested in Table IV. The first number in each configuration corresponds to the number of filters used in the first 1D convolutional layer and the remaining numbers correspond to the number of filters in concatenated 1D ResNet blocks. For instance, Configuration 4 of the MusicID dataset corresponds to the architecture illustrated in Figure 3 with k=(128,256). We evaluate each of these models using both the experimental setups and report the results in Table V. The highest accuracies are marked in bold. For comparison, we also report the results of MTSSL when the same feature extractor models are used.

According to Table V, for SimSiam, Configuration 3 constantly gives the best results. Note that, Configuration 3 layer-wise has the same depth as Configuration 1 and 2. However, Configuration 1 and 2 have a small number of filters compared to Configuration 3. By observing the results, we can conclude that given sufficient convolution filters are there, shallow networks work best with SimSiam. In contrast, the best architecture changes for MTSSL across experimental settings as well as datasets.

Config(k)	MusicID	MMI
1	32	16
2	64	32
3	128	48
4	128, 256	48, 96
5	128, 256, 512	48, 96, 192
6	128, 256, 512, 1024	48, 96, 192, 384
7	128, 256, 512, 1024, 2048	48, 96, 192, 384, 768

TABLE IV: Model configurations for the two datasets

	MusicID				MMI
	SimSiam		MTSSL		SimSiam		MTSSL
	$Ex_{1}$	$Ex_{2}$	$Ex_{1}$	$Ex_{2}$	$Ex_{1}$	$Ex_{2}$	$Ex_{1}$	$Ex_{2}$
1	0.9461	0.9601	0.9709	0.9318	0.4110	0.3796	0.5127	0.4380
2	0.9751	0.9767	0.9544	0.9534	0.4931	0.4423	0.5827	0.5082
3	0.9834	0.9966	0.9627	0.9750	0.5241	0.4670	0.5863	0.5160
4	0.9482	0.9651	0.9668	0.9534	0.4928	0.4163	0.6238	0.5526
5	0.9772	0.9518	0.9399	0.9468	0.4396	0.3583	0.6532	0.5679
6	0.9668	0.9136	0.9419	0.9335	0.0758	0.0430	0.6470	0.5733
7	0.8860	0.7262	0.9192	0.9003	2e-8	-2e-8	0.6392	0.5691

TABLE V: Effect of feature extractor depth

VI-B2 Weight decay

Next, we investigate how high weight decay (a.k.a, regularisation) can improve the performance of self-supervised models. We apply $l_{2}\;norm$ weight regularisation to the feature extractor network and conduct experiments by varying the regularisation coefficient ( $\lambda$ ) while keeping all other parameters fixed. Table VI shows the results and it is clear that both self-supervised methods benefit from higher weight regularisation.

	MusicID				MMI
	SimSiam		MTSSL		SimSiam		MTSSL
$\lambda$	$Ex_{1}$	$Ex_{2}$	$Ex_{1}$	$Ex_{2}$	$Ex_{1}$	$Ex_{2}$	$Ex_{1}$	$Ex_{2}$
$0.1$	0.9523	0.9036	0.9088	0.9236	0.5044	0.3997	0.6677	0.5859
$0.01$	0.9730	0.8688	0.9647	0.9169	0.5139	0.4148	0.6293	0.5591
$0.001$	0.9523	0.8870	0.8943	0.8571	0.4974	0.4001	0.6080	0.5337
$0.0001$	0.9255	0.8471	0.9420	0.8503	0.5055	0.4033	0.5885	0.5256
$0.00001$	0.9357	0.8738	0.9378	0.9019	0.5051	0.4100	0.6077	0.5176

TABLE VI: Effect of weight decay

VI-C Depth of the Predictor

In contrast to the feature extractor which needs to be shallow for higher performance, our experiments found that the predictor network needs to be deeper for better performance. To show this effect, we conduct several experiments keeping all other parameters fixed and only changing the predictor depth. We present our results in Table VII. Predictor architectures are given in a comma delimited format - corresponds to the dimensions of the Dense layers, from left to right. For example, the sixth predictor architecture corresponds to the architecture illustrated in Figure 4.

	MusicID		MMI
Predictor	$Ex_{1}$	$Ex_{2}$	$Ex_{1}$	$Ex_{2}$
1. $512$	0.9212	0.8837	0.2424	0.1720
2. $2048,512$	0.9440	0.9119	0.4828	0.4041
3. $4096,2048,512$	0.9337	0.9368	0.5008	0.4159
4. $8196,4096,2048,512$	0.9399	0.9318	0.4907	0.3993
5. $8196,8196,4096,2048,512$	0.9337	0.9418	0.4860	0.3824
6. $8196,8196,8196,4096,2048,512$	0.9544	0.9402	0.5177	0.4263

TABLE VII: Effect of predictor network depth

As mentioned in Section III-A, the role of the predictor is to average the representation vector across all possible augmentations the network has seen. A deeper predictor can memorise more augmentations, consequently making the averaging more precise. During the training process the model compares the output of the predictor $p$ with the output of the encoder network $z$ . where $p$ is the mean vector of several augmentations of the same sample and $z$ is the representation vector of a single augmented version of that sample. The task of the network is to make $p$ and $z$ similar. Since $p$ is an averaged vector, in order to make $z$ similar, the encoder network is forced to learn a representation that is common to all the averaged versions. If the predictor is shallow, it can only compare only a few versions of the sample. Therefore, deeper predictor networks can help to learn a more generalised representation.

VII Discussion & Concluding Remarks

Using two EEG-based behavioural biometric datasets and three authentication scenarios, we demonstrated that non-contrastive SSL allows developing label-efficient user classifiers. The SimSiam SSL approach we proposed achieved 4%–11% higher on average performance compared to conventional supervised learning and data augmentation baselines. Our approach also resulted in comparable performances to state-of-the-art multi-task SSL approaches in all three scenarios. Next, we discuss the implications of our results, limitations, and possible future extensions.

SimSiam and multi-task SSL - Though the majority of the time, SimSiam and multi-task learning showed comparable performances, on some occasions, multi-task learning performed better. However, when it comes to training time resource requirements, SimSiam has a distinct advantage in terms of memory footprint over multi-task learning. Multi-task learning requires adding new heads to the network architecture when more transformations are added to the training process. As a result, the network size increases approximately linearly with the number of transformations. In more complex datasets, multi-task learning will require multi-GPU, multi-server distributed training. In contrast, increasing the number of transformations has no impact on the memory footprint of the SimSiam model. The most likely thing that can happen for SimSim is that the number of epochs it needs to be trained may increase with the number of transformations.

Larger dataset and accounting for contextual changes - The datasets we explored are relatively homogeneous and stable. That is, the data was collected in similar conditions across sessions. However, contextual changes and biases are major factors in real-world behavioural biometrics systems. Such factors can include user demographics, users’ physical condition and activity levels, and the heterogeneity of the hardware used to collect data. In addition to user invariant feature learning, context invariant feature learning will also be required to account for such contextual factors. That means more augmentation techniques need to be investigated, especially those with the potential of being context-invariant, such as augmentation techniques from the frequency domain.

The true capability of non-contrastive SSL will be more visible when available unlabelled data and the number of users targeted by the authentication application is high. However, the currently publicly available behavioural biometrics datasets for different modalities only have the number of users in the range of a few tens, which is a limitation for further extensions in the area. Also, future work can also explore the potential of non-contrastive SSL on other behavioural biometrics modalities such as gait, typing patterns, and breathing acoustics.

Improvements to SimSiam - Despite SimSiam-based ideas are relatively new, several subsequent modifications have been proposed to further improve its performance. For example, even though SimSiam architecture avoids network collapsing by using the stopgradient operation, it was recently discovered that another phenomenon called dimensional collapse can also impact the learning capability of both contrastive learning and non-contrastive learning [53, 54]. It is important to analyse such modifications in the context of behavioural biometrics data in particular, and sensor data streams in general to identify possible further improvements.

References

[1] A. C. Weaver, “Biometric authentication,” Computer, vol. 39, no. 2, pp. 96–97, 2006.
[2] M. O. Derawi, C. Nickel, P. Bours, and C. Busch, “Unobtrusive user-authentication on mobile phones using biometric gait recognition,” in 2010 Sixth International Conference on Intelligent Information Hiding and Multimedia Signal Processing, 2010, pp. 306–311.
[3] S. P. Banerjee and D. L. Woodard, “Biometric authentication and identification using keystroke dynamics: A survey,” Journal of Pattern Recognition Research, vol. 7, no. 1, pp. 116–139, 2012.
[4] K. S. Killourhy and R. A. Maxion, “Comparing anomaly-detection algorithms for keystroke dynamics,” in 2009 IEEE/IFIP International Conference on Dependable Systems & Networks. IEEE, 2009.
[5] J. Chauhan, Y. Hu, S. Seneviratne, A. Misra, A. Seneviratne, and Y. Lee, “Breathprint: Breathing acoustics-based user authentication,” in Proceedings of the 15th Annual International Conference on Mobile Systems, Applications, and Services, 2017, pp. 278–291.
[6] J. Chauhan, J. Rajasegaran, S. Seneviratne, A. Misra, A. Seneviratne, and Y. Lee, “Performance characterization of deep learning models for breathing-based authentication on resource-constrained devices,” Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies, vol. 2, no. 4, pp. 1–24, 2018.
[7] J. Sooriyaarachchi, S. Seneviratne, K. Thilakarathna, and A. Y. Zomaya, “Musicid: A brainwave-based user authentication system for internet of things,” IEEE Internet of Things Journal, 2020.
[8] X. Zhang, L. Yao, S. S. Kanhere, Y. Liu, T. Gu, and K. Chen, “MindID: Person identification from brain waves through attention-based recurrent neural network,” Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies, 2018.
[9] M. Abuhamad, A. Abusnaina, D. Nyang, and D. Mohaisen, “Sensor-based continuous authentication of smartphones’ users using behavioral biometrics: A contemporary survey,” IEEE Internet of Things Journal, vol. 8, no. 1, pp. 65–84, 2020.
[10] R. Miller, N. K. Banerjee, and S. Banerjee, “Using siamese neural networks to perform cross-system behavioral authentication in virtual reality,” in 2021 IEEE Virtual Reality and 3D User Interfaces (VR). IEEE, 2021, pp. 140–149.
[11] M. Gadaleta and M. Rossi, “Idnet: Smartphone-based gait recognition with convolutional neural networks,” Pattern Recognition, vol. 74, pp. 25–37, 2018.
[12] J. Chauhan, S. Seneviratne, Y. Hu, A. Misra, A. Seneviratne, and Y. Lee, “Breathing-based authentication on resource-constrained iot devices using recurrent neural networks,” IEEE Computer, 2018.
[13] X. Chen and K. He, “Exploring simple siamese representation learning,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 15 750–15 758.
[14] X. Xu, J. Yu, Y. chen, Q. Hua, Y. Zhu, Y.-C. Chen, and M. Li, TouchPass: Towards Behavior-Irrelevant on-Touch User Authentication on Smartphones Leveraging Vibrations. New York, NY, USA: Association for Computing Machinery, 2020. [Online]. Available: https://doi.org/10.1145/3372224.3380901
[15] Y. Meng, D. S. Wong, and L.-F. Kwok, “Design of touch dynamics based user authentication with an adaptive mechanism on mobile phones,” in Proceedings of the 29th annual ACM symposium on applied computing, 2014, pp. 1680–1687.
[16] T. Zhao, Y. Wang, J. Liu, Y. Chen, J. Cheng, and J. Yu, “Trueheart: Continuous authentication on wrist-worn wearables using ppg-based biometrics,” in IEEE INFOCOM 2020-IEEE Conference on Computer Communications. IEEE, 2020, pp. 30–39.
[17] S. Vhaduri and C. Poellabauer, “Wearable device user authentication using physiological and behavioral metrics,” in 2017 IEEE 28th Annual International Symposium on Personal, Indoor, and Mobile Radio Communications (PIMRC). IEEE, 2017, pp. 1–6.
[18] K. Sundararajan and D. L. Woodard, “Deep learning for biometrics: A survey,” ACM Computing Surveys (CSUR), 2018.
[19] J. Chauhan, Y. D. Kwon, P. Hui, and C. Mascolo, “Contauth: continual learning framework for behavioral-based user authentication,” Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies, vol. 4, no. 4, pp. 1–23, 2020.
[20] J. Solano, L. Tengana, A. Castelblanco, E. Rivera, C. Lopez, and M. Ochoa, “A few-shot practical behavioral biometrics model for login authentication in web applications,” in NDSS Workshop on Measurements, Attacks, and Defenses for the Web (MADWeb’20), 2020.
[21] Y. Zhang, Z. Zhao, Y. Deng, X. Zhang, and Y. Zhang, “Human identification driven by deep cnn and transfer learning based on multiview feature representations of ecg,” Biomedical Signal Processing and Control, vol. 68, p. 102689, 2021.
[22] T. Sheng and M. Huber, “Siamese networks for weakly supervised human activity recognition,” in 2019 IEEE International Conference on Systems, Man and Cybernetics (SMC). IEEE, 2019, pp. 4069–4075.
[23] A. Saeed, T. Ozcelebi, and J. Lukkien, “Multi-task self-supervised learning for human activity detection,” Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies, vol. 3, no. 2, pp. 1–30, 2019.
[24] L. Jing and Y. Tian, “Self-supervised visual feature learning with deep neural networks: A survey,” IEEE transactions on pattern analysis and machine intelligence, vol. 43, no. 11, pp. 4037–4058, 2020.
[25] J. Bromley, I. Guyon, Y. LeCun, E. Säckinger, and R. Shah, “Signature verification using a” siamese” time delay neural network,” Advances in neural information processing systems, vol. 6, 1993.
[26] T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean, “Distributed representations of words and phrases and their compositionality,” Advances in neural information processing systems, 2013.
[27] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “Bert: Pre-training of deep bidirectional transformers for language understanding,” arXiv preprint arXiv:1810.04805, 2018.
[28] Y. Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, O. Levy, M. Lewis, L. Zettlemoyer, and V. Stoyanov, “Roberta: A robustly optimized bert pretraining approach,” arXiv preprint arXiv:1907.11692, 2019.
[29] A. Conneau, K. Khandelwal, N. Goyal, V. Chaudhary, G. Wenzek, F. Guzmán, E. Grave, M. Ott, L. Zettlemoyer, and V. Stoyanov, “Unsupervised cross-lingual representation learning at scale,” arXiv preprint arXiv:1911.02116, 2019.
[30] A. Van den Oord, N. Kalchbrenner, L. Espeholt, O. Vinyals, A. Graves et al., “Conditional image generation with pixelcnn decoders,” Advances in neural information processing systems, vol. 29, 2016.
[31] L. Dinh, J. Sohl-Dickstein, and S. Bengio, “Density estimation using real nvp,” arXiv preprint arXiv:1605.08803, 2016.
[32] D. P. Kingma and P. Dhariwal, “Glow: Generative flow with invertible 1x1 convolutions,” Advances in neural information processing systems, vol. 31, 2018.
[33] C. Doersch, A. Gupta, and A. A. Efros, “Unsupervised visual representation learning by context prediction,” in Proceedings of the IEEE international conference on computer vision, 2015, pp. 1422–1430.
[34] R. Zhang, P. Isola, and A. A. Efros, “Colorful image colorization,” in European conference on computer vision. Springer, 2016, pp. 649–666.
[35] M. Noroozi and P. Favaro, “Unsupervised learning of visual representations by solving jigsaw puzzles,” in European conference on computer vision. Springer, 2016, pp. 69–84.
[36] S. Gidaris, P. Singh, and N. Komodakis, “Unsupervised representation learning by predicting image rotations,” arXiv preprint arXiv:1803.07728, 2018.
[37] T. Chen, S. Kornblith, M. Norouzi, and G. Hinton, “A simple framework for contrastive learning of visual representations,” in International conference on machine learning. PMLR, 2020, pp. 1597–1607.
[38] J.-B. Grill, F. Strub, F. Altché, C. Tallec, P. Richemond, E. Buchatskaya, C. Doersch, B. Avila Pires, Z. Guo, M. Gheshlaghi Azar et al., “Bootstrap your own latent-a new approach to self-supervised learning,” Advances in Neural Information Processing Systems, 2020.
[39] M. Caron, I. Misra, J. Mairal, P. Goyal, P. Bojanowski, and A. Joulin, “Unsupervised learning of visual features by contrasting cluster assignments,” Advances in Neural Information Processing Systems, 2020.
[40] K. He, H. Fan, Y. Wu, S. Xie, and R. Girshick, “Momentum contrast for unsupervised visual representation learning,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2020.
[41] D. Dwibedi, Y. Aytar, J. Tompson, P. Sermanet, and A. Zisserman, “With a little help from my friends: Nearest-neighbor contrastive learning of visual representations,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 9588–9597.
[42] I. Misra and L. v. d. Maaten, “Self-supervised learning of pretext-invariant representations,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020, pp. 6707–6717.
[43] E. Eldele, M. Ragab, Z. Chen, M. Wu, C. K. Kwoh, X. Li, and C. Guan, “Time-series representation learning via temporal and contextual contrasting,” arXiv preprint arXiv:2106.14112, 2021.
[44] X. Chen, H. Fan, R. Girshick, and K. He, “Improved baselines with momentum contrastive learning,” arXiv preprint arXiv:2003.04297, 2020.
[45] M. Caron, P. Bojanowski, A. Joulin, and M. Douze, “Deep clustering for unsupervised learning of visual features,” in Proceedings of the European conference on computer vision (ECCV), 2018, pp. 132–149.
[46] J. Zbontar, L. Jing, I. Misra, Y. LeCun, and S. Deny, “Barlow twins: Self-supervised learning via redundancy reduction,” in International Conference on Machine Learning. PMLR, 2021, pp. 12 310–12 320.
[47] P. Khosla, P. Teterwak, C. Wang, A. Sarna, Y. Tian, P. Isola, A. Maschinot, C. Liu, and D. Krishnan, “Supervised contrastive learning,” Advances in Neural Information Processing Systems, vol. 33, pp. 18 661–18 673, 2020.
[48] M. Tagliasacchi, B. Gfeller, F. d. C. Quitry, and D. Roblek, “Self-supervised audio representation learning for mobile devices,” arXiv preprint arXiv:1905.11796, 2019.
[49] C. I. Tang, I. Perez-Pozuelo, D. Spathis, and C. Mascolo, “Exploring contrastive learning in human activity recognition for healthcare,” arXiv preprint arXiv:2011.11542, 2020.
[50] H. Qian, T. Tian, and C. Miao, “What makes good contrastive learning on small-scale wearable-based tasks?” arXiv preprint arXiv:2202.05998, 2022.
[51] C. Wright and D. Stewart, “One-shot-learning for visual lip-based biometric authentication,” in International Symposium on Visual Computing. Springer, 2019, pp. 405–417.
[52] G. Schalk, “A general-purpose Brain-Computer Interface (BCI) system,” IEEE Transactions on Biomedical Engineering, 2004.
[53] T. Hua, W. Wang, Z. Xue, S. Ren, Y. Wang, and H. Zhao, “On feature decorrelation in self-supervised learning,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 9598–9608.
[54] L. Jing, P. Vincent, Y. LeCun, and Y. Tian, “Understanding dimensional collapse in contrastive self-supervised learning,” arXiv preprint arXiv:2110.09348, 2021.