MTS-LOF: Medical Time-Series Representation Learning via Occlusion-Invariant Features

Huayu Li, Ana S. Carreon-Rascon, Xiwen Chen, Geng Yuan, and Ao Li This work was supported by grants from the National Heart, Lung, and Blood Institute (#R21HL159661), and the National Science Foundation (#2052528).Huayu Li, Ana S. Carreon-Rascon are with the Department of Electrical & Computer Engineering at the University of Arizona, Tucson, AZ 85719 USA. (e-mail: [email protected], [email protected])Xiwen Chen is with the School of Computing at Clemson University, Clemson, SC 29634 USA. (e-mail: [email protected])Geng Yuan is with the School of Computing (CS Department) at the University of Georgia, Athens, GA 30602 USA. (e-mail: [email protected])Ao Li is with the Department of Electrical & Computer Engineering and BIO5 Institute at The University of Arizona, Tucson, AZ 85719 USA. (e-mail: [email protected])

Abstract

Medical time series data are indispensable in healthcare, providing critical insights for disease diagnosis, treatment planning, and patient management. The exponential growth in data complexity, driven by advanced sensor technologies, has presented challenges related to data labeling. Self-supervised learning (SSL) has emerged as a transformative approach to address these challenges, eliminating the need for extensive human annotation. In this study, we introduce a novel framework for Medical Time Series Representation Learning, known as MTS-LOF. MTS-LOF leverages the strengths of contrastive learning and Masked Autoencoder (MAE) methods, offering a unique approach to representation learning for medical time series data. By combining these techniques, MTS-LOF enhances the potential of healthcare applications by providing more sophisticated, context-rich representations. Additionally, MTS-LOF employs a multi-masking strategy to facilitate occlusion-invariant feature learning. This approach allows the model to create multiple views of the data by masking portions of it. By minimizing the discrepancy between the representations of these masked patches and the fully visible patches, MTS-LOF learns to capture rich contextual information within medical time series datasets. The results of experiments conducted on diverse medical time series datasets demonstrate the superiority of MTS-LOF over other methods. These findings hold promise for significantly enhancing healthcare applications by improving representation learning. Furthermore, our work delves into the integration of joint-embedding SSL and MAE techniques, shedding light on the intricate interplay between temporal and structural dependencies in healthcare data. This understanding is crucial, as it allows us to grasp the complexities of healthcare data analysis.

Index Terms:

Medical time series, self-supervised learning, health monitoring, masked autoencoder, representation learning, time series classification, transformer

I Introduction

Medical time series data, comprising physiological signals, human activities, and other time-stamped data, is crucial across various clinical settings and home care scenarios, guiding clinical decision-making and patient management. The analysis and interpretation of these complex temporal data streams are pivotal for enhancing our comprehension of health and facilitating timely interventions. Recently, the exponential growth in both the volume and complexity of medical time series data has been propelled by advanced sensor technologies and enhancements in electronic health records. Consequently, the automatic classification of medical time series data represents a fundamental challenge in the development of intelligent clinical decision support systems. Typically, classification models are trained using supervised learning on labeled data pairs annotated by human experts. Nevertheless, the continuous generation of vast volumes of time series data from various sensors renders human annotation impractical and inconsistent with the cost-effectiveness principle [1].

In recent years, self-supervised learning (SSL) techniques have emerged as a transformative approach across various domains, including computer vision[2, 3, 4, 5, 6, 7], natural language processing [8, 9, 10, 11], and time series [12, 13, 14, 15, 16]. Supervised learning methods, which rely on extensive labeled data and domain expertise for training, face challenges related to the availability of large annotated datasets. Yet, obtaining labeled data is impractical in numerous scenarios due to factors like time constraints, cost considerations, or privacy concerns. In response to these challenges, SSL techniques have gained prominence as a game-changing innovation. They provide a versatile alternative that empowers models to autonomously derive meaningful representations directly from raw data, eliminating the need for extensive reliance on labeled samples. As a result, SSL has opened new avenues for knowledge discovery, enabling more efficient, cost-effective, and privacy-conscious solutions to complex data analysis tasks.

The invariance-based methods (also called joint-embedding SSL) contain two branches, Contrastive learning [17, 4, 18, 19, 5, 6] and Non-Contrastive learning [20, 21, 22, 23], that share the same core idea, which is to learn a representation invariant to augmentations of the same image instance. It means that the learned representation of two different augmentations of the same image should be close, and the representation space shall not be a collapsed trivial one (e.g. all representations collapse to a constant). To overcome those issues, contrastive learning methods create a dichotomy between positive and negative samples in a latent space, thereby encouraging the model to pull similar samples together while pushing dissimilar ones apart. In contrast, Non-Contrastive learning methods learn to produce similar embeddings for different views of the same image, and penalize the embeddings with regularization or stop gradients to enforce a non-collapsing solution. Recent time series representation learning algorithms [16, 13, 15] are developed based on contrastive learning and achieved outstanding improvements than different baselines [24, 14]. Compared to SSL in computer vision tasks, time series data inherently possess temporal dependencies, a defining characteristic that is not adequately addressed by traditional image-based contrastive learning methods. Also, many augmentation techniques commonly employed in image-based contrastive learning, such as color distortion, do not seamlessly translate to time-series data, posing significant challenges to adapting these methods effectively.

Beneficial of the advancement of transformer architecture [25] in computer vision [26], Masked Autoencoder (MAE) [27], and its broader concept, Masked Image Modeling (MIM) [26, 28], are at the forefront of innovative approaches within the fields of computer vision, delivering robust methods for visual representation learning. These methodologies share a common foundational principle: the strategic concealment or ”masking” of specific regions within an image, feature space, or patchified image. Through this selective occlusion, models are guided to scrutinize intricate patterns, complete missing information, and ultimately, provide valuable insights into the underlying structure of visual content. Much like the strides made in computer vision tasks, the work by Zerveas et al. [29] presents an exciting development in the field of time series data analysis. They introduce a groundbreaking framework for multivariate time series representation learning based on the principles of MAE. This pioneering effort marks a significant leap forward as it represents the first direct application of MAE techniques to the domain of time series data analysis.

While time series contrastive learning and time series MAE methods have undeniably ushered in significant advancements in the domain of time series representation learning, a vital question emerges: Can we amalgamate the valuable insights gleaned from MAE with contrastive learning algorithms to craft invariant representations that encompass an even more profound wealth of contextual information? This question lies at the heart of a relentless pursuit to amplify the capabilities of time series representation learning. Contrastive learning and MAE techniques have independently demonstrated their prowess by tackling various facets of feature extraction and data representation. Contrastive learning excels in the discernment of discriminative patterns within time series, while MAE exhibits a remarkable aptitude for capturing the latent dependencies and nuanced context embedded in temporal data.

The significance of this potential integration cannot be overstated. It holds the promise of furnishing time series representation learning with a new level of sophistication and depth, enabling it to yield more comprehensive and context-rich representations. More importantly, such an integration benefits more in the medical time series domain where the temporal and the structural dependency share almost the same importance. In clinical decision-making procedures, physicians base their diagnoses on the local patterns of the time series data, such as interictal spikes in EEGs. Conversely, some diagnoses depend more on the long term temporal behaviors of the data, such as sleep architecture.

In this paper, we envision the creation of a novel framework that could yield representations by integrating the strengths of these two approaches. We propose a Medical Time-Series representation Learning framework via Occlusion-invariant Feature (MTS-LOF). The proposed MTS-LOF framework employs a simple yet efficient multi-masking strategy without specific data augmentation to create multiple views of the patches of data. Then, the MTS-LOF framework learns the occlusion-invariant feature by minimizing the discrepancy between each masked patch and the complete data. Experiments are conducted on different medical time series data and the results demonstrate the performance of MTS-LOF outperforms different baselines. In summary, the main contributions of this work are as follows:

•

MTS-LOF Framework: We present the MTS-LOF framework, a novel paradigm for medical time series representation learning. Unlike other frameworks that rely on specific data augmentation, this framework offers a simple yet highly effective approach to capture rich contextual information within medical time series data.
•

Integration of Joint-Embedding SSL and MAE: MTS-LOF is one of the first frameworks that integrates the advantages of both contrastive learning and MAE techniques in medical time series representation learning, capturing the intricate interplay between temporal and structural dependencies within the data. This fusion underscores the potential for enhanced representation learning in healthcare applications.
•

Performance Evaluation on Diverse Medical Datasets: We conducted extensive experiments on several medical datasets, including Epilepsy Seizure Prediction[30], Human Activity Recognition[31], Sleep-EDF[32], and Sleep Heart Health Study[33]. The experimental results demonstrate that our framework significantly outperforms state-of-the-art methods. This framework has great potential for healthcare applications in clinical settings and home care.

II Related works

II-A Joint-Embedding Self-Supervised Learning

Joint-Embedding SSL is a prominent paradigm in representation learning, dedicated to discerning similarities and dissimilarities among data point pairs. It primarily involves maximizing the similarity between augmented views from identical inputs (referred to as ”positive” samples) and minimizing the similarity between augmented views from distinct inputs (referred to as ”negative” samples). A primary challenge in Joint-Embedding SSL is the potential risk of representation collapse. In such cases, the model consistently generates identical outputs, making the learned representations uninformative. In the realm of contrastive learning, the methodologies frequently incorporate positive samples and negative samples. The overarching objective is to bring positive samples closer to the anchor sample within an embedding space while simultaneously pushing negative samples farther apart. Contrastive learning techniques, including Contrastive Predictive Coding (CPC) [4], Simple Contrastive Learning Framework (SimCLR) [5], and Momentum Contrast (MoCo) [6], have demonstrated substantial progress. Unlike contrastive learning methods, non-contrastive approaches in Joint-Embedding SSL depart from the conventional utilization of negative samples. These techniques employ diverse strategies to mitigate the risk of representation collapse. For example, Bootstrap Your Own Latent (BYOL) [20] and Simple Siamese representation learning (SimSiam) [23] employ gradient stopping techniques and predictor modules to avert representation collapse. Furthermore, Variance-Invariance-Covariance Regularization (VICReg) [21] introduces covariance regularization as a means to penalize learned representations, thus mitigating the risk of collapse and enhancing the stability of embeddings.

II-B Masked Auto Encoder and Masked Image Modeling

Within the domain of SSL, masked language modeling (MLM) has emerged as a prominent method in Natural Language Processing (NLP) [8, 11, 9]. MLM and its auto-regressive variations [9] have transformed NLP, enabling the training of extensive language models that excel in a wide range of language comprehension and generation tasks. These methodologies entail predicting masked tokens within sentences or sentence pairs/triplets, harnessing extensive training data to attain remarkable performance. Concurrently, Masked Image Modeling (MIM) has advanced in parallel with MLM. These techniques encompass the deliberate masking of particular regions within images or feature spaces, directing models to discern intricate patterns, imputing missing information, and ultimately extracting valuable insights regarding the underlying structure of visual content. A seminal contribution in this domain is the context encoder approach [34], which masks regions within original images and predicts the missing pixels, laying the foundation for subsequent advancements. Recent studies [35, 26, 28] have delved into the utilization of MIM for pretraining vision Transformers [26, 36]. Additionally, Masked Autoencoder (MAE) [27] introduced a straightforward and scalable self-supervised learning approach for computer vision. This method involves masking random patches of the input image and subsequently reconstructing the missing pixels. Recent work proposed by Kong et al. [37] reveals the insight behind MIM is encouraging the network to learn the learning occlusion invariant feature.

II-C Self-supervised Learning for Time-Series

SSL techniques have significantly impacted the field of time-series data analysis. Researchers have earnestly explored the adaptation of SSL approaches to capture the intricate temporal dependencies, patterns, and representations inherent in sequential data. Before the emergence of time-series contrastive learning, the predominant method utilized multitask learning [24, 14] to obtain time-series representations. These methods typically involved applying various transformations to the original time series to construct pretext tasks and pseudo labels. Subsequently, models were trained to recognize the specific transformations that were applied. However, motivated by the remarkable successes observed in other domains, contrastive learning has made inroads into the domain of time-series representation learning. For instance, SimCLR [5] has been extended to support EEG representation learning [16]. Contrastive Predictive Coding (CPC) [4] has employed predictive modeling in latent spaces to acquire time-series representations, showcasing competence in speech recognition tasks. The TS-TCC framework [13] has introduced innovative temporal and contextual contrasting modules founded on transformer blocks [25], enhancing the acquisition of discriminative representations. Additionally, TS2Vec [15] has presented a hierarchical contrasting framework with the capability to capture multi-scale temporal and instance dependencies within time series data. In addition to temporal contrastive learning, the direct application of MAE in the context of time series is put forth in [29].

Refer to caption — Figure 1: Illustration of the backbone network architecture. This figure provides an overview of the backbone network employed in our study, designed to effectively process multidimensional multivariate time series samples. The input time series undergoes a patching process using a CNN1D, followed by transformation through the transformer encoder to generate meaningful representations. The final representations are obtained from the outputs of the transformer encoder post a global average pooling layer. These representations are then input into a linear classifier to make the final predictions.

III Proposed Methods

III-A Backbone Model Structure

We begin by detailing the backbone network utilized in our study. Specifically, when dealing with a collection of multidimensional multivariate time series samples, each training sample denoted as $X\in\mathbb{R}^{t\times m}$ constitutes a sequence of $m$ univariate series represented as $x\in\mathbb{R}^{t}$ , where $t$ signifies the time step. As illustrated in Figure. 1, our approach diverges from prior work [29], which employed a linear projection of each time step from the original time series into a $d$ -dimensional vector space, treating them as tokens, using $\hat{X}=WX+b$ , with $W\in\mathbb{R}^{m\times d}$ and $b\in\mathbb{R}^{d}$ . Instead, we adopt a strategy where we initially generate patches as tokens by processing the time series through several convolutional layers. To streamline the discussion, we implement a modified version of the CNN1D architecture from TS-TCC [13]. The configuration of the initial convolutional layer varies depending on the dataset, with distinct kernel sizes ( $k$ ) and strides ( $s$ ). Subsequently, we stack three convolutional layers with fixed kernel size $8$ and stride $2$ . These layers efficiently downsample the original time series by a factor of $8\times s$ . Finally, a single convolutional layer with kernel size $1$ and stride $1$ reduces the feature channel dimensions to match the $d$ -dimensional input required by the transformer encoder. Each convolutional layer is followed by a batch normalization (BN) [38] and Gaussian Error Linear Units (GELU) [39] beside the last one. Several factors motivate this choice. Firstly, akin to the concept of patching in vision transformers (ViT) [26], we consider a single time step as equivalent to a single pixel, lacking semantic meaning akin to words in a sentence. Thus, it becomes imperative to extract local semantic information for analyzing their connections. Secondly, as exemplified in [40], early convolutional layers can enhance the stability and convergence of the transformer during training. Lastly, this design facilitates further modifications and extensions of the backbone network, enabling the replacement of the simple CNN1D with more complex architectures [41].

After generating the patches through the convolutional layers, the tokens $\hat{X}\in\mathbb{R}^{p\times d}$ , where $p=\frac{t}{8s}$ represents the number of patches, are encoded by adding a fixed 1D cosine positional embedding denoted as P, defined as follows:

\displaystyle\text{P}_{i,2\cdot j}=\sin(\frac{i}{10000^{2j/d}}),\quad\text{P}_{i,2\cdot j+1}=\cos(\frac{i}{10000^{2j/d}}).

(1)

Here, $\text{P}_{i,j}$ signifies the value of the embedding for the $i$ -th position and the $j$ -th dimension. Typically, $d$ matches the dimension of the Transformer encoder’s input embeddings. This positional encoding method enhances the model’s capability to capture sequential order and relationships within the time series data, which are crucial for various time-series analysis tasks. It ensures that the model can differentiate between different positions in the sequence, and when combined with the patch-based tokenization, it allows the transformer encoder to effectively process the data for downstream tasks. Subsequently, we feed these encoded tokens into the Transformer encoder. The Transformer encoder employs multi-head attention, where each attention head, denoted as $h=1,\dots,H$ , independently processes the input tokens $\hat{X}$ . Each head transforms the input into query matrices $Q_{h}$ , key matrices $K_{h}$ , and value matrices $V_{h}$ , achieved through weight matrices $W_{Q}$ , $W_{K}$ , and $W_{V}$ , respectively. Once the query, key, and value matrices are obtained, a scaled dot-product attention mechanism computes the attention output $O_{h}$ . The softmax function is utilized to calculate attention weights, which determine the level of attention each patch pays to others. The calculation for $O_{h}$ is defined as:

\displaystyle O^{\dagger}_{h}=\text{Attention}(Q_{h},K_{h},V_{h})=\text{softmax}(\frac{Q_{h}K_{h}^{\dagger}}{\sqrt{d_{k}}})V_{h},

(2)

where the $\sqrt{d_{k}}$ term refers to the dimension of the key vectors and is employed to scale the dot product to prevent it from becoming excessively large. The multi-head attention block incorporates Layer Normalization layers [42] to normalize the data and stabilize training. Additionally, a feed-forward network with residual connections is employed to capture intricate patterns within the data. Consequently, the outcome of the Transformer encoder’s operations is a representation denoted as $Z\in\mathbb{R}^{p\times d}$ . Finally, a global average pooling layer is applied to obtain the final representation $\hat{Z}\in\mathbb{R}^{d}$ , followed by a linear head for predicting results $Y\in\mathbb{R}^{c}$ , where $c$ is the number of classes.

III-B Occlusion-Invariant Feature Learning

We now delve into the concept of Occlusion-Invariant Feature Learning and motivation behind the proposed MTS-LOF framework. To establish a comprehensive understanding, it is essential to revisit the MAE objective, a pivotal component of our methodology. The primary purpose of this objective is to predict the original pixel values of the masked patches/tokens $\hat{X}\odot M$ . Let it be known that we represent $M\in\mathbb{R}^{p}$ , where each $m_{i}$ within $M$ is assigned the value of $1$ for masked tokens and $0$ for unmasked tokens, effectively delineating the regions of the time series data that are concealed and revealed. This MAE objective, working in tandem with a transformer decoder denoted as $d_{\phi}(\cdot)$ and the latent representation $Z^{m}$ obtained from the unmasked patches $\hat{X}\odot(1-M)$ , is formally articulated as follows:

	$\displaystyle\mathcal{L}_{MAE}=\left\\|d_{\phi}(Z^{m})\odot M-\hat{X}\odot M\right\\|^{2}$
	$\displaystyle\approx\left\\|d_{\phi}(Z^{m})-\hat{X}\right\\|^{2},\quad Z^{m}=f_{\theta}(\hat{X}\odot(1-M)).$		(3)

In this equation, $\mathcal{L}_{MAE}$ quantifies the mean absolute error between the predicted values obtained from the decoder and the actual values of the masked tokens. Fundamentally, the MAE objective encapsulates the idea of reconstructing the entirety of the input data, even when faced with incomplete or occluded information. It is worth noting that in an ideal scenario, with an over-parameterized neural network, the network can memorize all seen training samples. This implies that with the latent representation $Z$ of the complete input tokens $\hat{X}$ , we have:

\displaystyle d_{\phi}(Z)=\hat{X},\quad Z=f_{\theta}(\hat{X}).

(4)

Consequently, we can reformulate the MAE objective with a distance function $\mathcal{D}$ :

\displaystyle\mathcal{L}_{MAE}\approx\mathcal{D}(Z,Z^{m})=\left\|Z^{m}-Z\right\|^{2},

(5)

which signifies that the primary objective of MAE training is to acquire the Occlusion-Invariant Feature.

Moreover, it is essential to recall the fundamental concept of Joint-Embedding SSL, which is centered on the objective of minimizing the distance between two latent representations derived from augmented views of the same input. This objective can be generally formulated as follows:

	$\displaystyle\mathcal{L}_{Inv}=\mathcal{D}(Z_{1},Z_{2}),$		(6)
	$\displaystyle Z_{1}=f_{\theta}(\mathcal{A}_{1}(\hat{X})),\quad Z_{2}=f_{\theta}(\mathcal{A}_{2}(\hat{X})),$		(7)

where we denote the two augmentations as $\mathcal{A}_{1}$ and $\mathcal{A}_{2}$ . Importantly, it should be noted that the distance function can be both parametric and non-parametric. It means it can encompass simple functions such as mean square error and cosine similarity, or it can incorporate projection heads or predictor networks, as is common in Joint-Embedding SSL settings. We can establish a connection between Joint-Embedding SSL and the MAE framework by assuming that $\mathcal{A}{1}$ preserves the original input while $\mathcal{A}{2}$ involves random masking operations. Consequently, we can define the primary similarity objective of the proposed MTS-LOF Framework:

	$\displaystyle\mathcal{L}_{sim}=\mathcal{D}(Z,Z^{m}),$		(8)
	$\displaystyle Z^{m}=f_{\theta}(\hat{X}\odot(1-M)),\quad Z=f_{\theta}(\hat{X}).$		(9)

This objective captures the essence of the MTS-LOF framework, aiming to minimize the distance between the original representation $Z$ and the representation $Z^{m}$ obtained after masking a portion of the input data. This formulation bears a resemblance to non-contrastive learning algorithms, which are prone to the risk of representation collapse. To mitigate this risk, a covariance regularization technique is employed to prevent representation collapse. Specifically, the latent representation $Z^{m}$ of the masked patches is penalized using Total Coding Rate (TCR), which is formulated as:

\displaystyle\mathcal{L}_{TCR}(Z)=\frac{1}{2}\log\det(I+\frac{d}{b\epsilon^{2}}ZZ^{\dagger})

(10)

where $b$ represents the batch size, $d$ signifies the dimensionality of the representation, and $\epsilon>0$ is a chosen size of distortion.

In practical implementation, the algorithm is supposed to effectively capture features that are consistently occlusion-invariant. To achieve this, we apply multiple mask operations to the patches denoted as $\hat{X}$ . This operation is defined as follows:

	$\displaystyle\hat{X}_{1},\dots,\hat{X}_{N}$	$\displaystyle=\hat{X}\odot(1-M_{1}),\dots,\hat{X}\odot(1-M_{N}),$
		$\displaystyle\text{where}\quad M_{i}\neq M_{j}\quad\text{for}\quad i\neq j.$		(11)

where $N$ refers to the number of masks. Subsequently, the similarity objective is employed to compare the representations of the unmasked patches with each representation of the masked patches:

\displaystyle\mathcal{L}_{sim}=-\frac{1}{N}\sum_{i=0}^{N}\frac{\hat{Z}}{\left\lVert{\hat{Z}}\right\rVert_{2}}{\cdot}\frac{\hat{Z}_{i}^{m}}{\left\lVert{\hat{Z}_{i}^{m}}\right\rVert_{2}},

(12)

where we use the negative cosine similarity to measure the distance between two representation vectors in this context. It’s important to note that we calculate $\mathcal{L}_{sim}$ using the globally pooled representations $\hat{Z}$ and $\hat{Z}_{i}^{m}$ , as well as the covariance regularization $\mathcal{L}_{TCR}$ . A transformer decoder is employed to decode the unmasked patches encoded in the previous step, along with all the masked tokens. To provide positional awareness, position embeddings are added so that each token can determine its respective position. It’s worth mentioning that each masked token is shared and can learn a vector. Visual demonstration of the entire workflow of the MTS-LOF framework is demonstrated in Figure. 2, where we set a hyperparameter $\lambda$ to balance the $\mathcal{L}_{sim}$ and $\mathcal{L}_{TCR}$ .

TABLE I: Details of Datasets Used in the Study. The table provides information on the dataset domains, sizes of training, validation, and test sets, sequence length, number of channels, and classes.

Dataset	Domain	# Train	# Val	# Test	Length	# Channel	# Class
HAR	-	5881	1471	2947	128	9	6
Sleep-EDF	-	25612	7786	8910	3000	1	5
Epilepsy	-	7360	1840	2300	178	1	2
FD	a	8184	2728	2728	5120	1	3
	b	8184	2728	2728	5120	1
	c	8184	2728	2728	5120	1
	d	8184	2728	2728	5120	1
SHHS	C4-A1	96753	32251	32252	3750	1	5
SHHS	C3-A2	98158	32720	32720	3750	1	5

TABLE II: Comparison of Linear Probing Performance against the baselines.

	HAR		Sleep-EDF		Epilepsy
	Accuracy	MF1	Accuracy	MF1	Accuracy	MF1
SSL-ECG	65.34 $\pm$ 1.63	63.75 $\pm$ 1.37	74.58 $\pm$ 0.60	65.44 $\pm$ 0.97	93.72 $\pm$ 0.45	89.15 $\pm$ 0.93
CPC	83.85 $\pm$ 1.51	83.27 $\pm$ 1.66	82.82 $\pm$ 1.68	73.94 $\pm$ 1.75	96.61 $\pm$ 0.43	94.44 $\pm$ 0.69
SimCLR	80.97 $\pm$ 2.46	80.19 $\pm$ 2.64	78.91 $\pm$ 3.11	68.60 $\pm$ 2.71	96.05 $\pm$ 0.34	93.53 $\pm$ 0.63
TS-TCC	90.37 $\pm$ 0.34	90.38 $\pm$ 0.39	83.00 $\pm$ 0.71	73.57 $\pm$ 0.74	97.23 $\pm$ 0.10	95.54 $\pm$ 0.08
TS2VEC	89.77 $\pm$ 1.17	89.77 $\pm$ 1.38	83.33 $\pm$ 1.54	73.23 $\pm$ 2.17	97.09 $\pm$ 0.13	96.26 $\pm$ 0.24
TST	87.77 $\pm$ 0.27	87.66 $\pm$ 0.31	83.43 $\pm$ 0.24	73.11 $\pm$ 0.27	97.17 $\pm$ 0.19	95.64 $\pm$ 0.17
MTS-LOF (Ours)	93.05 $\pm$ 0.30	93.09 $\pm$ 0.31	84.35 $\pm$ 0.31	73.52 $\pm$ 0.58	98.33 $\pm$ 0.05	97.41 $\pm$ 0.09

TABLE III: Transferability of Learned Representations in SHHS Dataset (Accuracy)

SSL			Supervised
Source/Target	C4-A1	C3-A2	Source/Target	C4-A1	C3-A2
C4-A1	85.50	82.76	C4-A1	83.80	82.08
C3-A2	81.81	85.88	C3-A2	81.90	84.91
	In-Domain	Cross domain	Overall
SSL	85.69	82.29	83.99
Supervised	84.36	81.99	83.17

TABLE IV: Transferability of Learned Representations in SHHS Dataset (F1 score)

SSL			Supervised
Source/Target	C4-A1	C3-A2	Source/Target	C4-A1	C3-A2
C4-A1	72.31	69.80	C4-A1	70.11	68.77
C3-A2	68.82	74.21	C3-A2	68.98	72.44
	In-Domain	Cross domain	Overall
SSL	73.33	69.31	71.32
Supervised	71.28	68.88	70.07

TABLE V: Transferability of Learned Representations in FD Dataset (Accuracy)

SSL					Supervised
source/target	a	b	c	d	source/target	a	b	c	d
a	100.00	51.21	54.94	52.62	a	100.00	45.10	47.96	47.9
b	43.03	100.00	79.17	99.99	b	43.08	100.00	77.06	99.57
c	44.87	92.98	100.00	94.07	c	43.12	97.77	100.00	96.61
d	51.34	100.00	82.61	100.00	d	48.76	99.40	82.63	100.00
	In domian			Cross domain			Overall
SSL	100.00			70.57			77.93
Supervised	100.00			69.08			76.81

TABLE VI: Transferability of Learned Representations in FD Dataset (F1 Score)

SSL					Supervised
source/target	a	b	c	d	source/target	a	b	c	d
a	100.00	35.44	38.83	36.35	a	100.00	31.48	33.44	33.29
b	45.75	100.00	83.13	99.99	b	41.10	100.00	79.89	99.68
c	48.69	93.81	100.00	95.50	c	40.12	98.34	100.00	97.51
d	52.14	100.00	86.40	100.00	d	45.90	99.54	85.10	100.00
	In domian			Cross domain			Overall
SSL	100.00			68.00			76.00
Supervised	100.00			65.45			74.09

IV Experiments

IV-A Materials

Our experiments encompass various medical time series datasets, including those related to human activity recognition, sleep stage classification, and epileptic seizure prediction. Additionally, to assess the generalizability and transferability of the proposed MTS-LOF framework beyond medical time series, we utilize a fault diagnosis dataset. The datasets are summarized in Table. I.

Epilepsy Seizure Prediction (Epilepsy) [30]: This dataset comprises EEG recordings from 500 subjects, each recorded for 23.6 seconds. The original data features five classes, with only subjects in class 1 having epileptic seizures, while subjects in classes 2, 3, 4, and 5 do not. We merge the four negative classes into a single class, transforming the dataset into a binary classification problem.

Human Activity Recognition (HAR) [31]: This dataset comprises sensor readings from 30 subjects engaged in six activities, including walking, walking upstairs, walking downstairs, standing, sitting, and lying down. It consists of inertial sensor data collected using a smartphone (Samsung Galaxy S II) positioned at the subjects’ waist. This device’s embedded accelerometer and gyroscope provided 3-axial linear acceleration and 3-axial angular velocity data at a constant rate of 50Hz.

Sleep Stage Classification: We incorporate two sleep stage classification datasets, Sleep-EDF [32] and the Sleep Heart Health Study (SHHS) [33]. The objective of sleep stage classification is to categorize 30-second EEG signals into five distinct stages: Wake (W), Non-rapid eye movement (N1, N2, N3), and Rapid Eye Movement (REM). In the case of Sleep-EDF, we focus on the Fpz-Cz channel, which captures EEG signals sampled at a rate of 100 Hz. As for the SHHS dataset, we partition the records into two segments and analyze the C4-A1 and C3-A2 channels, each with a sampling rate of 125 Hz, separately. This separation enables us to assess the transferability of the MTS-LOF framework across different contexts.

Fault Diagnosis (FD) [43]: This dataset involves motor current signals from electric motors operating under four different conditions, each considered a separate domain due to its distinct characteristics. We employ this dataset to assess our algorithm’s transferability and generalizability, demonstrating its applicability beyond the realm of medical time series. Furthermore, fault diagnosis is of significance in the domain of medical device security [44].

IV-B Experimental Setup

We adopt the data splitting settings from TS-TCC [13] for Epilepsy, Sleep-EDF, FD, and HAR. The data is divided into training (60%), validation (20%), and testing (20%) sets. For the Sleep-EDF dataset, we additionally perform subject-wise splitting. In the case of SHHS, we initially divide the records into two equal parts subject-wisely to obtain EEG data from two distinct channels. Each part is then subdivided into training (60%), validation (20%), and testing (20%) sets. We conducted experiments five times using different random seeds ( $\{2019,2020,2021,2022,2023\}$ ), and the results are reported as averages across all runs. Model training, including SSL pretraining, linear probing, and fine-tuning, was carried out using the AdamW optimizer. The training duration was 40 epochs with a learning rate of 5e-4, weight decay set to 0.05, and a batch size of 128. The initial convolutional layers have specific kernel sizes and strides for each dataset: ${k=8,s=1}$ for Epilepsy, ${k=25,s=6}$ for Sleep-EDF, ${k=32,s=4}$ for FD, ${k=8,s=1}$ for HAR, and ${k=25,s=6}$ for SHHS. Hyperparameters are set as follows: $\lambda=100$ , mask ratio=0.8, and the number of masks=20. The transformer decoder configuration matches that of the encoder while having 4 layers. We implemented the algorithms and models using PyTorch (Code available at https://github.com/HuayuLiArizona/MST-LOF.git) and conducted all experiments on an NVIDIA RTX 3090 GPU.

IV-C Comparison with Baseline Approaches

In the pursuit of assessing the performance of our proposed approach, we conducted a comprehensive comparative analysis against several baseline methods: SSL-ECG [14], CPC [4], SimCLR [5], TS-TCC [13], TST [29], and TS2Vec [15]. This evaluation encompassed the domains of Human Activity Recognition (HAR), Sleep-EDF, and Epilepsy, and focused on standard linear probing results, employing linear classifiers on top of frozen representations from the SSL-pretrained models. Table. II showcases the results of these comparisons, including accuracy and the Macro F1 score (MF1). Notably, our approach demonstrates a significant improvement in performance across all three domains. In the domain of Human Activity Recognition (HAR), our approach achieves an accuracy of 93.05% and an MF1 score of 93.09, surpassing all baseline methods. In comparison, SSL-ECG, CPC, and SimCLR show lower accuracy and MF1 scores, highlighting the superior capability of our method in recognizing human activities from time series data. For Sleep-EDF, our approach attains an accuracy of 84.35% and an MF1 score of 73.52, demonstrating a substantial enhancement compared to the baseline methods. TS-TCC, while showing competitive results, falls short of our approach in terms of MF1 score, emphasizing the effectiveness of our method in sleep stage classification. In the challenging domain of Epilepsy, our approach outperforms all baseline methods, achieving an impressive accuracy of 98.33% and an MF1 score of 97.41. This showcases the robustness and high predictive power of our approach in distinguishing between subjects with and without epileptic seizures.

These results validate the effectiveness of our proposed method, which leverages surprising self-supervised representation learning capability, in enhancing the predictive performance across various medical time series domains. Our approach consistently exhibits superior accuracy and F1 scores, highlighting its potential to make a significant impact in medical diagnostics and related applications. The success of our method can be attributed to the inherent ability of the occlusion invariant feature learning to capture intricate temporal and local patterns within the time series data, which is vital for distinguishing subtle differences in medical conditions. Moreover, our approach showcases the adaptability of self-supervised learning in a diverse range of medical domains, including human activity recognition, sleep stage classification, and epileptic seizure prediction. This versatility is a testament to the broad utility and applicability of our method in real-world medical scenarios.

IV-D Semi-supervised learning

To further assess the adaptability and performance of the MTS-LOF framework, we conducted experiments under semi-supervised learning conditions, specifically on the Sleep-EDF dataset. In this scenario, we fine-tuned the pre-trained model using varying percentages of partially labeled data, specifically 1%, 5%, 10%, 50%, and 100% of randomly selected subsets from the complete dataset. The results of these semi-supervised experiments are depicted in Figure. 3, and they are compared to the results obtained under fully supervised learning conditions with fully labeled data, which yielded an F1 score of 75.55. The outcomes of these semi-supervised experiments are insightful. We observe that the MTS-LOF framework demonstrates its robustness and adaptability across different levels of labeled data. Even with just 1% of the data labeled, the model achieves a competitive F1 score of 72.77. This suggests that the framework can effectively leverage minimal labeled data to produce meaningful results, which is especially relevant in real-world scenarios where labeling medical data can be a time-consuming and costly process.

As the percentage of labeled data increases, we notice a gradual improvement in the F1 score. This trend highlights the framework’s ability to capitalize on additional labeled data for enhancing predictive performance. Notably, when 100% of the data is labeled, the F1 score reaches 75.95, surpassing the fully supervised learning result. This finding is promising, as it implies that SSL pretraining benefits the performance of supervised learning. Comparing these semi-supervised results with the fully supervised counterpart, we can infer that the MTS-LOF framework offers a powerful solution in scenarios where labeling medical data comprehensively is challenging or impractical. The ability to achieve competitive performance even with minimal labeled data reaffirms the framework’s applicability in scenarios where obtaining fully labeled datasets may be a limiting factor.

IV-E Transferability of learned representation

The MTS-LOF framework not only enhances performance but also improves the transferability of learned representations across diverse data distributions. We conducted experiments using the FD and SHHS datasets to explore this aspect. Baseline models, trained with supervised learning, utilized fully labeled data from each dataset. In contrast, SSL pretraining utilized unlabeled data to acquire generalized representations, which were later fine-tuned with fully labeled target domain data through linear probing. It’s essential to note that these models were exclusively trained in one domain (the source domain) and subsequently tested in various domains (the target domains).

In the SHHS dataset, we evaluated the SSL-pretrained model’s ability to transfer learned representations across different domains, namely labeled domains C4-A1 and C3-A2. The results consistently demonstrate high cross-domain accuracy when assessing accuracy (Table. III). A similar pattern is observed when examining F1 scores (Table. IV). The SSL-pretrained model consistently outperforms in cross-domain performance regarding F1 scores, with particularly significant improvements in domain C3-A2. These results underscore the effectiveness of SSL pretraining in enhancing the transferability of learned representations in the SHHS dataset. Tables. V and VI provide accuracy and F1 score results for the FD dataset. These tables demonstrate strong transferability, with consistently high accuracy and F1 scores, even when shifting between different domains. These results showcase that MTS-LOF’s exceptional generalizability extends beyond medical time series data.

IV-F Ablation Study

To comprehensively assess the impact of two critical hyperparameters, specifically the number of masks and mask ratio, we performed an ablation study using the HAR dataset. Figure. 4 illustrates the F1 scores obtained across various combinations of these hyperparameters. In the initial part of the ablation study, we explored the effect of different numbers of masks while keeping the mask ratio constant at 0.8. Clearly, the F1 score shows significant sensitivity to the number of masks, with a significant increase as we move from one mask to 20 masks. However, increasing the number of masks to forty does not lead to a substantial improvement. In the subsequent part of the ablation study, we investigated the effect of different mask ratios while maintaining a constant number of masks at 20. The results indicate significant sensitivity of the F1 score to changes in the mask ratio. Notably, with an increase in the mask ratio from 0.5 to 0.9, the F1 score steadily improves, with the most significant gain occurring with a mask ratio of 0.8.

IV-G Visualization of the Representations

In this section, we provide visual insights into the learned representations on three distinct datasets: Epilepsy, Sleep-EDF, and HAR. For visual exploration, we utilize t-SNE (t-Distributed Stochastic Neighbor Embedding) [45] to produce two-dimensional representations of the latent features obtained through different training methodologies. Figure 5 presents a side-by-side comparison of t-SNE visualizations for three distinct training approaches: SSL, supervised training, and 5% fine-tuning. These visualizations aim to clarify how each training paradigm influences the distribution and clustering of data points within the latent space. This information is essential for assessing the quality and discriminative capabilities of the learned representations. Our observations from the visualizations indicate that the features learned through SSL are already linearly separable, and a 5% fine-tuning can yield results comparable to supervised training.

V Discussion

In this study, we introduce MTS-LOF, a novel framework tailored for medical time series representation learning. This framework addresses critical challenges and creates new opportunities for knowledge discovery in healthcare applications.

Reduction of Labeling Requirements: MTS-LOF mitigates one of the most significant bottlenecks in medical time series analysis—the demand for extensive labeled data. Traditional supervised learning methods heavily rely on manually annotated datasets, a costly and time-consuming process. In healthcare, where data availability is often constrained by privacy concerns and regulatory hurdles, MTS-LOF provides an efficient alternative. By learning from unlabeled data, MTS-LOF independently extracts meaningful representations directly from raw medical time series. This eliminates the need for extensive expert-annotated datasets and offers a versatile solution for medical data analysis.

Generalizability and Transfer Learning: MTS-LOF excels at acquiring generalized representations. These representations capture the underlying structures and patterns within medical time series data, enabling it to generalize across diverse datasets and tasks. This inherent generalizability enhances the transfer of learned representations across different domains and medical specialties. Models pretrained on one dataset can be fine-tuned for specific clinical applications with limited labeled data, such as in the cases of rare diseases or emerging health trends.

Enhanced Feature Extraction: MTS-LOF exhibits superior feature extraction capabilities for medical time series. They learn intricate patterns and temporal dependencies, which are vital for identifying subtle yet clinically relevant changes in patients’ conditions. MTS-LOF has shown exceptional performance in various downstream tasks, including sleep stage classification, human activity recognition, and seizure prediction. This leads to improved diagnostic accuracy, early disease detection, and enhanced patient management.

Potential Healthcare Applications: MTS-LOF is a robust framework tailored for medical time series analysis. Designed for diverse clinical settings and home care scenarios, it reduced the reliance on expert annotation, demonstrated adaptability across multiple domains, and exhibited superior feature extraction capabilities. Leveraging these attributes, MTS-LOF is ideal for integration into numerous healthcare applications, ranging from sleep quality analysis and fall detection to seizure prediction. Furthermore, there is potential to refine MTS-LOF for deployment on smartphones integrated with wearable sensors, such as smartwatches and EEG headsets, facilitating real-time prediction and continuous monitoring.

VI Conclusion

In this paper, we present MTS-LOF, an unsupervised representation framework that combines the strengths of Joint-Embedding SSL and MAE. The MTS-LOF framework is designed to address the intricate challenges and complexities involved in the analysis of medical time series data when there is a shortage of annotated data. The primary strength of MTS-LOF lies in its capability to generate context-rich representations. Through the integration of joint-embedding SSL and MAE, our framework adeptly captures discriminative patterns within medical time series data and the subtle temporal dependencies inherent to the data. This creative combination leads to representations that offer a comprehensive understanding of healthcare data, making it a powerful tool to support clinical decision-making. Our experiments, conducted on diverse medical time series datasets, illustrate the exceptional performance of MTS-LOF. In conclusion, MTS-LOF marks a substantial advancement in medical time series data analysis. Its contributions extend to enhancing data-driven insights and improving the development of intelligent clinical decision support systems, particularly under constraints related to label availability.

References

[1] T. Ching, D. S. Himmelstein, B. K. Beaulieu-Jones, A. A. Kalinin, B. T. Do, G. P. Way, E. Ferrero, P.-M. Agapow, M. Zietz, M. M. Hoffman et al., “Opportunities and obstacles for deep learning in biology and medicine,” Journal of The Royal Society Interface, vol. 15, no. 141, p. 20170387, 2018.
[2] M. Noroozi and P. Favaro, “Unsupervised learning of visual representations by solving jigsaw puzzles,” in European conference on computer vision. Springer, 2016, pp. 69–84.
[3] S. Gidaris, P. Singh, and N. Komodakis, “Unsupervised representation learning by predicting image rotations,” arXiv preprint arXiv:1803.07728, 2018.
[4] A. v. d. Oord, Y. Li, and O. Vinyals, “Representation learning with contrastive predictive coding,” arXiv preprint arXiv:1807.03748, 2018.
[5] T. Chen, S. Kornblith, M. Norouzi, and G. Hinton, “A simple framework for contrastive learning of visual representations,” in International conference on machine learning. PMLR, 2020, pp. 1597–1607.
[6] K. He, H. Fan, Y. Wu, S. Xie, and R. Girshick, “Momentum contrast for unsupervised visual representation learning,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2020, pp. 9729–9738.
[7] R. D. Hjelm, A. Fedorov, S. Lavoie-Marchildon, K. Grewal, P. Bachman, A. Trischler, and Y. Bengio, “Learning deep representations by mutual information estimation and maximization,” arXiv preprint arXiv:1808.06670, 2018.
[8] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “Bert: Pre-training of deep bidirectional transformers for language understanding,” arXiv preprint arXiv:1810.04805, 2018.
[9] T. Brown, B. Mann, N. Ryder, M. Subbiah, J. D. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell et al., “Language models are few-shot learners,” Advances in neural information processing systems, vol. 33, pp. 1877–1901, 2020.
[10] A. Radford, K. Narasimhan, T. Salimans, I. Sutskever et al., “Improving language understanding by generative pre-training,” 2018.
[11] Y. Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, O. Levy, M. Lewis, L. Zettlemoyer, and V. Stoyanov, “Roberta: A robustly optimized bert pretraining approach,” arXiv preprint arXiv:1907.11692, 2019.
[12] J.-Y. Franceschi, A. Dieuleveut, and M. Jaggi, “Unsupervised scalable representation learning for multivariate time series,” Advances in neural information processing systems, vol. 32, 2019.
[13] E. Eldele, M. Ragab, Z. Chen, M. Wu, C. K. Kwoh, X. Li, and C. Guan, “Time-series representation learning via temporal and contextual contrasting,” arXiv preprint arXiv:2106.14112, 2021.
[14] P. Sarkar and A. Etemad, “Self-supervised ecg representation learning for emotion recognition,” IEEE Transactions on Affective Computing, vol. 13, no. 3, pp. 1541–1554, 2020.
[15] Z. Yue, Y. Wang, J. Duan, T. Yang, C. Huang, Y. Tong, and B. Xu, “Ts2vec: Towards universal representation of time series,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, no. 8, 2022, pp. 8980–8987.
[16] M. N. Mohsenvand, M. R. Izadi, and P. Maes, “Contrastive representation learning for electroencephalogram classification,” in Machine Learning for Health. PMLR, 2020, pp. 238–253.
[17] R. Hadsell, S. Chopra, and Y. LeCun, “Dimensionality reduction by learning an invariant mapping,” in 2006 IEEE computer society conference on computer vision and pattern recognition (CVPR’06), vol. 2. IEEE, 2006, pp. 1735–1742.
[18] P. Bachman, R. D. Hjelm, and W. Buchwalter, “Learning representations by maximizing mutual information across views,” Advances in neural information processing systems, vol. 32, 2019.
[19] M. Caron, I. Misra, J. Mairal, P. Goyal, P. Bojanowski, and A. Joulin, “Unsupervised learning of visual features by contrasting cluster assignments,” Advances in neural information processing systems, vol. 33, pp. 9912–9924, 2020.
[20] J.-B. Grill, F. Strub, F. Altché, C. Tallec, P. Richemond, E. Buchatskaya, C. Doersch, B. Avila Pires, Z. Guo, M. Gheshlaghi Azar et al., “Bootstrap your own latent-a new approach to self-supervised learning,” Advances in neural information processing systems, vol. 33, pp. 21 271–21 284, 2020.
[21] A. Bardes, J. Ponce, and Y. LeCun, “Vicreg: Variance-invariance-covariance regularization for self-supervised learning,” arXiv preprint arXiv:2105.04906, 2021.
[22] J. Zbontar, L. Jing, I. Misra, Y. LeCun, and S. Deny, “Barlow twins: Self-supervised learning via redundancy reduction,” in International Conference on Machine Learning. PMLR, 2021, pp. 12 310–12 320.
[23] X. Chen and K. He, “Exploring simple siamese representation learning,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2021, pp. 15 750–15 758.
[24] A. Saeed, T. Ozcelebi, and J. Lukkien, “Multi-task self-supervised learning for human activity detection,” Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies, vol. 3, no. 2, pp. 1–30, 2019.
[25] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,” Advances in neural information processing systems, vol. 30, 2017.
[26] A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly et al., “An image is worth 16x16 words: Transformers for image recognition at scale,” arXiv preprint arXiv:2010.11929, 2020.
[27] K. He, X. Chen, S. Xie, Y. Li, P. Dollár, and R. Girshick, “Masked autoencoders are scalable vision learners,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2022, pp. 16 000–16 009.
[28] H. Bao, L. Dong, S. Piao, and F. Wei, “Beit: Bert pre-training of image transformers,” arXiv preprint arXiv:2106.08254, 2021.
[29] G. Zerveas, S. Jayaraman, D. Patel, A. Bhamidipaty, and C. Eickhoff, “A transformer-based framework for multivariate time series representation learning,” in Proceedings of the 27th ACM SIGKDD conference on knowledge discovery & data mining, 2021, pp. 2114–2124.
[30] R. G. Andrzejak, K. Lehnertz, F. Mormann, C. Rieke, P. David, and C. E. Elger, “Indications of nonlinear deterministic and finite-dimensional structures in time series of brain electrical activity: Dependence on recording region and brain state,” Physical Review E, vol. 64, no. 6, p. 061907, 2001.
[31] D. Anguita, A. Ghio, L. Oneto, X. Parra, J. L. Reyes-Ortiz et al., “A public domain dataset for human activity recognition using smartphones.” in Esann, vol. 3, 2013, p. 3.
[32] A. L. Goldberger, L. A. Amaral, L. Glass, J. M. Hausdorff, P. C. Ivanov, R. G. Mark, J. E. Mietus, G. B. Moody, C.-K. Peng, and H. E. Stanley, “Physiobank, physiotoolkit, and physionet: components of a new research resource for complex physiologic signals,” circulation, vol. 101, no. 23, pp. e215–e220, 2000.
[33] S. F. Quan, B. V. Howard, C. Iber, J. P. Kiley, F. J. Nieto, G. T. O’Connor, D. M. Rapoport, S. Redline, J. Robbins, J. M. Samet et al., “The sleep heart health study: design, rationale, and methods,” Sleep, vol. 20, no. 12, pp. 1077–1085, 1997.
[34] D. Pathak, P. Krahenbuhl, J. Donahue, T. Darrell, and A. A. Efros, “Context encoders: Feature learning by inpainting,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 2536–2544.
[35] M. Chen, A. Radford, R. Child, J. Wu, H. Jun, D. Luan, and I. Sutskever, “Generative pretraining from pixels,” in International conference on machine learning. PMLR, 2020, pp. 1691–1703.
[36] X. Chen, S. Xie, and K. He, “An empirical study of training self-supervised vision transformers,” in 2021 IEEE/CVF International Conference on Computer Vision (ICCV). Los Alamitos, CA, USA: IEEE Computer Society, oct 2021, pp. 9620–9629. [Online]. Available: https://doi.ieeecomputersociety.org/10.1109/ICCV48922.2021.00950
[37] X. Kong and X. Zhang, “Understanding masked image modeling via learning occlusion invariant feature,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 6241–6251.
[38] S. Ioffe and C. Szegedy, “Batch normalization: Accelerating deep network training by reducing internal covariate shift,” in International conference on machine learning. pmlr, 2015, pp. 448–456.
[39] D. Hendrycks and K. Gimpel, “Gaussian error linear units (gelus),” arXiv preprint arXiv:1606.08415, 2016.
[40] T. Xiao, M. Singh, E. Mintun, T. Darrell, P. Dollár, and R. Girshick, “Early convolutions help transformers see better,” Advances in neural information processing systems, vol. 34, pp. 30 392–30 400, 2021.
[41] E. Eldele, Z. Chen, C. Liu, M. Wu, C.-K. Kwoh, X. Li, and C. Guan, “An attention-based deep learning approach for sleep stage classification with single-channel eeg,” IEEE Transactions on Neural Systems and Rehabilitation Engineering, vol. 29, pp. 809–818, 2021.
[42] J. L. Ba, J. R. Kiros, and G. E. Hinton, “Layer normalization,” arXiv preprint arXiv:1607.06450, 2016.
[43] C. Lessmeier, J. K. Kimotho, D. Zimmer, and W. Sextro, “Condition monitoring of bearing damage in electromechanical drive systems by using motor current signals of electric motors: A benchmark data set for data-driven classification,” in PHM Society European Conference, vol. 3, no. 1, 2016.
[44] A. S. Carreon-Rascon and J. W. Rozenblit, “Towards requirements for self-healing as a means of mitigating cyber-intrusions in medical devices,” in 2022 IEEE International Conference on Systems, Man, and Cybernetics (SMC). IEEE, 2022, pp. 1500–1505.
[45] L. Van der Maaten and G. Hinton, “Visualizing data using t-sne.” Journal of machine learning research, vol. 9, no. 11, 2008.