Catch up on the latest AI articles

Automatic View Learning In Contrastive Learning Of Time Series LEAVES

Automatic View Learning In Contrastive Learning Of Time Series LEAVES


3 main points
✔️ Data augmentation with contrastive learning was time-consuming in terms of tuning policies and parameters
✔️ A method, LEAVES, was developed to automatically generate training views for time series data
✔️ More effective than baselines including SOTA methods for It was found to be more effective at finding rational views and performing downstream tasks

LEAVES: Learning Views for Time-Series Data in Contrastive Learning
written by Han YuHuiyuan YangAkane Sano
(Submitted on 13 Oct 2022)
Comments: Published on arxiv.

Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)


The images used in this article are from the paper, the introductory slides, or were created based on them.


Contrastive learning, a self-supervised learning method that can learn representations from unlabeled data, has developed as a promising technique. Many contrastive learning methods rely on data augmentation methods that generate a different view of the original signal. However, in contrast, learning, and tuning policies and hyperparameters for more effective data augmentation methods is often time and resource-consuming. Researchers have designed approaches that automatically generate new views for a given input signal, especially for image data. However, few view-learning methods have been developed for time series data. In this study, we propose a simple and effective module, named View Learning for Time Series Data (LEAVES), to automate view generation for time series data in contrast learning. The proposed module uses adversarial learning in contrast to learning to learn hyperparameters for expansion. We test the effectiveness of the proposed method on multiple time series datasets. Experimental results demonstrate that the proposed method is more effective in finding reasonable views and performing downstream tasks than baseline methods including contrastive learning and SOTA methods based on manually tuned extensions.


Contrastive learning has been used for a variety of downstream tasks such as images (Chen et al., 2020; Grill et al., 2020; Wang & Qi, 2022) and time series data (Mohsenvand et al., 2020; Mehari & Strodthoff, 2022). It has been widely applied to improve model robustness. Among the contrastive learning methods developed, data expansion plays an essential role in generating different corrupted transformations as a view of the original input for the pre-text task. For example, Chen et al. (2020) proposed the SimCLR method, which pre-trains models by maximizing the agreement of extended views from the same sample, significantly outperforming previous state-of-the-art methods in image classification, where labeling data is overwhelmingly scarce. However, the choice of data augmentation methods is usually empirical, and tuning a set of optimized data augmentation methods can cost thousands of GPU hours even with automated search algorithms (Cubuk et al., 2019). Thus, how to effectively generate views for new data sets remains an open question.

Instead of using artificially generated views, researchers are focusing on training deep learning methods to generate optimized views for input samples (Tamkin et al., 2020; Rusak et al., 2020) These methods produce reasonably corrupted generate a corrupted views and produce satisfactory results. For example, Tamkin et al. (2020) proposed ViewMaker, an adversarially trained convolutional module in contrast learning, to generate augmentations for images. However, methods such as the aforementioned ViewMaker may not fit well when used directly on time series data. The main challenge is that for time series signals, not only the magnitude (spatial) must be disturbed, but also the temporal dimension Um et al. (2017); Mehari & Strodthoff (2022). Image-based methods, on the other hand, can only disturb the spatial domain by adding moderate noise to the input data.

We propose LEAVES, a lightweight module for learning views of time series data in contrastive learning; LEAVES is optimized to be adversarial to contrastive loss and generates views that are challenging for the encoder in representation learning. We also propose a differentiable data extension technique for time-series data, named TimeDistort, to introduce smooth temporal perturbations in the generated views. In Fig. 1(a), the temporal position is not perturbed and the flat region of the original ECG signal (T-P interval as ECG fiducial) is completely distorted. distortions and, more importantly, reduces the risk of losing intact information due to excessive perturbations in time series data. Experimental and analytical results show that the proposed LEAVES (1) outperforms baselines including SimCLR and SOTA methods and (2) generates a more reasonable view in time series data compared to SOTA methods.

Related Research

Contrastive learning based on extensions

Among the contrast learning algorithms proposed in various fields, data extension methods usually play an essential role in generating views from the original input to form contrast pairs. Recently, many contrast learning frameworks have been developed based on image transformation in computer vision (He et al., 2020; Chen et al., 2020; Grill et al., 2020; Chen & He, 2021; Tamkin et al; Zbontar et al., 2021; Wang & Qi, 2022; Zhang & Ma, 2022). For example, Chen et al. (2020) proposed a SimCLR framework that maximizes the agreement between two transformed views from the same image; BYOL (Grill et al., 2020) is based on two extended views of an image, a target network and an online network The two networks, including the target network and the online network, are encouraged to interact and learn from each other; Zbontar et al. (2021) proposed the Barlow Twin framework, which applies two corrupted views from an image with a redundancy reduction objective function to avoid trivial constant solutions in contrast learning.

Outside of applications in the computer vision field, contrast learning algorithms have also been applied to time series data (Gopal et al., 2021; Mehari & Strodthoff, 2022; Wickstrøm et al. 2022). For example, Gopal et al. (2021) proposed an extension based on clinical domain knowledge to ECG data, generating views from the ECG from contrast learning. Mehari & Strodthoff (2022) applied well-evaluated methods such as SimCLR, BYOL, and CPC Oord et al. (2018) to time series ECG data for clinical downstream tasks; Wickstrøm et al. (2022) applied MixUp augmentation (Zhang et al., 2017) was applied to generate contrasting views. While the aforementioned studies have shown promising results by leveraging unlabeled data, empirically augmented views may not be optimal, especially for relatively new or unpopular data sets, as searching for a suitable set of augmentations is expensive.


Researchers have proposed several methods to optimize appropriate expansion strategies rather than setting up empirical expansion methods (Cubuk et al., 2019, Ho et al., 2019, Lim et al., 2019, Li et al., 2020, Cubuk et al., 2020, Liu et al., 2021). For example, AutoAugment (Cubuk et al., 2019) was designed as a reinforcement learning-based algorithm to search for expansion policies, including the possibility and order of using different expansion methods; DADA (Li et al., 2020) finds the most probable expansion policy after learning gradient-based optimization strategy for finding the most probable expansion policy after learning, and can significantly reduce the search time compared to algorithms such as AutoAugment. Although these search methods have proven to have high performance, they usually require heavy computational effort to explore the augmentation space and find the optimized policy thoroughly.

Instead of searching for extensions from the policy space, researchers have also developed training views, which are understood to generate data transformations by neural networks rather than manually tuned extensions (Tian et al., 2020; Rusak et al., 2020; Tamkin et al., 2020). For example, Rusak et al. (2020) applied a CNN structure to generate noise based on input data and trained a perturbation generator adversarial to supervised loss. Similarly, Tamkin et al. (2020) proposed a ResNet-based ViewMaker module to generate views on the data in a contrastive learning framework; the ViewMaker training was also adversarial by maximizing contrastive loss for expression encoders The two methods were not designed to be used in the same way. Nevertheless, these methods lack consideration for temporal perturbations when used on time series data sets. Therefore, we design the proposed LEAVES module to generate both magnitude and temporal perturbations in sequences.


Contrastive learning is a form of self-supervised learning that encourages similar representations of transformations of the same input and learns their differences from different pairs of samples. In this study, we employ a simple and proven contrastive learning method, SimCLR (Chen et al., 2020); Fig. 2 provides an overview of the pre-training architecture. First, a differentiable LEAVES module is introduced to generate views that are more challenging but faithful to the input; the LEAVES module is connected to the SimCLR framework to generate different views for contrast learning; LEAVES is trained together with the encoder in an adversarial manner; the encoder is then used to generate the views for the contrast learning. LEAVES are trained with the encoder in an adversarial manner.


We propose the LEAVES module, a lightweight component that can be easily plugged into a contrast learning system. The module consists of a series of differentiable data augmentation methods: jitter TJ, scale TS, magnitude warp (MagW) TMW, permutation (Perm) TP, and a newly proposed time distortion (TimeDis) TT D. For example, TJ TP represents the transformation of input data with jitter noise followed by reordering. Thus, the proposed module generates view ˆ X as follows.

where σ represents the hyperparameter of the data augmentation method that controls the strength of the corruption relative to the original sample. For example, σJ represents the value of the standard deviation for generating jittering noise. The target learning parameter for this module is σ for the augmentation method. By learning this parameter, this module learns strategies to combine multiple dilation methods to generate views. The order of extensions applied to X in Equation 1 is not intentionally adjusted because the hyperparameters and the views of the extensions are independent. For example, this is because the Scale operation to be applied does not depend on the view generated by Jitter.

Differentiable data extensions for time series data

Several widely used data expansion methods were selected for LEAVES. For example, Jitter, Scale, and MagW perturb the magnitude of the original signal, while Time Warping (TimeW) (Um et al., 2017) and Perm corrupt the temporal position. A detailed description of the augmentation methods can be found in Appendix A.1 (see original paper).

To optimize the hyperparameters in these extension algorithms, it is necessary to propagate the gradient to these parameters during the learning process. However, these extension methods are based on non-differentiation operations such as random value drawing and indexing. Therefore, we applied the reparameterization trick (Jang et al. 2016; Maddison et al. 2016) to make those procedures differentiable, except for the TimeW method, due to the difficulty of extracting gradients through indexing operations. We, therefore, propose the TimeDis augmentation method as an alternative to distort the temporal information smoothly; Fig. 3 shows examples of the six augmentation methods for time series samples. To ensure that the generated corruptions are reasonable, we constrain the noise by up-bounding η with the maximum σ value in the magnitude-based methods Jitter, Scale, MagW, and K as the maximum segment in Perm.


This method relies on a smooth probability distribution to generate the probability of the position at which the original signal is sampled. a reparameterized Gaussian mixture model with M Gaussian components as ∑M i φiN (μi, σ2 i ) is used to generate a position index λ ∈ RN × C × L from -1 to 1. Fig. 4 shows an example using TimeDis. Of the position indices generated, -1 corresponds to the first time step of the original signal (position 1) and 1 corresponds to the last time step (position L). Then, affine the original signal X with λ as view ˆ X, we see that the spacing between samples is looser at positions with dense λ indexes and the corresponding spacing is tighter at positions with sparse λ indexes.

Hostile Training

For representation learning, we define z as the representation extracted by the encoder; if the N pairs of representations extracted by the encoder in the SimClR framework are (zi, zj), {i, j} ∈ [1, N], then the loss function that maximizes the agreement between the pairs of representations can be defined as follows.

where s(zi, zj) is the cosine similarity between zi and zj, 1k6=i is an indicator function equal to 1 for k 6=i, and τ is a temperature parameter, fixed as 0.05 in this study.

As shown in Fig. 2, LEAVES and the encoder are optimized in opposite directions. The encoder's goal is to minimize L, whereas the LEAVES module wants to maximize L. By utilizing adversarial learning methods, the LEAVES module is designed to distort the original signal as difficult as possible, while the encoder is still able to pull intact information from the view pair. In this scenario, the encoder becomes robust by training against the most corrupted views; after training the SimCLR framework, the model weights learned in the encoder structure are used to initialize the model weights for supervised learning in a downstream task.


To evaluate the proposed method, we perform experiments on Apnea-ECG (Penzel et al., 2000), Sleep-EDFE (Kemp et al., 2018), PTB-XL (Wagner et al., 2020) for applications to detect apnea, sleep stages, arrhythmia, and human activity, PAMAP2 (Reiss & Stricker, 2012), respectively, and conduct experiments on four different public time series datasets. For each dataset, we pre-train the proposed module and encoder and fine-tune the encoder for downstream tasks. For comparison, we implement three baselines (1) supervised ResNet-18, (2) SimCLR with random extensions, and (3) a replicated ViewMaker network that incorporates one-dimensional time series inputs.

Detection of Sleep Apnea Syndrome with Single-Lead ECG

The Apnea-ECG data set (Penzel et al., 2000) studies the relationship between sleep apnea symptoms and cardiac activity (monitored by ECG) in humans. It can be accessed from Physionet (Goldberger et al., 2000). Following the setup of Penzel et al. (2000) in the original release, we used 100 Hz ECGs on a minute-by-minute basis to detect binary labels for the occurrence of apea, with 17233 and 17010 samples in the training and test sets, respectively. In the control pre-training phase, the noise threshold δ was set to 0.05 for Jitter, Scale, and MagW, M to 12 for TimeDis, and K to 5 for Perm. We used 100 epochs at a learning rate of 1e-3 for the encoder pre-training and 30 epochs at a learning rate of 1e-3 for the downstream task fine-tuning.

Table 1 shows the evaluation results for the detection of sleep apnea. Following SOTA on the same dataset, we used sensitivity (Sen.) and specificity (Spec.) as measures of model performance, which assesses the ability to diagnose apnea in patients. A comparison of the proposed model with the baseline model confirmed that LEAVES performed better than the baseline on both the Sen. and Spec. metrics. SimCLR and ViewMaker both outperformed the supervised baseline, and SimCLR performed slightly better than ViewMaker. compared to SOTA, the proposed method had a competitive Sen. score but a relatively low Spec. score. This may be due to different settings for filtering noisy samples and data preprocessing. The supervised structure of the baseline is similar to (Chang et al., 2020), but our supervised baseline results had lower Spec. than SOTA.

Sleep stage classification by EEG

Electroencephalography (EEG) is an essential signal for monitoring human brain activity; it was tested on the Sleep-EDF (expanded) (Kemp et al., 2018) dataset, which contains whole-night sleep recordings of 100 Hz Fpz-Cz EEG signals. Following Supratak & Guo (2020), we extracted 42308 30-second samples annotated in 5 sleep stages. For the contrast pre-training phase, we set the noise threshold δ to 0.05 for Jitter, Scale, and MagW, M to 10 for TimeDis, and K to 5 for Perm. We used 100 epochs at a learning rate of 1e-3 for the encoder pre-training and 30 epochs at a learning rate of 1e-3 for the downstream task fine-tuning; for performance comparison with SOTA, we use accuracy and macro f1 scores as evaluation metrics.

Table 2 shows the performance of the proposed method in classifying sleep stages using EEG signals. From the table, we can see that the proposed method performs better than the baseline. We can also see that the proposed method achieves comparable performance in both accuracy and macro f1 score when compared to SOTA. It must be acknowledged, however, that the experimental setup used in our study differs from that of SOTA. However, the experimental settings were different between SOTA and this experiment. For example, the pretreatment was not uniform, and the division of the training/test data set was not universal, as in SOTA, 10- and 20-fold cross-validation was widely applied, whereas here the validation set was divided according to subject ID.

Human activity detection using IMU and heart rate

Human activity can be detected using data from wearable devices; PAMAP2 (Reiss & Stricker, 2012) studies the relationship between data collected from three inertial measurement units (IMUs) and a heart rate monitor wearable sensor and human activity. The experiments used 100 Hz IMU data and upsampled heart rate data; following Moya Rueda et al. (2018); Tamkin et al. (2020), 12 of a total of 18 physical activity types are used in the experiments. For the contrastive pre-training phase, we set the noise threshold δ to 0.05 for Jitter, Scale, and MagW, M to 7 for TimeDis, and K to 5 for Perm. We used 100 epochs at a learning rate of 1e-3 for the encoder pre-training and 20 epochs at a learning rate of 1e-3 for the downstream task fine-tuning; to compare performance with SOTA, we used accuracy and macro f1-score as evaluation metrics.

Table 3 shows the performance in classifying human activities using the PAMAP2 dataset. The proposed method outperformed all baselines and showed competitive results as a SOTA sharing the same training/test settings (Moya Rueda et al., 2018; Tamkin et al., 2020); the study conducted by Li & Wang (2022) used a subject-dependent setting with a 70/30% training/test split strategy and achieved the highest performance of all studies. Table 3 also compares the results reported in the original work (Tamkin et al., 2020) with those of the 1D version we replicated. The model accuracy of the original work, which converted the time series data into spectrograms and utilized 2D ResNet, is very similar to the accuracy of the 1D ResNet version we implemented.


Classification of arrhythmias by 12-lead ECG

Arrhythmia is one of the major causes of cardiovascular disease, and detection of arrhythmia has important clinical prospects. The PTB-XL (Wagner et al., 2020) dataset contains 21,837 12-lead and 10-second ECGs at 100 Hz with 5 classes of arrhythmia labels in a large dataset. It follows the split between training and test sets recommended in the original work (Wagner et al., 2020). In the contrasting pre-training phase, we set the noise threshold δ to 0.05 for Jitter, Scale, and MagW, M to 6 for TimeDis, and K to 5 for Perm. We used 100 epochs with a learning rate of 1e-3 for the encoder pre-training and 30 epochs with a learning delay of 1e-3 for the downstream task fine-tuning; to compare model performance with SOTA, we use AUC and accuracy as evaluation metrics.

Table 4 shows the results of arrhythmia classification using ECG sequences. The proposed method outperforms the Supervised and ViewMaker baselines, and the SimCLR baseline with random extensions performs slightly better than the proposed method. Comparing the proposed method with SOTA, we find that the proposed method exhibits a higher AUC than several supervised methods such as (́ Smigiel et al., 2021; Li et al., 2021). Interestingly, comparing the results here with the ECGcentered benchmark self-supervised learning work shows that the results of this paper are slightly higher than their implementation of SimCLR, which is also compared to the manually adjusted reinforcement of the proposed LEAVES may indicate the effectiveness of LEAVES (Mehari & Strodthoff, 2022).


We will introduce the application of the ViewMaker framework in time series data and the fine-tuning of the baseline SimCLR algorithm with different augmentation hyperparameters, as well as other incisional studies. The complexity of the training views and LEAVES module of the proposed method will be further introduced.

Isolation Study: ViewMaker in Time Series Data

The research in this paper was inspired by ViewMaker (Tamkin et al., 2020); we tested the ViewMaker framework and observed improvements as shown in the evaluation section. However, we also observed ViewMaker's limitations when applied to time series data; Fig. 1 shows an example of ViewMaker's limitations in temporal distortion and information storage. To further validate the fidelity of the generated views, we used the ECG quality check method (Zhao & Zhang, 2018) with the NeuroKit package (Makowski et al., 2021). Table 5 shows the quality check, where it is observed that the ViewMakers method perturbed almost half of the ECG to "Unacceptable," representing signals that are barely recognized as ECG signals. This limitation of ViewMaker when applied to time series data motivated us to develop the method proposed in this study.

Isolation Study: Fine-tuning the SimCLR Baseline

Finding the optimal data augmentation method, in contrast, learning is difficult because the search space for augmentation methods is usually huge. In this study, we adjusted the strength of the baseline SimCLR augmentation method to learn a strong SimCLR baseline. For example, T (0.01) represents Jitter, Scale, MagW, and TimeW with σ = 0.01 and K = 5. Table 6 shows the tuning performance as a measure of accuracy and macro f1 score. Different evaluation results were observed when changing the hyperparameters that affect the intensity of the extension method; for some datasets, such as PAMAP2, we were able to confirm close performance with different hyperparameters. However, for the PTB-XL dataset, the performance of the model seems to be more strongly affected by the hyperparameters. For example, when σ is set to 0.05, we observed a significant decrease in performance compared to SimCLR with σ = 0.03. This indicates that finding appropriate extensions contributes to contrast learning, while inappropriate transformations may degrade the performance of the model. In addition, since the proposed method can find appropriate augmentations for time series data without spending too much time searching for augmentations, it is considered to be a meaningful method for researchers using new or generally unstudied time series data.

Learning hyperparameters for expansion

Given the proposed differentiable dilation-based approach, we can infer that the hyperparameters controlling the dilation change with the learning process as the model are trained; Fig. 5 shows the change in the We can see that the σ values for Jitter and MagW continue to increase for all four datasets, while the σ value for Scale shows a decreasing trend. We also find that the maximum segment K for Perm shows an increasing trend for PTB-XL and a decreasing trend for the PAMAP2 dataset. Although not performed in this study, this phenomenon indicates that this approach may be useful in finding appropriate views in supervised learning on different datasets. A possible application would be to combine our proposed module with a supervised learning framework with adversarial learning, such as the framework by Rusak et al. (2020).

The complexity of time and space

Since the target optimization weights of LEAVES are hyperparameters of the extended method, the proposed method has an advantage in terms of model space complexity compared to conventional SOTAs such as ViewMaker. For example, our reproduced 1D ViewMaker structure has 580,000 parameters to be trained, whereas LEAVES optimizes 20 parameters to generate views. On the other hand, the latency of introducing LEAVES into SimCLR is negligible: in a training environment on AWS p3.2xlarge (dual NVIDIA V100 GPUs), with batch size N set to 128 and 100 epochs of Sleep-EDFE data set training, the baseline SimCLR averages 578.0 seconds/epoch, whereas the SimCLR with LEAVES is 390.8 seconds/epoch. This is because LEAVES has extensions programmed as part of the model, which takes advantage of the GPU to speed up the computation and achieve even shorter learning times than the baseline SimCLR.


In this study, we introduce a simple and effective LEAVES module for learning augmentations to time series data in contrast learning. The proposed method uses an adversarial learning method to optimize the hyperparameters of the data augmentation method in contrast learning. We evaluated the proposed method on four datasets and found that the performance is better than the baseline. In particular, without hyperparameter tuning, the proposed method LEAVES outperforms the SimCLR baseline on three of the four applications. We also demonstrated the superiority of using the proposed method, especially for ECG time series data, compared to SOTA studies, in terms of storing the information as it is in the extended view. In the future, we will further introduce extension methods to LEAVES to improve the variability of the module and explore the possibilities of tuning the extensions in supervised learning. We also plan to apply the method to a wider range of time series data. In addition, investigating the interpretability of LEAVES is another interesting direction to better understand the policy of data extension in contrast to learning.

友安 昌幸 (Masayuki Tomoyasu) avatar
JDLA G certificate 2020#2, E certificate2021#1 Japan Society of Data Scientists, DS Certificate Japan Society for Innovation Fusion, DX Certification Expert Amiko Consulting LLC, CEO

If you have any suggestions for improvement of the content of the article,
please contact the AI-SCHOLAR editorial team through the contact form.

Contact Us