# Time-Frequency Consistency (TF-C), The First Realization Of Prior Learning In Time Series With Self-supervised Contrasted Learning

*3 main points*✔️ This is a NeurIPS 2022 accepted paper. In time series data, various changes may prevent the learned model from being successfully applied.

✔️ To address these challenges, we show that TF-C-based models can be trained with high accuracy without providing data in the target domain by performing self-supervised contrast pre-training in time and frequency space, respectively.

✔️ Fine tuning can be adapted to a variety of posttasks such as clustering and anomaly detection.

Self-Supervised Contrastive Pre-Training For Time Series via Time-Frequency Consistency

written by Xiang Zhang, Ziyuan Zhao, Theodoros Tsiligkaridis, Marinka Zitnik[Submitted on on 17 Jun 2022 (v1), last revised 15 Oct 2022 (this version, v3)]

Comments: Accepted by NeurIPS 2022

Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)

code：

The images used in this article are from the paper, the introductory slides, or were created based on them.

## summary

Time series pre-training has its own challenges, including shifts in temporal dynamics, fast evolving trends, long- and short-period effects, and other potential mismatches between the pre-training and the target domain that can degrade the performance of downstream tasks. Domain adaptation methods can mitigate these shifts, but are not optimal for pre-training because most methods require examples directly from the target domain.

To solve this challenge, target domains with different temporal dynamics need to be addressed, and there needs to be a way to do this without having to look at target examples during pre-training.

Compared to modalities in other domains, in time series, time-based and frequency-based representations of the same example are expected to be located close together in time-frequency space. Therefore, time-frequency consistency (TF-C), i.e., the time-based neighborhood representation is embedded close to the frequency-based neighborhood representation, is desirable for pre-training.

In the wake of TF-C, the authors define a decomposable pre-training model that expresses the self-supervised signal as distances between time and frequency components, each of which is trained separately by contrastive estimation. 8 data sets were used to evaluate the new method.

Comparison experiments with eight SOTA methods show that TF-C outperforms the baseline by an average of 15.4% (F1 score) in a one-to-one setting (e.g., fine-tuning an EEG pre-training model with EMG data) and by 8.4% in a one-to-many setting (e.g., fine-tuning an EEG pre-training model in either hand gesture recognition or machine failure prediction), outperforming the baseline by 8.4% in a one-to-many setting (e.g., fine-tuning the EEG pre-training model with EMG data), we can expect performance improvements in a wide range of scenarios in real applications.

## Introduction.

Although representation learning has greatly advanced time series analysis, learning generalizable representations for time series data remains fundamentally difficult. Among the many advantages gained from representation generation, the pre-training capability is of particular practical importance. At the heart of pre-learning is how to process time series from diverse data sets to significantly improve generalization to new time series from different data sets. By training a neural network model on one dataset and transferring it to a new target dataset for fine-tuning, i.e., without explicitly re-training on that target data, the resulting performance is at least as good as the state-of-the-art model tailored to the target dataset expected to be obtained.

Unfortunately, the expected performance gains are often not realized for a variety of reasons (e.g., misaligned distributions, characteristics of the target dataset that are unknown at the time of pre-training, etc.) and are made more difficult by the complexity of the time series. Such time series complexity limits the usefulness of knowledge transfer for pre-training. For example, pre-training a model on a diverse time series dataset with mostly low-frequency components (smooth trends) may not lead to good migration in downstream tasks with high-frequency components (transient events). Examining such tasks can provide clues as to what inductive biases can facilitate a generalizable representation of the time series.

Also, since the target dataset is not available for pre-training, the pre-training model must capture latent properties that are true for a target dataset that has not been seen before. At the core of this problematic idea is the idea of properties that are shared by the prelearning and target datasets and allow for knowledge transfer from prelearning to fine tuning. In computer vision (CV), prior learning is driven by the knowledge that the early neural layer captures universal visual elements, such as edges and shapes, regardless of image style or task. In natural language processing (NLP), the foundation for prior learning is provided by the linguistic principles of semantics and grammar shared across different languages. However, due to the aforementioned temporal complexity, such principles for prelearning for time series have not yet been established. Furthermore, supervised pre-training requires access to large annotated data sets, limiting its use in domains where rich labeled data sets are scarce. For example, in the medical field, labeling large data sets is often not feasible, expensive, and noisy (e.g., experts may disagree on truthful labeling, such as whether an ECG signal indicates a normal or abnormal rhythm).

Therefore, this paper employs a self-supervised learning method that is not constrained by the lack of labeled data sets. The authors introduce a strategy for self-supervised pre-training in time series by modeling time-frequency consistency (TF-C), which stipulates that time-based and frequency-based representations learned from the same time series sample should be closer to each other in time-frequency space than representations from different time series samples. than representations of different time-series samples. Specifically, we employ contrast learning in time space to generate time-based representations. In parallel, we propose a new set of reinforcements based on frequency spectral features to generate frequency-based embeddings through contrastive instance identification. This is the first study to develop frequency-based contrastive augmentations to exploit rich spectral information and explore time-frequency consistency in time series. The goal of pre-training is to minimize the distance between time-based and frequency-based embeddings using a novel consistency loss (Figure 1(a)). This self-supervised loss is used to optimize the pre-training model and to strengthen the consistency between the time and frequency domains in the latent space. The learned relationships encoded in the model parameters are transferred to initialize the fine tuning model and improve performance on the dataset of interest (Figure 1(b)).

Figure 1 a. Illustration of time-frequency consistency (TF-C). The time-based embedding _{zTi} and frequency-based embedding ^{zFi} of the time series sample _{xTi}_{,} as well as and learned from the _{xTi} extension, should be close to each other in the potential time-frequency space. b. optimize a pre-trained model F of the parameter Θ that is fine-tuned to Φ on a small scenario-specific data set using the TF-C properties of the time series. |

## Related Research

**Prior Learning for Time Series** While there is research on self-supervised representation learning for time series and self-supervised prior learning for images, the intersection of these two areas, namely self-supervised prior learning for time series, is still unexplored. For time series, it is not clear what reasonable assumptions can bridge pre-training and target datasets. Thus, pre-training models in CV and NLP are not directly applicable to time series due to data modality mismatch, and existing results leave room for improvement; Shi et al. explicitly designed for self-supervised time series pre-training to capture local and global time series patterns, but the designed pre-training task is not convincing as to why it is able to capture generalizable representations. While there are several studies that have applied transition learning in the context of time series, there is still no foundation for which conceptual properties are best suited for time series pre-training and why. To address this gap, the authors show that TF-C, designed to be invariant to different time series data sets, can produce generalizable pre-training models.

Unlike domain adaptation, which requires access to the target dataset during training, pre-training models do not require access to the dataset for fine tuning. Therefore, to benefit from pre-training, generalizable time series characteristics must be identified. Furthermore, self-supervised domain adaptation does not require labels on the target dataset, but does require labels for model training. In contrast, TF-C does not require labels during pre-training.

**Contrastive Learning with Time Series** Contrastive learning is a general type of self-supervised learning that maps inputs into the embedding space so that positive sample pairs (the original extension and another alternative extension/view of the same input sample) are closer together and negative sample pairs (the original extension and the alternative input sample extension) are further apart The goal is to learn encoders. Contrastive learning in time series data has not been well studied, in part because it is difficult to identify extensions that capture important invariance properties of time series data. For example, CLOCS defines adjacent time segments as positive pairs, while TNC assumes that overlapping time neighborhoods have similar representations. These methods use temporal invariance to define the positive pairs used to compute contrast loss, but other invariances such as transformation invariance (e.g., SimCLR), context invariance (e.g., TS2vec and TS-TCC) and augmentation are also possible. We propose an augmentation bank that uses multiple invariants to generate a variety of augmentations, adding richness to the pre-trained model. Importantly, we propose frequency-based extensions that perturb the frequency spectrum of a time series (e.g., by adding or removing frequency components or manipulating amplitudes) and learn better representations by exposing the model to local frequency variations. In previous work, CoST processes sequential signals through the frequency domain, but the extension is still implemented in time space. Similarly, BTSF includes the frequency domain, but its data transformation is implemented only in the time domain using instance-level dropouts. To the best of our knowledge, this is the first study to directly vary the frequency spectrum to take advantage of frequency invariance for contrast learning. In addition, we have developed a pre-training model that applies TF-C to two contrast encoders.

## problem formulation

Given an unlabeled time series sample from the pre-training dataset , the sample _{xpreti} has a Kpret ^{channel and} Lpret ^{timestamp}. are fine tunes. Let be a fine-tuning dataset of labeled time series samples, each with ^{Ktune} channels and ^{Ltune} timestamps. Furthermore, every sample _{xtunei} is accompanied by a label _{yi∈{1},.C} where C is the number of classes. denotes the input time series samples and denotes the discrete frequency spectrum of _{xi}.

**Problem (Self-supervised controlled pre-training for time series)** Given an unlabeled pre-training dataset ^{Dpret} with N samples and a target dataset ^{Dtune} with M samples ( ). The goal is to pre-train model F ^{using} Dpret and fine-tune the model parameters ^{on} Dtune so that the fine-tuned model generates a generalizable representation for all _{xtunei}.

^{Only} the unlabeled dataset Dpret is available for pre-training, while a small labeled dataset ^{Dtune} is available for fine tuning. That is, the model F is pre-trained on ^{the} unlabeled time series dataset Dpret, and its optimized model parameters Θ are fine-tuned from F(-, Θ) to F(-, Φ) ^{using} the dataset Dtune, where Φ represents the fine-tuned model parameters. The fine tuning is done using Dtune. Note that this problem setup (i.e., ^{Dpret} is independent of the target dataset) differs from domain adaptation because the fine-tuning dataset ^{Dtune} is not accessed during pre-training. As a result, the pre-trained model can be used on many different fine tuning datasets without having to be re-trained.

**Rationale for Time-Frequency Consistency (TF-C)** A central concept is to identify general properties that are conserved across time-series data sets and use them to guide transition learning for effective prior learning. The time domain shows how the sensor readouts change over time, whereas the frequency domain shows how much of the signal is present within each frequency component over the entire spectrum. Explicitly considering the frequency domain allows one to understand the behavior of a time series that cannot be directly captured by the time domain alone. Existing contrast methods, however, focus only on time-domain modeling and completely ignore the frequency domain. It can be argued that the approach is sufficient for high-volume methods because the time and frequency domains are different views of the same data and can be inter-transformed using transforms such as Fourier and inverse Fourier. The relationship between the two domains, based on signal processing theory, can serve as an induced bias for prior learning because it provides effective invariance regardless of the time series distribution. Approaching this invariance through the lens of representation learning, we then formulate the time-frequency consistency (TF-C), a TF-C property that states that for all sample _{xi}, the time-based representation and frequency-based representation of the same sample, and their local extensions, are close to each other in the latent space It is assumed that there exists a potential time-frequency space such that

Let the **representation time-frequency consistent (TF-C)**_{xi} be a time series and let F be a model satisfying TF-C. Then the time-based representation and the frequency-based representation , as well as the representation of the local extension of _{xi}, are close together in the potential time-frequency space.

The authors' strategy is to use the dataset ^{Dpret} to induce TF-C in the model parameters Θ of F and use it to initialize the target model ^{on}^{Dtune} to generate a generalizable representation for downstream predictions.The invariant nature of TF-C means that large discrepancies (e.g., temporal dynamics, semantic meaning, etc.), this approach allows for a bridge and provides a means for general pre-training for time series.

To realize TF-C, Model F has four components: a time encoder GT _{, a} frequency encoder GF _{, and} two cross-space projectors RT _{and}_{RF} that map time-based and frequency-based representations into the same time-frequency space, respectively (Figure 2). These four components provide a way to embed _{xi} into the potential time-frequency space so that the time-based embedding and the frequency-based embedding are approached.

Figure 2: Overview of the TF-C approach The TF-C pre-training model F consists of four components: a time encoder GT _{, a} frequency encoder GF _{, and} two heterospace projectors RT _{and}_{RF}. For an input time series xi, this model generates a time-based representation (i.e., the input xi and its extended versions and , respectively, and a frequency-based representation (i.e., the input xi _{and} its extended versions and , respectively) The TF-C property promotes alignment of time- and frequency-based representations in a potential time-frequency space realized by providing an unprecedented means of transferring F to the target data set. |

## Proposed Method

The architecture of the developed self-supervised contrastive pre-training model F is shown next.

### Time-based contrast encoder

For a given input time series sample _{xi}, an expansion set is generated through the time-based expansion bank . Each element is extended from _{xi} based on its temporal characteristics. Here, the time-based extension bank includes jittering, scaling, time shifting, and neighborhood segments, all of which are well established in contrast learning. The authors develop the dilation bank to produce a variety of extensions (rather than a single type of extension) and expose the model to complex temporal dynamics to produce a more robust time-based embedding.

For the input _{xi}, randomly select the extended sample and feed it into the contrast time encoder _{GT} which maps the sample to the embedding. We get and , assuming that the embedding of after passing through _{GT} is close to the embedding of and away from the embeddings of and , which are obtained from another sample .

**Contrastive Time Loss** To maximize similarity within positive pairs and minimize similarity within negative pairs ( and ), we employ NT-Xent (normalized temperature scale cross entropy loss), widely used in contrastive learning, as the distance function d. We define the loss function for a time-based contrast encoder as follows

where sim(u, v) = uT v/ ‖u‖ ‖v‖ is the cosine similarity, is an indicator function that is 0 for i = j and 1 otherwise, and τ is a time parameter to scale. ^{xj∈Dpret} refers to different time series samples or their extended samples. This loss function prompts the time encoder _{GT} to produce closer time-based embeddings for positive pairs and to extrude embeddings away from each other for negative pairs.

### Frequency-based contrast encoder

Generate a frequency spectrum from a time series sample through a transform operator (e.g., Fourier transform). Frequency information in time series is universal and plays an important role in classical signal processing, but has not been well studied in self-supervised contrastive representation learning of time series. Here we develop an extension method to perturb based on features of the frequency spectrum and show how to generate frequency-based representations.

Since all frequency components of the frequency spectrum exhibit basis functions with corresponding frequencies and amplitudes (e.g., sinusoidal functions for Fourier transforms), perturb the frequency spectrum by adding or removing frequency components. Small perturbations in the frequency domain can result in large changes in temporal patterns in the time domain. To ensure that the perturbed time series resembles the original sample (in the time domain as well as the frequency domain; Figure 6), use a small quantity E for the perturbation (E represents the number of frequency components to be manipulated). In removing frequency components, we randomly select E frequency components and set their amplitude to zero. For adding frequency components, randomly select E frequency components from those with amplitudes smaller than α _{Am} and set their amplitudes to α _{Am}, where _{Am is} the maximum amplitude of the frequency spectrum and α is a predefined factor to adjust the scale of the perturbed frequency components (α = 0.5 in this paper). Through the frequency extension bank , an extension set to is generated. As mentioned above, the ^{BF} has two extension methods (i.e., removing or adding frequency components), .

The frequency encoder _{GF} is used to map the frequency spectrum to a frequency-based embedding. We assume that the frequency encoder _{GF} can learn similar embeddings for the original frequency spectrum and the slightly disturbed frequency spectrum . So, let be the positive pair and and be the negative pair.

**Contrasting frequency loss**

The frequency-based contrast loss of sample _{xi} is calculated as follows

Preliminary experiments show that the value of τ has little effect on performance, and we use the same τ throughout all experiments; LF _{,i} yields a frequency encoder _{GF} that produces an embedding that is invariant to frequency spectral perturbations.

### Time-frequency consistency

To encourage the learned embeddings to satisfy TF-C, we develop a consistency loss item LC _{,i}: for a given sample, we assume that its time-based and frequency-based embeddings (and their local neighborhoods) are close to each other. To ensure that the distance between the embeddings is measurable, we map from time space and from frequency space into joint time-frequency space _{through the} projectors RT _{and}_{RF}, respectively. Specifically, for each input sample _{xi}, we have four embeddings: , The first two embeddings are generated based on temporal characteristics, while the latter two are generated based on frequency spectral characteristics.

To enforce time-frequency space embedding according to TF-C, we design a consistency loss LC _{,i that} measures the distance between time-based and frequency-based embeddings. Here, we use to represent the distance between and . Similarly, we define

Next, let's take a closer look at and with the three embeddings. Here, and learn from the original samples ( and ), while learns from the extended . Thus, intuitively, should be closer to compared to . This relative relationship prompts the proposed model to learn , which is smaller than . Inspired by the triplet loss, the authors designed as a term in the consistency loss LC _{,i}, where δ is a given constant margin to keep negative samples far away. This term optimizes the model so that is small and is relatively large. Similarly, should be smaller than and . In summary, we compute the consistency loss LC _{,i} for sample _{xi} as follows

where _{Spairi} is the distance between the time-based and frequency-based embeddings. In each pair, there is at least one embedding derived from the augmented sample instead of the original sample. δ is a predefined constant. By combining all triplet loss entries, _{LC} prompts the pre-trained model to capture the consistency between time-based and frequency-based embeddings in the model optimization. Note that Equation 3 does not explicitly measure the loss between different time series samples, but the relationship between samples is implicitly covered in the _{STFi} and _{Spairi} calculations.

### Implementation and technical details

The overall loss function in the prior has three terms. First, the time-based loss _{LT} prompts the model to learn embeddings that are invariant to temporal dilation. Second, the frequency-based contrast loss _{LF} prompts the model to learn embeddings that are invariant to frequency spectral-based extensions. Third, the consistency loss _{LC} leads the model to preserve consistency between time-based and frequency-based embeddings. In summary, the loss of prior learning is defined as follows:

Here, λ controls for the relative importance of contrastive and consistency losses. The total loss is computed by summing _{LTF-C,i} across all pre-training samples. In the implementation, contrastive losses are computed within batches. From the problem definition, the model F that we want to train is a combination of neural networks GT _{,} RT _{,} GF _{, and}_{RF}. Once the pre-training is complete, the parameters of the entire model are stored and denoted as F(-, Θ) (where Θ represents all the parameters that can be learned). When a sample _{xtunei} is presented, the fine-tuned model F generates an embedded _{ztunei} by concatenation as follows:

where Φ is a parameter of the fine-tuned model.

## experiment

The developed TF-C model is compared to 10 baselines in 8 diverse data sets. Time series classification tasks were investigated in the context of one-to-one and one-to-many transfer learning setups. TF-C was also evaluated in downstream tasks such as clustering and anomaly detection.

**Dataset** (1) SLEEPEEG has 371,055 univariate electroencephalograms (EEG; 100 Hz) collected from 197 individuals. Each sample is associated with one of five sleep stages. (2) EPILEPSY monitors brain activity in 500 subjects with a single-channel EEG sensor (174 Hz). It uses a binary value to determine whether the subject is epileptic or not. (3) FD-A collects vibration signals of rolling bearings in mechanical systems for fault detection purposes. Each sample has 5,120 time stamps and an indicator of one of three mechanical system conditions. (4) FD-B was run with the same setup as FD-A, but the rolling bearings were run under different working conditions (e.g., varying rotational speeds). (5) HAR had 10,299 9D samples from six daily activities. (6) GESTURE included 440 samples collected from 8 hand gestures recorded with accelerometers. (7) ECG contains 8,528 single-sensor ECG recordings classified into four classes based on human physiology. (8) EMG consists of 163 EMG samples with 3-class labels suggestive of muscle disease.

**Baseline** Ten baseline methods were considered. These included eight state-of-the-art methods: TS-SD, TS2vec, CLOCS, Mixing-up, TS-TCC, SimCLR, TNC, and CPC.

**Implementation** We use two 3-layer 1D ResNet as the backbone for encoders _{GT} and _{GF}. The dataset contains long time series (5,120 observations for the FD-A and FD-B samples), and preliminary experiments confirm that ResNet is a better choice than the Transformer variant. 2 fully connected layers are used for RT _{and}_{RF}, with no shared parameters. We set E=1 and α=0.5 for frequency extension and τ=0.2, δ=1, and λ=0.5 for the loss function.

### Results 1:1 pre-training evaluation

**Setup** In a one-to-one evaluation, the model is pre-trained on one pre-training data set and used for fine tuning on only one target data set.

Scenario 1 (SLEEPEEG → EPILEPSY): Pre-training is done with SLEEPEEG and fine tuning is done with EPILEPSY. Both datasets describe single-channel EEG, but the signals are from different channels/positions of the scalp, track different physiologies (sleep and epilepsy) and are collected from different patients.

Scenario 2 (FD-A → FD-B): The data set describes a mechanical device operating under different working conditions, such as rotational speed, load torque, and radial force.

Scenario 3 (HAR → GESTURE): different activities are recorded in the data sets (6 different human daily activities vs. 8 different hand gestures). Both data sets include acceleration signals, but HAR has 9 channels and GESTURE has 1 channel.

Scenario 4 (ECG to EMG): Both are physiological data sets, but ECG records electrical signals from the heart, whereas EMG measures muscle response when nerves stimulate muscles.

The discrepancies between the pre-training and fine-tuning datasets in the above four scenarios are very large, covering a wide range of variations in time series datasets (meaning, sampling frequency, time series length, number of classes, and system factors (e.g., number of devices and subjects)) The data set can be used for fine-tuning. In addition, the relatively small number of samples available for fine tuning (EPILEPSY: 60, FD-B: 60, GESTURE: 480, EMG: 122) makes setup difficult.

**RESULTS** The results for the four scenarios are shown in Tables 1 and 4-6. Overall, we find that the TF-C model wins in 16 of the 24 tests (6 metrics for the 4 scenarios) and performs second best in only 8 other tests. We report all metrics, but discuss the F1 score below. On average, our TF-C model produces a large margin of 15.4% for all baselines. While the strongest baselines vary (e.g., TS-TCC in Scenario 2, Mixing-up in Scenario 3), the TS-C model outperforms the strongest baseline by 1.5% in all scenarios. Specifically, as shown in Table 1 (HAR to GESTURE, Scenario 3), TF-C achieves the best performance of 79.91% on the F1 score, a 7.2% margin over the best baseline TS-TCC (74.57%). One possible reason for this is that Scenario 3 contains a complex data set (6 classes for HAR and 8 classes for GESTURE) that is difficult to model. The complexity of Scenario 3 was further validated by the poor performance of all models (±80%) compared to the performance of the other scenarios (±90%).

Table 1 One-on-One Pre-Learning Assessment (Scenario 3 ) Pre-Learning in HAR, followed by Fine Tuning in GESTURE. |

Table 4 Performance in one-to-one setting (scenario 1): pre-training with SLEEPEEG and fine tuning with EPILEPSY. |

Table 5: Performance in a one-to-one setting (Scenario 2): pre-training in FD-A and fine tuning in FD-B. |

Table 6 Performance in one-to-one setting (scenario 4): pre-training with ECG, fine tuning with EMG |

### Results 1-to-many pre-training evaluation

**Setup** One-to-many evaluation involves pre-training on one dataset, followed by fine tuning on multiple target datasets to evaluate them independently without having to re-start the pre-training from scratch. of the eight datasets, SLEEPEEG has the most complex temporal dynamics and is the largest dataset (371,055 samples). Therefore, we pre-train on SLEEPEEG and separately fine-tune the pre-trained models on EPILEPSY, FD-B, GESTURE, and EMG.

**RESULTS** Results are shown in Table 2. Because there are few commonalities in EEG signal vs. vibration and acceleration vs. EMG, we expect the effect of transfer learning to be less effective than in one-to-one evaluation. In the bottom three blocks (SLEEPEEG → {FD-B,G ESTURE,E MG}), the pre-study and fine-tuning data sets are very different. While it is not surprising that larger gaps degrade baseline performance, the TF-C model is markedly more tolerant of knowledge transfer between datasets with larger gaps. Notably, the proposed model with TF-C obtained the best performance in 14 of the 18 settings in the three challenging settings. This indicates that the TF-C assumption is universal in time series. The model has great potential to serve as a universal model in the absence of a large pre-training dataset similar to the fine-tuning dataset. Furthermore, TF-C consistently outperforms KNN and Random Init. (without prior training).

Table 2 One-to-many pre-training evaluation Pre-training with SLEEPEEG, followed by independent fine tuning with EPILEPSY, FD-B, GESTURE, and EMG One-to-many pre-training evaluation. pre-training with SLEEPEEG, followed by independent fine tuning with EPILEPSY, FD-B, Independent fine-tuning in GESTURE and EMG. |

### selective research

We evaluate the extent to which the model components are related to each other for effective pre-training. As shown in Table 9 (SLEEPEEG → EPILEPSY), we find that removing LC _{,} LT _{, and}_{LF} results in performance degradation (accuracy). To verify that the performance improvement is not solely due to the third loss term measuring any consistency, we replaced the consistency loss _{LC} with _{a} loss term measuring consistency in time space ( _{LTT-C)} or in frequency space ( _{LFF-C)}. As a result, the consistency loss of the TF-C model outperformed _{both LTT-C and}_{LFF-C}.

Table 9 Carve-out evaluation (SLEEPEEG → EPILEPSY) |

### Additional downstream tasks: clustering and anomaly detection

Using the clustering **task** SLEEPEEG → EPILEPSY as an example, we evaluate the clustering performance of TF-C. Specifically, we added K-means (K=2) on top of _{ztunei} in fine tuning, since Epilepsy has two classes. The evaluation metrics are commonly used: silhouette score, adjusted rand index (ARI), and normalized mutual information (NMI). Table 7 shows that TF-C achieved clustering well above the strongest baseline (TS-TCC) (5.4% on the Silhouette score). This indicates that TF-C is able to capture more distinctive representations due to prior training knowledge, which is consistent with TF-C's advantage in the classification task described above.

Table 7 Performance on downstream clustering Pre-trained on the SLEEPEEG dataset, followed by independent fine tuning in EPILEPSY Two non-port-based baselines (Random Init. and Non-DL), the most in the context of the classification task performing baseline (i.e., TS-TCC), and five baselines including two new models (TNC and CPC) were compared. |

**Anomaly Detection Task** Evaluate how TF-C performs in a sample-level anomaly detection task. We note here that we are addressing sample-level anomaly detection, not observation-level anomaly detection. Based on global patterns, the former focuses on local context (as in BTSF and USAD) with the goal of detecting anomalous time series samples instead of in-sample anomaly observations. Specifically, in the FD-A to FD-B scenario, a small subset of FD-B is constructed with 1,000 samples, 900 of which are from undamaged bearings and the remaining 100 from bearings with inner or outer damage. The undamaged samples are considered "normal" and the inner/outer damaged samples are considered "outliers." For fine tuning, a one-class SVM was used on top of the learned representation _{ztunei}. Experimental results (Table 8) show that the proposed TF-C outperforms the five competing baselines by 4.5% in F-1 Score. This result indicates that the proposed TF-C is more sensitive to anomalous samples and can effectively detect anomalous conditions in mechanical devices.

Table 8: Carve-out evaluation (SLEEPEEG → EPILEPSY) |

## summary

In this study, we developed a prior learning approach that introduces time-frequency consistency (TF-C) as a mechanism to support knowledge transfer between time-series data sets. The approach uses self-supervised contrastive estimation and introduces TF-C in prior learning to bring time-based and frequency-based representations and their local neighborhoods closer together in latent space.

**Limitations and Future Directions** The TF-C property serves as a universal property for pre-training on diverse time series data sets. Additional generalizable properties, such as temporal autoregressive processes, may also be useful for pre-training on time series. Furthermore, although the method assumes regularly sampled time series as input, irregularly sampled time series can also be handled by using encoders that can embed irregular time series (such as Raindrop or SeFT). For the input of a frequency encoder, there are several ways to obtain a regularly sampled signal by resampling or interpolation, or by using regular or non-uniform FFT operations. Furthermore, the TF-C embedding strategy and loss function favor classification that exploits global information over tasks that utilize local context (e.g., forecasting). The results show that the TF-C approach performs well on a wide range of downstream tasks, including classification, clustering, and anomaly detection.

Categories related to this article