# Transformer For Time-series Anomaly Detection

*3 main points* ✔️ Finally, Transformer appears in multivariate time series anomaly detection!

✔️ Deep learning, including graphs, has improved the representation of multivariate time series but is still limited to a single point in time

✔️ We leverage Transformer's expressive power for global and long-term linkages to confirm its performance over traditional SOTA with a two-branch structure including a modified Anomaly-Attention

Anomaly Transformer: Time Series Anomaly Detection with Association Discrepancy

written by Jiehui Xu, Haixu Wu, Jianmin Wang, Mingsheng Long

(Submitted on 6 Oct 2021 (v1), last revised 13 Feb 2022 (this version, v4))

Comments: arXiv

Subjects: Machine Learning (cs.LG)

code：

The images used in this article are from the paper, the introductory slides, or were created based on them.

## Background

Finally, Transformer has appeared in multivariate time series anomaly detection. In fact, before this paper, Transformer was used in GTA (Chen et al., 2021), a model that learns the relationship between multiple IoT sensors by a graph structure, for modeling the time axis and reconstruction criteria for anomaly detection. The block diagram is attached. Other models such as TranAD, TiSAT, etc. are being presented one after another. When I have a chance, I would like to introduce these as well.

The Anomaly Transformer introduced here is a modified version of the self-attention mechanism for anomaly detection.

As a repeat of our previous article, we focus on unsupervised learning here, because in real-world data, when we want to perform anomaly detection, anomalies are rare, and labeling them is difficult. In this case, we need to create a normal/abnormal discrimination criterion without any supervised data. Classical methods include density estimation and clustering methods, but these do not take into account the temporal component and are difficult to generalize to unseen real-world scenarios. Recent deep learning models have achieved excellent results by taking advantage of the ability of neural networks to learn representations. The main method category involves pointwise representation learning through regression-type networks and self-supervised learning through reconstruction or autoregressive tasks. As expected, due to the small amount of anomaly data, it is difficult to discriminate complex time-axis patterns. Also, reconstruction or prediction errors are computed at each time point, making a comprehensive representation of the temporal context difficult.

Another category of methods uses explicit association modeling to detect anomalies. Vector autoregression and state-space models belong to this category. Graphs are also included. As introduced previously, GNNs have been applied to learn dynamic graphs of multivariate time series. Although they have improved their representational capabilities, they are still restricted to a single point in time models. Subsequence-based methods, on the other hand, detect anomalies by computing the similarity between subsequences. However, these methods cannot capture the detailed temporal relationships between each time point and the whole series.

In this paper, we apply Transformer to unsupervised learning of time-series anomaly detection, and Transformer is widely applied because of its global representation and its ability to represent long-term linkages in a unified manner. When applied to time series, the self-attention The map represents the temporal linkage of each point in time. We call this __series association__. Furthermore, due to the rarity of anomalies and the dominance of normal patterns, it is observed that it is more difficult for anomalies to establish a strong association with the whole series. Abnormal associations should be concentrated at adjacent time points that are more likely to contain similar abnormal patterns due to continuity. Such induced bias in adjacent concentrations is called __prior association__. In contrast, the dominant normal time points are not restricted to adjacent regions but can find beneficial associations with the whole series. Based on this observation, we seek to exploit the inherent normal-abnormal discriminability of the associated distribution. This leads to a new anomaly criterion for each time point. This is quantified by the distance between the prior association and its serial association at each time point and is called Association Discrepancy. As mentioned earlier, anomalies will have smaller association discrepancies than normal time points because anomalous associations are more likely to be adjacently concentrated.

We introduce a Transformer for unsupervised time series anomaly detection and propose an AnomalyTransformer for linkage learning. To compute Association Discrepancy, we innovate the self-attention mechanism to Anomaly-Attention. It includes two branching structures that model the prior and serial associations at each time point, respectively. The prior linkage uses a learnable Gaussian kernel to present the induced bias of adjacent concentrations at each time point. The serial linkage, on the other hand, corresponds to self-attention weights learned from the raw series. Additionally, a minimax strategy is applied between the two branches. This amplifies the discriminability of normal and abnormal linkage discrepancies and further derives new linkage-based criteria.

The contributions of this paper are threefold.

Based on key observations of linkage discrepancies, we propose an Anomaly-Transformer with an Anomaly-Attention mechanism. This allows us to simultaneously model the prior linkage and the serial linkage, and to materialize the linkage discrepancy.

We propose a minimax strategy to extend the discriminability of normal and abnormal linkage discrepancies and to derive new linkage-based detection criteria.

The Anomaly Transformer delivers SOTA anomaly detection results on six benchmarks of three real-world applications. Extensive isolation and insightful case studies.

## related research

Unsupervised time series anomaly detection can be classified as follows

...*Density Estimation Methodology*

LOF (Local Outlier Factor), and COF (Connectivity Outlier Factor) calculate local density and connectivity and determine outliers, while DAGMM and MPPCACD incorporate mixed Gaussian models to estimate density.

middle dot (a typographical symbol used between parallel terms, names in katakana, etc.)**clustering-based method**

Anomaly score is obtained by the distance to the cluster center; SVDD, Deep SVDD collects the compact cluster representation from normal values; THOC fuses multiscale temporal features from the middle layer of the hierarchical clustering mechanism to detect anomalies from multilayer distances.

...*reconstruction-based method*

Park et al. used LSTM-VAE models to represent temporal models in LSTM and reconstructions in VAE; OmniAnomaly extended this further and used reconstruction probabilities for detection; InterFusion retrofitted the backbone to hierarchical VAE to model within- and between-sequence dependencies GANs are also used for reconstruction-based modeling.

...**autoregressive-based method**

Detect anomaly by prediction error, extend ARIMA with VAR, or replace the autoregressive model with LSTM.

## technique

### Anomaly-Transformer

The Anomaly-Attention block and the feed-forward layer are alternately stacked to form the Anomaly-Transformer as shown in Fig. 1. It contributes to learning potential linkages from deep multi-layer features. The equation can be expressed as follows.

The*Anomaly-Attention*

Since the usual self-attention mechanism cannot model both the prior linkage and the serial linkage at the same time, we propose an Anomaly-Attention with two branches. For the pre-coupling, we use a learnable Gaussian kernel to compute the distance between the coupling timescales. It adapts to different time series patterns, such as heterogeneous segments of different lengths, by learning a scale parameter σ. For serial linkage bifurcation, we learn a linkage from the raw series. It adaptively finds the most effective linkage. These two preserve the temporal dependence of each point in time, which is more informative than the representation of individual points in time. The expression is as follows.

...*Association Discrepancy*

Define Association Discrepancy as the symmetrization KL divergence between a prior and a serial linkage, representing the information gain of the two distributions. Averages the multi-layer Association Discrepancy.

For abnormal data, AssDis is smaller than for normal data.

### Min-Max linkage learning

Reconstruction loss is used to optimize the model. The serial linkage finds the most informative linkage. To amplify the difference between normal and abnormal, additional losses are used to magnify the linkage discrepancy. Due to the unimodal nature of the pre linkage, the discrepancy loss also forces the serial linkage to focus on non-adjacent regions. This makes anomaly reconstruction more difficult and anomalies more identifiable. The loss function is expressed as the sum of the reconstruction loss and the linkage mismatch as follows

...*Minimax Strategy*

Since directly maximizing the linkage discrepancy would drastically reduce the scale parameter of the Gaussian kernel and render the prior linkage meaningless, we take the minimax strategy shown in Fig. 2. Specifically, in the minimization phase, we allow the before approximate the series linkage learned from the raw series. This allows the coupling to adapt to various temporal patterns.

In the maximization phase, we optimize the prior linkage to enlarge the linkage discrepancy. This ensures that we focus more attention on non-proximity to the serial linkage.

linkage*Base Variant Criteria*

Introduce normalized linkage discrepancy in the reconstruction criterion. This benefits both the temporal representation and the recognizable linkage discrepancy. In the end, the dissimilarity score becomes

## experiment

We use the following five evaluation datasets, including the datasets commonly used in other papers. (1) SMD (Server Machine Dataset), (2) PSM (Pooled Server Metrics), (3) MSL (Mars Science Laboratory) and SMAP (Soil Moisture Active Passive satellite), (4)SWaT (Secure Water Treatment), (5)NeurIPS-TS (NeurIPS 2021 Time Series Benchmark)

The baseline models compared are the reconstruction-based model's InterFusion, BeatGAN, OmniAnomaly, LSTM-VAE, the density estimation-based models DAGMM, MPPCACD, LOF, the clustering-based models ITAD, THOC, Deep-SVDD, the autoregressive based models CL-MPPCA, LSTM, and VAR, and the classical methods OC-SVM and IsolationForest.

Table 1 summarizes the results. For both databases, this method shows the highest F1 score.

Fig. 3 shows the ROC curve. As expected, the Anomaly Transformer shows the best results.

NeurIPS-TS is a database proposed by Lai et al. that contains various time and pattern anomalies. Here too, Anomaly Transformer shows the highest F1 score.

Table 2 shows the results of the cut-and-paste experiments: in terms of F1 scores, the linkage-based reconstruction improves by 18.76%, and even using linkage discrepancy as a direct criterion, the improvement is significant. There is an 8.43% improvement with the learnable prior linkage and a 7.48% improvement with the minimax strategy.

### model analysis

To give a sensory understanding of the behavior of this model, it is visualized in Fig. 5.

...*Visualization of anomaly criteria*

We find that linkage-based criteria are, in general, more distinguishable. Specifically, the linkage-based criterion allows us to get consistently smaller values for the normal part. This is in stark contrast to the case of points and patterns. In contrast, the jitter curve of the reconstruction criterion confounds and fails the detection process in the previous two cases. This ensures that the criteria can highlight anomalies and provide clear values for normal and abnormal points, making detection more accurate and reducing the false positive rate.

... ex-ante* association visualization*

During minimax optimization, the prior linkage is learned to approach the serial linkage. Thus, the learned σ can reflect a time series concentrated on its neighbors. As shown in Fig. 6, we can see that σ changes to adapt to different data patterns in the time series. In particular, the prior linkage of anomalies usually has a smaller σ than the normal time point, which is consistent with the induced bias of the adjacent concentration of anomalies.

The ...*Optimization Strategy Analysis*

Only in the presence of reconstruction loss do anomalous and normal time points perform similarly in terms of linkage weights with adjacent time points, corresponding to contrast values close to 1 (Fig. 3). Maximizing linkage mismatch causes the serial linkage to pay more attention to non-adjacent regions. However, to obtain better reconstruction, anomalies should maintain much larger adjacency weights than at the normal time point, corresponding to larger contrast values. However, direct maximization leads to Gaussian kernel optimization problems and does not strongly amplify the difference between normal and abnormal time points as expected (SMD: 1.15 → 1.27). The minimax strategy optimizes the prior linkage and provides stronger constraints on the serial linkage. Therefore, the minimax strategy obtains more discriminative contrast values than direct maximization (SMD: 1.27→2.39), thereby improving performance.

More detailed evaluation results and dataset descriptions are provided in Appendices A-L of this paper.

## summary

This paper studies the unsupervised time series anomaly detection problem. Unlike previous methods, we learn more informative time series linkages with Transformers. Based on the observation of significant linkage discrepancies, we propose the Anomaly Transformer. This includes a two-branch structure of Anomaly-Attention to embody the linkage discrepancy. A minimax strategy is employed to further amplify the difference between normal and abnormal time points. By introducing linkage discrepancy, we propose a linkage-based criterion that links the performance of reconstruction and linkage discrepancy. The Anomaly Transformer has been extensively evaluated on a dataset of empirical studies to confirm the results of SOTA.

In the future, he will theoretically study the Anomaly Transformer in the light of the classical analysis of autoregressive and state-space models.

Categories related to this article