MTAD-GAT Using Graph-attention For Multivariate Time Series Anomaly Detection

Time-series 30/06/2021

3 main points
✔️ We are building a new framework taking into account the characteristics of multivariate time series data and the purpose of using the results.
✔️ Presenting a solution for cases where relationships are found between variables, rather than lumping univariate data together, will pave the way for a leap forward not only in space and cloud computing but also in many
✔️ The performance of SOTA is even better than SOTA that has been coming out in the last few years.

Multivariate Time-series Anomaly Detection via Graph Attention Network
written by Hang Zhao, Yujing Wang, Juanyong Duan, Congrui Huang, Defu Cao, Yunhai Tong, Bixiong Xu, Jing Bai, Jie Tong, Qi Zhang
(Submitted on 4 Sep 2020)
Comments: Accepted by ICDM 2020.
Subjects: Machine Learning (cs.LG); Machine Learning (stat.ML)

code：

first of all

The framework developed is named MTAD-GAT, as the title of the paper suggests. For time series forecasting using deep learning, we introduced a survey paper in a previous AI SCHOLAR article, " Deep Learning Changes Future Forecasting ". I listed elements such as LSTMs as building blocks and mentioned that deep learning may provide outputs that cannot be obtained with classical methods. However, he did not discuss multivariate data in depth there.

LSTM-based encoders and decoders modeling the restoration probability under normal conditions [3], and stochastic recurrent networks of multivariate time series using stochastic latent variables [4] have been proposed. However, there is no paper that explicitly captures the correlation between multivariates, and this paper cuts into that.

technique

MTAD-GAT considers each univariate time series in a multivariate time series as a feature value and tries to explicitly model the correlation between each feature value. On the other hand, it also models the time dependence within each time series.

The core components of the framework are two graph-attention layers: a feature-value-oriented graph-attention layer and a time-oriented graph-attention layer. The feature-value-oriented graph attention layer captures the causal relationships between multiple feature values. Also, the Time-oriented graph/attention layer captures the dependency relationship on the time axis.

Before inputting the data into these layers, the time series data is 1D-convolution processed to extract high-level feature values.

For Graph Attention, see Graph Attention Networks, where Graph Neural Networks are applied to objects that cannot be captured by a Grid-like structure such as MLPs or CNNs, but Graph Attention Networks are applied to objects that cannot be captured in a Grid-like structure such as MLPs or CNNs, but Graph Neural Networks further increase the flexibility of the model by characterizing each node with self-attention. The above two model layers can be optimized simultaneously by combining the objective functions.

The concatenated hidden layer variables then input to the GRU (Gated Recurrent Unit). It captures the sequence patterns of the time series. The output of the GRU is input to the prediction-based model and the recovery model in parallel. Fully connected network is used for the prediction-based model and VAE (Variational Auto Encoder) is used for the restoration model.

data preprocessing

As preprocessing, we use max-min regularization and SR (Spectral Residual), which is SOTA in univariate anomaly detection.

graph attention

The graph attention is expressed for each node by the following equation.

σ is the sigmoid activation function; L is the number of nodes in the neighborhood of i. The attention score α is obtained as follows.

The + sign in the 0 means concatenation.

1. feature value-oriented graph and attention layer

In the feature value-oriented graph attention layer, a node corresponds to each multivariate variable, and the input is given as a sequential vector of time series data with all timestamps n. The graph is created with k number of variables.

2. time-oriented graphs and attention layers

In the time-oriented graph attention layer, time-series data are separated by Sliding windows, there is n number of nodes with n number of timestamps, and vectors with the multivariate number of dimensions are output from each graph.

joint optimization

The loss function is the prediction model and the loss function of the restoration-based model and the parameters from both models are updated simultaneously.

The first term in the loss function of the restoration model is the negative log-likelihood and the second term is the Kullback-Leibler divergence.

model estimation

For each timestamp, there are two inference results. The first is the predicted value, and the second is the restoration probability. For anomaly detection, we balance the two results and use the optimized result. The threshold of the anomaly determination is automatically determined by the POT (Peak Over Threshold) algorithm. The score of the inference is calculated using the hyperparameter γ as follows.

experiment

Data sets, indicators

Three data sets are used for comparison and evaluation: two are spacecraft data collected by NASA, SMAP (Soil Moisture Active Passive satellite), and MSL (Mars Science Laboratory rover). The other is TSA (Time Series Anomaly detection system) collected by Microsoft's own time series anomaly detection system from a stream processing framework called Flink.

The evaluation metrics are precision, recall, and F1 score.

Comparison with SOTA

The following eight SOTAs were used for comparison. All of them are multivariate anomaly detection published in 2018-2019. They include prediction-based and restoration-based models.

Omni-Anomaly [4].

We propose a probabilistic model for multivariate time-series anomaly detection, in which temporal dependencies are grasped by GRU and then projected onto a probability distribution by VAE. We consider patterns with low recovery probability to be anomalies in comparison with normal patterns.

LSTM-NDT [2].

We propose an unsupervised, non-parameterized threshold determination algorithm for the output of the LSTM.

KitNet [13].

Feature value extraction, feature value mapping, and anomaly detection are performed through the ensemble of Auto Encoder.

DAGMM [9].

We do not look at time dependence but focus only on multivariate anomaly detection.

GAN-Li [10].

By training the Discriminator with GAN, it detects anomalies.

MAD-GAN [11].

All data are considered simultaneously to find potential interactions between variables, similar to GAN-Li, with almost the same research group.

LSTM-VAE [12 ].

LSTM and VAE are integrated to fuse the signals and recover the expected distribution. In encoding, the multivariate observables and their temporal dependence at each time step are projected into the latent space by the LSTM-based encoder. In decoding, the expected distribution of multivariate inputs is predicted from the latent representative values.

Against these SOTAs, MTAD-GAT beats the superior F1 score for all data sets.

Rated by Different Delay

Since this is time-series data, it is important to be able to determine the anomaly without delay. The graph below shows that the F1 score is always increasing quickly against OmniAnomaly.

analysis

Effects of Graph Attention

We evaluate the effect of each of the two graph-attentions that we use. The reason for the effect of the time-oriented graph-attention, despite the use of GRU, is that the relationship between spaced time points, which is not captured by GRU, has an effect on the occurrence of the anomaly. The reason for the effect of time-oriented graph attention despite the use of GRU is interpreted as the influence of relationships between spaced time points on the occurrence of anomalies, which is not captured by GRU.

Effects of coupling optimization

From the same table, in addition, we can also check the validity of using the predictive and restoration-based models for the joint optimization.

However, there are some cases that are not grasped by the restoration base model. For example, the red interval in Fig.6 is not detected. Although the restoration-based model is generally good at understanding the global distribution, it may miss the sudden perturbation that disturbs the periodicity.

Analysis of γ

γ is the ratio combining the prediction-based error and the restoration-based probability; we also evaluate the dependence on γ. The results are shown in Table 1. There is almost no change.

abnormal condition diagnosis

In addition to detecting anomalies, MTAD-GAT provides useful insights for anomaly diagnosis. We evaluated the accuracy using two metrics, HitRate@P% and NDCG.

Both of these values are relatively high.

case study

In some cases, it is not detected correctly. The green area in the figure below was determined to be abnormal, but it was normal because FLINK_CHECKPOINT_DURATION and the CPU found an unusual spike. DATA_RECEIVED_ON_FLINK and DATA_SENT_FROM_ONTIMER_FLINK also increased at the same time, indicating a temporary increase in data input to the system. However, the increase in load was short-lived and was not abnormal for the system's operating state. Such cases that require domain knowledge and customer feedback require further investigation.

summary

MTAD-GAT, a framework that allows us to deeply consider and understand each of the characteristics of multivariate time series data itself and of doing anomaly detection of time series data, has been developed. It has achieved a superior F1 score compared to many SOTAs of the last few years. We believe that significant progress has been made.

On the other hand, in some cases, which are also mentioned in the paper, the decision is False positive, False-negative. This will promote the development of even better models that can comprehensively incorporate these properties as well.

Categories related to this article

友安昌幸 (Masayuki Tomoyasu): JDLA G certificate 2020#2, E certificate2021#1 Japan Society of Data Scientists, DS Certificate Japan Society for Innovation Fusion, DX Certification Expert Amiko Consulting LLC, CEO