Multivariate Time-series Anomaly Detection Using Self-supervised Learning And Adaptive Memory To Capture Hidden Habits.
3 main points
✔️ Self-supervised learning and adaptive memory fusion are applied to compensate for the diversity in normal time series data and the information that cannot be obtained from limited training data.
✔️ The model is fast and has little performance degradation even when it is lightweight.
✔️ Deep insight into the behavior of data (signals) is important to increase the accuracy of the model
Adaptive Memory Networks with Self-supervised Learning for Unsupervised Anomaly Detection
written by Yuxin Zhang, Jindong Wang, Yiqiang Chen, Han Yu, Tao Qin
(Submitted on 3 Jan 2022)
Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
The images used in this article are from the paper, the introductory slides, or were created based on them.
first of all
In a paper I introduced a while ago, a model was constructed to match the characteristics of heterogeneous patterns in multivariate time series data. The main idea of this paper is the same. However, we observe the patterns in more detail and construct an adaptive model. Please compare this model with the previous one.
As a background, there is little anomaly data in anomaly detection datasets, and much research has been done on unsupervised learning. Autoencoders are powerful methods and they learn to minimize restoration error. Derived models include LSTM-AE, Convolutional AE, and ConvLSTM-AE.
However, two major challenges remain.
1) Lack of normal data: Lack of normal data may seem strange, but I think it means that there are many variations and expansions of normal data, and we cannot prepare a data set that includes all of them for training. (a) normal, (b) abnormal, and (c) similar.
2) Limitation of feature representation: When there is diversity in normal data like (c), conventional methods cannot represent it well.
The aim of the Adaptive Memory Network with Self-supervised Learning (AMSL) proposed in this paper is as follows.
1) Self-supervised learning and memory networks for normal data and feature representation tasks, respectively.
2) Learning global and local memories to increase expressive ability, and then using the adaptive memory fusion module to fuse global and local memories into a final expression.
3) We compare the performance of four public data sets. Compared to the conventional method, we observed more than a 4% improvement in accuracy and F1 score. It is also more robust to noise.
Unsupervised anomaly detection in deep learning methods can be categorized into reconstruction models and prediction models.
We focus on reducing the reconstruction error. For example, autoencoders are often used for anomaly detection by learning to reconstruct a given input, while LSTM encoder decoders are used for time-series data but cannot account for spatial correlation. For time-series data, LSTM encoder-decoders are used, but they cannot take into account spatial correlations; Convolutional autoencoders can capture 2D image structures; ConvolutionalLSTM can capture spatial and temporal correlations.
It predicts one or more consecutive values. For example, RNN-based models detect anomalies based on the error between future predictions and actual values; LSTNet captures short- and long-term patterns; GAN-based methods use U-Net as a generator to predict the next point in time and compare it to actual values to detect anomalies, and GAN-based methods use U-Net as a generator to predict the next point in time and compare it to actual values to detect anomalies. However, these methods lack a reliable mechanism for the granular representation of normal data.
Feature representation learning is one important aspect of deep learning, where a good representation of the input data is essential for generalizability, interpretability, and robustness. Self-supervised learning (SSL) is one of the unsupervised learning paradigms that use the data itself to obtain a good representation. Specific methods span image, natural language processing, and speech recognition. In anomaly detection, it is used to learn features of within-distribution (i.e. normal) samples.
It is used for question answering.RNN, LSTM uses local memory cells to understand the long-term structure. Memory records information stably, so we employ memory networks like one-shot learning, neural machine translation, and anomaly detection. Anomaly detection aims to distinguish between normal and abnormal values by recording various patterns of normal values compared to the items in the memory.
Configuration of AMSL
Convolutional AE (CAE) is used as the base network; the loss function of CAE is the mean squared error (MSE) as shown in Eq.
AMSL is composed of four elements.
1) Self-supervised learning module
2) Global memory module
3) Local memory module
4) Adaptive fusion module
And the algorithm also consists of four steps.
1) The encoder maps the raw time-series data into the latent feature space by performing six transformations
2) For self-supervised learning, a multi-class classifier classifies these feature representations to learn generalized features
3) Features are also sent to global and local storage networks to learn common and individual features.
4) An adaptive fusion module fuses these features to obtain a new representation that can be used for the reconstruction
The self-supervised learning module of AMSL generalizes the feature representation of normal values. For unknown anomalies that may have various patterns, the normal value data used in training is limited. To solve this problem, self-supervised learning is used to increase the generalization capability of the model.
After data expansion, we assume that the instances are consistent and design feature transformations of the original data, training the model to recognize the sample transformation types as an auxiliary task. Specifically, we use six signal transformations (noise, inversion, permutation, scale, inversion, and smoothing). The loss function is the sum of the cross entropies for each transformation.
Adaptive Memory Fusion Module
Traditional AEs are adversely affected by noisy or unknown training data and may consistently reconstruct too much of the anomalous input. Hence, the model is unable to learn representative features. To address this challenge, we propose an adaptive memory fusion module that enhances the model's ability to distinguish between normal and abnormal data by recording typical patterns.
The memory module consists of a memory representation that represents the encoded pattern and an update part that updates the memory based on the similarity between the memory and the input. The memory is a CxF matrix M. For an input z, the cosine similarity
obtained from the weight wi
will be the output of this module.
In the training phase, the memory matrix can be updated with a reconfiguration loss function, which focuses on recording the properties of normal. In the testing phase, the memory network considers multiple patterns of normal characteristics and outputs a representation with all item combinations. Thus, the normal instances can be reconstructed appropriately. The anomalies reconstructed using the normal patterns retrieved by the memory module are sought as a result of higher reconstruction errors.
Adaptive Fusion Module
Furthermore, we propose an adaptive memory fusion network to learn both common and specific representations from all extensions. Specifically, we propose a global memory module that learns the common representations contained in all transformations and a local memory module that learns the extension-specific representations of each transformation. Finally, we propose an adaptive fusion module that fuses these two levels of functionality into the final representation used in the reconstruction. The motivation is to be able to capture both the general patterns of normal data and the specific information that is useful for normal data patterns (i.e., each different transformation), thus improving the feature representation of normal data in a fine-grained way.
Construct a global memory module using a shared-memory matrix. By using the encoded representation as a query, the global memory module can record common items in the memory matrix. Through the shared memory module, the output can be obtained as follows
We construct R local storage modules for the original data and the six transformations. Each storage matrix records the properties of the normal values of the corresponding transformations. These outputs are obtained by the local storage modules as follows
Intuitively, the common and specific features are not equally important in representing a particular instance. To adaptively fuse these features, we use a feed-forward layer that takes the features and the free variables r as input and produces a fused representation with weights α (the sum of two weights x R transformations in local and global memory). Batch normalization and sigmoid activation functions are used to normalize the weights and control their values within the range of (0, 1). r is used to increase the randomness. The adaptive fusion representation can be expressed as follows
α is a weight for common (global) and specific (local) features.
The decoder concatenates the output of the encoder and the output of the adaptive fusion as an input to reconstruct the original input. The reconstruction loss is defined to minimize the l2 distance between the decoder output and the original input.
To limit the sparsity of the memory weight w to avoid heterogeneous over-reconstruction due to complex combinations of memory items, we employ sparsity loss to minimize the entropy of w.
Integrating the three-loss functions (10), (11), and (3) with the trade-off parameter λ, the overall AMSL loss function becomes
Learning is done in an end-to-end fashion. (Please refer to the paper for the algorithm)
In the inference, we set a threshold and decide for the value of Err (Xi ). (See the paper for the algorithm.)
We use four databases for benchmarking: DSADS is motion sensor data on daily body movements; PAMAP2 is similar body movement data, but using mobile devices; WESAD is wearable stress, emotion, and sensor data; CAP is sleep state sensor data to detect sleep apnea. WESAD is wearable stress, emotion sensor data; CAP is sleep state sensor data to detect sleep apnea.
TABLE 2 shows the classification of normal and abnormal by operation for DSADS and PAMAP2.
The models to be compared are four traditional methods (KPCA, ABOD, OCSVM, HMM ) and seven unsupervised learning methods (CNN-LSTM, LSTM-AE, MSCRED, CovLSTM-COMPOSITE, BeatGAN, MNAD, GDN, UODA ). The evaluation metrics are average fit rate, average recall rate, average F1 score, and accuracy.
TABLE 3 shows the evaluation results. For all data sets, AMSL significantly outperforms the others. In particular, for the largest database, CAP, AMSL shows a dramatic improvement of 4.90% in F1 score compared to the second-largest database, OCSVM. For the relatively more difficult databases DSADS, PARAM2, and CAP, we find that the amount of improvement decreases as the number of data increases. This means that self-supervised learning is more effective when generalized representations are difficult to learn on small data sets. Furthermore, while the number of samples is relatively small, the improvement in AMSL is large when the number of categories is large, indicating its superior ability to handle diversity on limited training data.
The performance of traditional methods varies depending on the dataset due to the limitations of the feature extraction methods. For example, the reconstruction model is not robust to noise; MNAD and ConvLSTM may not be suitable for multivariate time series since they are original models for video data; BeatGAN does not perform well against CAP and WESAD.
The confusion matrix in Fig.4 shows that for most of the datasets, the ratio of misclassification of normal data is lower than that of misclassification of abnormal data; the F1 score is over 93%.
We isolate the effects of each of AMSL's self-supervised learning (SSL), memory (Mem), and adaptive fusion (Ada Mem) modules. The data set is PAMAP2. The baseline is Convolutional AE (CAE). The self-supervised learning module and the memory module each show improvement. Combining them and further adaptive fusion has further improved the results.
Self-supervised learning helps the network to learn the general and diverse features of normal data, which improves the generalization ability of the model and discriminates between invisible normal and abnormal instances. Fig. 3(a) shows a comparative analysis of the performance of each self-supervised data transformation. This evaluation shows whether the performance of the model from jointly training the extended data is better than training the individual data. Excluding the noisy signal, we show that the overall performance s is competitive. Therefore, it is beneficial to combine all the transformations for better generalization.
Adaptive fusion module
In Fig.3(b), we compare CAE, GMSL, LMSL, and AMSL, where GMSL is a global memory network and LMSL is a local memory network. The results show that adaptive fusion performs better than either global or local individual memory networks.
TABLE 5 shows a more detailed comparison of four data sets. In all cases, adaptive fusion shows high performance. Fig. 3(c) shows how the adaptive weights change as learning proceeds; the numbers 1~7 correspond to the transformations in Fig. 3(a).
Robustness to noisy data
In real-world applications, the collection of multivariate time series data can easily be contaminated by noise due to changes in the environment or data collection devices. Noisy data poses a serious challenge for unsupervised anomaly detection. We evaluate the robustness to noise by injecting Gaussian noise (µ = 0, σ = 0.3) into randomly selected samples at ratios varying between 1% and 30%. Fig. 6 compares the performance of the three methods, UODA, ConvLSTM-Composite, and AMSL. As the noise increases, the performance of all methods decreases. Among them, AMSL (orange) is significantly better than the other AMSL.
In general, the percentage of anomalies will be significantly lower than normal. Therefore, we experiment with the CAP data set when the percentage of abnormalities in the test set is 1%, 5%, 10%, 15%, 20%, 25%, and 30%. Fig. 7 shows the F1 scores for the anomaly classes using different methods. We compare the performance of the four methods OCSVM, ConvLSTM-COMPOSITE, MNAD-R, and AMSL. We can see that as the percentage of anomalies decreases, the F1 scores of the other methods decrease significantly, while AMSL (orange) remains stable. This indicates that AMSL achieves high accuracy and reproducibility in the anomaly class, even when the percentage of anomalies is very low in the test set. In other words, it is robust to the problem of imbalance in the data set.
We have case studies for several classifications of normal and abnormal, using 3D signals from the DSADS dataset; AMSL correctly classifies in all cases. In comparison to MNAD, UODA misclassifies when the normal sample is different from the majority of normal samples and when the abnormal sample is very similar to the normal sample.
Parameter Sensitivity Analysis
Three key parameters, the length of the time series window V, the size of the memory matrix M, and the filter size F of the final layer of the encoder, are used for sensitivity analysis.
We also perform sensitivity analysis for LMSL and GMSL, Fig. 9(a-b) shows the window length sensitivity, (c-d) shows the memory size dependence, (e-f) shows the filter size sensitivity. The λ1 and λ2 dependences in the loss function are shown in Fig. 9(g-h), where the optimal values are 1 and 0.0002, respectively.
Threshold µ is also an important parameter; according to TABLE 6, the 99th percentile is likely to predict the optimal threshold. Therefore, we set the 99th percentile as the threshold for anomaly detection.
Convergence, space-time complexity
Fig.10(a) shows the convergence of the reconstruction loss and the self-supervision loss by the memory module; AMSL converges fast and stably, and can be applied more effectively.
We also evaluate the inference time of AMSL and other strong baselines on the DSADS dataset; as shown in Fig. 10(b), AMSL, in addition to achieving the best performance, requires only a shorter execution time than most other methods.
Furthermore, according to TABLE 7, evaluated on the DSADS dataset, the number of parameters and model size in AMSL are relatively smaller than in most other methods. We also discard poorly performing transformations by controlling the self-supervised data transformation R in TABLE 7 to reduce the number of model parameters; AMSL (R = 6) discards the poorly performing "noise" transformation, AMSL (R = 5) discards the "noise" and "scale" transformations, and AMSL (R = 4) discards the "noise", "scale" and "substitution" transformations. AMSL (R = 3) discards the "noise", "scale", "permutation", and "inversion" transformations, indicating that AMSL still achieves the highest F1 and accuracy scores. For the other datasets, the conclusions are similar. This makes the method selection more flexible for real-world applications.
In this paper, we propose an adaptive memory network with self-supervised learning (AMSL) for unsupervised anomaly detection of multivariate time series signals. To enhance the generalization capability of the model for invisible anomalies, we proposed to use a self-supervised learning module to learn a variety of normal patterns and an adaptive memory fusion network to learn rich feature representations by global and local memory modules. Experiments on four public datasets show that AMSL significantly outperforms existing approaches in terms of accuracy, generalization, and robustness.
In the future, they plan to extend AMSL to other modalities, such as image and video, for unsupervised anomaly detection, and they also plan to develop more efficient learning algorithms and pursue the theoretical analysis of the method.
(Author's comment) This method seems to capture the diversity of each series in detail, but it does not seem to take into account the correlation between series. There is a possibility to improve it to a more powerful algorithm by combining it with the methods in other papers.
On the other hand, if the model structure is too closely matched to the characteristics of the data, I think the generalization capability that we are aiming for in this paper may be lost. For example, it is assumed that normal/abnormal does not change for six transformations, but consistency is not necessarily guaranteed depending on the target system or application. In addition to the physical data used in the evaluation, it would be interesting to see the performance of the system on financial and network data.
Categories related to this article