Catch up on the latest AI articles

A Privacy-preserving Time-series Anomaly Detection Architecture

A Privacy-preserving Time-series Anomaly Detection Architecture

Time-series

3 main points
✔️ It is a privacy-preserving architecture that detects anomalies without collecting all data on the server
✔️ It consists of a combination of very simple models
✔️ Depends on the homogeneity of local data, but the detection performance is hardly degraded by edge processing

Federated Variational Learning for Anomaly Detection in Multivariate Time Series
written by Kai ZhangYushan JiangLee SeverskyChengtao XuDahai LiuHoubing Song
(Submitted on 18 Aug 2021)
Comments: 
Accepted paper in the IEEE 40th International Performance Computing and Communications Conference - IPCCC 2021
Subjects:  Machine Learning (cs.LG)

code:  

The images used in this article are from the paper or created based on it.

first of all

In multivariate time series anomaly detection, it is necessary to understand both temporal dependencies and dependencies among variables. Deep learning methods are superior to conventional time series anomaly detection methods such as ARIMA because they are able to understand the dependencies among variables.

In AI-SCHOLAR, we have introduced MTAD-GAT, Stack-VAE, and ScoreGrad. In this paper, we take a somewhat different model architecture from these. However, the most significant feature is that we build the time series model on the Federated Learning architecture, which is one of the privacy protection methods.

related research

Prediction model

In forecasting models, anomaly detection is performed according to the error with the prediction; LSTM and GRU, which are variants of RNN, are often used. Unlike future forecasting of time series trends, Lin et al. use VAE to extract local information embeddings for short windows, and LSTM to predict embeddings for the next window. Temporal pattern detection is not the only way to find outliers in a sequence. Microsoft uses spectral residuals (SR) to understand the spectral information of a sequence and puts it into a CNN to classify the outliers. Methods using graph neural networks can model more complex dependencies in the network.

Generative model

The core idea of generative models is to learn pattern representations of normal values rather than time series anomalies, e.g. DAGMM, VAE [3, 14 ], GAN [15, 16 ]. However, DAGMM is not intended for multivariate time series and does not capture the innate temporal dependence, while [3, 14, 15 ] only considers temporal dependence and does not explicitly incorporate latent interactions between feature values.

Federated Learning for Anomaly Detection

Federated learning (FL) allows a large number of edge device computers to train models in a coupled manner without sharing data [19 ]. FedAvg is a well-known algorithm that applies stochastic gradient descent to local devices, with parameters server and update it in the cycle of communication between the parameter server and the client. It enhances the training of robust models to detect anomalies in cyber-physical systems to address the problem of data scarcity from a privacy perspective.

Diot first applied FL to anomaly detection based intrusion detection. To reduce the heterogeneity of network traffic packets from each device, packet sequences were mapped as symbols. This was fed into a pre-trained GRU model, which predicted the probability of occurrence of each symbol and detected potential intrusions.

Communication is another bottleneck in FL. This is because edge devices usually have slow processing speed and low reliability. In [22 ], a sparsification technique is used to obtain a compressed gradient that reduces the communication cost. In [23 ], multi-task federated learning is proposed. In [24 ], a generative model, DAGMM, is used as a Federated expression.

technique

As shown in Fig.1, the training data consists of data from sensors and actuators of a certain period of time from different entities. During training, the data contains only normal values. At test time, we create outliers by putting a part of the training data sequence into a different interval.

FedAnomaly Overview

FedAnomaly, as shown in Fig. 2, consists of two parts. These are collective learning and online detection. Although not shown in the figure, there are preprocessing steps of transformation, standardization, and fixed-length windowing. After capturing the patterns in the training data in the local model, we aggregate the gradients from the edge devices and update the Globus model. At the last timestamp of the sequence, the recovery error of the observed values is output and the standard/validated data is stored in the cloud. Training of the global model continues until the restoration error of the standard data converges. The restoration error is used to select the anomaly threshold for online detection in the threshold selection module. In this paper, the maximum restoration error of the standard data is used as the threshold.

The online detection module of each edge device receives the learned model and threshold from the cloud. The entity can get the dissimilarity results for new observations.

ConvGRU (Convolutional Gated Recurrent Unit)

We use ConvGRU, which replaces the dot product of GRU by convolution (Fig.3). Since the time series data is 1D, we use 1D-Conv. Multivariate dependencies are captured as representation learning.

Here we combine VAE as a generative model. VAE in its standard form is not a sequential model as it only consists of a multilayer perceptron. Therefore, we connect ConvGRU and VAE as shown in Fig.4. The hidden feature value ht is extracted from the last stage of the lower series of ConvGRUs. From here, the log variance and mean are calculated, and the distribution of the latent variable zt is obtained. The reverse sequence yields the recovered sequence x 't. As mentioned earlier, anomaly detection is performed at the last timestamp of the input sequence, so only the hidden state of the last ConvGRU cell is sent to the decoder.

experiment

We use SMAP, MSL and SWaT datasets, which are widely used in other papers, and evaluate them under two assumptions: non-federated and federated settings. The optimization algorithm used in the former is Averaged Stochastic Gradient Descent (ASGD) while the latter is ordinary SGD; there are 128 ConvGRU cells and the loss function is the restoration error consisting of MSE and KL divergence .

In the Federared setting, by default we have three clients with local update epoch E=1. Each client samples data exclusively non-iid (not independent-identical distribution) from the training data.

The comparison targets are IF (Isolation Forest), AE, LSTM-VAE, DAGMM, MAD-GAN, OmniAnomaly and USAD. The results are shown in Table II. The results are shown in Table II. ConvGRU-VAE gives the best results for F1, and almost the best results for Precision and Recall.

In the Federated setting (FedAnomaly), the results are worse for SMAP and MSL because the telemetry channel on the spacecraft is extremely inhomogeneous. In SWaT, there is almost no degradation.

Model parameter search and delay time analysis

Since SWaT knew the label and the cause of the anomaly, further analysis was performed.

window length search

The dependence on window length is shown in Table III; Precision, Recall, and F1 are maximized for window lengths 5, 20, and 10, respectively. In real-world anomaly detection, anomalies are more likely to occur adjacent to each other than at distant points. The left graph in Fig. 5 shows the delay between the detection and the correct answer; the delay becomes shorter for window lengths longer than 5.

We conclude that the model performs well for overall anomaly patterns, segment detection, and reaction speed in windows 10 and 20.


Hidden variable size search

We investigated the relationship between detection performance and the size of the hidden layers. In general, the smaller the number of hidden layers is, the less capacity the model has to understand the correlation and temporal dependency of feature values. On the other hand, if the number of hidden layers is too long, the model becomes redundant and hinders effective representation learning. table IV, Fig.5 center and right graphs show that the detection performance is better when the number of hidden layers is large, the delay is smaller and the variance is less, and the detection rate is improved in the combined delay.

Additional experiments on the Federated Learning mechanism

Federated settings, I checked the dependency between local epoch L and the number of clients C.

Performance Analysis

In Table V, F1 and Precision are almost best at L=2, and Recall is almost best at L=3. As L increases, Recall improves and Precision worsens. As L increases, Recall improves and Precision worsens, but not so much for C.

Table VI summarizes the delays: the first of the two sets of numbers is the number of adjusted segments, and the second is the average delay; L=3 is good for most C, but in Table V, L=3 has a lower Precision and tends to produce false alarms.

Analysis of the learning curve

If we look at the learning curve, we see that as the number of clients increases, the verification loss is not minimized without increasing the number of communications. In terms of local epochs, the convergence is faster for larger values.

summary

The ability to detect anomalies at the entity level, such as edge computers, is a great advantage in terms of unsupervised learning and privacy protection. Our proposed system, ConvGRU-VAE, and its application to the Federated environment, FedAnomaly, show the same or better performance than SOTA in such an environment.

However, there is a problem of performance degradation when the data is non-homogeneous, which is an issue for future research.

(Article Author's Findings)

Several models have been proposed recently to understand the relationships among temporal and feature values in multivariate time series data, and comparisons on common datasets such as SMAP have yielded almost identical F1s and so on. It is interesting to note that some of the architectures are quite different, such as graph attention, stacked VAE, energy-based generative models, and now ConvGRU-VAE, and yet they are getting good results as well. I would like to see an analysis of the similarities in what is essentially being done for the apparent differences.

If you have any suggestions for improvement of the content of the article,
please contact the AI-SCHOLAR editorial team through the contact form.

Contact Us