Time Series Anomaly Detection Starting From Unsupervised (NCAD)

Time-series 26/08/2021

3 main points
✔️ A powerful framework is proposed for time series anomaly detection
✔️ Combines data augmentation with an expressive representation model, not just a prediction model
✔️ Does not waste even a small amount of labeled data but incorporates more labels into the model to improve performance

Neural Contextual Anomaly Detection for Time Series
written by Chris U. Carmona, François-Xavier Aubet, Valentin Flunkert, Jan Gasthaus
(Submitted on 16 Jul 2021)
Comments: Published on arxiv.
Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)

code：

The images used in this article are from the paper or created based on it.

first of all

This is a paper by AWS AI Labs. We propose NCAD (Neural Contextual Anomaly Detection), a framework for time series anomaly detection that can handle unsupervised and supervised, seamlessly, univariate and multivariate.

Unsupervised learning on a data set with few labels is inefficient because the available information is not used. Therefore, we have prepared a framework that can seamlessly handle unsupervised, semi-supervised, and supervised learning, and a structure that can incorporate additional data.

Recent developments in deep anomaly detection in computer vision have achieved remarkable performance with such a concept, a notable example being the work coupled to hypersphere classifiers, which extends one-class classification into a powerful framework for semi-supervised learning anomaly detection.

Recent developments in representation learning for multivariate time series have been realized by efficiently combining techniques originally used for anomaly detection in image processing, such as HSC (Hypersphere Classifier) and OE (Outlier Exposure), with modifications for time series. By injecting the synthesized outliers into the data at hand, it facilitates learning of the boundary between normal and outlier values. All information can be used efficiently as domain knowledge or as training labels for semi-supervised learning.

This method divides the time series data into overlapping windows of fixed length. Each window is further divided into two parts: a context window and a suspect window (Fig. 1). The goal is to detect anomalies in the suspect window. Based on the intuition that outliers induce large perturbations in the embedding, the outliers are identified in the learned latent representation space. That is, we expect the representations of two overlapping segments containing normal and abnormal values, respectively, to be far apart when compared.

Time series dissimilarity is inherently contextual. Using this, we extend HSC loss to contextual hypersphere loss. It is a dynamic fit to the center of the hypersphere by contextual representation. To facilitate the learning of the boundary between normal and abnormal, we use data expansion. Among other things, we use OE variance to create contextual dissimilarities and use simply injected outlier point dissimilarities.

related research

We classify the previous approaches to time series anomaly detection into three categories: 1) prediction approaches, 2) recovery models and 3) compression-based approaches.

Forecasting approaches include traditional methods such as ARIMA. SPOT and DSPOT detect outliers in the time series using extreme value theory, which models the skirts of the distribution.

In deep anomaly detection, recovery model methods using VAE and GAN are derived from prediction approaches; DONUT uses VAE to predict the distribution of sliding windows. SR-CNN trains a supervised CNN on top of an unsupervised anomaly detection model, SR, using injected single outlier labels. AnoGAN uses GANs to model a sequence of observables and make probabilistic predictions in latent space. DAGMM, LSTM-VAE uses recurrent networks and VAE. OmniAnomaly extends this framework with a deep innovation state-space model and a normalization flow. MSCRED uses a convolutional self-encoder to find anomalies by measuring the recovery error. MTAD-GAT is a method using graph attention networks, which is introduced in another article.

Compression-based approaches are becoming more common in image anomaly detection. The principle is the same as the one-class classification used in the SVM-like support vector data description method SVDD (only one class in the training data). Instances are mapped to latent representations to form a sphere in the latent space. Any point away from the center of the sphere is considered an anomaly.DeepSVDD achieves this by minimizing the Euclidean distance. THOC applies this principle to the context of time series.

The Hypersphere Classifier (HSC ) is an improvement on DeepSVDD and is trained using standard binary cross-entropy. This extends it to a (semi-)supervised learning setting, where the loss of HSC is expressed in terms of pseudo-probabilities as

Several studies have shown that remarkable performance improvements can be obtained with only a small amount of labeled outlier data. An extension of this is a powerful tool called OE (Outlier Exposure). It improves detection performance by capturing a large number of outliers from an extended dataset during training. Even though such negative examples are not true outliers, such contrasts are useful for learning property representation. Furthermore, the combination of OE and HSC has shown remarkable results in images.

For time series, artificial anomalies, and data extensions for them, have not been well studied, and SR-CNNs have been used to put unsupervised learning anomaly detection models on top of supervised learning CNNs by injecting single-point outliers.

Model Description

The building blocks of anomaly detection are as follows Combining a window-based anomaly detection approach with a flexible learning paradigm and effective, heuristic data expansion.

Rather than predicting binary data (normal, abnormal), a positive anomaly score is predicted for each time step, followed by a threshold to satisfy the desired Precision/Recall trade-off to obtain the anomaly label.

Window-based contextual hypersphere detection

As mentioned above, time-series data is divided into windows, each of which is further divided into a context window and a suspect window. The suspect window is usually smaller and can be a single point in time.

Anomaly detection is performed by comparing the representation vectors $\phi(w;\theta)$ and $\phi(w ^(C);\theta)$ of the full and context windows obtained by applying the neural network feature value extractor $\phi (\cdot ; \theta )$.

The loss function can be seen as a contextual version of the HSC.

Using the Euclidean distance for dist(,) and the dynamic basis function for l(), we get

Intuitively, this means that the center of the hypersphere is an HSC loss function that is dynamically chosen as a representation of the context for each context. If we learn this for generic anomaly injections, it means we can generalize it to more complex anomalies in the real world. (You can add label data)

architecture

NCAD consists of three parts: 1) an NN encoder $\phi(\cdot ;theta)$ where we use TCN with adaptive max pooling along the time series 2) a distance-like function $dist(\cdot,\cdot)$ 3) a probabilistic score function l(z)

　The encoder parameter $\theta$ is learned by minimizing the classifier loss for a mini-batch of window w.

In order to detect anomalies in real-time, this model is applied to a rolling window of time series data. Thus, the data at each point in time appears in multiple rolling windows. You can alarm on the first high score or on the average value.

data extension

Another feature of this model is the use of a set of data extension methods that inject artificial outliers. The purpose of this is to allow supervised learning without using the true data (ground-truth labels). These data extension methods do not attempt to characterize the entire distribution of outliers but add an effective generic heuristic to detect common outlier cases.

Contextual Outlier Exposure (COE)

Following the success of OE, we propose a method that does not rely on the simple task of generating outlier examples in context. For the data in the Suspect window, we bring in chunks of data from other time series and replace them; Fig. 5 shows the original data, and Fig. 6 shows the data with COE applied to 1500~1550. Fig.5 shows the original data, and Fig.6 shows the data with COE applied to 1500~1550. The data are interchanged between (a) and (b), (c) and (d).

Different Injection

　One simple point PO(Point Outlier) is proposed in this section. We inject a spike as shown in the figure below.

Window Mixup

We perform a linear combination of the training data in a manner inspired by MIXUP. as shown in Fig. 8.

experiment

benchmark

Data set

NASA's SMAP (Soil Moisture Active Passive satellite), MSL (Mars Laboratory rover), which are often used in other comparative evaluations, SWaT (Secure Water Treatment), which is 11 days of water treatment data, and SMD (Server Machine Dataset) taken from the Internet are used. SMD (Server Machine Dataset) taken from the Internet is used.

In addition, 367 real data from YAHOO Research Institute as well as synthetic data and univariate dataset KPIs released in the AIOPS data competition are used for univariate evaluation.

Evaluation setting

It is difficult to measure the performance of time series anomaly detection in a universal way. This is because different applications often have different trade-offs (what to prioritize) regarding sensitivity, specificity, and temporal locality. To account for this, various methods have been proposed. Here, we follow the method of Xu et al. If the model detects an anomaly at least at one point, we assume that it has detected an anomaly in the entire segment containing that point.

Benchmark Results

Table 1 and Table 2 show the comparison results of SOTA on univariate and multivariate datasets, respectively, and the KPI is evaluated in unsupervised and supervised settings. data set, and almost the same performance on the KPI data set.

In multivariate datasets, it is far ahead of others in MSL and SWaT, almost as good as the best THOC in SMAP, and second in SMD after OmniAnomaly.

cut analysis

To understand which parts of the method are effective, we performed a carve-out experiment: the first line of NCAD is the full configuration, followed by the cases with -, where we did not perform the respective operations such as PO, COE, etc. We can see that the Contextual loss function makes a significant contribution. The data expansion methods also improve the performance.

The following table shows the results of our isolation experiments on the Yahoo data set.

Unsupervised to supervised scaling

To investigate how the performance of this method changes when moving from unsupervised, to semi-supervised, to supervised, we measured the change in performance by varying the number of correct labels on the Yahoo dataset.

As expected, the performance first improves monotonically with the number of correct labels. Using synthetic variants (PO or COE) improves the performance significantly even with a small number of correct labels. Injecting a variant that is well matched to the type of variant improves the performance more significantly than relying on the label data alone.

On the other hand, if the anomaly to be injected is different from the desired anomaly (in this case, the COE case), the performance is inferior compared to the case when rich data labels are obtained.

Use a specialized variant injection technique

Although we used generic anomaly injections in our benchmark, we found that designing anomaly injections that mimic true anomalies as a byproduct of the method allows us to derive models that detect the desired class of anomalies. It is often simpler to design such an anomaly than to have enough true anomaly data. Table (a) below illustrates the effectiveness of this approach.

This method is effective when the anomaly is subtle, close to normal data, and when there is prior knowledge of the type of anomaly to be detected. However, there are times when such prior knowledge is not available. And it can be time-consuming to generate mimicked anomalies. That is a limitation that prevents the general deployment of this method. For this reason, we did not use this method in our benchmark evaluation.

Generalization from the injected variant

Artificial anomalies are always different from true anomalies. Whether the generative method is COE, PO, or a more complex method, we need a model that bridges this gap and generates true anomalies from incomplete training data. We use MIXUP to further improve the generalization performance of the model; Fig. 3(b) evaluates the effect of this improvement. The model is trained by injecting single outliers, and we measure the long-width anomaly detection performance. In this experiment, we use a synthetic dataset consisting of a simple sinusoidal time series plus Gaussian noise. To this base dataset, we add actual outliers with varying widths by convolving the spike outliers with Gaussian filters of different widths; as we increase the MIXUP ratio, the model better generalizes the outliers differently from the injected examples, and the F1 score improves.

summary

As aimed, we show state-of-art equivalent or better performance for univariate/multivariate, unsupervised/semi-supervised/supervised settings. Typologically, for predictive and restorative modeling methods, we find that methods that combine expressive neural representations (also known as compression-based) with data expansion are superior.

On the other hand, not surprisingly, contextual data expansion can greatly improve performance, but if it is not, the effect is limited. How to perform contextual data expansion is the subject of future research.

Categories related to this article

友安昌幸 (Masayuki Tomoyasu): JDLA G certificate 2020#2, E certificate2021#1 Japan Society of Data Scientists, DS Certificate Japan Society for Innovation Fusion, DX Certification Expert Amiko Consulting LLC, CEO