Catch up on the latest AI articles

2 Time Series Machine Learning Libraries

2 Time Series Machine Learning Libraries


3 main points
✔️ Two-time series machine learning libraries have been announced in quick succession
✔️ Merlion provides a time series anomaly detection and prediction library
✔️ Darts is an attempt to democratize and integrate modern machine learning forecasting approaches under a common, user-friendly API

Merlion: A Machine Learning Library for Time Series
written by Aadyot BhatnagarPaul KassianikChenghao LiuTian LanWenzhuo YangRowan CassiusDoyen SahooDevansh ArpitSri SubramanianGerald WooAmrita SahaArun Kumar JagotaGokulakrishnan GopalakrishnanManpreet SinghK C KrithikaSukumar MaddineniDaeki ChoBo ZongYingbo ZhouCaiming XiongSilvio SavareseSteven HoiHuan Wang
(Submitted on 20 Sep 2021)
Comments: Published on arxiv.

Subjects: Computer Vision and Pattern Recognition (cs.CV)


Darts: User-Friendly Modern Machine Learning for Time Series
written by Julien HerzenFrancesco LässigSamuele Giuliano PiazzettaThomas NeuerLéo TaftiGuillaume RailleTomas Van PottelberghMarek PasiekaAndrzej SkrodzkiNicolas HugueninMaxime DumonalJan KościszDennis BaderFrédérick GussetMounir BenheddiCamila WilliamsonMichal KosinskiMatej PetrikGaël Grosch
(Submitted on 7 Oct 2021 (v1), last revised 8 Oct 2021 (this version, v2))
Comments: Published on arxiv.

Subjects:  Machine Learning (cs.LG); Computation (stat.CO)


The images used in this article are from the paper, the introductory slides, or were created based on them.

first of all

Two time-series machine learning libraries have been announced in quick succession, so I'll introduce them all together.

The first is Merlion, a Salesforce group.

Time series are ubiquitous in monitoring the behavior of complex systems in real-world applications such as IT operations management, manufacturing, and cyber security. They can represent key metrics of computing resources, business indicators, or feedback from marketing campaigns on social networking sites. In all of these applications, it is important to accurately predict the trends and values of key metrics and to quickly and accurately detect anomalies in those metrics. In fact, in the software industry, anomaly detection that notifies operators promptly is one of the key machine learning techniques for automating the identification of problems and incidents to improve the availability of IT systems.

Although several tools have been proposed to account for the different potential applications of time series analysis, there are still several issues with today's industry workflows for time series analysis. These include inconsistent interfaces between data sets and models, inconsistent metrics between academic papers and industrial applications, and a relative lack of support for practical features such as post-processing, AutoML, and model combination. These issues make it difficult to benchmark across multiple datasets and settings, across a variety of models, and to make data-driven decisions about the best model for the target task.

Merlion, the Python library for time series intelligence presented here, provides an end-to-end machine learning framework that includes reading and transforming data, building and training models, post-processing model output, and evaluating model performance. It supports a variety of time series learning tasks, including forecasting and anomaly detection for both univariate and multivariate time series. Key features of Merlion include

  • A standardized and easily extensible framework for data loading, preprocessing and benchmarking a wide range of time series forecasting and anomaly detection tasks
  • A library of diverse models for both anomaly detection and prediction, integrated under a shared interface. The models include classical statistical methods, decision tree ensembles, and deep learning methods. Advanced users can fully combine each model as needed
  • Abstraction of the DefaultDetector and DefaultForecaster models for efficient and robust performance and to provide a starting point for new users
  • AutoML for automated hyperparameter adjustment and model selection
  • Practical, industry-inspired post-processing rules for anomaly detectors that make anomaly scores easier to interpret while reducing false-positive rates.
  • Easy-to-use ensembles that combine the outputs of multiple models for more robust performance
  • A flexible evaluation pipeline that simulates live model deployment and retraining in a production environment to evaluate performance in both prediction and anomaly detection
  • Native support for visualizing model predictions

Table 1 shows the feature table of Merlion compared to other libraries.

The Merlion code can be found on Github, and documentation about the API can be found at the following sites

Typical usage of the library is as follows

from merlion.models.defaults import DefaultDetectorConfig, DefaultDetector
model = DefaultDetector(DefaultDetectorConfig())
test_pred = model.get_anomaly_label(time_series=test_data)

Architecture and Design Principles

Merlion's modular architecture consists of five layers. "The Data Layer loads raw data, transforms it into Merlion's TimeSeries data structures, and performs the necessary pre-processing. "The Modeling Layer supports a wide range of models for prediction and anomaly detection, including AutoML for automated hyperparameter tuning. "The post-processing layer provides practical solutions for improving interactivity and reducing false-positive rates in anomaly detection models. Fig. 1 illustrates the relationship between these modules.

data layer

Merlion's core data structure is a TimeSeries. It represents a general multivariate time series T as a collection of UnivariateTimeSeries U(i). This formulation reflects the reality that individual univariates may be sampled at different rates and may contain missing data at different timestamps. after initializing the TimeSeries from the raw data, the merlion. transform module After initializing the TimeSeries from the raw data, the merlion. transform module is a preprocessing operation that can be applied before passing the TimeSeries to the model. The preprocessing includes resampling, normalization, moving averages, and time differencing.

model layer

Since no single model will work well for all-time series and all use cases, it is important to provide users with the flexibility to choose from a wide range of heterogeneous models Merlion implements a variety of models for both forecasting and anomaly detection. To make all these choices transparent to the user, we integrate all Merlion models into two generic APIs, one for prediction and one for anomaly detection. All models are initialized with a config object that contains implementation-specific hyperparameters and supports the model. train(time_series) method. Given a generic multivariate time series, the predictor will be trained to predict the value of a single target univariate value for a single target univariate. You can then get the model's forecast for a set of future timestamps by calling the model. forecast(time_stamps).

Similarly, you can use the Just call model.get_anomaly_score(time_series) to get a time series of anomaly detector's sequence of anomaly scores. Forecast-based anomaly detectors provide both model.forecast (time_stamps) and model.get_anomaly_score (time_series).

For models that require additional computation, the Layer interface, which is the basis for the autoML functionality provided. Layers can be used to implement additional logic on top of existing model definitions that are not properly fit into the model code itself, such as seasonality detection or hyperparameter tuning. Layers have three methods: generate_theta to generate candidate hyperparameters θ, evaluate_theta to evaluate the quality of θ, and set_theta to apply the selected θ to the underlying model. theta to apply the selected theta to the underlying model. Another class, ForecasterAutoMLBase, implements the forecast and train methods that leverage the methods of the Layers class to complete the predictive model. Finally, all models support the ability to adjust their forecasts with historical data time_series_prev, which is different from the data used for training. These conditional forecasts can be obtained by calling model.forecast (time_stamps, time_series_prev) or model.get_anomaly_score (time_series, time_series_prev).

postprocessing layer

translation results

All anomaly detectors have a post_rule that applies significant post-processing to the output of model.get_anomaly_score (time_series). This includes calibration and thresholding rules. The post-processed anomaly scores are stored in obtained directly by calling model.get_anomaly_label (time_series).

Ensemble and Model Selection

An ensemble is structured as a model that represents a combination of several underlying models. To this end, we have a base EnsembleBase class that abstracts the process of obtaining forecasts Y1, ..., Ym from m underlying models on a single time series T, Ym, and a base EnsembleBase class that abstracts the process of obtaining the results Y1,..., and Ym. to the output of the ensemble. to the output of the ensemble. These combinations include traditional average ensembles as well as model selection based on evaluation metrics such as sMAPE.

evaluation pipeline

When a time series model is deployed live in production, training, and inference are typically not performed in batches on the complete time series. Rather, the model is retrained at normal intervals, and inference is performed in streaming mode when possible. To simulate this setting more realistically, we provide an EvaluatorBase class that implements the following evaluation loop.

  1. Train an initial model with recent historical training data
  2. Periodically (e.g., once a day), we retrain the entire model with the most recent data. This can be for the entire history, or a more limited window (e.g. 4 weeks).
  3. Get the model's prediction (forecast or heteroskedasticity score) of the time series values that will occur during the retraining. The user can customize whether this should be done in batch, streaming, or intermediate rhythms.
  4. Compare the model's predictions with the correct answers and report quantitative metrics

It also provides a wide range of evaluation metrics for both forecasting and anomaly detection, implemented as the enumerations ForecastMetric and TSADMetric, respectively. Finally, we provide the scripts and This allows the user to use this logic to easily evaluate the model performance of the datasets contained in the ts_datasets module.

time-series forecasting

Merlion contains several models for univariate time series forecasting. These include classical statistical methods such as ARIMA, SARIMA, and ETS (error, trend, and seasonality), recent algorithms such as Prophet, an earlier algorithm created by the author group, MSES (Cassius et al., 2021), and deep autoregressive LSTM. The multivariate predictive models used here are based on autoregressive and decision tree ensemble algorithms. For the autoregressive algorithm, we employ a vector autoregressive model that captures the relationships between multiple sequences as they change over time. For the decision tree ensemble, we consider random forests and gradient boosting as base models. We allow the model to generate forecasts for any prediction period, similar to traditional models such as VARs. Furthermore, all multivariate forecasting models share a common API with univariate forecasting models, so they are common to both univariate and multivariate forecasting tasks.


translation results

The AutoML module for time series prediction models is slightly different from the autoML for traditional machine learning models. This is because it considers not only the traditional optimization of hyperparameters but also the detection of some properties of the time series. For example, SARIMA includes autoregressive parameters, difference orders, moving average parameters, seasonal autoregressive parameters, seasonal difference orders, seasonal moving average parameters, and seasonality.

We further reduce the training time of the autoML module in the following way We obtain an initial list of candidate models that achieve good performance with relatively few optimization iterations. We then retrain each of these candidates until the models converge and select the best model by AIC.


Merlion's ensemble of predictors allows the user to transparently combine models in two ways. First, it supports traditional ensembles that report the mean or median value predicted by all models at each timestamp. Second, we support automatic model selection. When performing model selection, we split the training data into training and validation data, train each model on the training data, and retrieve the predictions for the validation data. It then evaluates the quality of these predictions using user-specified metrics and, after retraining on the full training data, returns the model that achieved the best performance.

There are many ways to evaluate the accuracy of a prediction model. Merlion's ForecastMetric provides MAE, RMSE, sMAPE, MARRE, and other metrics.

time-series anomaly detection

Merlion contains several models dedicated to univariate time series anomaly detection. These fall into two groups: forecast-based and statistical. Merlion's predictors are easily adapted to anomaly detection because they predict specific univariate values in a general time series. The anomaly score is the residual between the predicted and actual time series values, optionally normalized by the predicted standard error of the underlying predictor. The univariate statistical method provides Spectral Residual and two simple baselines WindStats and ZMS. In addition, we offer both statistical methods and deep learning models capable of handling both univariate and multivariate heteroskedasticity detection. The statistical methods include Isolation Forest and Random Cut Forest. Deep learning models include autoencoders, deep autoencoding Gaussian mixture models, LSTM encoder decoders, and variational autoencoders.

Merlion supports two key post-processing steps in a heterogeneity detector: calibration and thresholding. Calibration is important to improve the interpretability of the model, while thresholding converts a series of continuous dysmorphic scores into individual labels, reducing the false positive rate.

All of Merlion's anomaly detectors return an anomaly score st which is positively correlated with the severity of the anomaly. However, the scale and distribution of these dissimilarity scores vary widely. For example, Isolation Forest returns an anomaly score st ∈ [0, 1]; Spectral Residual returns an unnormalized saliency map; and DAGMM returns an anomaly score st ∈ [0, 1]. DAGMM returns the negative log probability.


To use a model successfully, you need to be able to interpret the anomaly scores returned by the model. This would make many models immediately unusable by users unfamiliar with the particular implementation. Calibration fills this gap by allowing all anomaly scores to be interpreted as z-scores, i.e. values extracted from a standard normal distribution. This simple post-processing step dramatically improves the interpretability of the heteroskedasticity scores returned by individual models.

Threshold processing

The most common way to determine whether an individual timestamp t is an anomaly is to compare the anomaly score st with a threshold τ. However, in many real-world systems, a human is alerted each time an anomaly is detected. A high false-positive rate increases the load on the user to investigate each alert and may result in a system that the user does not trust. A way to avoid this problem is to include additional automated checks that must be passed before alerting a human. These steps can significantly improve accuracy without adversely affecting repeatability, and Merlion implements all of these features in the user-configurable AggregateAlarms post-processing rules.


Because both the time series and its heterogeneity are so diverse, no single model can be expected to be optimal for all use cases. As a general rule, a heterogeneous ensemble of models is likely to generalize better than individual models within that ensemble. Since the anomaly scores of all Merlion models can be interpreted as z-scores, an ensemble of anomaly detectors can be constructed by simply reporting the average calibrated anomaly scores returned by the individual models and applying thresholds. Empirically, we find that the ensemble reliably achieves the strongest or most competitive performance across multiple open source and internal datasets for both univariate (Table 10) and multivariate (Table 13) anomaly detection.

valuation index

A key challenge in designing appropriate evaluation metrics for time series anomaly detection lies in the fact that anomalies are almost always time frames rather than discrete points. Thus, while it is easy to compute pointwise (PW) fit rates, recall rates, and F1 scores for predicted anomaly label sequences compared to correct label sequences, these metrics do not reflect the quantities of interest to human operators.

We propose a point adjustment (PA) metric as a solution to this problem. If any point in the positive dissimilarity window is labeled as dissimilar, then all points in the segment are treated as true positives. If the window is not flagged as an anomaly, then all points are labeled as false negatives. Anomalies predicted outside the anomaly window are treated as false positives. The goodness of fit, repeatability, and F1 can be calculated based on these adjusted true/false positive/negative counts. However, the drawback of the PA metric is that it is biased towards reward models for detecting long anomalies rather than short anomalies.

An alternative is the updated point adjustment (RPA) metric. In this case, if any point in the positive dissimilarity window is labeled as dissimilar, a single true positive is registered. If the window is not flagged as dissimilar, then one false positive is recorded. Any anomaly predicted outside the anomaly window will be treated as a false positive.


Since Merlion is a library, it is equipped with various methods as described so far, but since a performance comparison is performed here, it is better to refer to it when selecting.

We show benchmark results generated using Merlion with a popular baseline model across several time-series datasets.

univariate forecasting

We primarily evaluate our models on the M4 benchmark, a reputable time series forecasting competition. The dataset contains 100,000-time series from a variety of domains, including financial, industry, and demographic forecasting, with sampling frequencies ranging from hourly to yearly. The sampling frequency ranges from hourly to yearly; Table 2 summarizes the dataset. In addition, we evaluate three internal datasets of cloud KPIs, which are described in Table 3. To reduce the impact of outliers, we show both the mean and median MAPE for each method.

We compare ARIMA, Prophet (Taylor and Letham, 2017), ETS (Error, Trend, Seasonality), and MSES. These are implemented using merlion. module.

Tables 4 and 5 show the performance of each model on the public and internal datasets, respectively; Table 6 shows the average improvement achieved using the autoML module.

multivariate forecasting

We collect a public dataset and an internal dataset (Table 7) and train the model by training partitioning of the data. For some datasets, we resample the data at a specified granularity. For each time series, we train the model in training partitions and predict the first univariate as the target sequence. We do not retrain the model but use the evaluation pipeline to incrementally obtain forecasts for test splits using a rolling window. Predict the time series values for the next three timestamps while conditioning the prediction on the previous 21 timestamps. We obtain these 3-step predictions for all timestamps of the test split and evaluate the quality of the predictions using sMAPE if possible, otherwise using RMSE.

The multivariate predictive models used are based on autoregressive and decision tree ensemble algorithms. We compare the VAR, the GB Forecaster based on the gradient boosting algorithm, and the RF Forecaster based on the random forest algorithm.

Table 8 shows the performance of each model. GBForecaster achieves the best results on three of the four data sets. The VAR model shows competitive performance on only one data set. For this reason, we consider GBForecaster to be a good "default" model for new users and early exploration.