# Do We Really Need Deep Learning Models For Time Series Forecasting?

*3 main points* ✔️ In the domain of time series prediction, deep learning models have recently shown rapid performance improvements. However, is classical machine learning models no longer necessary, which is why this large-scale survey and comparison experiment was conducted.

✔️ GBRT is used as a representative of classical learning models. The representation of inter-sequence dependencies realized by deep learning models was replaced by feature engineering-based windowing of the input.

✔️ With preprocessing, the improved GBRT performs as well as or significantly better than several deep learning models on both univariate and multivariate data sets.

Do We Really Need Deep Learning Models for Time Series Forecasting?

written by Shereen Elsayed, Daniela Thyssens, Ahmed Rashed, Hadi Samer Jomaa, Lars Schmidt-Thieme

(Submitted on6 Jan 2021 (v1), last revised 20 Oct 2021 (this version, v2))

Comments: arXiv

Subjects: Machine Learning (cs.LG); Machine Learning (stat.ML)

code：

The images used in this article are from the paper, the introductory slides, or were created based on them.

## first of all

Over the past few years, the performance of deep learning-based frameworks in the field of time series prediction has significantly outperformed classical parametric (autoregressive) approaches. As a background, researchers have analyzed that traditional approaches may not be able to capture the information provided by a mixture of long and short-term series. Therefore, many deep learning methods discuss capturing the nonlinear dependence of data across time. These new deep learning-based approaches have not only been shown to outperform traditional methods such as ARIMA and simple machine learning models such as GBRT but have also raised expectations of the need for time series prediction models in the field of machine learning.

However, since the publication of the paper "Are we making much progress?" in the field of recommender systems, it has become clear that we need to evaluate simple and effective models to regularly review and maintain the performance of deep learning approaches in different research segments of machine learning. The need to regularly check the performance of deep learning approaches in different research segments of machine learning and to evaluate them against simple, effective models to sustain The credibility of progress in their respective research areas. Apart from the increasing complexity of time series prediction models, another motivation against the argument is that the approach to the time series prediction problem is one-sided concerning the deep learning based models that have been refined in the literature, which makes the problem one of the highest level of diversity when applied in the real world It limits the diversity of existing solution approaches to the problem, which is one of the highest diversity.

In this work, we show that with a carefully constructed input processing structure, simple yet powerful ensemble models such as the GBRT model can compete and even outperform many DNN models in the field of time series prediction.

The feature-engineered multivariate output GBRT model is evaluated along with the following two research questions.

1. in terms of a window-based learning framework for time series prediction, what is the effect of carefully constructing the input and output structure of the GBRT model?

2. how does a simple yet properly constructed GBRT model compare to the SOTA deep learning time series prediction framework?

The evaluation is performed on two types of forecasting tasks: univariate and multivariate forecasting. We evaluate the GBRT model against the SOTA deep learning approach, which is discussed at a prestigious research conference.

The overall contribution of this research study is to

- GBRT: We elevate GBRT, a simple machine learning method, to the standard of competing DNN time series prediction models by first casting it into a window-based regression framework and then feature engineering the input and output structure of the model. It benefits most from additional contextual information.

-Comparison with naively configured baselines: to highlight the importance of input processing for time series forecasting models, the window-based input setup of GBRT is compared to forecasts produced by traditional configured models such as ARIMA and GBRT's implementation in the domain of naive time series forecasting We empirically prove why it improves performance.

-Competitiveness: we investigate the performance of GBRT on various state-of-the-art deep learning time series prediction models and show its competitiveness on two types of time series prediction tasks (univariate and multivariate).

## Research Procedures

### Baseline papers for comparison

We screened 2016-2020 papers from nine representative societies (NeurIPS, KDD, etc.) using the following criteria to extract a baseline

Topics: Only time series forecasts are covered

Data structures: Asynchronous time series, graphs, and other data structures are excluded

Reproducibility: data is publicly available and code is available from the author.

Computability: the results in the paper should be reproducible

### and assess

GBRTs configured for time series forecasting were evaluated at two levels, univariate and multivariate. To ensure compatibility between the selected baselines and GBRTs, all were evaluated with models on the same data set (Table 1) pool.

Electricity and Traffic were subsampled for compatibility. To match the conditions, the baseline model was re-evaluated and re-tuned under the current evaluation conditions.

The bottom four in Table 1 are multivariate data.

## Feature-engineered window-based GBRT

The GBRT models investigated, especially those implementing XGBoost, have the property that they are easy to apply and particularly fit structural data. However, when simply applied to time series data, the GBRT model loses much of its flexibility as it cannot be fed into a window-based regression problem and instead fits the majority of the time series as a complete continuous sequence of data points, predicting the time series for the subsequent remaining test portion. Unlike this simplistic way of handling the input, we followed the successful time series prediction model and reconstructed the time series data into windowed input data, and trained on these multiple training instances (windows). The window length is adjustable. The GBRT model for this window-based input setup is illustrated in Fig. 1.

The first step is to transform the 2D training data (time series windows) into a 1D vector formulation that fits the GPRT using a transformation function. This function flatly combines the target values yi of all windows and the covariate vector of the last instance t of the input. After this transformation, it is passed to the GBRT model to predict the future for each instance.

Multivariate output methods are not originally supported by GBRT. However, they can be instantiated by problem transformation methods such as single-target methods. In this case, we chose the multi-output wrapper, which transforms a multivariate regression problem into several univariate target problems. This method involves a simple strategy of extending the number of regressions to the size of the prediction range. Here, one regression, i.e. one loss function, is introduced for each prediction step of the prediction range. The final target forecast is then computed using the sum of all tree model estimators. This single target setup automatically comes with the drawback that the target variables within the forecast range are predicted independently and the model does not reflect the potential relationship between them. This is precisely why the emphasis is placed on the window-based input setting in GBRT, which not only transforms the forecasting problem into a regression task but more importantly allows the model to capture the autocorrelation effects of the target variables, compensating for the initial drawback of independent multivariate forecasts. The aforementioned window-based GBRT imputation setting significantly improves the prediction performance, and the GBRT model can thus capture the underlying time series structure and can be seen as a suitable baseline for machine learning as an advanced DNN time series prediction model.

On the other hand, the simple GBRT model described above is a regression model for a single point, taking as input the simultaneous covariates at time X, predicting a single target value Y at the same point, and minimizing the next training loss.

## Experiments and Results

### Time Series Prediction Approach in Deep Learning

The following prominent deep learning based models are considered in the evaluation

TRMF (Temporal Regularized Matrix Factorization)

It is a matrix factorization-based method. Only linear dependence of time series data can be understood.

LSTNet (Long- and Short-term Time-series Network)

We capture local multivariate patterns and long-term dependence.

DARNN (Dual-Stage Attention-Based RNN)

The input is passed through an attentional mechanism, followed by an encoder-decoder model with an additional attentional mechanism.

DeepGlo (Deep Global Local Forecaster)

Normalizing global matrix factorization structures via temporal convolutional networks.

TFT (Temporal Fusion Transformer)

It is the most recent DNN among those discussed in this paper. The recursive layer captures local dependencies and the transformer-specific self-attention layer captures long-term dependencies.

6. deepAR

An autoregressive stochastic RNN model that estimates the parametric distribution of a time series using additional time, covariates.

DeepState (Deep State Space Model)

A stochastic generative model for learning parameterizations of linear state-space models using RNNs .

DAQFF (Deep Air Quality Forecasting Framework)

It consists of a two-stage feature representation: three 1D convolutional layers, two bi-directional LSTM layers, and prediction through linear layers.

### univariate data set

The results in Table 2 summarize the forecasting performance on the univariate time series forecast dataset without using simple covariates as predictors. The overall results show that window-based GBRT is strongly competitive, except Traffic forecasting. On the other hand, predictive models with traditional configurations such as ARIMA and GBRT (Naive) perform much better, as expected. The findings highlight the appropriateness of carefully configuring machine learning baselines and adapting them to the specific problem. Although covariates are not considered in this univariate setting, the improved performance of GBRT (W-b) can only be attributed to the rolling forecast formulation of GBRT.

For Electricity forecasts, the window-based GBRT shows the best RMSE performance of all the models, with a substantial margin, but its performance on WAPE and MAE is only surpassed by TRMF, which was introduced in 2016. Attention-based DARNN models, which perform poorly, were originally evaluated in a multivariate setting for the stock market and indoor temperature data. Unlike LSTNet, which was originally evaluated in a univariate setting, it had to be reimplemented for all datasets in Table 2 due to the different evaluation metrics deployed; for the Exchange-Rate forecasting task, LSTNet (reimplemented with w = 24) and While Table 2 shows unfavorable performance results for LSTNet, Table 4 shows positive results for the first metric used and the original experimental setup. positive results concerning the first metric used and the original experimental setup. Without considering the time predictor, the best results for the hourly traffic dataset are achieved by DARNN and LSTNet, with the Traffic prediction results interleaved, but for the PeMSD7 dataset, the window-based GBRT baseline performs better on three metrics and outperforms the DNN model in two of the three metrics. However, when including time-varying covariates, the performance of GBRT improves significantly (Table 3), and for Traffic prediction, it outperforms all DNN approaches, including DeepGlo and the popular spatiotemporal traffic prediction model (STGCN), which achieves an RMSE of 6.77 on PeMSD7 and the reconstructed GBRT baseline outperforms

Overall, windowing the input and adding a simple time-variate to the gradient-boosted tree model demonstrates compelling generalization performance across the various univariate time-series datasets in Tables 2 and 3. To further confirm this finding and mitigate the disadvantage of the DNN model with different metrics and subsampled datasets, we have since conducted one-on-one experiments to evaluate the published performance results.

**Comparison with LSTNet**

We evaluate LSTNet on the additional solar energy dataset introduced with the exchange rate dataset in the original paper. Table 4 shows the GBRT (W-b) results, including time-varying covariates and a forecast window of h =24 evaluated at the root square of the relative squared error (RSE) and the empirical correlation coefficient (Corr). These complementary results support the above finding that a well-constructed GBRT model is (consistently) superior to a powerful, deep learning-backed framework such as LSTNet.

*Comparison with stochastic/transformer-based models*

Finally, we would like to confirm the above findings for univariate datasets concerning stochastic models such as DeepAR and DeepState, as well as transformer-based models (TFT). To directly compare our results with published results, we apply an experimental setup that follows TFT concerning the use of different versions of the ElectricityV2 and TrafficV2 datasets. In particular, in the case of ElectricityV2, the length of the time series is T = 6000 while the available series is n = 370, while the TrafficV2 dataset consists of 963 time series of length around T = 4000.

The test periods listed in Table 1 (7 days) remain the same and simple timestamp extraction covariates are used in all models. The parameters of the window-based GBRT for the TrafficV2 dataset are the same as those used for the subsampled dataset, but for ElectricityV2 the parameters had to be adjusted separately.

The results in Table 5 highlight the competitiveness of GBRT in the rolling-forecast configuration, but also show that fairly powerful transformer-based models, such as TFT, outperform GBRT(W-b). Nevertheless, as an exception, TFT is the only DNN model that consistently outperforms GBRT in this study, outperforming stochastic models such as DeepAR and DeepState on these univariate datasets.

The main finding from these results is that even simple covariates extracted primarily from timestamps significantly improved the performance of the GBRT baseline.

### multivariate data set

The multivariate time series prediction setting we deal with represents the case where data from multiple features are natively provided in the dataset, but only one target variable needs to be predicted. In this case, given the external features ^{XMi}_{and t-w}, they are more expressive than simple time predictors extracted from timestamps.

**Comparison against DARNN with Covariates**

For this direct comparison, the multivariate prediction task of DARNN is to predict the target value, room temperature (SML 2010), and stock price (NASDAQ100), assuming a lookup window size of 10 data points, which have been proven to be the best values for various prediction functions and DARNN, respectively, one The goal is to predict a step.

The results in Table 6 corroborate the aforementioned results in this multivariate case as well, showing that a simple, properly constructed GBRT baseline outperforms even the DNN framework using attention specifically conceptualized for multivariate prediction.

On another note, given that the only non-DNN baseline in the DARNN evaluation protocol was ARIMA, the one-sided aspect of machine learning predictive models in the field of time series forecasting is further emphasized. Thus, in general, care should be taken not only about composing machine learning baselines, which are probably not that powerful, but also about the case of creating a pool of baselines for evaluation.

*Comparison against DAQFF*

As the final one-to-one comparison experiment in this study, we evaluate the fully extended DNN model Deep Air Quality Forecasting Framework, which was explicitly built for the reconstructed GBRT baseline air quality forecasting task. The original results for DAQFF could not be reproduced due to the unavailability of the source code, but we still had access to the data. The original well-documented data preprocessing scheme and experimental setup were employed, the prediction window size was chosen to be 6 hours, and the lookup window size was set to 1 hour for both data sets. Table 7 shows that even the DNN shows that even the model, which is assumed to perform particularly well concerning that task, does not meet expectations. Instead, DAQFF performs worse than a simple window-based functionally designed gradient boosting regression tree model.

Note that in this experiment, even the GBRT model used in the traditional application prediction sense yields better results for the air quality data set.

### isolation test

We show results in support of a feature inclusion scheme that allows us to obtain competitive results by simply including only the last timestep covariate in the flattened GBRT input window. For the dataset selected from Table 1, we evaluate both window-based GBRT (all instances, last instance) configurations are evaluated. The experimental field setup for the datasets is the same as in the previous evaluation, except for PM2.5, where the lookup and prediction window sizes are set to 6 and 3, respectively.

The results in Table 8 show that considering only the auxiliary functions of the last instance incurs little information loss. Therefore, applying the "last instance" scheme saves a lot of computational memory and power.

## summary

In this work, we survey and replicate several recent deep learning frameworks for time series prediction and compare them to rolling prediction GBRT for a variety of data sets. Experimental results show that conceptually simple models such as GBRT can compete with, and in some cases outperform, SOTA DNN models through efficient feature engineering of input and output structures.

To broaden the perspective, this finding suggests that even though deep learning models have been successful, simpler machine learning baselines should not be simply dismissed, but need to be more carefully constructed to ensure the reliability of progress in the field of time series prediction.

As a subject for future research, these results encourage the application of this window-based input setup to other simpler machine learning models such as multilayer perceptrons and support vector machines.

Categories related to this article