# Unsteady Transformer

*3 main points*✔️ This is a NeurIPS 2022 accepted paper. It proposes a predictive model "non-stationary transformer" for time series data with non-stationary intervals

✔️ The model consists of two parts. Series Stationary and Non-Stationary Attention. This solves the dilemma between the ability to predict series and the ability to model.

✔️ Using six real-world data sets, we compared the performance of this model to the leading traditional models and found a nearly 50% MSE reduction.

Non-stationary Transformers: Exploring the Stationarity in Time Series Forecasting

written by Yong Liu, Haixu Wu, Jianmin Wang, Mingsheng Long(Submitted on 01 Nov 2022, Last Modified on 12 Jan 2023)

Comments: NeurIPS 2022

The images used in this article are from the paper, the introductory slides, or were created based on them.

## summary

This paper adds a twist to the transformer to improve forecasting accuracy in time series data, including non-stationary data. Transformers have been a great force in time series forecasting due to their global range modeling ability, but their performance is severely degraded in real-world non-stationary data, where the joint distribution changes over time.

Previous studies have primarily employed stationarization, which attenuates the nonstationarity of the original series, to improve forecast accuracy. However, stationary series that remove non-stationarity may not be very useful for forecasting bursty events in the real world. This problem, referred to as over-stationarization in this paper, leads to the Transformer generating indistinguishable temporal attention to different series, hindering the predictive ability of deep models.

To address the dilemma between series forecasting and modeling capabilities, we propose the non-stationary transformer as a generic framework with two interdependent modules: namely series stationary and non-stationary.

Specifically, series stationarization unifies the statistics of each input and transforms the output with the restored statistics to increase predictability. For the stationary problem, nonstationary attentions are devised to approximate distinguishable attentions learned from raw series to recover essential nonstationary information in a time-dependent manner.

The authors' non-stationary transformer framework consistently significantly improves the mainstream transformer, reducing the MSE by 49.43% for the transformer, 47.34% for the informer, and 46.89% for the reformer, making it a SOTA in time series forecasting.

## Introduction.

Time series forecasting has an increasingly wide range of real-world applications, including weather forecasting, energy consumption planning, and financial risk assessment. Transformers are perfectly suited for time series forecasting tasks due to their stacked structure and ability of the attention mechanism to naturally capture temporal dependencies from deep multi-level features.

However, despite its superior architectural design, it is still difficult for transformers to predict real-world time series due to the non-stationarity of the data. Non-stationary time series are characterized by continuously changing statistical properties and joint distributions (simultaneous probability distributions) over time, making time series difficult to predict. Besides, successfully generalizing a deep model over a changing distribution is a fundamental problem. In previous work, it has been common to preprocess time series by stationarizing them, which attenuates the nonstationarity of the raw time series for better predictability and provides a more stable data distribution for the deep model.

However, non-stationarity is an inherent property of real-world time series and is a good guide to discovering time dependence for prediction. Experimentally, the authors found that training on stationary series weakens the distinction between the attentions learned by the transformers. While vanilla transformers can capture different time dependencies from different series, as shown in Figure 1(a), transformers trained on stationary series tend to produce indistinguishable attentions, as shown in Figure 1(b). This problem, referred to as overstationarization, has an unexpected side effect on the transformer, which prevents it from capturing eventful time dependence, limits the predictive ability of the model, and even induces the model to produce outputs with large deviations of nonstationarity from ground-truth. Therefore, ways to attenuate the non-stationarity of time series for better predictability and at the same time mitigate the over-stationarity problem for model capability is an important issue to further improve forecasting performance.

Figure 1 Visualization of learned temporal attentions for different series with varying mean µ and standard deviation σ. (a) is due to a vanilla transformer trained on the raw series. (b) is by a transformer trained on a stationary series, showing similar attentions. (c) is due to a non-stationary transformer, with non-stationary attentions to avoid over-stationarization. |

In this paper, we explore the effect of stationarity in time series forecasting and propose the non-stationary transformer as a general framework. It efficiently transforms the transformer to have a large predictive capacity for real-world time series. The proposed framework contains two interdependent modules: series stationarization to increase the predictability of non-stationary series and non-stationary attrition to mitigate over-stationarity. Technically, series stationarization employs a simple and effective normalization strategy to unify the key statistics for each series without extra parameters. In addition, non-stationary attentions approximate the attention of non-stationary data and compensate for the non-stationarity inherent in the raw series. With the above design, the non-stationary transformer ensures great predictability of the stationary series and the important time dependence found in the original non-stationary data. The authors' method can be generalized to various transformers for further improvement. Their contributions are threefold:

- The ability to predict non-stationary series is critical for real-world forecasting. A detailed analysis found that the current stationary approach leads to over-stationary problems and limits the predictive ability of transformers.

- A nonstationary transformer is proposed as a general framework, which includes series stationarization to make the series more predictable and nonstationary attrition to avoid the overstationarity problem by re-capturing the nonstationarity of the original series.

- The transient transformer consistently significantly outperformed the four mainstream transformers and achieved SOTA performance on six real-world benchmarks.

## previous work

### Deep models for time series forecasting

In recent years, RNN-based models, or transformers, have been applied in time series forecasting. Transformers have great power in sequence modeling. To overcome the quadratic increase in computational complexity with respect to sequence length, subsequent work has aimed to reduce the complexity of self-attention. In particular, for time series prediction, Informer extends self-attention with a KL-divergence criterion to select dominant queries; Reformer introduces local sensitive hashing (LSH), which approximates the attention by assigning similar queries. In addition to improved complexity, the following models further develop the delicate building blocks for time series prediction: Autoformer develops Auto-Correlation to fuse decomposition blocks into a regular structure and discover serial connections; Pyraformer has designed the Pyramid Attention Module (PAM) to capture temporal dependencies with different hierarchies. Other deeper models that do not use transformers also achieve remarkable performance; N-BEATS proposes an explicit decomposition of trend and seasonal terms with strong interpretability; N-HiTS proposes a hierarchical layout to handle time series with their respective frequency bands, and multi-rate sampling is introduced. Unlike previous studies that focus on architectural design, this paper analyzes the series prediction task from the fundamental perspective of stationarity, an intrinsic property of time series. It should also be noted that, as a general framework, the authors' proposed non-stationary transformer can be easily applied to a variety of transformer-based models.

### Stationary for time series forecasting

Although stationarity is important for the predictability of time series, real time series are always non-stationary. To address this problem, the classical statistical method ARIMA makes time series stationary by differencing. For deep learning models, the distributional variation problem with non-stationarity makes deep learning predictions more difficult, so stationary methods have been widely studied and are always employed as a pre-processing of deep model inputs. to apply z-score normalization; DAIN employs a nonlinear neural network to adaptively stationary the time series with the observed training distribution; RevIN introduces a two-stage instance normalization, transforming the model inputs and outputs, respectively, to reduce the discrepancy between each series; and the PARI method uses a two-stage instance normalization, transforming the model inputs and outputs to reduce the discrepancy between each series. In contrast, the authors found that directly stationaryizing the time series impairs the ability to model certain time dependencies. Therefore, in contrast to previous methods, the non-stationary transformer further develops non-stationary attentions in addition to stationary attentions to draw attention to the intrinsic non-stationarity of the raw series.

## Non-stationary transformer

As mentioned earlier, stationarity is an important component of time series predictability. Previous "direct stationary" designs can increase predictability by attenuating the non-stationarity of the time series, but they clearly ignore the inherent characteristics of real-world time series, resulting in the over-stationarity problem shown in Figure 1. To address this dilemma, we propose the non-stationary transformer as a general framework. This model has two complementary parts: namely, "series stationarization," which attenuates the non-stationarity of the time series, and "de-stationarization," which reacquires the non-stationary information of the time series. By these designs, the non-stationary transformer improves the predictability of the data and at the same time preserves the model's ability to

### Sequential Stationary

Non-stationary time series make the forecasting task difficult for deep models. This is because it is difficult to successfully generalize to time series whose statistics have changed during inference (typically time series whose mean and standard deviation have changed). The pilot work, RevIN, makes each series follow a similar distribution by applying instance normalization with learnable affine parameters to each input and restoring the statistic to the corresponding output. This design works well without learnable parameters. Therefore, we wrap the transformer as a basic model without any extra parameters. As shown in Figure 2, this involves two corresponding operations: a normalization module to handle the nonstationary series by changing the mean and standard deviation, and a denormalization module to return the model output to the original statistics. Below are the details.

Figure 2 Non-stationary transformer Sequence stationarity is employed as a wrapper around the base model to normalize each input series and denormalize the output. Non-stationary attention replaces the original attention mechanism to approximate the attention learned from the non-stationary series, rescaling the current time-dependent weights with the learned non-stationary coefficients τ, ∆. |

**Normalization Module** To attenuate the non-stationarity of each input series, we perform normalization with a sliding window on the time axis. For each input series x, it is transformed by translation and scaling operations to x′. The normalization module is formulated as follows

Here, means element-by-element division and ⊙ is the element-by-element product. The normalization module reduces the distributional discrepancy between each input time series, making the distribution of model inputs more stable.

**Denormalization** Module As shown in Figure 2, after the base model H predicts future values at length O, it employs denormalization and transforms at the model output y′ to obtain as the final prediction result. The denormalization module is formulated as follows:

The two-step transformation results in the base model receiving stationary inputs and following a stable distribution, which facilitates generalization. This design also makes the model equivariant to translational and scaling perturbations of the time series, which is advantageous for real-world series forecasting.

### Non-stationary Attention

Although the statistics of each time series are explicitly restored to the corresponding predictions, the nonstationarity of the original series cannot be fully restored by non-normalization alone. For example, series normalization can generate the same normalized input x′ from different time series _{x1 and}_{x2}, and the base model will obtain identical attrition that fails to capture the important time dependence involved in nonstationarity (Figure 1). In other words, the effects that are impaired by over-stationarity occur inside the deep model, particularly in the computation of attentionality. Furthermore, the non-stationary time series are fragmented and normalized into several series chunks with the same mean and variance that follow a more similar distribution than the raw data before stationary. Thus, the model is more likely to produce over-stationary and non-stationary output, which is incompatible with the natural non-stationarity of the original series.

To address the problem of over-stationarization caused by series stationarization, the authors propose a new non-stationary attrition mechanism that approximates the attrition obtained without stationarization and allows the discovery of certain time dependencies from the original non-stationary data.

**Analysis of Plain Models** As mentioned earlier, the over-stationarity problem is caused by the loss of intrinsic non-stationarity information, which prevents the base model from capturing eventful time dependence for forecasting. Therefore, we attempt to approximate the learned attention from the original non-stationary series. The formula for self-attention is as follows

where Q, K, and V are the queries, keys, and values of length S in dk _{dimensions}, respectively, and Softmax(-) is done row by row. After the normalization module, the model receives a stationary input x′. Based on the assumption of linear characteristics, it can be proved that the attentional layer receives Q′. and the corresponding transformed K′ and V′ as well. Without series stationarization, the input for Softmax(-) in the self-attention should be , but now the attention is computed based on Q′ and K′.

Since Softmax(-) is invariant to the same translation in the row dimension of the input, we have

Equation 5 leads to a direct representation of the learned attention from the raw series x. This representation requires non-stationary information _{σx,}_{µQ, and} K to be removed by series stationarization, except for the current Q′ and K′ from the stationary series x′.

**Non-stationary Attention** In order to restore the original attention to the non-stationary series, we attempt to reintroduce the lost non-stationary information back into the calculation. The key is to approximate the positive scaling scalar and the shift vector , defined as non-stationary factors, based on Equation 5. Since strict linear properties are rarely established in deep models, besides spending a great deal of effort to estimate and use real factors, we attempt to learn nonstationary factors directly from the statistics of nonstationary x, Q, and K in a simple but effective multilayer perceptron layer. Since only limited nonstationary information can be found from the current Q′ and K′, a unique and reasonable source to compensate for nonstationarity is the original non-normalized x. Therefore, as a direct deep learning implementation of Equation 5, we apply the multilayer perceptron as a projector to learn the nonstationary coefficients τ and ∆ separately from the nonstationary x statistics µx and σx. The non-stationary attentions are then computed as follows:

Here, the non-stationary coefficients τ and ∆ are shared by non-stationary attentions at all levels (Figure 2). The non-stationary attrition mechanism learns the time dependence from both the stationary series Q′ and K′ and the non-stationary series x, μx, and σx, and multiplies them with the stationary value V′. Thus, it simultaneously preserves the predictability benefits of the stationary series and the inherent temporal dependence of the raw series.

**Overall Architecture** Following the prior use of transformers in time series forecasting, the authors adopt a standard encoder-decoder structure (Figure 2) in which the encoder extracts information from past observations and the decoder aggregates past information to refine the forecast from a simple initialization. The regular non-stationary transformer is wrapped by series stationarization to both the input and output of the vanilla transformer, replacing the self-attention with the proposed non-stationary attention to improve the non-stationary series forecasting ability of the base model. The transformer transformation transforms the terms inside Softmax(-) with nonstationary coefficients τ , ∆ in order to reintegrate the nonstationary information.

## experiment

Extensive experiments will be conducted to evaluate the performance of the non-stationary transformer on six real-world time series forecasting benchmarks, as well as to verify the generality of the proposed framework on a variety of mainstream transformer transformations.

**Dataset** The dataset used is as follows

(1) Electricity records hourly electricity consumption for 321 clients from 2012 to 2014.

(2) ETT contains time series of deoiling factors and power loads collected at electric transformers from July 2016 through July 2018; ETTm1 /ETTm2 are recorded every 15 minutes and ETTh1/ETTh2 every hour.

(3) Exchange We collect panel data on daily exchange rates for eight countries from 1990 to 2016.

(4) ILI is a collection of ratios of the number of cases of influenza-like illness per week to the total number of cases reported weekly by the U.S. Centers for Disease Control and Prevention from 2002 to 2021.

(5) Traffic includes hourly road occupancy measured by 862 sensors on freeways in the San Francisco Bay Area from January 2015 through December 2016.

(6) Weather includes weather time series with 21 weather indices collected every 10 minutes from the Max Planck Institute for Biological and Chemical Research weather station in 2020.

The Augmented Dick-Fuller (ADF) test statistic is employed as a quantitative measure of the degree of stationarity; a smaller ADF test statistic indicates a higher degree of stationarity, meaning the distribution is more stable. Table 1 summarizes the overall statistics for the data sets and arranges them in ascending order by degree of stationarity. We followed the standard protocol of dividing each dataset into training, validation, and testing subsets according to time series.

Table 1 Summary of datasets; smaller ADF test statistics indicate a more stationary dataset. |

**Baseline** The vanilla transformer equipped by the non-stationary transformer framework is evaluated in both multivariate and univariate settings to demonstrate its effectiveness. For multivariate forecasting, we include six state-of-the-art deep forecasting models: Autoformer, Pyraformer, Informer, LogTrans, Reformer and LSTNet. for univariate forecasting, we include seven competitive baselines: N-HiTS, N-BEATS Autoformer, Pyraformer, Informer, Reformer and ARIMA. In addition, for both canonical and efficient transformations of transformers, we employ the proposed frameworks: Transformer, Informer, Reformer, and Autoformer

### Main Results

**Prediction Results** In terms of multivariate prediction results, the vanilla transformer with the authors' framework consistently achieved state-of-the-art performance across all benchmarks and prediction lengths (Table 2). In particular, on highly non-stationary datasets, the non-stationary transformer successfully outperformed other deep learning models. With a prediction length of 336, Exchange achieved an MSE reduction of 17% (0.509 → 0.421) and ILI 25% (2.669 → 2.010), suggesting that the potential of deep learning models is still limited on non-stationary data. Univariate results for two typical datasets with different stationarity are also shown in Table 3. The non-stationary transformer still achieves remarkable predictive performance.

Table 2 Comparison of prediction results for different prediction lengths O ∈ {96, 192, 336, 720}. The length of the input array is set to 36 for ILI and 96 for the others. |

Table 3 Univariate results for different prediction lengths O ∈ {96, 192, 336, 720} for two typical datasets with strong nonstationarity. The length of the input array was set to 96. |

**Framework Generality** We apply this framework to four mainstream transformers and report performance promotions for each model (Table 4). The authors' methodology consistently improves predictive performance over the other models. Overall, they achieve an average performance gain of 49.43% for Transformer, 47.34% for Informer, 46.89% for Reformer, and 10.57% for Autoformer, with each outperforming the previous state-of-the-art. Compared to the native blocks of each model, there was little increase in parameters or computational complexity due to the application of this framework, and the computational complexity was maintained. This verifies that the non-stationary transformer is an effective and lightweight framework that can be widely applied to transformer-based models and can achieve state-of-the-art performance by enhancing its non-stationary predictability.

Table 4: Performance promotion of the proposed framework applied to Transformer and its variants. We report the average MSE/MAE for all forecast lengths (listed in Table 2) and the relative MSE reduction (Promotion) due to this framework. |

### selective research

**Quality Assessment** To explore the role of each module in the proposed framework, we compare the ETTm2 forecasting results with three models: the vanilla transformer, the series stationary only transformer, and the proposed non-stationary transformer. Figure 3 shows that the two modules enhance the non-stationary forecasting capabilities of the transformer from different perspectives. The series stationarity focuses on the consistency of statistical properties between each series input and is very effective in allowing the transformer to generalize to out-of-distribution data. However, as shown in Figure 3(b), an overly stationary environment for learning makes it easier for deep learning models to output non-stationary series with significantly higher stationarity and ignore the properties of non-stationary real-world data. Therefore, by using non-stationary attrition, the model takes into account the non-stationarity inherent in real-world time series. This is beneficial for accurate forecasting of detailed time series variation, which is essential for real-world time series forecasting.

Figure 3 Visualization of ETTm2 predictions by different models |

**Quantitative Performance** In addition to the above case studies, we also compare the quantitative prediction performance of the stationary method, the deep learning method RevIN, with that of series stationary. As shown in Table 5, the prediction results for RevIN and series stationary are essentially the same, which indicates that the parameter-free version of normalization in the proposed framework performs well enough to stationary the time series. Furthermore, the proposed non-stationary attention in the non-stationary transformer further improves the performance, achieving the highest values for all six benchmarks. The MSE reduction that the non-stationary attention provides is particularly noticeable when the data set is highly non-stationary (Exchange: 0.569 → 0.461, ETTm2: 0.461 → 0.306). This comparison reveals that simply stationary time series limits the predictive ability of the transformer and that the complementary mechanism of the non-stationary transformer can appropriately release the model's potential for non-stationary series forecasting.

Table 5 Prediction results obtained by applying different methods to Transformer and Reformer. For comparison, we report the average MSE/MAE for all forecast lengths (Table 2). |

### Model Analysis

Over-Stationarity **Problem** To statistically test the over-stationarity problem, we train the transformer with each of the aforementioned methods, time-series the predicted time series and compare the degree of ground-truth and stationarity (Figure 4). We find that the model with only the stationary method tends to output series with unexpectedly high stationarity, whereas the results supported by the non-stationary attrition are closer to the actual values (relative stationarity ∈ [97%, 103%]). In addition, as the degree of stationarity of the series increases, the problem of over-stationarity increases. This large discrepancy in the degree of stationarity can be explained by the inferior performance of the transformer with only stationary modification. We also demonstrate that non-stationary attrition n as an internal modification mitigates over-stationarity.

Figure 4 Relative stationarity is calculated as the ratio of the ADF test statistic between the model predictions and the grand truth From left to right, the data set is increasingly nonstationary. Models that only perform stationarization tend to output highly stationary series, while our method yields predictions with stationarity closer to ground-truth. |

**Exploring Non-stationary Information Retrieval** It is noteworthy that by stipulating over-stationarity as an indistinguishable attentation, the design space is narrowed down to the attentional calculation mechanism. Therefore, we experimented with exploring other approaches to retrieving non-stationary information by re-injecting μ and σ into the feed-forward layer (DeFF) on the left side of the transformer architecture. Specifically, the learned µ and σ are repeatedly injected into each feedforward layer. As shown in Table 6, reembedding nonstationarity is only necessary when the inputs are stationary (Stationary), which is beneficial for forecasting but leads to nonstationarity mismatch in the model output. The proposed design (Stat + DeAttn) was then further facilitated and achieved the best (77%) in most cases. In addition to the theoretical analysis, the experimental results further validate the effectiveness of the proposed design to reacquire non-stationarity of attn.

Table 6: Isolating Framework Design Baseline means vanilla transformer, Stationary means addition of series stationarity, DeFF means reincorporation of non-stationarity into the feedforward layer, DeAttn means reincorporation with de-stationarity attention, Stat + DeFF means addition of series stationarity and feed Stat + DeAttn is the proposed framework. |

## summary

This paper addresses time series forecasting from the perspective of stationarity. Unlike prior work that only attenuates non-stationarity and leads to over-stationarity, the authors propose an efficient method to increase the stationarity of the series and revamp the internal mechanism to reacquire non-stationary information, simultaneously increasing the predictability of the data and the predictive power of the model. Experimentally, the method shows excellent generality and performance on six real-world benchmarks. Detailed derivations and isolations were also performed to validate the effectiveness of each component of the proposed non-stationary transformer framework. In the future, we plan to explore more model-independent solutions to the over-stationarization problem.

Categories related to this article