Catch up on the latest AI articles

Predictive Performance SCINet Beyond Transformers

Predictive Performance SCINet Beyond Transformers


3 main points
✔️ This is a NeurIPS 2022 accepted paper. We propose SCINet, a time series prediction model that effectively models time series with complex temporal dynamics.
✔️ SCINet is a hierarchical downsample-convolution-interaction structure with rich convolution filters. It iteratively extracts and exchanges information at different temporal resolutions and learns effective representations with enhanced predictability.
✔️ SCINet achieves significant improvements in predictive accuracy over existing convolutional models and Transformer-based solutions on a variety of real-world time series prediction datasets.

SCINet: Time Series Modeling and Forecasting with Sample Convolution and Interaction
written by Minhao LiuAiling ZengMuxi ChenZhijian XuQiuxia LaiLingna MaQiang Xu
(Submitted on 17 Jun 2021 (v1), last revised 13 Oct 2022 (this version, v3))
Comments: This paper presents a novel convolutional neural network for time series forecasting, achieving significant accuracy improvements

Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)


The images used in this article are from the paper, the introductory slides, or were created based on them.


A property unique to time series is that temporal relationships are largely preserved when downsampled into two sub-series. In this paper, we propose a new neural network architecture, named SCINet, that exploits this property to perform sample convolution and interaction for temporal modeling and prediction. Specifically, SCINet is a recursive downsample-convolution-interaction structure. At each layer, multiple convolutional filters are used to extract distinct and valuable temporal features from downsampled sub-sequences and features. By combining these rich features aggregated from multiple resolutions, SCINet effectively models time series with complex temporal dynamics. It is worth noting that experimental results show that SCINet is able to achieve significant improvements in prediction accuracy on a variety of real-world time series prediction data sets, even when compared to existing convolutional models and Transformer-based solutions that have produced widely positive results. The results are very impressive.


Time series forecasting (TSF) enables decision making by estimating the future evolution of metrics and events, thereby playing an important role in various scientific and engineering fields, such as healthcare, energy management, traffic flow, and financial investments. There are three main types of deep neural networks used for sequence modeling, all of which have been applied to time series forecasting:(i). Recurrent Neural Networks (RNNs), (ii). Transformer-based models, (iii). Temporal Convolutional Networks (TCNs). Despite the promising results of TSF methods based on these generic models, they do not take into account the specificities of time series data during modeling. For example, one of the characteristics of time series data is that temporal relationships (e.g., trends and seasonal components of the data) are largely preserved when downsampling into two sub-series. As a result, recursive downsampling of time series into sub-series yields a rich set of convolutional filters for extracting dynamic temporal features at multiple resolutions.

In light of the above, this paper proposes a new neural network architecture for time series modeling and prediction, named Sample Convolution and Interaction Network (SCINet). The main contributions of this paper are as follows:

- We propose SCINet, a hierarchical downsample-convolution-interact TSF framework that effectively models time series with complex temporal dynamics. By repeatedly extracting and exchanging information at multiple temporal resolutions, an effective representation with enhanced predictability can be learned, which is verified by a relatively low permutation entropy (PE).

- SCI-Blocks, the basic building blocks for constructing SCINet, are designed to downsample the input data/features into two sub-sequences and extract the features of each sub-sequence using different convolutional filters. To compensate for information loss in the downsampling process, we incorporate bidirectional learning between the two convolutional features within each SCI-Block.

Extensive experiments on a variety of real-world TSF datasets show that the proposed model consistently outperforms existing TSF approaches by a considerable margin. Furthermore, although SCINet does not explicitly model spatial relationships, it achieves competitive prediction accuracy in spatial-temporal TSF tasks.

Related Research

Traditional time series forecasting methods such as the autoregressive integrated moving average (ARIMA) model and the Holt-Winters seasonal method have theoretical guarantees. However, they are mainly applied to univariate forecasting problems, limiting their application to complex time series data. Recent improvements in data availability and computational power have shown that deep learning-based TSF methods have the potential to achieve better prediction accuracy than traditional approaches. Previous RNN-based TSF methods compactly summarize past information into an internal memory state that is recursively updated with new inputs at each time step, as shown in Fig. 1(a). However, the gradient disappearance/explosion problem and inefficient training procedures have severely limited the application of RNN-based models. In recent years, Transformer-based models have replaced RNN models in almost all sequence modeling tasks due to the effectiveness and efficiency of the self-attention mechanism. In the literature, various Transformer-based TSF methods (see Fig. 1(b)) have been proposed. These papers generally focus on challenging long-term time series forecasting problems, taking advantage of their remarkable long-sequence modeling capabilities; another common type of TSF model is the so-called temporal convolutional network, in which convolutional filters are used to capture local temporal features ( See Fig. 1(c)). The proposed SCINet is also built on temporal convolution. However, our method has several important differences compared to TCN models based on extended causal convolution, as described below.

Rethinking Dilutional Causal Convolution for Time Series Modeling and Forecasting

DCS was initially proposed for generating raw audio waveforms in WaveNet. Later, the architecture of WaveNet was simplified to a so-called time convolutional network (see Fig. 1(c)); TCNs consist of a stack of causal convolutional layers with exponentially scaled expansion coefficients, allowing for a large receptive field with only a few convolutional layers. Over the years, TCN has been widely used for all kinds of time series forecasting problems, achieving promising results. In addition, convolution filters can work seamlessly with graph neural networks (GNNs) to solve a variety of spatial-time series TSF problems: in the causal convolution of the TCN architecture, output i is only convolved with the i-th previous element of the previous layer. Although causality should be preserved in the forecasting task, a potential "future information leakage" problem exists only when outputs and inputs overlap in time. In other words, causal convolution should only be applied in autoregressive forecasting, where the previous outputs serve as inputs for future forecasts. If the forecast is based entirely on known inputs in the lookback window, there is no need to use causal convolution. One can safely forecast by applying ordinary convolution to the lookback window.

More importantly, the extended architecture in TCN has two inherent limitations:

- A single convolution filter is shared within each layer. Such a unified convolution kernel tends to extract average temporal features from the data/features in the previous layer. However, complex time series may contain substantial temporal dynamics. Therefore, it is important to use rich convolutional filters to extract different but valuable features.

- The final layer of the TCN model can see the entire look-back window, but the effective receptive fields in the middle layer (especially the layer closest to the input) are limited, and temporal relationships are lost during feature extraction.

The above limitations of the TCN architecture motivated the design of the proposed SCINet, as detailed below.

SCINet: Sample Convolution and Interaction Network

SCINet employs an encoder-decoder architecture. The encoder is a hierarchical convolutional network that captures dynamic temporal dependencies at multiple resolutions with a rich set of convolutional filters. as shown in Fig. 2(a), the basic building block, the SCI-Block, downsamples input data or features into two sub-sequences It then processes each subsequence with a series of convolutional filters to extract distinct but valuable temporal features from each part. To compensate for the loss of information during downsampling, bidirectional learning occurs between the two subsequences; SCINet is constructed by arranging multiple SCI-Blocks in a binary tree structure (Fig. 2(b)). The advantage of such a design is that each SCI-Block has both local and global views of the entire time series, facilitating the extraction of useful temporal features. After all downsampling, convolution, and interact operations, the extracted features are reconditioned into a new sequence representation and added to the original time series to predict the fully connected network as a decoder. To facilitate the extraction of complex time series patterns, multiple SCINets can be further stacked and intermediate monitoring applied to obtain a Stacked SCINet, as shown in Fig. 2(c).


SCI-Block (Fig. 2(a)) is the basic module of SCINet, which splits the input feature F into two subfeatures F′ odd and F′ even through the operations of splitting and interactive learning. Splitting downsamples the original sequence F into two sub-sequences Feven andFodd by separating even and odd elements, which coarsens the temporal resolution but preserves most of the information of the original sequence. Different convolution kernels are then used to extract features from Feven andFodd. Because the kernels are separate, the features extracted from them have enhanced representational capabilities and contain different but valuable temporal relationships. To compensate for the potential information loss due to downsampling, we propose a novel interactive learning strategy that allows information exchange between two subsequences by learning affine transformation parameters from each other As shown in Fig. 2(a), interactive learning consists of two steps. First, Feven andFodd are projected into the hidden state by two different 1D convolution modules φ and ψ, respectively, transformed into the format of exp and interacted with Feven andFodd in an element-wise product (see Equation (1)). This can be viewed as performing a scaling transformation on Feven andFodd, and the scaling coefficients are learned from each other using the neural network module. Here, ⊙ is the Hadamard product or element-by-element product.

The two scaled features Fseven and Fsodd are then projected to another two hidden states by two other 1D convolution modules ρ and η and added or subtracted to1 Fseven and Fsodd, as shown in equation (11). The final output of the interactive learning module is the two updated subfeatures F′even and F′odd Compared to the extended convolution used in the TCN architecture, the proposed downsampling-convolution-interaction architecture achieves an even larger receptive field at each convolution layer. More importantly, unlike TCN, which employs a single shared convolutional filter in each layer, severely limiting feature extraction capacity, SCI-Block aggregates the important information extracted from two downsampled subsequences with both local and global views of the entire time series It aggregates the


Using the SCI-Blocks described above, a SCINet is constructed by hierarchically arranging multiple SCI-Blocks, resulting in a tree-structured framework as shown in Fig. 2(b).

The l-th level has 2l SCI-Blocks, l = 1, .... l is the index of the level and l is the total number of levels. Within the kth SCINet of a stacked SCINet, the input time series X (for k = 1) or feature vector (for k > 1) gradually changes. (for k > 1) are gradually downsampled and processed by SCI-Blocks through different levels, allowing for effective feature learning at different time resolutions. In particular, information from previous levels is gradually accumulated, i.e., features at deeper levels contain extra temporal information at finer scales conveyed from shallower levels. In this way, both short-term and long-term time dependence in time series can be captured.

After passing through the L-level SCI-Blocks, the odd-even splitting operation is reversed to reorder all sub-feature elements and concatenate them into a new sequence representation. They are then added to the original time series via residual connections to produce a new sequence with increased predictability. Finally, the enhanced sequence representation is decoded to using a simple fully connected network. Note that to mitigate distributional shifts in some TSF tasks, all data elements are subtracted by the value of the last element before feeding the lookback window data into the model, and the value is added to all subsequent data elements in the forecast horizon.

Stacked SCINet

Given a sufficient training sample, even better prediction accuracy can be achieved by stacking k layers of SCINets at the expense of a more complex model structure (see Fig. 2(c)). Specifically, to facilitate learning of intermediate temporal features, intermediate monitoring with groundtruth values is applied to the output of each SCINet: the output ˆ Xk of the kth intermediate SCINet has length τ and is concatenated with a portion of the input Xt-(T -τ)+1:t to recover the length of the original input and and fed as an input to the (k + 1)th SCINet. Where k = 1, ... .... K -1, where K is the total number of SCINets in the stack structure; the output ˆ XK of the kth SCINet is the final predicted result.

loss function

When training a stacked SCINet with K (K≥1) SCINets, the loss of the kth prediction result is calculated as the L1 loss between the output of the kth SCINet and the ground-truth horizontal window to be predicted:

The total loss of a stacked SCINet can be written as

complexity analysis

Downsampling allows neurons in each convolutional layer of SCINet to have a wider receptive field than neurons in TCN. More importantly, SCINet's rich set of convolutional filters allows for flexible extraction of temporal features from multiple resolutions. As a result, SCINet typically does not require downsampling the original sequence to the coarsest level for effective prediction. Given the lookback window size T, TCN typically requires [ log2 T] layers when the expansion factor is 2, while the number of layers L in SCINet can be much smaller than log2 T. Empirical studies show that the best prediction accuracy is achieved with L ≤ 5 in most cases, even when T is large (e.g., 168). Also for the number of stacks K, it has been empirically found that K ≤ 3 is sufficient.

As a result, the computational cost of SCINet is usually comparable to that of the TCN architecture. The worst case time complexity is , which is much smaller than the vanilla Transformer-based solution: .


Here we present quantitative and qualitative comparisons with state-of-the-art models for time series forecasting. We also present a comprehensive ablation study to evaluate the effectiveness of the various components of SCINet.


Experiments were conducted on 11 popular time series datasets: (1) electric transformer temperature (ETTh) (2) traffic (3) solar energy (4) electricity (5) exchange rates (6) PeMS (PEMS03, PEMS04, PEMS07, PEMS08). A brief description of these datasets is given in Table 1.

Tables 2, 3, 4, 5, and 6 show the main experimental results for SCINet, which confirm that SCINet performs better than other TSF models on a variety of tasks, including short-term, long-term, and space-time series time series forecasts.

Short-term time-series forecast

This paper evaluates SCINet's performance on short-term TSF tasks compared to other baseline methods using the Traffic, Solar-Energy, Electricity, and Exchange-Rate datasets. The experimental setup uses an input length of 168 to predict different future horizons {3, 6, 12, 24}.

As can be seen from Table 2, the proposed SCINet outperforms existing RNN/TCN-based (LSTNet, TPA-LSTM, TCN, TCN†) and Transformer based TSF solutions in most cases. Note that TCN† is a variant of TCN that replaces the causal convolution with a regular convolution and improves on the original TCN on all data sets. Furthermore, we can confirm that Transformer-based methods perform poorly in this task. For short-term forecasting, recent data points are generally more important for accurate forecasts. However, the permutation-invariant self-monitoring mechanism used in Transformer-based methods does not pay much attention to such important information. In contrast, general sequential models (RNN/TCN) can easily formulate this and have shown very good results in short-term forecasting.

Long-term time series forecast

Many real-world applications require the prediction of long-term events. Therefore, we will conduct experiments on exchange rate, power, traffic, and ETT datasets to evaluate SCINet's performance on long-term TSF tasks. In this experiment, we compare SCINet only with Transformer-based methods. This is because Transformer-based methods are more common in recent long-term TSF studies.

As can be seen from Talbe 3, SCINet achieves state-of-the-art performance in most benchmark and forecast length settings. Overall, SCINet improves MSE by an average of 39.89% in the above settings. In particular, for Exchange-Rate, SCINet improves MSE by an average of 65% compared to previous state-of-the-art results. This is likely due to the fact that the proposed SCINet better captures both short-term (local temporal dynamics) and long-term (trend, seasonality) temporal dependencies and provides accurate forecasts in the longer-term TSFs.

Both multivariate and univariate time series forecasts were performed on the ETT data set. To ensure a fair comparison, all input lengths T were set equal to those of Informer. Results are shown in Table 4 and Table 5, respectively.

Multivariate time series forecasting in ETT

As can be seen from Table 4, Transformer-based methods produce better prediction results compared to RNN-based methods such as LSTMa and LSTnet. One of the main reasons for this is that RNN-based solutions make iterative predictions and are therefore inevitably subject to error accumulation. As another direct prediction method, TCN further outperforms vanilla Transformer-based methods. it is worth noting that SCINet outperforms all of the above models by a wide margin. fig. 3 shows the results for several randomly selected sequences from the ETTh1 data set. The qualitative results are shown in Fig. 3 and clearly demonstrate that SCINet is able to obtain the trend and seasonality of the TSF time series.

Univariate time series forecasting of ETT

In this experimental setting, we bring powerful baseline methods for univariate forecasting such as ARIMA, Prophet, DeepAR, and N-Beats into the comparison, and in Table 5 we see that N-Beats outperforms the other baseline methods in most cases. In fact, N-Beats also accounts for time series-specific characteristics and uses a deep stack of fully connected layers with residuals to directly learn trend and seasonality models, which is different from leading architectures such as RNN, CNN, and Transformer. Nevertheless, SCINet's performance is far superior to N-Beats.

The newly proposed Transformer-based forecasting model, Autoformer, achieved the second best performance in all experimental settings and outperformed SCINet even in ETTm1 when the forecast horizon was large. This is because, on the one hand, Autoformer is much better at extracting long-term temporal patterns than vanilla Transformer-based methods because it focuses on modeling seasonal patterns and self-attends at the subseries level (rather than on raw data). On the other hand, when making long-term forecasts, trend and seasonal information, rather than the temporal dynamics of the look-back window, often take center stage, and SCINet's advantages may not be fully realized.

Space-Time Time Series Forecasting

In addition to the general TSF task, there are many other data sets related to spatial-temporal forecasting. For example, the transportation datasets PeMS (PEMS03, PEMS04, PEMS07 and PEMS08) are complex spatial-temporal time series of public transportation networks that have been studied for decades. The most recent approaches DCRNN, STGCN, ASTGCN, GraphWaveNet, STSGCN, AGCRN, LSGCN, and STFGNN use graph neural networks to capture spatial relationships and traditional TCN or RNN/LSTM architectures to time-dependent modeled. The hails here follow the same experimental setup as in the paper above; as shown in Table 6, these GNN-based methods generally outperform pure RNN- or TCN-based methods. However, SCINet achieves better performance without advanced spatial relationship modeling, further demonstrating SCINet's superior temporal modeling capabilities.

Estimation of Predictability

To measure the predictability of the original input and the extended representation learned by SCINet, we use permutation entropy (PE): time series with low PE values are considered less complex and therefore theoretically more predictable.2 The PE values of the original time series and its corresponding extended representation are shown in Table 7 Table 7 shows the PE values of the original time series and its corresponding extended representation.

This indicates that the extended representation learned by SCINet has a lower PE value than the original input and that it is easier to predict the future from the extended representation using the same predictor.

sectional analysis

To assess the impact of each of the key components used in SCINet, we experimented with several model variants on two data sets: ETTh1 and PEMS08.


We first set the number of stacks K = 1 and the number of SCINet levels L = 3. In the SCI-Block design, two variants are used to test the effectiveness of different convolution weights for handling bidirectional learning and sub-sequences: w/o. InterLearn and WeightShare . w/o. InterLearn is obtained by eliminating the bidirectional learning procedure described in equations (1) and (11). In this case, the two subsequences will be updated using F′odd = ρ(φ(Fodd)) and F′even = η(ψ(Feven)) In the case of WeightShare, the modules φ, ρ, ψ and η share the same weight.

The evaluation results in Fig. 4 show that both interactive learning and explicit weights are essential to improve the prediction accuracy of both datasets at various prediction horizons. At the same time, a comparison of Fig. 4(a) and Fig. 4(b) shows that interactive learning is more effective for longer lookback window sizes. This is because, intuitively, information can be exchanged between down-sampled sub-sequences to extract more effective features.


For the design of SCINet with multiple levels of SCI-Block, we also experimented with two variants. The first variant, w/o. ResConn, is obtained by removing residual connections from the complete SCINet. The second variant, w/o. Linear, removes the decoder (i.e., the fully connected layer) from the complete model; as can be seen from Fig. 4, removing residual connections results in a significant performance degradation. Besides the general benefit of facilitating model learning, more importantly, the predictability of the original time series is increased with the help of the residuals. The fully connected layer is also important for prediction accuracy, showing the effectiveness of the decoder in extracting and fusing the most relevant temporal information according to the given supervision for prediction. We also performed a comprehensive disentanglement analysis of the impact of K (number of stacks), L (number of levels), and the choice of operator on the interaction learning mechanism.

Limitations and Future

In this paper, we focused primarily on the TSF problem for regular time series that are collected at equally spaced intervals and arranged in time series. However, in real-world applications, time series may contain noise, missing data, or be collected at irregular time intervals, which we called irregular time series. Although the proposed SCINet is relatively robust to noisy data due to its progressive down-sampling and interactive learning procedure, if the ratio of missing data exceeds a certain threshold, SCINet's down-sampling-based multi-resolution sequence representation introduces bias, leading to poor prediction performance impact that could potentially lead to poor forecasting performance. In addition, the proposed down-sampling mechanism may have difficulty handling data collected at irregular intervals; we will consider the above issues in the future development of SCINet. Furthermore, this study focuses on the deterministic time series forecasting problem. Many application scenarios require probabilistic forecasts, and SCINet will be revised to be able to generate such forecast results. Finally, while SCINet generates promising results for spatial-temporal time series without explicitly modeling spatial relationships, incorporating a dedicated spatial model could further improve forecast accuracy. They plan to investigate such a solution in a future study.


Motivated by the unique properties of time series data compared to common sequence data, this paper proposes a new neural network architecture for time series modeling and prediction, the sample convolution and interaction network (SCINet). The proposed SCINet is a hierarchical downsample-convolution-interaction structure with rich convolution filters. It iteratively extracts and exchanges information at different temporal resolutions and learns effective representations with enhanced predictability. Extensive experiments on a variety of real-world TSF datasets have demonstrated that the model outperforms state-of-the-art methods.

友安 昌幸 (Masayuki Tomoyasu) avatar
JDLA G certificate 2020#2, E certificate2021#1 Japan Society of Data Scientists, DS Certificate Japan Society for Innovation Fusion, DX Certification Expert Amiko Consulting LLC, CEO

If you have any suggestions for improvement of the content of the article,
please contact the AI-SCHOLAR editorial team through the contact form.

Contact Us