A Super-fast Alternative For Time Series Data Classification
3 main points
✔️ A new perspective method proposed for time series classification
✔️ Dramatically reduced training time based on linear classification with high prediction accuracy
✔️ Converts time-series data to symbolic representation and uses χ-squared limit as a criterion to terminate the search for a trie tree (branch trimming)
MrSQM: Fast Time Series Classification with Symbolic Representations
written by Thach Le Nguyen, Georgiana Ifrim
(Submitted on 2 Sep 2021)
Comments: Published on arxiv.
Subjects: Machine Learning (cs.LG)
The images used in this article are from the paper, the introductory slides, or were created based on them.
first of all
This paper is about the classification of time series data. It is characterized by the use of symbolic representation and the use of pattern mining with chi-square limits. The name of the model is "Multiple Representations Sequence Miner", abbreviated as MrSQM. It is based on the preceding linear classification and has improved accuracy and learning efficiency.
The baseline for previous studies of time series classification (TSC) is the 1-N nearest-Neighbor classification (1NN-DTW). This model is simple, but suffers from a lack of robustness to noise, and has since been developed by three groups.
1) Ensemble classification
HIVE-COTE is popular and ensembles a large number of classifiers using different data representations and features, weighted according to the quality of each classification.
The problem is that it takes more than two weeks to learn the UEA/UCR benchmarks.
2) Deep learning classification
FCN, Resnet, etc. are efficient and show the same accuracy as HIVE-COTE. InceptionTime has improved the accuracy and variance, but it still takes several weeks to train.
3) Linear classification
It is a traditional method but has recently shown good results for time series libs. By taking a large feature space, it reduces the need for a nonlinear classifier, an idea that has been successfully incorporated into large-scale classifiers such as LIBLINEAR.
In TSC, WEASEL generates a large SFA (see below) word feature space, filters it with χ-squared feature selection, and trains a logistic regression classifier; MrSEOL uses a large feature space of SAX (see below) and SFA and performs Greedy gradient descent and logistic regression. The more recent ROCKET generates a large number of random convolution kernels and uses new features called max pooling and ppv.
Accuracy comparable to ensemble learning or deep learning can be achieved with several orders of magnitude faster learning. The UEA/UCR benchmark can be trained in a few hours.
Another feature is the conceptual simplicity of the method, which can be divided into three stages: 1) data transformation, 2) feature selection, and 3) linear classification.
From previous studies, it is based on linear regression. The input time series data is transformed into symbolic representations such as SAX and SFA. The symbolic representations are the techniques developed for document mining, SAX is Symbolic Aggregation Approximation and SFA is Symbolic Fourier Approximation. The whole workflow is shown in Fig.1.
Symbolic representation of time series
SAX and SFA commonly take the following three steps to convert data.
1) Extract a segment of a time series in a sliding window
2) Approximate each segment with a vector, usually of shorter length
3) Discretize the approximation to obtain symbol words (e.g. abbacc)
The main differences between SAX and SFA are approximation and discretization techniques, which are summarized in Table 1.
Feature Selection for Symbol Sequences
The χ-squared test is a well-known method for ranking candidate features. For a measured frequency Ok and an expected frequency Ek of a class feature ck, Chi2 is obtained as follows
Each subword that appears in the symbolic representation is a candidate for a feature. Evaluating them all exhaustively and ranking them using Chi2 is expensive; the Chi2 statistic has an upper bound, which is particularly useful for sequential data. The upper bound is expressed as
Due to the anti-monotonicity of sequential data, a sequence is guaranteed to occur only as frequently as its prefix. Therefore, the Chi2 score of a candidate feature can be prematurely bounded by examining its prefix by equation (2). For efficient search and effective use of the upper bound, we use a trie tree to store sub-sequences. A trie is a tree data structure where edges correspond to symbols and nodes correspond to sub-sequences formed by concatenating all the edges of the path from the root to that node.
Each node stores an inverted index (a list of positions) of the corresponding sub-sequence. In this way, child nodes can be generated quickly by iterating through the list of positions. For simplicity, let's say we are looking for a bordering sub-sequence in two-class sequence data, and the Oj value is less than or equal to the node above it. Thus, at some point, if we cut the threshold, we know that everything below it is below the threshold, without having to search for it. Therefore, we prune from the try tree.
This is the first time that this method has been applied to time series data. We have obtained an efficient algorithm for symbolic representation time series classification.
Another method to select useful features is Information Gain (IG), which represents how well the prediction can partition the data on the target variables. However, its partitioning performance is comparable to Chi2, and it cannot be applied to multi-classes.
There is a high dependence of sub-sequences on sequence data. Chi2 tests or supervised methods that rank similar features independently tend to be progressively weaker on this collinearity problem. In other words, the top-ranked features are highly correlated with each other and there is redundancy in the selected feature set. Random feature selection increases the diversity.
For a symbolic representation of a time series, a subword (substring) of the symbol is a candidate feature. The random feature selector simply samples the index of the time series, its position in the time series, and the length of the subword.
Since each method has its advantages and disadvantages, it is necessary to combine supervised and unsupervised methods; in MrSQM, there are two ways to combine them: 1) supervised feature selection and unsupervised filtering; 2) unsupervised feature selection and supervised filtering.
MrSQM Classifier Variant
Based on the above concepts, we chose four types.
1) MrSQM-R: Unsupervised feature selection by random sampling of feature values
2) MrSQM-RS: Hybrid method, where candidates are selected by MrSQM-R and refined by supervised Chi2 test.
3) MrSQM-S: Branch pruning of all sub-sequence features with supervised selection in Chi2 upper bound and selecting the optimal set of upper k sub-sequences in Chi2 test.
4) MrSQM-SR: Hybrid method. The feature candidates are generated by supervised MrSQM-S and filtered by random sampling.
The dataset consists of 112 fixed-length univariate time series from the UEA/UCR TSC Archive, and MrSQM can be extended to variable lengths. The accuracy gain is performed by the Holm-corrected Wilcoxon signed-rank test and visualized by a Critical Difference (CD) diagram, which is calculated using the R library scmamp. CD shows the ranking of several methods concerning the mean accuracy rank for several datasets. Methods that are not statistically significantly different are connected by horizontal lines (indicating the CD) to make it easier to see whether there are significant differences.
sensitivity analysis (e.g. in simulations)
Analysis of branch cutting effectiveness
The number of sub-sequences is greatly reduced due to branch trimming (Chi2 limit).
Comparison of SAX, SFA
There is no statistically significant difference between the models that selected either SAX or SFA features. A significant difference was identified between MrSQM-R-SS, in which both were selected.
Comparison of feature selection approaches
There is no significant difference between MrSQM-RS, MrSQM-R, MrSQM-SR, and MrSQM-S. There is a significant difference between MRSQM-RS, MrSQM-R and MrSQM-SR, MrSQM-S. The learning and inference time is 70 minutes.
Using both SAX and SFA features takes about twice as much runtime as using only one of them.
MrSQM vs. SOTA symbolic TSC
There is no significant difference in accuracy between MrSQM and conventional symbolic TSC (WEASEL, MrSEQL, BOSS, cBOSS, S-BOSS). However, the conventional SOTA takes more than 10-12 hours to train.
MrSQM vs. other SOTA TSC
So far, the most accurate time series classifiers are HIVE-COTE 1.0, TS-CHIEL, ROCKET, and Inception Time, except ROCKET, which requires computer resources for training and several days or weeks. Fig.7 shows the accuracy comparison of each of them with MrSQM-RS.
The accuracy of MrSQM-RS is not so inferior. The learning speed is much faster: only ROCKET takes 1.5 hours for training and inference, which is not so different from MrSQM-RS. However, ROCKET degrades the accuracy for some datasets. However, ROCKET degrades accuracy for some datasets, while MrSQM-RS does not, HIVE-COTE and others require intensive hyperparameter tuning, while MrSQM uses default parameters for all datasets. Use.
developed in other fields, symbolic representation, and tri-tree classification. By pruning branches by the χ (chi) squared limit, a significantly faster algorithm is proposed compared to conventional linear regression. Very little degradation inaccuracy has been observed. Also. I am wondering if it can be applied to multivariate data.
Categories related to this article