GAN Also In The Time-series
3 main points
✔️ A review of research on the application of GANs to time series data generation
✔️ Demonstrate useful results by solving the challenges unique to GAN itself and time series
✔️ Privacy protection is one of the key challenges in time series data generation
Generative adversarial networks in time series: A survey and taxonomy
Written by Eoin Brophy, Zhengwei Wang, Qi She, Tomas Ward
(Submitted on 23 Jul 2021)
Comments: Published on arxiv.
Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
code:
first of all
This is a survey paper of research on GAN applications to time series problems by a group at Dublin City University, Bight Dance. At the same time, a classification method is proposed. As is well known, most time-series data are normal, and collecting a large amount of outlier data requires a large number of resources. Therefore, various generative models have been proposed, and it is natural to consider the application of GAN.
The first model to be considered as a generative model is autoencoder (AE) shown in Fig. 1.
Compared to AEs, GANs are front-runners due to the quality of the data they generate and their innate privacy protection.
The author presents review papers on GAN in other fields, which are omitted here.
Challenges in applying to time series
The three main challenges are as follows
Stability of learning
The stability of GANs, not only time series, has its challenges: two problems have been described in the original paper: 1) gradient vanishing and 2) mode collapse. The gradient vanishing is caused by the direct optimization of the loss function expressed by equation (1). When the discriminator D reaches its optimal state, optimizing (1) for the generator G minimizes the Jensen^Shannon (JS) divergence.
JS convergence becomes a constant when there is no overlap in pr, pg. This results in a zero gradient. In practical use, there is a high probability that pr and pg do not intersect, or that the overlap is negligible.
To avoid this, the minimization in (3) is used to update the generator G.
This avoids the gradient vanishing but leads to the problem of mode collapse. The optimization of Eq. (3) is based on the inverse Kullback-Leibler (KL) divergence KL(pg ||p r ), which can be converted to When optimizing the inverse KL divergence if pr has multiple modes, pg will choose to recover one mode and ignore the others. Taking this into account, using (3) to training the generator G, can only generate a few modes of the actual data. These problems can be corrected by changing the architecture and the loss function.
assess
Performance evaluation of GANs has been proposed extensively. The evaluation of GANs in computer vision is usually designed considering two aspects: the quality and quantity of the generated data. The most typical qualitative measure is the quality of the generated images, which is determined by human notation. Quantitative measures compare the generated image with the true image in terms of statistical properties, such as Maximum mean discrepancy (MMD ), Inception Score, and Frechet Inception Distance (FID ).
Compared to the evaluation of image-based GANs, time-series data is difficult to evaluate quantitatively in the human perceptual psychological sense. For qualitative evaluation, we usually use t-SNE or PCA to visualize how similar the two samples are. For quantitative evaluation, we apply the two-sample test as well as the image-based GAN.
privacy risk
A wide range of methods is used to assess the privacy risks associated with the data generated by GANs.
Commonly Used Data Sets
There is no standard or commonly used benchmarking datasets for time series data generation, such as image-based datasets (CIFAR, MNIST, ImageNet). table 1 shows a list of commonly used datasets. There are two repositories: the UCR Time Series Classification/Clustering database and the UCI Machine Learning repository.
A Taxonomy of Time Series Based GANs
The data are classified as discrete and continuous variables. In discrete variables, data reporting is infrequent and irregular. Also, there are missing value gaps due to interruptions in reporting. Discrete-time series generation produces a sequence that may have temporal dependencies but may contain discrete signs. A continuous-time series has data corresponding to all-time points; Fig. 3 shows an example of each.
The challenge of generating discrete time series data
The obstacle to GANs is that they are zero-gradient at almost any point in time. This means that the distribution of discrete objects is not differentiable concerning their parameters. This limitation means that the generator cannot be trained simply by using backpropagation.
The challenge of generating continuous time-series data
GAN originally deals with continuous data in the form of images. However, time series have another problem due to the time characteristic of continuous data. There is a complex correlation between temporal characteristics and their attributes. For example, when dealing with multi-channel biometric/physiological data, ECG characteristics depend on the age and health status of the individual. There are long-term correlations in time series data. Determining their length is more difficult than for image data. Transforming the dimensions of an image is a recognizable process, although it may result in a decrease in image quality. However, for continuous-time series data, there is no standard dimension (length) that can be used for GAN. This makes the benchmark comparison difficult.
RNN (Fig. 4) is suitable for handling sequential data because of its loop-like structure. However, it lacks the ability to learn long-term dependencies, so a variant, LSTM (Fig. 4, right), has been developed. Most of the papers discussed in this paper that have RNN-based architectures use LSTMs.
RNN-based Recurrent GAN (RGAN ) was proposed in 2016. It includes a recursive feedback loop in the generator.
discrete variable GAN
・Sequence GAN (SeqGAN ) (Sept. 2016)
The SeqGAN generator contains LSTM cells and the discriminator is a Convolutional Neural Network (CNN), addressing the aforementioned challenges for discrete data, showing performance that outperforms other methods proposed up to 2016. The generator is updated by the measured gradient and the reward expectation from the discriminator with the Monte Carlo search. It looks like reinforcement learning. It was originally developed for discrete sequential data such as text but has opened the door to continuous sequential data and time series. The authors use synthetic data with a distribution generated by initializing the LSTM with random data that follows a normal distribution. They also compare the results with real-world data.
・Quant GAN (Jul. 2019)
Quant GAN is a data-driven model that aims to capture the long-term dependence of financial time series data. Both the generator and the discriminator use a Temporal Convolutional Network (TCN) with skip connections, a Dilated Causal Convolutional Network like WaveNet. It is suitable for modeling the long-term range dependence of continuous sequential data. The function of the generator is a stochastic volatility neural network consisting of volatility and drift TCN; the time block used in the TCN consists of two Dilated Causal Convolution layers and two parametric ReLU activity functions. The data generated by the generator is passed to the discriminator for output, and the average value becomes the Monte Carlo prediction of the discriminator's loss function. Although the performance of the method outperforms conventional methods, the computational complexity of modeling long-term continuous time series data is a problem. Therefore, it was decided to apply the method to discrete data.
Continuous variable GAN
・Continuous RNN-GAN ( C-RNN-GAN ) (Nov.2016)
C-RNN-GAN generates continuous sequential data. The generator is an RNN and the discriminator is a bidirectional RNN. The RNN here is a two-stage LSTM.
・Recurrent Conditional GAN (RCGAN ) (2017)
The architecture of RCGAN is different from that of C-RNN-GAN; it uses the RNN LSTM, but the discriminator is not bidirectional. It uses RNN LSTMs, but the discriminator is not bi-directional, and the output of the generator is not fed back as input for the next time point. In this model, we input condition c and assign relevant labels to the time series data.
Sequentially Coupled GAN (SC-GAN ) (Apr. 2019)
SC-GAN aims to generate patient-centric medical data that informs about the patient's current status and recommends medication dosages according to the patient's condition. 2 combined generators output the patient's current status and the recommended dosage, respectively. The discriminator is a two-layer bidirectional LSTM and both generators are two-layer unidirectional LSTMs. The generators are supervised and pre-trained.
Noise Reduction GAN (NR-GAN ) (Oct. 2019)
NR-GAN is intended for noise reduction of time series data, especially for EEG (electroencephalogram) signals. In particular, it is applied to denoise EEG (electroencephalogram) signals. nr-GAN removes noise in the frequency domain. The generator is a two-layer 1-D CNN with an all-coupled output layer, and the discriminator is a softmax layer instead of an all-coupled layer to compute probabilities. The generator does not sample from the latent space and tries to generate clear data from the raw EEG data. It is equivalent to a classical frequency filter but is limited by the amount of noise.
・Time GAN (Dec. 2019)
Time GAN provides a framework that can handle both traditional unsupervised GAN learning and supervised learning with more control. The data is in the form of a tuple of static component s and a time-varying part, which is a latent variable from real data through an encoder and from distributed data through a generator. The combination of these is the loss function for supervised learning, and the one through the discriminator is the loss function for unsupervised learning. In addition, the data passed through the encoder from the real data is recovered by the decoder and becomes the recovery loss function. These three loss functions are used to train the model, and it is claimed that improvements are seen for RC-GAN, C-RNN-GAN, and WaveGAN.
Conditional Sig-Wasserstein GAN (SigCWGAN )(Jun.2020)
We developed an indicator called Signature Wasserstein-1 ( Sig-W1 ), which expresses the dependence of time series models, and used it as a discriminator (orange part in the figure below). It is an abstract, universal representation of a complex data stream and does not require the computational cost of the Wasserstein index. The generator is also novel and is called the Conditional Autoregressive Feed-forward Neural Network (AR-FNN). It represents the autoregressive properties of a time series. The generator can map past data and noise to future data, and claims to outperform TimeGAN, RCGAN, and Generative Moment Matching Network ( GMMN ).
Decision Aware Time-series conditional GAN (DAT-CGGAN )(Sept. 2020)
The framework is designed to assist end-users in their decision-making, especially in the selection of financial portfolios. Multi-Wasserstein loss is used for structured decision-related quantification. The loss function is shown in the following equation
The generator is a two-layer feed-forward NN structure and outputs the asset profit. This is input to the discriminator, which is also a two-layer feed-forward NN. The reliability of the output seems to be good. The problem is that the computational cost is high: it takes a month to train one generator.
Synthetic biomedical Signals GAN (SynSigGAN ) (Dec. 2020)
SynSigGAN is designed to generate a variety of physiological/biomedical data. ECG, EEG, EMG, and Photoplethysmogram.MIT-BIH Arrhythmia Data can be generated from the MIT-BIH arrhythmia database and others. We have evaluated many variants such as BiLSTM-GRU, BiLSTM-CNN GAN, etc., and concluded that BiGridLSTM is the best.
application
data extension
GANs are well established when it comes to data expansion. Reasons for extension range from small size, small variety, and biased data sets to reproducing restricted data sets for dissemination.
A well-defined solution to the data scarcity problem is transfer learning, and GAN-enhanced datasets have shown further improvements in certain classification and recognition tasks. As we will see later, data expansion with GANs also has privacy advantages.
In the pharmaceutical and medical fields, these advantages are beginning to be used in time-series data.
Audio, Text-to-speech is a popular area; C-RNN-GAN is an example of music application. C-RNN-GAN is also applied in finance for prediction and decision-making. C-RNN-GAN is also applied in finance for forecasting and decision making and is used for predicting soil temperature and pharmaceutical spending.
Data completion (Imputation)
Missing or corrupted data is a common problem in real-world data. Guo et al. use GAN to design multivariate time series completion.
noise reduction
Artifacts inserted into time-series data often appear as noise in the signal. This can be a persistent problem in subsequent processing and analysis. Corrupted data can lead to biases in the data set and degrade the performance of critical systems such as health applications. Common methods for removing noise include adaptive linear filters; methods using GANs are also being explored; denoising EEG data with NR-GANs is competitive with traditional methods.
abnormal condition detection
The detection of outliers and anomalies in time series data is important in many real-world systems and sectors. Whether it is the detection of physiological anomalies that herald malignant symptoms or the detection of abnormal trading patterns in stock prices, anomaly detection is indispensable in providing important information. GAN has been applied to detect anomalies in ECG, cardiovascular diseases, taxis, malicious players in cyber-physical systems, stock market manipulation, etc. It has been applied to.
the others
An image-based GAN is used for the time series. The sequence is first converted into an image through a function, and then GAN training is performed on the image. After training, the sequence data is obtained by inverse transformation. This method is used for audio generation in the waveform, anomaly detection, and physiological time series data generation.
valuation index
As mentioned above, the evaluation of GANs is difficult and researchers have not agreed on a metric that best evaluates the performance of GANs. Most of the proposed metrics are for image data. Evaluation metrics can be divided into two categories. Qualitative and quantitative. Qualitative, in other words, is a human evaluation of appearance. However, it lacks objectivity and is not considered as a full evaluation of GAN performance. Quantitative evaluation includes statistical metrics for time series analysis and similarity measures such as PCC (Pearson Correlation Coefficient), PRD (Percent Root Mean Square Difference), MSE, RMSE (Root Mean Squared Error Error ), MRE (Mean Relative Error), and MAE (Mean Absolute Error). These are most commonly used for time series and hence are also used as GAN performance measures.
Several metrics have been established for the evaluation of image-based GANs and have penetrated sequential or time-series GANs, such as IS (Inception Score), FD/FID (Frechet Inception Distance), and SSIM (Structure Similarity Index). MMD (Maximum Mean Discrepancy), a measure of the similarity/dissimilarity of two probability distributions, is suitable across domains. Another metric that has been generalized to sequential data is the Wasserstein distance.
The data generated by GAN is used downstream for the classification task. TSTR(Train on Synthetic, Test on Real) and (TRTS(Train on Real, Test on Synthetic)) were proposed as the overall metrics. Precision, Recall, and F1 are also used, considering the performance of the classifier as the quality of the generated data. WA(Weighted Accuracy) and UAR(Unweighted Average Recall) are also used.
Commonly used distance and similarity indices for time series data are ED (Euclidian Distance), DTW (Dynamic Time Warping), and MTDTW (Multivariate (in)dependent DTW).
In addition, the ACF (Autocorrelation Function) score and DY metric are used in the financial sector, the NS (Nash-Sutcliffe model efficiency coefficient), WI (Willmott index of agreement), and LMI (Legates and McCabe index), and NSDR(Normalised Source-to Distortion Ratio), SIR(Source-to-Interface Ratio), SAR(Source-toArrifact Ratio), t-SNE for speech...
The architectures, applications, metrics, and datasets of all GANs discussed in this review are summarized in Table 2. Results for various GANs for the sine wave and ECG data are available in Tables 3 and 4.
privacy
There is a wide range of methods used to assess and reduce the privacy risks associated with GAN-generated data.
Differential Privacy
Differential Privacy is a concept proposed by Dwork in 2006. It tries to protect the privacy at the bottom of the database. Since GAN tries to model the training data, the privacy problem exists where the generated samples extract and generate useful information about the training data population without the possibility of linking it to personal data.
Abadi et al. demonstrated training a DNN with Differential Privacy, and Xie et al. proposed DPGAN, which achieves Differential Privacy by adding a noise gradient to the optimizer during the training process. Abadi et al. demonstrated a DNN with Differential Privacy, and Xie et al. proposed a DPGAN with Differential Privacy by adding noise gradients to the optimizer during the training process.
Decentralized/Federated Learning
Distributed or decentralized learning is another approach to limit privacy risks. The standard approach to machine learning keeps the training data on a single server; applying a decentralized/distributed approach to GANs requires ensuring wide communication bandwidth and convergence; Federated Learning makes this possible; the application to GANs is FedGAN.
Clearly, the combination of Differential Privacy and Federated Learning is the next area of research.
Privacy Protection Assessment
Whether the generative model can protect our privacy can be assessed through a test known as Attribute and presence disclosure. The latter test is better known in machine learning as the membership inference attack. It is a quantitative assessment of whether a machine learning model has compromised information about the data records of the individuals it has been trained on.
Hayes et al. applied the membership inference attack to synthetic images and concluded that the quality of the generated data must be sacrificed to achieve an acceptable level of privacy. Conversely, other researchers have shown that DP networks can follow Differential Privacy, withstand membership inference attacks, and not lose much quality of the generated data.
Finally.
Time series GANs have been developed to meet the first set of challenges. Progress has been made for both discrete and continuous systems. On the other hand, depending on the application, the architecture of GANs has been broadened and the corresponding loss functions are different. It is difficult to discuss them in a unified manner.
Therefore, it is an unrealistic situation to compare and say what is the best. More importantly, is it useful in practical terms?
personal opinion
As the authors of this review conclude, it seems that further research is needed to determine whether GAN-generated data can, for example, help us to better diagnose patients. This is true not only for GAN-generated data but also for time series data in general. Sequential data that look the same can be normal in one situation and abnormal in another. Even multivariate data on a single system can have different meanings. It is necessary to be able to express the meaning and value of each data or multiple data for users or patients in the model for the model to be truly practical.
Categories related to this article