Maximizing Model Generalization In Manufacturing Through Self-Supervised And Integrated Learning

Federated Learning 07/12/2023

3 main points
✔️ Challenges in smart factory realization include data labeling, less negative example data, and domain shift
✔️ In transition learning with domain adaptation, feature extractors trained on Barlow Twins transfer to operational environments with different process parameters, including new faults It outperformed supervised classifiers when transferred to operational environments with different process parameters, including new faults.
✔️ In addition, incorporating Federated Learning (FL) for distributed learning allows learning generalizable representations for newly emerging faults

Maximizing Model Generalization for Manufacturing with Self-Supervised Learning and Federated Learning
written by Matthew Russell, Peng Wang
[Submitted on 27 Apr 2023 (v1), last revised 22 Sep 2023 (this version, v2)]
Comments: Accepted by arXiv
Subjects: Machine Learning (cs.LG); Signal Processing (eess.SP)

code：

The images used in this article are from the paper, the introductory slides, or were created based on them.

Summary

Deep learning (DL) can diagnose faults and assess machine health from raw condition monitoring data without the need to manually design statistical features. However, practical manufacturing applications remain extremely challenging with existing DL methods.

Machine data is often unlabeled and derived from very few health conditions (e.g., only normal operating data). In addition, the model often encounters domain shifts as process parameters change or new failure categories emerge.

Traditional supervised learning relies on having a rich set of classes to partition the feature space at decision boundaries and may struggle to learn compact, discriminative representations that generalize to these unknown target domains. Transfer Learning (TL) with domain adaptation attempts to adapt these models to unlabeled target domains, but assumes a similar underlying structure that may not exist if new defects emerge.

We focus on maximizing the generality of features in the source domain and propose to apply TL via weight transfer to copy the model to the target domain. Specifically, Self-Supervised Learning (SSL) with Barlow Twins has the potential to generate more discriminative features for health monitoring than supervised learning by focusing on semantic properties of the data. Additionally, Federated Learning (FL) for distributed learning has the potential to improve generalization by sharing information across multiple client machines, effectively increasing the effective size and diversity of the training data.

Results show that Barlow Twins performs better than supervised learning in unlabeled target domains where motor impairments appear, when the original training data contains few distinct categories.Incorporating FL may also provide a slight advantage by allowing knowledge about health conditions to be spread across machines.

Future research should continue to investigate the performance of SSL and FL in such realistic manufacturing scenarios.

Introduction

In smart factories, early detection and diagnosis of machine failures is critical to prevent costly downtime and repairs. To achieve this goal, machine learning discovers statistical patterns in large data sets and builds classification and regression models for condition monitoring and fault diagnosis. Deep learning (DL) uses multi-layered neural networks to automatically extract features from raw data such as vibration signals, a paradigm shift from traditional manually designed features. However, manufacturers are hesitant to utilize these tools due to a lack of confidence in the model's ability to adapt to changing factory environments. Increasing the generalization capability of models is required to build tolerance to changing process parameters, new operating conditions, and machine-to-machine variability, thereby improving reliability.

Early deep learning (DL) research in manufacturing has shown superiority over traditional methods such as support vector machines (SVMs) in analyzing condition monitoring data sets. While good results have been achieved on controlled laboratory datasets, many practical issues exist with the widespread adoption of DL in manufacturing. Unlike the large volume and variety of data in the imaging field, failure diagnostics often lack the volume and variety of data required, especially generalization across data sets, operating conditions, and machines. In real industrial datasets, the problem is that there are few failure cases and most of them are unlabeled. In addition, the dynamic nature of the operating environment means that new types of failures can occur without warning and be misclassified by existing models. Further research is needed to overcome these challenges and increase the reliability of DLs on the factory floor.

Transfer Learning ( TL) is a way to mitigate problems related to model generalization; the goal of TL is to reuse existing models when data or tasks change (e.g., new faults orchanges inprocessparameters). Such changes can affect the statistical properties of the data, and can cause the model to fall outside of the input domain to function effectively; TL transfers the model from the labeled source domain to the unlabeled target domain, but does not reuse the model when a new However, new defects in the target domain may limit the source domain's ability to transfer models. Additionally, the target domain may be unknown or there may be no data (including unlabeled data) available at the time of training. In such cases, the TL must learn the most generic representation possible from the available data. This model is then transferred to the target domain and fine-tuned based on the target domain data that becomes available. In this way, it is possible to build a model for the target domain without assuming the same relationships as the source domain conditions.

Supervised learning relies on labeled data and may be inappropriate for practical condition monitoring when training conditions are limited or labels are lacking. In contrast, self-supervised learning (SSL), a technique that classifies features with similar semantic properties into compact clusters, provides a way for models to learn variation within categories using random extensions. For example, by mapping an inverted signal to the same feature as the original signal, the model learns to ignore the inversion; SSL is label-free and is useful for learning data-centric representations from unannotated raw factory data.

SSL is well suited for bootstrapping state monitoring models, but information sharing among machine fleets can further improve generalization capabilities. However, bandwidth constraints make it difficult for fleets to continuously aggregate data to the cloud. This is where Federated Learning (FL) comes in handy, allowing the development of globally informed models using distributed data. In this method, each client machine learns on local data and periodically sends models to the server instead of raw data. The global model from the server is distributed to the client machines and the information is shared; FL can integrate information from multiple clients and build datasets of effective size and variety without straining communication networks.

The condition monitoring literature lacks a cohesive introduction to SSL and FL for maximizing model generalization. This study outlines how SSL and FL can improve the generalization, and hence the reliability, of DL models on the factory floor through two complementary strategies: SSL extracts useful representations without the need for labeled data, while FL expands the effective size and diversity of the data set The contributions of this study are as follows. The contributions of this study include:

1. An overview of relevant research in SSL and manufacturing,

2. Overview of relevant studies in FL and manufacturing,

3. a theoretically motivated framework for combining SSL and FL to improve model generalization,

4. Case study using a motor failure data set to evaluate SSL and FL under new failures and changing process parameters.

Theoretical Background and Related Research

Supervised Learning and Transition Learning

Many factors can limit the applicability and robustness of machine learning models. In manufacturing, changes in processing parameters, operating environment, and health conditions can adversely affect performance by shifting the distribution of input data outside the expected domain. Transfer Learning (TL) attempts to avoid the need for large amounts of labeled data in the target task by adapting or reusing models learned in the source domain to the relevant target domain.

・Self-supervised learning

A typical fault diagnosis model can be partitioned into a feature extraction backbone _Gθ parameterized by a weight θ and a classification head Fφ with a weight φ that predicts the probability of K classes (e.g., faults) from the extracted features. Given the labeled data, the model parameters can be optimized by stochastic gradient descent and backpropagation with a cross-entropy loss (i.e., cost) function:

Optimizing weights to maximize classification accuracy teaches the model to draw "decision boundaries" that separate features in different categories. However, changes in process parameters and operating environment shift the distribution of input data and features away from _Gθ. These new features will no longer match the decision boundaries learned by the classifier Fϕ, resulting in undefined or inconsistent behavior. This undermines the generalization of the supervised classifier.

・Transition learning through domain adaptation

Transfer Learning (TL) is one solution to the domain shift problem. For domain adaptation, unlabeled data from known target domains can regularize the supervised learning process so that _Gθ produces a stable, matched distribution of source and target domain features for the classifier Fϕ. An updated loss function containing data from the unlabeled target domain is used during training:

where D(-, -) is a function that measures the distribution discrepancy between the source domain feature _Gθ (Xs) and the target domain feature _Gθ (Xt). the λ coefficient controls the strength of feature regularization. Since the feature extractor _Gθ produces a consistent distribution of features from both the source and target domains, the fault classifier Fφ is more likely to produce accurate predictions for the target domain.

A common implementation of D(-, -) in manufacturing is maximum mean discrepancy (MMD). Using MMD to ensure similarity between source and target features, TL of bearing and gearbox vibration data over different loads and shaft speeds was demonstrated. Flexible kernel implementation allows MMD to be combined with polynomial or Cauchy kernels, as demonstrated on the laboratory failure data set. Applying MMD at multiple levels in a deep feature extractor can also improve performance from lab to real transfers in locomotive bearing failure diagnosis, bearing failure classification and localization.

It replaces the loss term in D(-, -) with another neural network Dψ that learns to identify source and target features. By training the feature extractor _Gθ to disrupt the domain discriminator Dψ, the feature extractor learns to produce features that match the source and target domain data; DANN uses a 1D CNN feature extractor to facilitate TL between different bearing data sets. Interestingly, combining both MMD and DANN can be beneficial and has been demonstrated for TL across datasets.

・Moving learning by weight shifting

Domain adaptation can run into problems when new defects appear. In particular, trying to match source and target features can be counterproductive when the target domain contains new faults. In addition, the classifier must also be recalibrated to detect new faults. Therefore, transition learning (TL) shifts toward maximizing generalization of the feature representation learned in the source domain rather than domain adaptation. With this representation, which is sufficiently general, it is possible to transfer network weights to the target domain and distinguish between new and known faults. Given data from a labeled source domain or an unlabeled target domain, the TL via weight transfer pre-learns a discriminative representation for future faults. For image processing, this allows reuse of low-level, general features of the network learned on huge image data sets. These features can produce highly discriminative features for new categories of images. By starting with pre-trained weights, it is possible to generate useful feature representations without having to train a reliable image classifier from scratch, even in data-poor fields such as medical imaging.

Researchers in the manufacturing industry are creatively utilizing pre-trained image networks to transform condition monitoring datasets into images. These networks can extract useful low-level information about lines and shapes in the image, even when the high-level task is different. For example, when vibration data is transformed into 2D images using continuous wavelet transform (CWT), these pre-trained image networks can provide immediate features for training fault classifiers when labeled manufacturing data is limited. They can also accelerate domain adaptation by providing an initial feature representation before applying techniques such as MMD. Besides pre-trained image networks, TL through weight transfer has been shown to improve degradation prediction for target aircraft engines by training degradation models in the source engine, transferring the weights to the target engine, and fine-tuning them during the initial degradation phase of the target This has been shown to improve the prediction of target aircraft engine degradation. However, TL through weight transfer in manufacturing is often difficult due to the lack of labeled data needed to pre-train highly general feature extractors.

Self-supervised learning

Self-supervised learning (SSL) uses unlabeled data to train a feature extraction network that can then be applied to subsequent tasks. Broadly speaking, SSL allows data to "self-supervise" through premise-task and invariance-based methods to learn useful encodings of input examples. In manufacturing, where labeled data is scarce but unlabeled data is abundant, SSL has transformative potential. This approach enables efficient and effective feature extraction that can be applied to a wide variety of manufacturing tasks by leveraging the large amount of existing unlabeled data.

・Pre-Study Task SSL

Self-supervised learning (SSL) of pre-training tasks trains models based on relevant problems using automatically generated labels. Examples of pre-training tasks include predicting the rotation of an image, the relative position of patches in an image, or the next word in a natural language sequence (e.g., OpenAI's GPT-n model). Various adaptations of this approach have been explored in manufacturing and health monitoring research. Some studies have redefined traditional unsupervised techniques as "self-supervised." For example, embeddings learned from normal data using kernel principal component analysis (PCA) were described as "self-supervised" to help detect faults in industrial metal etching processes. Similarly, a study trained a deep autoencoder as a "self-supervised" auxiliary task for bearing fault classification, and another employed a similar approach for anomaly detection in washing machines. Another study predicted the orientation of randomly rotated laser powder bed fusion process images in additive manufacturing and characterized this as a prerequisite task. However, since the downstream task was also orientation prediction, this is more akin to pre-training with data expansion and not a unique pre-training goal. A true pre-training task SSL extracts features from unlabeled data through a unique pre-training task that does not rely on fault information. For example, a model can learn useful features by predicting statistical properties (mean, variance, skewness, kurtosis, etc.) of unlabeled input signals. In yet another study, input signals were randomly distorted and a model was trained to identify the distortions. All three approaches produced features useful for diagnosing bearing failures. Thus, without the need for manual labels, SSL in the pre-training task can bootstrap the model for future health monitoring tasks.

・Invariant-based SSL

Instead of using the preamble task, invariance-based SSL applies random transformations to "seed" examples from the data set to create a family of examples belonging to the same "pseudo-class." The feature extraction network is then trained to homogenize features from all extended examples in the pseudo-class. A contrasting loss function encourages each pseudo-class to be compact and well separated from the others. Through this process, the network learns to ignore randomized attributes and focus on semantically meaningful ways to cluster the input data (see Figure 1).

Figure 1: The SSL technique attempts to move extended features to members of the same pseudo-class while increasing their separation from other pseudo-classes.

A contrasting approach to Variance-basedSSL relies on having rich "negative" examples of other pseudo-classes to ensure clustering. For example, consider the loss function of InfoNCE:

where n is the size of the batch containing positive examples x+ , n - 1 negative examples (i.e., other pseudo-classes), and s(-, -) is the similarity metric. Increasing the number of negative examples increases the lower bound on the mutual information content (sim-, -). This encourages compact feature clusters. However, due to batch size limitations, learning efficiently with enough negative examples is non-trivial. Momentum Contrast increases the number of negative examples by aggregating features across multiple batches. The encoder learns with contrastive loss to separate the current batch from larger negative example feature groups. Momentum Encoder updated the embedding of previous examples into the latent space by running mean to ensure that the representation of negative examples from multiple previous batches remained stable.

A number of conceptually related developments were inspired by Moment Contrast (MoCo). A Simple Framework for Contrastive Learning of Visual Representations (SimCLR) and Bootstrap Your Own Latent (BYOL) both work well without a small number or negative examples MoCo proposed a MoCo style architectural change; SimCLR made an important contribution: a "projection head" network that maps features into a larger dimensional space before applying contrast loss. The work by X. Chen and K. He proposed a more direct approach known as Simple Siamese Representation Learning (SimSiam). feature projections from two extensions, while learning how to prevent gradients from one of the projections from updating the encoder. This effectively means that one projection is fixed while the other is moved toward this anchor. This was effective even without large batches, abundant negative examples, or momentum networks. To avoid the problem of contrast loss altogether, Barlow Twins used cross-correlation loss, which suppresses redundancy between feature dimensions while learning correlated features between pseudo-class examples (see Figure 2). Variance-InvarianceCovariance Regularization (VICReg) then introduced a somewhat more complex loss function as a generalization of Barlow Twins. These methods proved increasingly useful for computer vision problems.

Figure 2: Barlow Twins reduces representational redundancy by performing feature projections that are correlated and independent of each other within each hypothetical class.

Manufacturing companies can take advantage of computer vision invariance-based SSL by first converting 1D sensing data into 2D images. SimCLR without labels can use image augmentation such as rotation, crop, and affine transforms to find identifiable failure features in rotating machinery from vibration data. methods using BYOL, such as short-time Fourier transform (STFT) and continuous wavelet transform (CWT), can be used to convert vibration After converting the data to an image, the bearing failure features were extracted. However, applying image-domain techniques to vibration data may lack robust and physically meaningful interpretation. Therefore, an important step in adapting invariance-based SSL for condition monitoring is to design appropriate random reinforcement for the raw time series data (e.g., vibration and current).

・Design of augmentation of time series data

Self-supervised learning (SSL) based on invariance requires careful selection of random extensions. This is to avoid destroying important semantic information. For example, a high-level semantic label (e.g., a bearing inner ring defect) cannot be simplified to a simple feature analysis (e.g., normalized vibration amplitude > 0.6). Revealing such non-direct correlations is one of the reasons that motivate the use of deep learning (DL). When dealing with semantically important input attributes, it is difficult to extract and manipulate them. Conversely, if an attribute is easy to manipulate, it is likely to be less semantically important. Effective random reinforcement need not necessarily be complex in order to homogenize the representation of semantically relevant examples. Existing image-based augmentation SSL research supports this theory, using simple transformations such as translation, cropping, flipping, rotation, contrast, blurring, and color distortion to achieve state-of-the-art results. designing similar augmentations for 1D time series data can be used to unlock the potential of invariance-based SSL for raw sensing signals.

Figure 3: By randomizing semantically meaningless attributes, the augmentation allows SSL to identify pseudo-classes through the remaining semantically meaningful properties.

Several studies have examined possible augmentation methods for time series data. Time series data are temporally related, and to account for this, several studies have generated pairs of consecutive examples from oscillatory signals and created pseudo-classes to apply to invariance-based self-supervised learning (SSL). This includes temporal and amplitude distortions.

For example, in studies using MoCo, reinforcements such as Gaussian noise, amplitude scaling, stretching, masking, and time shifting were used to pre-train feature extractors to detect early bearing failures; in studies employing BYOL, truncation (continuous region masking), low-pass filtering, Gaussian noise, geometric scaling, and downsampling were used to learn representations for bearing failure diagnosis from unlabeled raw vibration data. In particular, truncation and downsampling were shown to be effective.

Studies using SimSiam were conducted using truncation, low-pass filtering, Gaussian noise, and time inversion. Another study using a motor condition data set implemented Barlow Twins on multi-channel oscillation and current signals using random time shifts, truncation, scaling, and vertical inversion. Here it is shown that random time shifts were very important in extracting features suitable for the motor fault diagnosis task.

These studies demonstrate effective data augmentation methods in invariance-based SSL for 1D signals.

Federated Learning

Federated Learning (FL) facilitates distributed training of predictive deep learning models on private user data via the FedAvg algorithm. To maintain user privacy, network training is performed on the user's device and only the weights and parameters of the updated model are sent to the cloud. in the FedAvg algorithm, network weights are averaged together without sending the client's data to the cloud and are averaged together to create a global model. This allows clients to collaborate to train a more generalizable model while retaining private control over their data. Algorithm 1 provides an overview of FedAvg.

・FL for condition monitoring and fault diagnosis

The advantage of FL for manufacturing is that it can be trained on multiple datasets without exposing sensitive factory data to servers. Motivated by this privacy perspective, FL was proposed to build fault diagnosis models from isolated datasets. Client models with poor validation performance are ignored when aggregating global models, improving robustness; a peer-to-peer adaptation of FL resulted in the following improvements.

Local learning is performed at each node to detect wind turbine and bearing faults. There is also a FL study for bearing failure diagnosis that proposes a vertical FL algorithm based on gradient tree boosting to deal with clients with different feature subsets. For residual useful life (RUL) applications, FL has been implemented for cooperative learning of transformer models based on simulated turbofan aircraft engine degradation data.

・Multi-party and single-party incentives for FL

Beyond privacy issues, FL offers benefits to coalitions of multiple manufacturers as well as within a single distributed manufacturer. Additive Manufacturing found that FL improves the segmentation of defective images over locally trained client models, and the improved performance motivates manufacturers to join existing coalitions and motivates these coalitions to welcome new clients The study showed that this is an incentive for manufacturers to join existing coalitions and for these coalitions to welcome new clients. Another study further supports FL's ability to improve model performance relative to locally trained models while maintaining privacy among aircraft manufacturers. Even if manufacturers refuse to federate with competitors to avoid the possibility of model poisoning, FL provides communication-efficient training for distributed data owned by a single manufacturing entity and the network traffic required to take full advantage of distributed sensing provides a significant benefit of reducing the However, whether in a multi-party or single-party paradigm, FL implementations must handle client-to-client discrepancies while taking full advantage of the collaborative approach.

・FL for heterogeneous clients

In practical applications, clients may have different tasks and data distributions, and basic FedAvg is not optimal for each member, but desirable for privacy benefits. initializing FL clients with pre-trained global feature extractors and reduce the training time required for personalized downstream tasks. However, in this case study, only image domain tasks were tested. Similarly, a personalized FL approach can optimize feature extractors and classifiers locally and penalize shifts between local classifier weights and globally optimized weights. This allows clients to share information without hard constraints such as fixing weights across clients. Surprisingly, we have demonstrated that FL can share classifier information among rotating machine clients even if they are unbalanced or non-i.i.d. classes when the clients observe different faults. It is also possible to globally align classes across models by injecting noise within each client and creating false pseudo-classes. Conversely, a single global model may not be successful if the input distributions of the clients are very different. In another study, we chose to cluster gradient updates from members and run FL separately within each subgroup. Experiments validated this algorithm on benchmark data and a custom bearing failure dataset. However, these studies in heterogeneous FLs fail to address the problem of large amounts of unlabeled data in each client. Furthermore, when the number of observed classes is very limited, relying on supervised learning may hinder the discriminability of the learned representations.

Figure 4: Overview of Federated Learning with FedAvg.

Figure 5: SSL promotes compactness and pseudo-class separation, but supervised representation depends on decision boundaries.

Proposal for a method to maximize model generalization

Supervised learning on large, diverse data sets may produce generalizable features, but may struggle with limited class diversity. Supervised learning shapes the feature space through the decision boundaries of the classifier without explicitly encouraging compact clusters (see Figure 5). When training classes are limited, the model has few decision boundaries to partition the feature space. This results in loosely structured features and increases the likelihood that features of future failures will overlap with features of previous health conditions. Aggregating data from distributed machines can improve centralized models, but fast sensing streams may be limited by bandwidth constraints. Therefore, the proposed method employs SSL to improve the structure of the feature space and FL to increase the effective data set size without flooding the communication network (see Figure 6). The combination of these techniques facilitates data-centric learning and information sharing, and maximizes the generalization of the condition monitoring model to new operating conditions and new faults.

Figure 6: Proposed method for comparing the identifiability of emerging faults when transferring weights from a supervised or self-supervised 1D CNN feature extraction backbone.Federated Learning can be used to efficiently share information among multiple client machines.

Barlow Twins

Rather than organizing features with bounded decision boundaries, SSL using Barlow Twin encourages tighter clusters by maximizing cross-correlation between feature projections from the same pseudo-class. Augmentations used to construct pseudo-classes from state monitoring time series signals should randomize unimportant signal attributes while preserving semantic classes. Algorithm 2, which extends the proposed augmentation, outlines the random transformation used in Barlow Twins to create pseudo-classes.

Barlow Twins first computes (according to Algorithm 2) the projections of the two extended versions X′ and X′′ of the input batch and their corresponding projections Z′ = Hψ(Gθ (X′′)) and Z′′ = Hψ(Gθ (X′′)). Both sets of projections are then normalized across the batch:

Next , the cross-correlation matrix R is computed and normalized by the batch size:

Finally, the loss function is calculated using R:

Here, λ controls the strength of the independence constraint. The first term encourages the diagonal elements to be 1. This means that individual features will be highly correlated (aligned) across the batch, and that instances within the expected variation, as defined by the applied random reinforcement, will map to similar feature projections (i.e., cluster together). The second term sets the off-diagonal elements to zero so that each feature is independent of the remaining features. This improves representational power by ensuring that multiple features do not encode the same information. This loss function allows the Barlow Twins feature extractor and projection head to be trained with standard stochastic gradient descent and backpropagation methods. Figure 7 shows the architecture of the 1D CNN backbone Gθ and the Barlow Twins projection head Hψ for extracting features from state monitoring data.

Figure 7: Architecture of 1D CNN backbone feature extractor Gθ, supervised K-class classifier Fφ,a, and Barlow Twins projection head Hψ.

Federated Learning for Information Sharing

Most factory floors have multiple similar machines that experience different health conditions each during operation; data from a single machine may contain few different conditions, but due to network constraints, each machine streams all its sensing data to the cloud, may not be able to build a unified data set. The machines themselves may not be geographically located or may belong to different manufacturers without data sharing agreements. To work around these obstacles, models can be trained in FedAvg (see Algorithm 1). Each client machine retains full ownership of its data, but indirectly gains knowledge of new health states through averaging of models on the FL server. This indirect sharing of information among clients via the global model can be seen as a form of TL. When each client receives an updated global model, it benefits from the observations and knowledge of other clients. Thus, if one client lacks training experience on a certain health condition, but another client has training on that health condition, the FL algorithm will return this experience to the client without knowledge (see Figure 8). Thus, FL provides a TL advantage across clients and may improve each client's generalization to future failure states. In addition, client machines send updated models to the FL server only once per round, significantly reducing the volume and speed of data sent to the cloud. combining FL and SSL, DL reduces network communication and maintains manufacturer privacy while and can operate in realistic condition monitoring scenarios using unlabeled distributed learning data.

Figure 8: Each client experiences different conditions, and by averaging the weights of the model, this knowledge is spread to other clients, maximizing the diversity of the data set and improving performance against new failures.

Experiment

Two case studies have been conducted to test the proposed claims. The first study compares the generalizability of representations after pre-training using supervised or self-supervised learning (SSL) in a different number of classes. This study will assess how well the pre-trained model works across different classes and will examine whether supervised learning or SSL learns more generalizable features.

The second study will investigate the impact of Federated Learning (FL), distributed training, on model performance in the context of emerging faults. This study will explore how FL integrates models trained individually on each client machine and the extent to which it improves model performance for unknown faults.

These case studies specifically compare the effectiveness of supervised and self-supervised learning, as well as centralized and distributed training approaches, and demonstrate how these methods can be applied to real-world problems.

Motor Condition Data Set

Both case studies use the motor fault condition data set collected from the Spec- traQuest Machinery Fault Simulator (MFS) in Figure 9. two accelerometers mounted orthogonally at a sampling rate of 12 kHz acquire vibration data and a current clamp measures the current signal. Sixty seconds of steady-state data are collected for eight conditions: normal (N), faulty bearing (FB), bent rotor (BoR), broken rotor (BRR), misaligned rotor (MR), unbalanced rotor (UR), phase loss (PL), and motor fault.

Unbalanced Voltage (UV). Each condition is run at 2000 RPM and 3000 RPM and loads of 0.06 N-m and 0.7 N-m, for a total of 32 different combinations of health and process parameters. For simplicity, each unique combination is identified by xy, where x is 2 or 3 specifying the RPM parameter and y is "H" or "L" specifying the high or low load parameter (for example, 3L means 3000 RPM with a load of 0.06 N-m). The signal is then normalized to [-1, 1] and divided into a 256-point window for the DL experiment.

Figure 9: SpectraQuest mechanical failure simulator used to collect motor health condition data sets.

Transfer Learning Experiments

The first set of experiments tests the claim that SSL is a more effective TL pre-training method. The experimental design reflects the following assumptions

1. labeled training data is available from the source set of process parameters

Unlabeled training data available from target set of process parameters

3. pre-trained models may encounter new types of failures after being deployed.

From this scenario, three methods of comparison can be derived:

- Supervised (source): supervised learning on labeled source domain data

- Barlow Twins (source): self-supervised learning on source domain data (ignores labels)

- Barlow Twins (target): self-supervised learning on unlabeled target domain data

All three methods use the same 1D CNN feature extraction backbone G shown in Figure 7. The supervised network adds a K-class classifier Fϕ to the backbone and Barlow Twins adds a projection head Hψ . The networks Fϕ and _Gθ are then optimized with cross-entropy loss from (1) using stochastic gradient descent and backpropagation. The Barlow Twins model then generates the projections ZJ = Hψ (Gθ (XJ )) and ZJJ = Hψ (Gθ (XJJ )) from the input batch reinforcements XJ and XJJ (see Algorithm 2) and the learning loss is calculated from (5)-(7) with λ = 0.01. Supervised and unsupervised models are trained for 1000 epochs with the Adam optimizer and a learning rate of 0.0005.

To assess the quality and generalizability of each method's representation, the frozen features of each pre-trained network are accessed to labeled target domain data from all eight health states (the evaluation dataset), following the convention for evaluating SSL models, and a privileged linear evaluation classifier used to train the privileged linear evaluation classifier. Access to privileged label information precludes this classifier from being actually trained and deployed, but follows accepted standards for assessing the separability of underlying feature representations. The evaluation classifier will be trained 75 epochs on the frozen features and the accuracy of the test set will be used to determine the quality of the representation.

To simulate the occurrence of new unknown failures, the training dataset for the source and target domains is limited to two, four, or six randomly selected health conditions. Since the evaluation dataset contains all eight conditions, this corresponds to encountering six, four, or two unknown classes after pre-training, respectively.

A total of 450 experiments, 150 each for each of the three comparison methods, were conducted to capture variations caused by source/target domain selection, training health, and model initialization.

Table 1: Health status set with transfer learning

Federal Learning Experiments

The FL experiment will determine whether sharing model information among clients with discontinuous training conditions improves the likelihood of identifying faults that will appear in the future.

To evaluate this, each of the two clients is assigned two randomly selected motor health conditions. Each client has local training data for its two conditions from all process parameter combinations (i.e., 2L, 2H, 3L, 3H); the FL server provides both clients with an initial global model with random weights; in each round of FL, client trains its local model on the two unique sets of health conditions and returns the updated model to the server. The server averages the weights and redistributes the new model to the clients in preparation for the next FL round (see Algorithm 1).

The FL experiment is run for 1000 rounds, with each client learning 20 local batches in each round. For supervised learning, each client updates its weights using the cross-entropy loss in (1); for Barlow Twins learning, each client updates its weights using the cross-entropy loss in (5)-(7). Both supervised learning and Barlow Twins are trained using the same network architecture as the TL shown in Figure 6, with Adam optimizers and a learning rate of 0.0002.

Each of the four model configurations (supervised learning and Barlow Twins, with and without FL, respectively) is trained with five random seeds (0-4) to measure variability due to random initialization. Five unique sets of training conditions will be tested to eliminate the influence of individual health conditions (see Table 2); a total of 100 FL experiments will be conducted with all combinations of the four methods, the five seeds, and the five condition sets.

Table 2: Learned Health Status Set

Results and Discussion

The results show that Barlow Twins produce more generalizable and transferable representations than supervised learning, indicating that FL for information sharing may further improve performance.

Results of Transfer Learning

Table 3 and Figure 10 show the comparison results of supervised learning for labeled source process parameters, Barlow Twins for unlabeled source process parameters, and Barlow Twins for unlabeled target process parameters. Accuracy measures are computed from test partitions of the evaluation dataset containing all eight conditions under the target process parameters. Even with only two conditions available for training, Barlow Twins produces a separable representation with 93.5% accuracy when presented with all eight health conditions. Supervised learning in the same scenario is limited to 83.9% accuracy. Figure 11 shows a representative confusion matrix that highlights the improvement of SSL over supervised learning. For example, supervised learning struggles to distinguish between off-axis (MR) and unbalanced rotor (UR) states, while Barlow Twins improves accuracy within these categories by 15 and 6 points, respectively. In addition, Barlow Twins can use unlabeled target domain data to further improve the representation - Barlow Twins (Target) in Table 3, but supervised learning cannot use this data because it is unlabeled. Interestingly, Barlow Twins (Target) shows no clear improvement over Barlow Twins (Source). This indicates that SSL is able to find a general representation from the process parameters of a single source set. As the number of conditions included in training increases, the convergence in performance between supervised learning and Barlow Twins is explained by the optimization objectives of each approach. Supervised learning attempts to partition the data along decision boundaries for the classifier. This ensures that training classes are identifiable, but does not guarantee compactness of feature clusters. Thus, it is possible that new fault features may overlap with fault features seen during training. In contrast, Barlow Twins encourages similar input instances to have features that are correlated and closely matched. This emphasis on feature similarity produces dense clusters that reduce the likelihood that new failure features will overlap with existing clusters. As the number of training conditions increases, additional decision boundaries created by supervised learning naturally improve the compactness of feature clusters, bringing their evaluation accuracy closer to Barlow Twins. However, due to the limited diversity of classes compared to the number of faults that may emerge in manufacturing applications, these results demonstrate the general superiority of SSL-based representations over those transferred from supervised learning in uncertain operational environments.

Table 3: Evaluation Accuracy Results for Transition Learning (%)

Figure 10: Target domain accuracy of the weight transfer method for all eight motor conditions relative to the number of failures in the training domain.

Figure 11: Representative confusion matrix showing the advantage of using Barlow Twins over supervised learning when transferring the model to a new process parameter (3L→2H) with 6 emergent conditions

Federated Learning Outcomes

Table 4 and Figure 12 show the results of Federated Learning (FL). In supervised learning, the ability to identify emerging faults is significantly improved when FL is incorporated; without FL, the overall assessment accuracy between clients is only 67.6%; when FL is introduced, information about health status is shared indirectly via the FedAvg server, and the overall accuracy Supervised learning clients trained without FL show a 6-point accuracy difference. methods, even when FL is excluded. Clients trained separately reach an overall rating accuracy of 82.4%; when FL and Barlow Twins are combined, performance increases to 83.7%, the highest overall accuracy of all methods. As in the supervised case, FL reduces the accuracy difference between clients, from 3.3 points to 0.1 points. The typical confusion matrix in Figure 13 shows an improvement for Client 1 when FL is included. Phase Loss (PL) accuracy increases from 90.5% to 97.8% and Misaligned Rotor (MR) accuracy increases from 63.9% to 71.4%. These results indicate that indirect information sharing via the FedAvg server may improve the ability to identify emerging faults when individual clients see a limited number of different health conditions. By integrating models trained on a subset of different health states, FL can increase the diversity of the training data set and improve the generalization of learned features. Future work should test SSL and FL on more datasets and health status partitions to comprehensively assess the value of FL for improving feature generalization.

Table 4: Accuracy results for coalition learning (%)

Figure 12: Accuracy of client ratings for all health conditions

Figure 13: Representative confusion matrix showing the benefits of including FL for Barlow Twins Client 1. Client 1 was trained on {BoR, N} and client 2 (not shown) was trained on {BrR, FB}.

Conclusion

This study compares the generalization performance of feature representations trained by self-supervised learning (SSL) and supervised learning methods. In weight transfer experiments, feature extractors trained with Barlow Twins performed better than supervised classifiers when transferred to an operational environment with different process parameters, including new faults. Even when using only two health conditions for training, the features learned by Barlow Twins from the source domain resulted in a 9.6 point improvement in accuracy for the evaluation classifier over representations learned by supervised training on labeled source domain data. The results are shown in Figure 3. In addition, multiple SSL client models can share information through FL to improve performance without the need to stream large amounts of data to the cloud. Thus, manufacturing applications with large unlabeled data sets where labeled data is not diverse can use SSL and FL to learn generalizable representations for emerging faults. The improved ability to detect new faults across conditions makes the model more relevant to the factory floor and improves the reliability of practical condition monitoring deployments.

Categories related to this article

友安昌幸 (Masayuki Tomoyasu): JDLA G certificate 2020#2, E certificate2021#1 Japan Society of Data Scientists, DS Certificate Japan Society for Innovation Fusion, DX Certification Expert Amiko Consulting LLC, CEO