Innovations In Rare Class Prediction Models For Semiconductor Manufacturing
3 main points
✔️ Developing a new forecasting model to address the class imbalance problem in semiconductor manufacturing data
✔️ Optimizing feature selection and data completion methods to enable accurate prediction of rare classes
✔️ Analyzing the impact of data resampling strategies utilizing SMOTE on model accuracy
Rare Class Prediction Model for Smart Industry in Semiconductor Manufacturing
code:
written by Abdelrahman Farrag, Mohammed-Khalil Ghali, Yu Jin
[Submitted on 6 Jun 2024]
Comments: Accepted by arXiv
Subjects: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
The images used in this article are from the paper, the introductory slides, or were created based on them.
Summary
Industrial evolution has led to the integration of physical and digital systems, enabling the collection of large amounts of data on manufacturing processes. This integration provides reliable solutions for improving process quality and equipment health management. However, the data collected from real manufacturing processes is fraught with challenges such as severe class imbalances, high missing value rates, and noisy features that hinder effective machine learning implementations.
In this study, we have developed a rare class prediction approach for in-situ data collected from smart semiconductor manufacturing processes. The main goal of this approach is to address the issues of noise and class imbalance and to enhance class separation.
The developed approach showed promising results compared to the existing literature, allowing the prediction of new observations that provide insight into future maintenance planning and production quality. The model was evaluated using a variety of performance indicators and showed an AUC of 0.95, an accuracy of 0.66, and a repeatability of 0.96 on the ROC curve.
Introduction
Semiconductor wafer fabrication involves hundreds of advanced manufacturing processes, including oxidation, photolithography, cleaning, etching, and planarization. Wafer yield is calculated as the ratio of qualified product to the total semiconductor chips in the wafer.
Maintaining high yield through reliable and accurate quality control is critical to success in the semiconductor industry. An important step in yield improvement is to identify the operations that significantly affect wafer yield, the so-called "critical process steps.
The selection of critical process steps presents significant challenges due to the inherent complexity of process data. These data are primarily acquired from a large number of in-situ sensors and therefore typically have high-dimensional and noisy features. The data also suffer from high missing value rates due to the limitations of current measurement techniques and low measurement frequencies.
During production, each wafer goes through various process steps and is inspected by measuring equipment. Because these inspections are time consuming and the capacity of the measurement tools is limited, only a small percentage of wafers are actually measured. This random sampling practice further complicates data analysis. For example, if there are five process steps and the measurement rate is 20%, there is a 0.032% chance of obtaining complete measurement data for all steps.
This problem is magnified in actual production lines, where there are more than 500 process steps, making it difficult to establish correlations between process steps. In addition, most mature wafer fabrication lines produce a large number of wafers with high conformance quality, reducing the incidence of low-yield wafers.
However, to effectively study and improve wafer yield, it is important to analyze both high and low yield wafers. Low yield wafer volumes are small, making it difficult to assess the impact of process variability on overall production quality.
Related Research
Related research is divided into three main sections focusing on key aspects of data preprocessing and feature selection in semiconductor manufacturing.
First, we will address the general issue of missing data in a data set. Next, we address the issue of class imbalance in predictive modeling and how to effectively balance it. Finally, we discuss feature selection methods to improve the predictive accuracy and efficiency of classification models.
Data Completion Method
The problem of missing data is an important issue common to many studies, affecting the reliability of statistical analyses and causing information loss and bias in parameter estimation. Missing data can be classified into three forms: missing completely at random (MCAR), missing at random (MAR), and depending on the missing values themselves (MNAR).
MCAR indicates a case of missing data independent of observed or unobserved variables, meaning that there is no systematic loss.MAR, on the other hand, is when missing instances occur in relation to other observed variables, indicating a systematic relationship affected by other variables in the data set. The most complex MNAR refers to cases where the missing data depend on the missing values themselves.
In real-world scenarios, such as semiconductor manufacturing, it can be difficult to identify the exact mechanism because each wafer is chosen at random. Thus, missing data are more likely to be MAR in practice because they are related to observed values.
Traditional completion methods include deletion and mean completion, which are effective primarily in the case of MCAR. In contrast, modern methods such as maximum likelihood, multiple completion, hot deck completion, regression completion, expectation maximization (EM), and Markov chain Monte Carlo (MCMC) methods are designed to provide bias-free estimates for data classified as MCAR or MAR.
Despite the fact that the percentage of missing data has a significant impact on the quality of statistical inferences, there is no universally accepted threshold for the percentage of missing data that is acceptable; missing rates below 5% are generally considered negligible, while rates above 10% are likely to introduce bias in statistical analyses The following is a list of the most commonly accepted thresholds for acceptable missing data.
A new data completion approach, inpainting KNN completion, was developed and compared to the average completion strategy after applying various machine learning approaches. The developed approach showed better performance than mean completion, a common data completion technique. Performance metrics were significantly improved, with a 10% improvement in recall and a 5% improvement in AUC.
The method has also been shown to be effective in complementing missing values by converting all continuous features to nominal data, eliminating the need for a unique approach for each different feature type.
Class Imbalance
Defect data sampling in machine learning and data analysis is an important issue, especially for data sets related to quality control and fault detection. In these scenarios, the data is often unbalanced, with a large imbalance between "defect" or "positive" classes (e.g., instances of faults or defects) and "non-defect" or "negative" classes.
This imbalance poses a significant challenge in predictive modeling, as defect classes are scarce, the model has a bias and cannot accurately identify defects. The model may be biased toward the majority class and show high accuracy, but it cannot effectively identify instances of the minority class, thus increasing the false negative rate.
This is particularly problematic in defect detection, where missing real defects (false negatives) can have serious consequences. Imbalance also leads to an accuracy-reproducibility tradeoff, where improving one tends to compromise the other.
To address these issues, resampling of data (oversampling of minority classes or undersampling of majority classes), use of different performance measures (F1 scores, precision-reproducibility curves, ROC-AUC), and adoption of algorithms specific to unbalanced data can be effective The following is a brief overview of the results of the study.
The undersampling technique addresses class imbalance by retaining the most representative instances of the oversampled majority class. Through integration with data-driven models, this approach has evolved significantly to more effectively mitigate imbalance problems by selectively undersampling instances that are closer to the minority class.
As specific methods, Cluster-based, Tomek-linked, and Condensed Nearest Neighbours (CNN) refine decision boundaries and improve classifier accuracy. Each undersampling technique has its own advantages and challenges; for example, Edited Nearest Neighbours (ENN) uses a k-nearest neighbor algorithm to remove noisy majority class instances, but is computationally expensive and can lead to information loss.
Oversampling techniques, on the other hand, address class imbalance by augmenting minority classes. Random oversampling replicates instances of minority classes but can lead to overlearning.
Methods such as SMOTE (Synthetic Minority Over-sampling Technique) increase diversity by creating synthetic instances, but may introduce noise. ADASYN (Adaptive Synthetic Sampling) focuses on minority instances that are difficult to learn, but also risks introducing noise.
Feature Selection Method
Feature selection algorithms included Boruta, multivariate adaptive regression splines (MARS), and principal component analysis (PCA) were applied to select the most important features. Results showed that Boruta and MARS were more accurate than PCA when used. They also showed higher accuracy values than Gradient Boosting Tree (GBT) when the data were unbalanced and classified by Random Forest (RF) and Logistic Regression (LR).
Feature selection approaches such as Chi-Square, mutual information content, and PCA were also used. LR, k nearest neighbor (KNN), decision tree (DT), and naïve Bayes (NB) were applied as classification models, with DT performing the best, yielding an F-measure of 64% and an accuracy of 67%.
To address the issue of high dimensional data, SMOTE was used to reduce higher dimensions and PCA was applied. The model was evaluated with ROC curves, with RF showing an AUC of 0.77, showing better results than KNN and LR.
Additionally, an early detection prediction model was developed to quickly detect equipment failures to maintain productivity and efficiency. After data preprocessing and feature selection, four prediction models were run, NB, KNN, DT, SVM, and ANN, with NB showing the best results compared to the other models. to improve the accuracy of the classification prediction model for the SECOM dataset, an early detection prediction model using XGBoost was applied, significant results compared to RF and DT.
An approach was proposed to apply deep learning and meta-heuristic approaches to optimize hidden layer nodes using the CSO algorithm, showing 70% accuracy, 65% recall, and 73% accuracy. An ensemble of deep learning models was applied and model weights were determined using PSO. This approach showed better results compared to KNN, RF, AdaBoost, and GBT.
While the majority of classification models are developed based on accuracy, these predictive models present a paradox of accuracy. Accuracy alone is not sufficient where imbalanced data are concerned. Predicting rare classes is difficult because rare classes are small compared to the majority class. Predicting the majority class is easy and its accuracy is easily classified.
However, minority classes are difficult, and as a result, if the performance of the forecasting model is measured solely by accuracy, minority classes may not be forecasted. Thus, even if accuracy is excellent, it is likely to predict only the majority class and not consider the minority class. In these cases, balanced accuracy is the key metric.
Several prior studies employed sampling strategies to increase the number of minority classes. However, if features are selected based on the data distribution, the feature selection algorithm may be affected before oversampling the minority class or undersampling the majority class.
Methodology
This section describes an approach to addressing the challenges of in-situ sensor data in semiconductor manufacturing. It includes case studies and details the data preprocessing techniques employed. These preprocessing steps include addressing missing values, data partitioning, and data scaling. In addition, data resampling techniques to perform feature selection and correct class imbalances are described.
Suggested Approach
As shown in Figure 1, the proposed methodology is organized into two main stages: data preprocessing and model development and prediction. The process begins with an initial exploratory data analysis (EDA I), which provides preliminary insights into the data. Next, feature selection is performed to complement missing values and ensure data integrity in the next step.
The processed data proceeds to EDA II, where it is further refined through a trial-and-error process. This leads to a second phase of insight-based feature selection (Feature Selection II), where the most relevant features are selected for use in the final model.
Figure 1: Schematic of the proposed approach. |
Case Study
The study used the SECOM dataset, an open-source industrial dataset representative of complex semiconductor manufacturing processes (Figures 2 and 3). The dataset contains 591 sensor measurements out of 1567 samples, of which 104 samples were classified as failures.
Handling semiconductor data presents multiple challenges.Due to the high cost of semiconductor manufacturing, the process is managed to minimize defects, resulting in a data set with a pronounced class imbalance in the ratio of 1:14. In addition, the data set contains a large amount of missing data due to sensor failures and missed operations.
Figure 2: Exploratory data analysis of SECOM data. |
Figure 3: Feature analysis of SECOM data. |
Data Preprocessing
Data preprocessing includes handling missing values, data splitting, and data scaling. As a missing value completion strategy, k-Nearest Neighbors (k-NN) completion was shown to be the most effective. Certain features were complemented using the median, while others were complemented using the mean to fit the normal distribution curve.
Data Partitioning
Data are partitioned using stratified cross-validation, which is particularly useful for unbalanced data sets. A portion of the data (the training set) is used to train the algorithm, and the remainder (the test set) is used to evaluate the algorithm's performance.
A five-fold cross-validation technique is employed, where data are randomly divided into five subgroups with equal numbers of samples. The process described in the following sections is performed five times, with one fold used as test data and the remaining four folds used as training data. The resulting model is tested using the test data and evaluated using performance metrics.
Data Scaling
Due to the irregular state of the data, scaling is required. Feature scaling improves the classification performance of the learning algorithm. Data are normalized to a linear scale of 0 to 1 and the following equation is used
where Min(X) is the minimum value of the data, Max(X) is the maximum value of the data, and Ave(X) is the average value of the data.
Feature Selection
Since most of the hundreds of features are unnecessary, feature selection is critical to creating an effective predictive model in rare class prediction. The models developed are biased toward rare class features and give priority to features that contribute significantly to the rare class. Feature selection is an important step in this type of problem, and the selection algorithm can be affected by the high dimensionality of the features.
Therefore, a voting strategy is employed to select features chosen by three or more feature selection methods, and only features from minority classes are considered. This process is repeated until the optimal number of features is reached. As a result of the feature selection voting, 21 features were ignored by the voters and 183 features were chosen by the feature selection methods.
However, only two features were selected in all 12 feature selection methods. Finally, 81 features were selected.
Data Resampling
The main purpose of data resampling is to solve the imbalance problem between minority and majority classes. This step is applied only to the training data set to prevent over-training of the test data. Two different strategies are implemented: oversampling the minority class and undersampling the majority class.
SMOTE (Synthetic Minority Over-sampling Technique) is applied to minority classes and interpolates between existing data points to create synthetic data points. New synthetic data points are generated by the following equation
where x_i and x_j are existing minority class instances and λ is a random number between 0 and 1.
The combined undersampling and SMOTE strategy oversamples the minority class by 40% and undersamples the majority class by 80%, adjusting the ratio from 1:14 to about 4:5. With both resampling approaches, efforts are made to bring class sizes closer together.
This prevents half of the data from being synthetic data due to large initial class imbalances. These methods are intended to address the problem of class imbalance and allow the model to generalize to unknown data.
Valuation Index
Several metrics are used to evaluate test data results. For imbalance and rare class data, the balancing accuracy is particularly important, which accounts for imbalance by averaging sensitivity and specificity. Balanced accuracy is calculated as follows
Accuracy indicates the accuracy of positive predictions and the percentage of true positives among all positive predictions. It is defined by the following equation
The reproducibility (sensitivity) indicates the ability to identify all relevant instances of the model and the percentage of true positives among all actual positive instances. It is calculated as follows
The false alarm rate (FAR) measures the percentage of false positives among all negative instances. It is given by the following equation
The receiver operating characteristic (ROC) curve is an evaluation measure for binary classification problems and is a probability curve that plots the true positive rate (TPR) versus the false positive rate (FPR) at various threshold values. The area under the curve (AUC) is a measure of the ability to distinguish between classes and serves as a summary of the ROC curve; a higher AUC indicates better model performance.
Result
Data Preprocessing
First, a random feature pair plot of the data was performed (see Figure 4). It was observed that the data classes completely overlapped and were irregularly distributed. The proportion of missing values was estimated at 4.5%, and 28 columns with missing value rates greater than 50% were removed. For the remaining 1.26% of missing values, six different completion approaches were employed.
The k-NN completion showed the best data separation, but some features were fitted to the normal distribution curve using median completion and others using mean completion.
Figure 4: EDA of SECOM data after data preprocessing. |
Rare Class-Based Feature Selection Ballot
The feature selection approach resulted in 183 features being voted for. Considering at least three votes for each feature, 81 features were selected. The selected features are shown in descending order as shown in Figure 5, features 433 and 210 were voted for in all feature selection algorithms.
Figure 5: Voting results for rare class-based feature selection approach. |
Classification Prediction Evaluation
In this section, the results of the classification model across three different test scenarios are presented. The results for each run are presented using performance metrics and ROC curves. Finally, a summary plot of the performance metrics for the three runs is performed.
Test Scenario I: Imbalance Model
The results of the first run show that XGB and DTC have the best performance indicators, with GBC showing relatively low accuracy values. However, LR, SVM, and RF did not perform well; despite RF's 100% accuracy value, it does not necessarily predict all positives correctly.
While the RF model predicted all negative cases, it failed to detect positive cases due to very low reproducibility. The best model should have the highest accuracy, repeatability, balancing accuracy, AUC, and lowest false alarm rate; XGB has relatively high values and is the best model for unbalanced data.
Figure 6: ROC curves for the first test scenario for unbalanced data |
Table 1: Summary results table for the first test scenario for imbalance data |
Test Scenario II: Oversampling the SMOTE model
After oversampling the minority class by 70%, the AUC and reproducibility of all models improved with the use of SMOTE. This is especially true for LR and RF, where SMOTE produces a balanced training data set that allows models to better understand the data distribution.
A slight decrease in accuracy values and a corresponding increase in reproducibility and false alarm rates are observed. This is due to the increased sample size of the minority class, which allows the classifier to detect more of it and potentially misclassify negative cases. This result underscores the importance of synthetic data generation.
Figure 7: ROC curves for the second test scenario of the SMOTE model |
Table 2: Summary results table for the second test scenario of the SMOTE model |
Test Scenario III: Combination of Resampling Models
By undersampling the majority class by 80% and oversampling the minority class by 40%, the AUC and reproducibility improved significantly, reaching 0.95 and 0.93 for XGB, respectively. This represents a significant improvement in the model due to data resampling.
The balance accuracy of DTC has also improved to 88%, which is considered the highest accuracy value. However, there is a noticeable decrease in the accuracy value and a slight increase in the false alarm rate. This result is attributed to the reduction of synthetic data and the convergence of the two classes of observations.
Figure 8: ROC curves for the third inspection scenario of the composite resampling model |
Table 3: Summary results table for the third test scenario of the composite resampling model |
Finally, we summarize the results of the three runs and show trends for each performance measure. Balance accuracy did not improve significantly, with the exception of DT and LR. Resampling improved detection of defects, with a jump in reproducibility to 93% for XGB and 90% for GBT. Increased error detection resulted in a slight decrease in the classification of compliant products and an increase in the false alarm rate.
Table 4: Comparison with recently published journal articles |
Conclusion
In this study, SECOM data sets from actual semiconductor manufacturing plants were analyzed and classified in detail. Eighteen different approaches were evaluated, including various stages of data completion, data imbalance handling, feature selection, and classification.
In addition, numerous attempts were made to select appropriate algorithms for missing value completion, hyperparameter tuning of the model, and adjusting the rate of resampling.
The proposed approach emphasizes feature selection and feature voting based on rare classes and has shown significant improvements in model predictability for positive cases compared to existing methods. The approach effectively identified the most important features and improved the model's ability to accurately predict failures.
In addition, the features that receive the highest votes will be analyzed along with additional sensor information to provide deeper insight into the causes of failure and to identify the most critical stages of the manufacturing process.
This experimental evaluation identified the best tools and stages for classifying the SECOM dataset. Results show the superiority of XGB for classification, SMOTE for synthetic data generation, feature voting for feature selection, and mixed algorithms for missing data completion.
These findings argue for the effectiveness of the proposed methodology in dealing with complex and unbalanced industry data sets and will pave the way for more reliable and accurate predictive models in semiconductor manufacturing.
In the future, large-scale language models (LLMs) and generative AI could provide innovative solutions to address the problem of class imbalance. These advanced AI techniques can be expected to further improve model robustness and accuracy in handling imbalanced data sets by generating synthetic data and enhancing data augmentation strategies.
Categories related to this article