Stage 3: How to leap forward from Information to Knowledge
To achieve the Knowledge state of the data, pre-processing of the data was conducted. This included data normalization, identifying duplicates/quality, imputation of missing values, and identifying extreme data points that were outliers. We used distance-based algorithms and genetic algorithms for pattern matching in order to build a golden copy of part ID identification, cleaning, and standardization based on placed orders and varied BOMs. Missing value imputation was performed using machine learning algorithms to predict missing values. All machine learning algorithms contribute to the principle of continuous quality assurance at each stage of the journey. The entire process of standardization and quality assurance was automated based on the DevOps principles. Each change was logged, analyzed, and communicated to stakeholders.
The historical data for machine failures was skewed, with the minority class (percentage of failures) ranging between 3% and 5%. It would be misleading to feed such data into predictive maintenance systems. Collecting further failure data was also not a possibility. We used advanced machine learning techniques to turn an imbalanced data set into a balanced training dataset, but left the test/validation datasets alone to avoid bias during the validation process.
Each dataset has a few extreme observations, and ours was no exception. We had to identify extreme observations that should be ignored when creating models. To avoid outliers entering into the models, we applied AI approaches such as Cook's distance and other similar strategies. We are now pushing onward to the level of Knowledge.
Additionally, we encountered a cold start issue everywhere new units were planted, and we lacked sufficient data to train our models. The use of machine learning-based similarity algorithms enabled us to distinguish between new and old objects based on a variety of criteria. Thus, nearest-matched historical product/machine data was utilized to predict, and it was refreshed with each use of the model output via a feedback loop procedure that gave back deviations in prediction to the model as a self-learning feature.
The challenge of obtaining representative test data
Although test data is typically included in the initial data set, we were unable to obtain production data due to security, privacy, and local laws. As a result, we used Unsupervised Deep Learning algorithms to generate production-grade data based on production systems without exposing real production data. However, the generated data retained the same pattern as the live production data. Now, data validation tests are automated using continuous integration. Anytime a new data-related feature becomes available, it integrates with the master flow via automated pipelines. Following integration, we used automated continuous delivery (CD) pipelines to push the feature to all candidate platforms with the necessary approval and exception processes. It enabled real-time training and validation of machine learning solutions within the corresponding platform using the necessary dataset. Section 6 provides a deep insight into synthetic data.