Resource Center 中文

An Automated Data Mining Technique on Spectral Data Modeling and Prediction

News & Insights2023-09-11

Big data and AI technologies have been increasingly applied in the pharmaceutical industry for spectral data mining, such as NIR and Raman, to improve work efficiency and decrease costs. This article describes an automated data mining technique used for spectral data modeling and prediction with high efficiency and precision.


Traditional Data Mining Technique for Spectral Data Modeling and Prediction

For the technique of traditional data mining, data preprocessing, feature extraction, and model determination work separately and manually. Generally, the process of data preprocessing and feature extraction has a great influence on model determination and precision of prediction. Thus, data scientists and data analysts always take plenty of time, maybe several days, or even several weeks, to verify the results of different method combinations in these processes.


The Automated Data Mining Technique for Spectral Data Modeling and Prediction

In the automated data mining technique, data preprocessing, feature extraction, and model determination are integrated into one procedure. Here, AI and optimization techniques (e.g., grid search, random search, GA algorithm, etc.) are used to automatically find the optimal method combination (ruled by AIC, BIC, RMSE, etc.),  and no more or few manual work needs to be done. Meanwhile, professional knowledge can also be integrated into this technique to get more useful outcomes flexibly.

Analyzing spectral data: Modeling options

Auto-Mode: First, you need to import the data, define the quality attributes and influence factors, then select a sub-mode as you need, quick mode or high-performance mode, and finally run the procedure. Based on default settings, the procedure will automatically and easily start the data pre-processing, feature extraction, and model determination.

Expert-Mode: besides auto-mode, specific methods can be selected instead of default settings in the procedure to improve accuracy. For example, Z-score standardization, S-G smoothing, and Gaussian Filtering can be selected in data preprocessing, Information Entropy can be selected in Feature Extraction, PLS and XGBoost can be selected in Model Building. Of course, the default settings of parameters and/or hyper-parameters can also be changed. Due to the reduction of irrelevant methods, Expert-Mode will take much less time to find the optimal results generally.


CASE: Improving the Efficiency of Spectral Data Modeling and Concentration Prediction of API using Auto Data Mining Technique

Traditionally, building predictive modeling for CQAs of API will take lots of time, especially at the stage of data preprocessing, feature extraction, and model determination. Fortunately, this work can be optimized to less time by using AI and Optimization Techniques in this article. The analyst just needs to import the sample data into the tool and run it using the Auto-Mode or Expert-Mode which needs little effort to modify and/or evaluate the whole process, and it highly improves the efficiency as well as the precision of prediction. The details are described as follows.

Step 1: Import Sample Data into the Tool to Generate Graphs 


Step 2: Use Auto-Mode to Find the Optimal Combination of Data Preprocessing, Feature Extraction, and Model Determination.

1.Define the Critical Quality Attributes (Concentration) and Influencing Factors (wavelength, temperature, etc.)


2.Select Auto-Mode or Expert Model


3.Keep or Change the Default Settings, and Click the “TRAINING” button. Then, the tool started to find the optimal mode of data preprocessing, feature extraction, and model determination

1)Data Preprocessing Methods




2)Feature Extraction


3)Model Building


4)Model Evaluation


4. After the “TRAINING” was finished, the optimal mode was added to the “Model Pool”. Click it to see the details.

1)Mode Pool


2)Mode Details



Step 3: Prediction of Concentration

“Deploy” the optimal mode, then it can be used to predict the concentration of product using new spectral data.


Compared with the traditional mode which took more than one week to finish the whole work, the automated data mining technique just takes approximately 4.5 hours to finish. 

In short, by using this automated data mining technique, the spectral data can be modeled and predicted in a highly efficient and easy way, achieving more precise outcomes.

  • Share
  • Share
Scan QR code and share to Wechat