Enhancing Auto-ML with Missing Value Imputation: A Case Study with TPOT2 Library and Industry 4.0

  • Joel Frank Huarayo Quispe UNIFESP
  • Didier A. Vega-Oliveros UNIFESP
  • Lilian Berton UNIFESP

Resumo


Automated Machine Learning (AutoML) is increasingly important in industrial applications for democratizing the use of machine learning techniques, particularly in Industry 4.0, where robust model development is crucial. Addressing the challenge of missing data, we introduce a missing data imputation module integrated into the TPOT2 AutoML library—a rewrite of TPOT with additional features. This module incorporates SimpleImputer, IterativeImputer, and KNNImputer, enhancing TPOT2’s ability to handle datasets with missing values. We evaluate the module on three industrial datasets (Mercedes-Benz Greener Manufacturing, NASA Turbofan Jet Engine, Gearbox fault diagnosis) with classification and regression tasks, testing it with varying levels of missing data (5%, 10%, 15%). Our results demonstrate that the TPOT2 library, equipped with this imputation module, significantly improves predictive modeling accuracy in the presence of missing data, proving its practical utility and robustness in industrial contexts.
Palavras-chave: Missing Value Imputation, Auto-ML, Industry 4.-1

Referências

Akiba, T., Sano, S., Yanase, T., Ohta, T., and Koyama, M. (2019). Optuna: A next-generation hyperparameter optimization framework. In Proceedings of the 25th ACM SIGKDD international conference on knowledge discovery & data mining, pages 2623–2631.

Alghamdi, T. A. and Javaid, N. (2022). A survey of preprocessing methods used for analysis of big data originated from smart grids. IEEE Access, 10:29149–29171.

Banzhaf, W., Nordin, P., Keller, R. E., and Francone, F. D. (1998). Genetic programming: an introduction: on the automatic evolution of computer programs and its applications. Morgan Kaufmann Publishers Inc.

Bilal, M., Ali, G., Iqbal, M. W., Anwar, M., Malik, M. S. A., and Kadir, R. A. (2022). Auto-prep: efficient and automated data preprocessing pipeline. IEEE Access, 10:107764–107784.

Bilalli, B., Abelló, A., Aluja-Banet, T., and Wrembel, R. (2016). Automated data pre-processing via meta-learning. In International Conference on Model and Data Engineering, pages 194–208. Springer.

Chai, C. P. (2023). Comparison of text preprocessing methods. Natural Language Engineering, 29(3):509–553.

García, S., Ramírez-Gallego, S., Luengo, J., Benítez, J. M., and Herrera, F. (2016). Big data preprocessing: methods and prospects. Big data analytics, 1:1–22.

Gourraud, P., Ginin, E., and Cambon-Thomsen, A. (2004). Handling missing values in population data: Consequences for maximum likelihood estimation of haplotype frequencies. European Journal of Human Genetics, 12(10):805–812.

He, X., Zhao, K., and Chu, X. (2021). Automl: A survey of the state-of-the-art. Knowledge-Based Systems, 212:106622.

Jackson, W., McNee, R., and TX., S. O. A. M. B. A. (1982). An Algorithm for the Univariate Analysis of Variance in Experiments with Repeated Measures. Defense Technical Information Center.

Jan, Z., Ahamed, F., Mayer, W., Patel, N., Grossmann, G., Stumptner, M., and Kuusk, A. (2023). Artificial intelligence for industry 4.0: Systematic review of applications, challenges, and opportunities. Expert Systems with Applications, 216:119456.

Jarrett, D., Cebere, B. C., Liu, T., Curth, A., and van der Schaar, M. (2022). Hyperimpute: Generalized iterative imputation with automatic model selection. In International Conference on Machine Learning, pages 9916–9937. PMLR.

Lakshminarayan, K., Harp, S. A., and Samad, T. (1999). Imputation of missing data in industrial databases. Applied intelligence, 11(3):259–275.

Lin, W.-C. and Tsai, C.-F. (2020). Missing value imputation: a review and analysis of the literature (2006–2017). Artificial Intelligence Review, 53:1487–1509.

Little, R. J. A. and Rubin, D. B. (1986). Statistical Analysis with Missing Data. John Wiley & Sons, Inc., USA.

Mishra, P., Biancolillo, A., Roger, J. M., Marini, F., and Rutledge, D. N. (2020). New data preprocessing trends based on ensemble of multiple preprocessing techniques. TrAC Trends in Analytical Chemistry, 132:116045.

Olson, R. S., Bartley, N., Urbanowicz, R. J., and Moore, J. H. (2016). Evaluation of a tree-based pipeline optimization tool for automating data science. CoRR, abs/1603.06212.

Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V., Vanderplas, J., Passos, A., Cournapeau, D., Brucher, M., Perrot, M., and Duchesnay, E. (2011). Scikit-learn: Machine learning in python. J. Mach. Learn. Res., 12:2825–2830.

Shende, M. K., Feijoo-Lorenzo, A. E., and Bokde, N. D. (2022). cleants: Automated (automl) tool to clean univariate time series at microscales. Neurocomputing, 500:155–176.

Torniainen, J., Afara, I. O., Prakash, M., Sarin, J. K., Stenroth, L., and Töyräs, J. (2020). Open-source python module for automated preprocessing of near infrared spectroscopic data. Analytica Chimica Acta, 1108:1–9.

Troyanskaya, O., Cantor, M., Sherlock, G., Brown, P., Hastie, T., Tibshirani, R., Botstein, D., and Altman, R. B. (2001). Missing value estimation methods for DNA microarrays . Bioinformatics, 17(6):520–525.

Zhang, Z. (2016). Multiple imputation with multivariate imputation by chained equation (mice) package. Annals of translational medicine, 4(2).
Publicado
17/11/2024
QUISPE, Joel Frank Huarayo; VEGA-OLIVEROS, Didier A.; BERTON, Lilian. Enhancing Auto-ML with Missing Value Imputation: A Case Study with TPOT2 Library and Industry 4.0. In: ENCONTRO NACIONAL DE INTELIGÊNCIA ARTIFICIAL E COMPUTACIONAL (ENIAC), 21. , 2024, Belém/PA. Anais [...]. Porto Alegre: Sociedade Brasileira de Computação, 2024 . p. 97-108. ISSN 2763-9061. DOI: https://doi.org/10.5753/eniac.2024.245232.

Artigos mais lidos do(s) mesmo(s) autor(es)