A Comparative Study of the Use of Committee Approaches Regression for hot-deck Imputation

  • Thiago da Silva Pereira Federal Center for Technological Education of Rio de Janeiro
  • Eduardo Bezerra da Silva Federal Center for Technological Education of Rio de Janeiro
  • Jorge de Abreu Soares Federal Center for Technological Education of Rio de Janeiro

Abstract


An essential problem in data preprocessing is related to deal with missing data. A possible solution to this problem is hot-deck imputation, a technique comprised of two steps: first cluster similar records in the input dataset and then perform imputation in each separate cluster. However, selecting the best algorithm for the second step is a challenging task. This article presents a comparative study of hot-deck imputation considering two ensemble methods: Bagging and Adaboost. We evaluate these methods using datasets that show different correlations between their attributes, with varying missing value rates. Our results measuring the precision of imputed data by both techniques indicate that Adaboost results in better precision and reasonable processing time.

Keywords: Imputation Hot-Deck, Missing Data, Bagging, AdaBoost, Ensemble

References

Samuel Zico Christopher, Titin Siswantining, Devvi Sarwinda, and Alhadi Bustaman.Missing value analysis of numerical data using fractional hot deck imputation. In 2019 3rd International Conference on Informatics and Computational Sciences (ICICoS), pages 1–6. IEEE, 2019.

B Ford. An overview of hot-deck procedures, incomplete data in sample surveys, 1 theoryand bibliographies, vol. 2, w.AcademicPress, 3, 1983.

David A Marker, David R Judkins, and Marianne Winglee. Large-scale imputation forcomplex surveys.Surveynonresponse, 329341, 2002.

Yongqing Nan and Yanyan Gao. A machine learning method to monitor china’s aidsepidemics with data from baidu trends.PloSone, 13(7):e0199697, 2018.

Pratik Patil and A. Karthikeyan. A survey on k-means clustering for analyzing vari-ation in data. In G. Ranganathan, Joy Chen, and Álvaro Rocha, editors, Inventive Communication and Computational Technologies, pages 317–323, Singapore, 2020.Springer Singapore. ISBN 978-981-15-0146-3.

Jorge Soares.Pre-Processamento em mineração de dados: Um Estudo Comparativo em Complementação. PhD thesis, COPPE/UFRJ - Engenharia de Sistemas e Computação,2007

Rodrigo Tavares Souza. Appraisal-spark: Uma abordagem para imputação em larga escala. Master’s thesis, CEFET/RJ - PPCIC, 2019.

Rodrigo Tavares Souza, Rafael Castaneda, Claudia Ferlin, Ronaldo Goldschmidt, LuisV. Carvalho Alfredo, and Jorge de Abreu Soares. Apoiando o processo de imputação com técnicas de aprendizado de máquina. In 33rd Brazilian Symposium on Databases(SBBD), pages 259–264, 2018.

MA Syakur, BK Khotimah, EMS Rochman, and BD Satoto. Integration k-means clustering method and elbow method for identification of the best customer profile cluster. In IOP Conference Series:Materials Science and Engineering, volume 336, page 012017.IOP Publishing, 2018.

Cha Zhang and Yunqian Ma. Ensemble machine learning: methods and applications.Springer, 2012.
Published
2020-09-28
PEREIRA, Thiago da Silva; DA SILVA, Eduardo Bezerra; SOARES, Jorge de Abreu. A Comparative Study of the Use of Committee Approaches Regression for hot-deck Imputation. In: BRAZILIAN SYMPOSIUM ON DATABASES (SBBD), 35. , 2020, Evento Online. Anais [...]. Porto Alegre: Sociedade Brasileira de Computação, 2020 . p. 163-168. ISSN 2763-8979. DOI: https://doi.org/10.5753/sbbd.2020.13635.