An Empirical Analysis of Data Drift Detection Techniques in Machine Learning Systems

Lucas Helfstein; Kelly Rosa Braghetto

doi:10.5753/sbbd.2024.240606

Lucas Helfstein Universidade de São Paulo (USP)
Kelly Rosa Braghetto Universidade de São Paulo (USP)

DOI: https://doi.org/10.5753/sbbd.2024.240606

Resumo

Software systems with machine learning (ML) components are being used in a wide range of domains. Developers of such systems face challenges that are different from those of traditional systems because the performance of ML systems is directly linked to their input data. This work shows that ML systems can be improved over time by actively monitoring the data that passes through them and retraining their models in case of drift detection. To this end, we first assess some widely used statistical and distance-based methods for data drift detection, discussing their pros and cons. Then, we present results from experiments performed using these methods in real-world and synthetic datasets to detect data drifts and improve the system’s robustness automatically.

Palavras-chave: Machine Learning, Data Science, Drift Detection in Machine Learning

Referências

Bland, J. M. and Altman, D. G. (1995). Multiple significance tests: the Bonferroni method. Bmj, 310(6973):170.

Bock, R. (2007). MAGIC Gamma Telescope. DOI: 10.24432/C52C8B.

Dasu, T., Krishnan, S., Venkatasubramanian, S., and Yi, K. (2006). An information-theoretic approach to detecting changes in multi-dimensional data streams. In Symposium on the Interface of Statistics, Computing Science, and Applications (Interface).

Ditzler, G. and Polikar, R. (2011). Hellinger distance based drift detection for nonstationary environments. In 2011 IEEE symposium on computational intelligence in dynamic and uncertain environments (CIDUE), pages 41–48.

Gama, J. and Castillo, G. (2006). Learning with local drift detection. In Advanced Data Mining and Applications: Second International Conference, ADMA 2006, Xi’an, China, August 14-16, 2006 Proceedings 2, pages 42–55.

Harries, M. (1999). Splice-2 comparative evaluation: electricity pricing. Technical report, The University of New South Wales, Sydney.

Hodges Jr, J. (1958). The significance probability of the Smirnov two-sample test. Arkiv för matematik, 3(5):469–486.

Lu, J., Liu, A., Dong, F., Gu, F., Gama, J., and Zhang, G. (2018). Learning under concept drift: A review. IEEE Transactions on Knowledge and Data Engineering, 31(12):2346–2363.

Pérez-Cruz, F. (2008). Kullback-Leibler divergence estimation of continuous distributions. In 2008 IEEE international symposium on information theory, pages 1666–1670.

Rabanser, S., Günnemann, S., and Lipton, Z. (2019). Failing loudly: An empirical study of methods for detecting dataset shift. Advances in Neural Information Processing Systems, 32.

Schlimmer, J. C. and Granger, R. H. (1986). Incremental learning from noisy data. Machine learning, 1:317–354.

Sculley, D., Holt, G., Golovin, D., Davydov, E., Phillips, T., Ebner, D., Chaudhary, V., Young, M., Crespo, J. F., and Dennison, D. (2015). Hidden technical debt in machine learning systems. Advances in Neural Information Processing Systems, 2015-Janua:2503–2511.

Souza, V. M. A., Reis, D. M., Maletzke, A. G., and Batista, G. E. A. P. A. (2020). Challenges in benchmarking stream learning algorithms with real-world data. Data Mining and Knowledge Discovery, 34:1805–1858.

Street, W. N. and Kim, Y. (2001). A streaming ensemble algorithm (sea) for large-scale classification. In Proceedings of the seventh ACM SIGKDD international conference on Knowledge discovery and data mining, pages 377–382.

Webb, G. I., Hyde, R., Cao, H., Nguyen, H. L., and Petitjean, F. (2016). Characterizing concept drift. Data Mining and Knowledge Discovery, 30(4):964–994.