Privacidade Diferencial em Gradient Boosting Decision Trees com Técnicas de Particionamento para Dados Categóricos

Antonio Gabriel M. Alves; Francisco Lucas F. Pereira; Iago C. Chaves; Javam C. Machado

doi:10.5753/sbbd.2024.240842

Antonio Gabriel M. Alves Universidade Federal do Ceará (UFC)
Francisco Lucas F. Pereira Universidade Federal do Ceará (UFC)
Iago C. Chaves Universidade Federal do Ceará (UFC)
Javam C. Machado Universidade Federal do Ceará (UFC)

DOI: https://doi.org/10.5753/sbbd.2024.240842

Resumo

Este artigo propõe uma nova abordagem de particionamento de dados categóricos para aplicar a privacidade diferencial em Gradient Boosting Decision Trees. Nele estudamos aprimoramentos no tratamento de atributos categóricos e seleção aleatória de pontos de particionamento enquanto oferecemos garantias de privacidade diferencial. Nossa abordagem define uma nova função de ganho para esses atributos e determina os limites de sensibilidade dessa função. Além disso, realizamos uma análise empírica em 6 conjuntos de dados reais, mostrando que a abordagem proposta alcança taxas de erro menores ou iguais aos modelos de referência.

Palavras-chave: Data mining and analytics, Data privacy and security, Machine Learning, AI, data management and data systems

Referências

Abadi, M., Chu, A., Goodfellow, I., McMahan, H. B., Mironov, I., Talwar, K., and Zhang, L. (2016). Deep learning with differential privacy. In Proceedings of the 2016 ACM SIGSAC conference on computer and communications security, pages 308–318.

Bojarski, M., Choromanska, A., Choromanski, K., and LeCun, Y. (2014). Differentially- and non-differentially-private random decision trees. arXiv preprint arXiv:1410.6973.

Breiman, L. (2001). Random forests. Machine learning, 45:5–32.

Breiman, L. (2017). Classification and regression trees. Routledge.

Chen, T. and Guestrin, C. (2016). Xgboost: A scalable tree boosting system. In Proceedings of the 22nd acm sigkdd international conference on knowledge discovery and data mining, pages 785–794.

Chollet, F. (2021). Deep learning with Python. Simon and Schuster.

Dahouda, M. K. and Joe, I. (2021). A deep-learned embedding technique for categorical features encoding. IEEE Access, 9:114381–114391.

Danandeh Mehr, A. (2021). Drought classification using gradient boosting decision tree. Acta Geophysica, 69(3):909–918.

Dwork, C. (2006). Differential privacy. In International colloquium on automata, languages, and programming, pages 1–12. Springer.

Ferry, J., Fukasawa, R., Pascal, T., and Vidal, T. (2024). Trained random forests completely reveal your dataset. arXiv preprint arXiv:2402.19232.

Friedman, J. H. (2002). Stochastic gradient boosting. Computational statistics & data analysis, 38(4):367–378.

Li, Q., Wu, Z., Wen, Z., and He, B. (2020). Privacy-preserving gradient boosting decision trees. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 34, pages 784–791.

Liu, X., Li, Q., Li, T., and Chen, D. (2018). Differentially private classification with decision tree ensemble. Applied Soft Computing, 62:807–816.

M. Silva, M. d. L., C. Chaves, I., and C. Machado, J. (2020). Private reverse top-k algorithms applied on public data of covid-19 in the state of ceará. Journal of Information and Data Management, 12(5).

McSherry, F. D. (2009). Privacy integrated queries: an extensible platform for privacy-preserving data analysis. In ACM SIGMOD Int. Conf. on Management of data, pages 19–30.

Opitz, D. and Maclin, R. (1999). Popular ensemble methods: An empirical study. Journal of artificial intelligence research, 11:169–198.

Pennacchiotti, M. and Popescu, A.-M. (2011). A machine learning approach to twitter user classification. In Proceedings of the international AAAI conference on web and social media, volume 5, pages 281–288.

Seger, C. (2018). An investigation of categorical variable encoding techniques in machine learning: binary versus one-hot and feature hashing.

Shokri, R., Stronati, M., Song, C., and Shmatikov, V. (2017). Membership inference attacks against machine learning models. In 2017 IEEE symposium on security and privacy (SP), pages 3–18. IEEE.

Si, S., Zhang, H., Keerthi, S. S., Mahajan, D., Dhillon, I. S., and Hsieh, C.-J. (2017). Gradient boosted decision trees for high dimensional sparse output. In International conference on machine learning, pages 3182–3190. PMLR.

Truex, S., Liu, L., Gursoy, M. E., Yu, L., and Wei, W. (2018). Towards demystifying membership inference attacks. arXiv preprint arXiv:1807.09173.

Wood, A., Altman, M., Bembenek, A., Bun, M., Gaboardi, M., Honaker, J., Nissim, K., O’Brien, D. R., Steinke, T., and Vadhan, S. (2018). Differential privacy: A primer for a non-technical audience. Vand. J. Ent. & Tech. L., 21:209.

Zhao, L., Ni, L., Hu, S., Chen, Y., Zhou, P., Xiao, F., and Wu, L. (2018). Inprivate digging: Enabling tree-based distributed data mining with differential privacy. In IEEE INFOCOM 2018-IEEE Conference on Computer Communications, pages 2087–2095. IEEE.