Differential Privacy in Gradient Boosting Decision Trees with Partitioning Techniques for Categorical Data

  • Antonio Gabriel M. Alves Federal University of Ceará (UFC)
  • Francisco Lucas F. Pereira Federal University of Ceará (UFC)
  • Iago C. Chaves Federal University of Ceará (UFC)
  • Javam C. Machado Federal University of Ceará (UFC)

Abstract


Gradient Boosting Decision Trees has achieved state-of-the-art performance in various machine learning tasks. This paper investigates enhancements in handling categorical attributes and random selection of split points while providing differential privacy guarantees. The results include a new gain function for these attributes and the sensitivity bounds for this gain function. Additionally, an empirical analysis on six real world datasets shows that the proposed approach achieves error rates equal to or lower than the baseline models.
Keywords: Data mining and analytics, Data privacy and security, Machine Learning, AI, data management and data systems

References

Abadi, M., Chu, A., Goodfellow, I., McMahan, H. B., Mironov, I., Talwar, K., and Zhang, L. (2016). Deep learning with differential privacy. In Proceedings of the 2016 ACM SIGSAC conference on computer and communications security, pages 308–318.

Bojarski, M., Choromanska, A., Choromanski, K., and LeCun, Y. (2014). Differentially- and non-differentially-private random decision trees. arXiv preprint arXiv:1410.6973.

Breiman, L. (2001). Random forests. Machine learning, 45:5–32.

Breiman, L. (2017). Classification and regression trees. Routledge.

Chen, T. and Guestrin, C. (2016). Xgboost: A scalable tree boosting system. In Proceedings of the 22nd acm sigkdd international conference on knowledge discovery and data mining, pages 785–794.

Chollet, F. (2021). Deep learning with Python. Simon and Schuster.

Dahouda, M. K. and Joe, I. (2021). A deep-learned embedding technique for categorical features encoding. IEEE Access, 9:114381–114391.

Danandeh Mehr, A. (2021). Drought classification using gradient boosting decision tree. Acta Geophysica, 69(3):909–918.

Dwork, C. (2006). Differential privacy. In International colloquium on automata, languages, and programming, pages 1–12. Springer.

Ferry, J., Fukasawa, R., Pascal, T., and Vidal, T. (2024). Trained random forests completely reveal your dataset. arXiv preprint arXiv:2402.19232.

Friedman, J. H. (2002). Stochastic gradient boosting. Computational statistics & data analysis, 38(4):367–378.

Li, Q., Wu, Z., Wen, Z., and He, B. (2020). Privacy-preserving gradient boosting decision trees. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 34, pages 784–791.

Liu, X., Li, Q., Li, T., and Chen, D. (2018). Differentially private classification with decision tree ensemble. Applied Soft Computing, 62:807–816.

M. Silva, M. d. L., C. Chaves, I., and C. Machado, J. (2020). Private reverse top-k algorithms applied on public data of covid-19 in the state of ceará. Journal of Information and Data Management, 12(5).

McSherry, F. D. (2009). Privacy integrated queries: an extensible platform for privacy-preserving data analysis. In ACM SIGMOD Int. Conf. on Management of data, pages 19–30.

Opitz, D. and Maclin, R. (1999). Popular ensemble methods: An empirical study. Journal of artificial intelligence research, 11:169–198.

Pennacchiotti, M. and Popescu, A.-M. (2011). A machine learning approach to twitter user classification. In Proceedings of the international AAAI conference on web and social media, volume 5, pages 281–288.

Seger, C. (2018). An investigation of categorical variable encoding techniques in machine learning: binary versus one-hot and feature hashing.

Shokri, R., Stronati, M., Song, C., and Shmatikov, V. (2017). Membership inference attacks against machine learning models. In 2017 IEEE symposium on security and privacy (SP), pages 3–18. IEEE.

Si, S., Zhang, H., Keerthi, S. S., Mahajan, D., Dhillon, I. S., and Hsieh, C.-J. (2017). Gradient boosted decision trees for high dimensional sparse output. In International conference on machine learning, pages 3182–3190. PMLR.

Truex, S., Liu, L., Gursoy, M. E., Yu, L., and Wei, W. (2018). Towards demystifying membership inference attacks. arXiv preprint arXiv:1807.09173.

Wood, A., Altman, M., Bembenek, A., Bun, M., Gaboardi, M., Honaker, J., Nissim, K., O’Brien, D. R., Steinke, T., and Vadhan, S. (2018). Differential privacy: A primer for a non-technical audience. Vand. J. Ent. & Tech. L., 21:209.

Zhao, L., Ni, L., Hu, S., Chen, Y., Zhou, P., Xiao, F., and Wu, L. (2018). Inprivate digging: Enabling tree-based distributed data mining with differential privacy. In IEEE INFOCOM 2018-IEEE Conference on Computer Communications, pages 2087–2095. IEEE.
Published
2024-10-14
M. ALVES, Antonio Gabriel; PEREIRA, Francisco Lucas F.; CHAVES, Iago C.; MACHADO, Javam C.. Differential Privacy in Gradient Boosting Decision Trees with Partitioning Techniques for Categorical Data. In: BRAZILIAN SYMPOSIUM ON DATABASES (SBBD), 39. , 2024, Florianópolis/SC. Anais [...]. Porto Alegre: Sociedade Brasileira de Computação, 2024 . p. 444-456. ISSN 2763-8979. DOI: https://doi.org/10.5753/sbbd.2024.240842.