Stochastic Target Encoder - A new categorical feature encoding applied to urban data regression problems
Resumo
Regression problems are Machine Learning (ML) tasks often found in real world, with many attributes being categorical. Most ML algorithms works only with numerical data, so encoding these attributes tends to be necessary, but common encoding methods don’t use data properties, which can lead to poor model performance on high cardinality data. Target Encoding methods address this, but encode each attribute into a discrete set of values of equal cardinality to the categorical attribute. We propose a Target Encoder that addresses both issues introducing variability to encoded data using target statistics, achieving results comparable with the existing Target Encoders. We test our method against existing Encoders, showing the robust performance of our method.Referências
Amihai, I., Chioua, M., Gitzel, R., Kotriwala, A. M., Pareschi, D., Sosale, G., and Subbiah, S. (2018). Modeling machine health using gated recurrent units with entity embeddings and k-means clustering. In 2018 IEEE 16th International Conference on Industrial Informatics (INDIN), pages 212–217.
Baboolal, K., Gooljar, S., and Hosein, P. (2023). A novel approach to feature encoding. In 2023 IEEE International Conference on Technology Management, Operations and Decisions (ICTMOD), pages 1–6.
Cao, G., Zhou, L.-A., Liu, C., and Zhou, J. (2023). The effects of the entries by bikesharing platforms on urban air quality. China Economic Quarterly International, 3(3):213–224.
Fanaee-T, H. and Gama, J. (2014). Event labeling combining ensemble detectors and background knowledge. Progress in Artificial Intelligence, 2:113–127.
Fernandez, G. C. and Xu, S. (2019). A case study on using deep learning for network intrusion detection.
Hien, D. T. T., Thuy, C. T. T., Anh, T. K., Son, D. T., and Giap, C. N. (2020). Optimize the combination of categorical variable encoding and deep learning technique for the problem of prediction of vietnamese student academic performance. International Journal of Advanced Computer Science and Applications, 11(11).
James, W. and Stein, C. (1992). Estimation with Quadratic Loss, pages 443–460. Springer New York, New York, NY.
Jiun Hooi, E. K., Zainal, A., Kassim, M. N., and Ayub, Z. (2022). Feature encoding for high cardinality categorical variables using entity embeddings: A case study in cus toms fraud detection. In 2022 International Conference on Cyber Resilience (ICCR), pages 1–5.
Leili, M., Bahrami Asl, F., Jamshidi, R., and Dehdar, A. (2023). Mortality and morbidity due to exposure to ambient air pm10 in zahedan city, iran: The airq model approach. Urban Climate, 49:101493.
Liu, H., Qiu, Q., and Zhang, Q. (2024). End-to-end approach of multi-grained embedding of categorical features in tabular data. Information Processing Management, 61(3):103645.
Micci-Barreca, D. (2001). A preprocessing scheme for high-cardinality categorical attributes in classification and prediction problems. SIGKDD Explor. Newsl., 3(1):27–32.
Mougan, C., Masip, D., Nin, J., and Pujol, O. (2021). Quantile encoder: Tackling high cardinality categorical features in regression problems.
Pargent, F., Pfisterer, F., Thomas, J., and Bischl, B. (2022). Regularized target encoding outperforms traditional methods in supervised machine learning with high cardinality features. Comput. Stat., 37(5):2671–2692.
Poslavskaya, E. and Korolev, A. (2023). Encoding categorical data: Is there yet anything ’hotter’ than one-hot encoding? Slakey, A., Salas, D., and Schamroth, Y. (2019). Encoding categorical variables with conjugate bayesian models for wework lead scoring engine.
Su, W.-Y., Wu, D.-W., Tu, H.-P., Chen, S.-C., Hung, C.-H., and Kuo, C.-H. (2023). Association between ambient air pollutant interaction with kidney function in a large taiwanese population study. Environmental science and pollution research international, 30(34):82341—82352.
Uyar, A., Bener, A., Ciray, H. N., and Bahceci, M. (2009). A frequency based encoding technique for transformation of categorical variables in mixed ivf dataset. In 2009 Annual International Conference of the IEEE Engineering in Medicine and Biology Society, pages 6214–6217.
Wang, B., Shaaban, K., and Kim, I. (2019). Reveal the hidden layer via entity embedding in traffic prediction. Procedia Computer Science, 151:163–170. The 10th International Conference on Ambient Systems, Networks and Technologies (ANT 2019) / The 2nd International Conference on Emerging Data and Industry 4.0 (EDI40 2019) / Affiliated Workshops.
Baboolal, K., Gooljar, S., and Hosein, P. (2023). A novel approach to feature encoding. In 2023 IEEE International Conference on Technology Management, Operations and Decisions (ICTMOD), pages 1–6.
Cao, G., Zhou, L.-A., Liu, C., and Zhou, J. (2023). The effects of the entries by bikesharing platforms on urban air quality. China Economic Quarterly International, 3(3):213–224.
Fanaee-T, H. and Gama, J. (2014). Event labeling combining ensemble detectors and background knowledge. Progress in Artificial Intelligence, 2:113–127.
Fernandez, G. C. and Xu, S. (2019). A case study on using deep learning for network intrusion detection.
Hien, D. T. T., Thuy, C. T. T., Anh, T. K., Son, D. T., and Giap, C. N. (2020). Optimize the combination of categorical variable encoding and deep learning technique for the problem of prediction of vietnamese student academic performance. International Journal of Advanced Computer Science and Applications, 11(11).
James, W. and Stein, C. (1992). Estimation with Quadratic Loss, pages 443–460. Springer New York, New York, NY.
Jiun Hooi, E. K., Zainal, A., Kassim, M. N., and Ayub, Z. (2022). Feature encoding for high cardinality categorical variables using entity embeddings: A case study in cus toms fraud detection. In 2022 International Conference on Cyber Resilience (ICCR), pages 1–5.
Leili, M., Bahrami Asl, F., Jamshidi, R., and Dehdar, A. (2023). Mortality and morbidity due to exposure to ambient air pm10 in zahedan city, iran: The airq model approach. Urban Climate, 49:101493.
Liu, H., Qiu, Q., and Zhang, Q. (2024). End-to-end approach of multi-grained embedding of categorical features in tabular data. Information Processing Management, 61(3):103645.
Micci-Barreca, D. (2001). A preprocessing scheme for high-cardinality categorical attributes in classification and prediction problems. SIGKDD Explor. Newsl., 3(1):27–32.
Mougan, C., Masip, D., Nin, J., and Pujol, O. (2021). Quantile encoder: Tackling high cardinality categorical features in regression problems.
Pargent, F., Pfisterer, F., Thomas, J., and Bischl, B. (2022). Regularized target encoding outperforms traditional methods in supervised machine learning with high cardinality features. Comput. Stat., 37(5):2671–2692.
Poslavskaya, E. and Korolev, A. (2023). Encoding categorical data: Is there yet anything ’hotter’ than one-hot encoding? Slakey, A., Salas, D., and Schamroth, Y. (2019). Encoding categorical variables with conjugate bayesian models for wework lead scoring engine.
Su, W.-Y., Wu, D.-W., Tu, H.-P., Chen, S.-C., Hung, C.-H., and Kuo, C.-H. (2023). Association between ambient air pollutant interaction with kidney function in a large taiwanese population study. Environmental science and pollution research international, 30(34):82341—82352.
Uyar, A., Bener, A., Ciray, H. N., and Bahceci, M. (2009). A frequency based encoding technique for transformation of categorical variables in mixed ivf dataset. In 2009 Annual International Conference of the IEEE Engineering in Medicine and Biology Society, pages 6214–6217.
Wang, B., Shaaban, K., and Kim, I. (2019). Reveal the hidden layer via entity embedding in traffic prediction. Procedia Computer Science, 151:163–170. The 10th International Conference on Ambient Systems, Networks and Technologies (ANT 2019) / The 2nd International Conference on Emerging Data and Industry 4.0 (EDI40 2019) / Affiliated Workshops.
Publicado
21/07/2024
Como Citar
ARAUJO, João Victor; SANTOS, Gean da Silva; AQUINO, Andre L. L.; QUEIROZ, Fabiane.
Stochastic Target Encoder - A new categorical feature encoding applied to urban data regression problems. In: SIMPÓSIO BRASILEIRO DE COMPUTAÇÃO UBÍQUA E PERVASIVA (SBCUP), 16. , 2024, Brasília/DF.
Anais [...].
Porto Alegre: Sociedade Brasileira de Computação,
2024
.
p. 131-140.
ISSN 2595-6183.
DOI: https://doi.org/10.5753/sbcup.2024.3157.