Mapping Ancestry through Surnames: Machine Learning Approaches Applied to Brazilian Data

Arthur Lins Wolmer; Diego de Freitas Bezerra

doi:10.5753/kdmile.2025.247744

Arthur Lins Wolmer Centro de Estudos e Sistemas Avançados do Recife (CESAR SCHOOL)
Diego de Freitas Bezerra Centro de Estudos e Sistemas Avançados do Recife (CESAR SCHOOL)

DOI: https://doi.org/10.5753/kdmile.2025.247744

Resumo

The classification of surname origin as a proxy for ethnic background estimation has long supported sociological, demographic, and genetic studies, particularly in countries with diverse migratory histories. In this article, we introduce a new Brazilian dataset constructed from over one million historical immigration records, propose a pipeline for surname extraction and disambiguation, and evaluate multiple supervised classifiers based on character-level n-grams. In addition to replicating classical models, we implement graph-based methods and an ensemble classifier. Our results confirm the competitiveness of traditional approaches while achieving significant gains with the ensemble model.

Palavras-chave: ancestry inference, name disambiguation, ensemble learning, graph-based models, surname classification

Referências

Cavnar, W. B. and Trenkle, J. M. N-gram-based text categorization. In Proceedings of SDAIR-94, 3rd Annual Symposium on Document Analysis and Information Retrieval. pp. 161–175, 1994.

CPDOC. Dicionário histórico-biográfico da primeira república. imigração, 2000. Available at: [link]. Accessed on June 4, 2025.

Heringer, R. Affirmative action policies in higher education in brazil: outcomes and future challenges. Social Sciences 13 (3): 132, 2024.

IBGE. Brasil: 500 anos de povoamento. IBGE, 2007.

IBGE. Características étnico-raciais da população : um estudo das categorias de classificação de cor ou raça : 2008. IBGE, 2011.

Jauhiainen, T., Lui, M., Zampieri, M., Baldwin, T., and Lindén, K. Automatic language identification in texts: A survey. Journal of Artificial Intelligence Research vol. 65, pp. 675–782, 2019.

Monasterio, L. M. Sobrenomes e ancestralidade no brasil. Tech. rep., Instituto de Pesquisa Econômica Aplicada (Ipea), 2016.

Nelson, J. R. and Shekaramiz, M. Authorship verification via linear correlation methods of n-gram and syntax metrics. In 2022 Intermountain Engineering, Technology and Computing (IETC). IEEE, pp. 1–6, 2022.

Ribeiro, C. A. C. and Carvalhaes, F. Research on social stratification in brazil. Sociology Compass 18 (9): e13266, 2024.

Schwartzmann, S. Fora de foco: diversidade e identidades étnicas no brasil. Novos Estudos CEBRAP vol. 55, pp. 83–96, 1999.

Tromp, E. and Pechenizkiy, M. Graph-based n-gram language identification on short texts. In Proc. 20th Machine Learning conference of Belgium and The Netherlands. sn, pp. 27–34, 2011.

Vogel, J. and Tresner-Kirsch, D. Robust language identification in short, noisy texts: Improvements to liga. In Proceedings of the 3rd international Workshop on Mining Ubiquitous and Social Environments. pp. 43–50, 2012.