MRC: A Hybrid Relevance and Correlation Metric for Indicator Selection with Application to SINASC Data
Abstract
Variable selection is a critical step in data analysis and modeling, especially in domains with high-dimensional datasets such as healthcare. Addressing this challenge is particularly crucial in digital governance, where robust indicator prioritization underpins evidence-based policymaking. Traditional approaches often rely solely on statistical correlation or feature importance derived from predictive models, potentially overlooking semantic relevance. In this paper, we introduce the Metric of Relevance and Correlation (MRC), a novel hybrid approach that combines semantic relevance, statistical correlation, and predictive impact to identify the most relevant variables for a target outcome. We apply MRC to the Brazilian Live Birth Information System (SINASC) data to identify key factors associated with the 5-minute APGAR score, an important indicator of newborn health. Our results show that MRC provides a more stable variable ranking compared to traditional correlationbased and random forest-based methods, with superior performance across multiple stability metrics. MRC successfully identifies clinically relevant variables while discovering non-obvious relationships that traditional methods might miss. This approach offers a more comprehensive framework for variable selection, applicable across various domains requiring robust feature prioritization, and is particularly valuable for digital government initiatives in aiding the formulation of more effective public policies.References
Alexopoulos, C., Lachana, Z., Androutsopoulou, A., Diamantopoulou, V., Charalabidis, Y., and Loutsaris, M. A. (2019). How machine learning is changing e-government. In Proceedings of the 12th International Conference on Theory and Practice of Electronic Governance, ICEGOV ’19, page 354–363, New York, NY, USA. Association for Computing Machinery.
Apgar, V. (1953). A proposal for a new method of evaluation of the newborn infant. Anesthesia & Analgesia, 32(1):260–267.
DATASUS (2021). SINASC - sistema de informações sobre nascidos vivos. Ministério da Saúde.
de Winter, J. C., Gosling, S. D., and Potter, J. (2016). Choosing between pearson and spearman correlation coefficients to assess correlations in the context of exposure assessment comparison exercises. Journal of Exposure Science & Environmental Epidemiology, 26(5):530–530.
European Commission (2016). Big data analytics for policy making report. Technical report, European Commission, Interoperable Europe. Level of specialisation: Intermediate. Published under the Interoperability Solutions for European Public Administrations programme (2016-2020).
Guyon, I. and Elisseeff, A. (2003). An introduction to variable and feature selection. Journal of machine learning research, 3(Mar):1157–1182.
Kennedy, J. J. (1970). The eta coefficient in complex anova designs. Educational and Psychological Measurement, 30(4):885–889.
Lynn, Laurence E., J., Heinrich, C. J., and Hill, C. J. (2000). Studying governance and public management: Challenges and prospects. Journal of Public Administration Research and Theory, 10(2):233–262.
Moosazadeh, M., Ifaei, P., Tayerani Charmchi, A. S., Asadi, S., and Yoo, C. (2022). A machine learning-driven spatio-temporal vulnerability appraisal based on socio-economic data for covid-19 impact prevention in the u.s. counties. Sustainable Cities and Society, 83:103990.
Parimala, K., Rajkumar, G., Ruba, A., and Vijayalakshmi, S. (2017). Challenges and opportunities with big data. International Journal of Scientific Research in Computer Science and Engineering, 5(5):16–20.
Rahutomo, F., Kitasuka, T., and Aritsugi, M. (2012). Semantic cosine similarity. In The 7th International Student Conference on Advanced Science and Technology (ICAST), Seoul, South Korea.
Saeys, Y., Inza, I., and Larrañaga, P. (2007). A review of feature selection techniques in bioinformatics. Bioinformatics, 23(19):2507–2517.
Schober, P., Boer, C., and Schwarte, L. A. (2018). Correlation coefficients: appropriate use and interpretation. Anesthesia & Analgesia, 126(5):1763–1768.
Souza, F., Nogueira, R., and Lotufo, R. (2020). BERTimbau: pretrained BERT models for Brazilian Portuguese. In 9th Brazilian Conference on Intelligent Systems, BRACIS, Rio Grande do Sul, Brazil, October 20-23.
Wan, J., Chen, H., Yuan, Z., Li, T., Yang, X., and Sang, B. (2021). A novel hybrid feature selection method considering feature interaction in neighborhood rough set. Knowledge-Based Systems, 227:107167.
Xu, W., Hou, Y., Hung, Y., and Zou, Y. (2013). A comparative analysis of spearman’s rho and kendall’s tau in normal and contaminated normal models. Signal Processing, 93(1):261–276.
Yang, C., Gu, M., and Albitar, K. (2024). Government in the digital age: Exploring the impact of digital transformation on governmental efficiency. Technological Forecasting and Social Change, 208:123722.
Apgar, V. (1953). A proposal for a new method of evaluation of the newborn infant. Anesthesia & Analgesia, 32(1):260–267.
DATASUS (2021). SINASC - sistema de informações sobre nascidos vivos. Ministério da Saúde.
de Winter, J. C., Gosling, S. D., and Potter, J. (2016). Choosing between pearson and spearman correlation coefficients to assess correlations in the context of exposure assessment comparison exercises. Journal of Exposure Science & Environmental Epidemiology, 26(5):530–530.
European Commission (2016). Big data analytics for policy making report. Technical report, European Commission, Interoperable Europe. Level of specialisation: Intermediate. Published under the Interoperability Solutions for European Public Administrations programme (2016-2020).
Guyon, I. and Elisseeff, A. (2003). An introduction to variable and feature selection. Journal of machine learning research, 3(Mar):1157–1182.
Kennedy, J. J. (1970). The eta coefficient in complex anova designs. Educational and Psychological Measurement, 30(4):885–889.
Lynn, Laurence E., J., Heinrich, C. J., and Hill, C. J. (2000). Studying governance and public management: Challenges and prospects. Journal of Public Administration Research and Theory, 10(2):233–262.
Moosazadeh, M., Ifaei, P., Tayerani Charmchi, A. S., Asadi, S., and Yoo, C. (2022). A machine learning-driven spatio-temporal vulnerability appraisal based on socio-economic data for covid-19 impact prevention in the u.s. counties. Sustainable Cities and Society, 83:103990.
Parimala, K., Rajkumar, G., Ruba, A., and Vijayalakshmi, S. (2017). Challenges and opportunities with big data. International Journal of Scientific Research in Computer Science and Engineering, 5(5):16–20.
Rahutomo, F., Kitasuka, T., and Aritsugi, M. (2012). Semantic cosine similarity. In The 7th International Student Conference on Advanced Science and Technology (ICAST), Seoul, South Korea.
Saeys, Y., Inza, I., and Larrañaga, P. (2007). A review of feature selection techniques in bioinformatics. Bioinformatics, 23(19):2507–2517.
Schober, P., Boer, C., and Schwarte, L. A. (2018). Correlation coefficients: appropriate use and interpretation. Anesthesia & Analgesia, 126(5):1763–1768.
Souza, F., Nogueira, R., and Lotufo, R. (2020). BERTimbau: pretrained BERT models for Brazilian Portuguese. In 9th Brazilian Conference on Intelligent Systems, BRACIS, Rio Grande do Sul, Brazil, October 20-23.
Wan, J., Chen, H., Yuan, Z., Li, T., Yang, X., and Sang, B. (2021). A novel hybrid feature selection method considering feature interaction in neighborhood rough set. Knowledge-Based Systems, 227:107167.
Xu, W., Hou, Y., Hung, Y., and Zou, Y. (2013). A comparative analysis of spearman’s rho and kendall’s tau in normal and contaminated normal models. Signal Processing, 93(1):261–276.
Yang, C., Gu, M., and Albitar, K. (2024). Government in the digital age: Exploring the impact of digital transformation on governmental efficiency. Technological Forecasting and Social Change, 208:123722.
Published
2025-07-20
How to Cite
SILVA, Daniel de Amaral da; MARTINS NETO, José Luciano; FREITAS, Adilio J.; BRAGA, Antonio Rafael; GOMES, Danielo G..
MRC: A Hybrid Relevance and Correlation Metric for Indicator Selection with Application to SINASC Data. In: LATIN AMERICAN SYMPOSIUM ON DIGITAL GOVERNMENT (LASDIGOV), 12. , 2025, Maceió/AL.
Anais [...].
Porto Alegre: Sociedade Brasileira de Computação,
2025
.
p. 237-248.
ISSN 2763-8723.
DOI: https://doi.org/10.5753/lasdigov.2025.9303.
