Spending Segmentation and Outlier Detection in Brazilian Elections

  • Leandro G. C. Simoes Instituto Tecnológico de Aeronáutica
  • Filipe A. N. Verri Instituto Tecnológico de Aeronáutica
  • Takashi Yoneyama Instituto Tecnológico de Aeronáutica


The political campaigns in Brazilian elections are mostly financed by public money. Every candidate has to provide detailed accountability reports to the legal authorities, which must be analyzed in a short time frame in search of eventual fraud or suspicious transactions. In this work we have compiled a real data set from 2016 Brazilian elections for all city councils in the São Paulo state and used it to propose a framework of data segmentation analysis and validation. An exploratory data analysis is performed to determine the features distribution and to define the required feature pre-processing tasks. A clustering analysis using DBSCAN method is applied to a subset of the original data, focused on segmenting the spending data regarding contracts with car fuel providers and detecting potential outliers. Three clusters were identified and a ridge regression model was used to evaluate the most important features on cluster definition. One cluster was related to candidates that received zero votes and the remaining two discriminated suppliers if they had or not contracts almost exclusively related to candidate spending on car fuel. The hyperparameters from the clustering analysis were validated using a bootstrap method and a null hypothesis of data set structure randomness was rejected using a Monte Carlo approach.

Palavras-chave: Clustering algorithms, fraud detection, machine learning, outlier detection


Amarbayasgalan, T., Jargalsaikhan, B., and Ryu, K. H. Unsupervised novelty detection using deep autoencoders with density based clustering. Applied Sciences 8 (9): 1468, 2018.

Baldomir, R. A., Van Erven, G. C., and Ralha, C. G. Brazilian government procurements: an approach to find fraud traces in companies relationships. In Anais do XV Encontro Nacional de Inteligência Artificial e Computacional. SBC, pp. 752–762, 2018.

Breunig, M. M., Kriegel, H.-P., Ng, R. T., and Sander, J. LOF: identifying density-based local outliers. In Proceedings of the 2000 ACM SIGMOD international conference on Management of data. pp. 93–104, 2000.

Caliński, T. and Harabasz, J. A dendrite method for cluster analysis. Communications in Statistics-theory and Methods 3 (1): 1–27, 1974.

Carvalho, L. F., Teixeira, C. H., Meira, W., Ester, M., Carvalho, O., and Brandao, M. H. Provider-consumer anomaly detection for healthcare systems. In 2017 IEEE International Conference on Healthcare Informatics (ICHI). IEEE, pp. 229–238, 2017.

Davies, D. L. and Bouldin, D. W. A cluster separation measure. IEEE transactions on pattern analysis and machine intelligence (2): 224–227, 1979.

Ekin, T., Ieva, F., Ruggeri, F., and Soyer, R. Statistical medical fraud assessment: exposition to an emerging field. International Statistical Review 86 (3): 379–402, 2018.

Ester, M., Kriegel, H.-P., Sander, J., Xu, X., et al. A density-based algorithm for discovering clusters in large spatial databases with noise. In Kdd. Vol. 96. pp. 226–231, 1996.

Goldstein, M. and Uchida, S. A comparative evaluation of unsupervised anomaly detection algorithms for multivariate data. PloS one 11 (4): e0152173, 2016.

Halkidi, M., Batistakis, Y., and Vazirgiannis, M. On clustering validation techniques. Journal of intelligent information systems 17 (2-3): 107–145, 2001.

Hillerman, T., Souza, J. C. F., Reis, A. C. B., and Carvalho, R. N. Applying clustering and AHP methods for evaluating suspect healthcare claims. Journal of Computational Science vol. 19, pp. 97–111, 2017.

Kim, J., Kim, H.-J., and Kim, H. Fraud detection for job placement using hierarchical clusters-based deep neural networks. Applied Intelligence 49 (8): 2842–2861, 2019.

Kohonen, T. The self-organizing map. Proceedings of the IEEE 78 (9): 1464–1480, 1990.

López-Iturriaga, F. J. and Sanz, I. P. Predicting public corruption with neural networks: An analysis of spanish provinces. Social Indicators Research 140 (3): 975–998, 2018.

Maaten, L. v. d. and Hinton, G. Visualizing data using t-SNE. Journal of machine learning research 9 (Nov): 2579–2605, 2008.

Olszewski, D., Kacprzyk, J., and Zadrożny, S. Employing self-organizing map for fraud detection. In International Conference on Artificial Intelligence and Soft Computing. Springer, pp. 150–161, 2013.

Rousseeuw, P. J. Silhouettes: A graphical aid to the interpretation and validation of cluster analysis. Journal of Computational and Applied Mathematics vol. 20, pp. 53 – 65, 1987.

Sarle, W. S. Algorithms for clustering data. Taylor & Francis Group, 1990.

Sharma, A. and Kumar Panigrahi, P. A review of financial accounting fraud detection based on data mining techniques. In IJCA. Vol. 39. pp. 37–47, 2012.

Van Erven, G. C., Carvalho, R. N., de Holanda, M. T., and Ralha, C. Graph database: A case study for detecting fraud in acquisition of brazilian government. In 2017 12th Iberian Conference on Information Systems and Technologies (CISTI). IEEE, pp. 1–6, 2017.

West, J. and Bhattacharya, M. Intelligent financial fraud detection: a comprehensive review. Computers & security vol. 57, pp. 47–66, 2016.

Yan, J., Linn, K. A., Powers, B. W., Zhu, J., Jain, S. H., Kowalski, J. L., and Navathe, A. S. Applying machine learning algorithms to segment high-cost patient populations. Journal of general internal medicine 34 (2): 211–217, 2019.

Zamini, M. and Montazer, G. Credit card fraud detection using autoencoder based clustering. In 2018 9th International Symposium on Telecommunications (IST). IEEE, pp. 486–491, 2018.

Zhang, W. and He, X. An anomaly detection method for medicare fraud detection. In 2017 IEEE International Conference on Big Knowledge (ICBK). IEEE, pp. 309–314, 2017.
SIMOES, Leandro G. C.; VERRI, Filipe A. N.; YONEYAMA, Takashi. Spending Segmentation and Outlier Detection in Brazilian Elections. In: SYMPOSIUM ON KNOWLEDGE DISCOVERY, MINING AND LEARNING (KDMILE), 8. , 2020, Evento Online. Anais [...]. Porto Alegre: Sociedade Brasileira de Computação, 2020 . p. 65-72. ISSN 2763-8944. DOI: https://doi.org/10.5753/kdmile.2020.11960.