Detection of vulnerabilities in Ethereum smart contract bytecodes via CodeBERT embeddings

  • Pedro Henrique F. S. Oliveira UFJF
  • Heder S. Bernardino UFJF
  • Saulo Moraes Villela UFJF
  • Edelberto Franco Silva UFJF
  • Jairo Francisco de Souza UFJF
  • Alex B. Vieira UFJF

Abstract


Ethereum is a cryptocurrency platform that allows the execution of smart contracts, autonomous programs that operate on a decentralized network. Vulnerabilities in these contracts represent significant financial and security risks in blockchain ecosystems, motivating the automation of the process of detecting them. This work studies the detection of vulnerabilities in Ethereum smart contracts using bytecode-derived embeddings. Embeddings are vector representations generated by language models that capture the structural characteristics of text. These representations were used as input for logistic regression, decision tree, and random forest algorithms in order to detect which contracts have vulnerabilities. The results show that the embeddings contain useful information to distinguish vulnerable from non-vulnerable contracts. The study also finds that altering the original data distribution during training significantly affects performance, highlighting the sensitivity of embedding-based approaches to sampling strategies.

References

Boughorbel, S., Jarray, F., and El-Anbari, M. (2017). Optimal classifier for imbalanced data using matthews correlation coefficient metric. PloS one, 12(6):e0177678.

Breiman, L. (2001). Random forests. Machine learning, 45(1):5–32.

Chawla, N. V., Bowyer, K. W., Hall, L. O., and Kegelmeyer, W. P. (2002). Smote: synthetic minority over-sampling technique. Journal of artificial intelligence research, 16:321–357.

Chu, H., Zhang, P., Dong, H., Xiao, Y., Ji, S., and Li, W. (2023). A survey on smart contract vulnerabilities: Data sources, detection and repair. Information and Software Technology, 159:107221.

Devlin, J., Chang, M.-W., Lee, K., and Toutanova, K. (2019). Bert: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 conference of the North American chapter of the association for computational linguistics: human language technologies, volume 1 (long and short papers), pages 4171–4186.

Di Angelo, M., Durieux, T., Ferreira, J. F., and Salzer, G. (2023). Smartbugs 2.0: An execution framework for weakness detection in ethereum smart contracts. In 2023 38th IEEE/ACM International Conference on Automated Software Engineering (ASE), pages 2102–2105. IEEE.

Di Angelo, M. and Salzer, G. (2019). A survey of tools for analyzing ethereum smart contracts. In Proceedings of the 2019 IEEE international conference on decentralized applications and infrastructures (DAPPCON), pages 69–78.

Dong, Z., Hu, Q., Guo, Y., Zhang, Z., Cordy, M., Papadakis, M., Le Traon, Y., and Zhao, J. (2025). Boosting source code learning with text-oriented data augmentation: an empirical study. Empirical Software Engineering, 30(3):68.

Durieux, T., Ferreira, J. F., Abreu, R., and Cruz, P. (2020). Empirical review of automated analysis tools on 47,587 ethereum smart contracts. In Proceedings of the ACM/IEEE 42nd International conference on software engineering, pages 530–541.

Feist, J., Grieco, G., and Groce, A. (2019). Slither: a static analysis framework for smart contracts. In Proceedings of the 2019 IEEE/ACM 2nd International Workshop on Emerging Trends in Software Engineering for Blockchain (WETSEB), pages 8–15.

Feng, Z. (2020). Codebert: A pre-trained model for program-ming and natural languages. arXiv preprint arXiv:2002.08155.

Hwang, S.-J., Ju, S. H., and Choi, Y.-H. (2024). Cggnet: compiler-guided generation network for smart contract data augmentation. IEEE Access, 12:97515–97532.

Safavian, S. R. and Landgrebe, D. (1991). A survey of decision tree classifier methodology. IEEE transactions on systems, man, and cybernetics, 21(3):660–674.

Sharma, N., Sharma, S., et al. (2022). A survey of mythril, a smart contract security analysis tool for evm bytecode. Indian Journal of Natural Sciences, 13(75):51003–51010.

Szabo, N. (1996). Smart contracts: building blocks for digital markets. EXTROPY: The journal of transhumanist thought, 18(2):28.

Tsankov, P., Dan, A., Drachsler-Cohen, D., Gervais, A., Buenzli, F., and Vechev, M. (2018). Securify: Practical security analysis of smart contracts. In Proc. of the 2018 ACM SIGSAC conference on computer and communications security, pages 67–82.

Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, Ł., and Polosukhin, I. (2017). Attention is all you need. Advances in neural information processing systems, 30.

Wood, G. (2014). Ethereum: A secure decentralised generalised transaction ledger. Ethereum project yellow paper, 151:1–32.

Wright, R. E. (1995). Logistic regression.

Zhang, H., Cisse, M., Dauphin, Y. N., and Lopez-Paz, D. (2017). mixup: Beyond empirical risk minimization. arXiv preprint arXiv:1710.09412.
Published
2026-05-25
OLIVEIRA, Pedro Henrique F. S.; BERNARDINO, Heder S.; VILLELA, Saulo Moraes; SILVA, Edelberto Franco; SOUZA, Jairo Francisco de; VIEIRA, Alex B.. Detection of vulnerabilities in Ethereum smart contract bytecodes via CodeBERT embeddings. In: BRAZILIAN SYMPOSIUM ON COMPUTER NETWORKS AND DISTRIBUTED SYSTEMS (SBRC), 44. , 2026, Praia do Forte/BA. Anais [...]. Porto Alegre: Sociedade Brasileira de Computação, 2026 . p. 631-644. ISSN 2177-9384. DOI: https://doi.org/10.5753/sbrc.2026.19804.

Most read articles by the same author(s)

<< < 1 2 3 4 5 > >>