Investigating vulnerability datasets

  • Rodrigo Andrade UFAPE
  • Vinícius Santos UFAPE


Insecure software can cause severe damage to user experience and privacy. Therefore, developers should be able to prevent software vulnerabilities. However, detecting such problems is expensive and time consuming. To mitigate this issue, researchers propose vulnerability datasets to make it easier to investigate its properties. In this work, we investigate one dataset to better understand common vulnerabilities, the authors who introduce them to open-source projects, and commit properties. Thus, we use as case study the Big-Vul dataset to help us answering the six research questions we define for this work. Our preliminary results indicate that the most common vulnerabilities occur in the Chromium project. Furthermore, mostly experienced authors are responsible for introducing these vulnerabilities. Last but not least, we conclude that such findings could help developers on detecting vulnerabilities.
Palavras-chave: Vulnerability, Datasets, Commits, CVE


J. Allen, S. Barnum, R. Ellison, G. McGraw, and N. Mead. 2008. Software Security Engineering. Addison-Wesley Professional.

GH Archive. 2021. GH Archive.

Amiangshu Bosu, Jeffrey C. Carver, Munawar Hafiz, Patrick Hilley, and Derek Janni. 2014. Identifying the Characteristics of Vulnerable Code Changes: An Empirical Study. In International Symposium on Foundations of Software Engineering. 257–268.

CVE. 2021. Common Vulnerabilities and Exposures.

CVSS. 2021. The Common Vulnerability Scoring System.

Kalliamvakou E, Gousios G, Blincoe K, Singer L, German DM, and Damian D. 2014. The promises and perils of mining github. In Proceedings of the 11th working conference on mining software repositories. 92–101.

Jiahao Fan, Yi Li, Shaohua Wang, and Tien N. Nguyen. 2020. A C/C++ Code Vulnerability Dataset with Code Changes and CVE Summaries. In Proceedings of the 17th International Conference on Mining Software Repositories. 508–512.

GitHub. 2021. GitHub Rest API.

GitHub. 2021. GitHub Search.

Antonios Gkortzis, Dimitris Mitropoulos, and Diomidis Spinellis. 2018. VulinOSS: A Dataset of Security Vulnerabilities in Open-source Systems. In Proceedings of the 15th International Conference on Mining Software Repositories. 18–21.

Jürgen Graf, Martin Hecker, and Martin Mohr. 2013. Using JOANA for Information Flow Control in Java Programs - A Practical Guide. In Work. Conf. Program. Languages. 123–138.

Ivan Victor Krsul. 1998. Software Vulnerability Analysis. Ph.D. Dissertation. Purdue University.

Bingchang Liu, Guozhu Meng, Wei Zou, Qi Gong, Feng Li, Min Lin, Dandan Sun, Wei Huo, and Chao Zhang. 2020. A Large-Scale Empirical Study on Vulnerability Distribution within Projects and the Lessons Learned. In Proceedings of the 42nd International Conference on Software Engineering. 1547–1559.

G. McGraw. 2006. Software Security: Building Security In. Addison-Wesley Professional.

Andrew Meneely, Harshavardhan Srinivasan, Ayemi Musa, Alberto Rodriguez Tejeda, Matthew Mokary, and Brian Spates. 2013. When a Patch Goes Bad: Exploring the Properties of Vulnerability-Contributing Commits. In Proceedings of the International Symposium on Empirical Software Engineering and Measurement. 65–74.

Andrew Meneely, Alberto C. Rodriguez Tejeda, Brian Spates, Shannon Trudeau, Danielle Neuberger, Katherine Whitlock, Christopher Ketant, and Kayla Davis. 2014. An Empirical Investigation of Socio-technical Code Review Metrics and Security Vulnerabilities. In Proceedings of the 6th InternationalWorkshop on Social Software Engineering. 37–44.

NVD. 2021. National Vulnerability Database.

Henning Perl, Sergej Dechand, Matthew Smith, Daniel Arp, Fabian Yamaguchi, Konrad Rieck, Sascha Fahl, and Yasemin Acar. 2015. VCCFinder: Finding Potential Vulnerabilities in Open-Source Projects to Assist Code Audits Henning. In Proceedings of the 22nd ACM SIGSAC Conference on Computer and Communications Security. 426–437.

Serena Elisa Ponta, Henrik Plate, Antonino Sabetta, Michele Bezzi, and Cedric Dangremont. 2019. A manually-curated dataset of fixes to vulnerabilities of opensource software. In Proceedings of the 16th International Conference on Mining Software Repositories. 383–387.

Muhammad Shahzad, Muhammad Zubair Shafiq, and Alex X. Liu. 2012. A Large Scale Exploratory Analysis of Software Vulnerability Life Cycles. In Proceedings of the 34th International Conference on Software Engineering. 771–781.

GH Torrent. 2021. GH Torrent.
Como Citar

Selecione um Formato
ANDRADE, Rodrigo; SANTOS, Vinícius. Investigating vulnerability datasets. In: WORKSHOP DE VISUALIZAÇÃO, EVOLUÇÃO E MANUTENÇÃO DE SOFTWARE (VEM), 9. , 2021, Joinville. Anais [...]. Porto Alegre: Sociedade Brasileira de Computação, 2021 . p. 26-30. DOI: