VulnSyncAI: NLP and LLMs for Construction and Continuous Updating of Vulnerability Datasets Fontes e conteúdo relacionado

Abstract


The construction and maintenance of up-to-date vulnerability datasets face challenges such as lack of standardization and the need for automation. In this work, we present VulnSyncAI, a modular tool that uses NLP (Natural Language Processing) and LLMs (Large Language Models) to correlate information from multiple sources, ensuring updated and relevant datasets. VulnSyncAI enhances the effectiveness of AI models in threat detection by automating processes and increasing efficiency in creating representative datasets.

Keywords: VulnSyncAI, NLP, LLMs, Vulnerability Datasets, Risk Analysis, Automation, Software Security, Threat Detection, Data Correlation, Vulnerability Sources

References

Alves, H., Fonseca, B., and Antunes, N. (2016). Software metrics and security vulnerabilities: Dataset and exploratory study. In 2016 12th European Dependable Computing Conference (EDCC), pages 37–44.

Anwar, A., Abusnaina, A., Chen, S., Li, F., and Mohaisen, D. (2022). Cleaning the nvd: Comprehensive quality assessment, improvements, and analyses. IEEE Transactions on Dependable and Secure Computing, 19(6):4255–4269.

Croft, R., Babar, M. A., and Kholoosi, M. M. (2023). Data quality for software vulnerability datasets. In 2023 IEEE/ACM 45th International Conference on Software Engineering (ICSE), pages 121–133. IEEE.

Guo, Y. and Bettaieb, S. (2024). An investigation of quality issues in vulnerability detection datasets. In arXiv preprint arXiv:2410.06030.

Guo, Y., Bettaieb, S., and Casino, F. (2024). A comprehensive analysis on software vulnerability detection datasets: trends, challenges, and road ahead. International Journal of Information Security, 23(5):3311–3327.

Hoang, T., Kang, H. J., Lo, D., and Lawall, J. (2020). Cc2vec: distributed representations of code changes. In Proceedings of the ACM/IEEE 42nd International Conference on Software Engineering, ICSE ’20, page 518–529, New York, NY, USA. Association for Computing Machinery.

Hu, W., Fey, M., Ren, H., Nakata, M., Dong, Y., and Leskovec, J. (2021). Ogb-lsc: A large-scale challenge for machine learning on graphs. arXiv preprint arXiv:2103.09430.

Lin, Y., Li, Y., Gu, M., Sun, H., Yue, Q., Hu, J., Cao, C., and Zhang, Y. (2022). Vulnerability dataset construction methods applied to vulnerability detection: A survey. In 2022 52nd Annual IEEE/IFIP International Conference on Dependable Systems and Networks Workshops (DSN-W), pages 141–146.

Longo, E., Redondi, A. E. C., Cesana, M., and Manzoni, P. (2022). Border: A benchmarking framework for distributed mqtt brokers. IEEE Internet of Things Journal, 9(18):17728–17740.

Miranda, L., Vieira, D., de Aguiar, L. P., Menasché, D. S., Bicudo, M. A., Nogueira, M. S., Martins, M., Ventura, L., Senos, L., and Lovat, E. (2021). On the flow of software security advisories. IEEE Transactions on Network and Service Management, 18(2):1305–1320.

Rocha, V., Assolin, J., Bragança, H., Kreutz, D., and Feitosa, E. (2023). AMGenerator e AMExplorer: Geração de metadados e construção de datasets android. In Anais Estendidos do XXIII Simpósio Brasileiro de Segurança da Informação e de Sistemas Computacionais, pages 41–48, Porto Alegre, RS, Brasil. SBC.

Rocha V. et al. (2023). AMGenerator e AMExplorer: Geração de metadados e construção de datasets android. In Anais Estendidos do XXIII SBSeg. SBC.

Sebastianelli, A., Del Rosso, M. P., and Ullo, S. L. (2021). Automatic dataset builder for machine learning applications to satellite imagery. SoftwareX, 15:100739.

Soares, T., Mello, J., Barcellos, L., Sayyed, R., Siqueira, G., Casola, K., Costa, E., Gustavo, N., Feitosa, E., and Kreutz, D. (2021). Detecção de malwares android: Levantamento empírico da disponibilidade e da atualização das fontes de dados. In Anais da XIX ERRC, page 49. SBC.

Vilanova, L., Kreutz, D., Assolin, J., Quincozes, V., Miers, C., Mansilha, R., and Feitosa, E. (2022). Adbuilder: uma ferramenta de construção de datasets para detecção de malwares android. In Anais Estendidos do XXII SBC, pages 143–150, Porto Alegre, RS, Brasil. SBC.

W3Schools (2024). Browser statistics. Available: [link]. Accessed on 2025-01-27.

Whang, S. E., Roh, Y., Song, H., and Lee, J.-G. (2023). Data collection and quality challenges in deep learning: A data-centric ai perspective. The VLDB Journal, 32(4):791–813.

Zhang, S., Zhang, M., and Zhao, L. (2023). Viet: A tool for extracting essential information from vulnerability descriptions for cvss evaluation. In Data and Applications Security and Privacy XXXVII: 37th Annual IFIP WG 11.3 Conference, DBSec 2023, Sophia-Antipolis, France, July 19–21, 2023, Proceedings, page 386–403, Berlin, Heidelberg. Springer-Verlag.

Zou, D., Wang, S., Xu, S., Li, Z., and Jin, H. (2021). µVulDeePecker: A deep learning-based system for multiclass vulnerability detection. IEEE Transactions on Dependable and Secure Computing, 18(5):2224–2236.
Published
2025-05-19
FIDELES, Douglas Rodrigues; LAUTERT, Douglas Paim; KREUTZ, Diego; QUINCOZES, Silvio Ereno. VulnSyncAI: NLP and LLMs for Construction and Continuous Updating of Vulnerability Datasets Fontes e conteúdo relacionado. In: DEMO SESSION - BRAZILIAN SYMPOSIUM ON COMPUTER NETWORKS AND DISTRIBUTED SYSTEMS (SBRC), 43. , 2025, Natal/RN. Anais [...]. Porto Alegre: Sociedade Brasileira de Computação, 2025 . p. 139-150. ISSN 2177-9384. DOI: https://doi.org/10.5753/sbrc_estendido.2025.7880.