YACSDB-NER: Yet Another Cybersecurity Database for Named Entity Recognition Task
Resumo
The increasing number of cybersecurity reports poses a challenge to efficiently retrieving and sharing Cyber Threat Intelligence. However, publicly available cybersecurity datasets for Natural Language Processing (NLP) remain scarce, hindering progress in automated intelligence production. To tackle this challenge, this article presents Yet Another Cybersecurity Database (YACSDB), a dataset designed to enhance Named Entity Recognition (NER) using Structured Threat Information Expression (STIX) entities for interoperability. Our pipeline extracts STIX Domain Objects from unstructured reports, leveraging Google’s Gemini and Bidirectional Encoder Representations from Transformers (BERT) model to assist in labeling and reduce resource needs. The dataset uses Inside–Outside–Beginning (IOB) notation to facilitate fine-tuning in sequence tagging tasks. Reports were selected for representativeness across different years. To the best of our knowledge, it is among the largest cybersecurity NER dataset with temporal information annotated by a single machine-assisted annotator. To evaluate the dataset, we fine-tuned seven BERT models to demonstrate its effectiveness for NER. The results emphasize the importance of domain-specific datasets in cybersecurity NLP and highlight key challenges. YACSDB serves as a benchmark for model comparison, solution development and knowledge graph generation. It is publicly available to foster future research in cybersecurity NLP.
Publicado
29/09/2025
Como Citar
MAIA, Yuri do Amaral Nobre; ALBUQUERQUE, Robson de Oliveira.
YACSDB-NER: Yet Another Cybersecurity Database for Named Entity Recognition Task. In: BRAZILIAN CONFERENCE ON INTELLIGENT SYSTEMS (BRACIS), 35. , 2025, Fortaleza/CE.
Anais [...].
Porto Alegre: Sociedade Brasileira de Computação,
2025
.
p. 559-574.
ISSN 2643-6264.
