NBID Dataset: Towards Robust Information Extraction in Official Documents

  • Lucas Wojcik UFPR
  • Luiz Coelho unico - idTech
  • Roger Granada unico - idTech
  • Gustavo Führ unico - idTech
  • David Menotti UFPR


The Visual Document Understanding (VDU) task is of great interest for a variety of organizations, including banks, governments and schools, all of which would benefit from reliable automatic information extraction from pictures of documents. However, due to the sensitive nature of the data, creating new datasets for official documents, such as identity cards and passports, proves to be very challenging as the data must first be safely anonymized and synthesized. Such a process requires the source images to be modified, which may impact performance on VDU models. In this paper, we propose a new dataset and the synthesizer used for its generation, both made publicly available. We also selected three state-of-the-art VDU models: PICK, StrucTexT, and DocFormer, for evaluation on the dataset, in order to study the impact of the synthetic data on performance. We trained the models using both synthetic-only and synthetic-plus-real data protocols and present the results for both datasets. Our synthesizing process is shown to benefit training when used as an addition to the real data.
WOJCIK, Lucas; COELHO, Luiz; GRANADA, Roger; FÜHR, Gustavo; MENOTTI, David. NBID Dataset: Towards Robust Information Extraction in Official Documents. In: CONFERENCE ON GRAPHICS, PATTERNS AND IMAGES (SIBGRAPI), 36. , 2023, Rio Grande/RS. Anais [...]. Porto Alegre: Sociedade Brasileira de Computação, 2023 . p. 145-150.