NBID Dataset: Towards Robust Information Extraction in Official Documents

Lucas Wojcik; Luiz Coelho; Roger Granada; Gustavo Führ; David Menotti

Lucas Wojcik UFPR
Luiz Coelho unico - idTech
Roger Granada unico - idTech
Gustavo Führ unico - idTech
David Menotti UFPR

Resumo

The Visual Document Understanding (VDU) task is of great interest for a variety of organizations, including banks, governments and schools, all of which would benefit from reliable automatic information extraction from pictures of documents. However, due to the sensitive nature of the data, creating new datasets for official documents, such as identity cards and passports, proves to be very challenging as the data must first be safely anonymized and synthesized. Such a process requires the source images to be modified, which may impact performance on VDU models. In this paper, we propose a new dataset and the synthesizer used for its generation, both made publicly available. We also selected three state-of-the-art VDU models: PICK, StrucTexT, and DocFormer, for evaluation on the dataset, in order to study the impact of the synthetic data on performance. We trained the models using both synthetic-only and synthetic-plus-real data protocols and present the results for both datasets. Our synthesizing process is shown to benefit training when used as an addition to the real data.