BID Dataset: a challenge dataset for document processing tasks

  • Álysson de Sá Soares UPE
  • Ricardo Batista das Neves Junior UPE
  • Byron Leite Dantas Bezerra UPE


The digital relationship between companies and customers happens through online systems where consumers must upload their identification documents pictures to prove their identities. The existence of this large volume of document images encourages the research development to generate image processing systems to automate tasks usually performed by humans, such as Document Type Classification and Document Reading. The lack of identification documents public datasets delays the research development in document image processing because researchers need to attempt partnerships with private or governmental institutions to obtain the data or build their dataset. In this context, this work presents as main contributions a system to support the automatic creation of identification document public datasets and the Brazilian Identity Document Dataset (BID Dataset): the first Brazilian identification documents public dataset. To accomplish the current personal data privacy law, all information in the BID Dataset comes from fake data. This work aims to increase the velocity of research development in identification document image processing, considering that researchers will be able to use the BID Dataset to develop their research freely.


K. Gai, M. Qiu, and X. Sun, "A survey on fintech," Journal of Network and Computer Applications, vol. 103, pp. 262–273, 2018.

R. R. Mullins, M. Ahearne, S. K. Lam, Z. R. Hall, and J. P. Boichuk, "Know your customer: How salesperson perceptions of customer relationship quality form and influence account profitability," Journal of Marketing, vol. 78, no. 6, pp. 38–58, 2014.

R. B. das Neves Junior, L. F. Verc¸osa, D. Macêdo, B. L. D. Bezerra, and C. Zanchettin, "A fast fully octave convolutional neural network for document image segmentation," in 2020 International Joint Conference on Neural Networks (IJCNN). IEEE, 2020, pp. 1–8.

R. Sicre, A. M. Awal, and T. Furon, "Identity documents classification as an image classification problem," in International Conference on Image Analysis and Processing. Springer, 2017, pp. 602–613.

C. A. Lopes Junior, M. H. M. da Silva, B. L. D. Bezerra, B. J. T. Fernandes, and D. Impedovo, "Fcn+ rl: A fully convolutional network followed by refinement layers to offline handwritten signature segmentation," in 2020 International Joint Conference on Neural Networks (IJCNN). IEEE, 2020, pp. 1–7.

Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner, "Gradient-based learning applied to document recognition," Proceedings of the IEEE, vol. 86, no. 11, pp. 2278–2324, 1998.

L. Kang, J. Kumar, P. Ye, Y. Li, and D. Doermann, "Convolutional neural networks for document image classification," in 2014 22nd International Conference on Pattern Recognition. IEEE, 2014, pp. 3168–3172.

C. Tensmeyer and T. Martinez, "Document image binarization with fully convolutional neural networks," in 2017 14th IAPR International Conference on Document Analysis and Recognition (ICDAR), vol. 1. IEEE, 2017, pp. 99–104.

A. M. Awal, N. Ghanmi, R. Sicre, and T. Furon, "Complex document classification and localization application on identity document images," in 2017 14th IAPR International Conference on Document Analysis and Recognition (ICDAR), vol. 1. IEEE, 2017, pp. 426–431.

P. da República do Brasil, "Lei geral de proteçãao de dados pessoais (lgpd)," 03/ ato2015-2018/2018/ lei/L13709.htm, 2018, [Online; accessed 2020-07-17].

V. V. Arlazarov, K. B. Bulatov, T. S. Chernov, and V. L. Arlazarov, "Midv-500: a dataset for identity document analysis and recognition on mobile devices in video stream," , vol. 43, no. 5, 2019.

J.-C. Burie, J. Chazalon, M. Coustaty, S. Eskenazi, M. M. Luqman, M. Mehri, N. Nayef, J.-M. Ogier, S. Prum, and M. Rusinol, "Icdar2015 competition on smartphone document capture and ocr (smartdoc)," in 2015 13th International Conference on Document Analysis and Recognition (ICDAR). IEEE, 2015, pp. 1161–1165.

I. Pratikakis, K. Zagoris, G. Barlas, and B. Gatos, "Icdar2017 competition on document image binarization (dibco 2017)," in 2017 14th IAPR International Conference on Document Analysis and Recognition (ICDAR), vol. 1. IEEE, 2017, pp. 1395–1403.

S. B. Dizaj, M. Soheili, and A. Mansouri, "A new image dataset for document corner localization," in 2020 International Conference on Machine Vision and Image Processing (MVIP). IEEE, 2020, pp. 1–4.

A. Dutta and A. Zisserman, "The via annotation software for images, audio and video," in Proceedings of the 27th ACM International Con- ference on Multimedia, 2019, pp. 2276–2279.

D. E. King, "Max-margin object detection," arXiv preprint arXiv:1502.00046, 2015.

E. S. Gedraite and M. Hadad, "Investigation on the effect of a gaussian blur in image filtering and segmentation," in Proceedings ELMAR-2011. IEEE, 2011, pp. 393–396.

R. Smith, "An overview of the tesseract ocr engine," in Ninth international conference on document analysis and recognition (ICDAR 2007), vol. 2. IEEE, 2007, pp. 629–633.

P. Heckbert, "Color image quantization for frame buffer display," ACM Siggraph Computer Graphics, vol. 16, no. 3, pp. 297–307, 1982.

I. B. de Geografia e Estatística, "Censo demografico — ´ ibge," 22827-censo-2020-censo4.html?=&t=downloads, 2020, [Online; accessed 2020-07-07].
SOARES, Álysson de Sá; DAS NEVES JUNIOR, Ricardo Batista ; BEZERRA, Byron Leite Dantas. BID Dataset: a challenge dataset for document processing tasks. In: WORKSHOP DE TRABALHOS EM ANDAMENTO - CONFERENCE ON GRAPHICS, PATTERNS AND IMAGES (SIBGRAPI), 33. , 2020, Evento Online. Anais [...]. Porto Alegre: Sociedade Brasileira de Computação, 2020 . p. 143-146. DOI: