An Optical Character Recognition Post-processing Method for technical documents

Lucas Viana da Silva; Paulo Lilles Jorge Drews Junior; Sílvia Silva da Costa Botelho

doi:10.5753/sibgrapi.est.2023.27464

Lucas Viana da Silva FURG
Paulo Lilles Jorge Drews Junior FURG
Sílvia Silva da Costa Botelho FURG

DOI: https://doi.org/10.5753/sibgrapi.est.2023.27464

Resumo

Methods for correcting errors generated by Optical Character Recognition (OCR) system are being developed for a long time, with interesting results in their applications. However, these methods tend to work only on data with words that are part of an existing language and with a large semantic relationship between each word in the text. In this work, an error correction method is proposed that focuses on types of documents without these large semantic relationships inside their text. Instead, we focus on sparse text that tends to have little semantic relationship between the words found within itself. The proposed method uses machine learning to train classifiers capable of finding errors in the OCR output and run an isolated execution of the OCR system to fix the error. The final results indicate a good accuracy of 88.24% for error detection and an improvement of the character error rate (CER) of 14.2%.

Referências

J. Mei, A. Islam, Y. Wu, A. Moh’d, and E. E. Milios, “Statistical learning for ocr text correction,” arXiv preprint arXiv:1611.06950, 2016.

E. D’hondt, C. Grouin, and B. Grau, “Low-resource ocr error detection and correction in french clinical texts,” in Proceedings of the seventh international workshop on health text mining and information analysis, 2016, pp. 61–68.

R. Smith, “An overview of the tesseract ocr engine,” in Ninth international conference on document analysis and recognition (ICDAR 2007), vol. 2. IEEE, 2007, pp. 629–633.

S. Hochreiter and J. Schmidhuber, “Long short-term memory,” Neural computation, vol. 9, no. 8, pp. 1735–1780, 1997.

A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,” Advances in neural information processing systems, vol. 30, 2017.

R. Holley, Many hands make light work: Public collaborative OCR text correction in Australian historic newspapers. National Library of Australia, 2009.

A. Poncelas, M. Aboomar, J. Buts, J. Hadley, and A. Way, “A tool for facilitating ocr postediting in historical documents,” arXiv preprint arXiv:2004.11471, 2020.

V. Nastase and J. Hitschler, “Correction of ocr word segmentation errors in articles from the acl collection through neural machine translation methods,” in Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018), 2018.

I. Kissos and N. Dershowitz, “Ocr error correction using character correction and feature-based word classification,” in 2016 12th IAPR Workshop on Document Analysis Systems (DAS). IEEE, 2016, pp. 198–203.

G. Khirbat, “Ocr post-processing text correction using simulated annealing (opteca),” in Proceedings of the Australasian Language Technology Association Workshop 2017, 2017, pp. 119–123.

I. L. Correa, P. L. J. Drews, and R. N. Rodrigues, “Combination of optical character recognition engines for documents containing sparse text and alphanumeric codes,” in 2021 34th SIBGRAPI Conference on Graphics, Patterns and Images (SIBGRAPI). IEEE, 2021, pp. 299–306.

Z. Huang, K. Chen, J. He, X. Bai, D. Karatzas, S. Lu, and C. Jawahar, “Icdar2019 competition on scanned receipt ocr and information extraction,” in 2019 International Conference on Document Analysis and Recognition (ICDAR). IEEE, 2019, pp. 1516–1520.

G. L. Santos, V. T. Silva, L. A. Dalmolin, R. N. Rodrigues, P. L. Drews, and N. L. Duarte Filho, “A form understanding approach to printed and structured engineering documentation,” in 2021 34th SIBGRAPI Conference on Graphics, Patterns and Images (SIBGRAPI). IEEE, 2021, pp. 330–337.

F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg et al., “Scikit-learn: Machine learning in python,” the Journal of machine Learning research, vol. 12, pp. 2825–2830, 2011.

L. Breiman, “Bagging predictors,” Machine learning, vol. 24, pp. 123–140, 1996.

T. Hastie, S. Rosset, J. Zhu, and H. Zou, “Multi-class adaboost,” Statistics and its Interface, vol. 2, no. 3, pp. 349–360, 2009.

H. Li, F. Zhu, and J. Qiu, “Cg-diqa: no-reference document image quality assessment based on character gradient,” in 2018 24th International Conference on Pattern Recognition (ICPR). IEEE, 2018, pp. 3622–3626.