sOCRates - a post-OCR text correction method
Resumo
A significant portion of the textual information of interest to an organization is stored in PDF files that should be converted into plain text before their contents can be processed by an information retrieval or text mining system. When the PDF documents consist of scanned documents, optical character recognition (OCR) is typically used to extract the textual contents. OCR errors can have a negative impact on the quality of information retrieval systems since the terms in the query will not match incorrectly extracted terms in the documents. This work introduces sOCRates, a post-OCR text correction method that relies on contextual word embeddings and on a classifier that uses format, semantic, and syntactic features. Our experimental evaluation on a test collection in Portuguese showed that sOCRates can accurately correct errors and improve retrieval results.
Palavras-chave:
Post-OCR text correction, Information Retrieval
Referências
Guilherme Torresan Bazzo, Gustavo Acauan Lorentz, Danny Suarez Vargas, and Viviane P. Moreira. Assessing the impact of OCR errors in information retrieval. In Advances in Information Retrieval, pages 102–109, 2020.
Steven M. Beitzel, Eric C. Jensen, and David A. Grossman. A survey of retrieval strategies for OCR text collections. In Symposium on Document Image Understanding Technologies, 2003.
G. Chiron, A. Doucet, M. Coustaty, and J. Moreux. ICDAR 2017 Competition on Post-OCR Text Correction. In Intl. Conf. on Document Analysis and Recognition, volume 01, pages 1423–1428, 2017.
W. Bruce Croft, Stephen Harding, Kazem Taghva, and Julie Borsack. An evaluation of information retrieval accuracy with simulated ocr output. In Symposium of Document Analysis and Information Retrieval, 1994.
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018.
M. Droettboom. Correcting broken characters in the recognition of historical printed documents. In Joint Conference on Digital Libraries, pages 364–366, May 2003.
John Evershed and Kent Fitch. Correcting noisy ocr: Context beats confusion. In Intl. Conference on Digital Access to Textual Cultural Heritage, DATeCH ’14, pages 45–51, 2014.
Paul B. Kantor and Ellen M. Voorhees. The TREC-5 confusion track: Comparing retrieval methods for scanned text. Information Retrieval, 2(2):165–176, May 2000.
Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. Distributed representations of words and phrases and their compositionality. In Advances in Neural Information Processing Systems 26, pages 3111–3119. 2013.
T. Nguyen, A. Jatowt, M. Coustaty, N. Nguyen, and A. Doucet. Deep statistical analysis of OCR errors for effective post-OCR processing. In Joint Conference on Digital Libraries (JCDL), pages 29–38, June 2019.
Thi Tuyet Hai Nguyen, Adam Jatowt, Mickael Coustaty, and Antoine Doucet. Survey of post-ocr processing approaches. ACM Computing Surveys (CSUR), 54(6):1–37, 2021.
Javier Parapar, Ana Freire, and Alvaro Barreiro. Revisiting n-gram based models for retrieval in degraded large collections. In Advances in Information Retrieval, pages 680–684, 2009.
C. Rigaud, A. Doucet, M. Coustaty, and J. Moreux. ICDAR 2019 competition on post-ocr text correction. In Intl. Conf. on Document Analysis and Recognition, pages 1588–1593, 2019.
Diana Santos and Paulo Rocha. The key to the first clef with Portuguese: Topics, questions and answers in Chave. In Workshop of the Cross-Language Evaluation Forum for European Languages, pages 821–832, 2004.
Kazem Taghva, Julie Borsack, and Allen Condit. Evaluation of model-based retrieval effectiveness with ocr text. ACM Trans. Inf. Syst., 14(1):64–93, January 1996.
Steven M. Beitzel, Eric C. Jensen, and David A. Grossman. A survey of retrieval strategies for OCR text collections. In Symposium on Document Image Understanding Technologies, 2003.
G. Chiron, A. Doucet, M. Coustaty, and J. Moreux. ICDAR 2017 Competition on Post-OCR Text Correction. In Intl. Conf. on Document Analysis and Recognition, volume 01, pages 1423–1428, 2017.
W. Bruce Croft, Stephen Harding, Kazem Taghva, and Julie Borsack. An evaluation of information retrieval accuracy with simulated ocr output. In Symposium of Document Analysis and Information Retrieval, 1994.
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018.
M. Droettboom. Correcting broken characters in the recognition of historical printed documents. In Joint Conference on Digital Libraries, pages 364–366, May 2003.
John Evershed and Kent Fitch. Correcting noisy ocr: Context beats confusion. In Intl. Conference on Digital Access to Textual Cultural Heritage, DATeCH ’14, pages 45–51, 2014.
Paul B. Kantor and Ellen M. Voorhees. The TREC-5 confusion track: Comparing retrieval methods for scanned text. Information Retrieval, 2(2):165–176, May 2000.
Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. Distributed representations of words and phrases and their compositionality. In Advances in Neural Information Processing Systems 26, pages 3111–3119. 2013.
T. Nguyen, A. Jatowt, M. Coustaty, N. Nguyen, and A. Doucet. Deep statistical analysis of OCR errors for effective post-OCR processing. In Joint Conference on Digital Libraries (JCDL), pages 29–38, June 2019.
Thi Tuyet Hai Nguyen, Adam Jatowt, Mickael Coustaty, and Antoine Doucet. Survey of post-ocr processing approaches. ACM Computing Surveys (CSUR), 54(6):1–37, 2021.
Javier Parapar, Ana Freire, and Alvaro Barreiro. Revisiting n-gram based models for retrieval in degraded large collections. In Advances in Information Retrieval, pages 680–684, 2009.
C. Rigaud, A. Doucet, M. Coustaty, and J. Moreux. ICDAR 2019 competition on post-ocr text correction. In Intl. Conf. on Document Analysis and Recognition, pages 1588–1593, 2019.
Diana Santos and Paulo Rocha. The key to the first clef with Portuguese: Topics, questions and answers in Chave. In Workshop of the Cross-Language Evaluation Forum for European Languages, pages 821–832, 2004.
Kazem Taghva, Julie Borsack, and Allen Condit. Evaluation of model-based retrieval effectiveness with ocr text. ACM Trans. Inf. Syst., 14(1):64–93, January 1996.
Publicado
04/10/2021
Como Citar
SUAREZ VARGAS, Danny; LIMA DE OLIVEIRA, Lucas; P. MOREIRA, Viviane; TORRESAN BAZZO, Guilherme; ACAUAN LORENTZ, Gustavo.
sOCRates - a post-OCR text correction method. In: SIMPÓSIO BRASILEIRO DE BANCO DE DADOS (SBBD), 36. , 2021, Rio de Janeiro.
Anais [...].
Porto Alegre: Sociedade Brasileira de Computação,
2021
.
p. 61-72.
ISSN 2763-8979.
DOI: https://doi.org/10.5753/sbbd.2021.17866.