Combination of Optical Character Recognition Engines for Documents Containing Sparse Text and Alphanumeric Codes
Resumo
Many companies that buy machines, parts, or tools retain documents such as notes, receipts, forms, or instruction manuals over the years, and they may find themselves in need of digitizing these accumulated documents. Thus, when using optical character recognition (OCR) systems in these documents, it is possible to note that these systems can present two main difficulties. The first is to locate the sparse text in a noncontinuous way, and the second is to match words that are closer to codes and less to words in human language. Although there are many works in the literature about sparse texts, such as forms and tables, there is usually not much concern about the issue with codes in which one can not rely on dictionaries or even both problems together. Therefore, to correct this issue without having to search for extensive databases or conduct training and development of new models, this work proposed to take advantage of pre-trained models of OCR such as from the Tesseract engine or the Google Cloud’s Vision API. In order to do so, we proposed the exploration of combination strategies, including a new one based on median string. The experimental results achieved up to 3.09% improvement in character accuracy and 1.16% in word accuracy in comparison to the best individual performances from the engines when our method based on string combination was adopted.
Palavras-chave:
Training, Graphics, Codes, Manuals, Optical imaging, Adaptive optics, Internet, optical character recognition, classifier combination, pattern recognition, tesseract, median string
Publicado
18/10/2021
Como Citar
CORREA, Iago Lourenço; DREWS, Paulo Lilles Jorge; RODRIGUES, Ricardo Nagel.
Combination of Optical Character Recognition Engines for Documents Containing Sparse Text and Alphanumeric Codes. In: CONFERENCE ON GRAPHICS, PATTERNS AND IMAGES (SIBGRAPI), 34. , 2021, Online.
Anais [...].
Porto Alegre: Sociedade Brasileira de Computação,
2021
.