Creating Resources and Evaluating the Impact of OCR Quality on Information Retrieval: A Case Study in the Geoscientific Domain

  • Lucas Lima de Oliveira Universidade Federal do Rio Grande do Sul
  • Viviane P. Moreira Universidade Federal do Rio Grande do Sul

Resumo


The evaluation paradigm in Information Retrieval (IR) requires a test collection with documents, queries, and relevance judgments. Creating such collections demands significant human effort, mainly to provide relevance judgments. As a result, there are still many domains and languages that, to this day, lack a proper evaluation testbed. To bridge this gap, we developed REGIS (Retrieval Evaluation for Geoscientific Information Systems), a test collection for the geoscientific domain in Portuguese. The documents in REGIS are in PDF. Optical Character Recognition (OCR) is typically used to extract the textual contents of scanned texts. The output of OCR can be noisy, especially when the quality of the scanned image is poor, which in turn can impact downstream tasks such as Information Retrieval. This work evaluates the impact of OCR extraction and correction on IR. Our results have shown significant differences in IR metrics for the different digitization methods.
Palavras-chave: information retrieval, test collection, OCR errors

Referências

Bazzo, G. T., Lorentz, G. A., Vargas, D. S., and Moreira, V. P. (2020). Assessing the impact of OCR errors in information retrieval. In European Conference on Information Retrieval, pages 102–109.

Croft, W. B., Harding, S., Taghva, K., and Borsack, J. (1994). An evaluation of information retrieval accuracy with simulated OCR output. In Symposium on Document Analysis and Information Retrieval, pages 115–126.

Ghosh, K., Chakraborty, A., Parui, S. K., and Majumder, P. (2016). Improving information retrieval performance on OCRed text in the absence of clean text ground truth. Information Processing & Management, 52(5):873–884.

Hegghammer, T. (2021). OCR with tesseract, amazon textract, and google document ai: a benchmarking experiment. Journal of Computational Social Science, pages 1–22.

Kantor, P. B. and Voorhees, E. M. (2000). The TREC-5 confusion track: Comparing retrieval methods for scanned text. Information Retrieval, 2(2):165–176.

Mittendorf, E. and Schäuble, P. (2000). Information retrieval can cope with many errors. Information Retrieval, 3(3):189–216.

Oliveira, L. L. d., Romeu, R. K., and Moreira, V. P. (2021). REGIS: A test collection for geoscientific documents in portuguese. In Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval, page 2363–2368.

Oliveira, L. L. d., Vargas, D. S., Alexandre, A. M. A., Cordeiro, F. C., Gomes, D. d. S. M., Rodrigues, M. d. C., Romeu, R. K., and Moreira, V. P. (2023). Evaluating and mitigating the impact of OCR errors on information retrieval. International Journal on Digital Libraries, 24(1):45–62.

Sanderson, M. (2010). Test collection based evaluation of information retrieval systems. Foundations and Trends® in Information Retrieval, 4(4):247–375.

Santos, D. and Rocha, P. (2004). The key to the first CLEF with portuguese: Topics, questions and answers in CHAVE. In Workshop of the Cross-Language Evaluation Forum for European Languages, pages 821–832.

Spark-Jones, K. (1975). Report on the need for and provision of an ’ideal’ information retrieval test collection. Computer Laboratory.

Taghva, K., Borsack, J., and Condit, A. (1996a). Effects of OCR errors on ranking and feedback using the vector space model. Information Processing & Management, 32(3):317–327.

Taghva, K., Borsack, J., and Condit, A. (1996b). Evaluation of model-based retrieval effectiveness with OCR text. ACM Transactions on Information Systems (TOIS), 14(1):64–93.

Vargas, D. S., de Oliveira, L. L., Moreira, V. P., Bazzo, G. T., and Lorentz, G. A. (2021). sOCRates-a post-OCR text correction method. In Anais do XXXVI Simpósio Brasileiro de Bancos de Dados, pages 61–72.
Publicado
14/10/2024
LIMA DE OLIVEIRA, Lucas; P. MOREIRA, Viviane. Creating Resources and Evaluating the Impact of OCR Quality on Information Retrieval: A Case Study in the Geoscientific Domain. In: CONCURSO DE TESES E DISSERTAÇÕES (CTDBD) - SIMPÓSIO BRASILEIRO DE BANCO DE DADOS (SBBD), 39. , 2024, Florianópolis/SC. Anais [...]. Porto Alegre: Sociedade Brasileira de Computação, 2024 . p. 202-206. DOI: https://doi.org/10.5753/sbbd_estendido.2024.241190.