Pipeline para identificação de erros lexicais e geração de sugestões de correção

Luana Q. Garcia; Miguel H. Chinellato; Helena de M. Caseli; Leandro H. M. Oliveira

doi:10.5753/stil.2023.234034

Luana Q. Garcia UFSCar http://orcid.org/0009-0006-7281-5649
Miguel H. Chinellato UFSCar https://orcid.org/0009-0009-8898-4940
Helena de M. Caseli UFSCar https://orcid.org/0000-0003-3996-8599
Leandro H. M. Oliveira Embrapa http://orcid.org/0000-0002-5628-3682

DOI: https://doi.org/10.5753/stil.2023.234034

Resumo

No PLN, os textos são a principal fonte de informação na geração de modelos computacionais usando aprendizado de máquina. Entretanto, para que sejam úteis no processo de aprendizado, estes textos precisam representar corretamente o fenômeno que se deseja aprender e, neste caso, os erros lexicais podem ser impactantes. Este artigo apresenta a proposta de um pipeline para preparação e/ou correção de textos que identifica várias categorias de erros lexicais. O pipeline objetiva identificar, anotar e categorizar os erros contidos nos textos, bem como sugerir correções de forma automática.

Palavras-chave: Pré-processamento, Pipeline, Processamento de Linguagem Natural, Correção de texto, Identificação de erros lexicais

Referências

Bird, S., Klein, E., e Loper, E. (2009). Natural Language Processing with Python – Analyzing Text with the Natural Language Toolkit. O’Reilly, 1st edition. https://www.nltk.org/book/

Chu, X., Ilyas, I. F., Krishnan, S., e Wang, J. (2016). Data cleaning: Overview and emerging challenges. In Proceedings of the 2016 International Conference on Management of Data, SIGMOD ’16, page 2201–2206, New York, NY, USA. Association for Computing Machinery. https://doi.org/10.1145/2882903.2912574

Ilyas, I. F. e Rekatsinas, T. (2022). Machine learning and data cleaning: Which serves the other? J. Data and Information Quality, 14(3). https://doi.org/10.1145/3506712

Li, P., Chen, Z., Chu, X., e Rong, K. (2023). Diffprep: Differentiable data preprocessing pipeline search for learning over tabular data. Proc. ACM Manag. Data, 1(2). https://doi.org/10.1145/3589328

Parulian, N. N. e Ludäscher, B. (2023). Trust the process: Analyzing prospective provenance for data cleaning. In Companion Proceedings of the ACM Web Conference 2023, WWW ’23 Companion, page 1513–1523, New York, NY, USA. Association for Computing Machinery. https://doi.org/10.1145/3543873.3587558

Sculley, D., Holt, G., Golovin, D., Davydov, E., Phillips, T., Ebner, D., Chaudhary, V., Young, M., Crespo, J.-F., e Dennison, D. (2015). Hidden technical debt in machine learning systems. In Proceedings of the 28th International Conference on Neural Information Processing Systems - Volume 2, NIPS’15, page 2503–2511, Cambridge, MA, USA. MIT Press. https://dl.acm.org/doi/10.5555/2969442.2969519

Zhang, S., Zhang, C., e Yang, Q. (2003). Data preparation for data mining. Applied Artificial Intelligence, 17:375–381. https://doi.org/10.1080/713827180