Pipeline for identification of lexical errors and generation of correction suggestions

Abstract


In Natural Language Processing, texts are the main source of information in generating computational models using machine learning. However, in order to be useful in the learning process, these texts need to correctly represent the phenomenon that one wants to learn and, in this case, lexical errors can be impactful. This article proposes a pipeline for text preparation/correction that identifies several categories of lexical errors. The pipeline aims to identify, annotate and categorize the errors contained in the texts, as well as automatically suggest corrections.

Keywords: Pre-processing, Pipeline, Natural Language Processing, Text correction, Identification of lexical errors

References

Bird, S., Klein, E., e Loper, E. (2009). Natural Language Processing with Python – Analyzing Text with the Natural Language Toolkit. O’Reilly, 1st edition. https://www.nltk.org/book/

Chu, X., Ilyas, I. F., Krishnan, S., e Wang, J. (2016). Data cleaning: Overview and emerging challenges. In Proceedings of the 2016 International Conference on Management of Data, SIGMOD ’16, page 2201–2206, New York, NY, USA. Association for Computing Machinery. https://doi.org/10.1145/2882903.2912574

Ilyas, I. F. e Rekatsinas, T. (2022). Machine learning and data cleaning: Which serves the other? J. Data and Information Quality, 14(3). https://doi.org/10.1145/3506712

Li, P., Chen, Z., Chu, X., e Rong, K. (2023). Diffprep: Differentiable data preprocessing pipeline search for learning over tabular data. Proc. ACM Manag. Data, 1(2). https://doi.org/10.1145/3589328

Parulian, N. N. e Ludäscher, B. (2023). Trust the process: Analyzing prospective provenance for data cleaning. In Companion Proceedings of the ACM Web Conference 2023, WWW ’23 Companion, page 1513–1523, New York, NY, USA. Association for Computing Machinery. https://doi.org/10.1145/3543873.3587558

Sculley, D., Holt, G., Golovin, D., Davydov, E., Phillips, T., Ebner, D., Chaudhary, V., Young, M., Crespo, J.-F., e Dennison, D. (2015). Hidden technical debt in machine learning systems. In Proceedings of the 28th International Conference on Neural Information Processing Systems - Volume 2, NIPS’15, page 2503–2511, Cambridge, MA, USA. MIT Press. https://dl.acm.org/doi/10.5555/2969442.2969519

Zhang, S., Zhang, C., e Yang, Q. (2003). Data preparation for data mining. Applied Artificial Intelligence, 17:375–381. https://doi.org/10.1080/713827180
Published
2023-09-25
GARCIA, Luana Q.; CHINELLATO, Miguel H.; CASELI, Helena de M.; OLIVEIRA, Leandro H. M.. Pipeline for identification of lexical errors and generation of correction suggestions. In: BRAZILIAN SYMPOSIUM IN INFORMATION AND HUMAN LANGUAGE TECHNOLOGY (STIL), 14. , 2023, Belo Horizonte/MG. Anais [...]. Porto Alegre: Sociedade Brasileira de Computação, 2023 . p. 357-361. DOI: https://doi.org/10.5753/stil.2023.234034.