Pipeline for identification of lexical errors and generation of correction suggestions
Abstract
In Natural Language Processing, texts are the main source of information in generating computational models using machine learning. However, in order to be useful in the learning process, these texts need to correctly represent the phenomenon that one wants to learn and, in this case, lexical errors can be impactful. This article proposes a pipeline for text preparation/correction that identifies several categories of lexical errors. The pipeline aims to identify, annotate and categorize the errors contained in the texts, as well as automatically suggest corrections.
References
Chu, X., Ilyas, I. F., Krishnan, S., e Wang, J. (2016). Data cleaning: Overview and emerging challenges. In Proceedings of the 2016 International Conference on Management of Data, SIGMOD ’16, page 2201–2206, New York, NY, USA. Association for Computing Machinery. https://doi.org/10.1145/2882903.2912574
Ilyas, I. F. e Rekatsinas, T. (2022). Machine learning and data cleaning: Which serves the other? J. Data and Information Quality, 14(3). https://doi.org/10.1145/3506712
Li, P., Chen, Z., Chu, X., e Rong, K. (2023). Diffprep: Differentiable data preprocessing pipeline search for learning over tabular data. Proc. ACM Manag. Data, 1(2). https://doi.org/10.1145/3589328
Parulian, N. N. e Ludäscher, B. (2023). Trust the process: Analyzing prospective provenance for data cleaning. In Companion Proceedings of the ACM Web Conference 2023, WWW ’23 Companion, page 1513–1523, New York, NY, USA. Association for Computing Machinery. https://doi.org/10.1145/3543873.3587558
Sculley, D., Holt, G., Golovin, D., Davydov, E., Phillips, T., Ebner, D., Chaudhary, V., Young, M., Crespo, J.-F., e Dennison, D. (2015). Hidden technical debt in machine learning systems. In Proceedings of the 28th International Conference on Neural Information Processing Systems - Volume 2, NIPS’15, page 2503–2511, Cambridge, MA, USA. MIT Press. https://dl.acm.org/doi/10.5555/2969442.2969519
Zhang, S., Zhang, C., e Yang, Q. (2003). Data preparation for data mining. Applied Artificial Intelligence, 17:375–381. https://doi.org/10.1080/713827180
