Automatic label error detection on text datasets labeled with data programming
Abstract
Supervised machine learning relies on the availability of large volumes of accurately annotated data, a requirement that data programming (DP) alleviates by aggregating weak supervision sources into probabilistic labels. However, DP generated labels remain susceptible to noise, compromising downstream model performance. In this work, we integrate and evaluate four Automatic Error Detection (AED) techniques: Retag, Confident Learning, Source-aware Influence Functions, and Unsupervised Label Functions Correction (ULF); within standard two-stage DP pipelines. Using the WRENCH benchmark on two text-classification tasks (YouTube and SMS spam detection), we optimize each pipeline via Bayesian optimization and collect the metrics Matthews correlation coefficient, accuracy, F1 score, and computational cost. Our experiments show that Influence Functions combined with the Hyper Label Model achieve the most positive trade-off between accuracy improvement and runtime on balanced data, while simpler DP baselines outperform all analyzed AED methods under class imbalance. These findings underscore both the promise and practical limitations of AED in refining weakly supervised workflows, guiding future development of cost-effective label-noise mitigation strategies.References
Alberto, T. and Lochter, J. (2015). YouTube Spam Collection. UCI Machine Learning Repository. DOI: 10.24432/C58885.
Almeida, T. and Hidalgo, J. (2011). SMS Spam Collection. UCI Machine Learning Repository. DOI: 10.24432/C5CC84.
Awasthi, A., Ghosh, S., Goyal, R., and Sarawagi, S. (2020). Learning from rules generalizing labeled exemplars. In International Conference on Learning Representations.
Chicco, D. and Jurman, G. (2020). The advantages of the matthews correlation coefficient (mcc) over f1 score and accuracy in binary classification evaluation. BMC genomics, 21:1–13.
Dawid, A. P. and Skene, A. M. (1979). Maximum likelihood estimation of observer error-rates using the em algorithm. Journal of the Royal Statistical Society: Series C (Applied Statistics), 28(1):20–28.
Fu, D. Y., Chen, M. F., Sala, F., Hooper, S. M., Fatahalian, K., and Ré, C. (2020). Fast and three-rious: Speeding up weak supervision with triplet methods.
George, T., Nodet, P., Bondu, A., and Lemaire, V. (2024). Mislabeled examples detection viewed as probing machine learning models: concepts, survey and extensive benchmark. arXiv preprint arXiv:2410.15772.
Geron, A. (2019). Hands-On Machine Learning with Scikit-Learn, Keras, and Tensor-Flow: Concepts, Tools, and Techniques to Build Intelligent Systems. O’Reilly Media, Inc., 2nd edition.
Klie, J.-C., Webber, B., and Gurevych, I. (2022). Annotation error detection: Analyzing the past and present for a more coherent future.
Nogueira, F. (2014–). Bayesian Optimization: Open source constrained global optimization tool for Python.
Northcutt, C. G., Jiang, L., and Chuang, I. L. (2022). Confident learning: Estimating uncertainty in dataset labels.
Ratner, A., Hancock, B., Dunnmon, J., Goldman, R., and Ré, C. (2018). Snorkel metal: Weak supervision for multi-task learning. In Proceedings of the Second Workshop on Data Management for End-To-End Machine Learning, DEEM’18, New York, NY, USA. Association for Computing Machinery.
Ratner, A., Sa, C. D., Wu, S., Selsam, D., and Ré, C. (2017). Data programming: Creating large training sets, quickly.
Sedova, A. and Roth, B. (2023). Ulf: Unsupervised labeling function correction using cross-validation for weak supervision. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, page 4162–4176. Association for Computational Linguistics.
Team, S. (2019). spam. GitHub.
van Halteren, H. (2000). The detection of inconsistency in manually tagged text. In Abeille, A., Brants, T., and Uszkoreit, H., editors, Proceedings of the COLING-2000 Workshop on Linguistically Interpreted Corpora, pages 48–55, Centre Universitaire, Luxembourg. International Committee on Computational Linguistics.
Wu, R., Chen, S.-E., Zhang, J., and Chu, X. (2023). Learning hyper label model for programmatic weak supervision. In The Eleventh International Conference on Learning Representations.
Zhang, J., Hsieh, C.-Y., Yu, Y., Zhang, C., and Ratner, A. J. (2022a). A survey on programmatic weak supervision. ArXiv, abs/2202.05433.
Zhang, J., Wang, H., Hsieh, C.-Y., and Ratner, A. (2022b). Understanding programmatic weak supervision via source-aware influence function.
Zhang, J., Yu, Y., Li, Y., Wang, Y., Yang, Y., Yang, M., and Ratner, A. (2021). WRENCH: A comprehensive benchmark for weak supervision. In Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track.
Zhou, Z.-H. (2017). A brief introduction to weakly supervised learning. National Science Review, 5(1):44–53.
Zhu, Z., Dong, Z., Cheng, H., and Liu, Y. (2021). A good representation detects noisy labels. ArXiv, abs/2110.06283.
Almeida, T. and Hidalgo, J. (2011). SMS Spam Collection. UCI Machine Learning Repository. DOI: 10.24432/C5CC84.
Awasthi, A., Ghosh, S., Goyal, R., and Sarawagi, S. (2020). Learning from rules generalizing labeled exemplars. In International Conference on Learning Representations.
Chicco, D. and Jurman, G. (2020). The advantages of the matthews correlation coefficient (mcc) over f1 score and accuracy in binary classification evaluation. BMC genomics, 21:1–13.
Dawid, A. P. and Skene, A. M. (1979). Maximum likelihood estimation of observer error-rates using the em algorithm. Journal of the Royal Statistical Society: Series C (Applied Statistics), 28(1):20–28.
Fu, D. Y., Chen, M. F., Sala, F., Hooper, S. M., Fatahalian, K., and Ré, C. (2020). Fast and three-rious: Speeding up weak supervision with triplet methods.
George, T., Nodet, P., Bondu, A., and Lemaire, V. (2024). Mislabeled examples detection viewed as probing machine learning models: concepts, survey and extensive benchmark. arXiv preprint arXiv:2410.15772.
Geron, A. (2019). Hands-On Machine Learning with Scikit-Learn, Keras, and Tensor-Flow: Concepts, Tools, and Techniques to Build Intelligent Systems. O’Reilly Media, Inc., 2nd edition.
Klie, J.-C., Webber, B., and Gurevych, I. (2022). Annotation error detection: Analyzing the past and present for a more coherent future.
Nogueira, F. (2014–). Bayesian Optimization: Open source constrained global optimization tool for Python.
Northcutt, C. G., Jiang, L., and Chuang, I. L. (2022). Confident learning: Estimating uncertainty in dataset labels.
Ratner, A., Hancock, B., Dunnmon, J., Goldman, R., and Ré, C. (2018). Snorkel metal: Weak supervision for multi-task learning. In Proceedings of the Second Workshop on Data Management for End-To-End Machine Learning, DEEM’18, New York, NY, USA. Association for Computing Machinery.
Ratner, A., Sa, C. D., Wu, S., Selsam, D., and Ré, C. (2017). Data programming: Creating large training sets, quickly.
Sedova, A. and Roth, B. (2023). Ulf: Unsupervised labeling function correction using cross-validation for weak supervision. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, page 4162–4176. Association for Computational Linguistics.
Team, S. (2019). spam. GitHub.
van Halteren, H. (2000). The detection of inconsistency in manually tagged text. In Abeille, A., Brants, T., and Uszkoreit, H., editors, Proceedings of the COLING-2000 Workshop on Linguistically Interpreted Corpora, pages 48–55, Centre Universitaire, Luxembourg. International Committee on Computational Linguistics.
Wu, R., Chen, S.-E., Zhang, J., and Chu, X. (2023). Learning hyper label model for programmatic weak supervision. In The Eleventh International Conference on Learning Representations.
Zhang, J., Hsieh, C.-Y., Yu, Y., Zhang, C., and Ratner, A. J. (2022a). A survey on programmatic weak supervision. ArXiv, abs/2202.05433.
Zhang, J., Wang, H., Hsieh, C.-Y., and Ratner, A. (2022b). Understanding programmatic weak supervision via source-aware influence function.
Zhang, J., Yu, Y., Li, Y., Wang, Y., Yang, Y., Yang, M., and Ratner, A. (2021). WRENCH: A comprehensive benchmark for weak supervision. In Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track.
Zhou, Z.-H. (2017). A brief introduction to weakly supervised learning. National Science Review, 5(1):44–53.
Zhu, Z., Dong, Z., Cheng, H., and Liu, Y. (2021). A good representation detects noisy labels. ArXiv, abs/2110.06283.
Published
2025-09-29
How to Cite
LEAL, Nalbert G. M.; ARAÚJO, Daniel S. A.; MENEZES NETO, Elias J..
Automatic label error detection on text datasets labeled with data programming. In: NATIONAL MEETING ON ARTIFICIAL AND COMPUTATIONAL INTELLIGENCE (ENIAC), 22. , 2025, Fortaleza/CE.
Anais [...].
Porto Alegre: Sociedade Brasileira de Computação,
2025
.
p. 903-914.
ISSN 2763-9061.
DOI: https://doi.org/10.5753/eniac.2025.14268.
