A Robustness Analysis of Automated Essay Scoring Methods

Resumo


This paper analyzed the robustness of a state-of-the-art Automated Essay Scoring (AES) model by applying various linguistically motivated perturbations to the Essay-BR corpus. Our findings reveal that the AES model failed to detect these adversarial modifications, often assigning higher scores to the disturbed essays than to the original ones.
Palavras-chave: Automated Essay Scoring, Robustness, Adversarial Essays

Referências

Barzilay, R. and Lapata, M. (2008). Modeling local coherence: An entity-based approach. Computational Linguistics, 34(1):1–34.

Beigman Klebanov, B. and Madnani, N. (2020). Automated evaluation of writing – 50 years and counting. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 7796–7810, Online. Association for Computational Linguistics.

Cohen, J. (1968). Weighted kappa: nominal scale agreement provision for scaled disagreement or partial credit. Psychological bulletin, 70(4):213–220.

de Sousa, R. F., Marinho, J. C., Neto, F. A. R., Anchiêta, R. T., and Moura, R. S. (2024). PiLN at PROPOR: A BERT-based strategy for grading narrative essays. In Proceedings of the 16th International Conference on Computational Processing of Portuguese - Vol. 2, pages 10–13, Santiago de Compostela, Galicia/Spain. Association for Computational Linguistics.

Devlin, J., Chang, M.-W., Lee, K., and Toutanova, K. (2019). BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4171–4186, Minneapolis, Minnesota. Association for Computational Linguistics.

Higgins, D. and Heilman, M. (2014). Managing what we can measure: Quantifying the susceptibility of automated scoring systems to gaming behavior. Educational Measurement: Issues and Practice, 33(3):36–46.

Kabra, A., Bhatia, M., Singla, Y. K., Jessy Li, J., and Ratn Shah, R. (2022). Evaluation toolkit for robustness testing of automatic essay scoring systems. In Proceedings of the 5th Joint International Conference on Data Science & Management of Data, pages 90–99, Bangalore, India. Association for Computing Machinery.

Liu, R., Wang, X., Liu, J., and Zhou, J. (2024). A comprehensive analysis of evaluating robustness and generalization ability of models in aes. In Proceedings of the 7th International Symposium on Big Data and Applied Statistics, pages 1–5, Beijing, China. IOP Publishing.

Marinho, J. C., Anchiêta, R. T., and Moura, R. S. (2021). Essay-br: a brazilian corpus of essays. In XXXIV Simpósio Brasileiro de Banco de Dados: Dataset Showcase Workshop, SBBD 2021, pages 53–64, Online. SBC.

Marinho, J. C., Anchiêta, R. T., and Moura, R. S. (2022a). Essay-br: a brazilian corpus to automatic essay scoring task. Journal of Information and Data Management, 13(1):65–76.

Marinho, J. C., C., F., Anchiêta, R. T., and Moura, R. S. (2022b). Automated essay scoring: An approach based on enem competencies. In Anais do XIX Encontro Nacional de Inteligência Artificial e Computacional, pages 49–60, Campinas, Brazil. SBC.

Mello, R. F., Oliveira, H., Wenceslau, M., Batista, H., Cordeiro, T., Bittencourt, I. I., and Isotanif, S. (2024). PROPOR’24 competition on automatic essay scoring of Portuguese narrative essays. In Proceedings of the 16th International Conference on Computational Processing of Portuguese - Vol. 2, pages 1–5, Santiago de Compostela, Galicia/Spain. Association for Computational Linguistics.

Oliveira, H., Ferreira Mello, R., Barreiros Rosa, B. A., Rakovic, M., Miranda, P., Cordeiro, T., Isotani, S., Bittencourt, I., and Gasevic, D. (2023). Towards explainable prediction of essay cohesion in portuguese and english. In Proceedings of the 13th International Learning Analytics and Knowledge Conference, pages 509–519, Arlington TX USA. Association for Computing Machinery.

Page, E. B. (1966). The imminence of... grading essays by computer. The Phi Delta Kappan, 47(5):238–243.

Perelman, L. (2014). When “the state of the art” is counting words. Assessing Writing, 21:104–111.

Tay, Y., Phan, M., Tuan, L. A., and Hui, S. C. (2018). Skipflow: Incorporating neural coherence features for end-to-end automatic text scoring. In Proceedings of the Thirty-second AAAI conference on artificial intelligence, pages 5948–5955, New Orleans, Louisiana, USA. AAAI Press.

Yannakoudakis, H. and Cummins, R. (2015). Evaluating the performance of automated text scoring systems. In Proceedings of the Tenth Workshop on Innovative Use of NLP for Building Educational Applications, pages 213–223, Denver, Colorado. Association for Computational Linguistics.

Yoon, S.-Y., Cahill, A., Loukina, A., Zechner, K., Riordan, B., and Madnani, N. (2018). Atypical inputs in educational applications. In Bangalore, S., Chu-Carroll, J., and Li, Y., editors, Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 3 (Industry Papers), pages 60–67, New Orleans - Louisiana. Association for Computational Linguistics.
Publicado
17/11/2024
ANCHIÊTA, Rafael T.; DE SOUSA, Rogério F.; MOURA, Raimundo S.. A Robustness Analysis of Automated Essay Scoring Methods. In: SIMPÓSIO BRASILEIRO DE TECNOLOGIA DA INFORMAÇÃO E DA LINGUAGEM HUMANA (STIL), 15. , 2024, Belém/PA. Anais [...]. Porto Alegre: Sociedade Brasileira de Computação, 2024 . p. 75-80. DOI: https://doi.org/10.5753/stil.2024.245419.