Computational Approaches for Simplifying Educational Texts: A Proposal Using spaCy
Resumo
This work presents an investigation into the application of Natural Language Processing (NLP) techniques for the automatic simplification of educational texts in Brazilian Portuguese. The study uses the spaCy library with the pt_core_news_sm model to perform syntactic analysis, named entity recognition, and textual readability assessment. The proposed methodology implements simplification rules based on syntactic dependency analysis, preserving essential elements such as the subject and main predicate while removing complex subordinate constructions. The results show that named entity analysis was effective in identifying people (PER), locations (LOC), organizations (ORG), and miscellaneous elements (MISC) in the analyzed texts. The original texts presented Flesch Reading Ease scores ranging from 25.23 to 54.57, indicating different levels of complexity. This research contributes to the advancement of automatic text simplification techniques in Portuguese and offers insights for the development of more accessible educational tools.
Palavras-chave:
Natural Language Processing, Text Simplification, Syntactic Analysis, Named Entities, Text Readability
Referências
Sandra Maria Aluísio, Lucia Specia, Thiago Alexandre Salgueiro Pardo, Erick Gimenes Maziero, and Rodrigo Parreira Maziero Fortes. 2008. Towards Brazilian Portuguese automatic text simplification systems. In Proceedings of the Eighth ACM Symposium on Document Engineering. ACM, São Paulo, 240–248.
Steven Bird, Ewan Klein, and Edward Loper. 2009. Natural Language Processing with Python: Analyzing Text with the Natural Language Toolkit. O’Reilly Media, Sebastopol.
Matthew Honnibal and Ines Montani. 2017. spaCy 2: Natural language understanding with Bloom embeddings, convolutional neural networks and incremental parsing. To appear.
Daniel Jurafsky and James H. Martin. 2025. Speech and Language Processing (3 ed.). Stanford University Press, Stanford.
Christopher D. Manning and Hinrich Schütze. 1999. Foundations of Statistical Natural Language Processing. MIT Press, Cambridge.
Thereza Bezerra Fraga Martins, Cristina Meneguello Ghiraldelo, Maria das Graças Volpe Nunes, and Osvaldo Novais Oliveira Junior. 1996. Readability formulas applied to textbooks in Brazilian Portuguese. ICMC-USP, São Carlos.
Joakim Nivre. 2010. Dependency parsing. Language and Linguistics Compass 4, 3 (2010), 138–152.
Lev Ratinov and Dan Roth. 2009. Design challenges and misconceptions in named entity recognition. In Proceedings of the Thirteenth Conference on Computational Natural Language Learning. Association for Computational Linguistics, Boulder, 147–155.
Vera Masagão Ribeiro. 1997. Alfabetismo funcional: referências conceituais e metodológicas para a pesquisa. Educação & Sociedade 18, 60 (Dec. 1997), 143–156. DOI: 10.1590/S0101-73301997000300009
Diana Santos and Nuno Cardoso. 2007. Reconhecimento de entidades mencionadas em português. Linguateca, Lisboa.
Carolina Eduarda Scarton and Sandra Maria Aluísio. 2010. Análise da inteligibilidade de textos via ferramentas de processamento de língua natural: adaptando as métricas do Coh-Metrix para o português. Linguamática 2, 1 (2010), 45–61.
Advaith Siddharthan. 2014. A survey of research on text simplification. International Journal of Applied Linguistics 165, 2 (2014), 259–298. 529
Steven Bird, Ewan Klein, and Edward Loper. 2009. Natural Language Processing with Python: Analyzing Text with the Natural Language Toolkit. O’Reilly Media, Sebastopol.
Matthew Honnibal and Ines Montani. 2017. spaCy 2: Natural language understanding with Bloom embeddings, convolutional neural networks and incremental parsing. To appear.
Daniel Jurafsky and James H. Martin. 2025. Speech and Language Processing (3 ed.). Stanford University Press, Stanford.
Christopher D. Manning and Hinrich Schütze. 1999. Foundations of Statistical Natural Language Processing. MIT Press, Cambridge.
Thereza Bezerra Fraga Martins, Cristina Meneguello Ghiraldelo, Maria das Graças Volpe Nunes, and Osvaldo Novais Oliveira Junior. 1996. Readability formulas applied to textbooks in Brazilian Portuguese. ICMC-USP, São Carlos.
Joakim Nivre. 2010. Dependency parsing. Language and Linguistics Compass 4, 3 (2010), 138–152.
Lev Ratinov and Dan Roth. 2009. Design challenges and misconceptions in named entity recognition. In Proceedings of the Thirteenth Conference on Computational Natural Language Learning. Association for Computational Linguistics, Boulder, 147–155.
Vera Masagão Ribeiro. 1997. Alfabetismo funcional: referências conceituais e metodológicas para a pesquisa. Educação & Sociedade 18, 60 (Dec. 1997), 143–156. DOI: 10.1590/S0101-73301997000300009
Diana Santos and Nuno Cardoso. 2007. Reconhecimento de entidades mencionadas em português. Linguateca, Lisboa.
Carolina Eduarda Scarton and Sandra Maria Aluísio. 2010. Análise da inteligibilidade de textos via ferramentas de processamento de língua natural: adaptando as métricas do Coh-Metrix para o português. Linguamática 2, 1 (2010), 45–61.
Advaith Siddharthan. 2014. A survey of research on text simplification. International Journal of Applied Linguistics 165, 2 (2014), 259–298. 529
Publicado
10/11/2025
Como Citar
SOUZA, Vitor Amadeu.
Computational Approaches for Simplifying Educational Texts: A Proposal Using spaCy. In: BRAZILIAN SYMPOSIUM ON MULTIMEDIA AND THE WEB (WEBMEDIA), 31. , 2025, Rio de Janeiro/RJ.
Anais [...].
Porto Alegre: Sociedade Brasileira de Computação,
2025
.
p. 526-529.
DOI: https://doi.org/10.5753/webmedia.2025.15154.
