Experimenting Sentence Split-and-Rephrase Using Part-of-Speech Labels

  • P. Berlanga Neto Universidade de São Paulo
  • E. Y. Okano Universidade de São Paulo
  • E. E. S. Ruiz Universidade de São Paulo


Text simplification (TS) is a natural language transformation process that reduces linguistic complexity while preserving semantics and retaining its original meaning. This work aims to present a research proposal for automatic simplification of texts, precisely a split-and-rephrase approach based on an encoder-decoder neural network model. The proposed method was trained against the WikiSplit English corpus with the help of a part-of-speech tagger and obtained a BLEU score validation of 74.72%. We also experimented with this trained model to split-and-rephrase sentences written in Portuguese with relative success, showing the method’s potential.

Palavras-chave: natural language processing, neural networks, sentence simplification


Aluísio, S. M., Specia, L., Pardo, T. A., Maziero, E. G., Caseli, H. M., and Fortes, R. P. A corpus analysis of simple account texts and the proposal of simplification strategies: first steps towards text simplification systems. In Proceedings of the 26th Annual ACM International Conference on Design of Communication. ACM, New York, NY, United States, pp. 15–22, 2008.

Alva-Manchego, F., Martin, L., Scarton, C., and Specia, L. EASSE: Easier automatic sentence simplification evaluation. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP): System Demonstrations. Association for Computational Linguistics, Hong Kong, China, pp. 49–54, 2019.

Bahdanau, D., Cho, K., and Bengio, Y. Neural machine translation by jointly learning to align and translate, 2014.

Botha, J. A., Faruqui, M., Alex, J., Baldridge, J., and Das, D. Learning to split and rephrase from Wikipedia edit history. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, Brussels, Belgium, pp. 732–737, 2018.

Cho, K., van Merrienboer, B., Gulcehre, C., Bahdanau, D., Bougares, F., Schwenk, H., and Bengio, Y. Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation, 2014.

Hartmann, N. S., Paetzold, G. H., and Aluísio, S. M. SIMPLEX-PB: A Lexical Simplification Database and Benchmark for Portuguese. In Computational Processing of the Portuguese Language, A. Villavicencio, V. Moreira, A. Abad, H. Caseli, P. Gamallo, C. Ramisch, H. Gonçalo Oliveira, and G. H. Paetzold (Eds.). Springer International Publishing, Cham, pp. 272–283, 2018.

Honnibal, M. and Montani, I. spaCy 2: Natural language understanding with Bloom embeddings, convolutional neural networks and incremental parsing, 2017. To appear.

Kingma, D. P. and Ba, J. Adam: A method for stochastic optimization. In In Proceedings of the International Conference on Learning Representations (ICLR). Curran Associates, Inc., San Diego, CA, USA., 2015.

Munkhdalai, T. and Yu, H. Neural semantic encoders. In Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 1, Long Papers. Vol. 1. Association for Computational Linguistics, Valencia, Spain, pp. 397, 2017.

Narayan, S., Gardent, C., Cohen, S. B., and Shimorina, A. Split and rephrase. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, Copenhagen, Denmark, pp. 606–616, 2017.

Nisioi, S., Štajner, S., Ponzetto, S. P., and Dinu, L. P. Exploring Neural Text Simplification Models. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers). Association for Computational Linguistics, Vancouver, Canada, pp. 85–91, 2017.

Papineni, K., Roukos, S., Ward, T., and Zhu, W.-J. BLEU: a Method for Automatic Evaluation of Machine Translation. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, Philadelphia, Pennsylvania, USA, pp. 311–318, 2002.

Scarton, C., Oliveira, M., Candido Jr., A., Gasperin, C., and Aluísio, S. SIMPLIFICA: a tool for authoring simplified texts in Brazilian Portuguese guided by readability assessments. In Proceedings of the NAACL HLT 2010 Demonstration Session. Association for Computational Linguistics, Los Angeles, California, pp. 41–44, 2010.

Scarton, C., Paetzold, G. H., and Specia, L. Text simplification from professionally produced corpora. In Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018). Elsevier Inc., Miyazaki, Japan, pp. 3504–3510, 2019.

Shardlow, M. A survey of automated text simplification. International Journal of Advanced Computer Science and Applications 4 (1): 58–70, 2014.

Siddharthan, A. A survey of research on text simplification. ITL–International Journal of Applied Linguistics 165 (2): 259–298, 2014.

Štajner, S., Calixto, I., and Saggion, H. Automatic Text Simplification for Spanish: Comparative Evaluation of Various Simplification Strategies. In Proceedings of the International Conference Recent Advances in Natural Language Processing. "INCOMA Ltd. Shoumen, BULGARIA", Hissar, Bulgaria, pp. 618–626, 2015.

Tajner, S. and Glava, G. Leveraging event-based semantics for automated text simplification. Expert Systems with Applications vol. 82, pp. 383 – 395, 2017.

Vu, T., Hu, B., Munkhdalai, T., and Yu, H. Sentence Simplification with Memory-Augmented Neural Networks, 2018.

Wang, T., Chen, P., Amaral, K., and Qiang, J. An experimental study of LSTM encoder-decoder model for text simplification, 2016.
BERLANGA NETO, P.; OKANO, E. Y.; RUIZ, E. E. S.. Experimenting Sentence Split-and-Rephrase Using Part-of-Speech Labels. In: SYMPOSIUM ON KNOWLEDGE DISCOVERY, MINING AND LEARNING (KDMILE), 8. , 2020, Evento Online. Anais [...]. Porto Alegre: Sociedade Brasileira de Computação, 2020 . p. 169-176. ISSN 2763-8944. DOI: https://doi.org/10.5753/kdmile.2020.11973.