Summarization of Educational Videos with Transformers Networks

  • Leandro Massetti Ribeiro Oliveira UFMA
  • Li Chang Shuen UFMA
  • Allan Kássio Beckman Soares da Cruz UFMA
  • Carlos de Salles Soares UFMA


This paper presents an approach to summarize educational videos using Deep Learning Transformers models. The approach focuses on educational content by summarizing captions and using the text results to summarize the videos. Tests were conducted using the EDUVSUM dataset, which improved upon the original paper’s results, achieving an accuracy of 26.53% in a multi-class problem, with a mean absolute error of 1.49 per video frame and 1.45 per video segment. Transformer techniques for automatic text summarization have proven effective in creating multimedia learning objects. The results suggest that these techniques can generate more efficient and high-quality digital educational resources, reducing the time and effort required for their creation.

Palavras-chave: Machine learning, transformers, e-learning, video summarization


Potapov, D., Douze, M., Harchaoui, Z., & Schmid, C. (2014). Category-specific video summarization. In Springer (Ed.), European Conference on Computer Vision (pp. 540-555). [S.l.].

Ghauri, J. A., Hakimov, S., & Ewerth, R. (2021). Supervised video summarization via multiple feature sets with parallel attention. In IEEE (Ed.), 2021 IEEE International Conference on Multimedia and Expo (ICME) (pp. 1–6s). [S.l.]: IEEE.

Song, Y., Vallmitjana, J., Stent, A., & Jaimes, A. (2015). Tvsum: Summarizing web videos using titles. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 5179-5187).

Mubarak, A. A., Cao, H., & Ahmed, S. A. (2021). Predictive learning analytics using deep learning model in MOOCs’ courses videos. Education and Information Technologies, 26(1), 371-392.

Ghauri, J. A., Hakimov, S., & Ewerth, R. (2020). Classification of important segments in educational videos using multimodal features. arXiv preprint arXiv:2010.13626

Oliveira, L. M. R., Busson, A. J. G., Salles, S. N. Carlos de, Santos, G. N. dos, & Colcher, S. (2021). Automatic generation of learning objects using text summarizer based on deep learning models. In SBC (Eds.), Anais do XXXII Simpósio Brasileiro de Informática na Educação (pp. 728-736). [S.l.]

Alrumiah, S. S., & Al-Shargabi, A. A. (2022). Educational videos subtitles’ summarization using latent dirichlet allocation and length enhancement. CMC-Computers Materials & Continua, 70(3), 6205–6221.

Abhilash, R. K., Anurag, C., Avinash, V., & Uma, D. (2021). Lecture video summarization using subtitles. In EAI International Conference on Big Data Innovation for Sustainable Cognitive Computing (pp. 83-92). Springer.

Moraes, L., Marcacini, R. M., & Goularte, R. (2022, November). Video summarization using text subjectivity classification. In Proceedings of the Brazilian Symposium on Multimedia and the Web (pp. 133-141).

de Souza Barbieri, T. T., & Goularte, R. (2020, November). Investigating Subjectivity Criterion for Multi-video Summarization. In Proceedings of the Brazilian Symposium on Multimedia and the Web (pp. 137-144).

Mendes, P. R. C., Vieira, E. S., de Freitas, P. V. A., Busson, A. J. G., Guedes, Á. L. V., Neto, C. D. S. S., & Colcher, S. (2020, November). Shaping the Video Conferences of Tomorrow With AI. In Anais Estendidos do XXVI Simpósio Brasileiro de Sistemas Multimídia e Web (pp. 165-168). SBC.

Soares, E. R., & Barrére, E. (2018, October). A framework for automatic topic segmentation in video lectures. In Anais Estendidos do XXIV Simpósio Brasileiro de Sistemas Multimídia e Web (pp. 31-36). SBC.

Narasimhan, M., Rohrbach, A., & Darrell, T. (2021). Clip-it! language-guided video summarization. Advances in Neural Information Processing Systems, 34, 13988-14000.

Huang, J. H., Murn, L., Mrak, M., & Worring, M. (2021, August). Gpt2mvs: Generative pre-trained transformer-2 for multi-modal video summarization. In Proceedings of the 2021 International Conference on Multimedia Retrieval (pp. 580-589).

Shang, X., Yuan, Z., Wang, A., & Wang, C. (2021, October). Multimodal video summarization via time-aware transformers. In Proceedings of the 29th ACM International Conference on Multimedia (pp. 1756-1765).

Reimers, N., & Gurevych, I. (2019). Sentence-BERT: Sentence embeddings using Siamese BERT-networks. arXiv preprint arXiv:1908.10084

Nguyen, T., Rosenberg, M., Song, X., Gao, J., Tiwary, S., Majumder, R., & Deng, L. (2016). Ms Marco: A human generated machine reading comprehension dataset. In CoCo@ NIPs. [S.l.: s.n.]

Mosley, L. (2013). A balanced approach to the multi-class imbalance problem (Doctoral dissertation). Iowa State University of Science and Technology, USA.

de Freitas, P. V., Santos, G. N. D., Busson, A. J., Guedes, Á. L., & Colcher, S. (2019, October). A baseline for NSFW video detection in e-learning environments. In Proceedings of the 25th Brazillian Symposium on Multimedia and the Web (pp. 357-360).

Balraj, B. (2021). Multilabel Active Learning for User Context Recognition In-the-Wild. North Carolina State University.
OLIVEIRA, Leandro Massetti Ribeiro; SHUEN, Li Chang; DA CRUZ, Allan Kássio Beckman Soares; SOARES, Carlos de Salles. Summarization of Educational Videos with Transformers Networks. In: SIMPÓSIO BRASILEIRO DE SISTEMAS MULTIMÍDIA E WEB (WEBMEDIA), 29. , 2023, Ribeirão Preto/SP. Anais [...]. Porto Alegre: Sociedade Brasileira de Computação, 2023 . p. 137–143.

Artigos mais lidos do(s) mesmo(s) autor(es)