How COVID-19 Impacted Data Science: a Topic Retrieval and Analysis from GitHub Projects’ Descriptions

  • Amanda C. R. Tavares Universidade Federal de Minas Gerais (UFMG)
  • Natércia A. Batista Universidade Federal de Minas Gerais (UFMG)
  • Mirella M. Moro Universidade Federal de Minas Gerais (UFMG)


We present a data-driven research over code repositories that are data science oriented. The goal is to compare their topics of interest and evolution over the COVID-19 pandemic period by analyzing Jupyter Notebook and Python projects from a year before and during the pandemic. We employ a state-of-art algorithm to find topics based on the repositories descriptions, and compare the performance of tuning its hyperparameters for better accuracy.

Palavras-chave: Data Science, GitHub, Python, Jupyter Notebooks, COVID-19


de Oliveira, P. A. M. et al. (2021). Software development during covid-19 pandemic: an analysis of stack overflow and github. In SEH, co-located with ICSE.

Gonzalez, D. et al. (2020). The state of the ml-universe: 10 years of artificial intelligence & machine learning software development on github. In MSR, page 431–442.

Oliveira, G. P., Batista, N. A., Brandão, M. A., and Moro., M. M. (2018). Utilização de redes heterogêneas para medir a força dos relacionamentos no github. In SBBD.

Panichella, A. (2021). A systematic comparison of search-based approaches for lda hyperparameter tuning. Information and Software Technology, 130:106411.

Perkel, J. M. (2018). Why jupyter is data scientists’ computational notebook of choice. Nature, 563:145–146.

Pimentel, J. F., Oliveira, G. P., Silva, M. O., Seufitelli, D. B., and Moro, M. M. (2021). Ciênncia de dados com reprodutibilidade usando jupyter. In Jornada de Atualização em Informática 2021, pages 11–59. SBC.

Ralph, P. et al. (2020). Pandemic programming: How COVID-19 affects software developers and how their organizations can help. Empir. Softw. Eng., 25:4927–4961.

Röder, M., Both, A., and Hinneburg, A. (2015). Exploring the space of topic coherence measures. In WSDM, pages 399–408.

Saraiva, M. C. and Medeiros, C. B. (2018). Correlating educational documents from different sources through graphs and taxonomies. In SBBD, pages 121–132.

Sharma, A. et al. (2017). Cataloging github repositories. In EASE, page 314–319.

Silveira, P. et al. (2021). A deep dive into the impact of covid-19 on software development. IEEE Transactions on Software Engineering.

Sipio, D. et al. (2020). A multinomial naïve bayesian (mnb) network to automatically recommend topics for github repositories. In Procs. EASE, page 71–80.

Tavares, A. C. R., Batista, N. A., and Moro., M. M. (2021). Greed: Github repositories and descriptions. Zenodo. DOI 10.5281/zenodo.5138079

Wang, L. et al. (2020). When the open source community meets covid-19: Characterizing covid-19 themed github repositories. ArXiv, 2010.12218.
TAVARES, Amanda C. R.; BATISTA, Natércia A.; MORO, Mirella M.. How COVID-19 Impacted Data Science: a Topic Retrieval and Analysis from GitHub Projects’ Descriptions. In: SIMPÓSIO BRASILEIRO DE BANCO DE DADOS (SBBD), 36. , 2021, Rio de Janeiro. Anais [...]. Porto Alegre: Sociedade Brasileira de Computação, 2021 . p. 325-330. ISSN 2763-8979. DOI: