Open-Source Software Projects Curating Model for Empirical Software Engineering Studies

  • Juan Andrés Carruthers UNNE


Software projects are common inputs in Empirical Software Engineering (ESE), and they are often selected without following a specific strategy, leading to biased samples. To avoid this problem, researchers choose to use publicly available datasets instead of picking the projects themselves. However, some datasets are not maintained, containing old versions of projects, or even deprecated ones. This may raise some representativeness issues due to major changes in development practices and technologies over time. The main goal of this research is to develop a procedures model to construct and maintain a software project dataset with their product quality metrics, to support the development of ESE studies.

Palavras-chave: Curating model, Software projects, Datasets, Empirical Software Engineering


Avison, D., Lau, F., Myers, M. and Nielsen, P. A. (1999). Action Research. Communications of the ACM, 42(1), 94–97.

Baltes, S. and Ralph, P. (2020). Sampling in Software Engineering Research: A Critical Review and Guidelines.

Garvin, D. A. (1984). What Does "Product Quality" Really Mean? MIT Sloan Management Review, 25–43.

Irrazábal, E., Vásquez, F., Díaz, R. and Garzás, J. (2011). Applying ISO/IEC 12207:2008 with SCRUM and Agile Methods. Communications in Computer and Information Science, 155 CCIS, 169–180.

Johannesson, P. and Perjons, E. (2014). An introduction to design science (Vol. 10, pp. 978-3). Cham: Springer.

Jureczko, M. and Madeyski, L. (2010). Towards identifying software project clusters with regard to defect prediction. ACM International Conference Proceeding Series, 1.

Kitchenham, B. (2004). Procedures for Performing Systematic Reviews. Keele University, 33, 1–26.

Kitchenham, B. and Pfleeger, S. L. (1996). Software quality: the elusive target. IEEE Software, 13(1), 12–21.

Lehman, M. M. (1996). Laws of software evolution revisited. Lecture Notes in Computer Science (Including Subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), 1149, 108–124.

Lewin, K. (1947). Frontiers in Group Dynamics: Concept, Method and Reality in Social Science; Social Equilibria and Social Change. Human Relations, 1(1), 5–41.

Lewowski, T. and Madeyski, L. (2020). Creating Evolving Project Data Sets in Software Engineering. Studies in Computational Intelligence, 851, 1–14.

Mockus, A., Fielding, R. T. and Herbsleb, J. D. (2002). Two case studies of open source software development. ACM Transactions on Software Engineering and Methodology (TOSEM), 11(3), 309–346.

Munaiah, N., Kroh, S., Cabrey, C. and Nagappan, M. (2017). Curating GitHub for engineered software projects. Empirical Software Engineering, 22(6), 3219–3253.

Nagappan, M., Zimmermann, T. and Bird, C. (2013). Diversity in Software Engineering Research. Proceedings of the 2013 9th Joint Meeting on Foundations of Software Engineering - ESEC/FSE 2013.

Palomba, F., Bavota, G., Penta, M. Di, Fasano, F., Oliveto, R., & Lucia, A. De. (2018). On the diffuseness and the impact on maintainability of code smells: a large scale empirical investigation. Empirical Software Engineering, 23(3), 1188–1221.

Parnas, D. L. (2001). Some software engineering principles. In Software fundamentals: collected papers by David L. Parnas (pp. 257–266).

Petersen, K., Feldt, R., Mujtaba, S. and Mattsson, M. (2008). Systematic Mapping Studies in Software Engineering. Proceedings of the 12th International Conference on Evaluation and Assessment in Software Engineering, 68–77.

Rahman, M. M. and Roy, C. K. (2018). Improving IR-based bug localization with context-aware query reformulation. ESEC/FSE 2018 - Proceedings of the 2018 26th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering, 621–632.

Seaman, C. B. (1999). Qualitative methods in empirical studies of software engineering. IEEE Transactions on Software Engineering, 25(4), 557–572.

Shepperd, M., Song, Q., Sun, Z. and Mair, C. (2013). Data quality: Some comments on the NASA software defect datasets. IEEE Transactions on Software Engineering, 39(9), 1208–1215.

Tempero, E., Anslow, C., Dietrich, J., Han, T., Li, J., Lumpe, M., Melton, H. and Noble, J. (2010). The Qualitas Corpus: A curated collection of Java code for empirical studies. Proceedings - Asia-Pacific Software Engineering Conference, APSEC, 336–345.

Vázquez, H. C., Bergel, A., Vidal, S., Díaz Pace, J. A. and Marcos, C. (2019). Slimming javascript applications: An approach for removing unused functions from javascript libraries. Information and Software Technology, 107, 18–29.

Vidal, S. A., Bergel, A., Marcos, C. and Díaz-Pace, J. A. (2016). Understanding and addressing exhibitionism in Java empirical research about method accessibility. Empirical Software Engineering, 21(2), 483–516.

Vidal, S., Bergel, A., Díaz-Pace, J. A. and Marcos, C. (2016). Over-exposed classes in Java: An empirical study. Computer Languages, Systems & Structures, 46, 1–19.

Wohlin, C., Runeson, P., Höst, M., Ohlsson, M. C., Regnell, B., and Wesslén, A. (2012). Experimentation in software engineering. Springer Science & Business Media.
CARRUTHERS, Juan Andrés. Open-Source Software Projects Curating Model for Empirical Software Engineering Studies. In: CONGRESSO IBERO-AMERICANO EM ENGENHARIA DE SOFTWARE (CIBSE), 25. , 2022, Córdoba. Anais [...]. Porto Alegre: Sociedade Brasileira de Computação, 2022 . p. 416-423. DOI: