Is There an Interplay Between Library Usage and Repository Features?: An Analysis with Regression Models

João Victor Esteves; Daniel Coutinho; Marcelo Schots; Igor Machado Coelho

João Victor Esteves UERJ
Daniel Coutinho UERJ
Marcelo Schots UERJ
Igor Machado Coelho UERJ

Resumo

The advent of open source has changed the way developers reuse software. The availability of libraries and their corresponding source code in public software repositories enables new forms of analyzing project aspects that can provide clues on their stability and maintainability. However, the literature lacks studies aiming to identify and understand whether and which repository features may correlate with the likeliness of usage of a library. In this sense, we present a factorial experiment using three different regression models - Multiple Linear Regression, Random Forest, and Neural Networks -, aiming at analyzing whether there is a correlation between library usage and a set of features extracted from release management and version control repositories. The results allowed to map features with positive learning impact, such as the number of stars, pull requests, and number of downloads, as well as features that contributed much less to the models (e.g., the repository size). Although the impact level of each feature varied from model to model, we also noticed from the analysis of regression results that the models were capable of achieving higher accuracy when considering only a subset of features.

Palavras-chave: Regression models, software reuse, mining software repositories, library usage

Referências

A S Badashian and E Stroulia. 2016. Measuring user influence in GitHub: the million follower fallacy. In IEEE/ACM 3rd International Workshop on CrowdSourcing in Software Engineering (CSI-SE). Austin, USA, 15--21.

B Barnes, T Durek, J Gaffney, and A Pyster. 1988. A Framework and Economic Foundation for Software Reuse. In Software Reuse: Emerging Technology, Will Tracz (Ed.). IEEE Computer Society Press, Los Alamitos, USA, 77--88.

V R Basili, H D Rombach, J Bailey, and A Delis. 1990. Ada reusability analysis and measurement. In Empirical Foundations of Information and Software Science V, P. Zunde and D. Hocking (Eds.). Springer, Boston, USA, 355--368.

C M Bishop. 2006. Pattern Recognition and Machine Learning. Springer-Verlag, New York, USA.

K Blincoe, J Sheoran, S Goggins, E Petakovic, and D Damian. 2016. Understanding the popular users: Following, affiliation influence and leadership on GitHub. Information and Software Technology 70 (2016), 30--39.

H Borges, A Hora, and M T Valente. 2016. Predicting the popularity of GitHub repositories. In The 12th International Conference on Predictive Models and Data Analytics in Software Engineering. Ciudad Real, Spain, 9.

H Borges, A Hora, and M T Valente. 2016. Understanding the factors that impact the popularity of GitHub repositories. In 2016 IEEE International Conference on Software Maintenance and Evolution. Raleigh, USA, 334--344.

H Borges and M T Valente. 2018. What's in a GitHub star? understanding repository starring practices in a social coding platform. Journal of Systems and Software 146 (2018), 112--129.

L Breiman. 2001. Random forests. Machine Learning 45, 1 (2001), 5--32.

D L Civco. 1993. Artificial neural networks for land-cover classification and mapping. International Journal of Geographical Information Science 7, 2 (1993), 173--186.

T Davis. 1993. The reuse capability model: a basis for improving an organization's reuse capability. In 2nd International Workshop on Software Reusability - Advances in Software Reuse. Lucca, Italy, 126--133.

W Frakes and C Terry. 1994. Reuse level metrics. In 3rd International Conference on Software Reuse: Advances in Software Reusability. Rio de Janeiro, Brazil, 139--148.

W Frakes and C Terry. 1996. Software reuse: metrics and models. ACM Computing Surveys (CSUR) 28, 2 (1996), 415--435.

W B Frakes and C J Fox. 1995. Modeling reuse across the software life cycle. Journal of Systems and Software 30, 3 (1995), 295--301.

W B. Frakes and C J. Fox. 1996. Quality improvement using a software reuse failure modes model. IEEE Transactions on Software Engineering 22, 4 (1996), 274--279.

W B Frakes and P B Gandel. 1990. Representing reusable software. Information and Software Technology 32, 10 (1990), 653--664.

W B. Frakes and T P. Pole. 1994. An empirical study of representation methods for reusable software components. IEEE Transactions on Software Engineering 20, 8 (1994), 617--630.

J E Gaffney and T A Durek. 1989. Software reuse - key to enhanced productivity: some quantitative models. Information and Software Technology 31, 5 (1989), 258--267.

G Gousios. 2013. The GHTorrent dataset and tool suite. In 10th Working Conference on Mining Software Repositories. San Francisco, USA, 233--236.

G Grégoire. 2014. Multiple linear regression. European Astronomical Society Publications Series 66 (2014), 45--72.

R Hecht-Nielsen. 1988. Theory of the backpropagation neural network. Neural Networks 1, Supplement-1 (1988), 445--448.

E Kalliamvakou, G Gousios, K Blincoe, L Singer, D M German, and D Damian. 2014. The promises and perils of mining GitHub. In 11th Working Conference on Mining Software Repositories. Hyderabad, India, 92--101.

R Kohavi. 1996. Scaling Up the Accuracy of Naive-Bayes Classifiers: A Decision-Tree Hybrid. In 2nd International Conference on Knowledge Discovery and Data Mining. Portland, USA, 202--207.

P Koltun and A Hudson. 1991. A reuse maturity model. In 4th Annual Workshop on Software Reuse, W. B. Frakes (Ed.). Hemdon, USA, 1--4.

J Margono and T E Rhoads. 1992. Software reuse economics: cost-benefit analysis on a large-scale Ada project. In 14th International Conference on Software Engineering. Melbourne, Australia, 338--348.

M D McIlroy. 1968. Mass-produced software components. In Software Engineering: Report on a Conference Sponsored by the NATO Science Committee, P Naur and B Randell (Eds.). NATO Scientific Affairs Division, Garmisch, Germany, 88--98.

A Michail. 2000. Data mining library reuse patterns using generalized association rules. In 22nd International Conference on Software Engineering. Limerick, Ireland, 167--176.

Y M Mileva, V Dallmeier, M Burger, and A Zeller. 2009. Mining trends of library usage. In Joint International and Annual ERCIM Workshops on Principles of Software Evolution (IWPSE) and Software Evolution (Evol) Workshops. Amsterdam, The Netherlands, 57--62.

M Morisio, M Ezran, and C Tully. 2002. Success and failure factors in software reuse. IEEE Transactions on Software Engineering 28, 4 (2002), 340--357.

M S Oliveira. 2015. On the use of visualization for supporting software reuse. Ph.D. Dissertation. Federal University of Rio de Janeiro (COPPE/UFRJ).

L Rokach and O Maimon. 2005. Clustering Methods. In Data Mining and Knowledge Discovery Handbook, O Maimon and L Rokach (Eds.). Springer US, Boston, USA, 321--352.

Richard W Selby. 1989. Quantitative studies of software reuse. In Software Reusability, Ted J. Biggerstaff and Alan J. Perlis (Eds.). ACM, New York, USA, 213--233.

R Setiono and H Liu. 1997. Neural-network feature selector. IEEE Transactions on Neural Networks 8, 3 (1997), 654--662.

J Tsay, L Dabbish, and J Herbsleb. 2014. Influence of social and technical factors for evaluating contribution in GitHub. In 36th International Conference on Software Engineering. Hyderabad, India, 356--366.

J Zhu, M Zhou, and A Mockus. 2014. Patterns of folder use and project popularity: A case study of GitHub repositories. In 8th ACM/IEEE International Symposium on Empirical Software Engineering and Measurement. Torino, Italy, 30.