Criteria for choosing the number of dimensions in a principal component analysis: An empirical assessment

Renata B. Silva; Daniel de Oliveira; Davi P. Santos; Lucio F. D. Santos; Rodrigo E. Wilson; Marcos Bedo

doi:10.5753/sbbd.2020.13632

Renata B. Silva Universidade Federal Fluminense
Daniel de Oliveira Universidade Federal Fluminense
Davi P. Santos Universidade de São Paulo
Lucio F. D. Santos Instituto Federal do Norte de Minas Gerais
Rodrigo E. Wilson Universidade Federal Fluminense
Marcos Bedo Universidade Federal Fluminense

DOI: https://doi.org/10.5753/sbbd.2020.13632

Resumo

Principal component analysis (PCA) is an efficient model for the optimization problem of finding d' axes of a subspace R^d' ⊆ R^d so that the mean squared distances from a given set R of points to the axes are minimal. Despite being steadily employed since 1901 in different scenarios, e.g., mechanics, PCA has become an important link in machine learning chained tasks, such as feature learning and AutoML designs. A frequent yet open issue that arises from supervised-based problems is how many PCA axes are required for the performance of machine learning constructs to be tuned. Accordingly, we investigate the behavior of six independent and uncoupled criteria for estimating the number of PCA axes, namely Scree-Plot %, Scree Plot Gap, Kaiser-Guttman, Broken-Stick, p-Score, and 2D. In total, we evaluate the performance of those approaches in 20 high dimensional datasets by using (i) four different classifiers, and (ii) a hypothesis test upon the reported F-Measures. Results indicate Broken-Stick and Scree-Plot % criteria consistently outperformed the competitors regarding supervised-based tasks, whereas estimators Kaiser-Guttman and Scree-Plot Gap delivered poor performances in the same scenarios.

Palavras-chave: Feature transformation, PCA, Number of principal components, Auto ML

Referências

Aggarwal, C. (2015) .Data mining: The textbook. Springer.

Guttman, L. (1954). Some necessary conditions for common-factor analysis.Psychometrika, 19(2):149–161.

Jackson, D. A. (1993). Stopping rules in principal components analysis: A comparison of heuristical and statistical approaches. Ecology, 74(8):2204–2214.

James, G., Witten, D., Hastie, T., and Tibshirani, R. (2013). An introduction to Statistical Learning, volume 112. Springer.

Legendre, P. and Legendre, L. F. (2012). Numerical Ecology. Elsevier.

Neto, P., Jackson, D., and Somers, K. (2005). How many principal components? Stopping rules for determining the number of non-trivial axes revisited. C. Stat., 49(4):974–997.

Pearson, K. (1901). On lines and planes of closest fit to systems of points in space.The London, Edinburgh, and Dublin Phil. Magazine and J. of Science, 2(11):559–572.

Pestov, V. (2008). An axiomatic approach to intrinsic dimension of a dataset. Neural Networks, 21(2-3):204–213.

Wilcoxon, F. (1992). Individual comparisons by ranking methods. In Breakthroughs in statistics, pages 196–202. Springer.

Zhu, M. and Ghodsi, A. (2006). Automatic dimensionality selection from the scree plot via the use of profile likelihood.Computational Stat. & Data Analysis, 51(2):918–930.