Criteria for choosing the number of dimensions in a principal component analysis: An empirical assessment

  • Renata B. Silva Universidade Federal Fluminense
  • Daniel de Oliveira Universidade Federal Fluminense
  • Davi P. Santos Universidade de São Paulo
  • Lucio F. D. Santos Instituto Federal do Norte de Minas Gerais
  • Rodrigo E. Wilson Universidade Federal Fluminense
  • Marcos Bedo Universidade Federal Fluminense

Resumo


Principal component analysis (PCA) is an efficient model for the optimization problem of finding d' axes of a subspace Rd' ⊆ Rd so that the mean squared distances from a given set R of points to the axes are minimal. Despite being steadily employed since 1901 in different scenarios, e.g., mechanics, PCA has become an important link in machine learning chained tasks, such as feature learning and AutoML designs. A frequent yet open issue that arises from supervised-based problems is how many PCA axes are required for the performance of machine learning constructs to be tuned. Accordingly, we investigate the behavior of six independent and uncoupled criteria for estimating the number of PCA axes, namely Scree-Plot %, Scree Plot Gap, Kaiser-Guttman, Broken-Stick, p-Score, and 2D. In total, we evaluate the performance of those approaches in 20 high dimensional datasets by using (i) four different classifiers, and (ii) a hypothesis test upon the reported F-Measures. Results indicate Broken-Stick and Scree-Plot % criteria consistently outperformed the competitors regarding supervised-based tasks, whereas estimators Kaiser-Guttman and Scree-Plot Gap delivered poor performances in the same scenarios.

Palavras-chave: Feature transformation, PCA, Number of principal components, Auto ML

Referências

Aggarwal, C. (2015) .Data mining: The textbook. Springer.

Guttman, L. (1954). Some necessary conditions for common-factor analysis.Psychometrika, 19(2):149–161.

Jackson, D. A. (1993). Stopping rules in principal components analysis: A comparison of heuristical and statistical approaches. Ecology, 74(8):2204–2214.

James, G., Witten, D., Hastie, T., and Tibshirani, R. (2013). An introduction to Statistical Learning, volume 112. Springer.

Legendre, P. and Legendre, L. F. (2012). Numerical Ecology. Elsevier.

Neto, P., Jackson, D., and Somers, K. (2005). How many principal components? Stopping rules for determining the number of non-trivial axes revisited. C. Stat., 49(4):974–997.

Pearson, K. (1901). On lines and planes of closest fit to systems of points in space.The London, Edinburgh, and Dublin Phil. Magazine and J. of Science, 2(11):559–572.

Pestov, V. (2008). An axiomatic approach to intrinsic dimension of a dataset. Neural Networks, 21(2-3):204–213.

Wilcoxon, F. (1992). Individual comparisons by ranking methods. In Breakthroughs in statistics, pages 196–202. Springer.

Zhu, M. and Ghodsi, A. (2006). Automatic dimensionality selection from the scree plot via the use of profile likelihood.Computational Stat. & Data Analysis, 51(2):918–930.
Publicado
28/09/2020
SILVA, Renata B.; OLIVEIRA, Daniel de; SANTOS, Davi P.; SANTOS, Lucio F. D.; WILSON, Rodrigo E.; BEDO, Marcos. Criteria for choosing the number of dimensions in a principal component analysis: An empirical assessment. In: SIMPÓSIO BRASILEIRO DE BANCO DE DADOS (SBBD), 35. , 2020, Evento Online. Anais [...]. Porto Alegre: Sociedade Brasileira de Computação, 2020 . p. 145-150. ISSN 2763-8979. DOI: https://doi.org/10.5753/sbbd.2020.13632.