TY - JOUR
AU - Silva, Renata B.
AU - Oliveira, Daniel de
AU - Santos, Davi P.
AU - Santos, Lucio F. D.
AU - Wilson, Rodrigo E.
AU - Bedo, Marcos
PY - 2020/09/28
TI - Criteria for choosing the number of dimensions in a principal component analysis: An empirical assessment
JF - Anais do Simpósio Brasileiro de Banco de Dados (SBBD); 2020: Anais do XXXV Simpósio Brasileiro de Bancos de DadosDO - 10.5753/sbbd.2020.13632
KW -
N2 - Principal component analysis (PCA) is an efficient model for the optimization problem of finding d' axes of a subspace R d' ⊆ R d so that the mean squared distances from a given set R of points to the axes are minimal. Despite being steadily employed since 1901 in different scenarios, e.g., mechanics, PCA has become an important link in machine learning chained tasks, such as feature learning and AutoML designs. A frequent yet open issue that arises from supervised-based problems is how many PCA axes are required for the performance of machine learning constructs to be tuned. Accordingly, we investigate the behavior of six independent and uncoupled criteria for estimating the number of PCA axes, namely Scree-Plot %, Scree Plot Gap, Kaiser-Guttman, Broken-Stick, p-Score, and 2D. In total, we evaluate the performance of those approaches in 20 high dimensional datasets by using (i) four different classifiers, and (ii) a hypothesis test upon the reported F-Measures. Results indicate Broken-Stick and Scree-Plot % criteria consistently outperformed the competitors regarding supervised-based tasks, whereas estimators Kaiser-Guttman and Scree-Plot Gap delivered poor performances in the same scenarios.
UR - https://sol.sbc.org.br/index.php/sbbd/article/view/13632