Interpretation and Hierarchical methods for Dimensionality Reduction
Resumo
High-dimensional data analysis is a ubiquitous task in practical and research activities. Dimensionality Reduction (DR) techniques are usually employed as they map highdimensional data to lower spaces and allow for knowledge discovery. This thesis focuses on the interpretability and representation aspects of non-linear DR approaches’ output, such as t-SNE and UMAP. That is, we propose methods for interpreting and hierarchically learning embeddings. To accomplish these goals, the following main research activities were carried out, representing separate but interconnected works: (1) a sampling method in visual space (R2) that can preserve class boundary structures while keeping outliers visible; (2) a technique for understanding cluster formation by leveraging statistical tests on the feature values after dimensionality reduction; (3) we advance the state-of-the-art by adapting SHAP to explain cluster formation after dimensionality reduction; (4) a novel hierarchical DR technique that employs an adaptive kernel for global/local neighborhood learning while preserving context across embeddings.Referências
“Macrocosm.so,” https://alex.macrocosm.so/download, accessed: July 7, 2023.
W. E. Marcílio-Jr, D. M. Eler, F. V. Paulovich, and R. M. Martins, “Humap: Hierarchical uniform manifold approximation and projection,” ArXiv e-prints, Feb. 2021.
J. S. Vitter, “Random sampling with a reservoir,” ACM Trans. Math. Softw., vol. 11, no. 1, pp. 37–57, Mar. 1985.
D. E. Knuth, The Art of Computer Programming, Volume 2 (3rd Ed.): Seminumerical Algorithms. Boston, MA, USA: Addison-Wesley Longman Publishing Co., Inc., 1997.
P. Joia, F. Petronetto, and L. G. Nonato, “Uncovering representative groups in multidimensional projections,” Computer Graphics Forum, vol. 34, pp. 281–290, 2015.
M. T. Ribeiro, S. Singh, and C. Guestrin, “”why should I trust you?”: Explaining the predictions of any classifier,” in Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, San Francisco, CA, USA, August 13-17, 2016, 2016, pp. 1135–1144.
L. G. Nonato and M. Aupetit, “Multidimensional projection for visual analytics: Linking techniques with distortions, tasks, and layout enrichment,” IEEE Transactions on Visualization and Computer Graphics, p. 1, 2018.
T. Fujiwara, O.-H. Kwon, and K.-L. Ma, “Supporting analysis of dimensionality reduction results with contrastive learning,” IEEE Trans. Vis. and Comp. Graph., vol. 26, pp. 45–55, 2019.
T. Le and L. Akoglu, “Contravis: Contrastive and visual topic modeling for comparing document collections,” in The World Wide Web Conference, ser. WWW ’19. New York, NY, USA: Association for Computing Machinery, 2019, p. 928–938.
F. V. Paulovich and R. Minghim, “Hipp: A novel hierarchical point placement strategy and its application to the exploration of document collections,” IEEE Transactions on Visulization and Computer Graphics, vol. 14, no. 6, pp. 1229–1236, 2008.
M. e. a. Kuchroo, “Multiscale phate exploration of sars-cov-2 data reveals multimodal signatures of disease,” bioRxiv, 2020.
N. Pezzotti, T. Höllt, B. Lelieveldt, E. Eisemann, and A. Vilanova, “Hierarchical stochastic neighbor embedding,” Comput. Graph. Forum, vol. 35, no. 3, pp. 21–30, Jun. 2016.
T. T. Nguyen and I. Song, “Centrality clustering-based sampling for big data visualization,” International Joint Conference on Neural Networks (IJCNN), pp. 24–29, 2016.
C. Turkay, A. Lundervold, A. J. Lundervold, and H. Hauser, “Representative factor generation for the interactive visual analysis of highdimensional data,” IEEE Trans. Vis. Comput. Graph., vol. 18, no. 12, pp. 2621–2630, 2012.
D. B. Coimbra, R. M. Martins, T. T. Neves, A. C. Telea, and F. V. Paulovich, “Explaining three-dimensional dimensionality reduction plots,” Information Visualization, vol. 15, no. 2, pp. 154–172, 2016.
J. Stahnke, M. Dörk, B. Müller, and A. Thom, “Probing projections: Interaction techniques for interpreting arrangements and errors of dimensionality reductions,” IEEE Trans. on Vis. and Comp. Graph., vol. 22, pp. 629–638, 2016.
L. de Carvalho Pagliosa, P. A. Pagliosa, and L. G. Nonato, “Understanding attribute variability in multidimensional projections,” in 29th Conf. Graphics, Patterns and Images, SIBGRAPI, 2016, Sao Paulo, Brazil, October 4-7, 2016, 2016, pp. 297–304.
H. Abdi and D. Valentin, “Multiple correspondence analysis,” Encyclopedia of Measurement and Statistics, pp. 651–657, 2007.
W. Marcilio-Jr, D. Eler, R. Garcia, R. Correia, and L. F. Silva, “A hybrid visualization approach to perform analysis of feature spaces,” International Conference on Information Technology–New Generations, vol. 1134, 2020.
F. V. Paulovich, L. G. Nonato, M. Rosane, and H. Levkowitz, “Least square projection: A fast high-precision multidimensional projection technique and its application to document mapping,” IEEE Transactions on Visulization and Computer Graphics, vol. 3, pp. 564–575, 2008.
W. E. Marcílio-Jr and D. M. Eler, “Sadire: a context-preserving sampling technique for dimensionality reduction visualizations,” Journal of Visualization, vol. 23, pp. 999–1013, 2020.
W. E. Marcílio-Jr, D. M. Eler, F. V. Paulovich, J. F. Rodrigues-Jr, and A. O. Artero, “Explorertree: A focus+context exploration approach for 2d embeddings,” Big Data Research, vol. 25, p. 100239, 2021.
“Contrastive analysis for scatterplot-based representations of dimensionality reduction,” Computers & Graphics, vol. 101, pp. 46–58, 2021.
W. E. Marcílio-Jr and D. M. Eler, “Explaining dimensionality reduction results using shapley values,” Expert Systems with Applications, vol. 178, p. 115020, 2021.
L. Shapley, “A value for n-person games, vol ii of contributions to the theory of games,” 1953.
B. Ma, Q. Wei, and G. Chen, “A combined measure for representative information retrieval in enterprise information systems,” Journal of Enterprise Information Management, vol. 24, pp. 310–321, nov 2011.
H. Labelle, P. Roussouly, E. Berthonnaud, J. Dimnet, and M. O’Brien, “The importance of spino-pelvic balance in l5-s1 developmental spondylolisthesis: A review of pertinent radiologic measurements,” Spine, vol. 30, pp. 27–34, 2005.
K. Moon, D. van Dijk, and Z. e. a. Wang, “Visualizing structure and transitions in high-dimensional biological data,” Nat Biotechnol, pp. 1482–1492, 2019.
M. D. Luecken and F. Theis, “Current best practices in single-cell rnaseq analysis: a tutorial,” Molecular Systems Biology, vol. 15, 2019.
Y. Pan, H. Hou, and Y. Zhao, “A rapid and interpretable feature screening workflow for high-entropy alloys,” Available at SSRN: https://ssrn.com/abstract=4464727, 2023.
D. Lähnemann, J. Köster, and E. e. a. Szczurek, “Eleven grand challenges in single-cell data science,” Genome Biol, vol. 31, 2020.
W. E. Marcílio-Jr, D. M. Eler, F. V. Paulovich, and R. M. Martins, “Humap: Hierarchical uniform manifold approximation and projection,” ArXiv e-prints, Feb. 2021.
J. S. Vitter, “Random sampling with a reservoir,” ACM Trans. Math. Softw., vol. 11, no. 1, pp. 37–57, Mar. 1985.
D. E. Knuth, The Art of Computer Programming, Volume 2 (3rd Ed.): Seminumerical Algorithms. Boston, MA, USA: Addison-Wesley Longman Publishing Co., Inc., 1997.
P. Joia, F. Petronetto, and L. G. Nonato, “Uncovering representative groups in multidimensional projections,” Computer Graphics Forum, vol. 34, pp. 281–290, 2015.
M. T. Ribeiro, S. Singh, and C. Guestrin, “”why should I trust you?”: Explaining the predictions of any classifier,” in Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, San Francisco, CA, USA, August 13-17, 2016, 2016, pp. 1135–1144.
L. G. Nonato and M. Aupetit, “Multidimensional projection for visual analytics: Linking techniques with distortions, tasks, and layout enrichment,” IEEE Transactions on Visualization and Computer Graphics, p. 1, 2018.
T. Fujiwara, O.-H. Kwon, and K.-L. Ma, “Supporting analysis of dimensionality reduction results with contrastive learning,” IEEE Trans. Vis. and Comp. Graph., vol. 26, pp. 45–55, 2019.
T. Le and L. Akoglu, “Contravis: Contrastive and visual topic modeling for comparing document collections,” in The World Wide Web Conference, ser. WWW ’19. New York, NY, USA: Association for Computing Machinery, 2019, p. 928–938.
F. V. Paulovich and R. Minghim, “Hipp: A novel hierarchical point placement strategy and its application to the exploration of document collections,” IEEE Transactions on Visulization and Computer Graphics, vol. 14, no. 6, pp. 1229–1236, 2008.
M. e. a. Kuchroo, “Multiscale phate exploration of sars-cov-2 data reveals multimodal signatures of disease,” bioRxiv, 2020.
N. Pezzotti, T. Höllt, B. Lelieveldt, E. Eisemann, and A. Vilanova, “Hierarchical stochastic neighbor embedding,” Comput. Graph. Forum, vol. 35, no. 3, pp. 21–30, Jun. 2016.
T. T. Nguyen and I. Song, “Centrality clustering-based sampling for big data visualization,” International Joint Conference on Neural Networks (IJCNN), pp. 24–29, 2016.
C. Turkay, A. Lundervold, A. J. Lundervold, and H. Hauser, “Representative factor generation for the interactive visual analysis of highdimensional data,” IEEE Trans. Vis. Comput. Graph., vol. 18, no. 12, pp. 2621–2630, 2012.
D. B. Coimbra, R. M. Martins, T. T. Neves, A. C. Telea, and F. V. Paulovich, “Explaining three-dimensional dimensionality reduction plots,” Information Visualization, vol. 15, no. 2, pp. 154–172, 2016.
J. Stahnke, M. Dörk, B. Müller, and A. Thom, “Probing projections: Interaction techniques for interpreting arrangements and errors of dimensionality reductions,” IEEE Trans. on Vis. and Comp. Graph., vol. 22, pp. 629–638, 2016.
L. de Carvalho Pagliosa, P. A. Pagliosa, and L. G. Nonato, “Understanding attribute variability in multidimensional projections,” in 29th Conf. Graphics, Patterns and Images, SIBGRAPI, 2016, Sao Paulo, Brazil, October 4-7, 2016, 2016, pp. 297–304.
H. Abdi and D. Valentin, “Multiple correspondence analysis,” Encyclopedia of Measurement and Statistics, pp. 651–657, 2007.
W. Marcilio-Jr, D. Eler, R. Garcia, R. Correia, and L. F. Silva, “A hybrid visualization approach to perform analysis of feature spaces,” International Conference on Information Technology–New Generations, vol. 1134, 2020.
F. V. Paulovich, L. G. Nonato, M. Rosane, and H. Levkowitz, “Least square projection: A fast high-precision multidimensional projection technique and its application to document mapping,” IEEE Transactions on Visulization and Computer Graphics, vol. 3, pp. 564–575, 2008.
W. E. Marcílio-Jr and D. M. Eler, “Sadire: a context-preserving sampling technique for dimensionality reduction visualizations,” Journal of Visualization, vol. 23, pp. 999–1013, 2020.
W. E. Marcílio-Jr, D. M. Eler, F. V. Paulovich, J. F. Rodrigues-Jr, and A. O. Artero, “Explorertree: A focus+context exploration approach for 2d embeddings,” Big Data Research, vol. 25, p. 100239, 2021.
“Contrastive analysis for scatterplot-based representations of dimensionality reduction,” Computers & Graphics, vol. 101, pp. 46–58, 2021.
W. E. Marcílio-Jr and D. M. Eler, “Explaining dimensionality reduction results using shapley values,” Expert Systems with Applications, vol. 178, p. 115020, 2021.
L. Shapley, “A value for n-person games, vol ii of contributions to the theory of games,” 1953.
B. Ma, Q. Wei, and G. Chen, “A combined measure for representative information retrieval in enterprise information systems,” Journal of Enterprise Information Management, vol. 24, pp. 310–321, nov 2011.
H. Labelle, P. Roussouly, E. Berthonnaud, J. Dimnet, and M. O’Brien, “The importance of spino-pelvic balance in l5-s1 developmental spondylolisthesis: A review of pertinent radiologic measurements,” Spine, vol. 30, pp. 27–34, 2005.
K. Moon, D. van Dijk, and Z. e. a. Wang, “Visualizing structure and transitions in high-dimensional biological data,” Nat Biotechnol, pp. 1482–1492, 2019.
M. D. Luecken and F. Theis, “Current best practices in single-cell rnaseq analysis: a tutorial,” Molecular Systems Biology, vol. 15, 2019.
Y. Pan, H. Hou, and Y. Zhao, “A rapid and interpretable feature screening workflow for high-entropy alloys,” Available at SSRN: https://ssrn.com/abstract=4464727, 2023.
D. Lähnemann, J. Köster, and E. e. a. Szczurek, “Eleven grand challenges in single-cell data science,” Genome Biol, vol. 31, 2020.
Publicado
06/11/2023
Como Citar
MARCÍLIO-JR, Wilson E.; ELER, Danilo M..
Interpretation and Hierarchical methods for Dimensionality Reduction. In: WORKSHOP DE TESES E DISSERTAÇÕES - CONFERENCE ON GRAPHICS, PATTERNS AND IMAGES (SIBGRAPI), 36. , 2023, Rio Grande/RS.
Anais [...].
Porto Alegre: Sociedade Brasileira de Computação,
2023
.
p. 83-89.
DOI: https://doi.org/10.5753/sibgrapi.est.2023.27456.