A strategy for interpreting and visualizing the results of matrix-trifactorization-based coclustering algorithms
Resumo
Information yielded by unsupervised learning is often hard to interpret due to the lack of defined labels. To overcome this, we propose and illustrate a strategy for interpreting and visualizing the results of coclustering algorithms based on trifactorization. Our method consists of three steps: (1) vector space visualization; (2) cluster characterization by top documents/words; and (3) cocluster characterization by comparing top words between different clusters. The latter allows exploring the resulting clusters in a way which considers the relationship between attribute cluster and data cluster for every data cluster, instead of just the data cluster with the highest association with this attribute cluster. We illustrate the use of our method for the Non-negative Block Value Decomposition on a dataset of scientific abstracts.
Referências
Bafna, P., Pramod, D., and Vaidya, A. (2016). Document clustering: Tf-idf approach. In 2016 International Conference on Electrical, Electronics, and Optimization Techniques (ICEEOT), pages 61–66. IEEE.
Brunialti, L. F., Peres, S. M., da Silva, V. F., and de Moraes Lima, C. A. (2017). The BinOvNMTF algorithm: Overlapping columns co-clustering based on non-negative matrix tri-factorization. In Brazilian Conference on Intelligent Systems, BRACIS, pages 330–335, Uberlândia, Brazil. IEEE Conference Publishing Services.
Chen, Y., Dong, M., and Wan, W. (2009). Image co-clustering with multi-modality features and user feedbacks. In Proceedings of the 17th ACM International Conference on Multimedia, MM ’09, page 689–692, New York, NY, USA. Association for Computing Machinery.
Dhillon, I. S., Mallela, S., and Modha, D. S. (2003). Information-theoretic co-clustering. In Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining, pages 89–98. ACM.
Diaz, A. K. R. and Peres, S. M. (2019). Biclustering and coclustering: concepts, algorithms and viability for text mining. Revista de Informática Teórica e Aplicada, 26(2):81–117.
Freitas Junior, W. L. (2023). Um comparativo quantitativo e qualitativo de algoritmos de coagrupamento baseados em fatoração de matrizes. Master’s thesis, Universidade de São Paulo.
Freitas Junior, W. L., Peres, S. M., Freire, V., and Brunialti, L. F. (2020). OvNMTF Algorithm: an Overlapping Non-Negative Matrix Tri-Factorization for Coclustering. In 2020 International Joint Conference on Neural Networks (IJCNN), pages 1–8. IEEE.
Han, J., Pei, J., and Kamber, M. (2012). Data Mining: Concepts and Techniques. Morgan Kauffman, Waltham, 3rd edition.
Hartigan, J. A. (1972). Direct clustering of a data matrix. Journal of the American Statistical Association, 67(337):123–129.
Hassani, A., Amir, I., and Mansouri, N. (2021). Text mining using nonnegative matrix factorization and latent semantic analysis. Neural Computing and Applications, 33(20):13745–13766.
Hofmann, T., Puzicha, J., and Jordan, M. I. (1998). Learning from dyadic data. In Advances in Neural Information Processing Systems 11, NIPS Conf., Denver, Colorado, USA, pages 466–472.
Li, T. and Ding, C. (2006). The relationships among various nonnegative matrix factorization methods for clustering. In Sixth International Conference on Data Mining (ICDM’06), pages 362–371.
Lloyd, S. P. (1982). Least squares quantization in PCM. IEEE Transactions on Information Theory, 28(2):129–136.
Long, B., Zhang, Z. M., and Yu, P. S. (2005). Co-clustering by block value decomposition. In Proceedings of the eleventh ACM SIGKDD international conference on Knowledge discovery in data mining, pages 635–640. ACM.
Luna, M., Lima, A., Neubauer, T., Fantinato, M., and Peres, S. (2021). Vector space models for trace clustering: a comparative study. In Anais do XVIII Encontro Nacional de Inteligência Artificial e Computacional, pages 446–457, Porto Alegre, RS, Brasil. SBC.
Madeira, S. and Oliveira, A. (2004). Biclustering algorithms for biological data analysis: a survey. IEEE/ACM Transactions on Computational Biology and Bioinformatics, 1(1):24–45.
Paschoal, A. F. A., Pirozelli, P., Freire, V., Delgado, K. V., Peres, S. M., José, M. M., Nakasato, F., Oliveira, A. S., Brandão, A. A. F., Costa, A. H. R., and Cozman, F. G. (2021). Pirá: A bilingual portuguese-english dataset for question-answering about the ocean. In Proceedings of the 30th ACM International Conference on Information & Knowledge Management, CIKM ’21, page 4544–4553, New York, NY, USA. Association for Computing Machinery.
Pensa, R. G., Boulicaut, J.-F., Cordero, F., and Atzori, M. (2010). Co-clustering numerical data under user-defined constraints. Statistical Analysis and Data Mining, 3(1):38–55.
Rajaraman, A. and Ullman, J. D. (2011). Mining of Massive Datasets. Cambridge University Press, 1 edition.
Rousseeuw, P. (1987). Silhouettes: A graphical aid to the interpretation and validation of cluster analysis. Journal of Computational and Applied Mathematics, 20:53–65.
Salah, A., Ailem, M., and Nadif, M. (2018). Word Co-Occurrence Regularized Non-Negative Matrix Tri-Factorization for Text Data Co-Clustering. Proceedings of the AAAI Conference on Artificial Intelligence, 32(1).
Salih Hasan, B. M. and Abdulazeez, A. M. (2021). A review of principal component analysis algorithm for dimensionality reduction. Journal of Soft Computing and Data Mining, 2(1):20–30.
Shahid, N., Ilyas, M. U., Alowibdi, J. S., and Aljohani, N. R. (2017). Word cloud segmentation for simplified exploration of trending topics on twitter. IET Software, 11(5):214– 220.
Shahnaz, F., Berry, M. W., Pauca, V., and Plemmons, R. J. (2006). Document clustering using nonnegative matrix factorization. Information Processing & Management, 42(2):373–386.
Wang, J., Zhao, Z., Zhou, J., Wang, H., Cui, B., and Qi, G. (2012). Recommending flickr groups with social topic model. Information Retrieval, 15(3-4):278–295.
Yoo, J. and Choi, S. (2010). Orthogonal nonnegative matrix tri-factorization for coclustering: Multiplicative updates on Stiefel manifolds. Information Processing & Management, 46(5):559–570.