Unsupervised Machine Learning Based on Heterogeneous Networks for Text Clustering
Abstract
Network-based representations can model different types of relationships between texts, they are capable of capturing patterns that are hardly captured by vector space model, and network-based clustering algorithms, such as label propagation, have linear complexity. However, network-based clustering has not been explored (i) specifically in clustering texts; and (ii) with different possibilities of text representations on networks. Thus, this article's objective is to explore and analyze clustering techniques applied to different types of network representations. Experiments were performed using 30 collections from different domains, represented in the bag-of-words format, similarity networks of the type k-Nearest Neighbors, and bipartite networks. The label propagation approach applied in similarity networks presented the best results for most evaluation measures and most text collections.
References
Angelova, R. and Weikum, G. (2006). Graph-based text classification: learn from your neighbors. In Proc. Conf. Special Interest Group on Information Retrieval, pages 485– 492. ACM.
Breve, F., Zhao, L., Quiles, M., Pedrycz, W., and Liu, J. (2011). Particle competition and cooperation in networks for semi-supervised learning. IEEE Transactions on Knowledge and Data Engineering, 24(9):1686–1698.
Cao, J., Wang, S., Wen, D., Peng, Z., Philip, S. Y., and Wang, F.-y. (2020). Mutual clustering on comparative texts via heterogeneous information networks. Knowledge and Information Systems, 62(1):175–202. de Sousa, C. A. R., Rezende, S. O., and Batista, G. E. A. P. A. (2013). Influence of graph construction on semi-supervised learning. In Proc. Eur. Conf. Machine Learning and Knowledge Discovery in Databases, pages 160–175.
Golo, M. P. S. and Rossi, R. G. (2019). An extensive empirical evaluation of preprocessing techniques and supervised one class learning algorithms for text classification (in press). In Proceeding of the National Meeting on Artificial and Computational Intelligence (ENIAC), pages 1–12.
Ienco, D., Bifet, A., Žliobaitė, I., and Pfahringer, B. (2013). Clustering based active learning for evolving data streams. In Int. Conf. Discovery Science, pages 79–93. Springer.
Khennak, I., Drias, H., Kechid, A., and Moulai, H. (2019). Clustering algorithms for query expansion based information retrieval. In Int. Conf Computational Collective Intelligence, pages 261–272. Springer.
Marcacini, R. M., Hruschka, E. R., and Rezende, S. O. (2012). On the use of consensus clustering for incremental learning of topic hierarchies. In Lecture Notes in Computer Science, Alemanha. Springer Verlag.
Mei, J.-P., Lv, H., Yang, L., and Li, Y. (2019). Clustering for heterogeneous information networks with extended star-structure. Data Mining and Knowledge Discovery, 33(4):1059–1087.
Mihalcea, R. and Radev, D. (2011). Graph-based natural language processing and information retrieval. Cambridge University Press.
Newman, M. (2018). Networks. OUP Oxford.
Rossi, R. G., de Andrade Lopes, A., and Rezende, S. O. (2016). Optimization and label propagation in bipartite heterogeneous networks to improve transductive classification of texts (in press). Information Processing & Management.
Rossi, R. G., Marcacini, R. M., and Rezende, S. O. (2013). Benchmarking text collections for classification and clustering tasks. Technical Report 395, Institute of Mathematics and Computer Sciences, University of Sao Paulo.
Rossi, R. G., Rezende, S. O., and de Andrade Lopes, A. (2015). Term network approach for transductive classification. In Int. Conf. Intelligent Text Processing and Computational Linguistics, pages 497–515.
Šubelj, L. (2019). Label propagation for clustering. Advances in Network Clustering and Blockmodeling, pages 121–150.
Sun, Y., Han, J., Zhao, P., Yin, Z., Cheng, H., and Wu, T. (2009). Rankclus: integrating clustering with ranking for heterogeneous information network analysis. In Proc. Int. Conf. Extending Database Technology: Advances in Database Technology, pages 565– 576. ACM.
Tan, P., Steinbach, M., Karpatne, A., and Kumar, V. (2019). Introduction to Data Mining. What’s New in Computer Science Series. Pearson.
