Topic Coherence Metrics: How Sensitive Are They?


  • João Marcos Campagnolo Federal University of Fronteira Sul
  • Denio Duarte Federal University of Fronteira Sul
  • Guillherme Dal Bianco Federal University of Fronteira Sul



Coherence Metrics, Model Evaluation, Sensibility, Topic Modeling, Unsupervised Machine Learning


Topic modeling approaches extract the most relevant sets of words (grouped into so-called topics) from a document collection. The extracted topics can be used for analyzing the latent semantic structure hiding in the collection. This task is intrinsically unsupervised (without information about the labels), so evaluating the quality of the discovered topics is challenging. To address that, different unsupervised metrics have been proposed, and some of them are close to human perception, e.g., coherence metrics. Moreover, metrics behave differently when facing noise (i.e., unrelated words) in the topics. This article presents an exploratory analysis to evaluate how state-of-the-art metrics are affected by perturbations in the topics. By perturbation, we mean that intruder words are synthetically inserted into the topics to measure the metrics’ ability to deal with noises. Our findings highlight the importance of overlooked choices in the metrics sensitiveness context. We show that some topic modeling metrics are highly sensitive to disturbing; others can handle noisy topics with minimal perturbation. As a result, we rank the chosen metrics by sensitiveness, and as the contribution, we believe that the results might be helpful for developers to evaluate the discovered topics better.


Download data is not yet available.


Aletras, N. and Stevenson, M. Evaluating topic coherence using distributional semantics. In Proceedings of the 10th International Conference on Computational Semantics (IWCS 2013)–Long Papers. pp. 13–22, 2013.

Alvarez, S. A. An exact analytical relation among recall, precision, and classification accuracy in information retrieval. Tech. rep., Boston College, Boston, Technical Report BCCS-02-01, 2002.

Blei, D. M. Probabilistic topic models. Communications of the ACM 55 (4): 77–84, 2012.

Bouma, G. Normalized (pointwise) mutual information in collocation extraction. Proceedings of GSCL vol. 30, pp. 31–40, 2009.

Chang, J., Gerrish, S., Wang, C., Boyd-Graber, J. L., and Blei, D. M. Reading tea leaves: How humans interpret topic models. In Proceedings of the Twenty-third Advances in neural information processing systems. pp. 288–296, 2009.

Duarte, D. and Ståhl, N. Machine learning: a concise overview. In Data Science in Practice, A. Said and V. Torra (Eds.). Springer, pp. 27–58, 2019.

Fatourechi, M., Ward, R. K., Mason, S. G., Huggins, J., Schlögl, A., and Birch, G. E. Comparison of evaluation metrics in classification applications with imbalanced datasets. In 2008 Seventh International Conference on Machine Learning and Applications. pp. 777–782, 2008.

Fitelson, B. A probabilistic theory of coherence. Analysis 63 (3): 194–199, 2003.

Folleco, A., Khoshgoftaar, T. M., and Napolitano, A. Comparison of four performance metrics for evaluating sampling techniques for low quality class-imbalanced data. In 2008 Seventh International Conference on Machine Learning and Applications. pp. 153–158, 2008.

Juba, B. and Le, H. S. Precision-recall versus accuracy and the role of large data sets. In Proceedings of the AAAI Conference on Artificial Intelligence. Vol. 33. International Joint Conferences on Artificial Intelligence, pp. 4039–4048, 2019.

Lau, J. H., Newman, D., and Baldwin, T. Machine reading tea leaves: Automatically evaluating topic coherence and topic model quality. In Proceedings of the 14th Conference of the European Chapter of the Association for Computational Linguistics. pp. 530–539, 2014.

Likert, R. A technique for the measurement of attitudes. Archives of Psychology 22 (140): 65–68, 1932.

Mimno, D., Wallach, H. M., Talley, E., Leenders, M., and McCallum, A. Optimizing semantic coherence in topic models. In Proceedings of the conference on empirical methods in natural language processing. Association for Computational Linguistics, pp. 262–272, 2011.

Newman, D., Lau, J. H., Grieser, K., and Baldwin, T. Automatic evaluation of topic coherence. In Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics. Association for Computational Linguistics, USA, pp. 100–108, 2010.

Nikolenko, S. I. Topic quality metrics based on distributed word representations. In Proceedings of the 39th international ACM SIGIR conference on research and development in information retrieval. pp. 1029–1032, 2016.

O’Callaghan, D., Greene, D., Carthy, J., and Cunningham, P. An analysis of the coherence of descriptors in topic modeling. Expert Systems with Applications 42 (13): 5645–5657, 2015.

Powers, D. M. W. Evaluation: from precision, recall and f-measure to roc, informedness, markedness and correlation. Journal of Machine Learning Technologies 2 (1): 37, 2011.

Röder, M., Both, A., and Hinneburg, A. Exploring the space of topic coherence measures. In Proceedings of the Eighth ACM International Conference on Web Search and Data Mining. Association for Computing Machinery, USA, pp. 399–408, 2015.

Steyvers, M. and Griffiths, T. Probabilistic topic models. In Handbook of latent semantic analysis, T. K. Landauer, D. S. McNamara, S. Dennis, and W. Kintsch (Eds.). Laurence Erlbaum Associates, 21, pp. 424–440, 2007.

Vijaymeena, M. and Kavitha, K. A survey on similarity measures in text mining. Machine Learning and Applications: An International Journal 3 (2): 19–28, 2016.




How to Cite

Campagnolo, J. M., Duarte, D., & Dal Bianco, G. (2022). Topic Coherence Metrics: How Sensitive Are They?. Journal of Information and Data Management, 13(4).



Regular Papers