What Happened in 2020: a Topic Modeling Approach based on a Topic Similarity Metric
Keywords:LDA, Metrics, News Article, Semantic Drift, Topic Evolution, Topic Modeling
2020 was atypical mainly due to the Covid-19 pandemic's beginning which has become a vastly discussed subject worldwide. Unsurprisingly, online news websites have followed this trend, besides publishing traditional subjects (e.g., sports, business, and politics). Understanding how the subjects interact with each other over the year is a challenge. In this paper, we intend to build a 2020 timeline based on the subjects and their similarity using a topic modeling approach (LDA) and a novel topic similarity metric. To accomplish that, we scrap news articles websites to build a collection of 2020 news. After that, the collection is pre-processed and sliced monthly. We use an LDA approach to discover the latent topics from all temporal collections. Next, we calculate the similarity between the topics across 2020 using five semantic correlations: born, death, keep, merge, and split. The discovered topics and the drift semantic between them show that building a meaningful 2020 time line is possible.
David M Blei, Andrew Y Ng, and Michael I Jordan. Latent dirichlet allocation. Journal of machine Learning research, 3(Jan):993–1022, 2003. URL link.
David M. Blei. Probabilistic topic models. Commun. ACM, 55(4):77–84, 2012. ISSN 0001-0782. URL 10.1145/2133806.2133826.
Luis González, Francisco Velasco, and Rafael M Gasca. A study of the similarities between topics. Computational Statistics, 20(3):465–479, 2005.
Uttam Chauhan and Apurva Shah. Topic modeling using latent dirichlet allocation: A survey. 54(7), 2021. URL https://doi.org/10.1145/3462478.
Mark Steyvers and Tom Griffiths. Probabilistic topic models. In Thomas K. Landauer, Danielle S. McNamara, Simon Dennis, and Walter Kintsch, editors, Handbook of latent semantic analysis, chapter 21, pages 424–440. Laurence Erlbaum Associates, 2007.
Denio Duarte and Niclas Ståhl. Machine learning: a concise overview. In Alan Said and Vicenç Torra, editors, Data Science in Practice, pages 27–58. Springer, 2019. URL doi.org/10.1007/978-3-319-97556-6_3.
Michael Röder, Andreas Both, and Alexander Hinneburg. Exploring the space of topic coherence measures. In Proceedings of the Eighth ACM International Conference on Web Search and Data Mining, pages 399–408, USA, 2015. Association for Computing Machinery. ISBN 9781450333177. URL 10.1145/2684822.2685324.
Andrew T Wilson and David Gerald Robinson. Tracking topic birth and death in LDA. Technical Report SAND2011-6927, Sandia National Laboratories (SNL), 2011. URL doi.org/10.2172/1029827.
Qi He, Bi Chen, Jian Pei, Baojun Qiu, Prasenjit Mitra, and Lee Giles. Detecting topic evolution in scientific literature: How can citations help? In Proceedings of the 18th ACM Conference on Information and Knowledge Management, CIKM ’09, page 957–966, New York, NY, USA, 2009. Association for Computing Machinery. URL doi.org/10.1145/1645953.1646076.
Zhufeng Li, Zhongxu Yin, and Qianqian Li. Study on topic intensity evolution law of web news topic based on topic content evolution. In International Conference on Cloud Computing and Security, pages 697–709. Springer, 2018. URL doi.org/10.1007/978-3-030-00021-9_62.
David M Blei and John D Lafferty. Dynamic topic models. In Proceedings of the 23rd international conference on Machine learning, pages 113–120, 2006. URL doi.org/10.1145/1143844.1143859.
Dongping Huang, Shuyu Hu, Yi Cai, and Huaqing Min. Discovering event evolution graphs based on news articles relationships. In 2014 IEEE 11th International Conference on e-Business Engineering, pages 246–251. IEEE, 2014. URL doi.org/10.1109/ICEBE.2014.49.
Zhiya Zuo and Kang Zhao. A graphical model for topical impact over time. In Proceedings of the 18th ACM/IEEE on Joint Conference on Digital Libraries, pages 405–406, 2018. URL doi.org/10.1145/3197026.3203891.
Luigi Di Caro, Marco Guerzoni, Massimiliano Nuccio, and Giovanni Siragusa. A bimodal network approach to model topic dynamics. arXiv preprint arXiv:1709.09373, 2017. URL link.
Muhammad Abulaish and Mohd Fazil. Modeling topic evolution in twitter: An embedding-based approach. IEEE Access, 6:64847–64857, 2018. URL doi.org/10.1109/ACCESS.2018.2878494.
Feng Jian, Wang Yajiao, and Ding Yuanyuan. Microblog topic evolution computing based on lda algorithm. Open Physics, 16(1):509–516, 2018. URL doi.org/10.1515/phys-2018-0067.
Guixian Xu, Yueting Meng, Zhan Chen, Xiaoyu Qiu, Changzhi Wang, and Haishen Yao. Research on topic detection and tracking for online news texts. IEEE access, 7:58407–58418, 2019. URL doi.org/10.1109/ACCESS.2019.2914097.
Sergei Koltcov, Sergey I Nikolenko, Olessia Koltsova, and Svetlana Bodrunova. Stable topic modeling for web science: granulated LDA. In Proceedings of the 8th ACM Conference on Web Science, pages 342–343, 2016. URL doi.org/10.1145/2908131.2908184.
Diogo Nolasco and Jonice Oliveira. Topical rumor detection based on social network topic models relationship. iSys-Brazilian Journal of Information Systems, 14(2):05–27, 2021. URL doi.org/10.5753/isys.2021.1799.
Pham Minh Chuan, Le Hoang Son, Mumtaz Ali, Tran Dinh Khang, Le Thanh Huong, and Nilanjan Dey. Link prediction in co-authorship networks based on hybrid content similarity metric. Applied Intelligence, 48(8):2470–2486, 2018. URL doi.org/10.1007/s10489-017-1086-x.
Nathan Klein, Christopher S Corley, and Nicholas A Kraft. New features for duplicate bug detection. In Proceedings of the 11th Working Conference on Mining Software Repositories, pages 324–327, 2014. URL doi.org/10.1145/2597073.2597090.
Qiaozhu Mei, Xuehua Shen, and Cheng Xiang Zhai. Automatic labeling of multinomial topic models. In Proceedings of the 13th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 490–499, 2007. URL doi.org/10.1145/1281192.1281246.
Nikolaos Aletras, Timothy Baldwin, Jey Han Lau, and Mark Stevenson. Representing topics labels for exploring digital libraries. In IEEE/ACM Joint Conference on Digital Libraries, pages 239–248. IEEE, 2014. URL doi.org/10.1109/JCDL.2014.6970174.
How to Cite
Copyright (c) 2022 The authors
This work is licensed under a Creative Commons Attribution 4.0 International License.