A Comparative Study of Text Document Representation Approaches Using Point Placement-based Visualizations
ResumoIn natural language processing, text representation plays an important role which can affect the performance of language models and machine learning algorithms. Basic vector space models, such as the term frequency-inverse document frequency, became popular approaches to represent text documents. In the last years, approaches based on word embeddings have been proposed to preserve the meaning and semantic relations of words, phrases and texts. In this paper, we focus on studying the influences of different text representations to the quality of the 2D visual spaces (layouts) generated by state-of-art visualizations based on point placement. For that purpose, a visualizationassisted approach is proposed to support users when exploring such representations in classification tasks. Experimental results using two public labeled corpora were conducted to assess the quality of the layouts and to discuss possible relations to the classification performances. The results are promising, indicating that the proposed approach can guide users to understand the relevant patterns of a corpus in each representation.
L. G. Nonato and M. Aupetit, “Multidimensional projection for visual analytics: Linking techniques with distortions, tasks, and layout enrichment,” IEEE Transactions on Visualization and Computer Graphics, vol. 25, no. 8, pp. 2650–2673, 2018.
N. Saeed, H. Nam, M. I. U. Haq, and D. B. Muhammad Saqib, “A survey on multidimensional scaling,” ACM Computing Surveys (CSUR), vol. 51, no. 3, pp. 1–25, 2018.
S. Ayesha, M. K. Hanif, and R. Talib, “Overview and comparative study of dimensionality reduction techniques for high dimensional data,” Information Fusion, vol. 59, pp. 44–58, 2020. [Online]. Available: https://www.sciencedirect.com/science/article/pii/S156625351930377X
M. Espadoto, R. M. Martins, A. Kerren, N. S. T. Hirata, and A. C. Telea, “Toward a quantitative survey of dimension reduction techniques,” IEEE Transactions on Visualization and Computer Graphics, vol. 27, no. 3, pp. 2153–2173, 2021.
M. E. Tipping and C. M. Bishop, “Mixtures of probabilistic principal component analyzers,” Neural computation, vol. 11, no. 2, pp. 443–482, 1999.
J. B. Tenenbaum, V. De Silva, and J. C. Langford, “A global geometric framework for nonlinear dimensionality reduction,” science, vol. 290, no. 5500, pp. 2319–2323, 2000.
L. McInnes, J. Healy, and J. Melville, “Umap: Uniform manifold approximation and projection for dimension reduction,” arXiv preprint arXiv:1802.03426, 2018.
S. T. Roweis and L. K. Saul, “Nonlinear dimensionality reduction by locally linear embedding,” science, vol. 290, no. 5500, pp. 2323–2326, 2000.
L. Van der Maaten and G. Hinton, “Visualizing data using t-sne.” Journal of machine learning research, vol. 9, no. 11, 2008.
L. Huang, S. Matwin, E. J. de Carvalho, and R. Minghim, “Active learning with visualization for text data,” in Proceedings of the 2017 ACM Workshop on Exploratory Search and Interactive Data Analytics, 2017, pp. 69–74.
L. Hajderanj, I. Weheliye, and D. Chen, “A new supervised t-sne with dissimilarity measure for effective data visualization and classification,” in Proceedings of the 2019 8th International Conference on Software and Information Engineering, 2019, pp. 232–236.
Y. Han, Z. Wang, S. Chen, G. Li, X. Zhang, and X. Yuan, “Interactive assigning of conference sessions with visualization and topic modeling,” in 2020 IEEE Pacific Visualization Symposium (PacificVis). IEEE, 2020, pp. 236–240.
K. Kim and J. Lee, “Sentiment visualization and classification via semisupervised nonlinear dimensionality reduction,” Pattern Recognition, vol. 47, no. 2, pp. 758–768, 2014.
Y. Zhao, B. Qin, T. Liu, and D. Tang, “Social sentiment sensor: a visualization system for topic detection and topic sentiment analysis on microblog,” Multimedia Tools and Applications, vol. 75, no. 15, pp. 8843–8860, 2016.
R. Motta, R. Minghim, A. de Andrade Lopes, and M. C. F. Oliveira, “Graph-based measures to assist user assessment of multidimensional projections,” Neurocomputing, vol. 150, pp. 583–598, 2015.
D. Smilkov, N. Thorat, C. Nicholson, E. Reif, F. B. Viégas, and M. Wattenberg, “Embedding projector: Interactive visualization and interpretation of embeddings,” arXiv preprint arXiv:1611.05469, 2016.
S. Liu, P.-T. Bremer, J. J. Thiagarajan, V. Srikumar, B. Wang, Y. Livnat, and V. Pascucci, “Visual exploration of semantic relationships in neural word embeddings,” IEEE transactions on visualization and computer graphics, vol. 24, no. 1, pp. 553–562, 2017.
H. Schütze, C. D. Manning, and P. Raghavan, Introduction to information retrieval. Cambridge University Press Cambridge, 2008, vol. 39.
T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean, “Distributed representations of words and phrases and their compositionality,” in Advances in neural information processing systems, 2013, pp. 3111–3119.
J. Pennington, R. Socher, and C. D. Manning, “Glove: Global vectors for word representation,” in Empirical Methods in Natural Language Processing (EMNLP), 2014, pp. 1532–1543. [Online]. Available: http://www.aclweb.org/anthology/D14-1162
J. Devlin, M. Chang, K. Lee, and K. Toutanova, “BERT: pre-training of deep bidirectional transformers for language understanding,” CoRR, vol. abs/1810.04805, 2018. [Online]. Available: http://arxiv.org/abs/1810.04805
S. Bird, E. Klein, and E. Loper, Natural language processing with Python: analyzing text with the natural language toolkit. ” O’Reilly Media, Inc.”, 2009.
A. Rexha, M. Kröll, M. Dragoni, and R. Kern, “Polarity classification for target phrases in tweets: a word2vec approach,” in European Semantic Web Conference. Springer, 2016, pp. 217–223.
O. Abdelwahab and A. Elmaghraby, “Uofl at semeval-2016 task 4: Multi domain word2vec for twitter sentiment classification,” in Proceedings of the 10th international workshop on semantic evaluation (SemEval- 2016), 2016, pp. 164–170.
J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “Bert: Pre-training of deep bidirectional transformers for language understanding,” arXiv preprint arXiv:1810.04805, 2018.
J. G. S. Paiva, W. R. Schwartz, H. Pedrini, and R. Minghim, “Semisupervised dimensionality reduction based on partial least squares for visual analysis of high dimensional data,” in Computer Graphics Forum, vol. 31, no. 3pt4. Wiley Online Library, 2012, pp. 1345–1354.
A. Chatzimparmpas, R. M. Martins, and A. Kerren, “t-visne: Interactive assessment and interpretation of t-sne projections,” IEEE transactions on visualization and computer graphics, vol. 26, no. 8, pp. 2696–2714, 2020.
“Hugging face - datasets,” https://huggingface.co/docs/datasets.
X. Zhang, J. Zhao, and Y. LeCun, “Character-level convolutional networks for text classification,” 2016.
A. Gulli, “Ag’s corpus of news articles,” http://groups.di.unipi.it/^gulli/_AG_corpus_of_news_articles.html.
Q. Le and T. Mikolov, “Distributed representations of sentences and documents,” in International conference on machine learning. PMLR, 2014, pp. 1188–1196.
A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, I. Sutskever et al., “Language models are unsupervised multitask learners,” OpenAI blog, vol. 1, no. 8, p. 9, 2019.
Y. Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, O. Levy, M. Lewis, L. Zettlemoyer, and V. Stoyanov, “Roberta: A robustly optimized bert pretraining approach,” 2019.
M. E. Peters, M. Neumann, M. Iyyer, M. Gardner, C. Clark, K. Lee, and L. Zettlemoyer, “Deep contextualized word representations,” arXiv preprint arXiv:1802.05365, 2018.