Acelerando a construção de vocabulário e matriz de co-ocorrência em bases textuais

  • Chayner Barros Universidade Federal de Goiás
  • Wellington Martins Universidade Federal de Goiás

Abstract

Two important preprocessing tasks in natural language processing are vocabulary building and word co-occurrence matrix computation. As datasets get large and non static corpora becomes common, these tasks become increasingly computational demanding. In this article, we present parallel algorithms to extract vocabulary and compute the co-occurrence matrix. These algorithms are mapped to both a multicore (CPU) and a manycore (GPU) architecture. Our experiments using a standard dataset show speedups of up to 21x when compared to a sequential state-of-the-art (GloVe) implementation performing the same tasks.

References

Alcantara, D. A., Sharf, A., Abbasinejad, F., Sengupta, S., Mitzenmacher, M., Owens, J. D., and Amenta, N. (2009). Real-time parallel hashing on the gpu. ACM Transactions on Graphics (TOG), 28(5):154.

Chelba, C., Mikolov, T., Schuster, M., Ge, Q., Brants, T., Koehn, P., and Robinson, T. (2013). One billion word benchmark for measuring progress in statistical language modeling. arXiv preprint arXiv:1312.3005.

Doddington, G. (2002). Automatic evaluation of machine translation quality using n-gram co-occurrence statistics. In Proceedings of the second international conference on Human Language Technology Research, pages 138–145. Morgan Kaufmann Publishers Inc.

Khuc, V. N., Shivade, C., Ramnath, R., and Ramanathan, J. (2012). Towards building large-scale distributed systems for twitter sentiment analysis. In Proceedings of the 27th Annual ACM Symposium on Applied Computing, SAC ’12, pages 459–464, New York, NY, USA. ACM.

Kirk, D. B. and Wen-mei, W. H. (2012). Programming massively parallel processors: a hands-on approach. Newnes.

Lin, J. (2009). Scalable language processing algorithms for the masses: A case study in computing word cooccurrence matrices with mapreduce. in proceedings of the conference on empirical methods in natural language processing. EMNLP ’08, pp. 419–428, Stroudsburg, PA, USA.

Manning, C. D. and Schütze, H. (1999). Foundations of statistical natural language processing. MIT press.

McCormick, C. (2017) Word2vec tutorial part 2 - negative sampling [blog post]. http://mccormickml.com/2017/01/11/word2vec-tutorial-part-2-negative-sampling/.

Pang, B. and Lee, L. (2008). Opinion Mining and Sentiment Analysis. Foundations and Trends R in Information Retrieval, 2(12):1–135.

Pennington, J., Socher, R., and Manning, C. (2014). Glove: Global vectors for word representation. In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), pages 1532–1543.

Role, F. and Nadif, M. (2011). Handling the impact of low frequency events on cooccurrence based measures of word similarity - a case study of pointwise mutual information.

Schuetze, H. (1997). Document information retrieval using global word co-occurrence patterns. US Patent 5,675,819.

Weiss, S. M., Indurkhya, N., Zhang, T., and Damerau, F. J. (2005). From textual information to numerical vectors. In Text Mining, pages 15–46. Springer.
Published
2019-11-08
How to Cite
BARROS, Chayner; MARTINS, Wellington. Acelerando a construção de vocabulário e matriz de co-ocorrência em bases textuais. Proceedings of the Symposium on High Performance Computing Systems (SSCAD), [S.l.], p. 418-429, nov. 2019. ISSN 0000-0000. Available at: <https://sol.sbc.org.br/index.php/sscad/article/view/8687>. Date accessed: 18 may 2024. doi: https://doi.org/10.5753/wscad.2019.8687.