Acelerando a construção de vocabulário e matriz de co-ocorrência em bases textuais

Chayner Barros; Wellington Martins

doi:10.5753/wscad.2019.8687

Chayner Barros Universidade Federal de Goiás
Wellington Martins Universidade Federal de Goiás

DOI: https://doi.org/10.5753/wscad.2019.8687

Resumo

Duas tarefas que se destacam no pré-processamento de textos são a construção de um vocabulário e a geração de uma matriz de co-ocorrências de palavras. Para um volume de dados crescente e não estático, estas tarefas requerem um alto custo computacional. Neste artigo, exploramos paralelismo para viabilizar este processamento. Apresentamos algoritmos paralelos para extrair o vocabulário e produzir a matriz de co-ocorrências e implementamos os mesmos em arquiteturas multicore e manycore (GPU). Os experimentos, utilizando uma base de dados padrão, mostram que nossa implementação consegue ser até 21x mais rápida que uma solução estado-da-arte (GloVe) sequencial na realização das mesmas tarefas.

Referências

Alcantara, D. A., Sharf, A., Abbasinejad, F., Sengupta, S., Mitzenmacher, M., Owens, J. D., and Amenta, N. (2009). Real-time parallel hashing on the gpu. ACM Transactions on Graphics (TOG), 28(5):154.

Chelba, C., Mikolov, T., Schuster, M., Ge, Q., Brants, T., Koehn, P., and Robinson, T. (2013). One billion word benchmark for measuring progress in statistical language modeling. arXiv preprint arXiv:1312.3005.

Doddington, G. (2002). Automatic evaluation of machine translation quality using n-gram co-occurrence statistics. In Proceedings of the second international conference on Human Language Technology Research, pages 138–145. Morgan Kaufmann Publishers Inc.

Khuc, V. N., Shivade, C., Ramnath, R., and Ramanathan, J. (2012). Towards building large-scale distributed systems for twitter sentiment analysis. In Proceedings of the 27th Annual ACM Symposium on Applied Computing, SAC ’12, pages 459–464, New York, NY, USA. ACM.

Kirk, D. B. and Wen-mei, W. H. (2012). Programming massively parallel processors: a hands-on approach. Newnes.

Lin, J. (2009). Scalable language processing algorithms for the masses: A case study in computing word cooccurrence matrices with mapreduce. in proceedings of the conference on empirical methods in natural language processing. EMNLP ’08, pp. 419–428, Stroudsburg, PA, USA.

Manning, C. D. and Schütze, H. (1999). Foundations of statistical natural language processing. MIT press.

McCormick, C. (2017) Word2vec tutorial part 2 - negative sampling [blog post]. http://mccormickml.com/2017/01/11/word2vec-tutorial-part-2-negative-sampling/.

Pang, B. and Lee, L. (2008). Opinion Mining and Sentiment Analysis. Foundations and Trends R in Information Retrieval, 2(12):1–135.

Pennington, J., Socher, R., and Manning, C. (2014). Glove: Global vectors for word representation. In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), pages 1532–1543.

Role, F. and Nadif, M. (2011). Handling the impact of low frequency events on cooccurrence based measures of word similarity - a case study of pointwise mutual information.

Schuetze, H. (1997). Document information retrieval using global word co-occurrence patterns. US Patent 5,675,819.

Weiss, S. M., Indurkhya, N., Zhang, T., and Damerau, F. J. (2005). From textual information to numerical vectors. In Text Mining, pages 15–46. Springer.