Generation of coocurrency matrix on the GPU with applications in textual bases

  • Chayner Cordeiro Barros UFG
  • Wellington S. Martins UFG

Abstract


Co-occurrence matrices are computational artifacts used in applications such as the representation models, in typical NLP (Natural Language Processing) tasks such as machine translation, text mining, document classification. Its construction is quite computationally expensive, because it requires the analysis of the co-occurrence relationship between the terms existing in a corpus, and may require a lot of memory to store these relationships, if the vocabulary, the context window and the corpus used are very large. To overcome these limitations, our solution was based on a hash-like data structure, capable of efficiently storing an inherently sparse matrix of large dimensions. This solution adapts easily to different architectures, having been implemented and tested in applications with single thread, multicore and manycore architectures.
Keywords: Coocurrency matrix, GPU

References

Barros, C. C. and Martins, W. (2019). Acelerando a construção de vocabulário e matriz de co-ocorrencia em bases textuais. In Anais Principais do XX Simposio em Sistemas Computacionais de Alto Desempenho, pages 418–429. SBC.

Chelba, C., Mikolov, T., Schuster, M., Ge, Q., Brants, T., Koehn, P., and Robinson, T. (2013). One billion word benchmark for measuring progress in statistical language modeling. arXiv preprint arXiv:1312.3005.

Lin, J. (2009). Scalable language processing algorithms for the masses: A case study in computing word cooccurrence matrices with mapreduce. in proceedings of the conference on empirical methods in natural language processing. EMNLP ’08, pp. 419–428, Stroudsburg, PA, USA.

Mikolov, T., Chen, K., Corrado, G., and Dean, J. (2013). Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781.

Pennington, J., Socher, R., and Manning, C. (2014). Glove: Global vectors for word representation. In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), pages 1532–1543.

Sahlgren, M. (2005). An introduction to random indexing. In Methods and applications of semantic indexing workshop at the 7th international conference on terminology and knowledge engineering.
Published
2020-09-14
BARROS, Chayner Cordeiro; MARTINS, Wellington S.. Generation of coocurrency matrix on the GPU with applications in textual bases. In: REGIONAL HIGH PERFORMANCE SCHOOL OF THE MIDWEST (ERAD-CO), 3. , 2020, Campo Grande. Anais [...]. Porto Alegre: Sociedade Brasileira de Computação, 2020 . p. 17-20. DOI: https://doi.org/10.5753/eradco.2020.12647.