Heterogeneous Parallel Architecture for Inverted Index Generation

  • Tiago Silveira Pontifical Catholic University of Minas Gerais
  • Felipe Soares Pontifícia Universidade Católica de Minas Gerais
  • Wladmir Brandão Pontifical Catholic University of Minas Gerais
  • Henrique Cota Freitas Pontifícia Universidade Católica de Minas Gerais

Resumo


The amount of data generated on the Web has increased dramatically, as well as the need for computational power to prepare this information. In particular, indexers process these data to extract terms and their occurrences, storing them in an inverted file, a compact data structure that provides quick search. However, this task involves processing of a large amount of data, requiring high computational power. In this article, we present a heterogeneous parallel architecture that uses CPU and GPU in a cluster to accelerate inverted index generation. Experimental results show that the proposed architecture provides faster execution times, up to 60 times in classification and 23 times in the compression of 1 million elements.

Referências

Celikik, M. and Bast, H. (2009). Fast single-pass construction of a half-inverted index.

In Karlgren, J., Tarhio, J., and Hyyrö, H., editors, String Processing and Information Retrieval, pages 194–205, Berlin, Heidelberg. Springer Berlin Heidelberg.

Chen, C. P. and Zhang, C.-Y. (2014). Data-intensive applications, challenges, techniques and technologies: A survey on big data. Information sciences, 275:314–347.

Cho, J. and Garcia-Molina, H. (2002). Parallel crawlers. In Proceedings of the 11th international conference on World Wide Web, pages 124–135. ACM.

Elias, P. (1975). Universal codeword sets and representations of the integers. IEEE Transactions on Information Theory, 21(2):194–203.

Guana, V. and Davidson, J. (2012). On comparing inverted index parallel implementations using mapreduce. University of Alberta.

Jung, S., Chang, D.-J., and Park, J. W. (2017). Large scale document inversion using a multi-threaded computing system. SIGAPP Appl. Comput. Rev., 17(2):27–35.

Kucukyilmaz, T., Turk, A., and Aykanat, C. (2012). A parallel framework for inmemory construction of term-partitioned inverted indexes. The Computer Journal, 55(11):1317–1330.

Melink, S., Raghavan, S., Yang, B., and Garcia-Molina, H. (2001). Building a distributed full-text index for the web. ACM Transactions on Information Systems (TOIS), 19(3):217–241.

Moffat, A. and Bell, T. A. H. (1995). In situ generation of compressed inverted files. Journal of the American Society for Information Science, 46(7):537–550.

Pasari, R., Chaudhari, V., Borkar, A., and Joshi, A. (2016). Parallelization of vertical search engine using hadoop and mapreduce. In Proceedings of the International Conference on Advances in Information Communication Technology & Computing, page 51. ACM.

Wei, Z. and JaJa, J. (2012). A fast algorithm for constructing inverted files on heterogeneous platforms. Journal of Parallel and Distributed Computing, 72(5):728 – 738.

Wei, Z. and JaJa, J. (2012). An optimized high-throughput strategy for constructing inverted files. IEEE Transactions on Parallel and Distributed Systems, 23(11):2033–2044.

Zhengyou, L. and Tao, C. (2009). A distributed parallel algorithm for web page inverted indexes construction on the cluster computing systems. In 2009 International Forum on Information Technology and Applications, volume 2, pages 33–36.

Zhou, J., Guo, Q., Jagadish, H. V., Krcal, L., Liu, S., Luan, W., Tung, A. K. H., Yang, Y., and Zheng, Y. (2018). A generic inverted index framework for similarity search on the gpu. In 2018 IEEE 34th International Conference on Data Engineering (ICDE), pages 893–904.
Publicado
08/11/2019
SILVEIRA, Tiago; SOARES, Felipe; BRANDÃO, Wladmir; FREITAS, Henrique Cota. Heterogeneous Parallel Architecture for Inverted Index Generation. In: SIMPÓSIO EM SISTEMAS COMPUTACIONAIS DE ALTO DESEMPENHO (SSCAD), 20. , 2019, Campo Grande. Anais [...]. Porto Alegre: Sociedade Brasileira de Computação, 2019 . p. 145-156. DOI: https://doi.org/10.5753/wscad.2019.8664.