Extensão de um ambiente de computação de alto desempenho para o processamento de dados massivos

  • Lucas M. Ponce UFMG
  • Walter dos Santos UFMG
  • Wagner Meira Jr. UFMG
  • Dorgival Guedes UFMG

Abstract


High performance computing (HPC) and massive data processing (Big Data) are two trends in computing systems that are beginning to converge. This paper presents our experience on this path of convergence, extending COMP Superscalar (COMPSs), a parallel and distributed programming model already known in the world of HPC, for the processing of massive data. For this, it has been integrated to HDFS, the most widely used distributed file system for Big Data, and to Lemonade, a data mining and analysis tool developed at UFMG. The results show that the integration with HDFS benefits the COMPSs by the data abstraction provided and the integration with Lemonade facilitates its use and popularization in the area of Data Science.

References

Conejero, J., Corella, S., Badia, R. M., and Labarta, J. (2017). Task-based programming in COMPSs to converge from HPC to Big Data. The International Journal of High Performance Computing Applications, 17.

Fox, G. et al. (2015). Big data, simulations and HPC convergence. In Workshop on Big Data Benchmarks, pages 3–17. Springer.

Gonzales, S. D. (2016). PyWebHDFS: a python wrapper for the Hadoop WebHDFS REST API. Disponível em: https://pypi.python.org/pypi/pywebhdfs. Acessado em 14/12/2017.

Kamburugamuve, S., Govindarajan, K., Wickramasinghe, P., Abeykoon, V., and Fox, G. In EXAMPI 2017 workshop SC17 (2017). Twister2: Design of a big data toolkit. Conference, Denver CO.

Leo, S. and Zanetti, G. (2010). Pydoop: a Python MapReduce and HDFS API for Hadoop. In Proceedings of the 19th ACM International Symposium on High Performance Distributed Computing, pages 819–825. ACM.

Lezzi, D., Rafanell, R., Lordan, F., Tejedor, E., and Badia, R. M. (2011). COMPSs in the VENUS-C platform: enabling e-science applications on the cloud. In 4th Iberian Grid Infrastructure Conference, volume 1, Braga, Portugal. Universidade do Minho.

Lordan, F., Ejarque, J., Sirvent, R., and Badia, R. M. (2016). Energy-aware programming model for distributed infrastructures. In 24th Euromicro Int’l Conf. on Parallel, Distributed, and Network-Based Processing (PDP), volume 24, pages 413–417.

Reed, D. A. and Dongarra, J. (2015). Exascale computing and big data. Communications of the ACM, 58(7):56–68.

Rocha, R. C., Hott, B., dos Santos Dias, V. V., Ferreira, R., Jr., W. M., and Guedes, D. (2016). Watershed-ng: an extensible distributed stream processing framework. Concurrency and Computation: Practice and Experience, 28(8):2487–2502.

Rosen, J. (2016). Pyspark internals. Disponível em: https://cwiki.apache.org/confluence/display/SPARK/PySpark+Internals. Acessado em 14/12/2017.

Santos, W., Carvalho, L. F. M., d. P. Avelar, G., Silva, A., Ponce, L. M., Guedes, D., and Meira, W. (2017). Lemonade: A scalable and efcient spark-based platform for data analytics. In 2017 17th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGRID), pages 745–748.

Tejedor, E., Becerra, Y., Alomar, G., Queralt, A., Badia, R. M., Torres, J., Cortes, T., and Labarta, J. (2017). PyCOMPSs: Parallel computational workows in Python. The International Journal of High Performance Computing Applications, 31(1):66–82.

Wu, X., Kumar, V., Quinlan, J. R., Ghosh, J., Yang, Q., Motoda, H., McLachlan, G. J., Ng, A., Liu, B., Philip, S. Y., et al. (2008). Top 10 algorithms in data mining. Knowledge and information systems, 14(1):1–37.

Zaharia, M. et al. (2012). Resilient distributed datasets: A fault-tolerant abstraction for in-memory cluster computing. In Proc. of the 9th USENIX Symposium on Networked Systems Design and Implementation (NSDI 12), pages 15–28, San Jose, CA. USENIX.
Published
2018-05-10
PONCE, Lucas M.; SANTOS, Walter dos; MEIRA JR., Wagner; GUEDES, Dorgival. Extensão de um ambiente de computação de alto desempenho para o processamento de dados massivos. In: BRAZILIAN SYMPOSIUM ON COMPUTER NETWORKS AND DISTRIBUTED SYSTEMS (SBRC), 36. , 2018, Campos do Jordão. Anais [...]. Porto Alegre: Sociedade Brasileira de Computação, 2018 . p. 1173-1186. ISSN 2177-9384. DOI: https://doi.org/10.5753/sbrc.2018.2486.

Most read articles by the same author(s)

1 2 3 4 5 6 > >>