Comparação de Desempenho entre Ambientes Distribuídos Virtualizados na Mineração de Dados

  • Joelson dos Santos USP
  • Murilo Naldi UFV

Abstract


Nowadays, big amounts of data are challenging and cause the need for distribution and management of huge data sets in separate repositories. New distributed systems have been designed to scale up from a single server to thousands of machines. Systems like Apache Hadoop and Apache Mahout are flexible and reliable, supporting Data Mining techniques. Therefore, Virtualization became an important tool to contribute in the development of cheap and stable systems to support the analysis of large amounts of data. Nowadays, there are several consolidated virtualization tools on the market, like VMware, Virtual- Box and Xen, among others. However, it may be difficult to determine which tool has the best performance for a given scenario of application. Therefore, computational performance evaluation techniques became important to assess accurately the advantages and disadvantages of each virtualization software. The main objective of this work is compare the performance of different distributed and virtualized environments on VirtualBox, VMware Player and Xen to support data mining tasks executed in the Apache Hadoop and Apache Mahout platforms. The performance of each environment is compared in order to evaluate the advantages of the use of Virtualization in the Data Mining context.

Keywords: Virtualization, Data Mining, Apache Hadoop, Apache Mahout, Big Data.

References

Barham, P., Dragovic, B., Fraser, K., Hand, S., Harris, T., Ho, A., Neugebauer, R., Pratt, I., andWarfield, A. (2003). Xen and the art of virtualization. SIGOPS Oper. Syst. Rev., 37(5):164–177.

Dearo Garcia, K. and Coelho Naldi, M. (2014). Multiple parallel mapreduce k-means clustering with validation and selection. In Intelligent Systems (BRACIS), 2014 Brazilian Conference on, pages 432–437.

Faceli, K., Gama, J., Carvalho, A. C. P. L. d., and Lorena, A. C. (2011). Inteligência Artificial, Uma Abordagem de Aprendizado de Máquina. GEN.

Galdámez, E. V. C. (2002). Aplicação das Técnicas de Planejamento e Análise de Experimentos na Melhoria da Qualidade de um Processo de Fabricação de Produtos Plásticos. Dissertação de Mestrado.

Ivanov, T., Zicari, R. V., Izberovic, S., and Tolle, K. (2014). Performance evaluation of virtualized hadoop clusters. CoRR, abs/1411.3811.

Jain, R. (1991). The art of computer system performance analysis: techniques for experimental design, measurement, simulation and modeling. New York: John Willey.

Johnson, T. (2011). Avaliação de Desempenho de Sistemas Computacionais. Gen. Lam, C. (2011). Hadoop in Action. Manning.

Larose, D. T. (2006). Data mining methods & models. John Wiley & Sons. Laureano, M. (2006). Máquinas Virtuais e Emuladores, Conceitos, Técnicas e Aplicações. Novatec.

Melnykov, V., Chen, W.-C., and Maitra, R. (2012). Mixsim: An r package for simulating data to study performance of clustering algorithms. Journal of Statistical Software, 51(12):131–158.

OWEN, S., Anil, R., Dunning, T., and Friedman, E. (2012). Mahout in Action. Manning Publications (October 17, 2011).

Portnoy, M. (2012). Virtualization Essentials. Wiley / Sybex.

Rabkin, A. and Katz, R. (2013). How hadoop clusters break. Software, IEEE, 30(4):88– 94.

Romero, A. V. (2010). Virtualbox 3.1 - Deploy and Manage a cost-effective virtual environment using Virtualbox - Beginner’s Guide. PACKT.

Schroeck, M., Shockley, R., Smart, J., Romero-Morales, D., and Tufano, P. (2012). Analytics : The real-world use of big data - How innovative enterprises extract value from uncertain data. Technical report, IBM Global Services, Route 100 Somers, NY 10589 U.S.A.

Sosinsky, B. (2010). Defining Cloud Computing, pages 1–22. Wiley Publishing, Inc. TAN, P.-N., STEINBACH, M., and KUMAR, V. (2009). Introdução ao Data Mining, Mineração de Dados. CI êNCIA MODERNA.

VERAS, M. (2011). Virtualização, Componente Central do Datacenter. Brasport.

VMware (2013a). Getting started with vmware player - vmware player 6. http:// www.vmware.com/pdf/desktop/vmware_player60.pdf. Acessado em 02/05/2014.

VMware (2013b). Virtualized hadoop performance with vmware vsphere R 5.1 - performance study - technical white paper. http://www.vmware.com/files/pdf/vmware-virtualizing-apache-hadoop.pdf. Acessado em 28/11/2014.

White, T. (2012). Hadoop The Definitive Guide. O’REILLY, 3° edition.
Published
2015-07-20
DOS SANTOS, Joelson; NALDI, Murilo. Comparação de Desempenho entre Ambientes Distribuídos Virtualizados na Mineração de Dados. In: WORKSHOP ON PERFORMANCE OF COMPUTER AND COMMUNICATION SYSTEMS (WPERFORMANCE), 14. , 2015, Recife. Anais [...]. Porto Alegre: Sociedade Brasileira de Computação, 2015 . p. 1-14. ISSN 2595-6167. DOI: https://doi.org/10.5753/wperformance.2015.10393.