Processamento Distribuído de Grafos: Modelagem de Desempenho e Escalonamento de Tarefas Moldáveis

  • Daniel Presser UFSC
  • Frank Siqueira UFSC

Abstract


The ever growing datasets observed in modern applications also applies to datasets modeled as graphs. Several large scale distributed graph processing models, such as Pregel, have been proposed. These models are designed to run in large clusters, where the resources must be allocated efficiently. In this paper we present a prediction model and a scheduler for distributed graph processing jobs. The scheduler treats the jobs as moldable tasks and, based on the predictions, allocates the best number of workers to each job in order to minimize makespan. Experimental results show that the prediction model has accuracy close to 90%, which allows the scheduler to work within the theoretical approximation limits of the optimal makespan.

References

Avery, C. (2011). Giraph: Large-scale graph processing infrastructure on Hadoop. In Proceedings of Hadoop Summit.

Boldi, P. and Vigna, S. (2004). The WebGraph framework I: Compression techniques. In Proc. of the 13th International World Wide Web Conference, pages 595–601, Manhattan, USA.

Dutot, P.-F., Netto, M. A., Goldman, A., and Kon, F. (2005). Scheduling moldable BSP tasks. In Proc. of the 11th International Workshops on Job Scheduling Strategies for Parallel Processing, pages 157–172. Springer.

Gonzalez, J. E., Low, Y., Gu, H., Bickson, D., and Guestrin, C. (2012). PowerGraph: Distributed graph-parallel computation on natural graphs. In Proc. of the 10th Symposium on Operating System Design and Implementation.

Guo, Y., Biczak, M., Varbanescu, A. L., Iosup, A., Martella, C., and Willke, T. L. (2014). How well do graph-processing platforms perform? An empirical performance evaluation and analysis. Proc. of the International Parallel and Distributed Processing Symposium, pages 395–404.

Han, M., Daudjee, K., Ammar, K., Özsu, M. T., Wang, X., and Jin, T. (2014). An experimental comparison of pregel-like graph processing systems. Proceedings of the VLDB Endowment.

Kc, K. and Anyanwu, K. (2010). Scheduling hadoop jobs to meet deadlines. In Proc. of the 2nd IEEE International Conference on Cloud Computing Technology and Science (CloudCom).

Leskovec, J. and Krevl, A. (2014). SNAP Datasets: Stanford large network dataset collection. http://snap.stanford.edu/data.

Leung, J. Y. (2004). Handbook of scheduling: algorithms, models, and performance analysis. CRC Press.

Li, Z., Zhang, B., Ren, S., Liu, Y., Qin, Z., Goh, R. S. M., and Gurusamy, M. (2017). Performance modelling and cost effective execution for distributed graph processing on congurable VMs. Proc. of the 17th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing, pages 74–83.

Lumsdaine, A., Gregor, D., Hendrickson, B., and Berry, J. (2007). Challenges in parallel graph processing. Parallel Processing Letters, 17(01):5–20.

Malewicz, G., Austern, M. H., Bik, A. J., Dehnert, J. C., Horn, I., Leiser, N., and Czajkowski, G. (2010). Pregel: a system for large-scale graph processing. In Proc. of the 2010 ACM SIGMOD International Conference on Management of Data, pages 135–146.

Padala, S., Kumar, D., Raj, A., and Dharanipragada, J. (2015). Octopus: A multi-job scheduler for Graphlab. In Proc. of the 2015 IEEE International Conference on Big Data, pages 293–298.

Page, L., Brin, S., Motwani, R., and Winograd, T. (1999). The pagerank citation ranking: Bringing order to the web. Technical report, Stanford InfoLab.

Qu, H., Mashayekhi, O., Terei, D., and Levis, P. (2016). Canary: A scheduling architecture for high performance cloud computing. arXiv preprint arXiv:1602.01412.

Salihoglu, S. and Widom, J. (2013). GPS: A graph processing system. In Proc. of the 25th International Conference on Scientic and Statistical Database Management.

Seo, S., Yoon, E. J., Kim, J., Jin, S., Kim, J.-S., and Maeng, S. (2010). Hama: An efcient matrix computation with the mapreduce framework. In Proc. of the 2nd IEEE International Conference on Cloud Computing Technology and Science, pages 721–726.

Turek, J., Wolf, J. L., and Yu, P. S. (1992). Approximate algorithms scheduling parallelizable tasks. In Proceedings of the 4th ACM Symposium on Parallel Algorithms and Architectures.

White, T. (2012). Hadoop: The denitive guide. O’Reilly Media, Inc.

Wolf, J., Rajan, D., Hildrum, K., Khandekar, R., Kumar, V., Parekh, S., Wu, K.-L., and balmin, A. (2010). Flex: A slot allocation scheduling optimizer for mapreduce workloads. In Proc. of the 11th ACM/IFIP/USENIX International Conference on Middleware.

Xin, R. S., Gonzalez, J. E., Franklin, M. J., and Stoica, I. (2013). GraphX: A resilient distributed graph system on spark. In Proc. of the 1st International Workshop on Graph Data Management Experiences and Systems.

Zaharia, M., Xin, R. S., Wendell, P., Das, T., Armbrust, M., Dave, A., Meng, X., Rosen, J., Venkataraman, S., Franklin, M. J., et al. (2016). Apache spark: A unied engine for big data processing. Communications of the ACM, 59(11):56–65.
Published
2018-05-10
PRESSER, Daniel; SIQUEIRA, Frank. Processamento Distribuído de Grafos: Modelagem de Desempenho e Escalonamento de Tarefas Moldáveis. In: BRAZILIAN SYMPOSIUM ON COMPUTER NETWORKS AND DISTRIBUTED SYSTEMS (SBRC), 36. , 2018, Campos do Jordão. Anais [...]. Porto Alegre: Sociedade Brasileira de Computação, 2018 . p. 1243-1256. ISSN 2177-9384. DOI: https://doi.org/10.5753/sbrc.2018.2491.

Most read articles by the same author(s)