Uma Implementação MPI Tolerante a Falhas do Algoritmo Hyperquicksort
Resumo
O Hyperquicksort é um algoritmo de ordenação paralela baseado em um hipercubo virtual. As implementações disponíveis desse algoritmo não consideraram a tolerância a falhas. Este trabalho apresenta uma implementação do Hyperquicksort que é capaz de tolerar até n-1 falhas de processos, onde n é o número total de processos empregados na ordenação. O Hyperquicksort tolerante a falhas foi implementado de acordo com a mais recente especificação de tolerância a falhas do MPI, a User Level Failure Mitigation (ULFM). Dessa forma, o Hyperquicksort é programado para se reconfigurar e continuar a sua execução, apesar de falhas de processos. Resultados mostram a eficiência da implementação na ordenação de até 1 bilhão de números inteiros.
Referências
Bland, W., Bosilca, G., Bouteiller, A., Herault, T., and Dongarra, J. (2012a). A proposal for user-level failure mitigation in the mpi-3 standard. Technical report, Department of Electrical Engineering and Computer Science, University of Tennessee.
Bland, W., Bouteiller, A., Hérault, T., Bosilca, G., and Dongarra, J. (2013). Post-failure recovery of MPI communication capability: Design and rationale. International Journal of HPC Applications, 27(3):244–254.
Bland, W., Bouteiller, A., Hérault, T., Hursey, J., Bosilca, G., and Dongarra, J. J. (2012b). An evaluation of user-level failure mitigation support in MPI. In EuroMPI, volume 7490 of LNCC, pages 193–203. Springer.
Bland, W., Du, P., Bouteiller, A., Hérault, T., Bosilca, G., and Dongarra, J. (2012c). A checkpoint-on-failure protocol for algorithm-based recovery in standard MPI. In EuroPar.
Chandra, T. D. and Toueg, S. (1996). Unreliable failure detectors for reliable distributed systems. Journal of the ACM, 43(2):225–267.
Chen and Chung (2001). Improved fault-tolerant sorting algorithm in hypercubes. TCS: Theoretical Computer Science, 255.
Cormer, T. H., Leiserson, C. E., Rivest, R. L., and Stein, C. (2009). Introductions to Algorithms’. The MIT Press.
Duarte, Jr., E. P. and Nanya, T. (1998). A hierarchical adaptive distributed system-level diagnosis algorithm. IEEE Transactions on Computers, 47(1):34–45.
Fagg, G. E. and Dongarra, J. (2000). FT-MPI: Fault tolerant MPI, supporting dynamic applications in a dynamic world. In Recent advances in PVM-MPI, LNCS. Springer.
Freiling, F. C., Guerraoui, R., and Kuznetsov, P. (2011). The failure detector abstraction. ACM Comput. Surv., 43(2):9:1–9:40.
Gabriel, E., Fagg, G. E., Bosilca, G., Angskun, T., Dongarra, J. J., Squyres, J. M., Sahay, V., Kambadur, P., Barrett, B., Lumsdaine, A., Castain, R. H., Daniel, D. J., Graham, R. L., and Woodall, T. S. (2004). Open MPI: Goals, concept, and design of a next generation MPI implementation. In Proceedings, 11th European PVM/MPI Users’ Group Meeting, pages 97–104, Budapest, Hungary.
Gropp, W., Lusk, E., Doss, N., and Skjellum, A. (1996). A high-performance, portable implementation of the MPI message passing interface standard. Parallel Computing, 22(6):789–828.
Gropp, W. and Lusk, E. L. (2004). Fault tolerance in message passing interface programs. International Journal of HPC Applications, 18(3):363–372.
Herault, T., Bouteiller, A., Bosilca, G., Gamell, M., Teranishi, K., Parashar, M., and Dongarra, J. (2015). Practical scalable consensus for pseudo-synchronous distributed systems. In SuperComputing conference.
Hoare, C. A. R. (1962). Quicksort. The Computer Journal, 5(4):10–15.
Hursey, J., Graham, R. L., Bronevetsky, G., Buntinas, D., Pritchard, H., and Solt, D. G. (2011). Run-through stabilization: An MPI proposal for process fault tolerance. In EuroMPI.
Johnsson (1984). Combining parallel and sequential sorting on a boolean n-cube. In ICPP: 13th International Conference on Parallel Processing.
Leighton, F. T. (1991). Introduction to Parallel Algorithms and Architectures: Array, Trees, Hypercubes. Morgan Kaufmann Publishers.
MPI Forum (2015). Document for a standard message-passing interface 3.1. Technical report, University of Tennessee, http://www.mpi-forum.org/docs/mpi-3.1.
Parhami, B. (1999). Introduction to Parallel Processing: Algorithms and Architectures. Kluwer Academic Publishers, Norwell, MA, USA.
Plaxton, C. G. (1989). Load balancing, selection sorting on the hypercube. In SPAA, pages 64–73.
Quinn, M. J. M. J. (2003). Parallel programming in C with MPI and OpenMP. McGraw-Hill, pub-MCGRAW-HILL:adr.
Seidel, S. R. and George, W. L. (1988). Binsorting on hypercubes with d-port communication. In Proceedings of the Third Conference on Hypercube Concurrent Computers and Applications - Volume 2, C3P, pages 1455–1461, New York, NY, USA. ACM.
Sheu, Chen, and Chang (1992). Fault-tolerant sorting algorithm on hypercube multicomputers. In ICPP: 21th International Conference on Parallel Processing.
Suo, G., Lu, Y., Liao, X., Xie, M., and Cao, H. (2013). Nr-mpi: A non-stop and fault resilient mpi. In ICPADS.
Wagar, B. (1987). Hyperquicksort: A fast sorting algorithm for hypercubes. Hypercube Multiprocessors, 1987:292–299.
Won, Y. and Sahni, S. (1988). A balanced bin sort for hypercube multicomputers. The Journal of Supercomputing, 2(4).