Diagnóstico Distribuído com Testes Imperfeitos Aplicado à Detecção de Estabilidade em Sistemas Baseados em MPI

Edson Tavares de Camargo; Elias Duarte Jr.; Weyne Pietniczka

doi:10.5753/wscad.2014.15009

Edson Tavares de Camargo UFPR / UTFPR
Elias Duarte Jr. UFPR
Weyne Pietniczka UFPR

DOI: https://doi.org/10.5753/wscad.2014.15009

Resumo

Este trabalho apresenta um modelo para diagnóstico distribuído que assume testes imperfeitos, permitindo que um teste executado sobre um processo sem-falha indique instabilidade. Um algoritmo de diagnóstico é proposto sobre o modelo e permite que processos testados sem-falha formem um núcleo de processos estáveis. Além disso, um processo considerado instável, mas que responde corretamente a uma sequência de testes, pode ter a sua classificação modificada após a execução do consenso no núcleo de processos estáveis. O modelo proposto foi implementado em um sistema baseado em MPI. Resultados mostram a eficiência do núcleo de processos estáveis para a ordenação paralela baseada no algoritmo HyperQuickSort, implementado de forma a tolerar até N-1 processos instáveis na ordenação de até 1 bilhão de inteiros.

Referências

Barsi, F., Grandoni, F., and Maestrini, P. (1976). A theory of diagnosability of digital systems. IEEE Trans. on Computers, C-25(6):585–593.

Batchu, R., Dandass, Y. S., Skjellum, A., and Beddhu, M. (2004). MPI/FT: A modelbased approach to low-overhead fault tolerant message-passing middleware. Cluster Computing, 7(4):303–315.

Bianchini, R., J. and Buskens, R. (1991). An adaptive distributed system-level diagnosis algorithm and its implementation. In Fault-Tolerant Computing, 1991, pages 222–229.

Bland, W., Bosilca, G., Bouteiller, A., Herault, T., and Dongarra, J. (2012a). A proposal for User-Level Failure Mitigation in the MPI-3 Standard. Technical report, Department of Electrical Engineering and Computer Science, University of Tennessee.

Bland, W., Bouteiller, A., Hérault, T., Bosilca, G., and Dongarra, J. (2013). Post-failure recovery of MPI communication capability: Design and rationale. International Journal of High Performance Computing Applications, 27(3):244–254.

Bland, W., Bouteiller, A., Hérault, T., Hursey, J., Bosilca, G., and Dongarra, J. J. (2012b). An evaluation of user-level failure mitigation support in MPI. In EuroMPI, volume 7490 of LNCC, pages 193–203. Springer.

Chandra, T. D. and Toueg, S. (1996). Unreliable failure detectors for reliable distributed systems. Journal of the ACM, 43(2):225–267.

Du, P., Bouteiller, A., Bosilca, G., Herault, T., and Dongarra, J. (2012). Algorithm-based fault tolerance for dense matrix factorizations. In Proceedings of the 17th ACM/SIGPLAN Symposium on PPOPP, pages 225–234, New Orleans, LA, USA. ACM Press.

Duarte, E. P. and Nanya, T. (1998). A hierarachical adaptive distributed system-level diagnosis algorithm. IEEE Trans. Computers, 47(1):34–45.

Egwutuoha, I. P., Levy, D., Selic, B., and Chen, S. (2013). A survey of fault tolerance mechanisms and checkpoint/restart implementations for high performance computing systems. The Journal of Supercomputing, 65(3):1302–1326.

Elnozahy, E. N. and Plank, J. S. (2004). Checkpointing for peta-scale systems: A look into the future of practical rollback-recovery. IEEE Trans. Dep. Sec. Comp., 1(2):97–108.

Fagg, G. E. and Dongarra, J. (2000). FT-MPI: Fault tolerant MPI, supporting dynamic applications in a dynamic world. In PVM/MPI, volume 1908 of LNCS. Springer.

Gropp, W. and Lusk, E. L. (2004). Fault tolerance in message passing interface programs. International Journal of High Performance Computing Applications, 18(3):363–372.

Hakimi, S. L. and Nakajima, K. (1984). On adaptive system diagnosis. IEEE Trans. Comput., 33(3):234–240.

Hosseini, S. H., Kuhl, J. G., and Reddy, S. M. (1984). A diagnosis algorithm for distributed computing systems with dynamic failure and repair. IEEE Trans. Comput., 33(3):223–233.

Hursey, J., Graham, R. L., Bronevetsky, G., Buntinas, D., Pritchard, H., and Solt, D. G. (2011). Run-through stabilization: An MPI proposal for process fault tolerance. In EuroMPI, volume 6960 of LNCS, pages 329–332. Springer.

Jacobson, V. and Karels, M. J. (1988). Congestion avoidance and control. ACM Computer Communications Review, 18(4):314–329.

Lamport, L. (1998). The part-time parliament. ACM Trans. Comput. Syst, 16(2):133–169.

MacDougall, M. H. (1987). Simulating Computer Systems. Techniques and Tools. Computer Systems Series. MIT. Discrete Event Simulation mittels SMPL.

MPI Forum (2013). Document for a standard message-passing interface 3.0. Technical report, University of Tennessee, http://www.mpi-forum.org/docs/mpi-3.0.

Nakajima, K. (1981). A new approach to system diagnosis. Proc. of the 19th Allerton Conf. on Communication, Control and Computing, pages 697–706.

Preparata, Metze, and Chen (1967). On the connection assignment problem of diagnosable systems. In IEEE Transactions on Electronic Computers, volume 16.

Wagar, B. (1987). Hyperquicksort: A fast sorting algorithm for hypercubes. Hypercube Multiprocessors, 1987:292–299.

Weber, A., Kutzke, A. R., and Chessa, S. (2012). Energy-aware test connection assignment for the self-diagnosis of a wireless sensor network. J. BCS, 18(1):19–27.

Ye, T.-L. and Hsieh, S.-Y. (2013). A scalable comparison-based diagnosis algorithm for hypercube-like networks. Reliability, IEEE Transactions on, 62(4):789–799.