Detecção e Correção de Falhas Transitórias Através da Descrição de Programas Usando Matrizes

Ronaldo R. Ferreira; Álvaro F. Moreira; Luigi Carro

doi:10.5753/wtf.2010.23101

Ronaldo R. Ferreira UFRGS
Álvaro F. Moreira UFRGS
Luigi Carro UFRGS

DOI: https://doi.org/10.5753/wtf.2010.23101

Resumo

Os avanços na fabricação de transistores têm permitido reduzir o tamanho da tecnologia, sustentando a Lei de Moore. Neste cenário, os sistemas embarcados serão projetados com margem reduzida para a implantação de técnicas de tolerância a falhas devido ao aumento no consumo de potência que essas técnicas requerem. Neste artigo, defendemos a detecção e correção (D&C) de falhas em nível de sistema através da codificação de quaisquer programas e algoritmos com matrizes. Essa codificação possibilita empregarmos técnicas estabelecidas de D&C de erros em matrizes, incorrendo em acréscimo inexpressivo de potência e energia. Avaliamos a nossa proposta através de dois estudos de caso relevantes para o domínio de sistemas embarcados, para os quais observamos em alguns casos decréscimo de somente 5% em desempenho e de aumento 8% em tamanho de programa.

Referências

Argyrides, C., Lisboa, C. A. L., Pradhan, D. K., and Carro, L. (2009). A fast error correction technique for matrix multiplication algorithms. In IOLTS ’09: Proc. of the 15th IEEE International On-Line Testing Symposium, pages 133–137, Los Alamitos, CA, USA. IEEE.

Atallah, M. J., Kosaraju, S. R., Larmore, L. L., Miller, G. L., and Teng, S.-H. (1989). Constructing trees in parallel. In SPAA ’89: Proc. of the 1st Ann. ACM Symposium on Parallel Algorithms and Architectures, pages 421–431, New York, NY, USA. ACM.

Blum, L., Shube, M., and Smale, S. (1989). On a theory of computation and complexity over the real numbers: NP-completeness, recursive functions and universal machines. Bulletin (New Series) of the American Mathematical Society, 21(1):1–46.

Blum, M. and Kanna, S. (1989). Designing programs that check their work. In STOC ’89: Proc. of the 21st Annual ACM Symposium on Theory of Computing, pages 86–97, New York, NY, USA. ACM.

Chen, G. and Kandemir, M. (2005). Improving java virtual machine reliability for memory-constrained embedded systems. In DAC ’05: Proc. of the 42nd Annual Design Automation Conference, pages 690–695, New York, NY, USA. ACM.

Cheng, M.-H. and Hsu, Y.-H. (2003). Fast imdct and mdct algorithms - a matrix approach. IEEE Transactions on Signal Processing, 51(1):221–229.

Cheynet, P., Nicolescu, B., Velazco, R., Rebaudengo, M., Sonza Reorda, M., and Violante, M. (2000). Experimentally evaluating an automatic approach for generating safety-critical software with respect to transient errors. IEEE Transactions on Nuclear Science, 47(6):2231–2236.

Coppersmith, D. and Winograd, S. (1987). Matrix multiplication via arithmetic progressions. In STOC ’87: Proc. of the 19th Annual ACM Symposium on Theory of Computing, pages 1–6, New York, NY, USA. ACM.

Ernst, M. D., Perkins, J. H., Guo, P. J., McCamant, S., Pacheco, C., Tschantz, M. S., and Xiao, C. (2007). The daikon system for dynamic detection of likely invariants. Science of Computer Programming, 69(1-3):35–45.

Florio, V. D. and Blondia, C. (2008). A survey of linguistic structures for application-level fault tolerance. ACM Computing Surveys, 40(2):1–37.

Huang, K.-H. and Abraham, J. A. (1984). Algorithm-based fault tolerance for matrix operations. IEEE Transactions on Computers, 33(6):518–528.

Huffman, D. A. (1952). A method for the construction of minimum-redundancy codes. Proceedings of the IRE, 40(9):1098–1101.

Krishnamurthi, S. and Reiss, S. P. (2003). Automated fault localization using potential invariants. In AADEBUG ’03: Proceedings of the International Workshop on Automated and Algorithmic Debugging, pages 1–4.

Lisboa, C., Erigson, M., and Carro, L. (2007). System level approaches for mitigation of long duration transient faults in future technologies. In ETS ’07: Proc. of the 12th IEEE European Test Symposium, pages 165–172, Los Alamitos, CA, USA. IEEE.

Lisboa, C. A., Grando, C. N., Moreira, A. F., and Carro, L. (2009). Building robust software using invariant checkers to detect run-time transient errors. In DFR ’09: Proceedings of the 1st Workshop on Design for Reliability, pages 48–54.

Motwani, R. and Raghavan, P. (1995). Randomized algorithms. Cambridge University Press, New York, NY, USA.

Oh, N., Mitra, S., and McCluskey, E. J. (2002). Ed4i: Error detection by diverse data and duplicated instructions. IEEE Transactions on Computers, 51(2):180–199.

Pattabiraman, K., Kalbarczyk, Z., and Iyer, R. K. (2007). Automated derivation of application-aware error detectors using static analysis. In IOLTS ’07: Proceedings of the 13th IEEE International On-Line Testing Symposium, pages 211–216, Washington, DC, USA. IEEE.

Prata, P. and Silva, J. G. (1999). Algorithm based fault tolerance versus result-checking for matrix computations. In FCTS ’99: Proc. of the 29th Annual International Symposium on Fault-Tolerant Computing, pages 4–11, Los Alamitos, CA, USA. IEEE.

Princen, J., Johnson, A., and Bradley, A. (1987). Subband/transform coding using filter bank designs based on time domain aliasing cancellation. In ICASSP ’87: Proc. of the Int. Conf. on Acoustics, Speech, and Signal Processing, volume 12, pages 2161–2164, Los Alamitos, CA, USA. IEEE.

Vemu, R., Gurumurthy, S., and Abraham, J. (2007). ACCE: Automatic correction of control-flow errors. In ITC ’07. IEEE International Test Conference, pages 1–10.