Detecção e Correção de Falhas Transitórias Através da Descrição de Programas Usando Matrizes
Abstract
Advances in transistor manufacturing have enabled technology scaling along the years, sustaining Moore’s law. In this scenario, embedded systems will face restricted resources available to deploy fault-tolerance due the increase of power consumption that these techniques require. In this paper, we claim the detection and correction (D&C) of failures at system level by using matrices to encode whole programs and algorithms. With such encoding, it is possible to employ estabilished D&C techniques of errors occurring in matrices, running with unexpressive overhead of power and energy. We evaluated this proposal using two case studies significant for the embedded system domain. We observed in some cases an overhead of only 5% in performance and 8% in program size.
References
Atallah, M. J., Kosaraju, S. R., Larmore, L. L., Miller, G. L., and Teng, S.-H. (1989). Constructing trees in parallel. In SPAA ’89: Proc. of the 1st Ann. ACM Symposium on Parallel Algorithms and Architectures, pages 421–431, New York, NY, USA. ACM.
Blum, L., Shube, M., and Smale, S. (1989). On a theory of computation and complexity over the real numbers: NP-completeness, recursive functions and universal machines. Bulletin (New Series) of the American Mathematical Society, 21(1):1–46.
Blum, M. and Kanna, S. (1989). Designing programs that check their work. In STOC ’89: Proc. of the 21st Annual ACM Symposium on Theory of Computing, pages 86–97, New York, NY, USA. ACM.
Chen, G. and Kandemir, M. (2005). Improving java virtual machine reliability for memory-constrained embedded systems. In DAC ’05: Proc. of the 42nd Annual Design Automation Conference, pages 690–695, New York, NY, USA. ACM.
Cheng, M.-H. and Hsu, Y.-H. (2003). Fast imdct and mdct algorithms - a matrix approach. IEEE Transactions on Signal Processing, 51(1):221–229.
Cheynet, P., Nicolescu, B., Velazco, R., Rebaudengo, M., Sonza Reorda, M., and Violante, M. (2000). Experimentally evaluating an automatic approach for generating safety-critical software with respect to transient errors. IEEE Transactions on Nuclear Science, 47(6):2231–2236.
Coppersmith, D. and Winograd, S. (1987). Matrix multiplication via arithmetic progressions. In STOC ’87: Proc. of the 19th Annual ACM Symposium on Theory of Computing, pages 1–6, New York, NY, USA. ACM.
Ernst, M. D., Perkins, J. H., Guo, P. J., McCamant, S., Pacheco, C., Tschantz, M. S., and Xiao, C. (2007). The daikon system for dynamic detection of likely invariants. Science of Computer Programming, 69(1-3):35–45.
Florio, V. D. and Blondia, C. (2008). A survey of linguistic structures for application-level fault tolerance. ACM Computing Surveys, 40(2):1–37.
Huang, K.-H. and Abraham, J. A. (1984). Algorithm-based fault tolerance for matrix operations. IEEE Transactions on Computers, 33(6):518–528.
Huffman, D. A. (1952). A method for the construction of minimum-redundancy codes. Proceedings of the IRE, 40(9):1098–1101.
Krishnamurthi, S. and Reiss, S. P. (2003). Automated fault localization using potential invariants. In AADEBUG ’03: Proceedings of the International Workshop on Automated and Algorithmic Debugging, pages 1–4.
Lisboa, C., Erigson, M., and Carro, L. (2007). System level approaches for mitigation of long duration transient faults in future technologies. In ETS ’07: Proc. of the 12th IEEE European Test Symposium, pages 165–172, Los Alamitos, CA, USA. IEEE.
Lisboa, C. A., Grando, C. N., Moreira, A. F., and Carro, L. (2009). Building robust software using invariant checkers to detect run-time transient errors. In DFR ’09: Proceedings of the 1st Workshop on Design for Reliability, pages 48–54.
Motwani, R. and Raghavan, P. (1995). Randomized algorithms. Cambridge University Press, New York, NY, USA.
Oh, N., Mitra, S., and McCluskey, E. J. (2002). Ed4i: Error detection by diverse data and duplicated instructions. IEEE Transactions on Computers, 51(2):180–199.
Pattabiraman, K., Kalbarczyk, Z., and Iyer, R. K. (2007). Automated derivation of application-aware error detectors using static analysis. In IOLTS ’07: Proceedings of the 13th IEEE International On-Line Testing Symposium, pages 211–216, Washington, DC, USA. IEEE.
Prata, P. and Silva, J. G. (1999). Algorithm based fault tolerance versus result-checking for matrix computations. In FCTS ’99: Proc. of the 29th Annual International Symposium on Fault-Tolerant Computing, pages 4–11, Los Alamitos, CA, USA. IEEE.
Princen, J., Johnson, A., and Bradley, A. (1987). Subband/transform coding using filter bank designs based on time domain aliasing cancellation. In ICASSP ’87: Proc. of the Int. Conf. on Acoustics, Speech, and Signal Processing, volume 12, pages 2161–2164, Los Alamitos, CA, USA. IEEE.
Vemu, R., Gurumurthy, S., and Abraham, J. (2007). ACCE: Automatic correction of control-flow errors. In ITC ’07. IEEE International Test Conference, pages 1–10.
