DF-DTM: explorando redundância de tarefas em Dataflow
Resumo
Reúso de Instruções é uma técnica adotada em arquiteturas de Von Neumann para melhorar o desempenho ao evitar a execução redundante de instruções (ou traços de instruções), quando o resultado a ser produzido pode ser extraído de um tabela com o histórico de operandos de entrada e saída da referida instrução. Entretanto, ainda é necessário estudar essas técnicas no contexto do modelo Dataflow, que tem se destacado na comunidade de computação de alto desempenho, devido ao seu paralelismo inerente. Este trabalho propõe uma abordagem para reúso em dataflow, chamada de DF-DTM (Dataflow Dynamic Task Memoization). A técnica suporta reúso no nível de nós e subgrafos, o que é análogo ao reúso de instruções e traços, respectivamente. O potencial do DF-DTM é avaliado com uma série de experimento com três aplicações relevantes, resultando em reúso de até 97% das tarefas executadas.Referências
Alves, T. A. O., Goldstein, B. F., França, F. M. G., and Marzulo, L. A. J. (2014). A minimalistic dataow programming library for python. In Computer Architecture and High Performance Computing Workshop (SBAC-PADW), 2014 International Symposium on, pages 96–101.
Bosilca, G., Bouteiller, A., Danalis, A., Hérault, T., Lemarinier, P., and Dongarra, J. (2012). Dague: A generic distributed dag engine for high performance computing. Parallel Computing, 38(1-2):37–51.
da Costa, A. T., Franca, F. M. G., and Filho, E. M. C. (2000). The dynamic trace memoization reuse technique. In Parallel Architectures and Compilation Techniques, 2000. Proceedings. International Conference on, pages 92–99.
Duran, A., Ayguadé, E., Badia, R. M., Labarta, J., Martinell, L., Martorell, X., and Planas, J. (2011). Ompss: A proposal for programming heterogeneous multi-core architectures. Parallel Processing Letters, 21:173–193.
Gajinov, V., Stipiíc, S., Eriíc, I., Unsal, O. S., Ayguadé, E., and Cristal, A. (2014). Dash: A benchmark suite for hybrid dataow and shared memory programming models: with comparative evaluation of three hybrid dataow models. In Proceedings of the 11th ACM Conference on Computing Frontiers, CF '14, pages 4:1–4:11, New York, NY, USA. ACM.
Giorgi, R. e. a. (2014). TERAFLUX: Harnessing dataow in next generation teradevices. Microprocessors and Microsystems, pages –.
Marzulo, L. A., Alves, T. A., França, F. M., and Costa, V. S. (2014). Couillard: Parallel programming via coarse-grained data-ow compilation. Parallel Computing, 40(10):661 – 680.
Michie, D. (1968). "Memo" Functions and Machine Learning. Nature, 218:19–22.
P. C. Gilmore, R. E. G. (1961). A linear programming approach to the cutting-stock problem. Operations Research, 9(6):849–859.
Pell, O., Mencer, O., Tsoi, K., and Luk, W. (2013). Maximum performance computing with dataow engines, pages 747–774.
Shibata, Y., Tsumura, T., Tsumura, T., and Nakashima, Y. (2014). An implementation of auto-memoization mechanism on arm-based superscalar processor. In System-on-Chip (SoC), 2014 International Symposium on, pages 1–8.
Sodani, A. and Sohi, G. S. (1997). Dynamic instruction reuse. In Computer Architecture, 1997. Conference Proceedings. The 24th Annual International Symposium on, pages 194–205.
Swanson, S., Michelson, K., Schwerin, A., and Oskin, M. (2003). Wavescalar. In Microarchitecture, 2003. MICRO-36. Proceedings. 36th Annual IEEE/ACM International Symposium on, pages 291–302.
Tsai, Y. Y. and Chen, C. H. (2011). Energy-efcient trace reuse cache for embedded processors. IEEE Transactions on Very Large Scale Integration (VLSI) Systems, 19(9):1681–1694.
Wozniak, J., Armstrong, T., Wilde, M., Katz, D., Lusk, E., and Foster, I. (2013). Swift/t: Large-scale application composition via distributed-memory dataow processing. In Cluster, Cloud and Grid Computing (CCGrid), 2013 13th IEEE/ACM International Symposium on, pages 95–102.
Bosilca, G., Bouteiller, A., Danalis, A., Hérault, T., Lemarinier, P., and Dongarra, J. (2012). Dague: A generic distributed dag engine for high performance computing. Parallel Computing, 38(1-2):37–51.
da Costa, A. T., Franca, F. M. G., and Filho, E. M. C. (2000). The dynamic trace memoization reuse technique. In Parallel Architectures and Compilation Techniques, 2000. Proceedings. International Conference on, pages 92–99.
Duran, A., Ayguadé, E., Badia, R. M., Labarta, J., Martinell, L., Martorell, X., and Planas, J. (2011). Ompss: A proposal for programming heterogeneous multi-core architectures. Parallel Processing Letters, 21:173–193.
Gajinov, V., Stipiíc, S., Eriíc, I., Unsal, O. S., Ayguadé, E., and Cristal, A. (2014). Dash: A benchmark suite for hybrid dataow and shared memory programming models: with comparative evaluation of three hybrid dataow models. In Proceedings of the 11th ACM Conference on Computing Frontiers, CF '14, pages 4:1–4:11, New York, NY, USA. ACM.
Giorgi, R. e. a. (2014). TERAFLUX: Harnessing dataow in next generation teradevices. Microprocessors and Microsystems, pages –.
Marzulo, L. A., Alves, T. A., França, F. M., and Costa, V. S. (2014). Couillard: Parallel programming via coarse-grained data-ow compilation. Parallel Computing, 40(10):661 – 680.
Michie, D. (1968). "Memo" Functions and Machine Learning. Nature, 218:19–22.
P. C. Gilmore, R. E. G. (1961). A linear programming approach to the cutting-stock problem. Operations Research, 9(6):849–859.
Pell, O., Mencer, O., Tsoi, K., and Luk, W. (2013). Maximum performance computing with dataow engines, pages 747–774.
Shibata, Y., Tsumura, T., Tsumura, T., and Nakashima, Y. (2014). An implementation of auto-memoization mechanism on arm-based superscalar processor. In System-on-Chip (SoC), 2014 International Symposium on, pages 1–8.
Sodani, A. and Sohi, G. S. (1997). Dynamic instruction reuse. In Computer Architecture, 1997. Conference Proceedings. The 24th Annual International Symposium on, pages 194–205.
Swanson, S., Michelson, K., Schwerin, A., and Oskin, M. (2003). Wavescalar. In Microarchitecture, 2003. MICRO-36. Proceedings. 36th Annual IEEE/ACM International Symposium on, pages 291–302.
Tsai, Y. Y. and Chen, C. H. (2011). Energy-efcient trace reuse cache for embedded processors. IEEE Transactions on Very Large Scale Integration (VLSI) Systems, 19(9):1681–1694.
Wozniak, J., Armstrong, T., Wilde, M., Katz, D., Lusk, E., and Foster, I. (2013). Swift/t: Large-scale application composition via distributed-memory dataow processing. In Cluster, Cloud and Grid Computing (CCGrid), 2013 13th IEEE/ACM International Symposium on, pages 95–102.
Publicado
05/10/2016
Como Citar
ROUBERTE, Leandro; SENA, Alexandre; NERY, Alexandre; MARZULO, Leandro; ALVES, Tiago; FRANÇA, Felipe.
DF-DTM: explorando redundância de tarefas em Dataflow. In: SIMPÓSIO EM SISTEMAS COMPUTACIONAIS DE ALTO DESEMPENHO (SSCAD), 17. , 2016, Aracajú.
Anais [...].
Porto Alegre: Sociedade Brasileira de Computação,
2016
.
p. 275-286.
DOI: https://doi.org/10.5753/wscad.2016.14266.