Maximizando o Uso dos Recursos de GPU Através da Reordenação da Submissão de Kernels Concorrentes
Resumo
O aumento da quantidade de recursos disponíveis nas GPUs modernas despertou um novo interesse no problema do compartilhamento de seus recursos por diferentes kernels. A nova geração de GPUs permite a execução simultânea de kernels, porém ainda são limitadas ao fato de que decisões de escalonamento são tomadas pelo hardware em tempo de execução. Tais decisões dependem da ordem em que os kernels são submetidos para execução, criando execuções onde a GPU não necessariamente está com a melhor taxa de ocupação. Neste trabalho, apresentamos uma proposta de otimização para reordenar a submissão de kernels com foco em: maximizar a utilização dos recursos e melhorar o turnaround time médio. Modelamos a atribuição de kernels para a GPU como uma série de problemas da mochila e usamos uma abordagem de programação dinâmica para resolvê-los. Avaliamos nossa proposta utilizando kernels com diferentes tamanhos e requisitos de recursos. Nossos resultados mostram ganhos significativos no turnaround time médio e no throughput em comparação com a submissão padrão de kernels implementada em GPUs modernas.Referências
Adriaens, J. T., Compton, K., Kim, N. S., and Schulte, M. J. (2012). The case for GPGPU In IEEE 18th International Symposium on High Performance spatial multitasking. Computer Architecture (HPCA), pages 1–12. IEEE.
Choi, H. J., Son, D. O., Kang, S. G., Kim, J. M., Lee, H.-H., and Kim, C. H. (2013). An efcient scheduling scheme using estimated execution time for heterogeneous computing systems. The Journal of Supercomputing, 65(2):886–902.
Eyerman, S. and Eeckhout, L. (2008). System-level performance metrics for multiprogram workloads. IEEE Micro, 28(3):42–53.
Gregg, C., Dorn, J., Hazelwood, K., and Skadron, K. (2012). Fine-grained resource sharing for concurrent GPGPU kernels. In Presented as part of the 4th USENIX Workshop on Hot Topics in Parallelism.
Li, T., Narayana, V. K., and El-Ghazawi, T. (2015). A power-aware symbiotic scheduling algorithm for concurrent GPU kernels. In The 21st IEEE International Conference on Parallel and Distributed Systems (ICPADS).
Liang, Y., Huynh, P., Rupnow, K., Goh, R., and Chen, D. (2015). Efcient GPU spatialtemporal multitasking. IEEE Trans. on Parallel and Distributed Systems, 26:748–760.
Lopez-Novoa, U., Mendiburu, A., and Miguel-Alonso, J. (2015). A survey of perforIEEE mance modeling and simulation techniques for accelerator-based computing. Transactions on Parallel and Distributed Systems, 26(1):272–281.
Martello, S. and Toth, P. (1990). Knapsack problems: algorithms and computer implementations. John Wiley & Sons, Inc.
NVIDIA (2016). Cuda Proler. http://docs.nvidia.com/cuda/proler-users-guide.
Pai, S., Thazhuthaveetil, M. J., and Govindarajan, R. (2013). Improving GPGPU concurrency with elastic kernels. In ACM SIGPLAN Notices, volume 48, pages 407–418.
Peters, H., Koper, M., and Luttenberger, N. (2010). Efciently using a CUDA-enabled In IEEE 10th International Conference on Computer and
GPU as shared resource. Information Technology (CIT), pages 1122–1127. IEEE.
Ravi, V. T., Becchi, M., Agrawal, G., and Chakradhar, S. (2011). Supporting GPU sharing in cloud environments with a transparent runtime consolidation framework. In Proceedings of the 20th international symposium on High performance distributed computing, pages 217–228. ACM.
Wang, L., Huang, M., and El-Ghazawi, T. (2011). Exploiting concurrent kernel execution on graphic processing units. In International Conference on High Performance Computing and Simulation (HPCS), pages 24–32. IEEE.
Wende, F., Cordes, F., and Steinke, T. (2012). On improving the performance of multithreaded CUDA applications with concurrent kernel execution by kernel reordering. In Symp. on Application Accelerators in High Performance Computing (SAAHPC),74-83.
Zhong, J. and He, B. (2014). Kernelet: High-throughput GPU kernel executions with dynamic slicing and scheduling. IEEE Transactions on Parallel and Distributed Systems, 25(6):1522–1532.
Choi, H. J., Son, D. O., Kang, S. G., Kim, J. M., Lee, H.-H., and Kim, C. H. (2013). An efcient scheduling scheme using estimated execution time for heterogeneous computing systems. The Journal of Supercomputing, 65(2):886–902.
Eyerman, S. and Eeckhout, L. (2008). System-level performance metrics for multiprogram workloads. IEEE Micro, 28(3):42–53.
Gregg, C., Dorn, J., Hazelwood, K., and Skadron, K. (2012). Fine-grained resource sharing for concurrent GPGPU kernels. In Presented as part of the 4th USENIX Workshop on Hot Topics in Parallelism.
Li, T., Narayana, V. K., and El-Ghazawi, T. (2015). A power-aware symbiotic scheduling algorithm for concurrent GPU kernels. In The 21st IEEE International Conference on Parallel and Distributed Systems (ICPADS).
Liang, Y., Huynh, P., Rupnow, K., Goh, R., and Chen, D. (2015). Efcient GPU spatialtemporal multitasking. IEEE Trans. on Parallel and Distributed Systems, 26:748–760.
Lopez-Novoa, U., Mendiburu, A., and Miguel-Alonso, J. (2015). A survey of perforIEEE mance modeling and simulation techniques for accelerator-based computing. Transactions on Parallel and Distributed Systems, 26(1):272–281.
Martello, S. and Toth, P. (1990). Knapsack problems: algorithms and computer implementations. John Wiley & Sons, Inc.
NVIDIA (2016). Cuda Proler. http://docs.nvidia.com/cuda/proler-users-guide.
Pai, S., Thazhuthaveetil, M. J., and Govindarajan, R. (2013). Improving GPGPU concurrency with elastic kernels. In ACM SIGPLAN Notices, volume 48, pages 407–418.
Peters, H., Koper, M., and Luttenberger, N. (2010). Efciently using a CUDA-enabled In IEEE 10th International Conference on Computer and
GPU as shared resource. Information Technology (CIT), pages 1122–1127. IEEE.
Ravi, V. T., Becchi, M., Agrawal, G., and Chakradhar, S. (2011). Supporting GPU sharing in cloud environments with a transparent runtime consolidation framework. In Proceedings of the 20th international symposium on High performance distributed computing, pages 217–228. ACM.
Wang, L., Huang, M., and El-Ghazawi, T. (2011). Exploiting concurrent kernel execution on graphic processing units. In International Conference on High Performance Computing and Simulation (HPCS), pages 24–32. IEEE.
Wende, F., Cordes, F., and Steinke, T. (2012). On improving the performance of multithreaded CUDA applications with concurrent kernel execution by kernel reordering. In Symp. on Application Accelerators in High Performance Computing (SAAHPC),74-83.
Zhong, J. and He, B. (2014). Kernelet: High-throughput GPU kernel executions with dynamic slicing and scheduling. IEEE Transactions on Parallel and Distributed Systems, 25(6):1522–1532.
Publicado
05/10/2016
Como Citar
BREDER, Bernardo; CHARLES, Eduardo; CRUZ, Rommel; CLUA, Esteban; BENTES, Cristiana; DRUMMOND, Lucia.
Maximizando o Uso dos Recursos de GPU Através da Reordenação da Submissão de Kernels Concorrentes. In: SIMPÓSIO EM SISTEMAS COMPUTACIONAIS DE ALTO DESEMPENHO (SSCAD), 17. , 2016, Aracajú.
Anais [...].
Porto Alegre: Sociedade Brasileira de Computação,
2016
.
p. 251-262.
DOI: https://doi.org/10.5753/wscad.2016.14264.