Analyzing and Estimating the Performance of Concurrent Kernels Execution on GPUs
GPUs have established a new baseline for power efficiency and computing power, delivering larger bandwidth and more computing units in each new generation. Modern GPUs support the concurrent execution of kernels to maximize resource utilization, allowing other kernels to better exploit idle resources. However, the decision on the simultaneous execution of different kernels is made by the hardware, and sometimes GPUs do not allow the execution of blocks from other kernels, even with the availability of resources. In this work, we present an in-depth study on the simultaneous execution of kernels on the GPU. We present the necessary conditions for executing kernels simultaneously, we define the factors that influence competition, and describe a model that can determine performance degradation. Finally, we validate the model using synthetic and real-world kernels with different computation and memory requirements.
Aguilera, P., Morrow, K., and Kim, N. S. (2014). Fair share: Allocation of GPU resources for both performance and fairness. In 32nd IEEE International Conference on Computer Design (ICCD), 2014, pages 440–447.
Ausavarungnirun, R. (2017). Techniques for Shared Resource Management in Systems with Throughput Processors. PhD thesis, Carnegie Mellon University.
Breder, B., Charles, E., Cruz, R., Clua, E., Bentes, C., and Drummond, L. (2016). Maximizando o uso dos recursos de GPU através da reordenação da submissão de kernels concorrentes. In Anais do WSCAD 2016 Simpósio de Sistemas Computacionais de Alto Desempenho, pages 98–109. Editora da Sociedade Brasileira de Computação (SBC).
Che, S., Sheaffer, J. W., Boyer, M., Szafaryn, L. G., Wang, L., and Skadron, K. (2010). A characterization of the rodinia benchmark suite with comparison to contemporary In IEEE International Symposium on Workload Characterization CMP workloads. (IISWC), 2010, pages 1–11.
Goswami, N., Shankar, R., Joshi, M., and Li, T. (2010). Exploring GPGPU workloads: Characterization methodology, analysis and microarchitecture evaluation implications. In IEEE International Symposium on Workload Characterization (IISWC), 2010, pages 1–10.
Hu, Q., Shu, J., Fan, J., and Lu, Y. (2016). Run-time performance estimation and fairnessoriented scheduling policy for concurrent GPGPU applications. In 45th International Conference on Parallel Processing (ICPP), 2016, pages 57–66.
Janzén, J., Black-Schaffer, D., and Hugo, A. (2016). Partitioning GPUs for improved scalability. In 28th International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD), 2016, pages 42–49.
Jeong, M. K., Erez, M., Sudanthi, C., and Paver, N. (2012). A qos-aware memory controller for dynamically balancing GPU and CPU bandwidth use in an mpsoc. In 49th Annual Design Automation Conference, pages 850–855.
Jog, A., Kayiran, O., Kesten, T., Pattnaik, A., Bolotin, E., Chatterjee, N., Keckler, S. W., Kandemir, M. T., and Das, C. R. (2015). Anatomy of GPU memory system for multiapplication execution. In Proceedings of the 2015 International Symposium on Memory Systems, pages 223–234.
Lal, S., Lucas, J., Andersch, M., Alvarez-Mesa, M., Elhossini, A., and Juurlink, B. (2014). In International ConGPGPU workload characteristics and performance analysis. ference on Embedded Computer Systems: Architectures, Modeling, and Simulation (SAMOS XIV), 2014, pages 115–124.
Li, T., Narayana, V. K., El-Araby, E., and El-Ghazawi, T. (2011). GPU resource sharing and virtualization on high performance computing systems. In International Conference on Parallel Processing (ICPP), 2011, pages 733–742.
Li, T., Narayana, V. K., and El-Ghazawi, T. (2015). A power-aware symbiotic scheduling In IEEE 21st International Conference on
algorithm for concurrent GPU kernels. Parallel and Distributed Systems (ICPADS), 2015, pages 562–569.
Pai, S., Thazhuthaveetil, M. J., and Govindarajan, R. (2013). Improving GPGPU concurrency with elastic kernels. In ACM SIGPLAN Notices, volume 48, pages 407–418.
Park, J. J. K., Park, Y., and Mahlke, S. (2015). Chimera: Collaborative preemption for multitasking on a shared GPU. ACM SIGARCH Computer Architecture News, 43(1):593–606.
Subramanian, L., Seshadri, V., Ghosh, A., Khan, S., and Mutlu, O. (2015). The application slowdown model: Quantifying and controlling the impact of inter-application interference at shared caches and main memory. In 48th International Symposium on Microarchitecture, pages 62–75.
Suzuki, Y., Kato, S., Yamada, H., and Kono, K. (2014). Gpuvm: Why not virtualizing GPUs at the hypervisor? In USENIX Annual Technical Conference, pages 109–120.
Tanasic, I., Gelado, I., Cabezas, J., Ramirez, A., Navarro, N., and Valero, M. (2014). Enabling preemptive multiprogramming on GPUs. In ACM SIGARCH Computer Architecture News, volume 42, pages 193–204.
Ukidave, Y., Paravecino, F. N., Yu, L., Kalra, C., Momeni, A., Chen, Z., Materise, N., Daley, B., Mistry, P., and Kaeli, D. (2015). Nupar: A benchmark suite for modern GPU architectures. In 6th ACM/SPEC International Conference on Performance Engineering, pages 253–264.
Wende, F., Cordes, F., and Steinke, T. (2012). On improving the performance of multithreaded CUDA applications with concurrent kernel execution by kernel reordering. In Symposium on Application Accelerators in High Performance Computing (SAAHPC), 2012, pages 74–83.