Poluição de Cache e Thrashing em Aplicações Paralelas de Alto Desempenho
Resumo
Conforme os processadores evoluem, o desempenho dos sistemas computacionais se torna cada vez mais limitado pelo tempo de acesso à memória. Caches são empregadas a fim de contornar este problema, mas é necessária uma gerência inteligente dos dados que são armazenados nelas para impedir que problemas como poluição e thrashing degradem seu desempenho. Neste trabalho é apresentada uma análise da poluição de cache e thrashing em aplicações paralelas de alto desempenho. Os resultados mostram que caches com maior associatividade sofrem mais com estes problemas. Até 28% dos cache misses na L1 poderiam ser evitados com uma política de substituição de cache mais inteligente, chegando a até 62% na cache L2 e 98% na LLC. As processors evolve, the performance of computer systems becomes increasingly limited by the memory access time. Caches are employed in order to get around this problem, but an intelligent management of the data that is stored in them is necessary to prevent problems such as pollution and thrashing from degrading their performance. In this work, an analysis of cache and thrashing pollution in high performance parallel applications is presented. The results show that caches with greater associativity suffer more from these problems. Up to 28% of cache misses in the L1 cache could be avoided with a smarter replacement policy, up to 62% in the L2 cache and 98% in the LLC.
Referências
Alves, M. A. Z., Villavieja, C., Diener, M., Moreira, F. B., and Navaux, P. O. A. (2015) Sinuca: A validated micro-architecture simulator. In HPCC/CSS/ICESS, pages 605–610.
Bailey, D., Harris, T., Saphir, W., Van Der Wijngaart, R., Woo, A., and Yarrow, M. (1995) The nas parallel benchmarks 2.0. Technical report, Technical Report NAS-95-020, NASA Ames Research Center.
Belady, L. A. (1966). A study of replacement algorithms for a virtual-storage computer. IBM Systems journal, 5(2):78–101.
Chandra, D., Guo, F., Kim, S., and Solihin, Y. (2005). Predicting inter-thread cache contention on a chip multi-processor architecture. In High-Performance Computer Architecture, 2005. HPCA-11. 11th International Symposium on, pages 340–351. IEEE.
Chaudhuri, M. (2009). Pseudo-lifo: the foundation of a new family of replacement policies for last-level caches. In Proceedings of the 42nd Annual IEEE/ACM International Symposium on Microarchitecture, pages 401–412. ACM.
Ghahani, S. A. V., Shahri, S. M., Bakhshalipour, M., Lotfi-Kamran, P., and SarbaziAzad, H. (2018). Making Belady-Inspired Replacement Policies More Effective Using Expected Hit Count.
Jain, A. and Lin, C. (2016). Back to the Future: Leveraging Belady’s Algorithm for Improved Cache Replacement. In ISCA, pages 78–89. IEEE.
Jain, A. and Lin, C. (2018). Rethinking belady’s algorithm to accommodate prefetching. In ISCA, pages 110–123. IEEE.
Jaleel, A., Theobald, K. B., Steely Jr, S. C., and Emer, J. (2010). High performance cache replacement using re-reference interval prediction (rrip). In ACM SIGARCH Computer Architecture News, volume 38, pages 60–71. ACM.
Jiajun, W., Lu, Z., Reena, P., and Lizy, K. J. (2017). Less is More: Leveraging Belady’s Algorithm with Demand-based Learning. The Second Cache Replacement Championship: workshop schedule, pages 1–4.
Khan, S. M., Tian, Y., and Jimenez, D. A. (2010). Sampling dead block prediction for last-level caches. In Proceedings of the 2010 43rd Annual IEEE/ACM International Symposium on Microarchitecture, pages 175–186. IEEE Computer Society.
Levinthal, D. (2009). Performance analysis guide for intel core i7 processor and intel xeon 5500 processors. Intel Performance Analysis Guide, 30:18.
Nori, A. V., Gaur, J., Rai, S., Subramoney, S., and Wang, H. (2018). Criticality aware tiered cache hierarchy: a fundamental relook at multi-level cache hierarchies. In 2018 ACM/IEEE 45th Annual International Symposium on Computer Architecture (ISCA), pages 96–109. IEEE.
Prybylski, S., Horowitz, M., and Hennessy, J. (1988). Performance tradeoffs in cache design. In ACM SIGARCH Computer Architecture News, volume 16, pages 290–298. IEEE Computer Society Press.
Qureshi, M. K., Jaleel, A., Patt, Y. N., Steely, S. C., and Emer, J. (2007). Adaptive insertion policies for high performance caching. In ACM SIGARCH Computer Architecture News, volume 35, page 381, New York, New York, USA. ACM Press.
Seshadri, V., Mutlu, O., Kozuch, M. A., and Mowry, T. C. (2012). The evicted-address filter: A unified mechanism to address both cache pollution and thrashing. In Proceedings of the 21st international conference on Parallel architectures and compilation techniques, pages 355–366. ACM.
Solihin, Y. (2015). Fundamentals of Parallel Multicore Architecture. CRC Press.
Srinath, S., Mutlu, O., Kim, H., and Patt, Y. N. (2007). Feedback directed prefetching: Improving the performance and bandwidth-efficiency of hardware prefetchers. In High Performance Computer Architecture, 2007. HPCA 2007. IEEE 13th International Symposium on, pages 63–74. IEEE.
Wu, C.-J., Jaleel, A., Hasenplaugh, W., Martonosi, M., Steely, S. C., Jr, and Emer, J. (2011). SHiP: signature-based hit predictor for high performance caching. MICRO-44 ’11: Proceedings of the 44th Annual IEEE/ACM International Symposium on Microarchitecture, pages 430–441.
Wu, C.-J. and Martonosi, M. (2011). Characterization and dynamic mitigation of intraapplication cache interference. In Performance Analysis of Systems and Software (ISPASS), 2011 IEEE International Symposium on, pages 2–11. IEEE.
Wulf, W. A. and McKee, S. A. (1995). Hitting the memory wall: implications of the obvious. ACM SIGARCH computer architecture news, 23(1):20–24.
Young, V., Chen, C., Jaleel, A., and Qureshi, M. (2017). Ship++: Enhancing signaturebased hit predictor for improved cache performance. In Proceedings of the Cache Replacement Championship (CRC’17) held in Conjunction with the International Symposium on Computer Architecture (ISCA’17).