Efficient ECC-Based Directory Implementations for Scalable Multiprocessors

Kourosh Gharachorloo; Luiz André Barroso; Andreas Nowatzyk

doi:10.5753/sbac-pad.2000.41216

Kourosh Gharachorloo Compaq Computer Corporation
Luiz André Barroso Compaq Computer Corporation
Andreas Nowatzyk Compaq Computer Corporation

DOI: https://doi.org/10.5753/sbac-pad.2000.41216

Resumo

With increasing chip densities, next-generation micro-processor design have the opportunity to integrate many of the traditional system-level moules onto the same chip as the processor. This integration changes some of the design trade-offs for how and where to store directory information. One extremely attractive option is to support directory data with virtually no memory space overhead by computing memory ECC at a coarser granularity and utilizing the usused bits for storing the directory information. Compared to providing a dedicated memory and datapath for directory atorage, this approach leads to lower cost and a simpler design by requiring fewer components and pins. Furthermore, this approach leverages the low latency, high bandwidth path to memory provided by the integration of memory controllers onto the processor chip. However, without careful design, maintaining data and directory bits together can lead to potential inefficiencies in the form of extra memory bandwidth usage and memory controller occupancy, and extra memory latency. This paper describes the techniques used in the context of the Piranha design [3] to provide an efficient ECC-based directory implementation which addresses the occupancy/bandwidth and latency issues. Our approach for dealing with the occupancy/bandwidth issues involves either eliminating the extra read and write operations or performing partial memory accesses (instead of accessing the whole block). Thi is achieved by a combination of techniques which include (i) augmenting the L2 caching state to keep track of some critical directory state, (ii) making up dummy data for protocol transactions with a stale momory copy, and (iii) maintaining a partial ECC that is used to compute the combined ECC of the data and the modified directory bits without needing the actual data bits. To address the latency issues, we replicate critical directory state in different segments of the momory line which allows us to efficiently support the critical-word-first optimization by pipelining data from memory to the requester before all the data is read from memory. The combination of the above techniques also eliminates all the inefficiencies that arise due to maintaining a combined ECC for directory and data bits. Therefore, we benefit from the more efficient use of bits provided by the combined ECC with virtually no performance penalty compared to maintaining separe ECC bits for data and directory. Finally, the optimizations used in Piranha are general and applicable to other designs that use ECC-based directories.

Referências

A. Agarwal, R. Simoni, J. Hennessy, and M. Horowitz. An Evaluation of Directory Schemes for Cache Coherence. In ACM Annual International Symposium on Computer Architecture, pages 280–289, May 1988.

P. Bannon. Alpha 21364: A Scalable Single-chip SMP. Presented at the Microprocessor Forum ’98 [link], October 1998.

L. A. Barroso, Kourosh Gharachorloo, Robert McNamara, Andreas Nowatzyk, Shaz Qadeer, Barton Sano, Scott Smith, Robert Stets, and Ben Verghese. Piranha: A Scalable Architecture based on Single-Chip Multiprocessing. In Proceedings of the 25th Annual International Symposium on Computer Architecture, pages 282–293, Vancouver, Canada, June 2000.

L. A. Barroso, K. Gharachorloo, A. Nowatzyk, and B. Verghese. Impact of Chip-Level Integration on Performance of OLTP Workloads. In 6th International Symposium on High-Performance Computer Architecture, January 2000.

A. Gupta, W.-D. Weber, and T. Mowry. Reducing Memory and Traffic Requirements for Scalable Directory-Based Cache Coherence Schemes. In International Conference on Parallel Processing, July 1990.

T. Horel and G. Lauterbach. UltraSPARC-III: Designing Third-Generation 64-Bit Performance. IEEE Micro, Volume 19, No. 3, May/June 1999.

J. Laudon and D. Lenoski. The SGI Origin: A ccNUMA Highly Scalable Server. In ACM Annual International Symposium on Computer Architecture, pages 241–251, June 1997.

A. Nowatzyk, G. Aybay, M. Browne, E. Kelly, M. Parkin, W. Radke, and S. Vishin. The S3.mp Scalable Shared Memory Multiprocessor. In International Conference on Parallel Processing (ICPP’95), pages I.1–I.10, July 1995.

A. Saulsbury, F. Pong, and A. Nowatzyk. Missing the Memory Wall: The Case for Processor/Memory Integration. In 23rd Annual International Symposium on Computer Architecture, May 1996.