O Fetch de uma Instrução Escalar por Ciclo Não Inibe o Paralelismo no Nível de Instrução
Resumo
Máquinas Super Escalares trazem múltiplas instruções escalares do cache de instruções por ciclo. Contudo, máquinas que buscam no cache de instruções apenas uma instrução escalar por ciclo de relógio têm demonstrado níveis de desempenho comparáveis aos de máquinas Super Escalares, como é o caso de máquinas que seguem a arquitetura Dynamic Trace Scheduled VLIW (DTSVLIW). Neste trabalho, mostramos através de experimentos que basta trazer uma instrução escalar por ciclo de máquina do cache de instruções para atingir praticamente o mesmo desempenho obtido trazendo várias instruções por ciclo graças à localidade de execução existente nos programas. Fazemos, também, a primeira comparação direta entre as arquiteturas Super Escalar, Trace Cache e DTSVLIW. Nossos resultados mostram que uma máquina DTSVLIW capaz de executar até 16 instruções por ciclo tem desempenho 21.9% superior que uma Super Escalar e 6.6% superior que uma Trace Cache com hardware equivalente.
Referências
AUSTIN, T.; BURGER, D. The SimpleScalar Tool Ser, Technical Report TR-1342, Computer Science Department, University of Wisconsin-Madison, June 1997.
BLACK, B.; RYCHLIK, B.; SHEN, J. P. The Blockbased Trace Cache, Proceeding of 26th International Symposium on Computer Architecture, pp. 196-207, 1999.
CHARNEY, M. J.; PUZAK, T. R. Prefetching and Memory System Behaviour of the SPEC95 Benchmark Suite, IBM Journal of Research and Development, Vol. 41, No. 3, pp. 265-285, May 1997.
CONTE, T.; MENEZES, K.; MILLS, P.; PATEL, B. Optimization of lnstruction Fetch Mechanisms for High lssue Rates, Proc. 22nd Int'l Symp. Computer Architecture, pp. 333-344, June 1995.
DE SOUZA, A. F.; ROUNCE, P. Dynamically Trace Scheduled VLIW Architectures. Proceedings of the High-Performance Computing and Networking' 98 - HPCN'98, on Lecture Notes in Computer Science, Vol. 1401, pp. 993-995, April 1998.
DE SOUZA, A. F. lnteger Performance Evaluation of the Dynamically Trace Scheduled VLIW Architecture. PhD Thesis, University of London, UK, September 1999.
DE SOUZA A. F.; ROUNCE P. Effect of Multicycle lnstructions on the lnteger Performance of the Dynamically Trace Scheduled VLIW Architecture, Proceedings of the High-Performance Computing and Networking' 99 - HPCN'99, on Lecture Notes in Computer Science, Vol. 1593, pp. 1203-1206, 1999.
DE SOUZA, A. F.; ROUNCE, P. On the Scheduling Algorithm of the Dynamically Trace Scheduled VLIW Architecture. Proceedings of the International Parallel and Distributed Symposium - IPDPS'2000. Los Alamitos, CA - USA: IEEE Computer Society, 2000. p.565-572.
DE SOUZA, A. F.; ROUNCE, P. Dynamically Scheduling VLIW lnstructions. Journal of Parallel and Distributed Computing 60, pp. 1480-1511, December 2000.
DE SOUZA, A. F. lmproving the DTSVLIW Performance via Block Compaction. Aceito para o 13th Brazilian Symp. on Computer Architecture and High Performance Computing, 2001.
DIGITAL EQUIPMENT CORPORATION, Alpha Architecture Handbook, Digital Equipment Corporation, 1992.
FISHER, J. A. The VLIW Machine: A Multiprocessor for Compiling Scientific Code, IEEE Computer, pp. 45-53, July 1984.
FLYNN, M. J. Very High-Speed Computing Systems. Proccedings of the IEEE, 54, December 1966.
GEE, J. D.; HILL, M. D.; PNEVMATIKATOS, D. N.; SMITH, A. J., Cache Performance of the SPEC92 Benchmark Suite, IEEE Micro, pp. 17-27, August 1993.
GROHOSKJ, G. F. Machine Organization of the IBM RISC System/6000 Processor, IBM Journal of Research and Development, Vol. 34, No. 1, pp. 37-58, January 1990.
JOHNSON, M. Superscalar Microprocessor Design, Prentice-Hall, 1991.
KELLER, R. M. Look-Ahead Processors, ACM Computer Surveys, Vol. 7, No. 8, pp. 177-195, December 1975.
KESSLER, R. E. The Alpha 21264 Microprocessor, IEEE Micro, pp. 24-36, March-April 1999.
PALACHARLA, S.; JOUPPI, N.; SMITH, J. E. Complexiry-Effective Superscalar Processors, Proceedings of the 24th Annual lnternational Symposium on Computer Architecture, pp. 206-218, 1997.
PATEL, S. J.; FRIENDLY, D. H.; PATT, Y. N. Critical lssues Regarding the Trace Cache Fetch Mecanism, Technical Report CSE-TR-335-97, Univ of Michigan, 1997.
PATEL, S. J. ; FRIENDLY, O. H.; PATI, Y. N. Evaluation of Design Options for the Trace Coche Fetch Mechanism, IEEE Transactions on Computers, Vol. C-48, No. 2, pp.193-204, 1999.
PATIERSON, D. A.; HENNESSY, J. L. Computer Architecture: A Quantitative Approach, Second Edition, Morgan Kaufmann Publishers, Inc., 1996.
MCFARLING, S. Combining Branch Predictors, Digital Western Research Laboratory - WRL Technical Note TN-36, June 1993.
MELVIN, S.; SHEBANOW, M.; PATI, Y. Hardware Support for Large Atomic Units in Dynamic Scheduled Machines, Proceedings of the 21st Annual lnternational Symposium on Microarchitecture, pp. 60-66, December 1988.
NAIR, R. ; HOPKJNS, M. E. Exploiting lnstructions Level Parallelism in Processors by Caching Scheduled Groups. Proceedings of the 24th Annual lnternational Symposium on Computer Architecture, pp. 13-25, 1997.
RAKVIC, R.; BLACK, B.; SHEN, J. P., Completion time multiple branch prediction for enhancing trace cache performance, Proceeding of 26th International Symposium on Computer Architecture, pp. 47-58, 2000.
ROTENBERG E.; BENNETI, S.; SMITH, J. E. Trace cache: a low latency approach to high bandwidth instruction fetching. Proceedings of the 29th Annual ACM/IEEE International Symposium on Microarchitecture, 1996.
SEZNEC, A.; JOURDAN, S.; SAINRAT, P.; MICHAUD, P. Multiple-Block Ahead Branch Predictors, Proceedings of the 7th lnternational Conference on Architectural Support for Programming Languages and Operating Systems, pp. 116-127, 1996.
SMITH J. E.; WEISS S. PowerPC601 and Alpha 21064: A Tale of Two RISCs, IEEE Computer, pp. 46- 58, June 1994.
THORNTON, J. E. Parallel Operation in the Control Data 6600, Proceedings of the AFIPS Fall Joint Computer Conference, Vol. 26, part 2, pp. 33-40, 1964.
TOMASULO, R. M. An Efficient Algorithm for Exploiting Multiple Arithmetic Units, IBM Journal of Research and Development, Vol. 11, No. 1, pp. 25-33, January 1967.
YEH, T.-Y.; MARR, D. T.; PATT, Y. N. lncreasing the lnstruction Fetch Rale via Multiple Branch Prediction and a Branch Address Cache, Proc. Seventh lnt'l Conf. Super-computing, pp. 67-76, July 1993.