A Systolic Array Optimizing CompilerSpringer Science & Business Media, 31/01/1989 - 202 páginas This book is a revision of my Ph. D. thesis dissertation submitted to Carnegie Mellon University in 1987. It documents the research and results of the compiler technology developed for the Warp machine. Warp is a systolic array built out of custom, high-performance processors, each of which can execute up to 10 million floating-point operations per second (10 MFLOPS). Under the direction of H. T. Kung, the Warp machine matured from an academic, experimental prototype to a commercial product of General Electric. The Warp machine demonstrated that the scalable architecture of high-peiformance, programmable systolic arrays represents a practical, cost-effective solu tion to the present and future computation-intensive applications. The success of Warp led to the follow-on iWarp project, a joint project with Intel, to develop a single-chip 20 MFLOPS processor. The availability of the highly integrated iWarp processor will have a significant impact on parallel computing. One of the major challenges in the development of Warp was to build an optimizing compiler for the machine. First, the processors in the xx A Systolic Array Optimizing Compiler array cooperate at a fine granularity of parallelism, interaction between processors must be considered in the generation of code for individual processors. Second, the individual processors themselves derive their performance from a VLIW (Very Long Instruction Word) instruction set and a high degree of internal pipelining and parallelism. The compiler contains optimizations pertaining to the array level of parallelism, as well as optimizations for the individual VLIW processors. |
Índice
Introduction | 1 |
11 Research approach | 4 |
12 Overview of results | 5 |
122 Cell level optimizations | 6 |
13 This presentation | 8 |
Architecture of Warp | 11 |
21 The architecture | 13 |
211 Warp cell | 14 |
531 Bounds on the initiation interval | 98 |
532 Scheduling an acyclic graph | 100 |
533 Scheduling a cyclic graph | 103 |
5331 Combining strongly connected components | 104 |
5332 Scheduling a strongly connected component | 106 |
5333 Complete algorithm | 113 |
54 Modulo variable expansion | 114 |
55 Code size requirement | 117 |
212 Interface unit | 18 |
22 Application domain of Warp | 19 |
23 Programming complexity | 21 |
A Machine Abstraction | 25 |
31 Previous systolic array synthesis techniques | 26 |
32 Comparisons of machine abstractions | 29 |
321 Programrnability | 30 |
3211 Partitioning methods | 31 |
3212 Programmability of synchronous models | 32 |
322 Efficiency | 35 |
asynchronous communication | 37 |
331 Effect of parallelism in cells | 39 |
332 Scheduling constraints between receives and sends | 43 |
3321 The problem | 44 |
3322 The analysis | 45 |
3323 Implications | 48 |
333 Discussion on the asynchronous communication model | 50 |
34 Hardware and software support | 51 |
queues | 52 |
342 Compiletime flow control | 53 |
3421 The skewed computation model | 54 |
3422 Algorithm to find the minimum skew | 58 |
342 3 Hardware design | 64 |
35 Chapter summary | 66 |
The W2 Language and Compiler | 69 |
41 The W2 language | 70 |
42 Compiler overview | 73 |
43 Scheduling a basic block | 77 |
432 List scheduling | 79 |
433 Ordering and priority function | 80 |
Software Pipelining | 83 |
51 Introduction to software pipelining | 87 |
52 The scheduling problem | 91 |
522 Definition and complexity of problem | 95 |
53 Scheduling algorithm | 96 |
56 Comparison with previous work | 120 |
562 The polycyclic machine | 121 |
57 Chapter summary | 123 |
Hierarchical Reduction | 125 |
61 The iterative construct | 128 |
62 The conditional construct | 130 |
621 Branches taking different amounts of time | 133 |
622 Code size | 134 |
63 Global code motions | 135 |
64 Comparison with previous work | 139 |
641 Trace scheduling | 140 |
6411 Loop branches | 141 |
6412 Conditionals | 143 |
642 Comparison with vector instructions | 144 |
Evaluation | 147 |
71 The experiment | 148 |
711 Status of compiler | 149 |
712 The programs | 151 |
72 Performance analysis of global scheduling techniques | 158 |
722 Efficiency of scheduler | 161 |
7221 Exclusive IO time | 163 |
7222 Global resource use count | 165 |
7223 Data dependency | 168 |
723 Discussion on effectiveness of the Warp architecture | 169 |
73 Performance of software pipelining | 170 |
732 Effectiveness of software pipelining | 171 |
733 Feasibility of software pipelining | 176 |
74 Livermore Loops | 180 |
75 Summary and discussion | 183 |
Conclusions | 187 |
81 Machine abstraction for systolic arrays | 188 |
82 Code scheduling techniques | 190 |
193 | |
Outras edições - Ver tudo
Palavras e frases frequentes
achieved acyclic graph asynchronous communication asynchronous communication model basic block branches buffer chapter code scheduling compaction compile-time flow control computation rate conditional statements cycles data dependency data flow data item data path delay dynamic flow control Figure floating-point flow graph functional units hardware heuristics hierarchical reduction host implemented innermost loops input instruction ISBN iWarp large numbers list scheduling lower bound machine abstraction machine model MFLOPS micro-operation micro-operation sequence microcode Microprogramming minimum skew modulo variable expansion multiple node number of iterations optimal output overlapped parallel and pipelined PC Warp performance polynomial evaluation precedence constraints prolog prototype machine queue model receive and send RecX reduced register files resource usage Sched scheduling algorithm scheduling constraints scheduling problem send operations Sendx SIMD software pipelined loop software pipelining speed strongly connected components systolic algorithms systolic array throughput tion trace scheduling upper bound vector VLIW Warp array Warp cell Warp machine
Passagens conhecidas
Página 194 - Intl. Conf. on Architectural Support for Programming Languages and Operating Systems, pages 62-73, 1992.
Referências a este livro
IWarp: Anatomy of a Parallel Computing System Thomas Gross,David Richard O'Hallaron Pré-visualização limitada - 1998 |