
79f87b9a951bb170c9440c35edd94315.ppt
- Количество слайдов: 44
New Opportunities with Platform Based Design Frank Vahid Associate Professor Dept. of Computer Science and Engineering University of California, Riverside Also with the Center for Embedded Computer Systems at UC Irvine http: //www. cs. ucr. edu/~vahid This research has been supported by the National Science Foundation, NEC, Trimedia, and Triscend Frank Vahid, UC Riverside 1
How Much is Enough? Frank Vahid, UC Riverside 2
How Much is Enough? Perhaps a bit small Frank Vahid, UC Riverside 3
How Much is Enough? Reasonably sized Frank Vahid, UC Riverside 4
How Much is Enough? Probably plenty big Frank Vahid, UC Riverside 5
How Much is Enough? More than typically necessary Frank Vahid, UC Riverside 6
How Much is Enough? Very few people could use this Frank Vahid, UC Riverside 7
How Much Custom Logic is Enough? IC package IC 1993: ~ 1 million logic transistors Perhaps a bit small Frank Vahid, UC Riverside 8
How Much Custom Logic is Enough? 1996: ~ 5 -8 million logic transistors Reasonably sized Frank Vahid, UC Riverside 9
How Much Custom Logic is Enough? 1999: ~ 10 -50 million logic transistors Probably plenty big Frank Vahid, UC Riverside 10
How Much Custom Logic is Enough? 2002: ~ 100 -200 million logic transistors More than typically necessary Frank Vahid, UC Riverside 11
How Much Custom Logic is Enough? n 1993: 1 M Point of diminishing returns n n n 2008: >1 BILLION logic transistors Other examples n n Perhaps very few people 32 -bit ARM: ~30 K MPEG dcd: ~1 M Fast cars (> 100 mph) High res digital cameras (> 4 M) Disk space Even IC performance could design this Frank Vahid, UC Riverside 12
Very Few Companies Can Design High-End ICs Design productivity gap 10, 000 100, 000 10, 000 Logic transistors per 100 10 chip (in millions) 1 1000 Gap 100 IC capacity 10 0. 1 1 productivity 0. 01 0. 001 Source: ITRS’ 99 2007 2005 2003 2001 1999 1997 1995 1993 1991 1989 1987 1985 1983 0. 01 1981 n Productivity (K) Trans. /Staff-Mo. Designer productivity growing at slower rate n n 1981: 100 designer months ~$1 M 2002: 30, 000 designer months ~$300 M Frank Vahid, UC Riverside 13
Meanwhile, ICs Themselves are Costlier Tech: 0. 13 $40 k $10 0 k $35 0 k $1, 0 00 k 49 days 56 days 76 days Market: n 0. 18 Turnaro 42 und days n 0. 35 NRE: n 0. 8 $6 B $12 B $18 B $3. 5 B Source: DAC’ 01 panel on embedded programmable logic And take longer to fabricate While market windows are shrinking Less than 1, 000 out of 10, 000 ASIC designs have volumes to justify fabrication in 0. 13 micron Frank Vahid, UC Riverside 14
Summarizing So Far. . . * Transistors are less scarce • ICs are big enough, fast enough * ICs take more time and money to design and fabricate • While market windows are shrinking Buy pre-fabricated system-level ICs: platforms Designers Frank Vahid, UC Riverside 15
Trend Towards Pre-Fabricated Platforms: ASSPs ASSP: application specific standard product n n n Domain-specific prefabricated IC e. g. , digital camera IC ASIC: application specific IC ASSP revenue > ASIC ASSP design starts > ASIC n Unique IC design n Ignores quantity of same IC ASIC design starts decreasing Due to strong benefits of using pre-fabricated devices Source: Gartner/Dataquest September’ 01 n Frank Vahid, UC Riverside 16
Will High End ICs Still be Made? n n n YES The point is that mainstream designers likely won’t be making them Very high volume or very high cost products n n Becoming out of reach of mainstream designers Platforms are one such product – high volume Need to be highly configurable to adapt to different applications and constraints Frank Vahid, UC Riverside 17
Configurable Platform Design: Cache Peripherals L 1 cache JPEG dcd u. P DSP FPGA IC Pre-fabricated Platform (A pre-designed system-level architecture) n n ARM 920 T: Caches consume half of total power (Segars 01) M*CORE: Unified cache consumes half of total power (Lee/Moyer/Arends 99) Frank Vahid, UC Riverside 18
Best Cache Architecture for Embedded Systems n Not clear n n Huge variety among popular embedded processors What’s the best… n Associativity, Line size, Total size? Frank Vahid, UC Riverside 19
Cache Associativity A 00 0 000 n Direct mapped cache n B 01 0 000 n Certain bits “index” into cache Remaining “tag” bits compared Tag 11 D C 10 0 000 D 11 0 000 n Index 0000 Conflict Set associative cache n n n Multiple “ways” Fewer index bits, more tag bits, simultaneous comparisons More expensive, but better hit rate 110 D Direct mapped cache 100 C 000 2 -way set associative cache (1 -way set associative) Frank Vahid, UC Riverside 20
Cache Associativity n Reduces miss rate – thus improving performance n Impact on power and energy? n (Energy = Power * Time) Frank Vahid, UC Riverside 21
Associativity is Costly n Associativity improves hit rate, but at the cost of more power per access n Are the power savings from reduced misses outweighed by the increased power per hit? Energy per access for 8 Kbyte cache Energy access breakdown for 8 Kbyte, 4 -way set associative cache (considering dynamic power only) Frank Vahid, UC Riverside 22
Associativity and Energy n Best performing cache is not always lowest energy Significantly poorer energy Frank Vahid, UC Riverside 23
Associativity Dilemma n Direct mapped cache n Good hit rate on most examples n n But poor hit rate on some examples n n High power due to many misses Four-way set-associative cache n n Good hit rate on nearly all examples But high power per access n n Low power per access Overkill for most examples, thus wasting energy Dilemma: Design for the average or worst case? Frank Vahid, UC Riverside 24
Associativity Dilemma n Obviously not a clear choice Frank Vahid, UC Riverside 25
Our Solution: Configurable Cache n Can be configured as 4, 2, or 1 way n Ways can be concatenated 11 x D 10 x C 0000 n Size can also be configured n n By shutting down ways Saves static power (leakage) 110 D 0000 This bit selects the way 11 0 000 Frank Vahid, UC Riverside 26
Configurable Cache Design: Way Concatenation (4, 2 or 1 way) a 31 tag address a 13 a 12 a 11 a 10 index a 4 line offset a 0 Configuration circuit a 11 Small area and performance overhead a 5 reg 0 a 12 reg 1 tag part c 0 index c 1 c 3 c 2 bitline c 1 c 0 6 x 64 c 2 6 x 64 data array c 3 6 x 64 column mux sense amps tag address c 0 c 1 c 2 c 3 line offset mux driver data output critical path Frank Vahid, UC Riverside 27
Configurable Cache Experiments n n Motorola Power. Stone benchmark g 3 fax Way concatenate outperforms 4 way and direct map. Frank Vahid, UC Riverside 28
Configurable Cache Experiments 100% = 4 -way conventional cache n Configurable cache with both way concatenation and way shutdown was best on average n n Considered programs from Powerstone, Media. Bench, and Spec 2000 And, it was superior on every benchmark Frank Vahid, UC Riverside 29
Configurable Cache Experiments – Line Size Too 100% = 4 -way conventional cache n Best line size also differs per example n n n csb: concatenate plus shutdown cache Our cache can be configured for line of 16, 32 or 64 bytes 64 is usually best; but 16 is much better in a couple cases A configurable cache with way concatenation, way shutdown, and variable line size, can save a lot of energy Frank Vahid, UC Riverside 30
Configurable Platform Use n Peripherals L 1 cache n JPEG dcd u. P Platforms increasingly come with on-chip FPGA Can we use that FPGA to improve software performance and energy? DSP FPGA IC Pre-fabricated Platform Frank Vahid, UC Riverside 31
Commercial Single-Chip Microprocessor/FPGA Platforms n Triscend E 5: based on 8 -bit 8051 CISC core n n 10 Dhrystone MIPS at 40 MHz 60 kbytes on-chip RAM up to 40 K logic gates Cost only about $4 (in volume) Configurable logic Triscend E 5 chip 8051 processor plus other peripherals Frank Vahid, UC Riverside Memory 32
Single-Chip Microprocessor/FPGA Platforms n Atmel FPSLIC n n Field-Programmable System-Level IC Based on AVR 8 -bit RISC core n n 20 Dhrystone MIPS 5 k-40 k configurable logic gates On-chip RAM (20 -36 Kb) and EEPROM $5 -$10 Frank Vahid, UC Riverside Courtesy of Atmel 33
Single-Chip Microprocessor/FPGA Platforms n n Triscend A 7 chip Based on ARM 7 32 bit RISC processor n n 54 Dhrystone MIPS at 60 MHz Up to 40 k logic gates On-chip cache and RAM $10 -$20 in volume Courtesy of Triscend Frank Vahid, UC Riverside 34
Single-Chip Microprocessor/FPGA Platforms n n Altera’s Excalibur EPXA 10 ARM (922 T) hard core ~200 Dhrystone MIPS at ~200 MHz Devices range from ~200 k to ~2 million programmable logic gates Source: www. altera. com Frank Vahid, UC Riverside 35
Single-Chip Microprocessor/FPGA Platforms n n Xilinx Virtex II Pro Power. PC based n n n Config. logic n • 622 Mbps to 3. 125 Gbps Power. PCs n 420 Dhrystone MIPS at 300 MHz 1 to 4 Power. PCs 4 to 16 gigabit transceivers 12 to 216 multipliers 3, 000 to 50, 000 logic cells 200 k to 4 M bits RAM 204 to 852 I/O $100 -$500 (>25, 000 units) Up to 16 serial transceivers Courtesy of Xilinx Frank Vahid, UC Riverside 36
Single-Chip Microprocessor/FPGA Platforms n Why wouldn’t future microprocessor chips include some amount of on-chip FPGA? Frank Vahid, UC Riverside 37
Single-Chip Microprocessor/FPGA Platforms n Lots of silicon area taken up by configurable logic n n As discussed earlier, less of an issue every year Smaller area doesn’t necessarily mean higher yield (lower costs) any more n n n Previously could pack more die onto a wafer But die are becoming pad (pin) limited in nanoscale technologies Configurable logic typically used for peripherals, glue logic, etc. n We have investigated another use. . . Frank Vahid, UC Riverside 38
Software Improvements using On-Chip Configurable Logic n Partitioned software critical loops onto on-chip FPGA for several benchmarks n n Most time spent in one or two loops Extensive simulated results for 8051 and MIPS n For Powerstone (PS), Media. Bench (MB) and Netbench (NB) Frank Vahid, UC Riverside 39
Software Improvements using On-Chip Configurable Logic Speedup of 3. 2 and energy savings of 34% obtained with only 10, 500 gates (avg) Frank Vahid, UC Riverside 40
Speedup Gained with Relatively Few Gates n Created several partitioned versions of each benchmarks n n n Most speedup gained with first 20, 000 gates Surprisingly few gates Stitt, Grattan and Vahid, Field-programmable Custom Computing Machines (FCCM) 2002 Stitt and Vahid, IEEE Design and Test, Dec. 2002 J. Villarreal, D. Suresh, G. Stitt, F. Vahid and W. Najjar, Design Automation of Embedded Systems, 2002 (to appear). 41 Frank Vahid, UC Riverside
Software Improvements using On-Chip Configurable Logic – Verified through Physical Measurement n Performed physical measurements on Triscend A 7 and E 5 devices n A 7 IC Similar results (even a bit better) Triscend A 7 development board Frank Vahid, UC Riverside 42
Other Types of Configurability n Peripherals L 1 cache n DSP FPGA (other researchers) JPEG dcd u. P Microprocessor n n Peripherals n IC n n VLIW configurations Voltage scaling e. g. , JPEG decoder with different precisions Bus topology Etc. Frank Vahid, UC Riverside 43
Conclusions n Trend is away from semi-custom IC fabrication n n Platforms must be highly configurable n n To be useful for a variety of applications, and hence mass produced We have discussed n n Pressures encourage buying pre-fabricated platforms Software speedup/energy benefits of on-chip configurable logic: 3 x speedups and 34% energy savings with only ~10, 000 gates Creating a highly-configurable cache architecture: 40% energy savings compared to conventional cache Designing highly-configurable platforms, and facilitating their use with good exploration tools, can help enable platform-based design See http: //www. cs. ucr. edu/~vahid for more information Frank Vahid, UC Riverside 44
79f87b9a951bb170c9440c35edd94315.ppt