Improving Embedded System Software Speed and Energy using

Improving Embedded System Software Speed and Energy using Microprocessor/FPGA Platform ICs Frank Vahid Associate Professor Dept. of Computer Science and Engineering University of California, Riverside Also with the Center for Embedded Computer Systems at UC Irvine http: //www. cs. ucr. edu/~vahid This research has been supported by the National Science Foundation, NEC, Trimedia, and Triscend Frank Vahid, UC Riverside 1

General Purpose vs. Special Purpose n Standard tradeoff Amazing to think this came from wolves Oct. 14, 2002, Cincinnati, Ohio -physician at Cincinnati Children’s Hospital Medical Center report duct tape effective at treating warts. Frank Vahid, UC Riverside 2

General Purpose vs. Single Purpose Processors n total = 0 for i = 1 to N loop total += M[i] end loop Designers have long known that: n n General-purpose processors are flexible General purpose Single-purpose processors are Controller Datapath Control logic and State register Register file fast IR PC General ALU OR Single purpose Controller Control logic Datapath i total State register + Data memory Program memory Data memory Assembly code for: total = 0 for i =1 to … ENIAC, 1940’s Its flexibility was the big deal Flexibility Design cost Time-to-market Frank Vahid, UC Riverside Performance Power efficiency Size 3

Mixing General and Single Purpose Processors n A. k. a. Hardware/software partitioning n n n A 2 D Hardware: single-purpose processors n Digital camera chip CCD preprocessor Pixel coprocessor D 2 A lens coprocessor, accelerator, peripheral, etc. Software: general-purpose processors n Though hardware underneath! JPEG codec Microcontroller Multiplier/Accumulato r DMA controller Memory controller Display control ISA bus interface UART LCD control Especially important for embedded systems n n Computers embedded in devices (cameras, cars, toys, even people) Speed, cost, time-to-market, power, size, … demands are tough Frank Vahid, UC Riverside 4

How is Partitioning Done for Embedded Systems? n Partitioning into hw and sw blocks done early n n n During conceptual stage Sw design done separately from hw design Attempts since late 1980 s to automate not yet successful n n Informal spec Partitioning manually is reasonably straightforward Spec is informal and not machine readable Sw algorithms may differ from hw algorithms No compelling need for tools Frank Vahid, UC Riverside System Partitioning Sw spec Sw design Processor Hw spec Hw design ASIC 5

New Platforms Invite New Efforts in Hw/Sw Partitioning n New single-chip platforms contain both general-purpose processor and an FPGA Processor + FPGA n Informal spec FPGA: Field-programmable gate array n n n Programmable just like software Flexible Intended largely to implement single-purpose processors Can we perform a later partitioning to improve the software too? System Partitioning Sw spec Sw design Hw spec Hw design Partitioning Processor + FPGA Frank Vahid, UC Riverside ASIC 6

Commercial Single-Chip Microprocessor/FPGA Platforms n Triscend E 5: based on 8 -bit 8051 CISC core (2000) n n n 10 Dhrystone MIPS at 40 MHz up to 40 K logic gates Cost only about $4 Configurable logic Triscend E 5 chip 8051 processor plus other peripherals Frank Vahid, UC Riverside Memory 7

Single-Chip Microprocessor/FPGA Platforms n Atmel FPSLIC n n Field-Programmable System-Level IC Based on AVR 8 -bit RISC core n n n 20 Dhrystone MIPS 5 k-40 k logic gates $5 -$10 Courtesy of Atmel Frank Vahid, UC Riverside 8

Single-Chip Microprocessor/FPGA Platforms n n Triscend A 7 chip (2001) Based on ARM 7 32 bit RISC processor n n n 54 Dhrystone MIPS at 60 MHz Up to 40 k logic gates $10 -$20 in volume Courtesy of Triscend Frank Vahid, UC Riverside 9

Single-Chip Microprocessor/FPGA Platforms n n Altera’s Excalibur EPXA 10 (2002) ARM (922 T) hard core 200 Dhrystone MIPS at 200 MHz ~200 k to ~2 million logic gates Source: www. altera. com Frank Vahid, UC Riverside 10

Single-Chip Microprocessor/FPGA Platforms n n Xilinx Virtex II Pro (2002) Power. PC based n n n Config. logic n 420 Dhrystone MIPS at 300 MHz 1 to 4 Power. PCs 4 to 16 gigabit transceivers 12 to 216 multipliers Millions of logic gates 200 k to 4 M bits RAM 204 to 852 I/O $100 -$500 (>25, 000 units) • 622 Mbps to 3. 125 Gbps Power. PCs n Up to 16 serial transceivers Courtesy of Xilinx Frank Vahid, UC Riverside 11

Single-Chip Microprocessor/FPGA Platforms n n Why wouldn’t future microprocessor chips include some amount of on-chip FPGA? One argument against – area n n n Lots of silicon area taken up by FPGA n FPGA about 20 -30 times less area efficient than custom logic FPGA used to be for prototyping, too big for final products But chip trends imply that FPGAs will be O. K. in final products… Frank Vahid, UC Riverside 12

How Much is Enough? Perhaps a bit small Frank Vahid, UC Riverside 13

How Much is Enough? Reasonably sized Frank Vahid, UC Riverside 14

How Much is Enough? Probably plenty big for most of us Frank Vahid, UC Riverside 15

How Much is Enough? More than typically necessary Frank Vahid, UC Riverside 16

How Much Custom Logic is Enough? IC package IC 1993: ~ 1 million logic transistors Perhaps a bit small 8 -bit processor: 50, 000 tr. Pentium: 3 million tr. MPEG decoder: several million tr. Frank Vahid, UC Riverside 17

How Much Custom Logic is Enough? 1996: ~ 5 -8 million logic transistors Reasonably sized Frank Vahid, UC Riverside 18

How Much Custom Logic is Enough? 1999: ~ 10 -50 million logic transistors Probably plenty big for most of us Frank Vahid, UC Riverside 19

How Much Custom Logic is Enough? 2002: ~ 100 -200 million logic transistors More than typically necessary Frank Vahid, UC Riverside 20

How Much Custom Logic is Enough? 1993: 1 M 2008: >1 BILLION logic transistors Perhaps very few people could design this Frank Vahid, UC Riverside 21

Very Few Companies Can Design High-End ICs Design productivity gap 10, 000 100, 000 10, 000 Logic transistors per 100 10 chip (in millions) 1 ’s oore M Law IC capacity 1000 Gap 100 10 0. 1 1 productivity 0. 01 0. 001 Source: ITRS’ 99 2007 2005 2003 2001 1999 1997 1995 1993 1991 1989 1987 1985 1983 0. 01 1981 n Productivity (K) Trans. /Staff-Mo. Designer productivity growing at slower rate n n 1981: 100 designer months ~$1 M 2002: 30, 000 designer months ~$300 M Frank Vahid, UC Riverside 22

Single-Chip Platforms with On-Chip FPGAs n So, big FPGAs on-chip are O. K. , because mainstream designers couldn’t have used all that silicon area anyways Becoming out of reach of mainstream designers n Frank Vahid, UC Riverside But, couldn’t designers use custom logic instead of FPGAs to make smaller chips and save costs? 23

Shrinking Chips n Yes, but there’s a limit n Chips becoming pin limited A football huddle can only get so small Shrink Pads connecting to external pins This area will exist whether we use it all or not Frank Vahid, UC Riverside 24

Trend Towards Pre-Fabricated Platforms: ASSPs ASSP: application specific standard product n n n Domain-specific prefabricated IC e. g. , digital camera IC ASIC: application specific IC ASSP revenue > ASIC ASSP design starts > ASIC n Unique IC design n Ignores quantity of same IC ASIC design starts decreasing Due to strong benefits of using pre-fabricated devices Source: Gartner/Dataquest September’ 01 n Frank Vahid, UC Riverside 25

Microprocessor/FPGA Platforms n n Trends point towards such platforms increasing in popularity Can we automatically partition the software to utilize the FPGA? n For improved speed and energy Frank Vahid, UC Riverside 26

Automatic Hardware/Software Partitioning n Since late 1980 s – goal has been spec in, hw/sw out n But no successful commercial tool yet. Why? // From Media. Bench’s JPEG codec GLOBAL(void) jpeg_fdct_ifast (DCTELEM * data) { DCTELEM tmp 0, tmp 1, tmp 2, tmp 3, tmp 4, tmp 5, tmp 6, tmp 7; “Spec” DCTELEM tmp 10, tmp 11, tmp 12, tmp 13; Ideal Partitioner DCTELEM z 1, z 2, z 3, z 4, z 5, z 11, z 13; DCTELEM *dataptr; int ctr; SHIFT_TEMPS /* Pass 1: process rows. */ dataptr = data; for (ctr = DCTSIZE-1; ctr >= 0; ctr--) { tmp 0 = dataptr[0] + dataptr[7]; Software tmp 7 = dataptr[0] - dataptr[7]; Compilation tmp 1 = dataptr[1] + dataptr[6]; Hardware Synthesis … Processor // Thousands of lines like this in dozens of files Frank Vahid, UC Riverside ASIC/FPGA 27

Why No Successful Tool Yet? n Most research has focused on extensive exploration n n Roots in VLSI CAD Decompose problem into fine-grained operations Apply sophisticated partitioning algorithms Examples n n Min-cut, dynamic programming, simulated annealing, tabu-search, genetic evolution, etc. “Spec” 1000 s of nodes (like circuit partitioning) Partitioner Is this overkill? Frank Vahid, UC Riverside 28

We Really Only Need Consider a Few Loops – Due to the 90 -10 Rule Recent appearance of embedded benchmark suites n n Enables analysis understanding of the real problem We’ve examined UCLA’s Media. Bench, Netbench, Motorola’s Powerstone Currently examining EEMBC (embedded equivalent of SPEC) UCR loop analysis tools based on Simple. Scalar and Simics // From Media. Bench’s JPEG codec GLOBAL(void) jpeg_fdct_ifast (DCTELEM * data) { DCTELEM tmp 0, tmp 1, tmp 2, tmp 3, tmp 4, tmp 5, tmp 6, tmp 7; DCTELEM tmp 10, tmp 11, tmp 12, tmp 13; DCTELEM z 1, z 2, z 3, z 4, z 5, z 11, z 13; DCTELEM *dataptr; int ctr; SHIFT_TEMPS /* Pass 1: process rows. */ 1 0. 9 0. 8 0. 7 0. 6 0. 5 0. 4 0. 3 0. 2 0. 1 0 1 2 3 4 5 6 7 8 9 10 n dataptr = data; Assigned each loop a number, sorted by fraction of contribution to total execution time for (ctr = DCTSIZE-1; ctr >= 0; ctr--) { tmp 0 = dataptr[0] + dataptr[7]; tmp 7 = dataptr[0] - dataptr[7]; tmp 1 = dataptr[1] + dataptr[6]; … Frank Vahid, UC Riverside 29

The 90 -10 Rule Holds for Embedded Systems In fact, the most frequent loop alone took 50% of time, using 1% of code Frank Vahid, UC Riverside 30

So Need We Only Consider the First Few Loops? Not Necessarily What if programs were self-similar w. r. t. 90 -10 rule? n n n Remove most frequent loop – 90 -10 rule still hold? Intuition might say yes – remove loop, and we have another program. 1 1 0. 9 n 0. 8 % Remaining 0. 7 0. 6 0. 5 0. 4 0. 3 0. 2 0. 1 0 1 2 3 4 5 6 7 8 9 Execution Time % Remaining Execution Time 0. 8 0. 7 0. 6 0. 5 0. 4 0. 3 0. 2 0. 1 0 10 1 2 3 4 5 Loop 6 7 8 9 10 Loop So we need only speedup the first few loops n 1000 Speedup 1000 500 n 0 0 1 2 3 4 5 Loop 6 7 8 9 10 0 1 2 3 4 5 6 Loop Frank Vahid, UC Riverside 7 8 9 10 After that, speedups are limited Good from tool perspective! 31

Manually Partitioned Several Power. Stone Benchmarks onto Triscend A 7 and E 5 Chips n Used multimeter and timer to measure performance and power n E 5 IC Obtained good speedups and energy savings by partitioning software among microprocessor and on-chip FPGA Triscend A 7 development board Frank Vahid, UC Riverside 32

Simulation-Based Results for More Benchmarks (Quicker than physical implementation, results matched reasonably well) Speedup of 3. 2 and energy savings of 34% obtained with only 10, 500 gates (avg) Frank Vahid, UC Riverside 33

Looking at Multiple Loops per Benchmark n Manually created several partitioned versions of each benchmarks n n n Most speedup gained with first 20, 000 gates Surprisingly few gates! Stitt, Grattan and Vahid, Field-programmable Custom Computing Machines (FCCM) 2002 Stitt and Vahid, IEEE Design and Test, Dec. 2002 J. Villarreal, D. Suresh, G. Stitt, F. Vahid and W. Najjar, Design Automation of Embedded Systems, 2002 (to appear). 34 Frank Vahid, UC Riverside

Ideal Speedups for Different Architectures n Varied loop speedup ratio (sw time / hw time of loop itself) to see impact of faster microprocessor or slower FPGA: 30, 20, 10 (base case), 5 and 2 n Loop speedups of 5 or more work fine for first few loops, not hard to achieve Frank Vahid, UC Riverside 35

Ideal Energy Savings for Different Architectures n Varied loop power ratio (FPGA power / microprocessor power) to account for different architectures – 2. 5, 2. 0, 1. 5 (base case), 1. 0 n Energy savings quite resilient to variations Frank Vahid, UC Riverside 36

How is Automated Partitioning Done? Previous data obtained manually Informal spec System Partitioning Sw spec Hw spec Sw design Hw design Partitioning Processor + FPGA ASIC Frank Vahid, UC Riverside 37

Source-Level Partitioning SW Source _______ Front-end converts code into intermediate format, such as SUIF (Stanford University Intermediate Format) Compiler Front-End Intermediate format explored for hardware candidates Hw/Sw Partitioning Assembly & object files Compiler Back-End Hw source Assembler & Linker Synthesis Binary Netlists Processor FPGA Binary is generated from assembling and linking. Hw source is generated and synthesized into netlist Frank Vahid, UC Riverside 38

Problems with Source-Level Partitioning n Though technically superior, source-level partitioning n Disrupts standard commercial tool flow significantly n n n Requires special compiler (ouch!) Multiple source languages, changing source languages How deal with library code, assembly code, object code Compiler Front -end C Source C++ Source Java Source C SUIF Compiler C++ SUIF Compiler ? Frank Vahid, UC Riverside 39

Binary Partitioning SW Source _______ Assembly & object files Compilation Source code is first compiled and linked in order to create a binary. Assembler & Linker Binary Candidate hardware regions (a few small, frequent loops) are decompiled for partitioning Hw/Sw Partitioning Hw source Updated Binary Synthesis Netlists Processor HDL is generated and synthesized, and binary is updated to use hardware FPGA Frank Vahid, UC Riverside 40

Binary-Level Partitioning Results (ICCAD’ 02) • Binary-Level • Source-Level • Average speedup, 1. 5 • Average energy savings, 27% • Average 4, 361 gates • Average speedup, 1. 4 • Average energy savings, 13% • Large area overhead averaging 10, 325 gates Frank Vahid, UC Riverside 41

Binary Partitioning Could Eventually Lead to Dynamic Hw/Sw Partitioning n Dynamic software optimization gaining interest n n n Proces sor e. g. , HP’s Dynamo What better optimization than moving to FPGA? I$ Add component on-chip: n n n Detects most frequent sw loops Decompiles a loop Performs compiler optimizations Synthesizes to a netlist Places and routes the netlist n onto (simple) FPGA Updates sw to call FPGA Mem D$ Mem Profile r Proc. DMA Config. Logic Self-improving IC n n n Can be invisible to designer Appears as efficient processor HARD! Much future work. Frank Vahid, UC Riverside 42

Conclusions n Hardware/software partitioning can significantly improve software speed and energy n n Successful commercial tool still on the horizon n n Single-chip microprocessor/FPGA platforms, increasing in popularity, make such partitioning even more attractive Binary-level partitioning may help in some cases Source-level can yield massive parallelism (Profs. Najjar/Payne) Future dynamic hw/sw partitioning possible? Distinction between sw/hw continually being blurred! Many people involved: n n n Greg Stitt, Roman Lysecky, Shawn Nematbakhsh, Dinesh Suresh, Walid Najjar, Jason Villarreal, Tom Payne, several others… Support from NSF, Triscend, and soon SRC… Exciting new directions! Frank Vahid, UC Riverside 43