2702c13fff1df8b167a21d0068410c26.ppt
- Количество слайдов: 27
Architecture Tuning in Embedded Systems Greg Stitt, Frank Vahid, Tony Givargis Roman Lysecky Dept. of Computer Science & Engineering University of California, Riverside Department of IP Management Conexant Newport Beach This work was supported by the National Science Foundation under grants CCR 9811164 and CCR-9876006, and by a Design Automation Conference graduate scholarship. This work is being presented at CASES’ 00 (Compilers, Architectures and Synthesis for Embedded Systems), November 18 -19, 2000, San Jose, CA.
A “short list” of embedded systems Anti-lock brakes Auto-focus cameras Automatic teller machines Automatic toll systems Automatic transmission Avionic systems Battery chargers Camcorders Cell phones Cell-phone base stations Cordless phones Cruise control Curbside check-in systems Digital cameras Disk drives Electronic card readers Electronic instruments Electronic toys/games Factory control Fax machines Fingerprint identifiers Home security systems Life-support systems Medical testing systems Modems MPEG decoders Network cards Network switches/routers On-board navigation Pagers Photocopiers Point-of-sale systems Portable video games Printers Satellite phones Scanners Smart ovens/dishwashers Speech recognizers Stereo systems Teleconferencing systems Televisions Temperature controllers Theft tracking systems TV set-top boxes VCR’s, DVD players Video game consoles Video phones Washers and dryers And the list goes on and on
Introduction: Traditional microprocessor use in embedded systems n Tasks (not necessarily in the given order) n (1) Buy a microprocessor IC (integrated circuit) n (2) Integrate it with other IC’s onto a board and insert it into an embedded system n (3) Download a software program Software Processor 1 Board 2 3 n Notice that the processor IC is designed independent of the software n Different microprocessor variations thus exist, like low-power or high-performance IC’s
Introduction: Modern core-based approach n Tasks n (1) Buy a microprocessor CORE n Hard: layout; Firm: structural HDL; Soft: synthesizable HDL n You are buying Intellectual Property, like a file that may come on a floppy, CD-ROM, over the web, etc. You are NOT buying hardware. n (2) Design a system-on-a-chip (SOC) from this and other cores n (3) Fabricate a SOC IC n (4) Insert the IC into an embedded system Software n (5) Download a software program Processor 1 HDL 2 HDL 3 4 5
Introduction: embedded system unique feature of fixed program n SOC’s implementing an embedded system The software in have a unique feature n Implements a particular application here never changes after production n Thus, the processor may execute a single fixed program that never changes n Unlike desktop systems, which execute a variety of programs n Examples: digital camera, automobile cruise- controller n We can exploit this fixed-program feature n For example, by using mask-programmed ROM n But much more can be done
Introduction: Proposed core-based approach with architecture tuning n Tasks n (1) Buy a microprocessor core n (2) Design a system-on-a-chip (SOC) from this and other cores n (3) TUNE the SOC architecture to a software program n (4) Fabricate a SOC IC n (5) Insert the IC into an embedded system n (6) Download the software program Processor 1 HDL Processor 2 HDL Software Processor 3 HDL 4 5 6
Introduction: architecture tuning n A way to exploit the fixed- program feature of embedded systems n First, do architecture design for the particular application n Then, “tune” the corebased system architecture to the particular application program, before IC fabrication n Goals: better performance, power, size Fixed program Core library Peripheral. A Peripheral. B Processor. X Architecture design Peripheral Architecture tuning Peripheral Processor Prog. Processor HDL Prog. Fabrication HDL Tuned cores Peripheral Processor IC Prog.
Introduction: architecture tuning n Examples of tuning optimizations n Memory hierarchy: no cache, L 1+L 2 cache n Cache organization: size, associativity, write policies n Bus structure, data/address encoding n DMA block sizes n Microprocessor optimizations n Internal small-loop table n Controller partitioning n Datapath shortcuts n Register file copies
Introduction: Tuning is a special case of Y-Chart iteration n Philips/Tri. Media approach of simultaneously developing architecture and its applications Architecture Applications Mapping Analysis Our focus Numbers
Problem description n Focus of this work: n Tuning a microcontroller to its program n Goal is reduced power without performance loss n Restrict tuning to maintain exact instruction set compatibility n No instructions may be added or deleted n Thus, no modification to software development environment n Also, no problems with porting software to/from other versions of the microcontroller n Instruction set incompatibility can be a show stopper n Maintenance/upgrades/re-porting of binaries over the lifetime of product and for product variations is a key issue n Likewise, a stable software development environment is needed
Previous work n Application-specific instruction-set processors [Fisher 99] n Customize a microprocessor to its application(s) n Delete unnecessary instructions, add new ones along with accompanying datapath extensions n e. g. , Tensilica n Customized instruction-set requires customized development tools (e. g. , compiler, debugger) n Tuning compiler to architecture [Tiwari et al 94] n Architectural description languages to inform compiler of architecture features [Halambi et al 99] n Tuning cache and cache/bus [Givargis et al 99] organization to application
Tuning environment n Currently for the 8051 microcontroller n Starts from VHDL synthesizable model of 8051 (soft core) n Uses Synopsys synthesis, simulation and power analysis n Uses 8051 instruction-set simulator n Uses numerous scripts n Goal of the enviroment n Understand how power is being consumed for a particular application, so that modifications to the architecture (or application) can be made to minimize that power n Three main tools n Architectural view n Instruction-set view n Program/data memory view
Tuning environment: architectural view tool Microprocessor soft core RT-synthesizer Microprocessor structure Program binary ROM 1. 04 m. W ROM generator ALU 1. 62 m. W ROM entity Simulator and power analyzer “Flat” power data Total 7. 66 m. W RAM 1. 42 m. W CTRL 2. 69 m. W DECODER 0. 07 m. W Structural hierarchical power data translator and xdu display
Tuning environment: instruction-set view tool Binaries to exercise instruction 1 exer Binaries to instructionto exe Binaries 2 instruction 3 ROM generator Microprocessor structure ROM entity Simulator and power analyzer Flat power data for instruction 1 Flat power data for instruction 2 Flat power data for instruction 3 Power data collector, structural power data translator, and xdu display Instruction Power (m. W) ADDC_1 7. 340834 ADD_1 7. 350741 ANL_1 6. 631394 CLR_1 3. 76228 CPL_1 5. 481627 DA 5. 28897 DEC_1 5. 368807 DIV 7. 716592 INC_1 4. 662862 MOVC_1 6. 078014 MOVC_2 5. 021021 MOV_1 5. 577664 MOV_2 6. 164267 MUL 5. 522886 NOP 4. 900275 ORL_1 6. 954121 POP 8. 103867 PUSH 8. 7116
Tuning environment: program/data memory view tool Per-instruction power data (from previous tool) Program binary Instruction-set simulator Program/data memory access frequencies and power Program hierarchy power translator and xdu display Addr 000003 00005 00007 00009 00011 00012 00014 00016 00018 00020 00022 Ins LJMP MOV_9 RET MOV_9 MOV_4 LCALL Addr 00128 00129 00130 00131 00144 00208 00224 00240 Freq 1 108 108 108 27 27 27 Purpose P 0 SP DPL DPH P 1 PSW ACC B Pwr 0 5. 46067 5. 46067 4. 83507 0 Accesses 1311 70317 31189 7977 161 413527 360949 2598 Freq*Pwr 0 589. 752 0 147. 438 130. 547 0
Tuning environment Program binary Microprocessor core Program/data memory view tool (seconds) Architectural view tool (1 hour) Instruction-set power view tool (1 day) Program power data Architecture power data Instruction-set power data
Design flow using the tuning environment Change application Run program / data memory view tool Change architecture Run architecture view tool Run instruction -set view tool No Satisfied? Yes DONE
Experiments n Started with 8051 soft core in VHDL n Tuning environment was used to n Examine where power consumption was occurring for a given application n Quickly evaluate the impact of tuning optimizations n These are early results, much more work remains
Power consumption of the initial 8051 model n Power consumption n Mainly due to switching wires n Any wire who’s value changed (from 0 to 1) consumes power n Want to minimize switching Average power: 37. 1824 m. W n 8051 power consumption n 5 main components n Controller, RAM, and ALU are the most expensive components n These components have potential for general optimizations n Total Gates - 25854
General optimizations made to the 8051 n Prevent unnecessary switching on wires connecting to memories n Wires connecting processor to memories are high capacitance n They were switching even when not being used n So we inserted latches to hold the previous value, a standard power-saving technique n Prevent unnecessary switching in decoder and ALU n Again, by latching the inputs coming from the controller n Fetch instruction bytes only when needed n Hold ROM output when not being read
Power after general optimizations n Overall power reduction from 37. 2 to 11. 6 m. W. n Total gates - 25951 n % improvements n ROM n RAM n ALU Average power: 11. 6025 m. W n CTR 82. 9% 70. 5% 60. 0% 19. 9%
Tuning optimizations n Sought to tune the microprocessor to a particular applicaton n GCD (Greatest common divisor) computation n Tuning optimizations invoked n 1) Replace frequently-accessed RAM locations by internal registers n 2) Create datapath shortcuts for most common instructions n 3) Partition the controller into a big controller and a small controller, with the small one handling the most frequentlyexecuted GCD instructions
Sample tuning optimization n Observation ROM 1. 04 m. W n RAM consumes much power n Address 224 accessed frequently n Possible tuning optimization n Replace this RAM location by a ALU 1. 62 m. W Total 7. 66 m. W RAM 1. 42 m. W register CTRL 2. 69 m. W n Steps DECODER 0. 07 m. W n Modify VHDL model n Run all three view tools n Results n Power reduction: 7. 67 to 7. 27 m. W n RAM reduced from 1. 42 to 0. 8 m. W, CTRL increased slightly Addr 00128 00129 00130 00131 00144 00208 00224 00240 Purpose P 0 SP DPL DPH P 1 PSW ACC B Accesses 1311 70317 31189 7977 161 413527 360949 2598
Replacing certain RAM locations by registers n PSW and accumulator are separated from RAM entity, placed in internal registers n Total gates - 26465 n % improvements n RAM n Overall Average Power: 9. 7684 m. W 46. 1% 15. 8%
Optimized datapath Addr Ins 00000 LJMP 00003 MOV_9 00005 MOV_9 00007 MOV_9 00009 MOV_9 00011 RET 00012 MOV_9 00014 MOV_9 00016 MOV_9 00018 MOV_9 00020 MOV_4 00022 LCALL Freq 1 108 108 108 27 27 27 Pwr 0 5. 46067 5. 46067 4. 83507 0 Freq*Pwr 0 589. 752 0 147. 438 130. 547 0 n MOV from reg 7 to ACC very Average power: 11. 2857 m. W n n common Add “shortcut” signal to register file Avoids having data go through ALU Total Gates - 26315 Power reduced by 0. 32 m. W (2. 7%)
Controller Partitioning n Motivation n In many applications, 90% of the time is spent in 10% of the code (or some similar ratio) n So let’s partition the controller into two, one handling the 10% of frequently executed code n This smaller controller should consume less power n Results n Average power reduced from 11. 6 m. W to 11. 3 m. W (2. 6%) n Total gates - 28731
Conclusions n Described an environment for tuning a microprocessor to its application for low power n Full instruction set compatibility n Multiple views helps find power hogs n Fully automated n Focus is now on developing tuning optimizations n Controller partitioning, small-loop table, datapath shortcuts, register-file copies, etc. n Investigate possibility of automating tuning optimizations, develop more general tuning methodology n Environment for the 8051 is available on the web: n http: //www. cs. ucr. edu/~dalton
2702c13fff1df8b167a21d0068410c26.ppt