Скачать презентацию Survey of Digital Signal Processors Michael Warner ECD Скачать презентацию Survey of Digital Signal Processors Michael Warner ECD

55b3a3f6280d54870bb63829f7c9cce6.ppt

  • Количество слайдов: 26

Survey of Digital Signal Processors Michael Warner ECD: VLSI Communication Systems Survey of Digital Signal Processors Michael Warner ECD: VLSI Communication Systems

Agenda § Industry Trends § DSP Architecture § DSP Micro-Architecture § DSP Systems Agenda § Industry Trends § DSP Architecture § DSP Micro-Architecture § DSP Systems

Agenda § Industry Trends § DSP Architecture § DSP Micro-Architecture § DSP Systems Agenda § Industry Trends § DSP Architecture § DSP Micro-Architecture § DSP Systems

Moore’s Law Drives Processor Development Itaniu m® Itanium 2 ® Pentium ® 4 Pentium® Moore’s Law Drives Processor Development Itaniu m® Itanium 2 ® Pentium ® 4 Pentium® III II Pentiu 486™ m® 386™ 80286 8080 8008 1010 Transistors per Die 109 108 107 106 105 104 4004 103 1965 Data (Moore) 102 Microprocessor 101 100 ‘ 65 ‘ 70 ‘ 75 ‘ 80 ‘ 85 ‘ 90 ‘ 95 ‘ 00 ‘ 05 ‘ 10 Source: Intel internal Doubling the number of transistors every 18 -24 at same price point drives significant product opportunities …especially if you have little regard for power But what if energy-delay had to be reduced every generation by an order of magnitude?

Gene’s Law Drives DSP Development 1, 000 Gene’s Law 100 DSP Power 1 0. Gene’s Law Drives DSP Development 1, 000 Gene’s Law 100 DSP Power 1 0. 01 0. 0001 Year 2008 2006 2004 2002 2000 1998 1996 1994 1992 1990 1988 1986 1984 0. 00001 1982 m. W/MIPS 10 Gene’s Law will have it’s challenges to hold the line!

What’s Driving Gene’s Law? Digital Audio u MP 3 u Real Audio Streaming Video What’s Driving Gene’s Law? Digital Audio u MP 3 u Real Audio Streaming Video u MPEG 4 u H. 263 Connectivity u Internet u Bluetooth Modem Standards u UMTS u GMS Buy Now? Yes No TXN 160 + 4 UPX 12 3/4

DSP Design Constraints DEVICE CAPABILITIES 1982 1992 2002 Technology (u. M) 3 0. 8 DSP Design Constraints DEVICE CAPABILITIES 1982 1992 2002 Technology (u. M) 3 0. 8 0. 1 Transistors 50 K 500 K 180 M MIPS 5 40 5, 000 RAM (bytes) 256 2 K 3 M Power (m. W/MIPS) 250 12. 5 0. 1 Price/MIPS $30. 00 $0. 38 $0. 02

Agenda § Industry Trends § DSP Architecture § DSP Micro-Architecture § DSP Systems Agenda § Industry Trends § DSP Architecture § DSP Micro-Architecture § DSP Systems

What Makes a DSP? § § § § § Single-Cycle MAC Multiple Execution Units What Makes a DSP? § § § § § Single-Cycle MAC Multiple Execution Units High Bandwidth (Flat) Memory Sub-Systems Efficient Zero-Overhead Looping Short Pipeline High Bandwidth I/O Specialized Instruction Sets Sophisticated DMA Little to No Speculation

Single Cycle MAC § MAC’s Typically Determine DSP Performance and Pipeline Length (EX) § Single Cycle MAC § MAC’s Typically Determine DSP Performance and Pipeline Length (EX) § Most DSP’s Have 2 -8 MAC Units § MAC’s Typically Operate in Both a Scalar and Vector Mode

Multiple Instruction Units § VLIW Architectures Driving ILP § Typically Instruction Units § § Multiple Instruction Units § VLIW Architectures Driving ILP § Typically Instruction Units § § M-Unit - MAC S-Unit - Shift L-Unit - ALU D-Unit – Load/Store § Industry Has Converged on a ILP of ~8 Registers A 0 - A 15 Registers B 0 - B 15 1 X S 1 2 X S 2 D DL SL L 1 SL D D S L 1 S 2 S D 1 S 2 M 1 DDATA_I 1 (load data) D S S 1 2 D 1 S S D 2 1 D 2 S 1 D M 2 S D D SL 1 L S 2 DDATA_I 2 (load data) SL DL D S 2 L 2 S 1

High Bandwidth Memory Sub-Systems § Multiple Load-Store Units Required to Feed Data Path § High Bandwidth Memory Sub-Systems § Multiple Load-Store Units Required to Feed Data Path § Tightly Coupled Memory is Typically Dual Ported § Harvard Architecture is Heavily Banked M U X E S Central Arithmetic Logic Unit P D C E PC ARs M U X MAC A B ALU SHIFTER EXTERNAL MEMORY INTERNAL MEMORY CNTL

Specialized Instruction Sets § Base RISC ISA Plus CISC ISA Driven by End Application Specialized Instruction Sets § Base RISC ISA Plus CISC ISA Driven by End Application § § § § MAC SAD LMS FIRS Viterbi Support For Both Scalar and Vector Instructions Support For 8, 16 and 32 -Bit Instructions are Highly Orthogonal

Scalar (55 x) vs VLIW (64 x) § Scalar DSP’s Tend to be More Scalar (55 x) vs VLIW (64 x) § Scalar DSP’s Tend to be More CISC Like § § Hurts Compiler Performance Improves Energy-Delay Improves Code Density Limits Top End Performance § VLIW DSP’s Tend to be More RISC Like § RISC + GP Regs + Orthogonality Makes For a Good C Compiler § Assembler Code Is Challenging § RISC ISA Allows for Higher Frequencies § Load-Store Hurts Energy-Delay

TMS 320 C 54 x TMS 320 C 54 x

TMS 320 C 54 x Protected Pipeline CYCLES P 1 F 1 D 1 TMS 320 C 54 x Protected Pipeline CYCLES P 1 F 1 D 1 A 1 P 2 F 2 D 2 P 3 F 3 P 4 R 1 A 2 D 3 F 4 P 5 Fully loaded pipeline X 1 R 2 A 3 D 4 F 5 P 6 X 2 R 3 A 4 D 5 F 6 X 3 R 4 X 4 A 5 R 5 X 5 D 6 A 6 R 6 X 6 Prefetch: Calculate address of instruction Fetch: Collect instruction Decode: Interpret instruction Access: Collect address of operand Read: Collect operand Execute: Perform operation Note: Protected Pipeline Limits Micro-Architectural Flexibility and Performance

TMS 320 C 6 xx ’C 6 xx CPU Core Program Fetch Instruction Dispatch TMS 320 C 6 xx ’C 6 xx CPU Core Program Fetch Instruction Dispatch Control Registers Instruction Decode Data Path 1 Data Path 2 A Register File B Register File Control Logic Test Emulation L 1 S 1 M 1 D 1 Arithmetic Logic Unit Auxiliary Logic Unit D 2 M 2 S 2 L 2 Multiplier Unit Interrupts

TMS 320 C 6 xx Exposed Pipeline Fetch Decode Execute PG PS PW PR TMS 320 C 6 xx Exposed Pipeline Fetch Decode Execute PG PS PW PR DP DC E 1 E 2 E 3 E 4 E 5 § Fetch § § PG PS PW PR § Decode Program Address Generate Program Address Send Program Access Ready Wait Program Fetch Packet Receive § DP § DC Instruction Dispatch Instruction Decode § Execute Packet 1 PG PS PW PR DP DC Execute Packet 2 PG PS PW PR DP Execute Packet 3 PG PS PW PR Execute Packet 4 PG PS PW Execute Packet 5 PG PS Execute Packet 6 PG Execute Packet 7 § E 1 - E 5 Execute 1 through Execute 5 E 1 DC DP PR PW PS PG E 2 E 1 DC DP PR PW PS E 3 E 2 E 1 DC DP PR PW E 4 E 3 E 2 E 1 DC DP PR E 5 E 4 E 3 E 2 E 1 DC DP Note: Exposed Pipeline Adds Risk to Programming Model E 5 E 4 E 3 E 2 E 1 DC E 5 E 4 E 3 E 2 E 1 E 5 E 4 E 5 E 3 E 4 E 5 E 2 E 3 E 4 E 5

Agenda § Industry Trends § DSP Architecture § DSP Micro-Architecture § DSP Systems Agenda § Industry Trends § DSP Architecture § DSP Micro-Architecture § DSP Systems

Micro-Architectural Challenges § Accessing (Flat) On Chip Memory At Speed § § § Within Micro-Architectural Challenges § Accessing (Flat) On Chip Memory At Speed § § § Within 2 -3 cycles Feeding Multiple Functional Units From a Single Register File Running 600 Mhz+ with a 7 -9 Stage Pipeline Linking Multiple Functional Units with Result Forwarding Implementing CISC Data-path to Meet Area and Performance Goals Achieving ARM Like Code Density

What Does and Doesn’t Work? § Do § § § § § Banked Memory What Does and Doesn’t Work? § Do § § § § § Banked Memory Dual Access Memory Full Custom Register Files Split/Multiple Register Files Custom/Semi-Custom Data-paths Variable Length Instructions CISC ISA Co-Processors Multi-Core § § § Multi-Level Caches Super-Scalar VLIW Packet Descriptors Speculative Branching Full Synthesis Dynamic Logic § Don’t § Consider § Multi-Threading § u. P with Co-Processors

Agenda § Industry Trends § DSP Architecture § DSP Micro-Architecture § DSP Systems Agenda § Industry Trends § DSP Architecture § DSP Micro-Architecture § DSP Systems

DSP Systems Wireless Infrastructure Wired. Infrastructure Performance Audio Digital Still Client Wireless Camera Wireless DSP Systems Wireless Infrastructure Wired. Infrastructure Performance Audio Digital Still Client Wireless Camera Wireless Infrastructure 6 DSP CPU 600 MHz Viterbi and Turbo hardware accelerators Wireless Client @ 300 MHz CPU 6225 MHz DSP+GPP 24 Mb 3 MB 300 MHz DSP+GPP @ integrated Imaging 600 MHz memory integrated Floating Low power 3 MB accelerators 180 M point consumption memory transistors Viterbi Voice, data, video hardware accelerators Performance Audio DSP+GPP Low power consumption Voice, data, video 180 M transistors and Turbo TMS 320 C 5561 OMAP 5910 TMS 320 C 6416 Digital Still Camera DSP+GPP Imaging TMS 320 C 5561 accelerators TMS 320 DM 310 TMS 320 DA 610 OMAP 5910 TMS 320 C 6416 TMS 320 DM 310 225 MHz Floating point TMS 320 DA 610

VIOP Platform § TNETV 3010 Features § 6 C 55 x DSP @ 300 VIOP Platform § TNETV 3010 Features § 6 C 55 x DSP @ 300 MHz § Shared Instruction Memory § Broadcast DMA § 24 M Bits of On Chip SRAM

Da. Vinci Platform Da. Vinci Platform

OMAP Platform § OMAP 2420 Features ARM 11 + VFP Imaging & 2 D/3 OMAP Platform § OMAP 2420 Features ARM 11 + VFP Imaging & 2 D/3 D Video TMS 320 C 55 x Graphics DSP Accelerator (IVA) or § DSP @ 220 MHz § 2 D/3 D graphics OMAP 2420 § IVA supports still Peripherals L 4 Interconnect Security Camer a I/F VFP (Vector Floating Point), 32 K/32 K I/Dcache accelerator L 3 Interconnect LCD Memory Internal I/F Video Controller SRAM Out § ARM 1136 @ 330 MHz, images to >4 Mpixels, 30 fps VGA video decode § Output to TV for gaming and video playback § Encryption hardware for DRM and security