Network-on-chip 성균관대 조준동 교수 차례 z No

Network-on-chip 성균관대 조준동 교수

차례 z No. C 소개 z On Chip network 구조 z On Chip Network 설계 사례 z No. C 설계 사례 z A Dynamic Routing Mechanism for Network on Chip

Technology Evolution

No. C definition z A flexible and scalable packet-based on-chip micro-network designed according to a layered methodology z Los Angeles : Reducing commute time by 15 min -> $15 b economic impact z On chip communication will dominate performance, power efficiency.

No. C의 필요성 z Wireless processing system은 높은 throughput 과 함께 많은 계산을 필요로 하지만 엄격한 power 제약이 있음 z 재구성 So. C 구현은 parallelism 에 의해 성능향상 을 시도하고, IP reuse를 사용 z Hot spot bottleneck(or traffic)에 의한 성능 예측 을 통한 Algorithm partitioning

Network-on-chip Architecture

Design challenges for On-Chip Communication Architectures Three System-on-chip design issues z. Technology issues z. Performance issues z. Design productivity issues

Design challenges for On-Chip Communication Architectures z Technology issues : y. Limiting the on-chip distance travelled by critical signal due to the global wire delay. y. Self-synchronous cores that communicate with another through network-centric architecture to avoid deep sub-micron effect (clock skew, power associated with clock distribution trees) y. Signal integrity issues can be solved by designing as regular structures , allow to optimise and well-control electrical parameters of wires.

Design challenges for On-Chip Communication Architectures z Perfomance issues : Network congestion cause large latency fluctuations for packet delivery. There are two methods to solve this problem. y. Network overdimensioning (for No. Cs support besteffort traffic). y. Implementation of dedicated mechanisms to provide guarantees for timing constrained traffic (e. g. , lossless data transport, minimal bandwidth, bounded latency, throughput).

Design challenges for On-Chip Communication Architectures z Design productivity issues : y The reuse of complex pre-verified design blocks is efficient means to increase productivity. y To use processing elements in different platform by means of plug-and play style needs a scalable and modular on-chip network y Using processing elements is facilitated by standard network which make the modularity property No. C effective. y Some standard of networks-on-chip were proposed such as Virtual Socket Interface Alliance (VISA) , OCP.

Network-on-chip Architecture z Network Interface (NI) : y Hiding detail about network communication protocol to the cores, developed independently of the communication infrastructure. y Communication protocol conversion (from end-to-end to network protocol). y Data packetization (packet assembly, delivery and disassembly). It is a critical task. Messages that have to transmitted across the network are partitioned into fixed-length packets. Packets are broken into flits that are represent logical units of information. A phit is a information unit that can be transferred across a physical channel.

Network-on-chip Architecture z Network Switch : y Carry packets injected into the network to their final destination , following a staticaly defined or dynamically determined routing path. y Switch may have both input and output buffer or only one type of buffers y Network flow control (routing mode) addresses the limited amount of buffering resources. y Three policies of network flow control are : x. Store-and-forward routing : an entire packet is received and store before being forwared to next switch. x. Virtual cut-through routing : Also requires buffers space for packet but allow lower latency communication.

Network-on-chip Architecture z Network Switch : y Three policies of network flow control are (cont): x. Wormhole routing : Reduce switch memory requirements and permit low latency communication. First flits is decoded and switch creates a path for next flits. A flit is passed to the next switch as soon as enough space to store it , even though there is not enough space to store whole packet. y Guaranteeing quality-of-service (Qo. S) in switch operation needs to be service when time-constrained traffic is to be supported. y Contention related delay are responsible for large fluctuation of performance metrics.

From Spaghetti wires to Noc z Marcello Coppola, MPSOC 05 On-chip communication Infrastructure

온칩 네트워크 아키텍처 ● Router/Scheduler 알고리즘 개발 ● System. C를 이용한 네트워크 모델 설계 및 검증 ● Star형/Mesh형 온칩 네트워크 핵심 IP 설 계 ● Master/Slave 네트워크 인터페이스, 고성 능 메모리 관리 인터페이스 설계

온칩 네트워크 기반 So. C 설계 플 랫폼 ● 분산형 Crossbar Switch Topology 생성 및 IP 맵핑 툴 개발 ● IP to Mesh Tile 맵핑 툴 개발 ● IP간 데이터 플로우 분석 기반 네트워크 Topology 생성 툴 개발, So. C 플랫폼 구축

활용 분야 - Qo. S를 보장하는 프로토콜을 지원하여 Real Time Application 및 대용량 데이터 대역폭이 요구되는 응용 분 야에 적합 - 멀티미디어 So. C, 휴대 및 통신용 단말기, 인터넷 셋톱 박 스, 게임기, 네트워크 단말의 제품 구현에 필요한 시스템 레벨 칩 등 - high frame rate video 및 3 D 그래픽 관련 등과 같은 멀티 미디어 대용량 응용분야 So. C 설계 - 온칩 네트워크 핵심 IP 및 설계 지원 툴을 하나의 플랫폼 화한 플랫폼 기반 - 설계 환경을 구축하여 이를 다양한 So. C 설계에 활용함

On chip communication

Putting the blocks together posed tough questions: • Do the hardware interfaces work with one another? • Do the chip have enough bus and memory bandwidth under worst-case loads? • Do software tasks communicate without deadlock? • Do all applications and features of the full system meet functional goals? • Does the system meet performance goals? • Are the cost, power acceptable?

IBM’s Coreconnect 초기의 32 비트에서 시작하여 128비트까지 대역폭을 확장

Sonics Smart Interconnect IP

SMART (Sonics Methodology and Architecture for Rapid Time-to-Market) zplug-and-play on-chip communications network z. Packet-based z 50 employees in a year z. IP 및 설계환경 제공, So. C 설계 지원 z. Cadence와 연합 z. Silicon. Backplne III는 통신+미디어

Arteris No. C layered architecture

OCN Configuration z 규칙적인 연결구조와 정적인 스케줄링은 불필요한 interconnect switching 을 제거 z 전체 core 에서 Computational load 의 균형을 맞 추어 성능향상 z Overhead of the configuration streams y. Configuration streams must be scheduled periodically along with the data y 4% 의 bandwidth를 configuration stream 이 사용 z Data content variation 과 system operating 환 경에 따라 core interface 와 core 자체가 low power 모드로 동적 재설정

Scheduled Communication z Tile은 computational core z Core interface는 heterogeneous processing 의 사용 제공 z Statically scheduled mesh of interconnect z Data 는 이웃하는tile 과 communication pipeline 에 의해 이동. Fast clock rate 와 interconnection resource 의 시 분할이 가능 z Core 와 runtime interconnect 의 재설정 능력 에 의해 dynamic power management 를 가능

Adaptive System on Chip

Communication Interface -Stream data that passes through a communication interface is scheduled for a specific communication - clock cycle based on data link availability. -the result of scheduling for each interface is a set of

9 -core and 16 -core Mode

Evaluation Methodology

Performance of the Benchmarks

i. SOC Compiler zdivides applications into parts, each of which fit into a specific core. zdetermines data communications between the cores in a space-time fashion zgenerate interconnect memory contents for each individual interface.

References z a. SOC: A Scalable, Single-Chip Communications Architecture Jian Liang, Sriram Swaminathan, and Russell Tessier University of Massachusetts, Amherst, MA. 01003. {jliang, tessier}@ecs. umass. edu z Configurable Platforms With Dynamic Platform Management: An Efficient Alternative to Application-Specific System-on. Chips y Krishna Sekar Kanishka Lahiri Sujit Dey y ksekar@ece. ucsd. edu klahiri@nec-labs. com dey@ece. ucsd. edu y Dept. of ECE, UC San Diego, La Jolla, CA y NEC Laboratories America, Princeton, NJ

Benchmarks, EE Times, 7/2005 z Xpipes, Bologna and Stanford : compared w/ Amba AHB multilayer bus, 21% faster, but worse latency z When, Univ. of Kaiserslautern: LPDC decoder: 500 Mhz vs 64 Mhz (fixed bus), but 30 W vs. 700 m. W, twice the die size. z Arteris: better die size, comparable power consumption, 740 Mhz (250 Mhz) z Sonics. MX: power-efficient mobile-handset w/ power management z STNo. C, Spidergon: topology w/ degree 2 -3

No. C Applications http: //www. eit. uni-kl. de/wehn • Turbo-Decoder UMTS compliant, 100 Mbit: large flexibilty w/ 14 parallel units, area = 16. 84 mm 2 (14 mm 2 PUs, 2. 8 mm 2 No. C) • LDPC Decoding, T. Theocharides, G. Link, N. Chip, T. Theocharides, G. Link, N. Vijaykrishnan, M. J. Irwin, Int. Conference on VLSI Design 2005 – 1024 Bit block size, 1. 2 Gb/s, R=0. 75 – No. C: 5 x 5 2 D mesh, dimension-order routing, large flexibility – 160 nm CMOS Technology, 1. 8 V, 500 MHz, 110 mm 2, ~30 Watt

References z z z z z Terry Tao Ye, On-Chip Multiprocessor Communication Network Design and Analysis, Ph. D. Dissertation, Stanford Univ. E. Bolotin, et al. , Automatic hardware-Efficient So. C Integration by Qo. S network on Chip, Israel Institute of Tech, Haifa, Israel. E. Bolotin, et al. , Efficient Routing in Irregular Topology No. Cs, Technion- Israel Institute of Tech [1] Alexandre E. Eichenberger, Kathryn O’Brien, Peng Wu, Tong Chen, Peter H. Oden, Daniel A. Prener, Janice C. Shepherd, Byoungro So, Zehra Sura, Amy Wang, Tao Zhang, Peng Zhao, and Michael Gschwind. L. Gauthier, S. Yoo, A. A. Jerraya “Optimizing Compiler for a CELL Processor”, PACT 2005, 17 -21, pp 161 – 172, Sept. 2005 [2] Sunao TORI, *Junji SAKAI, *INOUE, Hiroaki, *Tatsuya TOKUE and Yoshi. Yuki ITO, “Asymmetric Multi. Processing Mobile Application Processor MP 211” [3] The Intel Xeon. TM Processor MP and the Intel Xeon. TM Processor MP with up to 2 -MB L 3 Cache on the 0. 13 Micron Process [4] Hans-Joachim Stolberg, Mladen Berkovic, Lars Friebe, Soren Moch, Sebastian Flugel, Xun Mao, Mark B. Kulaczewski, Heiko Klubmann, and Peter Pirsch, “A Multi-Core System-on-Chip Architecture for Multimedia Signal Processing Applications”, SIPS 2003, 27 -29, pp. 189 – 194, Aug. 2003 , [5] Chen Yingqi, Yang Yuhong, Wang Feng, Guo Kai, “Inter Multi processor communication scheme and shared memory control in the HDTV decoder So. C design”, IWVDVT 2005, 28 -30, pp 304 – 307, May 2005 [6] Kumar, R. ; Tullsen, D. M. ; Jouppi, N. P. ; Ranganathan, P. , “Heterogeneous Chip Multiprocessors”, Computer, Volume 38, Issue 11, pp. 32 – 38, Nov. 2005