Bounded Dataflow Networks and Latency Insensitive Circuits Cont

Bounded Dataflow Networks and Latency Insensitive Circuits Cont… Arvind Computer Science and Artificial Intelligence Laboratory MIT Based on the work of Murali Vijayaraghavan and Arvind[MEMOCODE 2009] November 19, 2009 http: //csg. csail. mit. edu/korea L 23 -1

Modular transformation SSM 1 SSM 2 BDN 1 SSM BDN 2 BDN SSM 3 BDN 3 Is this transformation correct? Yes, provided each BDNi implements SSMi and is latency insensitive then the resulting BDN implements SSM and is latency insensitive November 19, 2009 http: //csg. csail. mit. edu/korea L 23 -2

BDN Implementing an SSM BDN. . . A BDN is said to implement an SSM iff 1. There is a bijective mapping between inputs (outputs) of the SSM and BDN 2. The output histories of the SSM and BDN match whenever the input histories match 3. The BDN is deadlock-free November 19, 2009 http: //csg. csail. mit. edu/korea L 23 -3

Latency-Insensitive BDN (LI-BDN) A BDN implementing an SSM is an LI-BDN iff it has n n No extraneous dependencies property Self cleaning property Theorem: A BDN where all the nodes are LI-BDNs will not deadlock November 19, 2009 http: //csg. csail. mit. edu/korea L 23 -4

No-Extraneous Dependency (NED) property SSM Inputs combinationally connected to out Production of out. Q waits only for these input FIFOs November 19, 2009 out BDN http: //csg. csail. mit. edu/korea out. Q L 23 -5

Self-Cleaning (SC) property If the BDN has enqueued all its outputs, it will dequeue all its inputs November 19, 2009 http: //csg. csail. mit. edu/korea L 23 -6

Modular refinement - revisited SSM 2 rest of the design LI-BDN 2 Automatically generated SSM 1 module to be refined LI-BDN 1 implementing SSM 1 LI-BDN 2 LI-BDN 1 refined manually November 19, 2009 http: //csg. csail. mit. edu/korea L 23 -7

Writing an LI-BDN wrapper for an SSM Given the SSM: oj(t) = fj(ij 1(t), . . . , ij. Ij(t), s(t)) // ij 1, ij 2, . . . ij. Ij are combinationally connected to oj s(t+1) = g(i 1(t), i 2(t), . . . , s(t)) LI-BDN: rule Oj when ( donej) donej True; oj. enq( fj(ij 1. first, . . . , ij. Ij. first, s) ) rule Finish when (done 1 done 2 . . . ) done 1 False; done 2 False; . . . ; s g(i 1. first, i 2. first, . . . , s); i 1. deq ; i 2. deq ; . . . November 19, 2009 http: //csg. csail. mit. edu/korea introduce a done flag and a rule for each output introduce the Finish rule L 23 -8

Wrapper circuit All input deq Patient SSM first deq not-empty Ii value enable not-full Depends-on(Oj) November 19, 2009 Oj enq All dones donej 1 http: //csg. csail. mit. edu/korea 0 L 23 -9

Patient SSM Combinational Logic Enable http: //csg. csail. mit. edu/korea . . . November 19, 2009 Outputs Combinational Logic Outputs Inputs . . . L 23 -10

Example 3 -port and 1 -port Register Files ra 0 ra 1 wen wa wd en R/W rd 0 rf a rd 1 out d interface Register. File 3 Ports method Value rd 0(Addr a); method Value rd 1(Addr a); method Action wr(Addr a, Value x); endinterface November 19, 2009 rf interface Register. File 1 port method Action. Value#(Value) access(Req r); endinterface //Response to write access is // unconstrained typedef union tagged{ W struct{a: Addr, v: Value}; R struct{a: Addr}; } Req; http: //csg. csail. mit. edu/korea L 23 -11

LI-BDN for a 3 -port register file ra 0 ra 1 wen wa wd rd 0 rf rd 0 Done rd 1 rule RD 0 when ( rd 0 Done) rd 0. enq(rf. r 1(ra 0. first)); rd 0 Done True; rule RD 1 when ( rd 1 Done) rd 1. enq(rf. r 1(ra 1. first)); rd 1 Done True; rule finish when (rd 0 Done rd 1 Done) ra 0. deq; ra 1. deq; wen. deq; wa. deq; wd. deq; if (wen. first) rf. wr(wa. first, wd. first); rd 0 Done False; rd 1 Done False; November 19, 2009 http: //csg. csail. mit. edu/korea L 23 -12

Refinement into a one-ported register file LI-BDN ra 0 ra 1 rd 1 Done en R/W out rf wen a wa d wd rd 0 Done rd 0 rd 1 This uses 1 port November 19, 2009 rule RD 0 when ( rd 0 Done) let x rf. access(R ra 0. first); rd 0. enq(x); rd 0 Done True rule RD 1 when ( rd 1 Done) let x rf. access(R ra 1. first); rd 1. enq(x); rd 1 Done True rule finish when (rd 0 Done rd 1 Done) ra 0. deq; ra 1. deq; wen. deq; wa. deq; wd. deq; if (wen. first) rf. access(W {a: wa. first, v: wd. first}); rd 0 Done False; rd 1 Done False; http: //csg. csail. mit. edu/korea L 23 -13

Pipelining combinational circuits a b S 1 f 2 S 2 c d a f 3 e b S 3 R 1 f 2 R 2 c d f 3 e R 3 Can potentially reduce the critical path of the entire circuit November 19, 2009 http: //csg. csail. mit. edu/korea L 23 -14

Optimizing an LI-BDN mux c c a a d d b b w Does not wait for don’t-care inputs w Counters used to keep track of how many inputs to drop w Can potentially increase throughput November 19, 2009 http: //csg. csail. mit. edu/korea L 23 -15

Summary Latency Insensitive BDNs allow true modular refinement of a system, where even the timing contract of a module can be changed without affecting the rest of the system November 19, 2009 http: //csg. csail. mit. edu/korea L 23 -16

A Design Flow issue Branch Prediction Fetch 1 Fetch 2 Branch Pred Exception Branch Resolution Crack Decode Reg File Addr Calc/ Branch Resolve Mem 1 Mem 2/ ALU/ Exception Handler Register Write Register file implemented as a BRAM w. Pipelined Multiplier w. Multicycle divider We can apply the technique discussed to refine this design But where does this design come from in the first place? Verilog Compiler Output? Bluespec? November 19, 2009 http: //csg. csail. mit. edu/korea L 23 -17

Design Flow Issues Generation of appropriate RTL is the major problem RTL / Specifications should be written in such a way that they are amenable to refinements Latency Insensitive Design Methodology November 19, 2009 http: //csg. csail. mit. edu/korea L 23 -18

The Power. PC Project Cycle-accurate modeling of Power. PC on FPGAs November 19, 2009 http: //csg. csail. mit. edu/korea L 23 -19

PPC In-order Pipeline PC Fetch Br. Pred Crack Decode stall Addr. Calc Br. Res Reg. Rd Mem 1 bypass Mem 2 ALU Excep Reg. Wr epochs I$/ITlb 1 D$/DTlb 1 I$/ITlb 2 D$/DTlb 2 Mem The designer specifies the FSM for each stage The FIFOs are latency-insensitive, that is, the correctness of the specification does not depend upon the depth of FIFOs or the number of stages November 19, 2009 http: //csg. csail. mit. edu/korea L 23 -20

The steps in Cycle-accurate implementation on FPGAs Can be mechanized The specs are turned into Bluespec code to give a target SSM Once the size of FIFOs is fixed the whole design has a precise timing specification n If the FPGA implementation requires refining some stages then cuts are made in the design to isolate the stages (SSMs) to be refined Each SSM is turned into a BDN by introducing FIFOs for each input and output wire, including the wires going in and out of model FIFOs of the SSM n This converts the nth time cycle of the SSM into the nth enqueue into input FIFOs and nth dequeue from output FIFOs Atomic rules for the operation of each BDN are defined so that no extraneous dependencies are introduced n November 19, 2009 This also ensures deadlock-free operation http: //csg. csail. mit. edu/korea L 23 -21

Initial results using XUPV 5 FPGA Direct prototype Cycle-accurate model using BDN theory Developed mainly at IBM Developed mainly at MIT (Kattamuri Ekanadham, (Asif Khan, Yuan Tang), Jessica Tseng) re-using a lot of components 92% LUT resources 24% LUT resources 20 MHz, can possibly be increased to 40 MHz 125 MHz Boots linux November 19, 2009 http: //csg. csail. mit. edu/korea L 23 -22

Detailed Preliminary Results Asif Khan & Murali Vijayaraghavan (June 2009) Cycle-accurate refinements onto Xilinx XUPV 5 n Slice Logic Utilization: w w w n Number of Slice Registers: 15448 out of 69120 22% Number of Slice LUTs: 16702 out of 69120 24% Number of Block RAM/FIFO: 1 out of 148 0% (only 1 BRAM for the register file) Number of DSP 48 Es: 12 out of 64 18% (these are used for the divider) Specific Feature Utilization: w n n Minimum period: 7. 988 ns (Maximum Frequency: 125. 188 MHz) Partially verified by running a 50 instruction program No numbers yet for actual work done November 19, 2009 Compared to Jessica has port onto Xilinx XUPV 5 Takes up 92% of the area; 20 Mhz 40 Mhz http: //csg. csail. mit. edu/korea L 23 -23

Conclusion Cycle-accurate modeling of processors on FPGAs is feasible and offers a 3 -orders of magnitude improvement in performance over software simulators BDNs offer a way to refine RTL without losing cycle-accuracy Bluespec makes quick RTL generation feasible n The generation of BDNs can be automated We plan to release our Bluespec designs under open source licensing to strengthen Powr. PC ecosystem. November 19, 2009 http: //csg. csail. mit. edu/korea L 23 -24