An FPGA Computing Demo Core for Space Charge

Скачать презентацию An FPGA Computing Demo Core for Space Charge

f4974990a069a4eebcbabc68074a7762.ppt

Количество слайдов: 25

An FPGA Computing Demo Core for Space Charge Simulation Wu, Jinyuan (Fermilab) Huang, Yifei (Illinois Math & Science Academy) May, 2009 May. 2009 Wu Jinyuan, (Fermilab jywu 168@fnal. gov), Huang Yifei (IMSA) 1

About Illinois Math & Science Academy n n One coauthor (Huang Yifei) is with Illinois Math & Science Academy (IMSA). IMSA enrolls grade 10 -12 academically talented Illinois students. Nobel Laureate Dr. Leon Lederman is an IMSA Founder and Resident Scholar at IMSA. The work is done through the Student Inquiry and Research program. (The SIR consists 22 Wednesdays in 08 -09 academic year. The work is done in Fermilab. Transportation to Fermilab is provided by IMSA). May. 2009 Wu Jinyuan, (Fermilab jywu 168@fnal. gov), Huang Yifei (IMSA) 2

What? In space charge simulation computing task, in an low cost FPGA, a 16 -bit demo core is developed: n One FPGA = 5 Intel Core 2 Duo 2. 2 GHz CPU n (0. 5 W) vs. (5 x 75 W) May. 2009 Wu Jinyuan, (Fermilab jywu 168@fnal. gov), Huang Yifei (IMSA) 3

The Space Charge Computing Number of Electrons Number of Calculations/Iteration Computing Time/1000 Iterations 103 ~106 100 s 104 ~108 2. 7 hours 105 ~1010 11. 6 days 106 ~1012 3. 2 years n n n @107 Calculations/s Each electron sees sum of Coulomb forces from other N-1 electrons. The total number of calculations is about N 2 and each calculation of the Coulomb force requires a square root, a division and several multiplications. Regular sequential computers are not fast enough. May. 2009 Wu Jinyuan, (Fermilab jywu 168@fnal. gov), Huang Yifei (IMSA) 4

The FPGA Board n n Up to 16 FPGA devices ($32 ea) can be installed onto each board. Each FPGA host one core. May. 2009 Wu Jinyuan, (Fermilab jywu 168@fnal. gov), Huang Yifei (IMSA) 5

The 16 -bit Demo Core xj yj S zj xi yi zi X - 16 -bit Coordinates S + vyj S + vzj X x 2 + May. 2009 - vxj X x 2 + + - + + LUT 10 b in 16 b out x 2 Wu Jinyuan, (Fermilab jywu 168@fnal. gov), Huang Yifei (IMSA) 32 -bit Forces 16 -bit Velocities 6

A Double-Layer + Single-Layer Sequencer BA 0 1 2 3 4 255 2 0 1 2 3 4 255 3 0 1 2 3 4 255 4 0 1 2 3 4 255 0 0 1 0 2 0 3 1 A double-layer loop is followed by a single-layer loop. 4 2 255 253 0 254 0 255 0 Outer Loop 0 1 State Control AA 0 Inner Loop May. 2009 Wu Jinyuan, (Fermilab jywu 168@fnal. gov), Huang Yifei (IMSA) 7

The Lookup Table The LUT replaces: • A Square Rooting • Two Multiplications • A Reciprocal Operations x 2 + LUT 10 b in 16 b out x 2 May. 2009 Wu Jinyuan, (Fermilab jywu 168@fnal. gov), Huang Yifei (IMSA) 8

Number of Bits for Input to LUT - xi X - yi X - zi 16 -bit Coordinates n X x 2 Shifters are used before and after the LUT. Leading zeros are eliminated: q + LUT 10 b in 16 b out q 0000000101011000 32 -bit Forces x 2 32 -bit Sum of Squares May. 2009 n A 32 -bit input LUT is too big. 232=4 G words. Wu Jinyuan, (Fermilab jywu 168@fnal. gov), Huang Yifei (IMSA) 9

Bit Evolution Before LUT (x 1 -x 2)^2 Sum of 3 Squares If ((High Bits) != 0) Choose (High Bits) Else Choose (Low Bits) (x 1 -x 2) x 1 x 2 LUT May. 2009 Wu Jinyuan, (Fermilab jywu 168@fnal. gov), Huang Yifei (IMSA) 10

Bit Evolution After LUT Shift 2 n before LUT Shift 3 n after LUT (x 1 -x 2) LUT May. 2009 Wu Jinyuan, (Fermilab jywu 168@fnal. gov), Huang Yifei (IMSA) 11

Two Electrons with Natural Scales e 256 nm e 28 ps May. 2009 Wu Jinyuan, (Fermilab jywu 168@fnal. gov), Huang Yifei (IMSA) 12

256 Charged Particles, Iteration 0 May. 2009 Wu Jinyuan, (Fermilab jywu 168@fnal. gov), Huang Yifei (IMSA) 13

256 Charged Particles, Iteration 5 May. 2009 Wu Jinyuan, (Fermilab jywu 168@fnal. gov), Huang Yifei (IMSA) 14

256 Charged Particles, Iteration 10 May. 2009 Wu Jinyuan, (Fermilab jywu 168@fnal. gov), Huang Yifei (IMSA) 15

256 Charged Particles, Iteration 15 May. 2009 Wu Jinyuan, (Fermilab jywu 168@fnal. gov), Huang Yifei (IMSA) 16

256 Charged Particles, Iteration 20 May. 2009 Wu Jinyuan, (Fermilab jywu 168@fnal. gov), Huang Yifei (IMSA) 17

256 Charged Particles, Iteration 25 May. 2009 Wu Jinyuan, (Fermilab jywu 168@fnal. gov), Huang Yifei (IMSA) 18

256 Charged Particles, Iteration 30 May. 2009 Wu Jinyuan, (Fermilab jywu 168@fnal. gov), Huang Yifei (IMSA) 19

256 Charged Particles, Iteration 35 May. 2009 Wu Jinyuan, (Fermilab jywu 168@fnal. gov), Huang Yifei (IMSA) 20

256 Charged Particles, Iteration 40 May. 2009 Wu Jinyuan, (Fermilab jywu 168@fnal. gov), Huang Yifei (IMSA) 21

Speed Comparison with Regular CPU n n n The FPGA core is x 10 faster than a typical 2. 2 GHz CPU core. The FPGA core runs at 200 MHz or 200 M Coulomb force calculations/s. It seems the CPU core needs 80 -100 clock cycles for each Coulomb force calculation. May. 2009 Wu Jinyuan, (Fermilab jywu 168@fnal. gov), Huang Yifei (IMSA) 22

One Board: 8 FPGA Cores 8 Cores/Board = 40 Dual Core CPUs One Core/FPGA = 5 Dual Core CPUs n n n One board has a calculation capacity as 40 dual core CPUs. The power consumption of one board is < 4. 5 W. Newer FPGAs capable of hosting 4 cores/FPGA are available. May. 2009 Wu Jinyuan, (Fermilab jywu 168@fnal. gov), Huang Yifei (IMSA) 23

The Execution & Non-Execution Cycles From MIT 6. 823 Open Course Site n In current micro-processors: q q n Each instruction takes one clock cycle to execute. It takes many clock cycles to prepare for executing an instruction. Pipelined? Yes. But the non-execution pipeline stages consume silicon area, power etc. To execute an instruction != to do useful calculation. Can we do something different? Arithmetic, Algorithm, Architecture. q May. 2009 Wu Jinyuan, (Fermilab jywu 168@fnal. gov), Huang Yifei (IMSA) 24

The End Thanks May. 2009 Wu Jinyuan, (Fermilab jywu 168@fnal. gov), Huang Yifei (IMSA) 25