cd74530c7b95067f6a771a8de0055b1c.ppt
- Количество слайдов: 54
1 Faster unicores are still needed André Seznec INRIA/IRISA
2 DAL: Defying Amdahl’s Law • ERC advanced grant to A. Seznec (2011 -2016) DAL objective: « Given that Amdahl’s Law is Forever propose (impact) the microarchitecture of the 2020 General Purpose manycore »
3 Multicores are everywhere • Multicores in servers, desktop, laptops § • Multicores in smart phones, tablets § • 2 -4 -8 -12 O-O-O cores 2 -4 -(not that simple) cores Manycores for niche markets § 48 -80 -100 simple cores § Tilera, Intel Phi
4 Multicore/multithread for everyone • End-user : improved usage comfort § • Can surf on the web and hear MP 3 Parallel performance for the masses? § Very few (scalable) mainstream // apps § Graphics § Niche market segments
No parallel software bonanza in the near future • Inheritage of sequential legacy codes • Parallelism is not cost-effective for most apps • Sequential programming will remain dominant 5
6 Inheritage of sequential legacy codes • Software is more resilient than hardware § Apps are surviving/evolving for years, often decades § Very few parallel apps now • Unlikely redevelopment of parallel apps from scratch • Computing intensive sections will be parallelized § But significant code sections will remain sequential
Parallelism is not cost-effective for most apps • Why parallelism ? § • Only for performance But costly: § § Difficult, man-time consuming, error prone Poorly portable: functionality and performance 7
Sequential programming will remain dominant § Just easier § The « Joe » programmer § § 8 Portability, maintenance, debug + compiler to parallelize + parallel libraries + software components (developped by experts)
9 Looking backwards
2002: The End of the Uniprocessor Road • Power and temperature walls: § • Stopped the frequency increase 2 x transistors: 5 %? 10 % ? perf. (if any) economical logic : buy smaller chips ! IC industry needs to sell new (expensive) chips: Marketing: « You need hyperthreading, 2, 4, 8 cores » 10
Marketing multicores to the masses 2002 -. . SMT Dual-core SMT GREAT !! Quad-core SMT 11
12 And now ? The end user is not such a fool. .
13 Following the trend: 2020 • Silicon area, power envelope § ≈ 100 Nehalem class cores or § ≈ 1, 000 simple cores (VLIW, in-order superscalar)
14 Amdahl’s Law “Cannot run faster than sequential part” seq. parallel
15 OK, parallel applications do not scale • Our recent study on parallel application scaling: Execution time Input set Processor number • In general: bp> -1 : sublinear scaling • Sometimes: bs > 0 : sequential part increases
But let us use a naive (overoptimistic) model • 16 A parallel application: § Parallel section: can use 1000 processors § Sequential section: run on a single processor SEQ: constant fraction of sequential code linear speed-up
17 Complex cores against simple cores • CC: 100 complex vs SC : 1000 simple cores with complex 2 X faster than simple if SEQ > 0. 8 % then CC > SC
18 And hybrid SC + CC ? CC_SC: § § 50 complex 500 simple if SEQ> 0. 2% then CC_SC > SC
19 And if. . • Use a huge amount of resource for a single core: 10 X the area of the complex core 10 X the power of the complex core Use all the uniprocessor techniques § Very wide issue (8 – 16 ? ), Ultimate frequency ( « heat and run » ), Helper threads, Value prediction Invent new techniques Ultra Complex cores
20 DAL architecture proposition • Heterogeneous architecture: § A few ultra complex cores § § to enable performance on sequential codes and/or critical sections A « sea » of simple cores § for parallel sections
21 For the naive model « DAL » : UC_SC 5 ultra complex cores + 500 simple cores • If SEQ > 0. 13 % then « DAL » > SC • « DAL » always better than UC, CC_SC
Need for research on faster unicores • Silicon area is 2 nd order issue can use the area of 10 complex cores • Power/energy is 2 nd order issue can use the power of 10 complex cores 22
23 On going work: Revisiting Value Prediction with Arthur Pérais
Value prediction ? 24 Lipasti et al, Gabbay and Mendelson 1996 Basic idea: § Eliminate (some) true data dependencies through predicting instruction results +2 I 0 +3 I 1 +1 I 3 +3 I 4 I 5
Value Prediction: • Large body of research 96 -02 • Quite efficient: § • Surprisingly high number of predictable instructions Not implemented so far: § High cost : is it still relevant now ? § High penalty on misp. : don’t lose all the benefit 25
26 Last Value Predictor • Just predict the last produced value § Set Associative Table § Use confidence counters Analogy with PC-based branch prediction
27 Stride value predictor • Add last value + (last difference) P C + Analogy with stride prefetcher, but also with loop predictor
28 Finite Context Method predictors Use history of the last values by the instruction P C Analogy with local history branch predictor
branch 29 And global value history • Just no sense ! § Need the history of the last instructions § • Too late !! But global branch history !? ! § ITTAGE is the state-of-the-art indirect branch predictor !! § And it predicts values !
30 ITTAGE VTAGE pc h[0: L 1] pc =? 32 32 pc h[0: L 3] pc h[0: L 2] =? 1 32 =? 32 1 1 32 Tagless base Predictor 32 prediction Longest matching component provides the prediction
31 The repair issue on misprediction I 0 misprediction I 1 I 3 I 4 I 5
32 Pipeline squash I 0 I 1 I 3 I 4 I 5 • Acts as on exception, branch misprediction • Very high penalty
33 Selective replay I 0 I 1 I 3 I 4 I 5 • Cancel all dependent instructions, but save the others • Very complex to implement: § Unlimited dependence chains
34 Critical path • Predicted value needed late in the pipeline: § • Disptach time is sufficient Except that:
35 A FCM implementation issue Speculative Window P C Might be a critical path Must take the last local values
Critical path on the stride value predictor P C + Speculative Window Can be reused on the next cycle Stride AND spec. last value must be high confidence 36
37 Experiments • 8 -way superscalar, deep pipeline • Use prediction only on high confidence § § 3 -bit counters + saturated + reset
0. 8 470. lbm 464. h 264 458. sjeng 456. hmmer 445. gobmk 444. namd 433. milc 429. mcf 416. gamess 403. gcc 401. bzip 255. vortex 197. parser 186. crafty 179. art 175. vpr 173. applu 168. wupwise 164. gzip 38 Squashing 1. 4 1. 3 1. 2 LVP stride FCM VTAGE 1. 1 1 0. 9
16 16 4. g z 8. w ip up w 17 ise 3. ap pl u 17 5. vp r 17 9. 18 art 6. cr 19 afty 7. pa 25 rser 5. vo rte 40 x 1. bz ip 40 3. gc 41 c 6. ga m es 42 s 9. m cf 43 3. m ilc 44 4. na m 44 d 5. go bm 45 k 6. hm m 45 er 8. sje 46 ng 4. h 2 6 47 4 0. lb m 39 Selective replay 1. 4 1. 35 1. 3 1. 25 1. 2 1. 15 1. 1 1. 05 1 0. 95 0. 9 LVP Stride FCM VTAGE
High confidence through probabilistic counters • Need for very high confidence: § § 95 % accuracy unsufficient >> 99 % needed TRADING ACCURACY AGAINST COVERAGE • Saturation with only very low probability § 1/32, 1/256 40
ise 17 3. ap pl u 17 5. vp r 17 9. ar t 18 6. cr af ty 19 7. pa rs er 25 5. vo rte x 40 1. bz ip 40 3. gc 41 c 6. ga m es s 42 9. m cf 43 3. m ilc 44 4. na m 44 d 5. go bm 45 k 6. hm m er 45 8. sje ng 46 4. h 2 64 47 0. lb m up w gz ip 4. 16. w 16 8 41 Squashing 1. 4 1. 35 1. 3 1. 25 1. 2 1. 15 1. 1 1. 05 1 LVP Stride FCM VTAGE 0. 95 0. 9
16 164 8. . gz w ip up 17 wis 3. e ap p 17 lu 5. v 17 pr 9. 18 a 6. rt 19 craf 7. ty p 25 ars 5. er vo r 40 tex 1. b 4 zip 41 03. 6. gcc ga m 42 ess 9. m 43 cf 3. 44 mi 4 lc 44. nam 5. d 45 gob m 6. hm k 45 me 8. r s 46 jen 4. g h 2 47 64 0. lb m 42 And hybrids 1. 5 1. 4 1. 3 1. 2 1. 1 1 0. 9 Stride VTAGE-Stride 3 c-Hybrid
43 Current status • All value predictors amenable to very high confidence § • No complex selective repair needed No need for local value prediction § No complex critical path in the local value predictor
44 On going work: Selective Prediction of Predicated Instructions with Nathanael Prémillieu
Who cares about predicated instructions ? • CMOV in all ISA • ARM, Itanium : § All instructions are predicated out-of-order execution: just a nightmare 45
46 The multiple definition problem Before renaming: Mapping Table I 1: R 1 R 2, R 3 (p) R 1 R 2 P 15 R 3 I 2: R 4 R 1, R 2 P 11 P 22 After renaming: I 1: P 15, P 22 (p) R 1 P 1 || P 11 R 2 P 15 I 2: P 13 ? ? ? , P 15 R 3 P 22 R 4 P 13
47 Expansion/Serialization After renaming: I 1 a: P 1 P 15, P 22 R 1 P 27 R 2 P 15 I 1 b: P 27 (p) ? P 1, P 11 R 3 P 22 R 4 P 13 I 2: P 13 P 27, P 15 • Create an extra instruction • Force I 1 b I 2 dependency
48 Aggressive serialization P 18 P 15 R 3 P 22 R 4 I 2: P 13 P 18, P 15 R 1 R 2 I 1: P 18 (p) ? (op P 15, P 22) : P 23 P 13 • No expansion, but an extra operand on I 1: • complexity on register file, issue logic, bypass network • Force I 1 I 2 dependency
49 Predicting the predicates • branch history or branch+predicate history to predict the predicates Ø Eliminate multiple definitions Ø Predicate mispredictions become branch mispredictions
-20 400. perlbench. checkspam 400. perlbench. diffmail 401. bzip 2. chicken 401. bzip 2. combined 401. bzip 2. liberty 401. bzip 2. program 401. bzip 2. source 401. bzip 2. text 403. gcc. 166 403. gcc. 200 403. gcc. c-typeck 403. gcc. cp-decl 403. gcc. expr 403. gcc. scilab 416. gamess. cytosine 416. gamess. h 2 ocu 2+ 429. mcf. ref 435. gromacs. ref 436. cactus. ADM. ref 444. namd. ref 445. gobmk. 13 x 13 445. gobmk. nngs 445. gobmk. trevorc 445. gobmk. trevord 453. povray. ref 456. hmmer. nph 3 456. hmmer. retro 458. sjeng. ref 459. Gems. FDTD. ref 462. libquantum. ref 464. h 264 ref. baseline 464. h 264 ref. main 464. h 264 ref. sss 470. lbm. ref 471. omnetpp. ref 473. astar. Big. Lakes 473. astar. rivers 483. xalancbmk. ref 50 Not that convincing ! 10 5 0 -5 -10 -15 Br & Pred Branch
51 • Filter the predicate prediction • Replay at rename time the mispredicted predicates
10 8 400. perlbench. checkspam 400. perlbench. diffmail 401. bzip 2. chicken 401. bzip 2. combined 401. bzip 2. liberty 401. bzip 2. program 401. bzip 2. source 401. bzip 2. text 403. gcc. 166 403. gcc. 200 403. gcc. c-typeck 403. gcc. cp-decl 403. gcc. expr 403. gcc. scilab 416. gamess. cytosine 416. gamess. h 2 ocu 2+ 429. mcf. ref 435. gromacs. ref 436. cactus. ADM. ref 444. namd. ref 445. gobmk. 13 x 13 445. gobmk. nngs 445. gobmk. trevorc 445. gobmk. trevord 453. povray. ref 456. hmmer. nph 3 456. hmmer. retro 458. sjeng. ref 459. Gems. FDTD. ref 462. libquantum. ref 464. h 264 ref. baseline 464. h 264 ref. main 464. h 264 ref. sss 470. lbm. ref 471. omnetpp. ref 473. astar. Big. Lakes 473. astar. rivers 483. xalancbmk. ref 52 Non. Agressive NA SPREPI Ag. SPREPI 6 4 2 0 -2 -4
53 • Predicate prediction + filtering allows: Better performance Without aggressive out-of-order implementation • Current compilers « shy » on predication usage might be worth to reconsider
54 Conclusion Faster cores are needed: Amdahl’s law, Uniprocessor workload Silicon, power, etc are available: Just grab the resource from the rest of the system Do research as if (area, power) was not a constraint: Then, take into account the constraints (or somebody else will manage to do it)