Скачать презентацию Status of the LLTT TPU project G Punzi Pisa Скачать презентацию Status of the LLTT TPU project G Punzi Pisa

f6722a81372fbcc64d55a54c3adc2ecc.ppt

  • Количество слайдов: 50

Status of the LLTT(TPU) project G. Punzi (Pisa) on behalf of the LLTT group Status of the LLTT(TPU) project G. Punzi (Pisa) on behalf of the LLTT group Meeting with INFN referees 20/3/2014

Brief recap - Feb. 2013: LHCb Trigger workshop: First LHCb presentation - June 2013: Brief recap - Feb. 2013: LHCb Trigger workshop: First LHCb presentation - June 2013: Feasibility study presented @LLT workshop - Received an official list of questions from management - 4 Oct 2013: LLT workshop: Answers to questions + simulation of a baseline system - Asked to be reviewed to become a project for the upgrade - LHCb outlook for the online evolves considerably - December 2013: Presentation to LHCb week. - 1 -2 Feb. 2014: Presentations to LHCb Trigger Workshop - 1 Mar 2014: Talk at international instrumentation conference (INSTR-14) - 10 Mar 2014: Internal note presented for review - 18 Mar 2014: Presentation to Technical Board meeting TODAY: presentation to INFN referees. - 31 Mar 2014: Presentation to LHCb external review committee

Effects of changes in readout design DAQ structure evolved to bi-directional EB Baseline readout Effects of changes in readout design DAQ structure evolved to bi-directional EB Baseline readout has moved FPGA cards to EB LLT hardware shrunk, most (or all) functions moved into EB CPUs (“software LLT”) BIG strategy choice: push investments upward But HLT wants to keep a “safety net” (LLT) LLTT more substantial hardware: → Track Processing Unit (TPU), connected to EB Regarded by trigger group mostly as HLT co-processor, or pre-processor Raises the bar considerably ! Can still fuel a software-LLTT functionality Safety net as well. . .

Hardware Status Hardware Status

Architecture Tracking layers Separate trigger-DAQ path switching network Custom switching network delivers hits to Architecture Tracking layers Separate trigger-DAQ path switching network Custom switching network delivers hits to appropriate cells Data organized by cell coordinates Cellular Engine s Blocks of cellular processors Fitter To DAQ Track finding and parameter determination

Basic Principles For each cellular unit in the parameter space (u, v) calculate a Basic Principles For each cellular unit in the parameter space (u, v) calculate a weighed response summing over all hits and all layers. Tracks are peaking structures in the parameter space. Find a track as clusters of excited cells Trigger&Tracking Workshop. M. J. Morello

Hit delivery by the switching logic Hits must be delivered only to the cell Hit delivery by the switching logic Hits must be delivered only to the cell that need them (they can be more than one) The switch network “knows” where to deliver hits All information about the network of connections is embedded in the network via distributed LUTs

Cellular engine Performs calculation of weights for a hit into a cell Deals with Cellular engine Performs calculation of weights for a hit into a cell Deals with surrounding cells as well. Handles time-skew between events In second stage performs local clustering in parallel, and queues results to output

Track parameter estimation by cluster Center-of-Mass ENGINE ENGINE MUX ENGINE ENGINE (ENGINE OUTPUT) MUX Track parameter estimation by cluster Center-of-Mass ENGINE ENGINE MUX ENGINE ENGINE (ENGINE OUTPUT) MUX 16 16 1, 0, 1 MUX REQ ACK DATA Co. M UNIT 22 REG PIPELINED DIVIDER REG z + z + p z 0 p 0 z 0 p + z 0 p z - p 0 z - p + z - p 0 p + + d 0 Due to data reduction out of the engine, a 1: 12 ratio is sufficient to keep up with the data flow Final parameter determination can be done of EB CPUs to achieve full “offline-compliance”

Studies of Stratix-V capacity TPU completely implemented in firmware All main components: - Switch Studies of Stratix-V capacity TPU completely implemented in firmware All main components: - Switch - Engines - Co. M implemented in VHDL and placed in FPGA Fit ~750 engines/chip on Stratix-V • exact number depends on details (time-ordering of pixel data, etc. ) Arria 10 allows double the logic at the same price, with lower power consumption

Simulation and Timing Ready CELL Hit_data LAY X Y address Intersect_x _y d_x d_y Simulation and Timing Ready CELL Hit_data LAY X Y address Intersect_x _y d_x d_y d_x^2 d_y^2 sum_sq weight uare Exceed 350 MHz clock freq 40 MHz throughput Total latency <1µs ! Much better than AM : Originally intended as “Low Level Track Trigger” (Not accounting for I/O)

Tracking Configuration VELO+UT, 16+2 layers - Split into two separate telescopes for ease of Tracking Configuration VELO+UT, 16+2 layers - Split into two separate telescopes for ease of cabling - Covers longable tracks

Layer configurations Layer configurations

Layer configuration acceptances Layer configuration acceptances

Integration in the DAQ to Event Builder TPU to Event Builder - ATC 40 Integration in the DAQ to Event Builder TPU to Event Builder - ATC 40 scheme (not the baseline anymore) - Shows the map of connections - Additional optical links needed to copy data to the TPU cards

TPU integrated in the EB TPU appears to the EB as an additional “virtual TPU integrated in the EB TPU appears to the EB as an additional “virtual detector” producing tracks

Data flow inside EB Tracks in post-EB Tracks in pre-EB Small flows in TPU Data flow inside EB Tracks in post-EB Tracks in pre-EB Small flows in TPU boxes TPU behaves as a virtual “track detector” - Local CPUs can be used to refine FPGA output - Availability of TRACKS in the Event Builder – Can control rate by confirming LLT muon(hadron) with stiff track – In the “partial reconstruction” scheme, could have HLT 1 inside EB

Lab test with TEL 62 - We have a plan for testing the retina Lab test with TEL 62 - We have a plan for testing the retina algorithm with real FPGAs (Stratix-3), in a simplified configuration. - This is lower-speed, but helps us demonstrate that we can put together and operate a complete system - We exploit TEL 62 boards, that are compatible with current LHCb DAQ, and can be easily inserted in the system (agreement with local DAQ experts) - TEL 62 boards have been ordered together with NA 62 order, and will arrive soon (~month). - Pisa has lab space for a bench test. TEL 62 is used for both sequence generation and “retina” implementation. - Work ongoing in Pisa on connection boards and logistics.

Preparations for test on silicon telescope @Milano - Second stage of testing planned with Preparations for test on silicon telescope @Milano - Second stage of testing planned with CR in a Si telescope being built in Milano - Details in UT talk yesterday.

Document presented to LHCb Technical Bo 10/3/2014 Document presented to LHCb Technical Bo 10/3/2014

Performance parameters Performance parameters

Simulation • At least 3 VELO hits in the last 8 VELO stations • Simulation • At least 3 VELO hits in the last 8 VELO stations • At least 1 hit in each axial UT layer. Fiducial region: |u|<0. 35, |v|<0. 35 (about theta< 50 mrad); |z| < 15 cm; Electron rejection. Some details on LHCb simulation used: Ebeam = 7 Te. V nu = 7. 6 (L=2 x 1033) and nu=11. 4 (L=3 x 1033), bunch crossing: 25 ns, with spillover Geometry: DDDB : dddb-20131025, CONDDB : sim-20130830 -vc-md 10. Velo. UT offline reconstruction Brunel v 44 r 9 with default setting. Performances on small angle telescope with 8 VELO +2 UT.

Mapping of detector to receptor cell array Intersection of “base tracks” with detectors gives Mapping of detector to receptor cell array Intersection of “base tracks” with detectors gives a map of “nerve endings”. This encodes the information about the geometry Every hit on the detector produces a signal on nearby receptors, depending on distance (I skip on several subtleties. For instance, effective operation require distribution to be non-uniform) Not unlike the distribution of photoreceptors in visual system – but it is all virtual in our case, that is, implemented in the internal LUT of the system.

Simulated full LHCb events (µ=7. 6) Generated Out of acceptance Used ~45, 000 cell Simulated full LHCb events (µ=7. 6) Generated Out of acceptance Used ~45, 000 cell engines C++ code, can be inserted in standard analysis code

Efficiency/Uniformity p , p. T Efficiency/Uniformity p , p. T

Efficiency/Uniformity z , IP Efficiency/Uniformity z , IP

Efficiency/Uniformity (u, v) ~ (θx , θy) Efficiency/Uniformity (u, v) ~ (θx , θy)

Momentum resolution σk = 0. 0102 σk = 0. 0126 Momentum resolution σk = 0. 0102 σk = 0. 0126

Physics performance and robustness Physics performance and robustness

COSTING COSTING

Detailed cost estimate from the online group Estimated at current prices: 940 k. CHF Detailed cost estimate from the online group Estimated at current prices: 940 k. CHF Does not account for savings from moving to Arria-10 Assumes using identical boxes to the EB for simplicity – Some further savings still possible

Is the TPU cost effective, TODAY ? Various estimates: (16/26)*2. 3+1. 5 = 2. Is the TPU cost effective, TODAY ? Various estimates: (16/26)*2. 3+1. 5 = 2. 9 ms (%GEC) 60% * 3. 8 = 2. 3 ms 3. 8 ms -(VELO 10) = 2. 4 ms Cost of (naked) CPU: ~120 SWF/core → 1 ms@40 MHz = 4. 8 MCHF TPU equivalent: 10 ÷ 15 MCHF Timing of piece of code yielding the performances we have been comparing to: 3. 8 ms (standalone CPU-2012) It was later understood that this piece of code performs further extra work (backward VELO layers) We have no piece of code doing exactly the TPU work on the sample, with the same performance

Is the TPU cost effective, TODAY ? Various estimates: (16/26)*2. 3+1. 5 = 2. Is the TPU cost effective, TODAY ? Various estimates: (16/26)*2. 3+1. 5 = 2. 9 ms (%GEC) 60% * 3. 8 = 2. 3 ms 3. 8 ms -(VELO 10) = 2. 4 ms Cost of (naked) CPU: ~120 SWF/core → 1 ms@40 MHz = 4. 8 MCHF TPU equivalent: 10 ÷ 15 MCHF Timing of piece of code yielding the performances we have been comparing to: 3. 8 ms (standalone CPU-2012) It was later understood that this piece of code performs further extra work (backward VELO layers) We have no piece of code doing exactly the TPU work on the sample, TPU clearly cost-effective with the same performance solution at present time

Projections to 2020 “It is always difficult to make predictions, especially about the future” Projections to 2020 “It is always difficult to make predictions, especially about the future” – Yogi Berra

Timings presented by HLT group Timings presented by HLT group

How the HLT group plans to use the TPU Does not include: - Multicore How the HLT group plans to use the TPU Does not include: - Multicore inefficiency - Data/MC effects

CPU cost projections from online group CPU cost projections from online group

Assumptions behind the 8 ms - TPU price 2020 = 2014 - CPU price Assumptions behind the 8 ms - TPU price 2020 = 2014 - CPU price drop 16 x - No inefficiency factor for 400 jobs/node - Additional 2 x to CPU for other uses - Full cost of TPU vs. “scalable” cost of CPU

Conclusions of TB (18/3/2014) Conclusions of TB (18/3/2014)

Summary • We designed a system capable of track reconstruction at 40 MHz with Summary • We designed a system capable of track reconstruction at 40 MHz with offline-like performance and ~1µs latencies. • Cost of TPU is an order of magnitude smaller than today's CPU solutions • Projections to the upgrade era made by the HLT group and online group predict that the CPU solution will become more convenient. Based on some assumptions. • The TB recommended a CPU-only solution as baseline for TDR.

People Many thanks to all people who contributed to the development of this design People Many thanks to all people who contributed to the development of this design • • • A. Abba (MI) F. Bedeschi (PI) F. Caponio (MI) M. Citterio (MI) D. Corbino (CERN) A. Cusimano (MI) A. Geraci (MI) S. Leo (PI) F. Lionetto (PI) P. Marino (PI) • • • M. J. Morello (PI) N. Neri (MI) A. Piucci (PI) G. Punzi (PI) L. Ristori (PI) F. Ruffini (PI) F. Spinella (PI) S. Stracka (PI) D. Tonelli (CERN) J. Walsh (PI)

BACKUP BACKUP

Basic principle We inject real hits (xr, yr)k in the detector layers. For each Basic principle We inject real hits (xr, yr)k in the detector layers. For each cellular unit ith in the parameter space (u, v) calculate Ri response summing over all hits and all layers. Tracks are peaking structures in the parameter space. Find a track as clusters of excited cells 3/17/1 4 Trigger&Tracking Workshop. M. J. Morello 43

Tracking Efficiency Reconstructed “offline” VELO+UT tracks using Official LHCb-MC Bs→φφ with mu=7. 6. — Tracking Efficiency Reconstructed “offline” VELO+UT tracks using Official LHCb-MC Bs→φφ with mu=7. 6. — Require on offline reconstructed tracks — p >3 Ge. V/c p. T > 500 Me. V/c and a geometrical acceptance (retina acceptance) 20 < theta < 60 mrad Found that ~95% of offline tracks have a compatible match within the geometric acceptance of our track processor. — All VELO and UT hits without any requirements sent to the LLTT.

Simulated full LHCb events (µ=7. 6) Fu ll L HC Out of acceptance b-M Simulated full LHCb events (µ=7. 6) Fu ll L HC Out of acceptance b-M C Generated Out of acceptance NB: it is simulable with 100% accuracy , C++ code available to users

Software simulation 4/10/13: Benchmark study on 6 VELOPIX layers + 2 UT planes Used Software simulation 4/10/13: Benchmark study on 6 VELOPIX layers + 2 UT planes Used 36, 000 cell units in (r, j) parameters (r, j) polar coordinates on virtual plane of track intersection. Mapping using LHCb-MC Particle. Gun. Tracks from Official Production Bs→φφ LHCb-MC. L=2 × 1033 cm-2 s-1, sqrt(s)=14 Te. V, mu=7. 6 DDDBtag =“dddb-20130408“. Cond. DBtag = “simcond-20121001 -vc-md 100” No kinematics cuts applied. No requirement on hits.

Layer configurations Layer configurations

Layer configuration acceptances Layer configuration acceptances