Скачать презентацию Parallel Computer Architecture Part I Part 2 on Скачать презентацию Parallel Computer Architecture Part I Part 2 on

823cd94fc535f0df587679c9b39d8ae4.ppt

  • Количество слайдов: 43

Parallel Computer Architecture Part I (Part 2 on March 18) Parallel Computer Architecture Part I (Part 2 on March 18)

Latency • Defn: Latency is the time it takes one message to travel from Latency • Defn: Latency is the time it takes one message to travel from source to destination. Includes various overheads. • E. g. Taking the T (or the SMRT for the Singaporeans) – – Time to walk out the door to the station Time to wait for train Time on the train Time to walk to destination

Latency • Defn: Latency is the time it takes one message to travel from Latency • Defn: Latency is the time it takes one message to travel from source to destination. Includes various overheads. • E. g. Taking the T (or the SMRT for the Singaporeans) – – Time to walk out the door to the station Time to wait for train Time on the train Time to walk to destination overhead

Latency • Defn: Latency is the time it takes one message to travel from Latency • Defn: Latency is the time it takes one message to travel from source to destination. Includes various overheads. • E. g. Taking the T (or the SMRT for the Singaporeans) – – Time to walk out the door to the station Time to wait for train overhead Time on the train Time to walk to destination Productive communication

Latency • Defn: Latency is the time it takes one message to travel from Latency • Defn: Latency is the time it takes one message to travel from source to destination. Includes various overheads. • E. g. Taking the T (or the SMRT for the Singaporeans) – – Time to walk out the door to the station Time to wait for train Time on the train Time to walk to destination Productive communication Bandwidth • Maximum Rate at which the network can propogate information • E. g. (speed of train) * (number of passengers)

Important 1. Bandwidth is easy to measure and easy to hype. It is an Important 1. Bandwidth is easy to measure and easy to hype. It is an idealized peak communications rate most likely to be achieved for long messages under condiitons of no traffic contention. • • (Long train ride, no delays. ) Ethernet = 100 Mbps 2. Latency varies with applications, conditions, etc. • (Train delays, rush hour, the weather, …) 3. Latency = How long it takes you to get to work. Bandwidth = How many miles you can ideally move people / unit time. 4. Latency & Bandwidth are related but not the same. Latency includes more realities!

Latency: the details 1. Latency = Sender Overhead + Time of Flight + Transmission Latency: the details 1. Latency = Sender Overhead + Time of Flight + Transmission Time + Receiver Overhead 2. Sender Overhead: Time for processor to inject message into network. Time processor tied up from doing useful work because it is sending. 3. Time of Flight: Time for the start of the message to arrive at the receiver. 4. Transmission Time: Time for remainder of message to arrive at receiver = (# bits)/bandwidth 5. Receiver Overhead: Time for receiver to pull message (usually larger than sender overhead)

Latency vs Bandwidth on the internet Experiment: Sending packets roundtrip from MIT to Singapore Latency vs Bandwidth on the internet Experiment: Sending packets roundtrip from MIT to Singapore • Specifications • Time: 5 am EST this morning (Feb 20) = 6 pm Singapore • Source: pythagoras. mit. edu (ip address = 18. 87. 0. 29) • Dest: sunsvr. comp. nus. edu. sg (ip address = 137. 132. 88. 6) • Method: ping -s 137. 132. 88. 6 num_bytes 5 (5 = num_trials) • Data: bytes: 8 100 500 1000 2000 3000 4000 5000 6000 7000 • msec: 276 263 264 270 404 645 690 777 868 923

Congestion? Latency Congestion? Latency

Congestion (Traffic!) Latency and Bandwidth are not the whole story, congestion can also slow Congestion (Traffic!) Latency and Bandwidth are not the whole story, congestion can also slow down communication!

Node Architecture 1. CPU, e. g. Athlon MP 1800 • Registers: O(400 B) Speed: Node Architecture 1. CPU, e. g. Athlon MP 1800 • Registers: O(400 B) Speed: O(1 nsec) 2. Cache (on chip) • L 1: 64 KB Instruction + 64 KB Data • L 2: 256 KB Data O(10 nsec) Memory Bus 3. Memory: e. g. 1 GB PC 2100 DDR EEC O(100 nsec) (Double Data Rate/Extended Error Correction) I/O Bus 4. Disk: IBM 120 GXP 40 GB O(5 msec) Warning: all O’s are just guesses made in the early morning

Node Architecture 1. CPU O(400 B) O(1 nsec) 2. Cache • L 1: 128 Node Architecture 1. CPU O(400 B) O(1 nsec) 2. Cache • L 1: 128 KB • L 2: 256 KB O(10 nsec) Memory Bus 3. Memory: 1 GB O(100 nsec) I/O Bus 4. Disk: 40 GB O(5 msec) Warning: all O’s are just guesses made in the early morning

Bus as an interconect network • One transaction at a time between source and Bus as an interconect network • One transaction at a time between source and destination • The “P”’s and “M”’s could be cache and memory, disk and memory, etc.

 • Where does the network meet the processor? Distance to Proc 1. I/O • Where does the network meet the processor? Distance to Proc 1. I/O Bus • Most Clusters 2. Memory Bus • Many MPPs 3. Processor Registers • RAW architecture at MIT

Asci White at Lawrence Livermore (Currently #1) • IBM SP System: SP = Scalable Asci White at Lawrence Livermore (Currently #1) • IBM SP System: SP = Scalable Powerparallel • 512 “High Nodes” ({thin, wide, high}=physical space) • 1 High Node = 16 Processor SMP Connected by 1 SP Switch • Weight = 106 tons = roughly 100 Honda Civics! • Processor = RS/6000 Power 3 RS=Risc System= “Reduced Instruction Set Chip/Comp” • SP Switch connects to 16 nodes and up to 16 other switches

IBM Sp Switch (Up to 16 P’s) IBM Sp Switch (Up to 16 P’s)

IBM Sp Switch (Up to 16 P’s) 16 P’s Other Nodes IBM Sp Switch (Up to 16 P’s) 16 P’s Other Nodes

IBM Sp Switch (Up to 16 P’s) 8 switch chips per switch IBM Sp Switch (Up to 16 P’s) 8 switch chips per switch

Cross Bar I’m guessing that every switch chip is a crossbar – someone check Cross Bar I’m guessing that every switch chip is a crossbar – someone check please -maybe not? ? ?

Multiple Data Flows on an SP Switch Four “on-chip” stay on chip, off-chip pairs Multiple Data Flows on an SP Switch Four “on-chip” stay on chip, off-chip pairs take three hops but not typically much slower

Connecting two 16 -way SMPs =32 processors SMP Connecting two 16 -way SMPs =32 processors SMP

Connecting three 16 -way SMPs =48 processors SMP Note: every pair share 8 wires Connecting three 16 -way SMPs =48 processors SMP Note: every pair share 8 wires SMP

Connecting five 16 -way SMPs =80 processors SMP SMP Every pair of SP chips Connecting five 16 -way SMPs =80 processors SMP SMP Every pair of SP chips not already on the same board are connected. There are 16 off board SP chips. Total of 40 wires. SMP (This is a guess but the only logical one, I think) SMP Always 4 parallel paths between any two processors!

Connecting eight 16 -way SMPs =128 processors Intermediate switch boards needed for 81 -128 Connecting eight 16 -way SMPs =128 processors Intermediate switch boards needed for 81 -128 nodes Each chip connected to one ISB! ISB’s

Guess at doubling What must be done? Guess at doubling What must be done?

Parallel Computer Architecture Part II Parallel Computer Architecture Part II

Caches: A means of solving the memory to processor latency problem • Memory just Caches: A means of solving the memory to processor latency problem • Memory just can’t get out the door fast enough for the processor to handle • Not all the data is needed all the time • The data that is needed can often be predicted! • Solution: faster (= more expensive) but smaller memory in between

Caches : Write-through vs write-back • In a write-through cache, when data is written Caches : Write-through vs write-back • In a write-through cache, when data is written to cache, it also is written to memory • In a write-back cache, when data is written to cache it is declared “dirty” in cache but it is not written to memory. If the cache runs out of space for the data, it then returns to memory

Cache Coherence Error Due to False Sharing Proc A foo. B Proc B foo. Cache Coherence Error Due to False Sharing Proc A foo. B Proc B foo. A foo. B Memory Both A and B get same line A only needs foo. A, B only needs foo. B

Cache Coherence Error Due to False Sharing Proc A HELLOfoo. B Proc B foo. Cache Coherence Error Due to False Sharing Proc A HELLOfoo. B Proc B foo. AWORLD Memory • Both A and B get same line A only needs foo. A, B only needs foo. B • A writes HELLO B writes WORLD

Cache Coherence Error Due to False Sharing Proc A HELLOfoo. B Proc B foo. Cache Coherence Error Due to False Sharing Proc A HELLOfoo. B Proc B foo. AWORLD foo. A|World • Both A and B get same line A only needs foo. A, B only needs foo. B • A writes HELLO B writes WORLD • B writes cache to memory

Cache Coherence Error Due to False Sharing Proc A HELLOfoo. B Proc B foo. Cache Coherence Error Due to False Sharing Proc A HELLOfoo. B Proc B foo. AWORLD HELLO|foo. B • Both A and B get same line A only needs foo. A, B only needs foo. B • A writes HELLO B writes WORLD • B writes cache to memory • A writes cache to memory

Evolution of Shared Memory P P C C Main Memory Processors use of cache Evolution of Shared Memory P P C C Main Memory Processors use of cache has freed up many of the burdens of main memory other caches can be attached

Multiprocessor Cache Coherence defined A memory system is coherent if • Uniprocessor Coherence: If Multiprocessor Cache Coherence defined A memory system is coherent if • Uniprocessor Coherence: If P writes to location X, then reads from location X, the written value is read. • Multiprocessor Coherence: If P 1 writes to location X, then P 2 reads from location X, the written value is read if sufficient time has passed. (Instantly is impossible!) • Serialization: If P 1 writes “ 1” to X and P 2 writes “ 2” to X, then one occurs before the other, say 1 then 2. All processors see the write in the same order. No processor can read 2 and then later read it as 1.

Enforcing Coherence • Directory based schemes: One central directory keeps sharing status • Snooping: Enforcing Coherence • Directory based schemes: One central directory keeps sharing status • Snooping: Caches keep status, caches “snoop” on the shared memory bus to determine status Advantange: Uses pre-existing hardware: the bus to memory

Write invalidate vs Write Update Protocols • Two approaches to insure coherence • Write Write invalidate vs Write Update Protocols • Two approaches to insure coherence • Write Invalidate: When P 1 writes to X, all other cached versions of X are invalidated, forcing every other processor to read X from main memory • (Invalidate has become the method of choice) • Write Update: When P 1 writes to X, all other cached versions of X are updated with the new value Examples

Write invalidate Example • Write Invalidate: • 1. P 1 reads X (from main Write invalidate Example • Write Invalidate: • 1. P 1 reads X (from main memory to its cache) • 2. P 2 reads X (from main memory (where else? ) to cache) • 3. P 1 writes X (to its cache and through to main memory) Invalidates all other cached copies of X • 4. P 2 reads X (from main memory) since its cache is invalid • Bus Activity: 1, 2, and 4 are cache misses for X, • 3 is an invalidate for X

Write Update Example • Write Invalidate: • 1. P 1 reads X (from main Write Update Example • Write Invalidate: • 1. P 1 reads X (from main memory to its cache) • 2. P 2 reads X (from main memory (where else? ) to cache) • 3. P 1 writes X (to its cache and through to main memory) Updates all other cached copies of X • 4. P 2 reads X (from cache) • Bus Activity: 1, 2, are cache misses for X, • 3 is a write broadcast for X to the caches

Performance • The goal is to move as little memory as possible on the Performance • The goal is to move as little memory as possible on the busses, indeed since another processor might not ever need the memory, it could be foolish to broadcast it all the time. • Write invalidate is more popular since it gives better performance

Write invalidate with write-back cache • Every cache item has three extra bits • Write invalidate with write-back cache • Every cache item has three extra bits • Valid bit (write invalidation model requires this) • Dirty bit (write-back cache requires this) • Shared bit (this is new!) • When writing to a shared block, the cache generates an invalidate on the bus and marks its block as private • If another processor requests this block it becomes shared (the owner would “see” the cache-miss on the snooping bus)

Distributed Shared Memory P C P C Can still support shared memory model, e. Distributed Shared Memory P C P C Can still support shared memory model, e. g. , Open. MP M M on our beowulf Simplest approach: only private data in the cache, once data is asked for by another processor it is removed from the cache Common Approach: Directory-Based Cache-Coherence States Included: • Shared: One or more processors have the block cached (and current) • Uncached: No processor has a copy • Exclusive: Exactly one processor has a copy of the data Also need which processors have shared data to invalidate

Future of Caches According to the “RAW Design Document” there have been three states Future of Caches According to the “RAW Design Document” there have been three states of procesor design: • Famine 70 s No room on chip concentrate on I/o • Moderation 80 s Some room on chip – build caches • Abundance Now Send data straight through Viewpoint – caches are a barrier to the memory. Work great on traditional problems, but current problems seek much non-repeated data from the outside world

Basics of Raw: chip = 4 x 4 Tiles Basics of Raw: chip = 4 x 4 Tiles