Скачать презентацию Developing Efficient Graphics Software Developing Efficient Graphics Скачать презентацию Developing Efficient Graphics Software Developing Efficient Graphics

12a9fdd6d395efb618266774a5ae6d9f.ppt

  • Количество слайдов: 159

Developing Efficient Graphics Software Developing Efficient Graphics Software

Developing Efficient Graphics Software Intent of Course • Identify application and hardware interaction • Developing Efficient Graphics Software Intent of Course • Identify application and hardware interaction • Quantify and optimize interaction • Identify efficient software structure • Balance software and hardware system component use

Developing Efficient Graphics Software Outline • 1: 35 Hardware and graphics architecture and performance Developing Efficient Graphics Software Outline • 1: 35 Hardware and graphics architecture and performance • 2: 05 Software and System Performance • Break • 2: 55 Software profiling and performance analysis • 3: 20 C/C++ language issues • 3: 50 Graphics techniques and algorithms • 4: 40 Performance Hints

Developing Efficient Graphics Software Speakers • Applications Consulting Engineers for SGI – optimizing, differentiating, Developing Efficient Graphics Software Speakers • Applications Consulting Engineers for SGI – optimizing, differentiating, graphics • Keith Cok, Bob Kuehne, Thomas True, Alan Commike

Hardware & Graphics Architecture & Performance Bob Kuehne, SGI Hardware & Graphics Architecture & Performance Bob Kuehne, SGI

Course Overview Why is your application drawing so slowly? • Could actually be the Course Overview Why is your application drawing so slowly? • Could actually be the graphics • Could be the data traversal • Could be something entirely different

Tour Guide Platform architecture & components • CPU • Memory • Graphics performance • Tour Guide Platform architecture & components • CPU • Memory • Graphics performance • Measurements: triangle rate, fill rate, misc. • Reproduce & maximize

Bottlenecks & Balance Bottlenecks • Find them • Eliminate them (sort of - move Bottlenecks & Balance Bottlenecks • Find them • Eliminate them (sort of - move them around) Balance • Understand hardware architecture • Fully utilize hardware

Yin & Yang • “Yin and yang are the two primal cosmic principles of Yin & Yang • “Yin and yang are the two primal cosmic principles of the universe” • “The best state for everything in the universe is a state of harmony represented by a balance of yin and yang. ” – Skeptics Dictionary -- http: //skepdic. com/yinyang. html

Write Once Run Everywhere? My application ran fast on that platform! Why is this Write Once Run Everywhere? My application ran fast on that platform! Why is this one so slow? • Different platforms require different tuning • Different platforms implement hardware differently – Macro: Architecture & features – Micro: Storage capacities, buffers, & caches – Effect: Bandwidth & latency

Latency & Bandwidth Definitions: • Latency: time required to communicate a unit of data Latency & Bandwidth Definitions: • Latency: time required to communicate a unit of data • Bandwidth: data transferred per unit time Example: • Latency bottleneck: S t S t S t • Bandwidth bottleneck: S t t t : unit of time s: texture setup time t: texture download time

Platform: Software View graphics CPU i/o memory misc net Platform: Software View graphics CPU i/o memory misc net

Platform: PCI, AGP Memory CPU glue AGP I/O Net Disk Graphics PCI I/O Net Platform: PCI, AGP Memory CPU glue AGP I/O Net Disk Graphics PCI I/O Net PCI Disk Memory Graphics CPU

Platform: UMA, Switched Hub CPU Memory glue CPU UMA Memory glue Graphics I/O Net Platform: UMA, Switched Hub CPU Memory glue CPU UMA Memory glue Graphics I/O Net Disk PCI

Platform: The Points Why learn about hardware? • To understand how your app interacts Platform: The Points Why learn about hardware? • To understand how your app interacts with it • To best utilize the hardware • Potentially can use extra hardware features Where? • Platform documentation • Talk with hardware vendor

CPU: Overview CPU Operation • Data transferred from main memory to registers • CPU CPU: Overview CPU Operation • Data transferred from main memory to registers • CPU works on data in registers Latency • Registers: 0 (free) CPU • Level-1 (L 1) cache: 1 • Level-2 (L 2) cache: 10 x L 1 • Main memory: 100 x L 1 R L 1 L 2 Main Memory

CPU, Cache, and Memory Caches designed to exploit data locality • Temporal locality • CPU, Cache, and Memory Caches designed to exploit data locality • Temporal locality • Spatial locality CPU Registers L 1 L 2 Main Memory

Memory: Cache & Logical Flow In Register? In L 1? In L 2? Copy Memory: Cache & Logical Flow In Register? In L 1? In L 2? Copy to L 2 (100) Compute Copy to Register (1) Copy to L 1 (10)

Memory: Cache & Physical Flow Main Memory L 2 Cache L 1 Cache Page Memory: Cache & Physical Flow Main Memory L 2 Cache L 1 Cache Page Registers CPU

Memory: Allocation & Pools • List elements are often allocated as-needed – This leads Memory: Allocation & Pools • List elements are often allocated as-needed – This leads to spatial disparity • Mitigated by use of application memory management – Bad: malloc, . . . – Good: pools - pool_init, pool_alloc, . . . • Graphics example: – Vertices, normals, textures, etc.

Memory: Graphics! Vertex Arrays Memory: Graphics! Vertex Arrays

Graphics: Pipe xf light clip rast fx fops FIFO xf: world to screen light: Graphics: Pipe xf light clip rast fx fops FIFO xf: world to screen light: apply light clip: clip to view rast: convert to pixels fx: apply texture, etc. fops: test pixel ops

Graphics: Pipe & Akeley Taxonomy G X R T • G - Generate geometric Graphics: Pipe & Akeley Taxonomy G X R T • G - Generate geometric data • T - Traverse data structures • X - Transform primitives world to screen • R - Rasterize triangles to pixels • D - Display framebuffer on output device D

Graphics: Hardware 4 types of hardware common • G-TXRD : all hardware • GT-XRD Graphics: Hardware 4 types of hardware common • G-TXRD : all hardware • GT-XRD : • GTX-RD : • GTXR-D : all software

Graphics: Performance Benchmarks • “Trust, but verify. ” - an ex-president Definitions • Triangle Graphics: Performance Benchmarks • “Trust, but verify. ” - an ex-president Definitions • Triangle rate: speed at which primitives are transformed (X) • Fill rate: speed at which primitives are rasterized (R) – Depth complexity: number of times pixel filled Caveats • Quantization, fastpath

Graphics: Quantization • Frame quantization is the result of swapbuffers occurring at the next Graphics: Quantization • Frame quantization is the result of swapbuffers occurring at the next vertical retrace. – Necessary to avoid image artifacts such as tearing • Example: 100 Hz display refresh

Graphics: Quantization no-sync 120 Hz 100 Hz 50 Hz 33 Hz t 0 t Graphics: Quantization no-sync 120 Hz 100 Hz 50 Hz 33 Hz t 0 t 1 t 2 t 3 t 4 t 5 t 4 t 6 t 7 : one graphics frame tn: 1/100 second

Graphics: Fastpath Definition • Fastpath: the most optimized path through graphics hardware Example • Graphics: Fastpath Definition • Fastpath: the most optimized path through graphics hardware Example • fast path: float verts, float norms, AGBR textures, z-test • less fast path: float verts, float norms, RGBA textures, z-test

Graphics: Fastpath Example Graphics: Fastpath Example

Graphics: Fastpath Points • Fast path is often synonymous with ideal path. – Real Graphics: Fastpath Points • Fast path is often synonymous with ideal path. – Real usage of graphics falls on a continuum. Fast path (hardware) Slow path (software) Speed Quality Where is your application? • Must quantify what hardware can do – Quality & speed

Graphics Hardware: Testing Duplicate performance numbers simply: • Good: build a simple test program Graphics Hardware: Testing Duplicate performance numbers simply: • Good: build a simple test program • Better: gl. Perf - http: //www. spec. org Maximize performance in an app: • Good: Use fast API extensions • Better: Create an “is-fast” test, use what is verified as fast

Graphics Hardware: “Is-Fast” Test each platform to determine fast path • Once, per-machine, test Graphics Hardware: “Is-Fast” Test each platform to determine fast path • Once, per-machine, test primitives and modes – Vertex array format, texture format, display list, etc. • Store data in database – Detect hardware changes or time-to-live • Read data from database at startup – Check database or re-generate data

Graphics Hardware: “Is-Fast” Pseudo-code If ( new_machine() || hardware_changed() ) { test_interesting_modes(); store_in_database(); } Graphics Hardware: “Is-Fast” Pseudo-code If ( new_machine() || hardware_changed() ) { test_interesting_modes(); store_in_database(); } else { // have database entry get_performance_data_from_database(); } // use the modes & primitives that are ‘’fast’’ when rendering

Think Globally, Act Locally Think globally • Know the platforms & graphics hardware • Think Globally, Act Locally Think globally • Know the platforms & graphics hardware • Use hardware effectively in your app • Balance hardware utilization Act locally • Use in-cache data • Understand hardware & graphics fastpaths • Balance quality vs. performance

Software and System Performance Thomas J. True, SGI Software and System Performance Thomas J. True, SGI

A Four Step Process Quantify System Evaluation Graphics Analysis Bottleneck Elimination A Four Step Process Quantify System Evaluation Graphics Analysis Bottleneck Elimination

Quantify Characterize • Application Space • Primitive Types • Primitive Counts • Rendering Characteristics Quantify Characterize • Application Space • Primitive Types • Primitive Counts • Rendering Characteristics • Frame Rate

Quantify Compare Quantify Compare

Examine System Configuration Resources • Memory • Disk Setup • Display • Network Examine System Configuration Resources • Memory • Disk Setup • Display • Network

Graphics Analysis Ideal Performance • Keep graphics pipeline full. • 100% CPU utilization running Graphics Analysis Ideal Performance • Keep graphics pipeline full. • 100% CPU utilization running application code. • 100% graphics utilization.

Graphics Analysis Graphics Bound Acme Electronics 30 20 10 0 40 50 60 70 Graphics Analysis Graphics Bound Acme Electronics 30 20 10 0 40 50 60 70 80 90 100

Graphics Analysis Graphics Bound • Graphics subsystem processes data slower than CPU can feed Graphics Analysis Graphics Bound • Graphics subsystem processes data slower than CPU can feed it. • Graphics subsystem issues an interrupt which causes the CPU to stall. • Data processing within application stops until graphics subsystem can again accept data.

Graphics Analysis Geometry Limited • Limited by the rate at which vertices can be Graphics Analysis Geometry Limited • Limited by the rate at which vertices can be transformed and clipped. Fill Limited • Limited by the rate at which transformed vertices can be rasterized.

Graphics Analysis CPU Bound Acme Electronics 30 20 10 0 40 50 60 70 Graphics Analysis CPU Bound Acme Electronics 30 20 10 0 40 50 60 70 80 90 100

Graphics Analysis CPU Bound • CPU at 100% utilization but can’t feed graphics fast Graphics Analysis CPU Bound • CPU at 100% utilization but can’t feed graphics fast enough. • Graphics subsystem at less than 100% utilization. • All CPU cycles consumed by data processing.

Graphics Analysis Determination Techniques • Remove graphics API calls. • Shrink graphics window. • Graphics Analysis Determination Techniques • Remove graphics API calls. • Shrink graphics window. • Reduce geometry processing requirements. • Use system monitoring tool.

Graphics Analysis Start Performance Problem Not Graphics Remove rendering calls Graphics Performance Problem Remove Graphics Analysis Start Performance Problem Not Graphics Remove rendering calls Graphics Performance Problem Remove graphics API calls Shrink graphics window Reduce geometry load Use system monitoring tool Excessive or unexpected CPU activity Graphics bound: ? = frame rate increase Graphics bound: fill limited Graphics bound: geometry limited = no change in frame rate Fallen off fast path

Graphics Analysis Graphics Architecture: GTXR-D Acme Electronics Graphics Analysis Graphics Architecture: GTXR-D Acme Electronics

Graphics Analysis Graphics Architecture: GTXR-D (aka Dumb Frame Buffer) • CPU does everything. • Graphics Analysis Graphics Architecture: GTXR-D (aka Dumb Frame Buffer) • CPU does everything. • Typically CPU bound. • To remedy, buy a “real” graphics board.

Graphics Analysis Graphics Architecture: GTX-RD Acme Electronics Graphics Analysis Graphics Architecture: GTX-RD Acme Electronics

Graphics Analysis Graphics Architecture: GTX-RD • Screen space operations performed by graphics. • Object-space Graphics Analysis Graphics Architecture: GTX-RD • Screen space operations performed by graphics. • Object-space to screen-space transform on host. • Can easily become CPU bound. “Roughly 100 single-precision floating point operations are required to transform, light, clip test, project and map an object-space vertex to screenspace. ” - K. Akeley & T. Jermoluk • Beware of fast-path and slow-path issues.

Graphics Analysis Graphics Architecture: GTX-RD • If Graphics Bound: – Reduce per-pixel operations. – Graphics Analysis Graphics Architecture: GTX-RD • If Graphics Bound: – Reduce per-pixel operations. – Reduce depth complexity. – Use native-format data.

Graphics Analysis Graphics Architecture: GTX-RD • If CPU Bound: – Reduce scene complexity. – Graphics Analysis Graphics Architecture: GTX-RD • If CPU Bound: – Reduce scene complexity. – Use more efficient graphics algorithms.

Graphics Analysis Graphics Architecture: GT-XRD Acme Electronics Graphics Analysis Graphics Architecture: GT-XRD Acme Electronics

Graphics Analysis Graphics Architecture: GT-XRD • Transformation and rasterization performed by graphics. • Can Graphics Analysis Graphics Architecture: GT-XRD • Transformation and rasterization performed by graphics. • Can be CPU or graphics bound. • Beware of fast-path and slow-path issues. • Subject to host bandwidth limitations.

Graphics Analysis Graphics Architecture: GT-XRD • If Graphics Bound: – Move lighting back to Graphics Analysis Graphics Architecture: GT-XRD • If Graphics Bound: – Move lighting back to CPU. – Use native data formats within application. – Use display lists or vertex arrays. – Use less expensive lighting modes.

Graphics Analysis Graphics Architecture: GT-XRD • If CPU Bound: – Move lighting from CPU Graphics Analysis Graphics Architecture: GT-XRD • If CPU Bound: – Move lighting from CPU to graphics subsystem. – Do matrix operations in graphics hardware. – Profile in search of computational performance issues.

Bottleneck Elimination Bottlenecks Bottleneck Elimination Bottlenecks

Bottleneck Elimination Bottlenecks • Understanding, crucial to effective tuning. • Will always exist, tune Bottleneck Elimination Bottlenecks • Understanding, crucial to effective tuning. • Will always exist, tune to balance. • Not always a bad thing.

Bottleneck Elimination Graphics • Use native graphics formats. • Remove excessive state changes. • Bottleneck Elimination Graphics • Use native graphics formats. • Remove excessive state changes. • Package graphics primitives efficiently. • Use textures that fit in texture cache. • Don’t use unnecessary rendering modes. • Decrease depth complexity. • Cull out excessive geometry.

Bottleneck Elimination Memory • Don’t allocate memory in rendering loop. • Avoid copying and Bottleneck Elimination Memory • Don’t allocate memory in rendering loop. • Avoid copying and repackaging of graphics data. • Organize graphics data. • Avoid memory fragmentation.

Bottleneck Elimination Memory Bandwidth and Fragmentation Independent Triangles 9 vertices: 504 bytes Triangle Strip Bottleneck Elimination Memory Bandwidth and Fragmentation Independent Triangles 9 vertices: 504 bytes Triangle Strip 5 vertices: 280 bytes Vertex Array 5 vertices: 280 bytes Vertex = RGBA+XYZW+XYZ+STR = 56 bytes

Bottleneck Elimination Code and Language • Use native data types. • Avoid contention for Bottleneck Elimination Code and Language • Use native data types. • Avoid contention for a single shared resource. • Avoid application bottlenecks in non-graphics code. • Reduce API call overhead.

Bottleneck Elimination API Call Overhead Independent Triangles (XYZW + RGBA + XYZ + STR) Bottleneck Elimination API Call Overhead Independent Triangles (XYZW + RGBA + XYZ + STR) * 9 vertices: 36 function calls Triangle Strips (XYZW + RGBA + XYZ + STR) * 5 vertices: 20 function calls Vertex Array 5 function calls Display List 1 function call

Conclusion Performance Tuning an Iterative Process Quantify System Evaluation Graphics Analysis Bottleneck Elimination Conclusion Performance Tuning an Iterative Process Quantify System Evaluation Graphics Analysis Bottleneck Elimination

Conclusion It’s all about balance! Conclusion It’s all about balance!

Profiling and Performance Analysis Keith Cok, SGI Profiling and Performance Analysis Keith Cok, SGI

Profile and Performance Analysis • Profiling points out code areas that take up most Profile and Performance Analysis • Profiling points out code areas that take up most time • Imperative for well balanced application • Points out code and system bottlenecks

Two Methods of Software Profiling Basic block • A section of code that has Two Methods of Software Profiling Basic block • A section of code that has one entry and one exit • Measures ideal time Statistical sampling • Interrupts program execution and examines current location • Measures actual CPU cycles spent executing a line of code

How Do You Profile Code? • Compile/link with compiler optimizations turned on – cc How Do You Profile Code? • Compile/link with compiler optimizations turned on – cc foo. c -use_all_optimization_flags. . • Instrument the code – Unix: pixie foo. exe -> foo. exe. pixie – Visual Studio: embedded in tool suite • Run the application with relevant data sets – foo. exe. pixie - args -> produces results data file

Profiling: Finding the Hot Spot Function list, in descending order by exclusive ideal time Profiling: Finding the Hot Spot Function list, in descending order by exclusive ideal time excl. % cum. % instructions calls function (dso: file, line) [1] 10. 3% 190583064 [2] 8. 9% 19. 2% 173920781 [3] 8. 2% 27. 4% 145950460 [4] 5. 9% 33. 3% 97798122 1975976 __sin (libm. so: sin. c, 194) [5] 4. 1% 37. 4% 82310479 [6] 3. 4% 40. 8% 50786176 1204269 __gl. Mgrim_Begin (lib. GLcore. so: mgras_prim. c, 221) [7] 3. 2% 44. 0% 58099072 [8] 3. 1% 47. 1% 53832546 290970 R_Recursive. World. Node (foo: gl_rsurf. c, 894) [9] 3. 1% 50. 2% 43855299 437627 R_Cull. Box (foo: gl_rlight. c, 313; compiled in gl_rmain. c) [10] 2. 8% 53. 0% 44666700 11484 GL_Create. Surface. Lightmap (foo: gl_rsurf. c, 1293) 3203 S_Update_ (foo: snd_dma. c, 848) 338787 R_Render. Brush. Poly (foo: gl_rsurf. c, 641) 240 GL_Load. Texture (foo: gl_draw. c, 990) 16797 R_Draw. Alias. Model (foo: gl_rmain. c, 232) 30981 Emit. Water. Polys (foo: gl_warp. c, 187)

Profiling: Fixing the Hot Spot What do you look for? • Common sub-expressions • Profiling: Fixing the Hot Spot What do you look for? • Common sub-expressions • Loop invariant code • Repeated pointer de-referencing • Global variables and cache misses • “Thin” loops

Profiling Example // Code the old way 19: void old_loop() { 20: sum = Profiling Example // Code the old way 19: void old_loop() { 20: sum = 0; 21: for (i = 0; i < NUM; i++) 22: sum += x[i]; 23: printf("sum = %fn", sum); 24: } // Code the new way 27: void new_loop () { 28: sum = 0; 29: ii = NUM%4; 30: for (i=0; i < ii; i++) 31: sum +=x[I]; 32: for (i = ii; i < NUM; i +=4) { 33: sum += x[i]; 34: sum += x[i+1]; 35: sum += x[i+2]; 36: sum += x[i+3]; 37 : } 38: printf(“ sum = %fn”, sum); 39: }

Profiling Example: Profile Results cycles instructions calls function (dso: file: line) [1] 6160 6168 Profiling Example: Profile Results cycles instructions calls function (dso: file: line) [1] 6160 6168 1 old_loop (blahdso. so: blahdso. c, 19) [2] 4869 8714 1 setup_data (blahdso. so: blahdso. c, 11) [1] 4869 8714 1 setup_data (blahdso. so: blahdso. c, 11) [2] 4625 4891 1 new_loop (blahdso. so: blahdso. c, 27)

Profile Example: Line Analysis Line list, in descending order by time ---------------------------cycles invocations function Profile Example: Line Analysis Line list, in descending order by time ---------------------------cycles invocations function (dso: file, line) 4096 2061 1024 old_loop sum += x[i]; for (i = 0; i < NUM; i++) 978 968 968 733 7 256 256 256 1 new_loop new_loop sum += x[i+3]; sum += x[i+2]; sum += x[i+1]; sum += x[i]; for (i = ii; i < NUM; i +=4) ii = NUM%4;

Profile and Performance Analysis Profile Example: Visual C++/Intel Function Percent of Hit Function Time(s) Profile and Performance Analysis Profile Example: Visual C++/Intel Function Percent of Hit Function Time(s) Run Time Count ---------------------------------0. 410 39. 4 1 _old_loop 0. 249 23. 9 1 _new_loop

Statistical vs. Basic Block Profile void ijk_loop(){ sum = 0; for (i=0; i<YNUM; i++) Statistical vs. Basic Block Profile void ijk_loop(){ sum = 0; for (i=0; i

Basic Block vs. Statistical Sampling Basic Block: Percent [1] 25. 3% [2] 25. 3% Basic Block vs. Statistical Sampling Basic Block: Percent [1] 25. 3% [2] 25. 3% [3] 25. 3% cycles 51141434 Statistical Sampling: Percent Samples [1] 38. 0% 2700 [2] 23. 9% 1700 [3] 19. 7% 1400 [4] 18. 3% 1300 inst calls function 37101028 1 ijk_loop foo. c, 47 37101028 1 kji_loop foo. c, 57 37101028 1 ikj_loop foo. c, 66 Procedure Function kji_loop foo. c, 57 setup_data foo. c, 15 ikj_loop foo. c, 66 ijk_loop foo. c, 47

Now We Know About Hot Spots. . . What do we do next? • Now We Know About Hot Spots. . . What do we do next? • Use compilers to fine-tune code • Use knowledge of language to optimize • Hand-tune code Profiling is fun, hard, and iterative and it can be highly effective

Compiler and Language Issues Keith Cok, SGI Bob Kuehne, SGI Compiler and Language Issues Keith Cok, SGI Bob Kuehne, SGI

Compiler and Language Issues Compiler Optimizations: • Occur within a compromise of speed and Compiler and Language Issues Compiler Optimizations: • Occur within a compromise of speed and memory space vs. time to compile and link • An iterative process to discover what does and doesn’t work • Important to keep at it

Compiler Issues: Trade-Offs • Trade-offs: – Round-off vs. needed precision – Inter-procedural analysis vs. Compiler Issues: Trade-Offs • Trade-offs: – Round-off vs. needed precision – Inter-procedural analysis vs. link time – Pointer aliasing vs. coding constraints – Optimizing for processor architectures vs. work of multiple binaries (support, test) • Explore other compilers than your first choice • Different source code - different flags

Compiler and Language Issues Comments on 32 vs. 64 bit code • Benefits of Compiler and Language Issues Comments on 32 vs. 64 bit code • Benefits of 64 bit code: – Increased address space – Higher precision • Downsides of 64 bit code: – Application memory footprint – Need to port which can be difficult! • Performance issues

Language Issues • Data Management • Unrolling loops • Arrays • Temporary variables • Language Issues • Data Management • Unrolling loops • Arrays • Temporary variables • Pointer aliasing

Language Issues: Data Management Manipulate data structures efficiently since graphics IS data struct { Language Issues: Data Management Manipulate data structures efficiently since graphics IS data struct { str *next; str *prev; large_type foo; int key; } str; struct { str *next; str *prev; int key; large_type foo; } str;

Language Issues: Data Management Pack data efficiently struct foo { char aa; float bb; Language Issues: Data Management Pack data efficiently struct foo { char aa; float bb; char cc; float dd; char ee; } foo_t; // 8 bits + 24 pad // 32 bits // 8 bits + 24 pad // 160 bits struct foo_better { float bb; // 32 bits char aa; // 8 bits char cc; // 8 bits char ee; // 8 bits + 8 pad float dd; // 32 bits } foo_t; // 96 bits

Language Issues: Data Management Examine your arrays and note their caching behavior • Break Language Issues: Data Management Examine your arrays and note their caching behavior • Break up large arrays into smaller sub-arrays for better memory access patterns • Understand the implications of data layout and cache behavior

Language Issues: Loop Unrolling Profiling Example // Code the old way 19: void old_loop() Language Issues: Loop Unrolling Profiling Example // Code the old way 19: void old_loop() { 20: sum = 0; 21: for (i = 0; i < NUM; i++) 22: sum += x[i]; 23: printf("sum = %fn", sum); 24: } // Code the new way 27: void new_loop() { 28: sum = 0; 29: ii = NUM%4; 30: for (i=0; i < ii; i++) 31: sum +=x[i]; 32: for (i=ii; i

Language Issues: Loop Unrolling Profile Example: Line Analysis Line list, in descending order by Language Issues: Loop Unrolling Profile Example: Line Analysis Line list, in descending order by time ---------------------------cycles invocations function 4096 1024 old_loop sum += x[i]; 2061 1024 old_loop for (i = 0; i < NUM; i++) 978 968 968 733 7 256 256 256 1 new_loop new_loop sum += x[i+3]; sum += x[i+2]; sum += x[i+1]; sum += x[i]; for (i = ii; i < NUM; i +=4) ii = NUM%4;

Language Issues: Loop Unrolling Issues with loop unrolling: • Code complexity • Clutter • Language Issues: Loop Unrolling Issues with loop unrolling: • Code complexity • Clutter • Compiler may/may not do this • Flags may affect compiler time spent optimizing Only “thin” loops gain performance Use application knowledge to take advantage of loop unrolling

Language Issues: Local temporary variables Use local temporary variables to avoid repeatedly de-referencing a Language Issues: Local temporary variables Use local temporary variables to avoid repeatedly de-referencing a pointer structure Example: x = global_ptr->record_str->a; y = global_ptr->record_str->b; Use: tmp = global_ptr->record_str; x = tmp->a; y = tmp->b;

Language Issues: Using tmp vars for global vars within a function void tr_point(FLOAT *old_pt, Language Issues: Using tmp vars for global vars within a function void tr_point(FLOAT *old_pt, FLOAT *m, FLOAT *new_pt) FLOAT *c 1, *c 2, *c 3, *c 4, *op, *np, tmp; c 1 = m; c 2 = m+4; c 3 = m+8; c 4 = m+12; for (j=0, np = new_pt; j<4; j++) { for (j=0; np = new_pt; j<4; j++) op = old_pt; tmp += *op++ * *c 1++; *np += *op++ * *c 1++; tmp += *op++ * *c 2++; *np += *op++ * *c 2++; tmp += *op++ * *c 3++; *np++ = tmp + (*op * *c 4++); } *np++ = *op++ * *c 4++; }

Language Issues: Pointer Aliasing • Pointers are aliases when they point to potentially overlapping Language Issues: Pointer Aliasing • Pointers are aliases when they point to potentially overlapping regions of memory • If regions never overlap, may optimize for this case. Not possible, though, in general • Compiler can't tell when pointers are aliased • Use restrict key word or compiler option

Language Issues: Pointer Aliasing Unaliased Pointers Compilers may use: - Parallelism - Pipelining in Language Issues: Pointer Aliasing Unaliased Pointers Compilers may use: - Parallelism - Pipelining in in out Aliased pointers

Language Issues: Pointer Aliasing void process_data( float * restrict in, float * restrict out, Language Issues: Pointer Aliasing void process_data( float * restrict in, float * restrict out, float gain) { int i; for (i = 0; i < NSAMPS; i++) { out[i] = in[i] * gain; } }

C++: General Issues • Language features – RTTI, safe casts, etc. • Use const, C++: General Issues • Language features – RTTI, safe casts, etc. • Use const, mutable, volatile, & inline – hints to compilers • Object construction – arrays, default constructors, arguments, etc. • Method invocation issues – operators, overloads, conversion, etc.

C++: Virtual Functions • Good - used to invoke child method when managing baseclass C++: Virtual Functions • Good - used to invoke child method when managing baseclass handles • Expensive - incur an additional pointer de-reference – one, find VTBL, two, find method, invoke – bad for caching • Use when necessary, but not for common objects – Good for ‘large’ methods that do lots of work – Bad for ‘small’ methods, like a vertex query

C++: Exceptions & Templates Exceptions • Great for error checking • Performance penalty – C++: Exceptions & Templates Exceptions • Great for error checking • Performance penalty – Additional stack information required Templates • Great for code re-use • Memory penalty – Across libraries, across object files

Code & Language Issues: The End Balance • Know your compiler – Features & Code & Language Issues: The End Balance • Know your compiler – Features & performance • Know your language – Features & performance • Know your app – Features & performance

Idioms and Application Architectures Alan Commike, SGI Idioms and Application Architectures Alan Commike, SGI

Starting Quote The best tuned most efficient bubble sort is still a bubble sort. Starting Quote The best tuned most efficient bubble sort is still a bubble sort. Additional tweaking won't improve performance. Change The Algorithm! - Commike ‘ 99

Introduction To write an efficient graphics application, one must: • Understand the platform • Introduction To write an efficient graphics application, one must: • Understand the platform • Use graphics efficiently • Write good code Use efficient application structures and algorithms

Outline • Background • Culling • Level of Detail (LOD) management • Application architectures Outline • Background • Culling • Level of Detail (LOD) management • Application architectures

Application Architectures: Rendering Path • Application work, culling, LOD, drawing • Pipelined rendering path Application Architectures: Rendering Path • Application work, culling, LOD, drawing • Pipelined rendering path App Cull LOD Draw

Application Architectures: Rendering Path • Application work, culling, LOD, drawing • Pipelined rendering path Application Architectures: Rendering Path • Application work, culling, LOD, drawing • Pipelined rendering path App Cull LOD Draw

Application Architectures: Rendering Path • Application work, culling, LOD, drawing • Pipelined rendering path Application Architectures: Rendering Path • Application work, culling, LOD, drawing • Pipelined rendering path Frame 0 App LOD Draw App Cull LOD Draw App Frame 1 Cull LOD Draw T 2 T 3 T 4 T 5 Frame 2 T 0 T 1

Application Architectures: Target Frame Rate A target frame rate attempts to bound the maximum Application Architectures: Target Frame Rate A target frame rate attempts to bound the maximum render time • Control Culling and LOD aggressiveness • Maintain a constant frame rate • Achieve an acceptable interactive frame rate

Graphics Idioms • Culling – Removing geometry that isn't visible • Level of Detail Graphics Idioms • Culling – Removing geometry that isn't visible • Level of Detail Management – Reducing geometric complexity

Culling Don’t draw what you can’t see Culling Don’t draw what you can’t see

Culling: Culling Types Use one. Use all. Pipeline them together. • View Frustum Culling Culling: Culling Types Use one. Use all. Pipeline them together. • View Frustum Culling • Backface Culling • Contribution Culling • Occlusion Culling

Culling: Bounding Volumes Test against a bounding volume not individual primitives • Can be Culling: Bounding Volumes Test against a bounding volume not individual primitives • Can be bounding sphere, box, oriented box, or any enclosing volume • Hierarchical bounding volumes to reduce cull time • Spheres are fast, boxes are more accurate – Use a combination of both

Culling: View Frustum Graphics pipeline clips data that falls outside the View Frustum If Culling: View Frustum Graphics pipeline clips data that falls outside the View Frustum If it will be clipped don’t bother drawing

Culling: View Frustum Usefulness • Improves geometry rate – Culled vertices are not transformed, Culling: View Frustum Usefulness • Improves geometry rate – Culled vertices are not transformed, lit, and clipped • Improves host download rate – Less data moved from memory into graphics • Does not change fill rate – Triangles outside the View Frustum would not have been drawn anyway

Culling: View Frustum Implementation • Transform vertices to clip coordinates (in Open. GL multiply Culling: View Frustum Implementation • Transform vertices to clip coordinates (in Open. GL multiply by Model-View and Projection matrix) • Check each vertex against View Frustum • Geometry is either In, Out, or Partial • Render In and Partial

Culling: Skip the Clip In software transform systems (GTX-RD) skip the clip • Partial Culling: Skip the Clip In software transform systems (GTX-RD) skip the clip • Partial and In geometry classified – Pipe renders Partial as usual – Pipe can render In without a View Frustum clip • Might be a hint to render • Can improve geometry rates if not already fill-limited

Culling: Backface Only half of any closed polyhedron is visible at any one time Culling: Backface Only half of any closed polyhedron is visible at any one time Don’t render what you can’t see

Culling: Backface Usefulness • Improves fill rate when using a native implementation – Primitives Culling: Backface Usefulness • Improves fill rate when using a native implementation – Primitives are transformed and lit before culling • Helps both geometry and fill with an application specific algorithm – More computationally expensive – Balance graphics and CPU work • This may not work well when you can enter closed geometry or need two-sided lighting

Lava. Hot! Lava. Hot!

Random Quote Try not. Do, or do not. There is no try. - Yoda Random Quote Try not. Do, or do not. There is no try. - Yoda ‘ 80

Culling: Contribution If it’s too small to make a difference don’t render it Culling: Contribution If it’s too small to make a difference don’t render it

Culling: Contribution Usefulness • Improves geometry rate – Culled vertices are not transformed, lit, Culling: Contribution Usefulness • Improves geometry rate – Culled vertices are not transformed, lit, and clipped • Improves host download rate – Less data moved from memory into graphics • Does not change fill rate – Screen space projection already minimal – Removes few pixels from rasterization stage

Culling: Contribution Implementation Don’t render items that fall below a size threshold • Screen Culling: Contribution Implementation Don’t render items that fall below a size threshold • Screen space size of bounding volume • A less computational approach – Distance to object combined with some notion of global object size

Culling: Occlusion If you can’t see it Front Side don’t draw it Culling: Occlusion If you can’t see it Front Side don’t draw it

Culling: Occlusion Goals Find the optimal set of occluders that will enable drawing the Culling: Occlusion Goals Find the optimal set of occluders that will enable drawing the minimal number of occludees • Occluders: The geometry that is visible • Occludees: The geometry that is not visible • Use general purpose occlusion culling algorithms • Use application specific spatial knowledge if possible

Culling: Occlusion Culling Usefulness • Can improve both transform-limited and fill-limited applications • Computationally Culling: Occlusion Culling Usefulness • Can improve both transform-limited and fill-limited applications • Computationally expensive – Beware of time trade-offs • Possible hardware support

Culling: General Occlusion Culling • Used for arbitrary scenes • Can improve both transform Culling: General Occlusion Culling • Used for arbitrary scenes • Can improve both transform limited and fill limited applications • Computationally expensive for arbitrary scenes

Culling: Occlusion Spatial Partitioning “Cell and Portal” Culling • Spatial organization leads to Cells Culling: Occlusion Spatial Partitioning “Cell and Portal” Culling • Spatial organization leads to Cells and Portals • Games that move from room to room • Architectural walkthroughs

LOD: Overview After culling, need to draw what is left • Still too much LOD: Overview After culling, need to draw what is left • Still too much geometry: – Use multiple Levels of Detail, I. e. multi-resolution objects • Match geometric complexity to visible on-screen space coverage • Reduce geometric complexity to maintain target frame rate

LOD: Issues • Generating LODs: – Height Fields vs 3 D objects – View-Dependent: LOD: Issues • Generating LODs: – Height Fields vs 3 D objects – View-Dependent: nice, but compute intensive – View-Independent: fast, memory intensive • Need to decide which LOD level to use – Not trivial! • Need smooth transitions between levels – Geomorphs

LOD: Height Fields • Generally thought of as infinite terrain • Specialized algorithms can LOD: Height Fields • Generally thought of as infinite terrain • Specialized algorithms can be used

LOD: 3 D Models • General purpose simplification algorithm • Can use on height LOD: 3 D Models • General purpose simplification algorithm • Can use on height fields also • Some recent real-time view-dependent algorithms • Also used for compression 1024 Triangles 256 Triangles 64 Triangles 16 Triangles

LOD: When to switch LOD levels Ability to only generate LOD models is not LOD: When to switch LOD levels Ability to only generate LOD models is not sufficient • Need to know when to use which LOD level – single constant hard metric: distance from eye – Multiple heuristics: cost, benefit, rankings • Can bias LODs to ensure frame rate targets are reached

LOD: Level determination • Determine system rendering characteristics • Determine cost of rendering each LOD: Level determination • Determine system rendering characteristics • Determine cost of rendering each object • Render objects with highest benefit while remaining under the target frame rate Level determination can be time consuming! “take the time to time the time taken to reduce the rendering time”

Going, and going. . . Going, and going. . .

LOD: Determining cost of rendering Cost is affected by many factors • Graphics hardware: LOD: Determining cost of rendering Cost is affected by many factors • Graphics hardware: published benchmarks, startup tests • Number of vertices: primarily a function of LOD algorithm • Rendering Quality: lighting, shading, wire frame, anti-aliasing, etc. • Global Factors: total texture memory, dirty internal state

LOD: Benefit Function Cost alone is not good enough, need benefit also • Rendered LOD: Benefit Function Cost alone is not good enough, need benefit also • Rendered size of object • Error tolerance between LOD level and reference model • Importance in scene • Frame-to-frame coherency

LOD: The Optimal LODs For all Objects, at each LOD Level, rendered with each LOD: The Optimal LODs For all Objects, at each LOD Level, rendered with each Render. Type Maximize the Benefit function: Benefit(Object, Level, Render. Type) Subject to: Cost(Object, Level, Render. Type) <= Target. Frame. Rate

LOD: Optimal Optimizations • Simulated Annealing • Monte Carlo Simulations • Simplex Searches LOD: Optimal Optimizations • Simulated Annealing • Monte Carlo Simulations • Simplex Searches

LOD: Optimal Optimizations • Simulated Annealing • Monte Carlo Simulations • Simplex Searches Dude, LOD: Optimal Optimizations • Simulated Annealing • Monte Carlo Simulations • Simplex Searches Dude, Can you spare a few dozen CPUs?

LOD: Trade-offs Don’t have enough time to run full LOD optimization problem and render LOD: Trade-offs Don’t have enough time to run full LOD optimization problem and render the scene • Simplify cost and benefit functions • Simplify optimization problem into a ranking of Benefit/Cost • Use frame-to-frame coherency • Be sure to consider time taken to calculate LODs

Application Architectures: Multi-Threading • More stages give more time to cull or generate LODs Application Architectures: Multi-Threading • More stages give more time to cull or generate LODs • Each stage adds latency Frame 0 App LOD Draw App Cull LOD Draw App Frame 1 Cull LOD Draw T 2 T 3 T 4 T 5 Frame 2 T 0 T 1

Application Architectures: Multi-Threading • Hard part is data synchronization • Watch out for memory Application Architectures: Multi-Threading • Hard part is data synchronization • Watch out for memory bloat

Application Architectures: Scene Graphs A scene graph is the basic data structures holding the Application Architectures: Scene Graphs A scene graph is the basic data structures holding the description of your scene • Cull-able, sort-able, and can contain multi-resolution objects • Hierarchical Bounding Volumes • Statistics gathering and timing infrastructure • For large scenes can do memory management and database paging

Application Architectures: Trade-offs • Quality • Speed • Memory • Complexity Application Architectures: Trade-offs • Quality • Speed • Memory • Complexity

Conclusion: Most importantly - Think about balance! Conclusion: Most importantly - Think about balance!

Performance Hints Keith Cok, SGI Performance Hints Keith Cok, SGI

Performance Hints: Pipeline Management • Avoid round trips to graphics server – Cache own Performance Hints: Pipeline Management • Avoid round trips to graphics server – Cache own state/attribute information – Avoid pipeline queries (e. g. , gl. Get*) – Flush buffer efficiently (gl. Flush vs. gl. Finish) • Reduce state changes. Sort by expense. For example, sort geometry by type (triangles, quads, etc) and then by color • Eliminate unused attributes

Performance Hints: Debugging Detect graphic errors: #ifdef DEBUG #define GLEND() gl. End();  {int Performance Hints: Debugging Detect graphic errors: #ifdef DEBUG #define GLEND() gl. End(); {int err; err = gl. Get. Error(); if (err != GL_NO_ERROR) printf("%sn", glu. Error. String(err)); assert(err == GL_NO_ERROR); } #else #define GLEND() gl. End() #endif

Performance Hints: Geometry • Maximize data between gl. Begin/gl. End – Sort geometry by Performance Hints: Geometry • Maximize data between gl. Begin/gl. End – Sort geometry by type (triangle, quad, etc. ) and group them together – Find best fit for length of gl. Begin/gl. End pair • Use stripped primitives (GL_TRIANGLE_STRIP. . . ) to reduce geometry data sent to the pipeline • Avoid GL_POLYGON. Use specific geometric primitives instead (GL_TRIANGLE, GL_QUAD, etc. ) • Use GL_FASTEST with gl. Hint calls where possible

Performance Hints: Geometry • Use flat display lists for static geometry. Deep display lists Performance Hints: Geometry • Use flat display lists for static geometry. Deep display lists may induce unwanted memory thrashing • Use API matrix operations instead of your own • Use texture to simulate complex geometry • Use vertex arrays. Test vertex, interleaved, precompiled arrays

Performance Hints: Geometry • Pass one normal (not 3 or 4) per flat shaded Performance Hints: Geometry • Pass one normal (not 3 or 4) per flat shaded polygon • Use a data format suitable for quick transfer to the graphics subsystem • Disable unneeded operations (alpha blending, depth, stencil, blending, dithering, fog, etc. )

Performance Hints: Lighting • Reduce lighting requirements: – Use as few lights as possible Performance Hints: Lighting • Reduce lighting requirements: – Use as few lights as possible – Use directional (infinite) lighting. Use gl. Lightfv(GL_LIGHTn, GL_POSITION, {x, y, z, 0}); – Use positional lights rather than spot lights – Use one-sided lighting when possible (be aware of issues associated with normals) – Don’t change material properties frequently

Performance Hints: Lighting • Use normalized normal vectors – Supply unit length vectors – Performance Hints: Lighting • Use normalized normal vectors – Supply unit length vectors – Don’t enable GL_NORMALIZE – Don’t scale using model-view matrix • Pre-multiply geometry, if possible

Performance Hints: Visuals/Pixel Formats • Pick the correct visual. Use hardware accelerated visuals • Performance Hints: Visuals/Pixel Formats • Pick the correct visual. Use hardware accelerated visuals • Structure windows and contexts to maximize performance (app may block after context swaps) • Put GUI elements in overlay planes to avoid unwanted graphics window refreshes

Performance Hints: Buffers • Turn off depth buffer when possible • Use HW accelerated Performance Hints: Buffers • Turn off depth buffer when possible • Use HW accelerated off-screen buffer for backing-store • Use stencil buffer for interactive picking and quick re-render (see course notes for full algorithm) • Use color/depth buffer data for interactive editing of complex scenes (see course notes for full algorithm)

Performance Hints: Textures • Be aware of texture sizes – Reduce texture resolution – Performance Hints: Textures • Be aware of texture sizes – Reduce texture resolution – Use texture LOD extension (Open. GL 1. 2) • Use texture objects. Create textures once • Don’t swap textures frequently, if possible – Mosaic multiple textures into one large texture – Sort geometry by texture

Performance Hints: Textures • Use texture as an additional data lookup to simulate more Performance Hints: Textures • Use texture as an additional data lookup to simulate more complex data: – Lighting, geometry, color, clipping, application-space data • Use gl. Tex. Sub. Image to replace part of a texture rather than creating a whole new texture • Avoid expensive texture filter modes • Use texture lookup tables instead of multi-channel textures

Conclusion Know how your application works within the system • Don’t let caches, latencies, Conclusion Know how your application works within the system • Don’t let caches, latencies, bandwidths, etc. slow you down • Know how fast you can go • Identify system performance characteristics • Work your compiler • Get all you can out of the hardware

Questions and Answers Questions and Answers