Скачать презентацию General Purpose GPU GPGPU Aaron Smith University of Скачать презентацию General Purpose GPU GPGPU Aaron Smith University of

26f600c76832c5b163b24f020e945a11.ppt

  • Количество слайдов: 19

General Purpose GPU (GPGPU) Aaron Smith University of Texas at Austin Spring 2003 General Purpose GPU (GPGPU) Aaron Smith University of Texas at Austin Spring 2003

Motivation • Graphics processors are becoming more programmable – Direct. X/Open. GL - Vertex Motivation • Graphics processors are becoming more programmable – Direct. X/Open. GL - Vertex and Pixel Shaders • Explore the current state of the art – How would a typical application run on a GPU? – What are the difficulties? Requirements?

MPEG Overview • • Format for storing compressed audio and video Uses prediction between MPEG Overview • • Format for storing compressed audio and video Uses prediction between frames to achieve compression (exploits spatial locality) – “I” or intra-frames • simply a frame encoded as a still image (no history) – “P” or predicted frames • predicted from most recently reconstructed I or P frame • can also be treated like I frames when no good match – “B” or bi-directional frames • predicted from closest two I or P frames, one in the past and one in the future • no good match then intra code like I frame • Typical sequence looks like: – IBBPBBPBBPBBIBBPBBPB. . . • Remember what a B frame is? ? ? – decode the I frame, then the first P frame then the first and second B frame – 0 xx 312645

GPU Programming Model • Streams Programming • Pixel Shaders – store data in texture GPU Programming Model • Streams Programming • Pixel Shaders – store data in texture memory – use multiple passes to render and re-render to texture memory • Vertex Shaders? ? ? – more powerful than pixel shaders from an instruction standpoint – but. . . not very useful because of restriction on accessing texture memory • What are the limitations? – branching ?

MPEG and the GPU • decoding is sequential • data structures are regular – MPEG and the GPU • decoding is sequential • data structures are regular – typical video stream is 352 x 240 • basic result is pixel color data

NVIDIA Cg • • High Level Shading Language Vertex and Pixel Shaders Open. GL NVIDIA Cg • • High Level Shading Language Vertex and Pixel Shaders Open. GL and Direct. X Support Can be compiled at runtime!

Cg Profiles 1) Which profile do we choose? Will the model fit? 2) What Cg Profiles 1) Which profile do we choose? Will the model fit? 2) What about portability? Can we move between architectures?

Direct. X 8 – PS_2_0 Direct. X 8 – PS_2_0

PS_2_0 Cont. PS_2_0 Cont.

MPEG -> Cg Challenges • Data Types – float/int basic types on GPU – MPEG -> Cg Challenges • Data Types – float/int basic types on GPU – unsigned char dominate type in MPEG • Loops – Most profiles do not support loops unless they can be completely unrolled – i. e. loop. cg(49) : warning C 7012: not unrolling loop that executes 352 times since maximum loop unroll count is 256 • No recursion – Normally not a problem we can change to iterative – But on the GPU we have a problem with “Loops” • Arrays – Severe restrictions on index variables – Some profiles assign each array element to a register • Ie. float array[10] uses ten registers • Pointers – Not supported

Implementation • Only support 352 x 240 resolution • Allocate fixed data structures to Implementation • Only support 352 x 240 resolution • Allocate fixed data structures to hold frame – 352 x 240 = 84880 x 21120 (yuv) • Hold data in texture memory • Use Cg pixel shaders – vertex shaders cannot access texture memory • Work backwards

An Example C -> CG • Convert MPEG decoder store() routine into CG shader An Example C -> CG • Convert MPEG decoder store() routine into CG shader – Simplify…simplify – Factor

store_ppm_tga() - Original static void store_ppm_tga(outname, src, offset, incr, height, tgaflag ) char *outname; store_ppm_tga() - Original static void store_ppm_tga(outname, src, offset, incr, height, tgaflag ) char *outname; unsigned char *src[]; int offset, incr, height; int tgaflag; { int i, j; int y, u, v, r, g, b; int crv, cbu, cgv; unsigned char *py, *pu, *pv; static unsigned char tga 24[14] = {0, 0, 2, 0, 0, 0, 24, 32}; char header[FILENAME_LENGTH]; static unsigned char *u 422, *v 422, *u 444, *v 444; if (chroma_format==CHROMA 444) { u 444 = src[1]; v 444 = src[2]; } else { if (!u 444) { if (chroma_format==CHROMA 420) { if (!(u 422 = (unsigned char *) malloc((Coded_Picture_Width>>1) * Coded_Picture_Height))) Error("malloc failed"); if (!(v 422 = (unsigned char *) malloc((Coded_Picture_Width>>1) * Coded_Picture_Height))) Error("malloc failed"); } if (!(u 444 = (unsigned char *)malloc(Coded_Picture_Width * Coded_Picture_Height))) Error("malloc failed"); else { conv 422 to 444(src[1], u 444); conv 422 to 444(src[2], v 444); } } strcat(outname, tgaflag ? ". tga" : ". ppm"); if ((outfile = open(outname, O_CREAT|O_TRUNC|O_WRONLY|O_BINARY, 0666))==-1) { sprintf(Error_Text, "Couldn't create %sn", outname); Error(Error_Text); } optr = obfr; if (tgaflag) { /* TGA header */ for (i=0; i<12; i++) putbyte(tga 24[i]); putword(horizontal_size); putword(height); putbyte(tga 24[12]); putbyte(tga 24[13]); } crv = Inverse_Table_6_9[matrix_coefficients][0]; cbu = Inverse_Table_6_9[matrix_coefficients][1]; cgu = Inverse_Table_6_9[matrix_coefficients][2]; cgv = Inverse_Table_6_9[matrix_coefficients][3]; for (i=0; i>16]; g = Clip[(y - cgu*u - cgv*v + 32768)>>16]; b = Clip[(y + cbu*u + 32786)>>16]; if (tgaflag) putbyte(b); putbyte(g); putbyte(r); else putbyte(r); putbyte(g); putbyte(b); } if (!(v 444 = (unsigned char *)malloc(Coded_Picture_Width * Coded_Picture_Height))) Error("malloc failed"); } if (chroma_format==CHROMA 420) { conv 420 to 422(src[1], u 422); conv 420 to 422(src[2], v 422); conv 422 to 444(u 422, u 444); conv 422 to 444(v 422, v 444); } } if (optr!=obfr) write(outfile, obfr, optr-obfr); close(outfile); }

Quick Analysis • Pointers – Remove • Conditionals (if/else) – Remove • Dynamic Memory Quick Analysis • Pointers – Remove • Conditionals (if/else) – Remove • Dynamic Memory – Remove • File I/O – Remove • Table lookups – Remove • Constant array indexes – OK! • Constant loop invariants – OK!

store_tga() - Simplified static void store_tga(unsigned char *src[]) { int i, j; int y, store_tga() - Simplified static void store_tga(unsigned char *src[]) { int i, j; int y, u, v, r, g, b; int crv, cbu, cgv; int incr = 352; int height = 240; int data_idx = 0; /* index into Bit. Map. data[] */ static unsigned char u 422[176*240]; static unsigned char v 422[176*240]; static unsigned char u 444[352*240]; static unsigned char v 444[352*240]; /* matrix coefficients */ crv = 104597; cbu = 132201; cgu = 25675; cgv = 53279; /* convert YUV to RGB */ for (i=0; i255) ? 255 : x) ) r = CLIP((y + crv*v + 32768)>>16); g = CLIP((y - cgu*u - cgv*v + 32768)>>16); b = CLIP((y + cbu*u + 32786)>>16); /* 352 x 240 x 3 frame */ Bit. Map. channels = 3; Bit. Map. size_x = 352; Bit. Map. size_y = 240; Bit. Map. data[data_idx++] = r; Bit. Map. data[data_idx++] = g; Bit. Map. data[data_idx++] = b; } conv 420 to 422(src[1], u 422); /* u 422 = src[1] */ conv 420 to 422(src[2], v 422); /* v 422 = src[2] */ conv 422 to 444(u 422, u 444); /* u 444 = u 422 */ conv 422 to 444(v 422, v 444); /* v 422 = v 444 */ } #ifdef _WIN 32 // output the frame Draw. GLScene((t. Image. TGA *)&Bit. Map); #endif }

Quick Analysis • Removed – If/else – Pointers – File i/o – Table lookups Quick Analysis • Removed – If/else – Pointers – File i/o – Table lookups • What’s Left? – Function calls (for chrominance conversion) • conv 420 to 422() and conv 422 to 444() – YUV to RGB loop

YUV -> RGB (cg) float 3 main( in float 3 texcoords 0 : TEXCOORD YUV -> RGB (cg) float 3 main( in float 3 texcoords 0 : TEXCOORD 0, /* texture coord */ uniform sampler 2 D y. Image : TEXUNIT 0, /* handle to texture with Y data */ in float 3 texcoords 1 : TEXCOORD 1, /* texture coord */ uniform sampler 2 D u. Image : TEXUNIT 1, /* handle to texture with U data */ in float 3 texcoords 2 : TEXCOORD 2, /* texture coord */ uniform sampler 2 D v. Image : TEXUNIT 2 /* handle to texture with V data */ ) : COLOR { float 3 yuvcolor; // f(xyz) -> yvu float 3 rgbcolor; yuvcolor. x = tex 2 D(y. Image, texcoords 0). x; yuvcolor. z = tex 2 D(u. Image, texcoords 1). y-0. 5; yuvcolor. y = tex 2 D(v. Image, texcoords 2). z-0. 5; rgbcolor. r = 2*(yuvcolor. x/2 + 1. 402/2 * yuvcolor. z); rgbcolor. g = 2*(yuvcolor. x/2 - 0. 344136 * yuvcolor. y/2 - 0. 714136 * yuvcolor. z/2); rgbcolor. b = 2*(yuvcolor. x/2 + 1. 773/2 * yuvcolor. y); } return rgbcolor; texld add mov dp 3 mul dp 3 mov dp 3 mul mov mov // 17 dcl_2 d s 0 dcl_2 d s 1 dcl_2 d s 2 def c 0, 0. 000000, def c 1, 2. 000000, def c 2, 1. 000000, 0. 000000 def c 3, 0. 500000, dcl t 0. xyz dcl t 1. xyz dcl t 2. xyz r 0, t 1, s 1 r 1, t 0, s 0 r 0. x, r 0. y, -c 1. y r 1. z, r 0. x r 0, t 2, s 2 r 0. x, r 0. z, -c 1. y r 1. y, r 0. x, r 1, c 3 r 0. x, c 1. x, r 0. x r 0. w, r 1, c 2 r 0. y, r 0. w, r 1, c 1. x r 0. w, c 1. x, r 0. w r 0. z, r 0. w r 1. w, c 0. w r 1. xyz, r 0 o. C 0, r 1 instructions, 2 R-regs. 0. 000000, 1. 000000 0. 500000, 0. 886500, 0. 000000 -0. 344000, -0. 714000, 0. 000000, 0. 701000, 0. 000000

Quick Analysis • YUV -> RGB – 17 instructions and 2 registers – 352 Quick Analysis • YUV -> RGB – 17 instructions and 2 registers – 352 x 240 = 84480 px * 17 = ~1. 4 M instr/frame

Just for Fun • What if we needed 1024 instructions? ? – 352 x Just for Fun • What if we needed 1024 instructions? ? – 352 x 240 = 84480 px * 1024 = 86, 507, 520 instr/frame