26f600c76832c5b163b24f020e945a11.ppt
- Количество слайдов: 19
General Purpose GPU (GPGPU) Aaron Smith University of Texas at Austin Spring 2003
Motivation • Graphics processors are becoming more programmable – Direct. X/Open. GL - Vertex and Pixel Shaders • Explore the current state of the art – How would a typical application run on a GPU? – What are the difficulties? Requirements?
MPEG Overview • • Format for storing compressed audio and video Uses prediction between frames to achieve compression (exploits spatial locality) – “I” or intra-frames • simply a frame encoded as a still image (no history) – “P” or predicted frames • predicted from most recently reconstructed I or P frame • can also be treated like I frames when no good match – “B” or bi-directional frames • predicted from closest two I or P frames, one in the past and one in the future • no good match then intra code like I frame • Typical sequence looks like: – IBBPBBPBBPBBIBBPBBPB. . . • Remember what a B frame is? ? ? – decode the I frame, then the first P frame then the first and second B frame – 0 xx 312645
GPU Programming Model • Streams Programming • Pixel Shaders – store data in texture memory – use multiple passes to render and re-render to texture memory • Vertex Shaders? ? ? – more powerful than pixel shaders from an instruction standpoint – but. . . not very useful because of restriction on accessing texture memory • What are the limitations? – branching ?
MPEG and the GPU • decoding is sequential • data structures are regular – typical video stream is 352 x 240 • basic result is pixel color data
NVIDIA Cg • • High Level Shading Language Vertex and Pixel Shaders Open. GL and Direct. X Support Can be compiled at runtime!
Cg Profiles 1) Which profile do we choose? Will the model fit? 2) What about portability? Can we move between architectures?
Direct. X 8 – PS_2_0
PS_2_0 Cont.
MPEG -> Cg Challenges • Data Types – float/int basic types on GPU – unsigned char dominate type in MPEG • Loops – Most profiles do not support loops unless they can be completely unrolled – i. e. loop. cg(49) : warning C 7012: not unrolling loop that executes 352 times since maximum loop unroll count is 256 • No recursion – Normally not a problem we can change to iterative – But on the GPU we have a problem with “Loops” • Arrays – Severe restrictions on index variables – Some profiles assign each array element to a register • Ie. float array[10] uses ten registers • Pointers – Not supported
Implementation • Only support 352 x 240 resolution • Allocate fixed data structures to hold frame – 352 x 240 = 84880 x 21120 (yuv) • Hold data in texture memory • Use Cg pixel shaders – vertex shaders cannot access texture memory • Work backwards
An Example C -> CG • Convert MPEG decoder store() routine into CG shader – Simplify…simplify – Factor
store_ppm_tga() - Original static void store_ppm_tga(outname, src, offset, incr, height, tgaflag ) char *outname; unsigned char *src[]; int offset, incr, height; int tgaflag; { int i, j; int y, u, v, r, g, b; int crv, cbu, cgv; unsigned char *py, *pu, *pv; static unsigned char tga 24[14] = {0, 0, 2, 0, 0, 0, 24, 32}; char header[FILENAME_LENGTH]; static unsigned char *u 422, *v 422, *u 444, *v 444; if (chroma_format==CHROMA 444) { u 444 = src[1]; v 444 = src[2]; } else { if (!u 444) { if (chroma_format==CHROMA 420) { if (!(u 422 = (unsigned char *) malloc((Coded_Picture_Width>>1) * Coded_Picture_Height))) Error("malloc failed"); if (!(v 422 = (unsigned char *) malloc((Coded_Picture_Width>>1) * Coded_Picture_Height))) Error("malloc failed"); } if (!(u 444 = (unsigned char *)malloc(Coded_Picture_Width * Coded_Picture_Height))) Error("malloc failed"); else { conv 422 to 444(src[1], u 444); conv 422 to 444(src[2], v 444); } } strcat(outname, tgaflag ? ". tga" : ". ppm"); if ((outfile = open(outname, O_CREAT|O_TRUNC|O_WRONLY|O_BINARY, 0666))==-1) { sprintf(Error_Text, "Couldn't create %sn", outname); Error(Error_Text); } optr = obfr; if (tgaflag) { /* TGA header */ for (i=0; i<12; i++) putbyte(tga 24[i]); putword(horizontal_size); putword(height); putbyte(tga 24[12]); putbyte(tga 24[13]); } crv = Inverse_Table_6_9[matrix_coefficients][0]; cbu = Inverse_Table_6_9[matrix_coefficients][1]; cgu = Inverse_Table_6_9[matrix_coefficients][2]; cgv = Inverse_Table_6_9[matrix_coefficients][3]; for (i=0; i
Quick Analysis • Pointers – Remove • Conditionals (if/else) – Remove • Dynamic Memory – Remove • File I/O – Remove • Table lookups – Remove • Constant array indexes – OK! • Constant loop invariants – OK!
store_tga() - Simplified static void store_tga(unsigned char *src[]) { int i, j; int y, u, v, r, g, b; int crv, cbu, cgv; int incr = 352; int height = 240; int data_idx = 0; /* index into Bit. Map. data[] */ static unsigned char u 422[176*240]; static unsigned char v 422[176*240]; static unsigned char u 444[352*240]; static unsigned char v 444[352*240]; /* matrix coefficients */ crv = 104597; cbu = 132201; cgu = 25675; cgv = 53279; /* convert YUV to RGB */ for (i=0; i
Quick Analysis • Removed – If/else – Pointers – File i/o – Table lookups • What’s Left? – Function calls (for chrominance conversion) • conv 420 to 422() and conv 422 to 444() – YUV to RGB loop
YUV -> RGB (cg) float 3 main( in float 3 texcoords 0 : TEXCOORD 0, /* texture coord */ uniform sampler 2 D y. Image : TEXUNIT 0, /* handle to texture with Y data */ in float 3 texcoords 1 : TEXCOORD 1, /* texture coord */ uniform sampler 2 D u. Image : TEXUNIT 1, /* handle to texture with U data */ in float 3 texcoords 2 : TEXCOORD 2, /* texture coord */ uniform sampler 2 D v. Image : TEXUNIT 2 /* handle to texture with V data */ ) : COLOR { float 3 yuvcolor; // f(xyz) -> yvu float 3 rgbcolor; yuvcolor. x = tex 2 D(y. Image, texcoords 0). x; yuvcolor. z = tex 2 D(u. Image, texcoords 1). y-0. 5; yuvcolor. y = tex 2 D(v. Image, texcoords 2). z-0. 5; rgbcolor. r = 2*(yuvcolor. x/2 + 1. 402/2 * yuvcolor. z); rgbcolor. g = 2*(yuvcolor. x/2 - 0. 344136 * yuvcolor. y/2 - 0. 714136 * yuvcolor. z/2); rgbcolor. b = 2*(yuvcolor. x/2 + 1. 773/2 * yuvcolor. y); } return rgbcolor; texld add mov dp 3 mul dp 3 mov dp 3 mul mov mov // 17 dcl_2 d s 0 dcl_2 d s 1 dcl_2 d s 2 def c 0, 0. 000000, def c 1, 2. 000000, def c 2, 1. 000000, 0. 000000 def c 3, 0. 500000, dcl t 0. xyz dcl t 1. xyz dcl t 2. xyz r 0, t 1, s 1 r 1, t 0, s 0 r 0. x, r 0. y, -c 1. y r 1. z, r 0. x r 0, t 2, s 2 r 0. x, r 0. z, -c 1. y r 1. y, r 0. x, r 1, c 3 r 0. x, c 1. x, r 0. x r 0. w, r 1, c 2 r 0. y, r 0. w, r 1, c 1. x r 0. w, c 1. x, r 0. w r 0. z, r 0. w r 1. w, c 0. w r 1. xyz, r 0 o. C 0, r 1 instructions, 2 R-regs. 0. 000000, 1. 000000 0. 500000, 0. 886500, 0. 000000 -0. 344000, -0. 714000, 0. 000000, 0. 701000, 0. 000000
Quick Analysis • YUV -> RGB – 17 instructions and 2 registers – 352 x 240 = 84480 px * 17 = ~1. 4 M instr/frame
Just for Fun • What if we needed 1024 instructions? ? – 352 x 240 = 84480 px * 1024 = 86, 507, 520 instr/frame


