Vector operations, swizzle and macros in CUDA

_Tom · May 19, 2009, 5:23pm

Hi,
i’m sorry if this has been asked before, but i couldn’t find any references…

From what i know, Shader Units that execute CUDA are Vectorial SIMD processors, that support swizzle operators and vector instruction at an hardware level…

for example in a shader was perfectly legal to write a thing like “position.xyz /= position.w”, or “uv.xy = uv.yx”;
scalar operations were in fact avoided because the GPU could use float4s natively, and this greatly boosted performance.

Now, in CUDA, everything looks like a scalar instruction, even float4 is made of 4 single floats… from what i know this is the worst thing one could do with a SIMD core, so, here’s finally the question :D

Is CUDA code eventually compiled to use vector instructions like in shaders, or what we see is what the GPU executes?
And if so, why this doesn’t kill performance?

Sorry if the question is really n00b :D

avidday · May 19, 2009, 5:50pm

CUDA (and the GPUs) are really scalar and (roughly) what we see is what runs.

Warps of threads can be viewed as SIMD (and I am pretty sure that is how the rendering pipeline treat them), although it is probably better thought of as “Single Program Multiple Data”, and I am pretty sure the ROPs just see each TPA as an SIMD stream, but the inner workings are scalar.

Jamie_K · May 19, 2009, 8:04pm

Whether the GPU is a SIMD processor that operates on vectors, or multiple scalar threads running concurrently is somewhat a matter of perspective. It’s described in the programming guide as SIMT (single instruction multiple thread) which is somewhat like SIMD except that the threads are thought of as scalar threads.

But regardless whether you consider it like SIMD, the way it’s generally used is different. Instead of having a vector contain a single float4, and say adding two 4-element vectors in a single instruction like C = A + B, what’s more common is to have each thread be responsible for a different point. Then it takes 4 instructions to add 32 4-element vectors: C[id].x = A[id].x + B[id].x and C[id].y = A[id].y + B[id].y and C[id].z = A[id].z + B[id].z and C[id].w = A[id].w + B[id].w (where id is a thread ID that ranges from 0 to 31). So these 4 instructions produce 32 resulting float4 outputs because the 32 threads run concurrently. It’s data-parallel, but not parallel on the level of floats within a float4. It’s not parallel across R, G, B, A within a pixel, but rather parallel across pixels. Or across vectors or nodes or “items” depending what your application is. If that makes any sense.

So every instruction is a vector instruction in the sense that if all threads are doing the same thing, then it executes in all 32 threads in a single instruction. If threads within a warp are not executing the same instruction, the hardware can run the threads sequentially to maintain correctness of the threaded model, but this is avoided as much as possible because performance deteriorates quickly.

_Tom · May 20, 2009, 9:06am

So the hardware, instead of using a whole SIMD for one float4, uses it for 4 threads that operate on 4 floats, that if possible are merged?
This way it makes sense :rolleyes:

I thought to a thing like this, but i couldn’t be really sure… thanks for the reply!

Topic		Replies	Views
SIMD on GPU CUDA Programming and Performance	6	17826	April 29, 2009
Do CUDA cores have vector instructions? General Topics and Other SDKs	0	1013	January 19, 2018
SIMT ,SIMD,SPMD, CUDA Programming and Performance	2	17318	June 6, 2010
Vector operations in cuda? CUDA Programming and Performance	3	24768	May 8, 2007
CUDA vs CPU and how to connect them ? CUDA Programming and Performance	5	6046	September 2, 2007
Hardware accelerated vector operations? CUDA Programming and Performance	1	3576	May 1, 2009
Why scalar processors? CUDA Programming and Performance	21	18266	June 26, 2009
Double Precision Units in Kepler? Legacy PGI Compilers	2	2516	February 23, 2016
CUDA execution mapping onto GPUs CUDA Programming and Performance	0	2818	March 2, 2009
Vector maths on float2, where are the SIMD functions? CUDA Programming and Performance	4	3236	July 9, 2018

Vector operations, swizzle and macros in CUDA

Related topics