Vector operations, swizzle and macros in CUDA

Hi,
i’m sorry if this has been asked before, but i couldn’t find any references…

From what i know, Shader Units that execute CUDA are Vectorial SIMD processors, that support swizzle operators and vector instruction at an hardware level…

for example in a shader was perfectly legal to write a thing like “position.xyz /= position.w”, or “uv.xy = uv.yx”;
scalar operations were in fact avoided because the GPU could use float4s natively, and this greatly boosted performance.

Now, in CUDA, everything looks like a scalar instruction, even float4 is made of 4 single floats… from what i know this is the worst thing one could do with a SIMD core, so, here’s finally the question :D

Is CUDA code eventually compiled to use vector instructions like in shaders, or what we see is what the GPU executes?
And if so, why this doesn’t kill performance?

Sorry if the question is really n00b :D

CUDA (and the GPUs) are really scalar and (roughly) what we see is what runs.

Warps of threads can be viewed as SIMD (and I am pretty sure that is how the rendering pipeline treat them), although it is probably better thought of as “Single Program Multiple Data”, and I am pretty sure the ROPs just see each TPA as an SIMD stream, but the inner workings are scalar.

Whether the GPU is a SIMD processor that operates on vectors, or multiple scalar threads running concurrently is somewhat a matter of perspective. It’s described in the programming guide as SIMT (single instruction multiple thread) which is somewhat like SIMD except that the threads are thought of as scalar threads.

But regardless whether you consider it like SIMD, the way it’s generally used is different. Instead of having a vector contain a single float4, and say adding two 4-element vectors in a single instruction like C = A + B, what’s more common is to have each thread be responsible for a different point. Then it takes 4 instructions to add 32 4-element vectors: C[id].x = A[id].x + B[id].x and C[id].y = A[id].y + B[id].y and C[id].z = A[id].z + B[id].z and C[id].w = A[id].w + B[id].w (where id is a thread ID that ranges from 0 to 31). So these 4 instructions produce 32 resulting float4 outputs because the 32 threads run concurrently. It’s data-parallel, but not parallel on the level of floats within a float4. It’s not parallel across R, G, B, A within a pixel, but rather parallel across pixels. Or across vectors or nodes or “items” depending what your application is. If that makes any sense.

So every instruction is a vector instruction in the sense that if all threads are doing the same thing, then it executes in all 32 threads in a single instruction. If threads within a warp are not executing the same instruction, the hardware can run the threads sequentially to maintain correctness of the threaded model, but this is avoided as much as possible because performance deteriorates quickly.

So the hardware, instead of using a whole SIMD for one float4, uses it for 4 threads that operate on 4 floats, that if possible are merged?
This way it makes sense :rolleyes:

I thought to a thing like this, but i couldn’t be really sure… thanks for the reply!