Vector operations, swizzle and macros in CUDA

Whether the GPU is a SIMD processor that operates on vectors, or multiple scalar threads running concurrently is somewhat a matter of perspective. It’s described in the programming guide as SIMT (single instruction multiple thread) which is somewhat like SIMD except that the threads are thought of as scalar threads.

But regardless whether you consider it like SIMD, the way it’s generally used is different. Instead of having a vector contain a single float4, and say adding two 4-element vectors in a single instruction like C = A + B, what’s more common is to have each thread be responsible for a different point. Then it takes 4 instructions to add 32 4-element vectors: C[id].x = A[id].x + B[id].x and C[id].y = A[id].y + B[id].y and C[id].z = A[id].z + B[id].z and C[id].w = A[id].w + B[id].w (where id is a thread ID that ranges from 0 to 31). So these 4 instructions produce 32 resulting float4 outputs because the 32 threads run concurrently. It’s data-parallel, but not parallel on the level of floats within a float4. It’s not parallel across R, G, B, A within a pixel, but rather parallel across pixels. Or across vectors or nodes or “items” depending what your application is. If that makes any sense.

So every instruction is a vector instruction in the sense that if all threads are doing the same thing, then it executes in all 32 threads in a single instruction. If threads within a warp are not executing the same instruction, the hardware can run the threads sequentially to maintain correctness of the threaded model, but this is avoided as much as possible because performance deteriorates quickly.