SIMD on GPU

Hi,

I have a few basic questions about the execution of vector instructions:

  1. Why are there no special vector instructions like SSE on the CPU (e.g. like _mm_mul_ps()…) ?

  2. When I have something like
    float4 a,b,c;
    c.x=a.x+b.x;
    c.y=a.y+b.y;
    c.z=a.z+b.z;
    c.w=a.w+b.w;
    Does this code evaluate to a single instruction on the GPU? What happens if the w-component is set to an other operation or value?

3)On a CPU, the SSE instructions allows you to process multiple vectors at once by packing the data together ( array of structures vs. structure of arrays), is this also possible on the GPU? Would it make my code faster?

4)Is there anything special (do’s/dont’s, best practices) when porting code that makes heavy use of SIMD/SSE optimizations?

While the hardware is SIMD, you cannot access the vectors directly. Basically each thread only sees the scalar at one specific element index. So each element in your 8-wide vector register is associated with one thread. Have a look at the Programming Guide where all of this (and why SoA is good for you) is explained.

Thanks, I guess that answered my third question, but I’m not sure you understood me correctly (or vice versa). I’m actually talking about geometric 4d vectors, not the vector data stucture. The code was meant to be completely inside a kernel, with a and b coming from some array of float4 and c is written to another array. If you could execute 4 ops at the same time in a single thread, wouldn’t that be faster than having 1 op at the same time in 4 threads (threading overhead?) ?
I’m doing some graphics programming, so I just wondered why you don’t have vector instructions (overloaded operators, dot/cross product…) in CUDA although you have them in GLSL/HLSL/Cg.
So how do kernels differ from shader programs? I have the feeling they don’t have as much in common as I think.
I know these are basic questions, but I have some troubles understanding the guides (english isn’t my native language…)

Ahh! … The “threads” are not fully autonomous but instead more like a very wide (SSE) vector - there is no threading overhead. It might be faster to also do 4 ops at once in each of the 32 threads of a warp, but then also 4 times as expensive and hot without any obvious advantage over adding 4 times as many processors executing 4 times as many threads.

If you had an array of float4 and wanted to do similar operations on each of the 4 vector elements, you could load one element in each of 4 neighboring threads.

Nope. Consider the following points about rendering:

  1. You pretty much always have a lot of elements to process. Either a lot of vertices or a lot of fragments.

  2. Not all operations in a shader are 4D. A lot are 3D, some are scalar.

  3. If you have a 4-wide SIMD, you can perform 4 operations per clock.

Now up to G7x, the SIMD would be used in pretty much the way you would expect, with a float4 in your shader code really being a full 4-way register. This leads to the problem that it is really, really hard to get peak computing power out of your SIMD, as most of the time one or more elements in your vector-register will not be used. Actually, back in the day the scheduling of shader instructions so that you could get the most out of your pipeline was very, very hard. Ask any PS3 programmer. ;)

However, if you have enough elements to process, you can “transpose” the problem. Say you want to store 4 3D position vectors. Naively, you could take four vector registers, extend your vectors to 4D and store them. Transposed, you take three vector registers and place all x-components in the first, all y-components in the second and all z-components in the third. This is often referred to “Struct of Arrays” as opposed to “Array of Structs”, although that is a bit misleading.

As you can see, you use your registers much more efficiently. The same is true for your computation units, btw. As an added bonus, you need less special instructions when doing this. Take a dot-product for example. Transposed, this is just a few MADDs. (If you ever wondered why the Cell has no dot-product instruction, now you know)

Bottom line: What you write in shader code is no longer what the hardware does (and hasn’t really been for some time).

I hope that helped.

Thanks T.B. and jma, that helped.
My diffuse knowledge slowly starts to converge to something meaningful.

Make sure to read the documentation. They specifically talk about SIMD and how it relates to SIMT. The docs are very clear and easy to understand. I seriously believe that CUDA is going to go way further than ATI’s CAL due to the Nvidia’s documentation.