Hardware accelerated vector operations?

If I use lots of dot() functions in a CUDA kernel. However, I’ve seen that they are declared simply are simply expanded (so that the dot()function is just translated as: xx+yy+z*z.

I would like to know whether is there any function to perform native dot products on the GPU (without having to perform 3 muls and 2 sums), as the GPU is capable of it (thinking about shaders).


See the FAQ, Q32:

In short, no, current NVIDIA GPUs are scalar within each thread, although you can think of them as vector (SIMD) across the warp.