I just want to know if GPUs are more optimized to carry out single precision arithmetic calculations than double precision. I noticed that CUDA experiences a slight drop in performance when compared to Intel MKL when I change data from single to double precision. Also, if so, can anyone please point me to the source that says so?
NVIDIA states this full on, and this is the basis for the marketing of fermi tesla. There is double precision support only from compute 1.3 cards and up (high end g200 based cards), and even there it’s one double precision FPU for 8 single precision FPUs (doesn’t necessarily mean that is the performance difference you will see due to where the bottlenecks lie, if used sparingly, I’ve seen specific codes run at the same speed when changing parts of it to double)
The same ratio is kept with geforce fermies. Tesla (and I think Quadro) fermies combine 2 single precision FPUs to get a double precision FPU (getting 16 cores per SM instead of 32), to get half the single precision performance. That is part of the marketing buzz around them. Half the cost instead of 8th the cost.
If you use sse, you should see the same performance drop vfor CPUs (as CPUs have a fixed vector length in bytes). With no SSE and compute limited code CPUs won’t drop the speed, although that is misleading as you are not using the full power in the first place.
In any case though you also double the data size so it’s more time for communication.