The kernel in the attached code simply performs 2500 Fused Multiply Add (FMA) instructions in an attempt to test arithmetic performance in the absence of any memory bottleneck. Under CUDA 3.0 it achieves 1033 GFLOPS on a GTX 480 with 7680 threads (and 1319 GFLOPS with 491520 threads but that’s far more threads than I can use in my applications). Excellent. (Varying the FLOPs per thread indicates, for this kernel and configuration, a 4.5 microsecond kernel launch overhead followed by approx. 1184 GFLOPS once the kernel has actually launched … but I am getting off my own topic!)
Unfortunately under CUDA 3.1/3.2rc2/3.2 that 1033 GFLOPS performance drops by about 17% (so approx. 20% slower: the post title is wrong) to about 860 GFLOPS. Before I submit a bug report, can anyone see anything I have done wrong or misinterpreted? Has anyone else encountered such a drop moving from 3.0 to 3.1/3.2? Has there been a change to loop unrolling or anything else that might explain this? I have searched the docs and in these forums but not found anything that explains the drop.
Here are my measurements (GFLOPS) under different CUDA toolkit and driver versions:
3.0 3.1 3.2rc2 3.2 260.19.21 973 854 854 858 260.19.14 971 855 868 856 256.40 1019 862 --- --- 195.36.15 1033 --- --- ---
Additional details: I am using Fedora 13 which comes with gcc 4.4.4 (so compiling with --compiler-options -fno-inline for CUDA 3.0 … but including those flags for the later versions also does not restore performance). I could easily try different gcc versions but the 3.2 toolkit is labelled as being for Fedora 13 so presumably gcc 4.4.4 is the version it was tested against. The 1033 figure is pretty consistent for CUDA 3.0 with its driver version (195.36.15). The figures for CUDA 3.1/3.2rc2/3.2 are a little more variable.
minprog0b.cu (1.59 KB)