FMA regression with CUDA 3.1/3.2: 17% slower than 3.0?

The kernel in the attached code simply performs 2500 Fused Multiply Add (FMA) instructions in an attempt to test arithmetic performance in the absence of any memory bottleneck. Under CUDA 3.0 it achieves 1033 GFLOPS on a GTX 480 with 7680 threads (and 1319 GFLOPS with 491520 threads but that’s far more threads than I can use in my applications). Excellent. (Varying the FLOPs per thread indicates, for this kernel and configuration, a 4.5 microsecond kernel launch overhead followed by approx. 1184 GFLOPS once the kernel has actually launched … but I am getting off my own topic!)

Unfortunately under CUDA 3.1/3.2rc2/3.2 that 1033 GFLOPS performance drops by about 17% (so approx. 20% slower: the post title is wrong) to about 860 GFLOPS. Before I submit a bug report, can anyone see anything I have done wrong or misinterpreted? Has anyone else encountered such a drop moving from 3.0 to 3.1/3.2? Has there been a change to loop unrolling or anything else that might explain this? I have searched the docs and in these forums but not found anything that explains the drop.

Here are my measurements (GFLOPS) under different CUDA toolkit and driver versions:

3.0 3.1 3.2rc2 3.2

260.19.21  973 854 854    858

260.19.14  971 855 868    856

256.40    1019 862 ---    ---

195.36.15 1033 --- ---    ---

Additional details: I am using Fedora 13 which comes with gcc 4.4.4 (so compiling with --compiler-options -fno-inline for CUDA 3.0 … but including those flags for the later versions also does not restore performance). I could easily try different gcc versions but the 3.2 toolkit is labelled as being for Fedora 13 so presumably gcc 4.4.4 is the version it was tested against. The 1033 figure is pretty consistent for CUDA 3.0 with its driver version (195.36.15). The figures for CUDA 3.1/3.2rc2/3.2 are a little more variable.
minprog0b.cu (1.59 KB)

  1. Have you checked the PTX code? Is there a difference in code generated by the different toolkits?

  2. If therez no change, it is possibly a driver problem. (the dynamic compilation)
    Run CUDA 3.0 generated binary on the driver that ships with CUDA 3.1

  3. Therez complete sequential dependency for a single thread. The GPU must be able to handle it though.
    Hopefully the active threads in a SM are always > 384 (or whatever the higher number for FERMI is)
    But for pure performance purposes, you could consider having a few more independent instructions per
    thread so you can be sure that RAW hazard is not hampering the performance.

17% you say? Strange coincidence! Look at my old post:

http://forums.nvidia.com/index.php?showtopic=186972

and still no reply from anyone.

Regards

I have just looked at this again. By varying the number of FMA instructions per thread (I used some template code and plotted the line of best fit for time against #FLOPs) it is possible to determine that:

Under CUDA 3.0 with its driver version (195.36.15) the average kernel runtime (ms_mean in the code) is 4.5 microseconds (kernel launch overhead?) plus the time it takes for the total number of FLOPs (ie. #threads * #FLOPs-per-thread) to run at 1180 GFLOPS.

Under CUDA 3.1 and 3.2 with their driver versions (I have tested 3.1 with 256.40 and both with 260.19.{21,26,29}) the average kernel runtime is 8.5 microseconds plus the time it takes for the total number of FLOPs to run at 1060 GFLOPS.

Trying the 260.19.{21,26,29} drivers with CUDA 3.0 gives a combination of the previous results: 8.5 microsecond kernel launch overhead followed by 1180 GFLOPS. So the 90% increase in launch overhead seems to be due to the later drivers while the 10% drop in post-launch performance seems to be due to the later CUDA versions.

I’ll open a bug report. Although … surely such a large increase in launch overhead would have been picked up in testing and (by now) by many other users. Have I missed something obvious?

I suggest to check overhead on other os and maybe use large kernells than a lot small ones.

Thanks. It’s certainly true that launch overhead becomes less of an issue with larger kernels :) 7680*5000 is around about where a couple of different applications my students and I are prototyping would fit best though.

I have tested this under Fedora 13 and RHEL 6beta but don’t have a version of Windows to hand (plus I’m currently in another country to the machines, making a Windows install tricky). Perhaps Nvidia will test under other OS’s when they see the bug report. Or maybe there’s a Mac or Windows user with a GTX 480 out there who could verify?

Bigger kernell could help to determine where lost of performance occurs. Is it compiler, or something else. Btw, do how many registers do you use? I assume you use small amount of registers.

The attached code (minprog0b2.cu) and its output (graphs) show how performance changes under CUDA 3.0 and CUDA 3.2 as the kernel size (#FLOPs per thread) increases.

Compiling with ‘–ptxas-options=-v’ shows that the kernel uses 4 registers under CUDA 3.0 and 5 registers under CUDA 3.2. Given that the block size is 64, that’s 256 registers per block under CUDA 3.0 and 320 registers per block under CUDA 3.2. As a GTX 480 has 32K (32768) 32-bit registers per SM, that’s (up to) 128 blocks active per SM under CUDA 3.0 and (up to) 102 blocks active per SM under CUDA 3.2 … if it weren’t for the maximum of 8 resident blocks per SM anyway! The grid size is 120 and a GTX 480 has 15 SMs so I’d expect 8 blocks active per SM. So I wouldn’t expect any performance difference because of the extra register, once the kernel is running. But perhaps it could cause a hit in kernel launch overhead? I have never worked out a rough formula for launch overhead. Does anyone have one?
minprog0b2-CUDA3.2-driver260.19.21.pdf (26.9 KB)
minprog0b2-CUDA3.0-driver195.36.15.pdf (24.1 KB)
minprog0b2.cu (2.19 KB)

I suspect that block size 64 is not optimal, what is the point of it? You have onle 512 threads on Fermi, while you can have 1500. You better put it at least 128. Maybe it is not important in this case though. So, cuda3.2 code is a bit different. I suggest to compare ptx output and see the difference. Looks like ptx are different, and new is slower, and it uses different way to calculate, cause uses more registers. So, we can see that start up time is now twice bigger for this kernell, and it is one source of slowdown, and kernell is slower itself, it is another source.

Better yet, run the benchmark with block size 32,64,96,128,…1024 and plot all of the performance numbers. Don’t just arbitrarily guess what block size is fastest.

I already know that the block size is not optimal in terms of GFLOPS alone. Nor is the grid size. Plenty of graphs plotted here :) But they are optimal in terms of useful performance for the kinds of applications I am interested in. Let me explain:

Under both CUDA versions (3.0 and 3.2) this code achieves near-peak performance (over 1300 GFLOPS) on a GTX 480 using a grid size of 3840 (#SM832) and a block size of 128. But not every application can use half a million or so threads.

Under CUDA 3.0 the code achieves over 1000 GFLOPS on a GTX 480 using just 7680 threads (grid size 120, block size 64). That’s excellent performance using a far more useful number of threads, for me at least. In fact it’s a little high on the thread count for my applications: using fewer threads for a slight loss in performance would be better but performance drops off dramatically (as one would expect) for either a block size below 64 or a grid size below 8*#SM.

So, I am hoping to discover why the performance with 7680 threads has dropped off after CUDA 3.0 and how to get it back! Clearly the hardware is capable of it.