Performance drop in nvcc 4.2 RC ?

allanmac · April 12, 2012, 6:11am

Has anyone else noticed a very significant drop in performance with nvcc 4.2 RC on CC2.0/2.1 Fermi devices?

I’m seeing a large drop in performance in my kernels. Fewer registers are being used though.

Just checking…

DrAnderson42 · April 12, 2012, 12:51pm

I haven’t done a full sweep, but one standard benchmark of my code shows no measurable difference between 4.2 and 4.1 (720 vs 717) +/-5.

allanmac · April 12, 2012, 4:15pm

OK, thanks. I went from 194m/sec. → 164m/sec. in a register intensive kernel that has no spills or locals. Clearly that’s a troubling drop in performance for me.

njuffa · April 12, 2012, 5:37pm

It is possible that the code is affected by a suboptimal balance between register use and re-computation. High register pressure leading to reduced occupancy or register spilling, which in turn leads to lower performance, is a common issue on Fermi. Often it is possible for the compiler to reduce register pressure by re-computing intermediate results on the fly, which increases dynamic instruction count and can also have a negative effect on performance. Finding the optimal balance between the two effects is not trivial. To my knowledge these decisions are driven by a number of heuristics; like any collection of heuristics they cannot work optimally for all code. If the affected kernels are crucial to the performance of your application, I would suggest filing a bug with a repro case.

allanmac · April 12, 2012, 6:32pm

FYI, I just compared the .ptx output of 4.1 and 4.2. They’re identical outside of .loc and other PTX file directives.

The SASS cuobjdump output of ptxas is another story. The 4.2 cubin/SASS is quite a bit bigger.

Making sense of the diff between the .sass files will take some time.

allanmac · April 19, 2012, 5:24pm

The release version of CUDA 4.2 ptxas is still generating suboptimal code. I have a clean compute-bound kernel that has a performance drop of 14% on Fermi SM’s in the release version of 4.2.

I filed a related ptxas issue with the CUDA team.