Performance drop in nvcc 4.2 RC ?

Has anyone else noticed a very significant drop in performance with nvcc 4.2 RC on CC2.0/2.1 Fermi devices?

I’m seeing a large drop in performance in my kernels. Fewer registers are being used though.

Just checking…

I haven’t done a full sweep, but one standard benchmark of my code shows no measurable difference between 4.2 and 4.1 (720 vs 717) +/-5.

OK, thanks. I went from 194m/sec. → 164m/sec. in a register intensive kernel that has no spills or locals. Clearly that’s a troubling drop in performance for me.

It is possible that the code is affected by a suboptimal balance between register use and re-computation. High register pressure leading to reduced occupancy or register spilling, which in turn leads to lower performance, is a common issue on Fermi. Often it is possible for the compiler to reduce register pressure by re-computing intermediate results on the fly, which increases dynamic instruction count and can also have a negative effect on performance. Finding the optimal balance between the two effects is not trivial. To my knowledge these decisions are driven by a number of heuristics; like any collection of heuristics they cannot work optimally for all code. If the affected kernels are crucial to the performance of your application, I would suggest filing a bug with a repro case.

FYI, I just compared the .ptx output of 4.1 and 4.2. They’re identical outside of .loc and other PTX file directives.

The SASS cuobjdump output of ptxas is another story. The 4.2 cubin/SASS is quite a bit bigger.

Making sense of the diff between the .sass files will take some time.

The release version of CUDA 4.2 ptxas is still generating suboptimal code. I have a clean compute-bound kernel that has a performance drop of 14% on Fermi SM’s in the release version of 4.2.

I filed a related ptxas issue with the CUDA team.