Who else is seeing a performance regression on CUDA 7.0 final?

I’m running 347.88 on Win7/x64 with CUDA 7.0 (final) and Nsight 4.6.

I’m seeing terrible spillage and a resulting 7x reduction in performance on dozens of kernels in an unchanged codebase. I’m only focusing on sm_50 right now.

I was running CUDA 7.0.18 RC and Nsight 4.5.x. It worked fine.

Reverting immediately.

My kernel runs reasonably fast with CUDA 6.5. One of my old kernels even runs faster with 6.0.
But with CUDA 7.0 (and RC), the spills go from 0 to 384 bytes.

It also produces some really silly assembly:

MOV R33, R81;
MOV R34, R82;
MOV R14, R62;

Something like that repeated for 20 lines.

However, if I use 7.0 to compile the ptx generated by 6.5, everything works just as fine.

I think the last resort for really high performance is to use maxas by Scott, but the documentation is really difficult for me to understand.

Bummer. The observation that the PTX from CUDA 6.5 compiled with PTXAS from CUDA 7.0 delivers good code is interesting, because the previous reports of weird code generation with CUDA 7.0 seemed to be more indicative of an issue inside the CUDA 7.0 PTXAS. It is of course entirely possible that there is more than one code generation issue, or a problem of interference between the compiler front-end and back-end (what I like to call an “impedance mismatch”).

Whatever the underlying root cause(s), I would encourage all CUDA programmers who encounter such regressions to report them via the bug reporting form that is linked from the CUDA registered developer website.

@ilway25, I was hoping your observation would also work for my kernels.

I just tried ptxas 7.0.27 on PTX emitted by 6.5 and unfortunately it didn’t resolve my unexpected spills. :(

As @njuffa notes, it could be more than one issue. Yikes!

I am really pleased with CUDA 7.0.18 RC…

Hopefully this gets resolved soon and we’ll see a minor update.

i’ve seen, description is here
https://devtalk.nvidia.com/default/topic/821817/cuda-programming-and-performance/a-performance-regression-on-cuda-7-0-final/?offset=3#4496364

regression at 2 times on Kepler 3.5 (GTX Titan) with using 255! register per thread

@allanmac :

Could you please help to provide a test case(or ptx) to reproduce and investigate the performance regression in the CUDA 7.0 final release? thanks.

I also so a decrease in performance of around 20% - 30% in my simulation. This happens if I compile my program under CUDA 6.5 and run it using the CUDA 7 drivers. I have not tested compiling under CUDA 7 yet.

I am running the code on a 780M through MacOS X 10.10.2.

I’ll also post a bug report.