Why is CUDA 4.1RC about 10-15% slower than 4.0?

Hello,

Curious if anyone else has noticed diminished performance with CUDA 4.1RC. Here’s what I did:

  • remove 4.0 toolkit
  • install 4.1RC toolkit
  • recompiled existing code with the same flags, etc as before
  • runtime increased 10-15%

Removing 4.1RC, instlaling 4.0 and recompiling resulted in original (better) performance.

Am I missing something related to the new compiler? Are there any new flags/settings we should be aware of that did not exist in 4.0 and could impact relative performance?

Thanks in advance, Joe

Are results the same? New compiler is very different. Also you can compiler with 1.2 on cuda 4.1RC1 to get old compiler.

Results are the same save for some rounding (9th decimal), which makes sense given updated machine math functions. I’m currently compiling with sm_20, and switching to 1.2 is not a good option for me, since I would like to keep using some of the fermi features. Perhaps it’s time to hit the registered developer portal with this feedback.

Although much effort has been spent trying to ensure that there are no significant performance regressions with the new compiler, such regressions can of course happen. If you have a small self-contained repro case I would suggest filing a bug against the compiler so the compiler team can take a look.

Much appreciated. Unfortunately, I couldn’t send in this code without a proper NDA, and it will probably be rather painful to figure out which part of the (rather large) code is responsible for the performance difference. It’s essentially a large scale financial MC simulation complete with cutsom RNG, and I had hoped similar runtime detriment had been noted elsewhere. Otherwise, I will try to write up a simple enough sample, once I can isolate it.

Check for the obvious things first:

  • does the register count go up or the number of spills increase significantly in any of your kerenls? (compile with --ptxas-options -v)
  • run in the compute profiler and find which kernels run faster and which run slower
  • due to different register usage in CUDA 4.1, you will need to retune your block size for optimal performance

Joe, it seems that your problem is not unique. I did MC simulation of nuclear reactor. My program becomes 200% slower when 4.1 is used, so I had to switch back to 4.0. I guess the problem of thread divergence inherent in MC calculation is somehow worsened in 4.1. External Image

My own MC particle transport code did not see a change in execution time, neither for better nor worse. It was within 2%.

Interesting to know. Did you do neutrons, photons or charged particles? Any frequent data transfer between CPU and GPU?

Given that the performance of your code was cut in half by switching to CUDA 4.1, I would encourage you to file a bug with a repro case so this performance regression can be analyzed. Thank you for your help.

Thanks for the suggestions.

Yes, the spills increase dramatically, almost by a factor of 2 (from about 300 to 600) for both store and load. Blocksize optimality did not change much. So it seems the new compiler is somehow less efficient at minimizing the number of spills. Not sure if there is much to be done in the way of coding.

I do coupled photons-electrons. I only do 1 front data transfer to send the phantoms to video memory and one final transfer to bring back the results. No transfer between the different batches of simulation except for some single-integer counters.

Are you using [font=“Courier New”]launch_bounds()[/font] directives or the [font=“Courier New”]-maxrregcount[/font] compiler option? That would explain why the compiler spilled so much and the optimal blocksize didn’t change (another explanation of course would be that you are simply using up all 63 available registers).

If so, allowing for more registers might improve performance until Nvidia manages to reduce register usage of the new compiler to the level of the old one.

In my code, CUDA 4.1 uses on average 3-6 fewer registers in each kernel and performance is 6% faster overall. I’ll repeat what njuffa says and encourage everyone who experiences significantly higher reg usage (or spills) to submit bug reports. The new compiler is sure to have corner cases where reg usage explodes.

Are you already setting the 48K L1 cache mode? That might lessen the impact of the added spills (assuming you don’t need the extra shared memory).

No,no lounch_bounds nor -maxregcount. All 63 registers are being used under both compilers. I guess I can always try digging in the compiler source code External Image

Intersing, it uses all 63 registers in my case.

Yessir, cache preference set to L1. Now all I have to do is prepare a repro with similar register usage that is NOT my current code.