Curious if anyone else has noticed diminished performance with CUDA 4.1RC. Here’s what I did:
remove 4.0 toolkit
install 4.1RC toolkit
recompiled existing code with the same flags, etc as before
runtime increased 10-15%
Removing 4.1RC, instlaling 4.0 and recompiling resulted in original (better) performance.
Am I missing something related to the new compiler? Are there any new flags/settings we should be aware of that did not exist in 4.0 and could impact relative performance?
Results are the same save for some rounding (9th decimal), which makes sense given updated machine math functions. I’m currently compiling with sm_20, and switching to 1.2 is not a good option for me, since I would like to keep using some of the fermi features. Perhaps it’s time to hit the registered developer portal with this feedback.
Although much effort has been spent trying to ensure that there are no significant performance regressions with the new compiler, such regressions can of course happen. If you have a small self-contained repro case I would suggest filing a bug against the compiler so the compiler team can take a look.
Much appreciated. Unfortunately, I couldn’t send in this code without a proper NDA, and it will probably be rather painful to figure out which part of the (rather large) code is responsible for the performance difference. It’s essentially a large scale financial MC simulation complete with cutsom RNG, and I had hoped similar runtime detriment had been noted elsewhere. Otherwise, I will try to write up a simple enough sample, once I can isolate it.
Joe, it seems that your problem is not unique. I did MC simulation of nuclear reactor. My program becomes 200% slower when 4.1 is used, so I had to switch back to 4.0. I guess the problem of thread divergence inherent in MC calculation is somehow worsened in 4.1. External Image
Given that the performance of your code was cut in half by switching to CUDA 4.1, I would encourage you to file a bug with a repro case so this performance regression can be analyzed. Thank you for your help.
Yes, the spills increase dramatically, almost by a factor of 2 (from about 300 to 600) for both store and load. Blocksize optimality did not change much. So it seems the new compiler is somehow less efficient at minimizing the number of spills. Not sure if there is much to be done in the way of coding.
I do coupled photons-electrons. I only do 1 front data transfer to send the phantoms to video memory and one final transfer to bring back the results. No transfer between the different batches of simulation except for some single-integer counters.
Are you using [font=“Courier New”]launch_bounds()[/font] directives or the [font=“Courier New”]-maxrregcount[/font] compiler option? That would explain why the compiler spilled so much and the optimal blocksize didn’t change (another explanation of course would be that you are simply using up all 63 available registers).
If so, allowing for more registers might improve performance until Nvidia manages to reduce register usage of the new compiler to the level of the old one.
In my code, CUDA 4.1 uses on average 3-6 fewer registers in each kernel and performance is 6% faster overall. I’ll repeat what njuffa says and encourage everyone who experiences significantly higher reg usage (or spills) to submit bug reports. The new compiler is sure to have corner cases where reg usage explodes.
Are you already setting the 48K L1 cache mode? That might lessen the impact of the added spills (assuming you don’t need the extra shared memory).
No,no lounch_bounds nor -maxregcount. All 63 registers are being used under both compilers. I guess I can always try digging in the compiler source code External Image