Just updated to the latest developer drivers and CUDA 3.1, ran my code and had a 30% performance hit. Almost all kernels, regardless of complexity or content, took a hit. Some of these kernels are little more than moving data around. Some have FFTs, I switched my fft plans to use CUFFT_COMPATIBILITY_NATIVE, but got no improvement from that.
Basic question is, without going into all the details of my code, is there anything obvious I should be looking for or thinking about when switching to 3.1?
fwiw using Windows 7 64 bit, gtx 480s, upgraded from CUDA 3.0.
– updated –
(still need help)
I’ve finally had a little bit of time to look into this, and was hoping to get some more input.
First, register usage is WAY higher in 3.1. Almost every kernel uses 2-4 registers more. When we’re using 10 registers in 3.0 and 14 in 3.1, that’s a pretty feaking big increase. Nothing’s spilling into local memory though. I haven’t had enough time yet to comb through everything and find out what causes the increase. A few kernels use the same number of registers, so it’s not like something is being added indiscriminately to all kernels (I saw that someone earlier had mentioned the possibility of the new printf functionality causing increased register usage, but that doesn’t explain why some (very few) kernels remain unchanged…). Any suggestions as to why?
Reading the release notes for 3.1, I was under the impression that FFT performance was improved in 3.1 as long as you are willing to relax some of the FFT requirements. Maybe I’m reading it wrong - I thought if I didn’t change anything in the code, FFT performance should not change. My experience has been basically the opposite - FFT performance has been absolutely killed (2x performance hit!) UNLESS you are willing to relax the requirements - and even then, I’m struggling just to get equal performance.
I’ve started going through the ptx outputs line by line, comparing 3.0 results with 3.1 results, which is painstaking because they don’t match up very well. The only thing I’ve noticed so far is mul.lo.u64’s are now mul.wide.u32, which, at least in my mind, are functionally equivalent. Can anyone comment on this?
I still can’t get concurrent kernels to work in 3.1 (outside of the carefully crafted SDK example), although the very specific situation I outlined in another post where the SDK example seems to fail in 3.0 seems to run correctly in 3.1, so I guess I need to find a new example.
It looks like the performance of cudaThreadSynchronize() is worse? Has anyone else experienced this? At least, this is my best guess for now, I haven’t had time to really look into it - I have a chunk of code that is essentially “start timer; cpu only code; cudaThreadSynchronize(); stop timer;” (don’t worry about why :P) and it takes 25% longer using 3.1… since the timer is part of the SDK, which I’m not changing, and the CPU code shouldn’t be compiled any differently, the only thing I can come up with is worse performance for cudaThreadSynchronize, maybe because we can launch 16 concurrent kernels now? Alternatively, I suppose it could be that the timer is not perfect and is being affected by what’s going on around it. Of course, the obvious thing to try is just call it a million times and see if it actually does take longer, but I haven’t had a chance yet… let’s ignore the fact that I could have tried it in the time it took to write this…
As always, appreciate any input at all. Thanks Lev and Romant for your suggestions.