30% performance hit with CUDA 3.1 Anything obvious I should be looking for?

Just updated to the latest developer drivers and CUDA 3.1, ran my code and had a 30% performance hit. Almost all kernels, regardless of complexity or content, took a hit. Some of these kernels are little more than moving data around. Some have FFTs, I switched my fft plans to use CUFFT_COMPATIBILITY_NATIVE, but got no improvement from that.

Basic question is, without going into all the details of my code, is there anything obvious I should be looking for or thinking about when switching to 3.1?

fwiw using Windows 7 64 bit, gtx 480s, upgraded from CUDA 3.0.

– updated –
(still need help)

I’ve finally had a little bit of time to look into this, and was hoping to get some more input.

First, register usage is WAY higher in 3.1. Almost every kernel uses 2-4 registers more. When we’re using 10 registers in 3.0 and 14 in 3.1, that’s a pretty feaking big increase. Nothing’s spilling into local memory though. I haven’t had enough time yet to comb through everything and find out what causes the increase. A few kernels use the same number of registers, so it’s not like something is being added indiscriminately to all kernels (I saw that someone earlier had mentioned the possibility of the new printf functionality causing increased register usage, but that doesn’t explain why some (very few) kernels remain unchanged…). Any suggestions as to why?

Reading the release notes for 3.1, I was under the impression that FFT performance was improved in 3.1 as long as you are willing to relax some of the FFT requirements. Maybe I’m reading it wrong - I thought if I didn’t change anything in the code, FFT performance should not change. My experience has been basically the opposite - FFT performance has been absolutely killed (2x performance hit!) UNLESS you are willing to relax the requirements - and even then, I’m struggling just to get equal performance.

I’ve started going through the ptx outputs line by line, comparing 3.0 results with 3.1 results, which is painstaking because they don’t match up very well. The only thing I’ve noticed so far is mul.lo.u64’s are now mul.wide.u32, which, at least in my mind, are functionally equivalent. Can anyone comment on this?

I still can’t get concurrent kernels to work in 3.1 (outside of the carefully crafted SDK example), although the very specific situation I outlined in another post where the SDK example seems to fail in 3.0 seems to run correctly in 3.1, so I guess I need to find a new example.

It looks like the performance of cudaThreadSynchronize() is worse? Has anyone else experienced this? At least, this is my best guess for now, I haven’t had time to really look into it - I have a chunk of code that is essentially “start timer; cpu only code; cudaThreadSynchronize(); stop timer;” (don’t worry about why :P) and it takes 25% longer using 3.1… since the timer is part of the SDK, which I’m not changing, and the CPU code shouldn’t be compiled any differently, the only thing I can come up with is worse performance for cudaThreadSynchronize, maybe because we can launch 16 concurrent kernels now? Alternatively, I suppose it could be that the timer is not perfect and is being affected by what’s going on around it. Of course, the obvious thing to try is just call it a million times and see if it actually does take longer, but I haven’t had a chance yet… let’s ignore the fact that I could have tried it in the time it took to write this…

As always, appreciate any input at all. Thanks Lev and Romant for your suggestions.

Maybe it is windows64 issue. It somehow works slower than xp at all kernells. Any chance to try it on other OS?

Do your kernels require the same amount of resources after being compiled with CUDA 3.1 ? Registers, shared/local/const mem.

updated, could use some more suggestions. thanks!

Really cudathreadsyncronize should not be called often in most programs. Any way memcpy gives sync. Or some other functions.

CUDA 3.1 introduces an ABI (application binary interface), which enables real (non-inlined) function calls and new features such as printf() and recursive functions. Note that the ABI is only supported on Fermi and later architectures. We have typically seen only a +/-3% performance delta from running with and without the ABI, but if you want to turn it off you can try adding the following to the nvcc command line:

-Xptxas -abi=no

Note that compiling this way will make it impossible to use any of the new features that require the ABI.

If you’re still seeing performance regressions with CUDA 3.1, please file a bug.

Very interesting information, thank you very much. Why such an interesting option is not discussed in the nvcc_3.1.pdf doc ?

In case of my kernel register usage with ABI is 54 registers but without ABI only 33 registers! This forces me to use launch_bounds directive to limit the number of registers used in order to get maximum productivity, however, it seems that registry spilling is extremely effective on Fermi and even when I explicitly limit the number of registers local memory usage is still zero and the speed of kernel execution is still 2.3 times higher than on GT200.

Good question, we’ll try and add something about this to the next release of the documentation.

Wow, that is a big difference, I guess you have a lot of function calls? Yes, register spilling to local memory is very efficient on Fermi due to the caches.

Yes, a lot of function calls. After you gave the info on ABI I’ve tested my kernel with ABI and launch_bounds limitation (so the kernel required 32 registers) and without ABI and without launch_bounds (54 registers) - the speed is just the same.

Another question … There was no recursion support before CUDA 3.1 but now recursion is possible. My current code emulates recursion using stacks in shared memory, how do you think - does it make good sense to avoid recursion emulation and use native recursive calls instead ? Can it give any speed up ?

I think it depends. L1 should be as fast as shared memory, however it may be used other ways. In principal, emulated recursion should be faster.

In general, I would try and avoid recursion (just like on the CPU). You have more control by using your own stack in local memory, and with recursive calls I think every function parameter has to be put on the stack.