occupancy and performance also a question about .cubin files

Hi all,

I’ve been optimizing my application and have some questions about what the results of the occupancy calculator and .cubin files mean for performance.

The opening question, however, is about the .cubin file that nvcc -cubin generates. It has two code{} blocks with the same kernel function. If I don’t use -maxrregcount then the register usage varies between the two blocks, otherwise the local memory usage may vary.

When I compile without -maxrregcount, register usage is 22 or 24, which the occupancy calculator tells me yields only 50% occupancy. Not good, right?

The occupancy calculator suggests that I limit the number of registers to 16 for 100% occupancy. When I do this, the .cubin file says lmem = 0. Great, I thought!

But performance of the 16-register version is abysmal, with 140% longer runtime than the other version. If I bump up the # of registers to 17, the occupancy calculator says I should have 75% occupancy, with 0 or 4 bytes of lmem (depending on which code block I look at), and I get close to the same performance as the 24-register version.

Can anyone help me sort this out? What should I be thinking about and looking at? Thanks!

Actually, except for a few cases, 50% occupancy is fine. Occupancy is not a measure of GPU activity or anything like that. It is just a measure of how close you are to the maximum number of threads on the multiprocessor. More threads means more opportunities to hide memory latency, which is good. However, many algorithms can hide memory latency just fine with less than 100% occupancy. This is probably why you see similar performance for 50% and 75% occupancy.

Unless your occupancy is extremely low (<25%), there is usually little to be gained by improving it. Verifying your global memory transfers are coalesced with the profiler (and improving that if possible) is a much more important optimization step.

I was under the impression that occupancy was how much of the GPU was active, like how many warps would be active over the maximum number of warps. Guess not. Thanks for the help, I’ll look at my memory transfers :)

But I’m still not happy with how well I understand this behavior. Anyone else have a comment? Why would lower register usage and higher occupancy with no local memory usage kill performance? More bank conflicts due to the added parallelism?

Then there’s the bit about the two code blocks in the .cubin file. Is that normal?

That statement about warps is true. However keep in mind that stream multiprocessor can execute only one warp at a time. There are only 8 scalar processors in a single SM and each of them handles 4 threads at a time.

In ideal situation if your code consisted only from computation, one warp would be enough to use full potential of your GPU. However, because you do read some memory and reads introduce latencies (reading global memory takes hundreds of clock cycles!), while the expensive load is executed by memory controller, the scalar processors are given new work of another warp.

If your kernel deals a lot of global memory reads, maxing out your occupancy may benefit your code. However at some stage, global memory latencies are well hidden and when that happens, additional active warps give you no good.

Assuming global memory reads are no longer a problem (as explained above) you actually want to use as many registers as there are available to you. Those extra registers may be used to store some intermediate values or other optimisation measures. It you set a harder register limit, some equations might have to be recomputed over and over again. This won’t be reflected in local memory usage because nothing is spilled into that memory yet, but performance will be hampered by those extra computation tasks.

Thanks Cygnus, that helps a lot. You make a lot of fine posts on these boards, I’ve noticed :) .

I just learned that the GTX 260 does more advanced coalescing than most other GPUs, so uncoalesced global memory accesses are unlikely to be an issue on this machine. Hmm, wait a second, I’m looking at cuda_profile.log and it tells me that the occupancy is only .125 instead of the .75 that I was expecting from the .cubin and occupancy calculator!

Thank you. I am debugging my code and since compilation takes several minutes to complete, I have a lot of time to poke over here ;)

I think you are referring to its 1.3 compute capability? Indeed global memory coalescing is not so restrictive and painful :)

How many blocks are you launching? If there are only few or only one block, cuda profiler will take that into account, while occupancy calculator assumes there are many blocks launched, saturating all available resources.

Oh, hehe, of course. I’m only running a small problem set right now so I don’t have to wait ages, so just one block. Thanks!