Occupancy Mystery lo-occ,hi-reg faster than hi-occ,lo-reg?

Hi all,

Anyone have an idea why a higher register-count, low-occupancy kernel can be faster than a low register-count, high-occupancy one? (Both of the kernels perform the same operations, only differing in order) That’s exactly what I’m observing here:

method=[ apply_flux ] gputime=[ 1567.680 ] occupancy=[ 1.000 ]

method=[ apply_flux ] gputime=[ 1068.960 ] occupancy=[ 0.500 ]

method=[ apply_flux ] gputime=[ 1361.504 ] occupancy=[ 1.000 ]

method=[ apply_flux ] gputime=[ 1071.584 ] occupancy=[ 0.500 ]

method=[ apply_flux ] gputime=[ 1251.840 ] occupancy=[ 1.000 ]

method=[ apply_flux ] gputime=[ 1357.568 ] occupancy=[ 1.000 ]

The above kernels, despite having the same name, are actually different, and in the same order as above, compile to have the following properties:

flux: lmem=0 smem=7576 regs=16

flux: lmem=0 smem=7576 regs=17

flux: lmem=0 smem=7576 regs=16

flux: lmem=0 smem=7576 regs=17

flux: lmem=0 smem=7576 regs=16

flux: lmem=0 smem=7576 regs=16

All kernels run with identical grid and block sizes.

I don’t understand how the 17-reg kernels that achieve 0.5 occupancy can actually be faster than the 16-reg kernels that achieve full occupancy. :blink:

Anybody?


NB: This is with CUDA 2.0 driving a GTX280.

And who the heck said occupancy was important? NVIDIA provides this tool, so everyone thinks they need to pay attention to this tool.

In what way are the kernels different? Using more regs can easily cause a speedup, since registers are fast on-die memory that can be put to good use. A 128-register kernel that never has to use DDR will be much faster than a 16-register kernel that has to read/write DDR constantly.

P.S. Another dumb thing about the occupancy tool is that it doesn’t tell anyone that 50% occupancy is actually barely worse than 100%. Even 25% is pretty good, and you don’t even need that if you’re not accessing DDR much. People see % and think they need 100.

Well, here are the kernels:

As I said, except for ordering, they’re virtually identical.

In the end, I’m just trying to understand what drives performance in CUDA. And in this case, I’m drawing a complete blank. I’ve been told “more occupancy at least can’t hurt”, but the opposite is true here. There’s not a whole lot of arithmetic for quite a bit of memory traffic. I would certainly expect occupancy to make a difference.

As for registers, I understand your argument that they can do a lot of good. But in this instance, there really isn’t that much register pressure, and I’m not artificially limiting register count, either.

So where does the time difference come from?

Thanks for any insight,

Andreas

It looks like you’re fetching the same value from the texture multiple times. I’m guessing in one version the compiler see to optimize this to be one access (and it stores the value in the extra register), and in the other the compiler doesn’t.

Good catch. I had spotted that, too, fixed it, but the described effect still occurs. I guess the solution to the mystery simply is that if the compiler uses that extra register, it manages to generate better code. Still weird–I’d hope the compiler would generate the fastest code possible in all situations, and optimize for register count only secondarily.

Thanks,
Andreas

I imagine that working out whether higher-occupancy or higher register usage will be fastest at run-time is rather tricky at compile time.

Yes, the compiler is very annoying in its inconsistency. The solution when really optimizing code is to use decuda and double-check what’s going on. Then move a statement here, a statement there, until stuff looks ideal.

Exactly–that’s the job of the user. That’s precisely my point: The compiler should use the maximum number of registers it can take advantage of–until you tell it not to using --maxrregcount=N.

Wishful thinking, I guess. I’m not much of a compiler person myself… :)

Andreas