Anyone have an idea why a higher register-count, low-occupancy kernel can be faster than a low register-count, high-occupancy one? (Both of the kernels perform the same operations, only differing in order) That’s exactly what I’m observing here:
All kernels run with identical grid and block sizes.
I don’t understand how the 17-reg kernels that achieve 0.5 occupancy can actually be faster than the 16-reg kernels that achieve full occupancy. :blink:
And who the heck said occupancy was important? NVIDIA provides this tool, so everyone thinks they need to pay attention to this tool.
In what way are the kernels different? Using more regs can easily cause a speedup, since registers are fast on-die memory that can be put to good use. A 128-register kernel that never has to use DDR will be much faster than a 16-register kernel that has to read/write DDR constantly.
P.S. Another dumb thing about the occupancy tool is that it doesn’t tell anyone that 50% occupancy is actually barely worse than 100%. Even 25% is pretty good, and you don’t even need that if you’re not accessing DDR much. People see % and think they need 100.
As I said, except for ordering, they’re virtually identical.
In the end, I’m just trying to understand what drives performance in CUDA. And in this case, I’m drawing a complete blank. I’ve been told “more occupancy at least can’t hurt”, but the opposite is true here. There’s not a whole lot of arithmetic for quite a bit of memory traffic. I would certainly expect occupancy to make a difference.
As for registers, I understand your argument that they can do a lot of good. But in this instance, there really isn’t that much register pressure, and I’m not artificially limiting register count, either.
It looks like you’re fetching the same value from the texture multiple times. I’m guessing in one version the compiler see to optimize this to be one access (and it stores the value in the extra register), and in the other the compiler doesn’t.
Good catch. I had spotted that, too, fixed it, but the described effect still occurs. I guess the solution to the mystery simply is that if the compiler uses that extra register, it manages to generate better code. Still weird–I’d hope the compiler would generate the fastest code possible in all situations, and optimize for register count only secondarily.
Yes, the compiler is very annoying in its inconsistency. The solution when really optimizing code is to use decuda and double-check what’s going on. Then move a statement here, a statement there, until stuff looks ideal.
Exactly–that’s the job of the user. That’s precisely my point: The compiler should use the maximum number of registers it can take advantage of–until you tell it not to using --maxrregcount=N.
Wishful thinking, I guess. I’m not much of a compiler person myself… :)