Uses 8 registers, 4 bytes lmem. When I look at the PTX output, it appears to load z from global memory, then immediately store it in local memory, then read z.x and z.y from local memory in two separate local loads.
I don’t think the arrays are going into local memory as they are constant indexed, and none of them are 4 bytes long (F is 12 bytes, Z is 24 bytes).
In my real kernel, this adds up to 12 32 bit local stores, 24 16 bit local loads which can’t be good for performance, especially since my occupancy is low for other reasons it won’t be able to hide the local mem latency well. Any idea how to prevent the compiler from doing this?
I suspect that you have low occupancy because your kernel use too much register, that lead the compiler to use “local” memory (that is Global Memory in fact).
You should check your register usage, and if registers are free, your CUDA SDK and drivers versions, and eventually force the use of registers instead constant-indexed arrays (bad code but may save your day).
That’s really odd. I don’t have the tools to compile it here, but you could try inlining the variable z. Shouldn’t make any difference at all, but might circumvent a compiler bug.
By inlining, do you mean use the value I assign to it directly in the expression? If so, I do this in my actual kernel, but I separated it for clarity in the sample code for this post. It doesn’t seem to affect the local memory usage issue.
Registers are not indexable. Any array that is not adressed with a constant is going to be put in local memory.
Now, im not sure if the compiler is smart enough to see that this is “somewhat” constant indexing. But if your look goes from 1-3, why not unroll it manually and be done with it?
Did you read the first post? The local memory is too small to be the array itself. Furthermore, I have read the PTX output and can see what the compiler is doing, and it is pointless.
I don’t manually unroll it myself because in my real kernel, this is a template, where N = 3 is adjustable.