Strange local memory usage

I’m having a strange local memory usage problem. This code:

__device__ void Test(ushort2 * F[3])


	float2 Z[3];

#pragma unroll

	for(int i = 0; i < 3; ++i)


		ushort2 z = F[i][threadIdx.x];

		Z[i] = make_float2(__half2float(z.x), __half2float(z.y));


	F[1][threadIdx.x] = make_ushort2(__float2half_rn(Z[0].x + Z[1].x + Z[2].x), __float2half_rn(Z[0].y + Z[1].y + Z[2].y));


__global__ void CallTest(ushort2 * F1, ushort2 * F2, ushort2 * F3)


	ushort2 * F[3] = { F1, F2, F3 };



Uses 8 registers, 4 bytes lmem. When I look at the PTX output, it appears to load z from global memory, then immediately store it in local memory, then read z.x and z.y from local memory in two separate local loads.

I don’t think the arrays are going into local memory as they are constant indexed, and none of them are 4 bytes long (F is 12 bytes, Z is 24 bytes).

In my real kernel, this adds up to 12 32 bit local stores, 24 16 bit local loads which can’t be good for performance, especially since my occupancy is low for other reasons it won’t be able to hide the local mem latency well. Any idea how to prevent the compiler from doing this?

I suspect that you have low occupancy because your kernel use too much register, that lead the compiler to use “local” memory (that is Global Memory in fact).

You should check your register usage, and if registers are free, your CUDA SDK and drivers versions, and eventually force the use of registers instead constant-indexed arrays (bad code but may save your day).

Did you compile the posted code? It uses 8 registers. The compiler should not use local memory.

That’s really odd. I don’t have the tools to compile it here, but you could try inlining the variable z. Shouldn’t make any difference at all, but might circumvent a compiler bug.

By inlining, do you mean use the value I assign to it directly in the expression? If so, I do this in my actual kernel, but I separated it for clarity in the sample code for this post. It doesn’t seem to affect the local memory usage issue.

Does anyone have a solution for this? I need to find a workaround.

Can I write inline assembly to do these few lines of code manually? Does anyone have an example of this?

It also looks like this needs to be filed as a bug with nvidia.

Registers are not indexable. Any array that is not adressed with a constant is going to be put in local memory.

Now, im not sure if the compiler is smart enough to see that this is “somewhat” constant indexing. But if your look goes from 1-3, why not unroll it manually and be done with it?

Did you read the first post? The local memory is too small to be the array itself. Furthermore, I have read the PTX output and can see what the compiler is doing, and it is pointless.

I don’t manually unroll it myself because in my real kernel, this is a template, where N = 3 is adjustable.