Avoiding local memory with structs

When compiling my code I realised that for some reason I am using local memory.

My register usage is way below limits so it is not just some registers spilling out problem. I am also not addressing any registers dynamically nor using any local arrays.

While searching for the cause I narrowed down to the following construction:

int4 data=context.global_array[threadIdx.x]

context is a struct of some pointers to global memory, which is passed as a parameter to the kernel (hence resides in shared memory).

If I replace the above with the following:

int4 data;

data.x=context.global_array[threadIdx.x].x;

data.y=context.global_array[threadIdx.x].y;

data.z=context.global_array[threadIdx.x].z;

data.w=context.global_array[threadIdx.x].w;

Suddenly, I do not use local memory at all. What can be the reason and how can I avoid it?

My suspision is that the compiler does not know if pointer context.global_array is aligned or not, so it does not know if it can use 16-byte wide load instruction or not and somehow it forces the load to be done into local memory instead of registers.

However the pointer I am using here is a value returned from cudaMalloc and should be aligned well. Simply the compiler does not know that at kernel compilation.

If that is the case, how can I inform the compiler that 16-byte wide load instruction is safe at this point?

If that is not the case, what am I doing wrong and how it can be avoided?

Just my guess:
I think the compiler is treating an “int4” data-structure as a local-array. When you access them with “w”,“x”,“y” and "z’ throughout the kernel - they are treated as constant indices into the array and hence allocated in registers. If you dont, it moves to local memory…

But I cant justify why the compiler should do that…

But operation like this

int4 a, b;

[...]

a=b;

Won’t access it dynamically will it? Even if you replace int4 with some bigger struct, as long as it fits into register space I see no reason not to keep it there?