Private arrays always forced into "local" memory?

There is a small private array declared in the kernel:

#define forward_register_workload_width 3
#define forward_register_workload_height 1
float sum[forward_register_workload_width * forward_register_workload_height];

The PTX code I see in Parallel Nsight starts with:

.local .align 4 .b8 %Depot1[12];
.reg .b32 %SP;
.reg .f32 %f<284>;
.reg .pred %p<29>;
.reg .s32 %r<147>;

Why is it so? Why did compiler decide to put 3 subsequent 32bit variables into the “local” memory. This dramatically slows down the kernel.

I have 560Ti, driver version 275.33, Windows7 Ultimate 64bit.

I am attaching text file with OpenCL source code for the kernel.
Convolution.txt (11.8 KB)

Do you access your private array with a variable index at runtime? The architecture cannot store an array in registers, even a small one, unless all array accesses are with constant indices known at compile time.

Thanks a lot. This is exactly what I do.

I have just checked the code and replaced truly dynamic indexing with conditions. There is another place where I access array element by index but in this case the compiler is able to determine index value if it follows #pragma unroll directive I specified.

It worked! No more regitser spilling.

Thank you very much, Seibert.