Local memory limit?

**Note: this is on CUDA 2.0beta

I have had problems with kernels using large arrays declared as local data.

(namely: "ptxas error : Entry function ‘…’ uses too much local data (… bytes … system, 0x4000 max)

The program was compiling and running fine in emu mode, but I cannot compile it in “debug or release”.

So instead of declaring the large arrays as local data in my kernel, I have declared and allocated them from the host, then used offsets(threadId, blockId) in order to privatize the data for each thread.

The program then compiles without a problem.

Does that mean there is a software limit to the amount of data you can declare as local variables inside a kernel? From my understanding, local data is nothing more but (if not allocated in registers) global data saved in a privatized manner.

Any inputs? I could not find any information in the programming manual about such a limit.

I would say that given your error message it is clear there definitely is a limit…which happens to be 16K. This is the same size as the shared memory, maybe there is a connection, maybe it is a coincidence.

On another note, it seems like your workaround is bound to be slow since it will make it impossible for threads to coalesce their reads from the array. (Unless you interleave the array in global memory instead of having chunks for each thread.)