**Note: this is on CUDA 2.0beta
I have had problems with kernels using large arrays declared as local data.
(namely: "ptxas error : Entry function ‘…’ uses too much local data (… bytes … system, 0x4000 max)
The program was compiling and running fine in emu mode, but I cannot compile it in “debug or release”.
So instead of declaring the large arrays as local data in my kernel, I have declared and allocated them from the host, then used offsets(threadId, blockId) in order to privatize the data for each thread.
The program then compiles without a problem.
Does that mean there is a software limit to the amount of data you can declare as local variables inside a kernel? From my understanding, local data is nothing more but (if not allocated in registers) global data saved in a privatized manner.
Any inputs? I could not find any information in the programming manual about such a limit.