What is the maximum CUDA Stack frame size per Kerenl.

I am working on a Project which uses local arrays of size almost : 135200 Bytes,inside the cuda kernel ;

The problem comes when the application size is increased, which increases the local array size to 320,000 Bytes leading to kernel Launch Failures; Similarly, it happens for other increases in the application sizes.

i have tried not using local arrays, but that increases the computation so much that the speed is almost halved.

I have tried to increase the Stack Size Limit by setting the variable cudaLimitStackSize using the function : cudaDeviceSetLimit, but of no avail.

i have found some comments regarding the matter on : Where does CUDA allocate the stack frame for kernels? - Stack Overflow

But the issue is not getting solved. So any thoughts ?.

The compiler reports stack frame usage on a per-thread basis. The maximum stack frame size per thread for a given GPU is determined by (a) a hard architecture limit on the amount of local memory per thread (b) the amount of available GPU memory.

The architectural limit on the amount of local memory per thread is documented in the programming guide section G.1, table 12.

Available stack frame size per thread can then be approximated by

stack frame size available per thread =
min (amount of local memory per thread as documented in section G.1 table 12,
available GPU memory / number of SMs / maximum resident threads per SM)

The reason this is approximate is because there are various levels of allocation granularity that, best I know, are not documented and may vary from GPU to GPU. I do not know anything about your use case, but in general massive local memory usage would suggest to me that one might want to re-think the mapping of work to the GPU.