I am working on a Project which uses local arrays of size almost : 135200 Bytes,inside the cuda kernel ;
The problem comes when the application size is increased, which increases the local array size to 320,000 Bytes leading to kernel Launch Failures; Similarly, it happens for other increases in the application sizes.
i have tried not using local arrays, but that increases the computation so much that the speed is almost halved.
I have tried to increase the Stack Size Limit by setting the variable cudaLimitStackSize using the function : cudaDeviceSetLimit, but of no avail.
i have found some comments regarding the matter on : http://stackoverflow.com/questions/7810740/where-does-cuda-allocate-the-stack-frame-for-kernels
But the issue is not getting solved. So any thoughts ?.