What is the maximum CUDA Stack frame size per Kerenl.

The compiler reports stack frame usage on a per-thread basis. The maximum stack frame size per thread for a given GPU is determined by (a) a hard architecture limit on the amount of local memory per thread (b) the amount of available GPU memory.

The architectural limit on the amount of local memory per thread is documented in the programming guide section G.1, table 12.
[url]http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#features-and-technical-specifications[/url]

Available stack frame size per thread can then be approximated by

stack frame size available per thread =
min (amount of local memory per thread as documented in section G.1 table 12,
available GPU memory / number of SMs / maximum resident threads per SM)

The reason this is approximate is because there are various levels of allocation granularity that, best I know, are not documented and may vary from GPU to GPU. I do not know anything about your use case, but in general massive local memory usage would suggest to me that one might want to re-think the mapping of work to the GPU.