What is the maximum CUDA Stack frame size per Kerenl.

rohit89 · November 18, 2013, 10:25am

I am working on a Project which uses local arrays of size almost : 135200 Bytes,inside the cuda kernel ;

The problem comes when the application size is increased, which increases the local array size to 320,000 Bytes leading to kernel Launch Failures; Similarly, it happens for other increases in the application sizes.

i have tried not using local arrays, but that increases the computation so much that the speed is almost halved.

I have tried to increase the Stack Size Limit by setting the variable cudaLimitStackSize using the function : cudaDeviceSetLimit, but of no avail.

i have found some comments regarding the matter on : Where does CUDA allocate the stack frame for kernels? - Stack Overflow

But the issue is not getting solved. So any thoughts ?.

njuffa · November 18, 2013, 6:46pm

The compiler reports stack frame usage on a per-thread basis. The maximum stack frame size per thread for a given GPU is determined by (a) a hard architecture limit on the amount of local memory per thread (b) the amount of available GPU memory.

The architectural limit on the amount of local memory per thread is documented in the programming guide section G.1, table 12.
[url]http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#features-and-technical-specifications[/url]

Available stack frame size per thread can then be approximated by

stack frame size available per thread =
min (amount of local memory per thread as documented in section G.1 table 12,
available GPU memory / number of SMs / maximum resident threads per SM)

The reason this is approximate is because there are various levels of allocation granularity that, best I know, are not documented and may vary from GPU to GPU. I do not know anything about your use case, but in general massive local memory usage would suggest to me that one might want to re-think the mapping of work to the GPU.

Topic		Replies	Views
Maximum stack size? CUDA Programming and Performance	7	847	March 24, 2024
How CUDA driver set stack size on kernel invocation? CUDA Programming and Performance	0	1152	May 21, 2019
cudaDeviceSetLimit call increases the GPU memory CUDA Programming and Performance	2	1076	September 28, 2016
Per thread local memory Per thread local memory specified in C Programming Guide CUDA Programming and Performance	1	846	March 6, 2012
cudaDeviceSetLimit bug CUDA Programming and Performance	6	46	January 21, 2025
Maximum number of threads per block CUDA Programming and Performance	1	463	September 15, 2021
show sizes of GPU memory usage, eg log cudaMalloc, CUDA reports "out of memory" at runtime CUDA Programming and Performance	4	2139	December 13, 2016
How to decide the optimal block size in CUDA CUDA Programming and Performance	4	27649	February 15, 2010
Out of memory when allocating local memory CUDA Programming and Performance	4	729	January 4, 2023
I wonder maximum number of threads per block really limits the number of threads in each block. CUDA Programming and Performance	5	3978	February 9, 2024

What is the maximum CUDA Stack frame size per Kerenl.

Related topics