cudaDeviceSetLimit bug

2075787455 · January 20, 2025, 8:56am

I successfully set the cudaLimitStackSize to 124KB using cudaDeviceSetLimit, and afterwards was able to define a large enough array in the kernel function. However, when I attempted to restore cudaLimitStackSize by setting it to 1KB using cudaDeviceSetLimit, the kernel function that was previously written with a large array still ran without error, contrary to my expectations.

Curefab · January 20, 2025, 12:10pm

Not sure, whether it is possible to later on change the stack size. You could make your kernel a template function (or declare and define two copies with different names of the kernel) to have several copies of it and set the individual stack sizes.

2075787455 · January 21, 2025, 1:12am

size_t myStackSize = 124.5 * 1024;
cudaError_t stat = cudaDeviceSetLimit (cudaLimitStackSize, myStackSize);
assert(stat == 0);
stat = cudaDeviceGetLimit (&myStackSize, cudaLimitStackSize);
assert(stat == 0);
printf("%zu", myStackSize);

...

__global__ void kernelFunc(.....) {
    uint64_t array[15940];
    ......
}

When setting myStackSize to 124.5 * 1024, printf returns 127488 and the program runs normally; when setting myStackSize to 16, printf returns 16 and the program continues to run normally; when not setting myStackSize, printf returns the default value of 1024 and the program still runs normally.
However, I remember that when I initially wrote the kernel function, without setting myStackSize, I couldn’t define a very large array (e.g., uint64_t array[15940]), and it would prompt “out of memory” at runtime; does this mean that once myStackSize is set to a large value, it cannot be reset, and is this a bug? Additionally, all the programs above were run on 4090.

Curefab · January 21, 2025, 11:33am

Can you run the kernel first and then set a stack size limit? Under the assumption, once the kernel was run, you cannot change the stack size.

Robert_Crovella · January 21, 2025, 12:49pm

this may be of interest

Greg · January 21, 2025, 4:30pm

LOCAL MEMORY ALLOCATION

On existing GPUs increasing the per thread stack size (cudaLimitStackSize) results in increasing the local memory allocation to RoundToAllocationGranularity(requestedSize) * CU_DEVICE_ATTRIBUTE_MULTIPROCESSOR_COUNT * CU_DEVICE_ATTRIBUTE_MAX_THREADS_PER_MULTIPROCESSOR.

On a RTX 4090 with 128 SMs this updates the allocation to 128 SMs x 1536 threads/SM = 196,608x the allocation size.
In your example of 124.5 KiB the local memory allocation is increased to at least 23.34 GiB. This is at the 4090 device memory limit so I would not be surprised if the launch fails.

DOCUMENTATION

From cuCtxCreate_v3

CU_CTX_LMEM_RESIZE_TO_MAX: Instruct CUDA to not reduce local memory after resizing local memory for a kernel. This can prevent thrashing by local memory allocations when launching many kernels with high local memory usage at the cost of potentially increased memory usage.Deprecated: This flag is deprecated and the behavior enabled by this flag is now the default and cannot be disabled. Instead, the per-thread stack size can be controlled with cuCtxSetLimit().

I recommend filing a bug to the CUDA team as the size and algorithm of the allocation should be more transparent. Nsight VSE CUDA trace tool supported tracing this allocation (I added it) for debugging this specific issue. Unfortunately, the feature has not returned to NSYS and NCU.

ALTERNATIVE SOLUTION

If you are launching a very small number of threads with significant stack size then you may want to allocate a separate device memory buffer and pass it into the kernel and reference based on a unique index. If you are launching a single wave of blocks then you can reference off block index. If you are launching multiple waves then either a grid stride loop or use of an atomic to assign a portion of the buffer would be required.

LMEM is 4B interleaved on a warp basis. Depending on the access pattern you plan to use it may be necessary to do the calculations in the kernel to interleave the memory accesses to improve coalescing to mimic the hardware when directly accessing an array on the stack.

system · February 4, 2025, 4:30pm

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
cudaDeviceSetLimit call increases the GPU memory CUDA Programming and Performance	2	1099	September 28, 2016
How CUDA driver set stack size on kernel invocation? CUDA Programming and Performance	0	1159	May 21, 2019
What is the maximum CUDA Stack frame size per Kerenl. CUDA Programming and Performance	1	13558	November 18, 2013
Set stack size limit not working CUDA Programming and Performance	1	880	October 31, 2014
Maximum stack size? CUDA Programming and Performance	7	1089	March 24, 2024
show sizes of GPU memory usage, eg log cudaMalloc, CUDA reports "out of memory" at runtime CUDA Programming and Performance	4	2147	December 13, 2016
CUDA thread local array memory is not released after finished kernel execution CUDA Programming and Performance	2	1364	December 1, 2019
invalid argument error while using cudaDeviceSetLimit in cuda kernel CUDA Programming and Performance	3	2246	August 15, 2014
Recursion Depth on C2050? CUDA Programming and Performance	6	1837	August 23, 2010
hardware stack error? CUDA Programming and Performance	2	4244	May 18, 2016

cudaDeviceSetLimit bug

Related topics