I successfully set the cudaLimitStackSize
to 124KB using cudaDeviceSetLimit
, and afterwards was able to define a large enough array in the kernel function. However, when I attempted to restore cudaLimitStackSize
by setting it to 1KB using cudaDeviceSetLimit
, the kernel function that was previously written with a large array still ran without error, contrary to my expectations.
Not sure, whether it is possible to later on change the stack size. You could make your kernel a template function (or declare and define two copies with different names of the kernel) to have several copies of it and set the individual stack sizes.
size_t myStackSize = 124.5 * 1024;
cudaError_t stat = cudaDeviceSetLimit (cudaLimitStackSize, myStackSize);
assert(stat == 0);
stat = cudaDeviceGetLimit (&myStackSize, cudaLimitStackSize);
assert(stat == 0);
printf("%zu", myStackSize);
...
__global__ void kernelFunc(.....) {
uint64_t array[15940];
......
}
When setting myStackSize
to 124.5 * 1024, printf
returns 127488 and the program runs normally; when setting myStackSize
to 16, printf
returns 16 and the program continues to run normally; when not setting myStackSize
, printf
returns the default value of 1024 and the program still runs normally.
However, I remember that when I initially wrote the kernel function, without setting myStackSize
, I couldn’t define a very large array (e.g., uint64_t array[15940]
), and it would prompt “out of memory” at runtime; does this mean that once myStackSize
is set to a large value, it cannot be reset, and is this a bug? Additionally, all the programs above were run on 4090.
Can you run the kernel first and then set a stack size limit? Under the assumption, once the kernel was run, you cannot change the stack size.
this may be of interest
LOCAL MEMORY ALLOCATION
On existing GPUs increasing the per thread stack size (cudaLimitStackSize) results in increasing the local memory allocation to RoundToAllocationGranularity(requestedSize) * CU_DEVICE_ATTRIBUTE_MULTIPROCESSOR_COUNT * CU_DEVICE_ATTRIBUTE_MAX_THREADS_PER_MULTIPROCESSOR.
On a RTX 4090 with 128 SMs this updates the allocation to 128 SMs x 1536 threads/SM = 196,608x the allocation size.
In your example of 124.5 KiB the local memory allocation is increased to at least 23.34 GiB. This is at the 4090 device memory limit so I would not be surprised if the launch fails.
DOCUMENTATION
From cuCtxCreate_v3
- CU_CTX_LMEM_RESIZE_TO_MAX: Instruct CUDA to not reduce local memory after resizing local memory for a kernel. This can prevent thrashing by local memory allocations when launching many kernels with high local memory usage at the cost of potentially increased memory usage.Deprecated: This flag is deprecated and the behavior enabled by this flag is now the default and cannot be disabled. Instead, the per-thread stack size can be controlled with cuCtxSetLimit().
I recommend filing a bug to the CUDA team as the size and algorithm of the allocation should be more transparent. Nsight VSE CUDA trace tool supported tracing this allocation (I added it) for debugging this specific issue. Unfortunately, the feature has not returned to NSYS and NCU.
ALTERNATIVE SOLUTION
If you are launching a very small number of threads with significant stack size then you may want to allocate a separate device memory buffer and pass it into the kernel and reference based on a unique index. If you are launching a single wave of blocks then you can reference off block index. If you are launching multiple waves then either a grid stride loop or use of an atomic to assign a portion of the buffer would be required.
LMEM is 4B interleaved on a warp basis. Depending on the access pattern you plan to use it may be necessary to do the calculations in the kernel to interleave the memory accesses to improve coalescing to mimic the hardware when directly accessing an array on the stack.
This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.