Maximum stack size?

How can I calculate the maximum stack size that is settable using information from the device properties?

I tried setting the stack size by trial and error, and received an “out of memory” error message
when trying to change the default value of 1024 to 1024*1024.

Apparently, it tried to allocate over 10GB of GPU memory and failed, which was surprising.

How can I calculate that number myself, so it doesn’t exceed my total GPU memory?

Edit: Sorry I don’t think my previous answer is correct. It returns the current limit. Could you show the code which does not work for you?

You have to keep in mind that the stack size is used per thread, so for example on A100 with 108 SMs, specifying a stack size of 1 MB per thread would require 108 * 2048 * 1 MB = 216 GB of memory.

Thanks for the reply.

Would this be correct? Assuming I wanted a 1MB stack size per thread.

cudaDeviceProp props;
cudaGetDeviceProperties( &props, deviceID );

size_t size_per_thread = 1024 * 1024;
size_t total_stack_size = props.multiProcessorCount * props.maxThreadsPerBlock * size_per_thread;

So, if my device returned 80 multiProcessorCount and 1024 maxThreadsPerBlock, that would be 80 GB?
Seems a bit wasteful if I don’t intend to use all those threads.

If I make a call, such as:
RunKernel << < 1,1 >> > (params);

Then I’m only using one thread. CUDA isn’t smart enough to resolve stack size based on my kernel call?

correct

Correction.
MaxSmCount x MaxThreadsPerSm x StackSizePerThread.

The difference is MaxThreadsPerSm vs maxThreadsPerBlock.

Ah, so you get can the total GPU threads using maxThreadsPerMultiProcessor.

 cudaDeviceProp props;
 cudaGetDeviceProperties( &props, deviceID );

 size_t total_gpu_threads = props.multiProcessorCount * props.maxThreadsPerMultiProcessor;

Well, thats interesting, because I assumed one “block” was considered to be one “MultiProcessor”, but these two device properties are returning different values:

         maxThreadsPerBlock: 1024
maxThreadsPerMultiProcessor: 1536

So, that assumption can’t be correct.

A block of threads is something that runs on a multiprocessor. In the absence of other unrelated restrictions, a multiprocessor that supports up to 1536 threads could run three blocks of 512 threads each, for example. Such a multiprocessor can only run one block of 1024 threads. Running just one block per multiprocessor is often a poor choice, and it is usually better to use finer block granularity, such as blocks with 128 or 256 threads.

1 Like