Maximum stack size?

samurle · March 18, 2024, 7:13pm

How can I calculate the maximum stack size that is settable using information from the device properties?

I tried setting the stack size by trial and error, and received an “out of memory” error message
when trying to change the default value of 1024 to 1024*1024.

Apparently, it tried to allocate over 10GB of GPU memory and failed, which was surprising.

How can I calculate that number myself, so it doesn’t exceed my total GPU memory?

striker159 · March 18, 2024, 7:20pm

Edit: Sorry I don’t think my previous answer is correct. It returns the current limit. Could you show the code which does not work for you?

You have to keep in mind that the stack size is used per thread, so for example on A100 with 108 SMs, specifying a stack size of 1 MB per thread would require 108 * 2048 * 1 MB = 216 GB of memory.

samurle · March 18, 2024, 8:17pm

Thanks for the reply.

Would this be correct? Assuming I wanted a 1MB stack size per thread.

cudaDeviceProp props;
cudaGetDeviceProperties( &props, deviceID );

size_t size_per_thread = 1024 * 1024;
size_t total_stack_size = props.multiProcessorCount * props.maxThreadsPerBlock * size_per_thread;

samurle · March 19, 2024, 11:31am

So, if my device returned 80 multiProcessorCount and 1024 maxThreadsPerBlock, that would be 80 GB?
Seems a bit wasteful if I don’t intend to use all those threads.

If I make a call, such as:
RunKernel << < 1,1 >> > (params);

Then I’m only using one thread. CUDA isn’t smart enough to resolve stack size based on my kernel call?

Robert_Crovella · March 19, 2024, 12:47pm

correct

Greg · March 21, 2024, 6:27am

Correction.
MaxSmCount x MaxThreadsPerSm x StackSizePerThread.

The difference is MaxThreadsPerSm vs maxThreadsPerBlock.

samurle · March 23, 2024, 10:46pm

Ah, so you get can the total GPU threads using maxThreadsPerMultiProcessor.

 cudaDeviceProp props;
 cudaGetDeviceProperties( &props, deviceID );

 size_t total_gpu_threads = props.multiProcessorCount * props.maxThreadsPerMultiProcessor;

Well, thats interesting, because I assumed one “block” was considered to be one “MultiProcessor”, but these two device properties are returning different values:

         maxThreadsPerBlock: 1024
maxThreadsPerMultiProcessor: 1536

So, that assumption can’t be correct.

njuffa · March 24, 2024, 1:36am

A block of threads is something that runs on a multiprocessor. In the absence of other unrelated restrictions, a multiprocessor that supports up to 1536 threads could run three blocks of 512 threads each, for example. Such a multiprocessor can only run one block of 1024 threads. Running just one block per multiprocessor is often a poor choice, and it is usually better to use finer block granularity, such as blocks with 128 or 256 threads.

Topic		Replies	Views
How to decide the optimal block size in CUDA CUDA Programming and Performance	4	27719	February 15, 2010
What is the maximum CUDA Stack frame size per Kerenl. CUDA Programming and Performance	1	13486	November 18, 2013
max thread per block and memory device question CUDA Programming and Performance	2	17000	January 9, 2009
Maximum number of threads on thread block CUDA Programming and Performance	12	74319	September 21, 2023
How to determine the Block Size CUDA Programming and Performance	1	5904	September 4, 2009
Question regarding maximum amount of blocks CUDA Programming and Performance	2	796	January 28, 2011
Question about grid/block/thread sizes CUDA Programming and Performance	3	12282	November 13, 2012
Invalid Configuration Argument CUDA Programming and Performance	2	1853	December 16, 2018
Maximum number of threads per block CUDA Programming and Performance	1	463	September 15, 2021
Understanding deviceQuery CUDA Programming and Performance	2	4112	June 28, 2014

Maximum stack size?

Related topics